
Machine learning experimentation sounds exciting, but honestly, most of my time goes into trial and error, tuning parameters, rerunning models, and figuring out what actually works.
I’ve seen how slow this gets. Some reports suggest up to 80% of ML time is spent on experimentation and tuning, not building real outcomes.
That’s exactly why AutoResearch AI stood out to me.
Instead of manually running experiments, I can define the goal, give it data, and let an AI agent continuously test, evaluate, and improve the model on its own, even on a single GPU.
I ran it myself on a real setup to see how far it actually goes.
In this guide, I’ll break down how it works, what’s happening under the hood, and what I learned from running it in practice.
AutoResearch AI is an open-source framework that automates machine learning experimentation using an AI agent.
Instead of manually tuning models, the system continuously modifies training code, runs experiments, evaluates results, and retains only the changes that improve performance. Over time, this creates a self-improving loop, replacing traditional trial-and-error workflows with continuous optimization.
Training GPT-style transformer models involves significant manual effort and repeated trial-and-error.
Teams often spend most of their time figuring out what works rather than building better models. The process is repetitive, requires deep expertise, and doesn’t scale efficiently.
Typical workflow includes:
Even small improvements can take hours or days of iteration.
AutoResearch AI removes this bottleneck by automating the experimentation loop, allowing systems to continuously test, evaluate, and improve without constant manual effort.
The core idea behind AutoResearch AI is simple but powerful: define the goal, and let the system figure out how to achieve it.
Instead of repeatedly writing and modifying training code, the process shifts to setting up the problem and letting the AI agent handle experimentation.
In practice, this involves:
Behind the scenes, the agent continuously modifies training logic, evaluates performance, and retains only the changes that improve results.
This changes the role of engineers from manually running experiments to designing systems that can optimize themselves.
AutoResearch AI is built around a simple structure where each component has a clearly defined role. This separation allows the system to automate experimentation while keeping control points stable.
The architecture revolves around three core files:
| File | Role | Editable by Agent |
program.md | Defines the objective and provides instructions for the AI agent | No (read-only) |
prepare.py | Handles dataset preparation and runs once before training | No (read-only) |
train.py | Contains the model architecture and training loop, updated continuously | Yes |
The key idea is that only the training logic (train.py) is modified, while the objective and data pipeline remain fixed. This ensures controlled experimentation without breaking the overall system.
AutoResearch AI operates as a continuous experimentation loop that automatically tests, evaluates, and refines model performance over time.
Instead of running experiments manually, the system takes control of the entire iteration cycle and keeps improving the model based on results.
The workflow looks like this:
Each iteration builds on the previous one, allowing the system to gradually discover better configurations without human intervention.
This is what makes AutoResearch powerful, what normally takes hours or days of manual experimentation can now run continuously in the background, often reaching around 100 experiments overnight on a single GPU.
Traditional machine learning workflows rely heavily on manual experimentation, which slows down iteration and limits scalability. AutoResearch AI changes this by automating the entire experimentation loop.
Here’s how they compare:
| Feature | Manual ML Workflow | AutoResearch AI |
Experimentation | Manual | Automated |
Speed | Slow (hours or days) | Fast (minutes per run) |
Human Effort | High | Low |
Optimization | Trial and error | Continuous improvement |
Scalability | Limited | High |
GPU Usage | Inefficient | Optimized |
The key difference is not just speed, it’s how experimentation is handled. Manual workflows depend on human intuition and repeated effort, while AutoResearch AI continuously improves models through an automated feedback loop.
AutoResearch evaluates model performance using validation bits-per-byte (val_bpb), a metric that measures how efficiently the model predicts text.
Walk away with actionable insights on AI adoption.
Limited seats available!
In simple terms, lower val_bpb means better performance, while higher values indicate weaker predictions.
A rough interpretation looks like this:
The goal during experimentation is straightforward: continuously reduce val_bpb with each iteration.
Getting started with AutoResearch AI is straightforward. The setup involves cloning the repository, installing dependencies, preparing the dataset, and running the training loop.
git clone https://github.com/karpathy/autoresearch.git
cd autoresearch
pip install uv
uv sync
# Prepare the dataset (one-time)
uv run prepare.py
# Run a single training experiment (5 minutes)
uv run train.py The first step prepares the dataset, which only needs to be done once. After that, each run executes a short training cycle, allowing the system to evaluate performance and iterate quickly.
To see how AutoResearch AI performs in practice, I ran it on a real setup using a GPT-style transformer model.
Setup:
Initial Run:
After Iterations:
This shows how quickly the system identifies better configurations. Most of the gains happened early, with a few meaningful improvements retained over time while weaker changes were discarded.
| Stage | val_bpb | Change |
Initial run | 1.2876 | Baseline, model not optimized |
After 2 runs | ~0.58 | Large jump from early hyperparameter changes |
After 26 runs | 0.5503 | Steady gains, 5 meaningful improvements kept |
val_bpb Improvement Over 26 Runs:

Blue line: best val_bpb after each run. Green dots: kept improvements. Gray dots: discarded attempts.
Output after the first run:

The model produced gibberish after the first run — repeating "goodngMutj" tokens.
Output after the 26th run:

After 26 runs, the model began generating coherent and structured text:
"Once upon a time, there was a big, friendly dog named Max."
• The agent tested multiple configurations
• Around 5 meaningful improvements were identified
• Poor-performing changes were reverted
• Only beneficial updates were retained
Results log (results.tsv):

Each row is a kept improvement. val_bpb dropped from 0.585 to 0.550 across 5 commits.
This shows that the system learns how to optimize training configurations over time.
After running multiple iterations, a few clear patterns stood out:
Overall, the system behaves like an accelerated trial-and-error loop, except it runs continuously and filters out what doesn’t work much faster than manual experimentation.
Each experiment in AutoResearch follows a continuous cycle designed to improve performance over time.
This cycle allows the system to continuously refine the model, keeping only what works and eliminating ineffective changes.
AutoResearch AI separates responsibilities clearly across different components, allowing experimentation to run in a controlled and scalable way.
Human Role
AI Agent Role
Training System
Walk away with actionable insights on AI adoption.
Limited seats available!
This separation keeps the system stable while enabling continuous optimization, making large-scale experimentation more efficient and reliable.
AutoResearch AI and AutoML tools both aim to simplify machine learning, but they solve very different problems.
While AutoML focuses on predefined pipelines for production use, AutoResearch AI is designed for open-ended experimentation and deeper control over model behavior.
Here’s how they compare:
| Feature | AutoResearch AI | AutoML Tools |
Control | High (code-level access) | Limited |
Automation | Agent-driven | Predefined pipelines |
Flexibility | Very high | Moderate |
Use Case | Research and experimentation | Production workflows |
The key difference is flexibility. AutoResearch AI allows full control over experimentation, while AutoML tools prioritize ease of use and standardization.
AutoResearch AI is best suited for scenarios where rapid experimentation and continuous optimization are required.
It can be used for:
To get the most out of AutoResearch AI, a few foundational practices make a significant difference in how effectively the system explores and optimizes models.
These practices ensure the system explores the right directions and avoids wasting cycles on poor configurations.
AutoResearch AI is most effective when speed, iteration, and experimentation are critical.
Use it when:
AutoResearch AI represents a shift from manual experimentation to system-driven optimization.
Instead of engineers running experiments step by step, systems are now designed to explore, evaluate, and improve models continuously.
This changes the role of AI, from a tool that assists development to a system that actively participates in the research process.
AutoResearch AI is a system that automates machine learning experimentation by using an AI agent to modify, test, and optimize training configurations.
It runs a continuous loop where the agent updates training code, evaluates performance, and retains only the configurations that improve results.
It significantly reduces manual tuning effort, but still depends on well-defined objectives and evaluation metrics.
It is commonly used for transformer-based and small language models, but can be adapted to other architectures.
It is designed for experimentation and optimization, not direct production deployment.
Walk away with actionable insights on AI adoption.
Limited seats available!