Rethinking the Established Trend on Training (Coding) Agents

This blog post is based on our recent paper SWE-Spot.

The community is working hard to build better coding agents. But what does “better” actually mean?

The prevailing paradigm, which can be identified as Generalist Improving, tries to optimize the expected performance of an agent in any given codebase/repository/environment, which can be formulated as:

$$ \hat{\theta_X} = \underset{\theta}{\mathrm{argmin}}\mathop{\mathbb{E}}_{x \sim X}(P_x(\theta)) $$

$\theta$ denotes agent parameters,
$X$ denotes the set of repositories in the entire world,
$x$ denotes any repository sampled from $X$ ,
$P_x(\theta)$ denotes the evaluation performance metric of an agent with parameter $\theta$ in a repository $x$ ,
$\hat{\theta_X}$ means the best agent parameters that Generalist Improving tries to deliver.

Then, imagine that in a certain limited time period and in a certain limited project scope, we are only working on a small group of repositories $X_0 \subset X$ , i.e., all our work is limited to the scope of $X_0$ , then the agent performance optimization that we actually desire in this case is:

$$ \hat{\theta_{X_0}} = \underset{\theta}{\mathrm{argmax}}\mathop{\mathbb{E}}_{x \sim X_0 \subset X}(P_x(\theta)) $$

We call this case as Expert Specialization, because in this case, we do not care the performance on $X - X_0$ at all due to the assumed limited scope of our project. In this case, $\hat{\theta_{X_0}}$ is actually the best agent parameters that we desire (rather than $\hat{\theta_X}$). Expert Specialization tries to deliver $\hat{\theta_{X_0}}$ .

The Necessity of Expert Specialization

The paradigm of Generalist Improving uses $\hat{\theta_X}$ to approxiamate $\hat{\theta_{X_0}}$ . However, we can think of a simple case that shows the imperfection of such approximation. To get this case, we can instantiate the notations above as follows. (Skip to the following visualization for an intuitive understanding.)

$X := \{A, B\},~ X_0 := \{A\}$
$P_A(\theta): = \mathcal{N}(\theta; \mu_A, \sigma),~ P_B(\theta) := \mathcal{N}(\theta; \mu_B, \sigma),~ \text{s.t.} |\mu_A - \mu_B| \le 2\sigma$
For Generalist Improving, it delivers $\hat{\theta_X} = \arg\max_{\theta}\mathop{\mathbb{E}}_{x \sim \{A, B\}}(P_x(\theta)) = \arg\max_{\theta}{1\over 2}(P_A(\theta) + P_B(\theta))$ $= \arg\max_{\theta}(\mathcal{N}(\theta; \mu_A, \sigma) + \mathcal{N}(\theta; \mu_B, \sigma)) = {1\over 2}(\mu_A + \mu_B)$
while for Expert Specialization, $\hat{\theta_{X_0}} = \arg\max_{\theta} \mathcal{N}(\mu_A, \sigma) = \mu_A$
when $\mu_A \neq \mu_B$ , $\hat{\theta_X} \neq \hat{\theta_{X_0}}$ , meaning that $\hat{\theta_X}$ is suboptimal for the limited scope $X_0$ though it is optimal for balancing all cases.

This means that, even we can achieve perfect optimization, there exist such cases that Generalist Improving does not deliver the best agent parameters that we desire, because its optimization formulation essentially does not match our actual situation where we consider a specific limited scope rather than the entire scope on average.

This is actually what the no free lunch (NFL) theorem implies. Using a concrete example here makes it more intuitive.

The existence of such cases necessitates Expert Specialization.

Our paper (Section 5, Table 6) identifies the existence of such cases in the reality.

We fine-tune the same base model on the task of SWE-bench-like agentic bug fixing with SFT, applying different data recipes.
- The second last row, “Only on X”, means that, we only train on django, and evaluate on django, reporting the pass rate of 24.29; then we only train on matplotlib, and evaluate on matplotlib, reporting the pass rate of 19.79; the same applies to sympy.
- The last row means that, we train on the combined dataset of django, matplotlib, and sympy (3x data).
Comparing the last two rows, we observe that training on more repositories does not necessarily yield better results for a certain single repository (e.g., performance on matplotlib degrades in combined training), necessitates Expert Specialization.

Let’s summarize Generalist Improving vs. Expert Specialization as follows:

The role of Generalist Improving includes:
- Achieving better optimization: without a good optimization, the model can coverge to a very bad state that satisfies nobody (e.g., underfit in general). Though $\hat{\theta_X}$ (the optimal generalist) is suboptimal for $X_0$ (a specific domain), it is still likely to be decent.
  - Realistic efforts that fall in this category: larger models, better model architectures, better optimizers, better training strategies, higher data quality, etc.
- Satisfying most on-averge users with a decent performance: Training on very few environments can overfit to certain long-tail cases and depress most users. By scaling the number of environments, the optimization prioritizes common cases that share similarities, strengthening them with each other:
  - Realistic efforts that fall in this category: larger datasets (e.g., SWE-Smith for coding agents)
The role of Expert Specialization includes:
- Pushing the boundaries for certain cases to the best beyond a decent performance, not settling down on an average performance: The value lies in the performance gap between $\hat{\theta_{X_0}}$ (an optimal generalist) and $\hat{\theta_{X}}$ (an optimal specialist): some problems with possibly a high value can only be solved by $\hat{\theta_{X_0}}$ but not $\hat{\theta_X}$ , so we try to shift $\hat{\theta_X}$ toward $\hat{\theta_{X_0}}$ .
  - Realistic efforts that fall in this category: domain-specific post-training, task-specific context engineering (prompting, retrieval, …), etc.

The Potentially Higher Efficiency (Lower Cost) of Expert Specialization

Having established the necessity of Expert Specialization, let’s consider efficiency/cost.

Say at a certain time, we have a model $\theta_X$ produced by Generalist Improving. The problem is, within a limited scope $X_0$ , how to further improve its performance toward the best.

There are two ways:

Continuted Generalist Improving: We put more efforts into generalist improving, resulting in a generally better model $\theta_X'$ that is very likely to also do better on the limited scope.
Expert Specialization: We only focus on further optimizing the performance on the limited scope to get an expert model $\theta_{X_0}$, regardless of out-of-scope tasks (i.e., we don’t care whether other tasks improve or degrade).

By plotting the performance distribution of a model over tasks, we visualize the comparison between two directions as follows.

Our paper shows that, there indeed exist cases where Expert Specialization has a higher efficiency (lower cost) than Continuted Generalist Improving. Since we only care about the limited scope, this offers a cheaper way. (See Section 4.2 in our paper.)

A Complimentary Path to “AGI”

The lower cost of Expert Specialization possibly unlocks a complimentary path to achieve AGI.

Imagine that the cost of Expert Specialization can be extremely low at some time in the future, such that given any limited scope, we can easily specialize a generalist that is not good enough on this scope as an expert that does extremely well on this scope. Or say, for any given task, if the effort needed to get an excellent scope-specific model based on what we already have (the original generalist) can be very low ($\text{effort} \to 0$), then essentially we have “AGI”, an “AGI” that consists of a mixture of experts, visualized below.

It is interesting that we may already on this path, with both research and engineering advancements that lower the cost of specialization:

Generalist Improving in the past several years establishes a higher starting point to reduce the cost of specialization. For example, just prompt engineering works for specialization on many domains, though it is less powerful than updating model weights.
- Therefore, we may be interested in when prompt engineering (or any other strategies) works and when it does not, and try to solve the latter cases. Or say, with the same generalist model as the same good starting point, some specialization strategies work better than others. (Section 4.3 in our paper explores this a bit.)
Tinker API removes the hassle of fine-tuning-based Expert Specialization on the infra side.
Test-Time Training and Nested Learning enable real-time, in-weights/parametric model specialization.
TTT-Discover demonstrates a concrete example, showing that in a highly-specialized setting, environment-specific model specialization can achieve superior performance with a relatively low cost, surpassing generalist-based approaches.
……

To conclude, “AGI” may be realized not with a single static model, but with a (elastic) model combined with on-demand low-cost specialization (artificial elastic intelligence?). Of course, it has several requirements, including but not limited to:

The generalist needs to be “decent”, otherwise the cost of specialization can be very high. (The purpose of pre-training/pre-trained models.) The prevailing path of scaling up is needed to establish this foundation.
Opposite to the first, the cost of specialization should be scaled down, so that it is quite low (while keeping the efficacy), or the relative cost is affordable considering the value of the problems that we can additionally solve (for higher-value problems, we can afford more costs), including time cost, compute cost, engineering cost (e.g., building domain-specific environments that support the specialization process), etc.
The generalist is “plastic”: it is easy to shift its distribution towards an expert, contrary to a “stubborn” generalist that is hard to change.

In summary, this blog is about more motivations, backgrounds and thoughts based on our paper but a bit broader than it. Also, many of the ideas already exist or exist similarly in the broader ML area, just not always being noticed, such as the no free lunch (NFL) theorem and the loss of plasticity. We try to connect such ideas to (coding) agents, bringing up the attention of such topics to share with the community.

Acknowledgement: Thanks for the contributions from all co-authors of SWE-Spot, and Zilin, Weiliang, Peihan, Chenxi for the valuable discussions.

@misc{peng2026swespotbuildingsmallrepoexperts,
      title={SWE-Spot: Building Small Repo-Experts with Repository-Centric Learning}, 
      author={Jinjun Peng and Magnus Saebo and Tianjun Zhong and Yi-Jie Cheng and Junfeng Yang and Baishakhi Ray and Simin Chen and Yangruibo Ding},
      year={2026},
      eprint={2601.21649},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2601.21649}, 
}

@misc{peng2026specialization,
    author = {Jinjun Peng},
    title = {Rethinking the Established Trend on Training (Coding) Agents},
    url = "https://research.co1in.me/posts/swespot/",
    month = {2},
    year = {2026},
}

The Necessity of Expert Specialization#

The Potentially Higher Efficiency (Lower Cost) of Expert Specialization#

A Complimentary Path to “AGI”#

The Necessity of Expert Specialization

The Potentially Higher Efficiency (Lower Cost) of Expert Specialization

A Complimentary Path to “AGI”