Exploiting parallel tool calls to make agentic search 4x faster

Today we're releasing Fast Agentic Search (FAS), a code-specific subagent trained with RL to quickly search through codebases for files relevant to a user request.

FAS uses multi-step reasoning along with parallel executions of view, grep, and bash tools to simultaneously explore several file chains. At the end, it calls a report_back tool to package the results of the search into a minimal set of files the main agent can use as context for implementing the desired changes.

This is the first in our line of small agents co-optimized with infrastructure. You can run FAS on any Relace Repo with relace.repo.search(repoId, { query });, which uses our prebuilt agent harness designed for handling parallel tool calls with low overhead. We also provide a public endpoint you can use to call the model from within your own agent framework.

In this blog post, we do a deep dive on how we trained the FAS model, which was inspired by Cognition's SWE-grep model.

RAG vs. Agentic Search

For months, there's been debate about whether to use RAG or agentic search for codebase retrieval.

RAG used to be the only option when models had limited context windows. You split your codebase into small chunks at file/function boundaries, compute vector embeddings for chunk, and store them in an database. Given a user query, a fast vector similarity search algorithm finds the most relevant sections of code and passes that context to the model.

However, as context windows expanded and models got better at multi-step reasoning, the limitations of RAG became more apparent. Even with code-specific embedding/rerank models, getting all the relevant files in a single shot is challenging.

The Claude Code team, led by Boris Cherny, were the first to completely deviate from the standard RAG setup by implementing what they called agentic search. Instead of using a separate system to curate context, the model finds its own context in the codebase by using command line view and grep tools.

The team found that this improved performance, but with a tradeoff:

"At the cost of latency and tokens, you now have really awesome search."

If you've tried Claude Code, you've probably seen why. The model works sequentially, methodically reasoning through what sections of the codebase are relevant.

Sequential operations are particularly slow for tool calling models. Not only do you pay the network latency for reaching the inference server, but the tool calls must return results before moving to the next turn.

FAS addresses this tradeoff with parallelism. The model retains accuracy via its step-by-step reasoning, but multiple search paths are executed at once using parallel tool calls.

F1 scores and latency results

Comparison of F1 score vs. latency across RAG and agentic search approaches. FAS achieves 4x latency reduction while maintaining accuracy close to Claude 4.5 Sonnet.

Isolating the Search Task

Of course, a specialized model like FAS is only useful if you can actually separate search from the rest of the agentic coding task.

To measure this separability, we used a coding agent to solve a set of ~1200 coding problems and looked for the first "non-search" tool call in each trace. i.e. a file edit tool or some kind of bash-based testing script.

We found that, on median, search accounts for 56.6% of all the tokens ingested by the model throughout the entire trace.

Distribution of percentage of tokens spent on search across 1200 coding tasks produced from real vibe coding requests.

Many of the tokens encountered during search also turn out to be irrelevant. Framing context collection as a separate task with search --> filter --> report back steps prevents context pollution.

Taken together, these results motivated a focused effort to train a small, specialized model for search that could operate faster and at lower cost.

Constructing the RL Environment

Prior to FAS, we had only trained models using supervised fine-tuning (SFT) on datasets consisting of extremely high quality input/output pairs.

This was entirely "off-policy". i.e. the data we collected was not sampled from the base model we were training.

Because agentic tasks require long sequences of tool calls with many different paths to success, they need to very flexible. We had observed models trained with off-policy SFT to be very consistent but fragile to distribution shifts and prompt changes.

For low entropy tasks like Fast Apply, where there is almost always one definitively correct answer and you could sample from the full distribution of requests in production, the fragility wasn't really a problem.

To train FAS, we decided it was necessary to set up an "on-policy" reinforcement learning pipeline instead. Here's how we set up the environment:

Defining a Search Example

For FAS, each data point consists of an initial repo state stored as a Relace Repo along with a coding task defined by some user prompt.

We collect these repo/prompt pairs from (1) data partnerships with prompt-to-app companies and (2) pull requests scraped from open source GitHub repos.

The advantage of (1) is the data is sampled directly from the distribution we expect to see in production for our customers. The downside is we need to synthetically construct a ground truth.

To do this, we use a coding agent to implement the task and parsed the traces to get a list of files edited by the agent and a list of files viewed by the agent.

With (2) the situation is flipped. We are able to produce an ironclad ground truth by taking the files actually edited by human developers in the pull request. However, we use an LLM to translate the comment history of the pull request into a realistic, complete user prompt that describes the task. For this set of data, we do not have an analog to "viewed" files.

Agent Harness

We run FAS inside a fixed agent harness that exposes a small, purpose-built toolset.

The harness equips the base model with five tools: view_file, view_directory, grep_search, bash, and report_back.

The prompt instructs the model to explore the repo using these tools and then return its findings via the report_back tool with a structured format:

{
  "explanation": "reasoning for file relevance",
  "files": {
    "path/to/file1.py": [
      [10, 25],
      [40, 55]
    ],
    "path/to/file2.js": [[1, 30]]
  }
}

Each file path maps to a list of line ranges [start, end] that are relevant to the task

Using the exact same harness at training and inference time is critical for performance. Any mismatch between the tool interface seen during training and the one used in production can degrade behavior.

We document the full harness and prompt in detail here.

The Reward Function

After reading the literature on deep research model training, we decided to evaluate the FAS model's performance using the F score of the report back tool. i.e. The harmonic mean of the precision and recall.

However, we made some key modifications to better suit the nature of our data set. The challenge is that the ground truth for "relevant files" varies by data source:

Edited, Viewed Files Venn Diagram

Hierarchy of code relevance. Edited files are always relevant, viewed files are sometimes relevant, and the remaining codebase is irrelevant.

Edited files are necessary but not always sufficient — e.g. a utils file may be relevant for a request as a reference but not actually modified. Viewed files are sufficient but not always necessary — e.g. the agent might view a file by chance that is not actually related to the request at all.

For GitHub data (2), we only have edited files as ground truth. The model might correctly identify relevant reference files that weren't edited, but these would be marked as false positives. To account for this imperfect ground truth, we use an F₂ score that weights recall 2x more than precision:

A_{2} = \frac{5 P_h R_h}{4P_h + R_h}

where $P_h$ and $R_h$ are "hard" precision and recall computed against edited files. This de-emphasizes precision since we expect some false positives to actually be correct.

For agent trace data (1), we have both edited and viewed files. Viewed files include noisy exploration, so using them for ground truth would incorrectly penalize the model. Instead, we use a hybrid F₁ score with "soft" precision computed against viewed files (more lenient) and "hard" recall against edited files (ensures we find what's necessary):

A_{1} = \frac{2 P_s R_h}{P_s + R_h}

where $P_s$ is precision computed on viewed files and $R_h$ is recall on edited files. This avoids penalizing correct predictions that happened to also be viewed during exploration.

The second critical component is a parallelism penalty. Though the base model was equipped with instructions for parallel tool calling, it more frequently performed the search serially. Our hypothesis was that we could parallelize the search with 4-12 simultaneous tool calls per turn and significantly drop the total number of turns while maintaining performance.

To encourage fewer turns while being robust to reward hacking, we applied a penalty that has no effect for $n \leq 4$ turns but decays linearly to zero at $n = 24$ turns:

P = \max\left(0, \frac{n_{\text{max}} - n}{n_{\text{max}} - n_{\text{min}}}\right) \quad \text{where } n_{\text{min}} = 4, \, n_{\text{max}} = 24

Combining these components, the final reward is:

R = A * P \quad \text{where} \quad A = \begin{cases} A_{1} & \text{on dataset (1)} \\ A_{2} & \text{on dataset (2)} \end{cases}

Training the Model

Setting up the reinforcement learning pipeline to actually have flexibility along the axes we cared about for the search problem took ~1 month.

Open source libraries like are helpful, but it can be challenging as soon as you decide to go off-road. We may eventually release a blog post on the specifics of our set up along with the experimentation we did around hyper parameter tuning for search, but we'll omit it for brevity here.

In the end, we built on top of verl and used Nebius for access to 8xH200 nodes. Training took ~1.5 days on a single node for 3 epochs on 3.2k very high quality data points.

Below you can see the progression of the reward function along with the number of generation turns required during training:

Reward function and number of generation turns

Training progression showing reward function improvement and reduction in generation turns over 3 epochs.

Emergent Reasoning

The initial prompting of the model can have a significant effect on the dynamics of training, especially with multi-faceted reward functions (also see Anthropic's post).

In early trials of training FAS, we didn't explicitly encourage the model to reason. For these runs, initial reward gains were delivered almost entirely by leaning into parallelization of the tool calls with no accuracy gains. The model performed no reasoning in this initial phase.

Once the number of turns stabilized below $n_{\text{max}}$ , however, the model could no longer extract additional reward from parallelism alone. This resulted in a temporary stall in reward. After this plateau, accuracy began to improve, and reward increased again.

Interestingly, when we inspected the training traces from this later phase, we observed the model begin to insert explicit reasoning steps between bursts of parallel tool calls.

This supports the intuition from Boris and the Claude Code team that agentic search outperforms RAG because it's multi-shot — the model can iteratively refine its judgement on file relevance by reasoning about information collected from other files.

For later RL runs, we updated the prompt to encourage reasoning up front. This made the reward increase more smoothly throughout training.

Evaluating the Model

For the evaluations presented in Figure 1, we held out a total of 889 data points to fairly compare against other models on the market — 465 of these data points came from dataset (1) and 424 came from dataset (2).

On both tasks, we significantly dropped the number of turns: from 20 --> 5 on (1) and from 10 --> 4 on (2). This amounted to over 4x reduction in end-to-end latency in both cases while retaining similar accuracy performance compared to models like Claude 4.5 Sonnet!

Because the search task is prefill-heavy with a 15:1 input/output ratio, we found output TPS to be much less important than reducing number of generation turns and agent harness overhead.

The above speed-ups were achieved using a standard inference engine running at ~200 TPS.

F1 scores and latency results

Comparison of F1 score vs. latency across RAG and agentic search approaches. FAS achieves 4x latency reduction while maintaining accuracy close to Claude 4.5 Sonnet.

From the plots we can also see that there is a clear tradeoff here between accuracy and speed. RAG is fully parallelizable and very fast, but it falls short in terms of accuracy compared to agentic search.

SWE-Bench Experiments

Finally, we wanted to measure the downstream performance of actually integrating FAS into a production system. For this, FAS is a subagent that passes its context to a central "Oracle" agent.

Motivation

Our goal for these experiments was twofold:

Retain or improve performance of the agent with FAS
Reduce overall generation time

The benchmarks in the previous section were good at evaluating the latency gains, but there was no way of making sure the downstream accuracy would be preserved.

To do this, we set up both agents to run SWE-Bench verified and recorded both accuracy and latency.

Context hand-off

Prior to running the full 500 examples, we tried various hand off strategies between FAS and the Oracle to minimize duplication of effort (viewing the same files over again) and performance regression.

Interestingly, compressing the FAS trace to retain the reasoning steps along with the actual relevant file sections helped the Oracle agent better contextualize the results. We ended up using this method for the full experiment.

You can read more about the exact hand off instructions in our documentation.

Results

Below are the results of the experiment:

Strategy	Median Latency	Accuracy	Total Tokens
FAS + Oracle	7m 1s	71.4%	890M
Standard Oracle	7m 44s	72.0%	1.03B

The FAS integration reduces median latency by 9.3% (43 seconds) and cuts total token usage by 13.6% (140M tokens) while maintaining comparable accuracy.

At a glance, this improvement seems low given that FAS delivers a ~4× latency reduction for search itself.

However, we found that for SWE-bench traces, the initial search accounts for only ~12% of the total tokens in the full agent trajectory. The problem statements are rich in detail and often directly state the relevant files.

This stands in sharp contrast to our 1200 real-world vibe-coding traces, where users give vague instructions that lead to almost 60% of the trace being search. We expect the gains to be much larger in these production settings.

Thus, the key takeaway is we can cleanly separate search from the rest of the coding task without regressing downstream performance.

Try the Model

FAS is now available on top of Relace Repos with our optimized agent harness. Try it out on our playground, and let us know what you think!

We've also released an OpenAI-compatible endpoint both on Relace platform or OpenRouter priced at $1/million input tokens and $3/million output tokens.

This is our first version, which we hope to improve even further with feedback from the community!

We're Hiring

If you have gotten this far, chances are you found this interesting!

We’re hiring pragmatic researchers (Physics/Math/CS/ML) and exceptional engineers to ship models like this that real product teams rely on. Check out our careers page, and join us!