Turn Agent Trajectories Into Long-Context Training Data

Long-context reasoning is a blocker for serious agent work. Getting a model there normally means curating long documents or synthesizing contexts with heuristics. Both paths are expensive. A new technique called Agent Context Compilation (ACC) cuts that cost by recycling data you already have: agent trajectories.

The core insight is simple. When an agent solves a problem across many turns, it accumulates tool calls and environment observations that together contain the evidence needed to answer the original question. Standard supervised fine-tuning ignores most of this. It masks tool responses and trains only on turn-level tool selection. The scattered signals across turns never receive direct supervision.

ACC fixes that supervision blind spot. It takes trajectories from search, software engineering, and database querying agents and converts them into long-context QA pairs. Each pair combines the original question with all the tool responses and observations gathered across turns. The model is then trained to answer directly, without tool use. This makes the dependency between the question and the evidence explicit, which is exactly what long-context reasoning requires.

The method is model-agnostic and composable. It can be combined with any existing long-context extension or training method, and it requires no additional annotation beyond what the trajectory already contains.

The numbers are hard to ignore. Training Qwen3-30B-A3B with ACC reaches 68.3 on MRCR (up 18.1 points) and 77.5 on GraphWalks (up 7.6 points). These are benchmarks that test cross-turn coreference resolution and graph traversal over extended contexts, exactly the skills agents need. Those scores are comparable to Qwen3-235B-A22B, a model more than seven times larger. General capabilities on GPQA, MMLU-Pro, AIME, and IFEval are preserved, so ACC does not trade broad competence for long-context gains.

Mechanism analysis in the paper adds useful signal for builders. ACC-trained models show task-adaptive attention restructuring and expert specialization. The model is not just memorizing; it is learning to route and attend differently depending on what the task demands.

For product engineers building agent systems today, the implication is direct. If you are already running agents in production, you are generating trajectories. Those trajectories are training data you are not using. The ACC approach gives you a concrete pipeline: collect trajectories, compile them into long-context QA pairs, and fold that data into your fine-tuning runs. You get long-range reasoning improvements on a smaller model, at a fraction of the cost of curating new long documents from scratch.