Quantifying Token Use in Agentic Software Engineering: Where Costs Lie
LLM-based Multi-Agent (LLM-MA) systems automate complex software tasks, but their token consumption, and thus costs, are poorly understood. New research analyzing the ChatDev framework with GPT-5 reveals that the iterative Code Review stage consumes a striking 59.4% of tokens, with input tokens making up 53.9% of total consumption. This indicates that the primary cost in agentic software engineering lies in refinement and verification, not initial generation, offering crucial insights for cost prediction and workflow optimization.

The landscape of software development is rapidly evolving with the advent of LLM-based Multi-Agent (LLM-MA) systems. These sophisticated agents are increasingly deployed to automate complex tasks across the Software Development Life Cycle (SDLC), from initial requirements engineering and code generation to rigorous testing and documentation. While the promise of such automation is immense, a critical challenge looms: our understanding of their operational efficiency and resource consumption remains surprisingly poor. This lack of clarity about where and how tokens are consumed translates directly into unpredictable costs and environmental impacts, hindering the broader practical adoption of these powerful systems.
The Unseen Costs of Agentic Software Development
For developers integrating or building with LLM-MA systems, the "black box" nature of token usage can be a significant impediment. Without insights into the tokenomics – the distribution and economics of token consumption – it's incredibly difficult to accurately predict operational expenses, optimize workflows, or even justify the large-scale deployment of these agents. Traditional software engineering has well-understood cost models; agentic software engineering, by contrast, introduces a new, opaque dimension of resource expenditure that needs to be illuminated.
A Framework for Token Analysis in the SDLC
To address this critical gap, recent research has introduced a novel methodology for analyzing token consumption patterns within LLM-MA systems operating across the SDLC. The approach involved observing execution traces from 30 distinct software development tasks, all performed by the ChatDev framework leveraging a GPT-5 reasoning model. A crucial step was mapping ChatDev's internal operational phases to a standardized set of development stages familiar to any software engineer:
- Design
- Coding
- Code Completion
- Code Review
- Testing
- Documentation
By categorizing token usage across these well-defined stages, the researchers were able to quantify and compare the distribution of different token types: input tokens (tokens fed into the LLM), output tokens (tokens generated by the LLM), and reasoning tokens (tokens representing the LLM's internal thought processes or intermediate computations). This granular analysis provides an unprecedented look into the true resource demands of each phase.
Surprising Findings: The Dominance of Refinement
Preliminary findings from this analysis offer compelling insights that challenge common assumptions about where LLM-MA costs are incurred:
- Code Review is the Major Token Sink: Strikingly, the iterative Code Review stage was found to account for the vast majority of token consumption, averaging 59.4% of all tokens used across the entire development process. This suggests that the back-and-forth, analytical, and refining nature of code review demands disproportionately more computational resource in terms of tokens.
- Input Tokens Lead the Charge: Consistently, input tokens constituted the largest share of overall token consumption, averaging 53.9%. This finding points to potential inefficiencies in how agents collaborate, possibly by transmitting extensive or redundant contextual information repeatedly.
These results underscore a crucial point: the primary cost driver in agentic software engineering isn't the initial generation of code or other artifacts. Instead, it resides in the subsequent, iterative processes of automated refinement and verification. This shifts the focus from optimizing initial creation to making the iterative feedback loops and contextual exchanges more efficient.
Practical Takeaways for Developers and Architects
For software developers and architects working with LLM-MA systems, these findings offer immediate practical implications:
- Cost Prediction: The methodology provides a framework for more accurately predicting expenses associated with agentic development workflows, enabling better budgeting and resource allocation.
- Workflow Optimization: By identifying Code Review as the most token-intensive stage, developers can strategically target this phase for optimization. This might involve refining agent prompts to be more concise during review cycles, implementing smarter context management to reduce redundant input, or exploring alternative communication protocols that minimize token exchange.
- Agent Design: The emphasis on input tokens suggests a need for agent designs that are more discerning about the context they transmit. Future agent architectures could focus on summarizing information, identifying key changes, or employing more efficient data structures to represent conversational history, thereby reducing the burden of input tokens.
Charting a Course for Token-Efficient Agents
The research not only quantifies current inefficiencies but also provides a clear direction for future innovation. Developing more token-efficient agent collaboration protocols is paramount. This could involve exploring new LLM architectures optimized for iterative tasks, advanced techniques for summarizing and filtering agent communications, or dynamic context window management that adapts to the specific needs of each SDLC stage. Ultimately, understanding the tokenomics of agentic software engineering is crucial for building sustainable, cost-effective, and widely adoptable AI-driven development tools.
FAQ
Q: What are "input tokens," "output tokens," and "reasoning tokens" in the context of this study?
A: Based on the study's framework, "input tokens" refer to the tokens supplied to the LLM by the agents for processing, typically representing context or instructions. "Output tokens" are the tokens generated by the LLM in response to those inputs. "Reasoning tokens" are distinct from input/output and likely represent the tokens used internally by the LLM for its analytical processes, intermediate thought steps, or complex decision-making within a task, though the abstract does not provide a more detailed definition.
Q: Why is the "Code Review" stage so token-intensive compared to other SDLC stages like "Coding" or "Testing"?
A: The study attributes the high token consumption in Code Review to its iterative nature. Unlike initial coding or discrete testing phases, code review often involves multiple rounds of analysis, feedback generation, and refinement, requiring agents to process and produce substantial amounts of contextual information, proposed changes, and explanations. This continuous, looping exchange of data and analytical thought drives up token usage significantly.
Q: How can the finding that "input tokens consistently constitute the largest share of consumption" be leveraged to optimize agentic software engineering workflows?
A: This finding highlights that a significant portion of token cost comes from providing context to the LLMs. To optimize, practitioners can focus on reducing the verbosity or redundancy of information passed to agents. This might involve designing agents that maintain more efficient internal states, summarizing lengthy prior conversations or code sections before re-feeding them, or implementing intelligent filters to only transmit the most relevant changes or contextual details during iterative processes.
Related articles
Pearl AI Cryptomining: Empty Promises, Real Costs
Pearl, a Layer-1 blockchain, claims to merge crypto mining with useful AI computation, but new research suggests its 320,000-GPU network burns 112MW on "zero useful AI computation," driving up GPU rental prices.
Anthropic's Model Suspension Ignites India's AI Sovereignty Debate
Anthropic's recent decision to suspend access to its newest AI models, Fable 5 and Mythos 5, for all foreign nationals following a U.S. government directive has sent ripples across the global technology industry. In
KPMG Withdraws AI Usage Report Citing 'Apparent Hallucinations
KPMG has pulled its report, "Redefining excellence in the age of agentic AI," after organizations cited within it denied the accuracy of its claims regarding their AI usage. Inaccuracies were attributed to AI hallucinations, implying KPMG used AI to write the report about AI. This follows a similar incident last month with EY.
My first 24 hours with Siri AI on the Mac: Apple — Key Details
Early testing of Siri AI on macOS 27 Golden Gate reveals a mixed bag for Mac users. While improved, its limitations are more apparent on desktops compared to mobile devices, with traditional input methods often outperforming voice commands for common tasks. The feature is still in an early preview state, suggesting future refinements.
Yang: Next Big Opportunity Lies in Lowering Cost of Living
Andrew Yang is championing a new wave of startups focused on lowering the cost of living for average Americans. His latest venture, Nobile Mobile, offers affordable cell service and returns savings to customers, demonstrating a business model that prioritizes giving money back rather than solely extracting it. This initiative aims to address economic pressures, particularly those intensified by AI's impact on wages and jobs.
Cleveland's Comeback Offers Stark Lessons for Seattle's Future
CLEVELAND, Ohio — As Seattle stands at a critical juncture, navigating the transition from the software era to the age of artificial intelligence, a recent fact-finding mission to Cleveland by GeekWire contributing






