Mike’s Agent Insights #4
In a world where AI increasingly shapes our lives and AI agents will increasingly transform even more aspects of our day to day lives, it's important to stay informed and understand the implications
In a world where AI increasingly shapes our lives and AI agents will increasingly transform even more aspects of our day to day lives, it's important to stay informed and understand the implications of this technology. That's why I'm launching Mike's Agent Insights, a newsletter dedicated to exploring the latest research, investment, advancements and applications of AI agents.
This week I want to explore;
Agents (and agent-ish things) in the news
New Agent Evals
Guest Contributor - Tavin Turner: As the fall NLP Venues near, preprint research papers paint what's on the research horizon for agents
Agents in the news
Automation systems are masquerading as AI agents, making it crucial to distinguish between true agents and sophisticated automation - (Evergreen and Bornet VentureBeat)
The distinction is critical because true AI agents possess "full process autonomy," enabling them to manage entire workflows independently, whereas automation systems cannot scale to that level of complexity. Misidentifying automation as AI agents can lead to ineffective solutions and wasted resources. The tech sector has seen an explosion of announcements about AI agents, with companies like Salesforce, Microsoft and Amazon unveiling their own versions. However, many of these "agents" are actually sophisticated automation systems in disguise.
The masquerade is not necessarily problematic, as many business processes benefit from reliable automation. However, as businesses become increasingly dependent on agentic AI , organizations must develop clear frameworks for evaluating and implementing these technologies. To identify true AI agents, look for systems that can;
Research: True AI agents can research and gather information to achieve a given goal.
Reason: They can reason and analyze the information to make informed decisions.
Make decisions: True AI agents can make decisions autonomously without human intervention.
Take action independently: They can take action independently to achieve the goal.
Improve over time through learning: True AI agents can learn from experiences and improve their performance over time.
My Thoughts: I think this has been one of the biggest challenges in working around and trying to talk about agents. There are a lot of things that we have called agents because the word has historical meaning like customer service agent. But we are now reaching a phase where the term agent has a very specific technical meaning, and a series of abilities that all need to be met in order for it to meet that definition. A lot of times what really matters is the outcome but with generative AI already being a topic that is growing so fast and has a lot of confusion associated with abilities which are new. It is pretty important to make sure people are getting the solution they're actually being promised/sold. What I have seen is that a lot of orchestration systems have adopted LLM capabilities, but are not able to actually reason or make decisions or take actions independently. This will matter both in deploying agenic systems as well as the ongoing support needed to keep up with changes in dynamic systems. To the author's point, this matters less today where a majority of people don't yet have an understanding of what an agent is, but will matter significantly as the technology advances over the next couple of years.
Agentic AI is the top strategic technology trend for 2025, with companies investing heavily in AI agents that autonomously plan and take actions to meet user-defined goals - (Gartner in ZDNet)
Agentic AI offers a virtual workforce that can offload and augment human work, driving a significant increase in IT spending, with worldwide IT spending expected to total $5.74 trillion in 2025, a 9.3% increase from 2024. This technology will deliver more adaptable software systems, capable of completing a wide variety of tasks. Gartner's latest forecast reveals that agentic AI systems will autonomously plan and take actions to meet user-defined goals, with spending on software expected to increase 14% to reach $1.23 trillion in 2025.
In this report Gartner predicts that by 2028 at least 15% of day-to-day work decisions will be done by agentic AI, up from 0% in 2024. The adoption of agentic AI will have far-reaching implications, including:
By 2028, at least 15% of day-to-day work decisions will be taken autonomously through agentic AI.
20% of organizations will use AI to flatten their organizational structure, eliminating more than half of current middle management positions by 2026.
10% of global boards will use AI guidance to challenge executive decisions by 2029.
40% of large enterprises will deploy AI to manipulate and measure employee mood and behavior by 2028.
By 2027, 70% of new employee contracts will include licensing and fair usage clauses for AI representations of their personas.
Through 2027, Fortune 500 companies will shift $500 billion from energy operational spending to microgrids to mitigate chronic energy risks and AI demand.
My Thoughts: Gartner is Gartner, they have good years and they have not so good years (like their Windows Phone market share prediction). In general though, they are both a reflection of enterprise customers and companies future plans to build these solutions. The survey across companies exposes why we are seeing such rapid growth and investment in Enterprise agentic (or pretending to be agentic) solutions. 15% is a significant number by 2028. Increases in education levels between 1948 and 1990 resulted in between an 11% and 20% increase in productivity (cite). The 15% may have a greater impact on productivity as it allows employees to focus on more strategic activities by automating more manual time consuming tasks. The expectation is that the additional time for strategic activities should have a positive impact on the quality of decisions. This is a very favorable view of where we could go with agents. I think there are things we need to be investing in to make this a reality. How and where we integrate agents will determine how much we are able to unlock these possibilities. There will also be social pressure to some of these changes as we recently saw in the moves to automate many of the major ports and the impact of the strikes by longshoremen.
Google's CEO reveals that AI generates over 25% of new code for Google's products, with human oversight - (Edwards at ArsTechnica)
The use of AI in coding has a significant impact on software development, boosting productivity and efficiency, but also raises concerns about potential bugs and errors. The integration of AI in coding began gaining momentum with GitHub Copilot in 2021, utilizing OpenAI's Codex model to suggest code continuations and create new code. Since then, AI-based coding solutions have expanded, with improvements from various tech companies.
The shift to AI-assisted coding is part of the ongoing evolution of software development tools, similar to past transitions from assembly language to higher-level languages and the adoption of object-oriented programming. While there are risks, AI augmentation aims to enhance human capability, not replace it.
Over 76% of developers are using or planning to use AI tools in their development process, according to Stack Overflow's 2024 Developer Survey.
A 2023 GitHub survey found that 92% of US-based software developers are already using AI coding tools.
Critics worry about potential bugs and errors, citing a 2023 Stanford University study that found developers using AI coding assistants included more bugs while believing their code was more secure.
Experts emphasize the need for skilled human oversight to ensure the quality of AI-generated code.
My Thoughts: One of the places that I think agents are going to have the most significant impact is software development cycles. I believe that as you introduce the combination of reasoning, planning and tooling which can include quality and security checks, you will see a step level change in the quality of code output from software engineers who are using AI tools today. Using generative AI assisted coding tools has already helped engineers significantly by helping them make their way through manual and menial tasks, which has begun to allow them to focus on more than a single line of code at a time. The reason I think that applying planning and reasoning and tools will have the expected impact is you will be able to define what you're trying to do and interact with the application of those changes across a larger section of a codebase. This will require a fairly significant change to how software engineers work with their codebase and I am excited to see the kinds of tools that we will see built to help accomplish this. This is also one of those topics that I would love to hear from you on how you are thinking about this going from just LLM interaction with code to how agents could impact your own workflow.
New agent evals
SimpleQA: A Factuality Benchmark for Language Models - (OpenAI) (Research Paper)
SimpleQA helps evaluate the ability of language models to provide factually correct answers, addressing the problem of "hallucinations" and improving trustworthiness. Current language models often produce false or unsubstantiated outputs, making factuality a critical concern. SimpleQA was created to fill this gap by focusing on short, fact-seeking queries.
SimpleQA contributes to the development of more reliable AI models, enabling broader applications and advancing research in factuality and calibration.
Key Features
High correctness: Reference answers supported by two independent AI trainers.
Diversity: Covers various topics, including science, technology, and entertainment.
Challenging: Designed to test frontier models, with GPT-4o scoring less than 40%.
Good researcher UX: Fast, simple, and efficient grading.
Dataset Creation
AI trainers created short, fact-seeking questions with strict criteria.
Questions were verified by a second independent AI trainer.
A third AI trainer validated a random sample, achieving 94.4% agreement.
Measuring Calibration
SimpleQA assesses model confidence and accuracy correlation.
Larger models show better calibration, but overstate confidence.
Frequency of responses also indicates calibration, with o1-preview performing best.
My Thoughts: This isn’t directly an agent eval, but I think it is important to think about how a true agentic system should perform. The tests against Claude and GPT-4o both resulted in results less than 50%, a goal of the eval was to be challenging for frontier models. Even o1-preview with reasoning abilities scored the highest at 42.7% of answers being correct. For a system to be truly agentic, it needs to be able to reason to develop a plan and backtrack where it experiences a problem. It also needs to be able to interact with the world around it by being able to use some kind of tooling to supplement what is not trained into the model. I expect the current leaderboard to change significantly as truly agentic systems are tested. I would expect a higher level of factuality to be achieved through having a reasoned path to an answer as well as a lower percentage of not answered. Time will tell on this one.
Guest Contributor/Tavin Turner: As the fall NLP Venues near, preprint research papers paint what's on the research horizon for agents
NLP Venues refer to conferences, workshops, and other events focused on Natural Language Processing (NLP) and these conferences are calling for contribution on the next steps for agents. Let's look at a few papers that represent these visions. In the hot seat: aligned reasoning, multiagent systems, and structured knowledge.
Aligned reasoning is still opaque but underlies explanation - (Sullivan and Elsayed - ARXIV)
Marketable agents compete to control foundation models' fuzzy reasoning with their design. Sullivan and Elsayed contest LLMs' inherent symbolic reasoning skills, and suggest that fast, explainable symbolic reasoning comes from external symbolic tools, which we will see later.
Tavin's thoughts: Reliable reasoning is an unchecked item on agent wishlists for healthcare, law, and other sensitive domains. External symbolic tools are like "showing your work" for a decision. Healthcare and legal providers know that there is both art and science in their thought, and stringent rationalism is at odds with their nuance, so structured-but-flexible reasoning is in-demand but off-market.
Multiagent systems distribute workload and enable specialists - (Ha et al. offer SARA - ARXIV) and (Jonas Becker - ARXIV)
Some think info-sharing multiagent systems can solve the inductive structuring problems of state-of-the-art architectures while maintaining a purer dependence on foundation models.
He et al. offer SARA, a multiagent model that uses a reason agent to break down a prompt, then organize other agents to execute a plan derived from this structure. Their results suggest that multi-agent structures with some fast thinkers and some slow thinkers dominate modern mainstays (Chain of Thought, ReAct, etc.).
Reviewing current multiagent approaches, Jonas Becker finds the good and the bad. Multiagent flows are more accurate, scale better with problem complexity, and more explainable than solo agents. That said, they have one big problem: better reasoning comes from longer discussions, and talk-time lets models get distracted and risk alignment collapse. Solving this problem could be monumental for new agent architectures.
Tavin's thoughts: SARA resembles orchestrative architectures from the neurosymbolic NLP literature. This reimagination realizes their potential, and there's more to come. Becker calls to reduce the strain on foundation model coherence by diversifying agent roles within multiagent systems. Ultimately, this is a promising take on decomposition, but computational and time barriers stand in the way.
Structured representations make for more effective agents - (Yu and Lu - ARXIV)
There's a lot of talk about introspecting on agents and depending more on high-level thinking, like SARA. Yu and Lu are one of many making tangible structures part of the reasoning process. Their Minecraft-playing agent builds a "causal graph" to explain how complex tasks like crafting items and building structures break down into common modular skills. This paper offers a way for open-world agents – who interact with an uncharted world – to take notes on their experiences. Turns out, agents are pretty good at this (their agents' causal graphs were "almost perfect" to human players) and they outperform their model peers when taking advantage of it.
Tavin's thoughts: Symbolic reasoning is a major artery of System 2 reasoning. If agents are to think slowly, they need a framework of structured representations baked into their design. Agentive systems must either foster or approximate structured representations to be interpretable, generalizable, and augmentable beyond what we already expect.
To close out, these three leads render a view of the coming agentive edge: structured representations supporting symbolic reasoning distributed over architected systems of agents. I believe that this dogma will take agents from nascent to mainstream, and fast.
Why a guest contributor and who is Tavin? I volunteer to help mentor students in the Denver area. Tavin has been a standout for a while and asked if he could contribute, he is also looking for an internship for 2025 (resume) for any of you who are interested. In his own words, he is a rising technologist mixing intelligent systems with interdisciplinarity, bringing tech-led change to diverse users. His young start in tech brought full stack experience and a curiosity for more. Self-starting in NLP, he enrolled at CU Boulder after learning from Jurafsky and Martin's textbook. There, since 2022, he joined Blueprint Boulder to accelerate nonprofits with tech and joined in NLP research. Today, he works in the BLAST lab to enrich structured discourse representations for statutory analysis.
Mike, I'm really enjoying your blog. This is a great post. All the talk of agents is exciting, but it's such a widely-used word that it almost has no meaning. This is a clearly written post on what an agent is and, what it isn't, and how we might think about evaluating them. The impact is going to be really amazing to see, but there's so much we don't know yet. What an exciting time!