1. Introduction: Humans, Tools, and the Genesis of AI Agents
The whitepaper begins by drawing a parallel between how humans solve problems and how large language models (LLMs) or Generative AI models might also solve problems when given the ability to use external resources. Humans, it states, are naturally skilled at pattern recognition, but we frequently augment ourselves with specialized tools: we consult books for domain expertise, we use online search engines like Google to keep updated on the latest information, or we deploy a calculator for arithmetic tasks that would be time-consuming or error-prone to do by hand.
In the same vein, a Generative AI model on its own may have an enormous capacity for language understanding or content generation but might be limited to pre-trained knowledge only. Its awareness of new developments beyond a certain cut-off date is nonexistent if it has no means of pulling in external or real-time data. It cannot, by default, act on behalf of the user in the real world—such as emailing someone, transferring funds, or fetching dynamic content from a web API.
When we take that powerful core—the generative model—and connect it to external tools (for instance, a web search API, a database call, or an application endpoint that triggers an action in the real world), we effectively create an agent. An agent is an application that can observe the world, act on it, and plan further steps in an autonomous fashion to achieve a given goal. This synergy is the heart of the whitepaper: how to extend a language model into something that can plan, reason, and connect with the outside environment.
Why Agents Are Needed
- Static training data: A language model cannot spontaneously update its internal knowledge unless it is trained or fine-tuned again. Therefore, if you need up-to-date or situation-specific data—like an item’s current price or real-time sports scores—you must connect to an external source.
- Complex tasks: Many tasks require multiple, structured steps. Rather than forcing a user to specify each micro-step to the model manually, an agent can follow a high-level objective, plan a sequence of actions, and execute them in the correct order.
- Real-world actions: Some processes go beyond retrieving text. They may require creating calendar events, sending emails, or booking flights. An agent, with appropriate permission and the right architecture, can perform these tasks automatically.
Hence, the introduction concludes by emphasizing that the combination of a large model with tool usage, plus logic to manage planning and execution, defines the modern concept of a Generative AI agent.
2. What Is an Agent?
Building on that introduction, the paper introduces a succinct working definition:
“In its most fundamental form, a Generative AI agent can be defined as an application that attempts to achieve a goal by observing the world and acting upon it using the tools that it has at its disposal.”
Several elements are worth stressing in this definition:
- Autonomy: Agents can act independently of continuous human oversight, provided they have been given clear instructions or end goals.
- Proactivity: An agent may deduce the next steps it should take, even if the human user has not spelled out each subtask.
- Generative AI: While “agent” in AI is a broad term, the whitepaper focuses on Generative AI agents that rely on language models. These models leverage advanced reasoning capabilities, instruction-following, or chain-of-thought techniques.
The Cognitive Architecture
Though agent behaviors can vary widely in complexity, the whitepaper isolates three fundamental components that collectively shape the agent’s “cognitive architecture”:
- The Model
This is the large language model (LLM) at the center of the decision-making process. It can be a single model or multiple models, fine-tuned or general-purpose, depending on the use case. The critical point is that this model is capable of following certain reasoning or logic frameworks. The authors note that popular frameworks include ReAct (Reason + Act), Chain-of-Thought (CoT), or Tree-of-Thoughts (ToT).
The paper remarks that, in production, developers often choose a model that is specifically suited to the domain or has been pre-trained on data relevant to the types of tools the agent will be using. Though the internal training set might not directly incorporate the configuration of the agent or the specifics of tool usage, the model can still be “taught” or “prompted” to use those tools effectively at inference time. - Tools
Traditional LLMs (without specialized tool-usage methods) are relatively siloed, with knowledge strictly limited to their training data. Tools open the door to real-time data or actions. When integrated properly, an agent can call an external weather API, a database retrieval function, or a transaction endpoint, effectively bridging the gap between the text-based generation environment and external systems. Tools can be conceptualized in a variety of shapes—APIs with GET/POST/PATCH/DELETE methods, local code interpreters, or specialized third-party services. - The Orchestration Layer
Within an agent, the process rarely ends with a single prompt-response iteration. Instead, there is a cyclical or iterative loop of reasoning, action, and observation. The orchestration layer ensures that the agent keeps track of context or memory, decides whether another tool invocation is needed, and finalizes a suitable outcome. Depending on complexity, this orchestration might be minimal or might include advanced logic, additional ML algorithms, or probabilistic reasoning steps.
Together, these three components constitute an agent architecture that is robust and generalizable, allowing for intricate tasks that go far beyond a single-shot language model generation.
3. Agents vs. Standalone Models
In one section, the whitepaper offers a direct comparison between plain models and agents. Here’s the gist:
- Knowledge Base:
- Standalone model: Knowledge is locked to what it learned in training. If the training data is old or incomplete, the model might hallucinate or produce outdated answers.
- Agent: By linking to tools, an agent can access external data that is fresh, up-to-date, or domain-specific.
- Conversation and Memory:
- Standalone model: Typically provides a single inference per query, lacking a built-in mechanism for multi-turn “conversational memory,” unless externally managed.
- Agent: Manages session history to allow multi-step inference, keeping track of context across different tool calls or user messages.
- Tool Implementation:
- Standalone model: There is no baked-in concept of calling external APIs or code. You could instruct it in a prompt to mimic a function call, but that is entirely user-driven.
- Agent: Tools are natively implemented and tied into its decision-making. The agent can decide, “Now I should fetch data,” or “Now I need to run a snippet of code.”
- Logic Layer:
- Standalone model: You can attempt to impose logic by writing a complex prompt, but the model does not inherently run loops or plan steps.
- Agent: Employs an explicit “cognitive architecture,” i.e., ReAct or chain-of-thought frameworks, that systematically influences how the agent interacts with incoming data, partial results, or next steps.
4. Cognitive Architectures: Under the Hood
The Chef Analogy
The whitepaper uses a helpful analogy of a chef who aims to create a dish in a busy kitchen. This chef:
- Gathers the relevant information: the order, the pantry contents, the cook’s resources.
- Reasoning ensues: “Given what I have in stock and the diner’s preferences, what dish should I prepare and in what way?”
- Execution of tasks: The chef chops vegetables, sears meat, sets timers, etc.
- Adjustments: If the sauce is running low or the diner changes a request, the chef dynamically updates the plan.
This repeated cycle—information intake, internal planning, actual action, and a final check for the result—demonstrates exactly how an agent orchestration layer can guide a model to handle multi-step tasks. If we link these steps to AI, we see:
- Information intake: The agent obtains the user’s question plus any relevant environmental data (from external tools).
- Planning/Reasoning: The model (with or without a chain-of-thought) decides which tool to call next or how to refine its approach.
- Execution: The agent calls that tool, obtains a real result.
- Adjustment: The result might change the agent’s approach, prompting it to re-plan.
Reasoning Frameworks
The paper references several popular frameworks that further strengthen the orchestration layer:
- ReAct
- Introduces a structure for the agent’s prompts: a user question is followed by the model’s “Thought” (an internal rationale), an “Action” (tool usage decision), “Action Input” (the parameters to that tool), an “Observation” (the tool’s response), and so on. Eventually, the model outputs a “Final Answer.”
- This method can reduce hallucination by forcing the model to consider whether an external source is needed.
- Chain-of-Thought (CoT)
- Encourages the model to produce step-by-step reasoning. The basic premise is to let the model “think out loud” before concluding. This might be done with zero-shot prompts or with few-shot examples.
- Sub-variants mentioned include “self-consistency” or “active prompting,” each aiming to further refine the accuracy or thoroughness of the reasoned steps.
- Tree-of-Thoughts (ToT)
- Generalizes the idea of chain-of-thought by exploring multiple possible lines of reasoning in a tree structure. The model can branch out, consider alternative pathways, evaluate partial outcomes, then converge on the best final solution.
- This approach can be effective for tasks requiring strategic lookahead, such as combinatorial reasoning or puzzle solving.
In a ReAct-based agent scenario, for example, the typical sequence is: user query → the agent decides on a next step → the agent picks a tool → the agent reads the tool’s response → the agent either picks another tool or finalizes the answer. This loop can iterate as needed until a stopping criterion (e.g., user’s question is fully answered) is met.
5. Tools: The Gateway to External Data and Services
A crucial premise of the whitepaper is that tools enable the agent to break free of the typical LLM constraints. Tools might be invoked through well-known API calls, local code running, or even specialized connectors to third-party platforms. The overarching idea is to let the agent consult or manipulate real data.
Extensions
The easiest way to conceptualize an Extension is a standard interface between the agent and an external API. If you want your agent to query Google Flights, you can build an Extension specifying:
- The correct endpoint (URL)
- The required parameters (such as “departureCity,” “arrivalCity,” “departureDate”)
- Any authentication details or data formatting
- Examples that illustrate how the agent might use this extension in typical queries
When a user says, “I want to book a flight from Austin to Zurich next Friday,” the agent no longer has to guess what to do. It sees that an Extension for Google Flights is available, notices the query’s mention of departure city “Austin” and arrival “Zurich,” and calls the extension with the right parameters. The Extension returns an actual real-time result, which the agent can incorporate in its final output.
This approach is far more robust than having custom code parse user queries in an ad-hoc manner. The agent learns from examples how to invoke the Extension correctly, and the user’s utterances can be more flexible: if the user forgets to specify the departure city, the agent can detect that mismatch, possibly ask clarifying questions, or supply defaults.
Standardized Example Types
The paper notes that, within the Vertex AI environment, “Extensions” come with a mechanism to teach an agent how to call them. Google’s system includes built-in example templates that show the agent how to fill out extension parameters. This fosters dynamic selection: if the user pivoted from “book a flight” to “book a hotel,” the model might detect that the hotel extension is more appropriate.
Prebuilt Extensions
Google also supplies prebuilt “hub” extensions, such as Code Interpreter. Rather than building an entire code-running environment from scratch, a developer can simply call Extension.from_hub("code_interpreter")
. The agent then gains the ability to generate and run code snippets in Python, pass in data, and capture outputs—like a more advanced version of a notebook environment integrated into the agent.
Functions
The paper then differentiates Functions from Extensions. While an Extension typically executes on the agent side (the agent calls the endpoint directly), a Function is defined on the client side. The model’s job is to propose a function name and the parameters that should be passed, but the actual code execution or API call takes place outside the agent environment.
Why is this relevant? Because it grants the developer finer control over the data pipeline, security constraints, or custom transformations. Sometimes, you do not want your agent directly calling third-party services. You might have a complicated authentication scheme or prefer to keep the agent behind a firewall. Instead of letting the agent dial the external API, you have the model produce a function call in a structured JSON block. You, as the developer, parse that JSON and decide how or when to run the code.
For instance, a user says, “I’d like a list of family-friendly ski resorts.” The agent might produce a function call for display_cities
with arguments ["Aspen", "Vail", "Park City"]
. That is effectively the agent’s best guess for a suitable list. Your client code, upon receiving that function call, might perform further filtering or fetch additional data from an internal system. You could then show final results to the user or pass them back to the agent. This separation of concerns proves beneficial in multi-step workflows, asynchronous tasks, or corporate compliance contexts where direct agent access to sensitive systems is disallowed.
Data Stores
Next, the whitepaper addresses how to keep an LLM “current” on knowledge it was never explicitly trained on. The solution is to connect the agent to a Data Store. Typically, this means a vector database. When the user’s question arises, the agent can run a retrieval step: it transforms the user query into an embedding, queries the vector database, and obtains relevant documents or data. The agent then leverages the retrieved information to inform its final answer, effectively “augmenting” the model’s generation.
Some highlights of Data Stores:
- They can store structured data (CSV, spreadsheets), unstructured data (PDFs, websites, text), or domain-specific reference docs.
- The agent often uses retrieval-based prompting: the best-matching chunks of text are appended to the model prompt so the model has direct access to relevant passages.
- This approach is widely referred to as Retrieval-Augmented Generation (RAG).
In short, Data Stores expand an LLM’s knowledge. Instead of “hallucinating” an answer about, say, your company’s internal policy, the agent can reference an actual PDF that was indexed into the vector store. This ensures factual correctness and up-to-date information.
6. Putting It All Together: Extensions, Functions, Data Stores
A table in the paper clarifies the typical usage of each “tool type,” noting that:
- Extensions are often best if you want the agent to call an external API directly (Agent-Side Execution).
- Functions are suitable when you want the model to propose a function call that your application will handle (Client-Side Execution).
- Data Stores are used to implement retrieval-augmented workflows where the agent queries a vector database to find relevant documents or data, typically in service of generating more accurate or more detailed answers.
The agent developer can decide which or all of these are appropriate. They can be used simultaneously—for instance, an agent might retrieve documents from a Data Store, then call a code-execution Extension, then propose a client-side Function.
7. Improving Performance Through Targeted Learning
One of the complexities in building agents is ensuring that the model chooses the right tools at the right time and handles edge cases gracefully. The whitepaper enumerates three high-level strategies:
- In-Context Learning
- You supply a prompt with instructions and a handful of examples that demonstrate correct usage of each tool. Because LLMs are powerful pattern matchers, they observe the examples and attempt to replicate that style.
- This can be quickly prototyped (no extra training time needed) but depends heavily on prompt engineering and can be less reliable for complex tasks.
- Retrieval-Based In-Context Learning
- Similar to above, but your examples are pulled from a repository dynamically. If the user’s request changes or a new domain arises, you fetch the relevant examples from an “examples database.” This approach scales better to varied tasks.
- Similar to above, but your examples are pulled from a repository dynamically. If the user’s request changes or a new domain arises, you fetch the relevant examples from an “examples database.” This approach scales better to varied tasks.
- Fine-Tuning
- Involves collecting a dataset of many examples—possibly logs from agent usage—and training or fine-tuning the model to internalize patterns for tool usage. The final model might more naturally produce correct function calls or extension calls, even in tricky situations. Fine-tuning is more resource-intensive but can yield robust, domain-specific performance gains.
An analogy is drawn to cooking: giving a single recipe (in-context example) vs. having an entire reference library (retrieval-based) vs. sending the chef to culinary school (fine-tuning).
8. Example: A Quick Agent in LangChain
To showcase how agents work in practice, the whitepaper includes a short code example using LangChain and LangGraph, two open-source libraries that facilitate multi-step reasoning pipelines.
- Tools: The snippet defines two: a
search
tool that uses SerpAPI for Google Search, and aplaces
tool that taps into the Google Places API. - Model: They instantiate a Vertex AI model named
gemini-1.5-flash-001
. - Agent: They create a ReAct agent that can parse the user’s question, decide to call
search
orplaces
, observe the result, and ultimately produce the final answer.
A sample query might be:
“Who did the Texas Longhorns play in football last week? What is the address of the other team’s stadium?”
The agent, orchestrating everything, might do the following:
- Step 1: Call the search tool with “Texas Longhorns football schedule” to identify the relevant opponent.
- Step 2: Receive a textual result indicating the last game. Suppose it was “Georgia Bulldogs.”
- Step 3: Then call the places tool with “Georgia Bulldogs stadium” to get the stadium details.
- Step 4: Return a final textual answer to the user: “They played the Georgia Bulldogs, and the address is 100 Sanford Dr, Athens, GA 30602.”
In the snippet’s logs, you see the multi-step reasoning: first it chooses the “search” tool, then it chooses the “places” tool, then it synthesizes a final statement. This is a simplified demonstration of how an agent can chain together external resources to produce a more useful answer than a single LLM response alone.
9. Production Applications with Vertex AI Agents
While demonstration code is important for proof-of-concept, the whitepaper then underscores the additional complexities of going to production. These complexities include:
- Handling user interfaces
- Evaluating agent performance
- Debugging errors
- Managing continuous updates or expansions of tool sets
Vertex AI simplifies many of these tasks by providing a managed environment. Through Vertex, a developer can define:
- High-level goals and tasks: The user might say, “I want an agent that can handle travel booking, plus summarizing user documents, plus emailing results.”
- Appropriate sub-agents: Possibly one sub-agent for flight booking, one for hotel booking, one for summarizing user data, etc.
- Tools: You can incorporate Vertex Extensions or custom functions.
- Memory and Example Stores: Vertex has a built-in system for storing examples, retrieval-based context, or logs of prior interactions.
Moreover, the platform provides mechanisms for:
- Testing and evaluation: You can measure how well the agent performs on a set of test queries, see which tools it chooses, and track success or error states.
- Debugging: If the agent is calling the wrong extension repeatedly, you can add clarifying examples or refine the extension definitions.
- Continuous improvement: Over time, you might gather logs of user interactions, identify repeated user requests, and incorporate that data into a targeted fine-tuning step.
Sample End-to-End Architecture
A figure (Figure 15 in the paper) illustrates an end-to-end agent architecture on Vertex AI:
- User Query: The user’s message arrives.
- Data Retrieval / Memory: The system might do a vector search for relevant context.
- Agent Orchestration: The agent determines the best plan: “Check the user’s profile from this extension,” or “Confirm flight details from that extension,” or “Propose a function call.”
- External APIs: The agent uses the chosen tools if needed, obtains real data.
- Response: The agent finalizes an answer, possibly returning it to the user or triggering an action.
Vertex AI can thus be seen as a robust, fully managed environment for turning the whitepaper’s conceptual building blocks—model, orchestration, and tools—into a scalable production pipeline.
10. Final Thoughts and Future Directions
The whitepaper ends with a concise wrap-up:
- Agents as Model Extenders: They emphasize that agents extend the capabilities of LLMs by providing real-time data retrieval and the capacity to take action.
- Central Role of Orchestration: At the heart of every agent is some form of cyclical “orchestration layer”—that is, a reasoning framework that handles decision-making and keeps track of partial results.
- Tools as Keys to the External World: Without external tools, even the largest language model is stuck with its pre-trained knowledge. Tools such as Extensions, Functions, or Data Stores turn a static model into a dynamic participant in real workflows.
- Power of Agent Chaining: They suggest that the future might involve “agent chaining,” a strategy where multiple specialized agents (each a domain expert) collaborate. If one agent is an expert in financial transactions and another in flight booking, a higher-level or “manager” agent might orchestrate calls to these specialists to solve multifaceted tasks.
- Iterative Improvement: Designing and deploying real agents can be an iterative process. You prototype, measure results, identify shortcomings, refine prompts or training data, and repeat. Over time, the agent will become more stable and reliable.
In summary, the paper lays the groundwork for how and why to integrate external resources into the generative model environment. In doing so, it posits that truly autonomous, multi-step, and context-aware AI applications require not just a powerful language model but a structured loop of reasoning and the right set of tools.