Every production AI agent worth running has six components. The model that reasons. The system prompt that sets the rules. The tools that let it act. The memory that carries state. The orchestrator that runs the loop. The observability that tells you what happened. Skip any one and the agent fails in a predictable way. This component breakdown mirrors the framing in Anthropic's Building Effective Agents and IBM's reference architecture for agents.
Model = the brain. System prompt = the job description. Tools = the hands. Memory = the notebook. Orchestrator = the manager that runs the loop. Observability = the log so you can debug when it goes wrong.
Most articles on this topic talk about agent architectures: ReAct, Reflexion, plan-execute, multi-agent orchestration. Those are useful for builders. For the rest of us (and most builders, honestly), what matters is the components. Architectures get reshuffled every six months. The components have been the same since 2023.
If you can evaluate an agent against these six components, you can compare any two agents on the market in 10 minutes. That's the working framework I use on SellerShorts when deciding whether a new agent listing is solid or weak.
The LLM that does the reasoning. In 2026 this is almost always one of Claude (Anthropic), GPT-5 (OpenAI), Gemini 3 (Google), or Llama (Meta, when self-hosted). For complex reasoning, the "thinking" or "reasoning" variants (Claude 4 with extended thinking, GPT-5's o-series, Gemini 3 thinking) are the production tier.
The model choice affects three things: quality of output, cost per token, and speed. There's a tradeoff cube. Better quality usually costs more. Faster usually means less reasoning. Most agent vendors pick the model based on the job, not because one model is universally best.
What good looks like: the vendor can tell you which model they use and why. Vague answers ("we use the best AI") usually mean they don't know.
The vendor-written text that defines the agent's role, hard rules, and behavior. The seller never sees this directly. It's the layer that makes the same Claude model behave like a listing optimizer for one product and a PPC analyst for another.
A strong system prompt includes: role, goal, constraints, output format, error handling, and (often) examples of good and bad output. A weak system prompt is just "you are a helpful assistant for Amazon sellers." The difference is night and day in output quality.
What good looks like: consistent output quality across runs, low rate of weird off-task answers, the agent gracefully refuses requests outside its scope.
Anything the agent can call to gather information or act on the world. APIs, databases, code execution, other models, external services. Without tools, the agent can only talk. With tools, it can do.
For Amazon-seller agents, the production-tier tools as of 2026 include:
What good looks like: the agent's tool list is documented. You can see what it can and cannot do. No mystery tool calls behind the scenes.
How the agent holds state. Four layers, covered in depth in the memory and learning guide: working memory (within a run), episodic memory (across runs for a user), semantic memory (knowledge base via RAG), and procedural memory (model weights).
Most production agents in 2026 do working + episodic + RAG. Real procedural learning (fine-tuning per customer) is rare and usually unnecessary for ecommerce use cases.
What good looks like: the agent remembers your preferences and brand-voice details between runs. It doesn't ask you the same setup questions every time.
The component that runs the agent loop. Receives input, calls the model, executes tool calls, manages context, handles errors, decides when to stop. Most builders use a framework (LangGraph, CrewAI, OpenAI Agents SDK, custom code) instead of writing this from scratch.
The orchestrator is invisible to users when it works. When it doesn't (the agent loops forever, costs runaway, errors aren't handled gracefully), the orchestrator is usually the failure point.
What good looks like: agents that terminate cleanly, error messages that make sense, retry behavior that doesn't loop infinitely.
Everything the agent did, recorded in a way you can review. Which tools it called, what they returned, what the model decided at each step, how long it took, what it cost.
This is the component most often missing in early-stage vendors. Without observability, when an agent produces bad output, you can't debug it. With observability, you can read the trace and see exactly where it went wrong.
What good looks like: the vendor can show you the agent's trace on request. For your own runs you can see iteration count, tool calls, and final reasoning.
| Component | Function | Fails by... | Common implementations |
|---|---|---|---|
| Model | Reasoning, generation | Hallucinations, weak reasoning | Claude, GPT-5, Gemini 3, Llama |
| System prompt | Defines role and rules | Off-task answers, inconsistency | Hand-crafted text, iteratively tuned |
| Tools | Acts on the world | Tool errors, wrong tool choice | SP-API, MCP servers, custom APIs |
| Memory | Holds state across steps and runs | Forgetting, context overflow | SQL, vector DBs, in-context summary |
| Orchestrator | Runs the loop | Infinite loops, cost runaway | LangGraph, CrewAI, custom code |
| Observability | Records what happened | Black-box debugging | LangSmith, Helicone, custom logging |
The components don't operate independently. They feed each other in a specific shape that's worth visualizing in your head.
The orchestrator kicks off a run. It loads the system prompt and presents it to the model. The model produces a plan, which usually includes a tool call. The orchestrator executes the tool, captures the result, adds it to memory, and feeds the updated state back to the model. Every step, the orchestrator logs to observability. The loop runs until the model declares the goal achieved or a limit is reached. Final output goes back to the user.
Six components, one loop. That's every production agent in 2026.
From watching agents launch and fail on SellerShorts, the two components vendors most often underbuild are memory and observability.
Agents without episodic memory feel forgetful. Every run is groundhog day. The seller has to re-explain their brand voice, target audience, and preferred output format. Adoption drops because the friction is high.
When something goes wrong (and it will), the support team can't tell you what happened. The agent produced a weird output, the user complains, the team shrugs because they don't have a trace. Trust evaporates fast.
Both are unsexy backend work that doesn't show up in a demo. Both are non-negotiable for production-grade agents. If you're evaluating a vendor, ask specifically about memory and observability. The answers will tell you whether they're early or mature.
Concrete view. A typical SellerShorts AI agent has:
Two of the six components (orchestrator, observability) the platform provides. The creator focuses on the other four. That's the marketplace value proposition: someone else takes care of the boring-but-critical infrastructure.
The six components have been stable since 2023. What's changed in 2026 is:
Six questions, one per component, that surface whether a vendor is mature or early.
Strong vendors answer all six clearly. Weak vendors stumble on two or three. Mid-tier vendors get four out of six. Use this as a buyer's checklist.
Every production AI agent has six components: the model (the reasoning brain), the system prompt (the role and rules), tools (APIs and external calls), memory (state across steps and runs), the orchestrator (loop manager), and observability (the trace log). Skip any one and the agent fails in a predictable way.
The system prompt is the vendor-written instruction set that defines the agent's role, hard rules, output format, and error handling. The seller never sees it directly. The user prompt is whatever the seller types or selects when they kick off a run. The system prompt makes the same Claude model behave like a listing optimizer for one product and a PPC analyst for another.
Most production agents use Claude 4 (Anthropic), GPT-5 (OpenAI), Gemini 3 (Google), or self-hosted Llama. For complex reasoning, the thinking variants (Claude 4 extended thinking, GPT-5 o-series, Gemini 3 thinking) are the production tier. A good vendor can tell you which model they use and why.
Memory and observability. Without episodic memory the agent feels forgetful, the seller has to re-explain their brand voice every run, and adoption drops. Without observability the support team cannot tell you what happened when a run produced a weird output. Both are unsexy backend work that does not show up in a demo, but both are non-negotiable for production-grade agents.
The Model Context Protocol (MCP) is Anthropic's open standard for how agents declare and call tools. Before MCP, every tool needed a custom integration. After Amazon shipped the Amazon Ads MCP Server in open beta on February 2, 2026, tools became more interchangeable across agents and frameworks.
SellerShorts handles the orchestrator and observability components for you. Creators bring the model choice, system prompt, tools, and memory. You just run the agent.
Browse SellerShorts agents