Agent architectures in practice: patterns, platforms, and use cases

By Jhony Vidal

Published in Data, AI & Automation

December 24, 2025

4 min read

Agent architectures in practice: patterns, platforms, and use cases

Agents are not “chatbots with a longer prompt”.

An agent is a system that can work toward a goal using a loop:

Understand the goal
Plan
Use tools and data
Check results
Repeat (or hand off to a human)

This post is a practical, easy read on:

The core building blocks of an agent
The most common multi-agent orchestration patterns (with diagrams)
How production teams implement agents using Azure AI Foundry, Microsoft Agent Framework, and the OpenAI SDK

The core agent loop (the only diagram you really need)

flowchart TD
  classDef step fill:#111827,stroke:#6366f1,color:#eef2ff;
  classDef tool fill:#0b1220,stroke:#38bdf8,color:#e0f2fe;
  classDef guard fill:#2d1b0b,stroke:#f59e0b,color:#fffbeb;
  classDef out fill:#052e1a,stroke:#22c55e,color:#dcfce7;

  Goal["Goal / request"]:::step --> Plan["Plan"]:::step
  Plan --> Act["Act (use tools)"]:::step
  Act --> Check["Check result"]:::guard
  Check -->|good| Done["Done"]:::out
  Check -->|needs more| Plan
  Check -->|risk/unclear| Human["Human review / handoff"]:::guard --> Done

  subgraph Tools["Tools / data access"]
    DB["Datastores"]:::tool
    API["APIs"]:::tool
    Search["Search / retrieval"]:::tool
    Workflow["Workflows (Logic Apps, etc.)"]:::tool
  end

  Act --> DB
  Act --> API
  Act --> Search
  Act --> Workflow

If you build everything around this loop, you’ll avoid most “agent failures”:

“It answers confidently but wrong” → missing checks and grounding
“It can’t finish tasks” → missing tools or state
“It did something dangerous” → missing guardrails + approval gates

What makes an agent production-ready (simple checklist)

State: what step it’s on, what it already tried, what it learned.
Tools: limited, well-defined actions (read-first, write later).
Grounding: retrieval, citations, and “show your sources”.
Identity + permissions: least privilege and per-agent identity.
Observability: traces, logs, and the ability to replay failures.
Evaluation: a small “golden set” you run after changes.
Human-in-the-loop: approvals for risky actions.

Multi-agent orchestration patterns (the ones that actually show up in real apps)

Different tasks need different coordination. Here are the patterns you’ll see most often.

flowchart TB
  classDef box fill:#0b1220,stroke:#334155,color:#e5e7eb;
  classDef hub fill:#111827,stroke:#6366f1,color:#eef2ff;
  classDef good fill:#052e1a,stroke:#22c55e,color:#dcfce7;

  subgraph Sequential["Sequential"]
    S1["Agent A"]:::box --> S2["Agent B"]:::box --> S3["Agent C"]:::box
  end

  subgraph Concurrent["Concurrent"]
    C0["Coordinator"]:::hub --> C1["Agent A"]:::box
    C0 --> C2["Agent B"]:::box
    C0 --> C3["Agent C"]:::box
    C1 --> C0
    C2 --> C0
    C3 --> C0
  end

  subgraph Handoff["Handoff"]
    H0["Router"]:::hub --> H1["Specialist A"]:::box --> H2["Specialist B"]:::box --> H3["Specialist C"]:::box
  end

  subgraph GroupChat["Group chat (debate + converge)"]
    G0["Facilitator"]:::hub --> G1["Worker"]:::box
    G0 --> G2["Reviewer"]:::box
    G1 --> G2 --> G0
  end

  subgraph Workflow["Workflow process (DAG)"]
    W1["Step 1"]:::box --> W2["Step 2"]:::box --> W3["Step 3"]:::box
    W2 --> W4["Parallel branch"]:::box --> W3
  end

  Sequential --> Concurrent --> Handoff --> GroupChat --> Workflow

When to use each pattern

Sequential: clean pipeline tasks (extract → transform → write).
Concurrent: parallel research (3 sources at once) then merge.
Handoff: routing to specialists (billing vs tech vs legal).
Group chat: “worker + critic” loops to reduce mistakes.
Workflow/DAG: repeatable business processes with checkpoints.

Platforms and tools: who does what?

You can build agents purely in code, purely on a platform, or (most common) as a hybrid.

Microsoft Agent Framework (build/orchestrate in code)

Microsoft describes Agent Framework as an open-source engine for building agentic apps, with durability, safety hooks, and paths to production. See the overview: Introducing Microsoft Agent Framework.

Azure AI Foundry Agent Service (operate agents as a managed platform)

Azure AI Foundry provides a managed place to run agents with enterprise features (deployment, governance, identity, monitoring). Product entry point: Azure AI Foundry Agent Service.

For SDK and endpoints, see: Get started with Foundry SDKs and endpoints.

Multi-model reality (including Anthropic models in Foundry)

In many teams, the “agent architecture” stays the same, but you swap models based on:

latency vs reasoning
cost vs quality
tool-use reliability
governance and deployment constraints

Microsoft has positioned Foundry as a multi-model layer, including Anthropic Claude models in the catalog: Claude models in Microsoft Foundry.

OpenAI SDK (model + tool calling from your app)

If you prefer to own orchestration in your repo, you can build the loop in code and call models via the OpenAI SDK. In production, this usually means:

strict tool schemas
structured outputs where possible
robust retries and timeouts
tracing across model calls and tool calls

Use cases (and how to map them to patterns)

1) Support triage agent (handoff + workflow)

Input: user ticket + logs
Tools: knowledge base search, ticket system API
Pattern: router decides “billing vs tech”, then a workflow runs the steps

2) Incident response agent (concurrent + human approval)

Input: alert payload
Tools: log query, runbook lookup, safe remediation actions
Pattern: parallel diagnostics, then a human approves remediation

3) Enterprise “analyst” agent (hybrid retrieval + semantic layer)

Input: questions like “why did churn increase?”
Tools: BI semantic model, vector search over docs, SQL access via approved views
Pattern: concurrent retrieval + reranking + cited summary

How to implement this in practice (3 realistic paths)

Path A: Azure AI Foundry-first (platform-led)

Best when you want a managed “run and operate” layer.

flowchart LR
  classDef sys fill:#0b1220,stroke:#334155,color:#e5e7eb;
  classDef step fill:#111827,stroke:#6366f1,color:#eef2ff;
  classDef guard fill:#2d1b0b,stroke:#f59e0b,color:#fffbeb;
  classDef out fill:#052e1a,stroke:#22c55e,color:#dcfce7;

  U["User / trigger"]:::step --> A["Foundry Agent Service"]:::step
  A --> T["Built-in tools + connectors\n(files, search, workflows)"]:::sys
  A --> ID["Identity + RBAC\n(Entra, least privilege)"]:::guard
  A --> Obs["Tracing + logs + eval runs"]:::sys
  A --> H["Human approval gate\n(optional)"]:::guard
  H --> O["Action / answer"]:::out
  A --> O

Define the job: one workflow, one success metric.
Pick a model deployment and decide which tools are allowed.
Connect data using approved connectors and enforce access control.
Add guardrails: redaction, approvals, rate limits.
Add observability + evals before you scale.

Start with the SDK entry point: Foundry SDK overview.

Path B: Agent Framework-first (code-led, deploy to Foundry)

Best when you want an SDK/workflow engine in your repo, but still want platform operations.

flowchart TB
  classDef sys fill:#0b1220,stroke:#334155,color:#e5e7eb;
  classDef step fill:#111827,stroke:#6366f1,color:#eef2ff;
  classDef guard fill:#2d1b0b,stroke:#f59e0b,color:#fffbeb;

  Repo["Your repo\n(agent code + tests + evals)"]:::sys
  AF["Agent Framework\n(router + specialists + reviewer)"]:::step
  Tools["Tools as code\n(OpenAPI clients, DB adapters)"]:::sys
  Foundry["Azure AI Foundry runtime\n(host + identity + monitoring)"]:::step
  Repo --> AF --> Tools
  AF --> Foundry --> Guard["RBAC + approvals + policies"]:::guard

Model your agents as roles: router, specialist(s), reviewer.
Make tools explicit (and small). Prefer read-first.
Run locally with traces and a golden dataset.
Deploy to a managed runtime and wire identity/permissions.

Agent Framework overview: Microsoft Agent Framework blog.

Microsoft Learn also provides Agent Framework docs (example agent types): Azure AI Foundry agent type.

Path C: OpenAI SDK-first (own the loop in your app)

Best when you want maximum portability and you already have strong engineering around reliability.

flowchart LR
  classDef sys fill:#0b1220,stroke:#334155,color:#e5e7eb;
  classDef step fill:#111827,stroke:#6366f1,color:#eef2ff;
  classDef guard fill:#2d1b0b,stroke:#f59e0b,color:#fffbeb;
  classDef out fill:#052e1a,stroke:#22c55e,color:#dcfce7;

  UI["API / UI"]:::sys --> Orchestrator["Your agent loop\n(plan → tool → check"]:::step
  Orchestrator --> Model["Model via OpenAI SDK"]:::step
  Orchestrator --> Tooling["Tools\n(DB, HTTP, search)"]:::sys
  Orchestrator --> Safety["Policies\nbudgets, allow-lists"]:::guard
  Orchestrator --> Result["Answer / action"]:::out

Implement the loop: plan → tool → check
Add structured tool schemas
Add a safe “check step” (critique, constraints, citations)
Add rate limits, budgets, and caching

If you want a vendor-neutral convention for “agent instructions”, OpenAI has pushed AGENTS.md via the Agentic AI Foundation: OpenAI: Agentic AI Foundation.

A clean Python starter (dependencies + a small, maintainable agent loop)

This is a compact “production-shaped” baseline: typed inputs/outputs, one router, a couple tools, retries, and clear boundaries.

Dependencies (pick what you need)

pip install "pydantic>=2.7" "httpx>=0.27" "tenacity>=8.2" "structlog>=24.1" "python-dotenv>=1.0"

Optional but common in real teams:

pip install "opentelemetry-api>=1.26" "opentelemetry-sdk>=1.26" "pytest>=8.0"

Example: router → tool → verify (minimal, readable)

from __future__ import annotations

import os
from dataclasses import dataclass
from typing import Literal, Optional

import httpx
import structlog
from pydantic import BaseModel, Field
from tenacity import retry, stop_after_attempt, wait_exponential_jitter

log = structlog.get_logger()


# -----------------------------
# Types (clear contracts)
# -----------------------------
Department = Literal["billing", "tech", "general"]


class Ticket(BaseModel):
    id: str
    subject: str
    body: str
    customer_tier: Literal["free", "pro", "enterprise"] = "free"


class RoutedTask(BaseModel):
    department: Department
    reason: str = Field(..., max_length=240)


class ToolResult(BaseModel):
    ok: bool
    summary: str
    source: str


class FinalAnswer(BaseModel):
    answer: str
    citations: list[str] = Field(default_factory=list)
    needs_human: bool = False


# -----------------------------
# Tools (small, testable units)
# -----------------------------
@retry(stop=stop_after_attempt(3), wait=wait_exponential_jitter(initial=0.2, max=2.0))
def fetch_billing_policy(query: str) -> ToolResult:
    # Example tool: replace with your KB/search
    base_url = os.getenv("POLICY_API_BASE_URL", "https://example.internal")
    with httpx.Client(timeout=5.0) as client:
        resp = client.get(f"{base_url}/policies/search", params={"q": query})
        resp.raise_for_status()
        data = resp.json()
    return ToolResult(ok=True, summary=data["top_match"]["summary"], source=data["top_match"]["url"])


def fetch_runbook_snippet(query: str) -> ToolResult:
    # Example tool: replace with your runbooks/CMDB
    return ToolResult(ok=True, summary=f"Runbook hint for: {query}", source="runbook://oncall")


# -----------------------------
# Orchestration (the loop)
# -----------------------------
@dataclass(frozen=True)
class AgentConfig:
    max_steps: int = 4
    allow_write_actions: bool = False  # start read-only


def route(ticket: Ticket) -> RoutedTask:
    text = f"{ticket.subject}\n{ticket.body}".lower()
    if any(k in text for k in ["invoice", "billing", "refund", "payment"]):
        return RoutedTask(department="billing", reason="Billing keywords detected")
    if any(k in text for k in ["error", "crash", "timeout", "bug", "api"]):
        return RoutedTask(department="tech", reason="Technical issue keywords detected")
    return RoutedTask(department="general", reason="Default route")


def verify(answer: str, tool_results: list[ToolResult]) -> tuple[bool, str]:
    """
    Cheap deterministic verification:
    - ensure we actually used tool evidence
    - prevent risky claims when evidence is missing
    """
    if not tool_results:
        return False, "No tool evidence retrieved"
    if len(answer.strip()) < 20:
        return False, "Answer too short"
    return True, "OK"


def handle_ticket(ticket: Ticket, cfg: AgentConfig = AgentConfig()) -> FinalAnswer:
    routed = route(ticket)
    log.info("routed_ticket", ticket_id=ticket.id, department=routed.department, reason=routed.reason)

    tool_results: list[ToolResult] = []
    for step in range(cfg.max_steps):
        if routed.department == "billing":
            tool_results.append(fetch_billing_policy(ticket.subject))
        elif routed.department == "tech":
            tool_results.append(fetch_runbook_snippet(ticket.subject))
        else:
            tool_results.append(fetch_runbook_snippet("support triage basics"))

        # Minimal “answer synthesis” (swap with an LLM call later)
        citations = [r.source for r in tool_results if r.ok]
        answer = (
            f"Here’s what I found for ticket {ticket.id}:\n"
            f"- Route: {routed.department} ({routed.reason})\n"
            f"- Evidence: {tool_results[-1].summary}\n"
            f"Next step: confirm details and apply the documented policy/runbook."
        )

        ok, why = verify(answer, tool_results)
        log.info("verify", ticket_id=ticket.id, ok=ok, why=why, step=step)
        if ok:
            return FinalAnswer(answer=answer, citations=citations, needs_human=False)

    return FinalAnswer(
        answer="I couldn’t verify a safe answer automatically. Please review the evidence and decide next steps.",
        citations=[r.source for r in tool_results if r.ok],
        needs_human=True,
    )

Why this is a good baseline:

Small tools, easy to unit test
Typed request/response models (Pydantic)
Retries with backoff on flaky dependencies
A verify step (the “safety valve”)
Starts read-only by default

Best practices for messy, real-world enterprises (many silos)

This is where agent projects succeed or fail.

Don’t centralize everything first: connect sources incrementally, start with one domain.
Treat permissions as first-class: store ACLs and enforce at query time.
Prefer hybrid retrieval: lexical search + vector search + reranking.
Avoid “one giant agent”: use a router and specialists to reduce confusion.
Log decisions and tool calls: make failures debuggable.
Add budgets: time, tokens, tool calls, and dollars per workflow.
Make write actions rare: start read-only, then add approvals for writes.