Building Production-Ready Monitoring Reports with Fast-Agent and Datadog MCP

When we set out to automate our daily on-call reporting workflow, I thought it would be straightforward: connect an LLM to Datadog, ask it to fetch some metrics, and generate a nice Markdown report. Simple, right?

Well, not quite. What started as a weekend experiment turned into a deep exploration of agent frameworks, the Model Context Protocol (MCP), and the surprising challenges of making LLMs behave deterministically. Along the way, we discovered Fast-Agent — a framework that made building multi-agent workflows feel natural rather than painful.

This is the story of building that system, the patterns we discovered, and the hard-won lessons about making AI agents work reliably in production.

Why We Needed This

Our platform engineering team monitors multiple production Kubernetes clusters and serverless functions. Every morning, someone on-call would spend 30-60 minutes cobbling together a status report:

Check Datadog for active incidents and monitor alerts
Query pod error states across clusters (CrashLoopBackOff, OOMKilled, ImagePullBackOff)
Review resource utilization trends
Scan error logs for new patterns
Check service mesh health and serverless function metrics
Compile everything into a readable report

The data was there. The queries were repeatable. But the manual compilation was tedious and error-prone. We needed automation, but not the brittle shell-script-and-jq kind. We needed something that could understand the data contextually—when a pod restart spike matters versus when it’s routine churn.

Enter Fast-Agent and MCP

Why Fast-Agent?

I’d looked at several agent frameworks (LangChain, AutoGen, CrewAI), but most felt either too heavyweight or too opinionated. Fast-Agent stood out for three reasons:

1. Decorator-based agent definition: Agents are just async functions with a decorator. No class hierarchies, no complex configuration schemas.

from fast_agent import FastAgent
from fast_agent.core.prompt import Prompt

fast = FastAgent("Platform On-Call Report")

@fast.agent(
    name="k8s_pod_overview",
    servers=["datadog"],
    instruction="Report pod error states across clusters",
    max_tokens=24000
)
async def k8s_pod_overview(task: str) -> str:
    return "Kubernetes pod overview generated."

That’s it. You define what the agent does via the instruction parameter (more on that later), specify which MCP servers it can access, and you’re done.

2. First-class MCP support: Fast-Agent was built with MCP in mind. You configure servers in fastagent.config.yaml, and they’re available to agents automatically:

mcp:
  servers:
    datadog:
      transport: http
      url: https://mcp.datadoghq.com/api/unstable/mcp-server/mcp
      headers:
        DD_API_KEY: ${DD_API_KEY}
        DD_APPLICATION_KEY: ${DD_APPLICATION_KEY}
    
    filesystem:
      command: "npx"
      args: ["-y", "@modelcontextprotocol/server-filesystem", ".", "/tmp"]

No manual tool registration, no adapter layers. Agents just work with MCP servers.

3. Built-in parallel execution: The @fast.parallel decorator handles fan-out/fan-in workflows natively. This turned out to be critical for performance—more on that below.

What is MCP, Really?

If you’re not familiar with the Model Context Protocol, think of it as an API standard that lets LLMs interact with external systems in a structured way. Instead of manually implementing function calling for every tool, MCP servers expose resources (like Datadog dashboards) and tools (like querying metrics) in a standard format.

Datadog’s MCP server gives you tools like:

mcp_datadog_get_datadog_metric - Fetch time-series metrics
mcp_datadog_search_datadog_logs - Query logs with filters
mcp_datadog_search_datadog_monitors - List monitor alerts
mcp_datadog_search_datadog_spans - Retrieve APM traces

The beauty is that the LLM decides which tools to call based on your instruction prompt. You don’t write explicit code to fetch metrics—you tell the agent what you want, and it figures out the API calls.

Architecture: The Prompt Server Pattern

We settled on what we call the “prompt server” pattern. Each section of our report (incidents, pod errors, resource consumption, service mesh metrics, etc.) is:

A separate agent with its own prompt file
Executed in parallel alongside other sections
Strictly output-only—no conversational fluff, just structured data

Here’s how it works:

The Decorator Factory

We created a decorator factory to reduce boilerplate:

def shared_agent(
    name: str,
    prompt_filename: str,
    servers: list[str] | None = None,
    max_tokens: int | None = None,
):
    """Decorator factory for shared fast-agents."""
    servers = servers or ["datadog"]
    
    request_params = None
    if max_tokens:
        request_params = RequestParams(maxTokens=max_tokens)
    
    def _decorator(fn):
        agent_kwargs = {
            "name": name,
            "servers": servers,
            "instruction": _build_instruction(prompt_filename),
        }
        if request_params:
            agent_kwargs["request_params"] = request_params
        return fast.agent(**agent_kwargs)(fn)
    
    return _decorator

This lets us define agents cleanly:

@shared_agent("k8s_pod_overview", "03-k8s-pod-overview.md", 
              servers=["datadog", "filesystem"], max_tokens=24000)
async def k8s_pod_overview(task: str) -> str:
    return "Kubernetes pod overview generated."

@shared_agent("serverless_health", "10-serverless-health.md", max_tokens=20000)
async def serverless_health(task: str) -> str:
    return "Serverless health report generated."

Notice something interesting: the function body is trivial. The real work happens via the instruction parameter, which loads from a Markdown file.

Prompt Files as Configuration

This was a key insight: prompts are specifications, not conversations. Each prompt file is a detailed spec for what data to fetch and how to format it.

Here’s a simplified example from 05-active-monitor-alerts.md:

# Active Monitor Alerts — Platform

## 🚨 STOP - READ THIS FIRST - MANDATORY RULES 🚨

YOU ARE FORBIDDEN FROM OUTPUTTING ANY COMMENTARY, EXPLANATIONS, OR ANALYSIS.

YOUR FIRST LINE MUST BE EXACTLY:
**24-Hour Alert Activity (team:platform):**

IF YOU WRITE EVEN ONE WORD BEFORE THIS LINE, YOU HAVE FAILED THIS TASK.

---

## Objective
Produce a Datadog monitor report that:
- Highlights currently active alerts (alert and warn)
- Counts alerts that triggered in the last 24 hours
- Separately counts "no data" alerts

## Tools
- `mcp_datadog_search_datadog_monitors`
- `mcp_datadog_search_datadog_events`

## Queries & Processing
...detailed query specifications...

## Output Format
Return ONLY this content, in this exact order:

1) First line: **24-Hour Alert Activity (team:platform):**

2) Activity summary:
- ⚡ Triggered (last 24h): {count}
- ✅ Resolved (last 24h): {count}
...

The prompt is aggressive about suppressing LLM commentary. More on why this was necessary below.

Parallel Execution

With nine sections to generate (incidents, infrastructure, pod status, monitors, logs, metrics, service mesh, serverless, etc.), sequential execution would take several minutes. Fast-Agent’s built-in async handling makes parallelization trivial:

async def _collect_section_outputs(agent) -> Dict[str, str]:
    results: Dict[str, str] = {}
    
    async def _invoke(section_name: str):
        print(f"\n🔎 Gathering data for section: {section_name}")
        response = await getattr(agent, section_name).send(Prompt.user("run"))
        results[section_name] = response.strip()
    
    tasks = [asyncio.create_task(_invoke(name)) for name in ACTIVE_SECTION_NAMES]
    await asyncio.gather(*tasks)
    return results

All sections run concurrently, each making independent Datadog MCP calls. What took 3-4 minutes sequentially now takes ~45 seconds.

Template Assembly

Once we have all section outputs, we use simple string replacement to assemble the final report:

template_str = REPORT_TEMPLATE.read_text(encoding="utf-8")
rendered_report = _render_template(template_str, {
    "TIMESTAMP": ts,
    "INCIDENTS_OUTPUT": sections.get("incidents", "No data"),
    "POD_OVERVIEW_OUTPUT": sections.get("k8s_pod_overview", "No data"),
    "SERVICE_MESH_OUTPUT": sections.get("service_mesh_metrics", "No data"),
    # ... etc
})

No LLM involvement here—just deterministic text substitution. The LLM’s job is to generate well-formatted section content, not to understand the overall report structure.

The Hard Problems

Building this system exposed several non-obvious challenges. Here’s what we learned.

Challenge 1: LLMs Are Conversational By Default

This was the biggest surprise. LLMs are trained to be helpful conversational assistants. When you ask for a report, they naturally want to explain what they’re doing:

Great! Now I have all the data I need. Let me analyze the results...

Based on the metrics collected, I can see that cluster-a has...

### Pod Error States
| Cluster | Pod | Error | Count |

For a chat interface, this is great. For programmatic report generation, it’s terrible. The commentary pollutes the structured output and breaks template substitution.

Our solution: aggressively explicit prompt engineering.

We added “MANDATORY OUTPUT RULES” sections at the top of every prompt:

## 🚨 MANDATORY OUTPUT RULES 🚨

**YOU ARE FORBIDDEN FROM OUTPUTTING ANY COMMENTARY, EXPLANATIONS, OR THOUGHTS.**

**EXAMPLES OF FORBIDDEN OUTPUT:**
❌ "Let me query..." / "Now I'll check..."
❌ "I can see..." / "Based on the data..."
❌ Any explanation of what you're about to do

**YOUR FIRST LINE MUST BE:** `### Cluster Status Table:`

**IF YOU WRITE EVEN ONE WORD BEFORE THAT HEADING, YOU HAVE COMPLETELY FAILED.**

The strong language (“COMPLETELY FAILED”) and specific negative examples were necessary. Subtle hints didn’t work. We had to be blunt.

We also discovered that showing what not to do was more effective than just stating rules. Listing actual phrases to avoid (“Let me query…”, “Based on the data…”) dramatically reduced unwanted commentary.

Challenge 2: Datadog Metrics Don’t Support OR Syntax

This caught us off-guard. We wanted to query metrics for specific clusters:

sum:kubernetes.cpu.usage.total{cluster:(cluster-a OR cluster-b OR cluster-c)}

But Datadog’s metrics API returns a 400 error: “Error parsing query”. Turns out, OR syntax works for logs but not metrics.

The fix:

sum:kubernetes.cpu.usage.total{*} by {kube_cluster_name}

Query all clusters, then filter in post-processing. Not elegant, but it works. We documented this quirk explicitly in our shared parameters file so every agent prompt knows about it:

**CRITICAL**: Datadog metric queries DO NOT support OR syntax in tags.

**❌ WRONG** (will cause 400 error):
sum:metric{cluster:(A OR B OR C)}

**✅ CORRECT**:
sum:metric{*} by {cluster}

Challenge 3: Token Limits and Data Volume

Querying metrics for all production clusters over 24 hours generates a lot of data. We hit token limits constantly in early iterations.

Our multi-layered solution:

1. Dynamic token scaling in prompts:

## Query Strategy

1. Start with max_tokens: 15000
2. If response truncated or URL-only:
   - Retry with max_tokens: 30000
   - If still truncated: 50000
3. If still failing: reduce time window or increase rollup

The “URL-only” case is interesting—when Datadog’s response would be huge, it sometimes returns a link to the metrics explorer instead of actual data. We trained our agents to detect this and retry with higher token limits.

2. Rollup intervals:

sum:kubernetes.cpu.usage.total{*} by {cluster}.rollup(sum, 60).as_rate()

The .rollup(sum, 60) aggregates data into 60-second buckets, dramatically reducing the response size while preserving trends.

3. Top-N filtering:

Every prompt includes instructions to return only the top 10 items by severity. No need to report every single pod across all production clusters. just the worst offenders.

Challenge 4: Output Format Consistency

LLMs occasionally wrapped tables in code blocks:

```
| Cluster | Status |
|---------|--------|
| cluster-a | Active |
```

This breaks Markdown rendering in our template. We added explicit format constraints:

**CRITICAL: Output tables as actual markdown tables, NOT inside code blocks (```). 
Do NOT wrap your output in triple backticks.**

We also had to enforce:

No line numbers (1 |, 2 | prefixes)
No extra indentation
Use - for empty fields (not N/A or null)
Exact emoji usage (🔴 🟡 🟢 only)

Turns out, making LLMs produce deterministic structured output requires treating the prompt like a strict API specification.

What Fast-Agent Got Right

Looking back, Fast-Agent’s design choices aligned perfectly with our needs:

1. Lightweight Agents

Agents are just functions. No boilerplate, no mandatory base classes. You can start simple and add complexity only when needed:

@fast.agent(name="simple")
async def simple_agent(task: str) -> str:
    return "Done"

This is a valid agent. As requirements grow, you add servers, instruction, max_tokens, etc., but the core pattern stays clean.

2. Configuration Over Code

The fastagent.config.yaml approach means we can switch MCP servers or LLM providers without touching application code. Want to test with GPT-4 instead of Claude? Change one line:

default_model: "gpt-4o"

Want to add a new MCP server for Slack notifications? Add it to the config:

mcp:
  servers:
    datadog: {...}
    filesystem: {...}
    slack:
      command: "npx"
      args: ["-y", "@modelcontextprotocol/server-slack"]

Agents that specify servers=["slack"] now have access.

3. Context Handling

Fast-Agent automatically manages conversation history and MCP server connections. In our parallel execution model, each agent gets its own isolated context. We never worry about cross-contamination between sections.

4. Built-in Async

Everything is async by default. Parallel execution, MCP calls, LLM requests—it all just works. The framework handles connection pooling, retries, and timeouts transparently.

Patterns That Emerged

Building this system surfaced some useful patterns:

The Shared Parameters File

We created _shared-parameters.md with common context:

# On-Call Engineer Context

## Report Parameters
- **TIME_RANGE**: `now-24h` to `now`
- **CLUSTERS** (monitored production):
  - prod-cluster-01
  - prod-cluster-02
  - prod-cluster-03
  ...

## Query Efficiency
**CRITICAL**: Datadog metric queries DO NOT support OR syntax...

## Output Format Requirements
- Use pipe tables for tabular data
- Keep section output under 2000 characters
...

Every agent prompt includes this file:

def _build_instruction(prompt_filename: str) -> str:
    shared_params = (_SHARED_MD_PATH.read_text())
    prompt_content = (PROMPTS_DIR / prompt_filename).read_text()
    
    return (
        "=== Shared Parameters ===\n" + shared_params + "\n\n"
        "=== Prompt ===\n" + prompt_content
    )

This ensures consistency and reduces duplication. When we add a new cluster, we update one file, and all agents see it.

Section Metadata as Control Center

Instead of hardcoding which sections to run, we use a dictionary:

SECTION_METADATA = {
    "incidents": incidents,
    "infrastructure_overview": infrastructure_overview,
    "k8s_pod_overview": k8s_pod_overview,
    "k8s_resource_consumption": k8s_resource_consumption,
    "active_monitor_alerts": active_monitor_alerts,
    "recent_error_logs": recent_error_logs,
    "performance_metrics": performance_metrics,
    "service_mesh_metrics": service_mesh_metrics,
    "serverless_health": serverless_health,
}

ACTIVE_SECTION_NAMES = list(SECTION_METADATA.keys())

Comment out a line to disable a section—no other code changes needed. This made iteration fast during development.

Executive Summary as Post-Processing

Rather than trying to generate the executive summary in parallel with data collection, we do it as a second pass:

# Collect all sections in parallel
sections = await _collect_section_outputs(agent)

# Generate executive summary using section outputs
summary_payload = {
    "k8s_pod_overview_full": sections.get("k8s_pod_overview"),
    "service_mesh_metrics_full": sections.get("service_mesh_metrics"),
    "section_snippets": {
        name: _truncate_text(sections.get(name, ""), max_chars=1200)
        for name in ["active_monitor_alerts", "recent_error_logs", "serverless_health"]
    }
}

summary_response = await agent.executive_summary.send(
    Prompt.user(json.dumps(summary_payload))
)

The executive summary agent gets truncated versions of section outputs (to stay within token limits) and synthesizes key insights. This two-stage approach keeps prompts focused and improves reliability.

What We’d Do Differently

Use Fast-Agent’s Built-in Parallel Pattern

We currently use asyncio.gather() manually. Fast-Agent actually has a @fast.parallel decorator for fan-out/fan-in workflows:

@fast.parallel(
    fan_out=["incidents", "k8s_pods", "monitors", "service_mesh", "serverless"],
    fan_in="executive_summary",
    name="generate_report"
)
async def generate_report(task: str) -> str:
    pass

This would give us automatic progress tracking, better error handling, and cleaner code. We plan to migrate once we validate output consistency.

Add Evaluation Loop

Right now, if a section generates malformed output, we don’t catch it until template assembly fails. An evaluator agent could validate each section’s output before proceeding:

@fast.evaluator_optimizer(
    generator="k8s_pod_overview",
    evaluator="format_validator",
    min_rating="GOOD"
)

Fast-Agent supports this pattern natively—we just haven’t implemented it yet.

Explore Router Pattern

We generate the same report structure every day. It would be more efficient to route based on urgency:

@fast.router(
    name="smart_report",
    agents=["full_report", "critical_only", "summary_only"]
)

On quiet days, generate a summary. When there’s an active incident, generate the full detailed report. The router agent decides based on context.

Practical Takeaways

If you’re building something similar, here’s what we’d recommend:

1. Start With Clear Constraints

LLMs need boundaries. Don’t assume they’ll “figure it out.” Write prompts like API specs:

Exact output format (with examples)
Forbidden patterns (with specific phrases to avoid)
Fallback behaviors (if query fails, do X)
Token budgets and limits

The more explicit, the better.

2. Separate Data Fetching from Synthesis

Don’t ask one agent to “fetch metrics and generate insights.” Split it:

Agent 1: Fetch and format metrics (deterministic)
Agent 2: Synthesize insights from formatted data (creative)

This makes debugging easier and improves reliability.

3. Embrace Parallelization

MCP calls are I/O-bound. Running agents in parallel is nearly free and dramatically improves latency. Fast-Agent makes this trivial.

4. Test Prompts in Isolation

We built a prompt_test.py script to test individual prompts without running the full pipeline:

python prompt_test.py --prompt prompts/10-serverless-health.md --max-tokens 30000

This was invaluable for iteration. Prompt changes are cached at startup, so testing in the full system required a restart each time. Isolated testing gave us instant feedback.

5. Document API Quirks in Prompts

Every API has gotchas (like Datadog’s OR syntax limitation). Don’t document them separately—put them directly in the prompts:

**CRITICAL**: Datadog metrics DO NOT support OR syntax.

**WRONG**: {cluster:(A OR B)}
**CORRECT**: {*} by {cluster}

This ensures agents always have the right context.

6. Use Shared Parameters Liberally

Anything used by multiple prompts goes in _shared-parameters.md. This prevents drift and makes global changes easy.

7. Version Your Prompts

Prompts are code. We track them in Git, and changes go through code review. This prevented subtle regressions and made debugging easier (“which version of the prompt generated this output?”).

The Results

Our agent now generates a comprehensive daily report covering:

Active incidents and recent alert activity
Infrastructure status across all monitored clusters
Pod error states (CrashLoopBackOff, OOMKilled, etc.)
Resource utilization trends
Recent error log patterns
Service mesh health metrics
Serverless function health
Executive summary with critical action items

The report runs unattended every morning and takes ~45 seconds. It’s consistent, comprehensive, and actionable—the on-call engineer gets a clear picture without manual data gathering.

More importantly, it’s maintainable. Adding a new section is just:

Write a prompt file
Add an agent definition with the @shared_agent decorator
Register it in SECTION_METADATA

No plumbing changes, no framework wrestling.

Why Fast-Agent Worked For Us

Reflecting on this project, Fast-Agent succeeded because it got out of our way. We didn’t fight the framework—it aligned with how we naturally thought about the problem:

Agents as functions: Simple mental model, easy to reason about
MCP as the integration layer: Standard protocol, broad ecosystem
Configuration over code: Swap servers/models without refactoring
Async by default: Parallel execution “just works”

The framework had opinions (decorator-based agents, YAML config, Prompt objects), but they were lightweight opinions. We could still structure our code however made sense for our use case.

What’s Next

We’re exploring several extensions:

1. Slack Integration: Post reports automatically via an MCP Slack server

2. Anomaly Detection: Add an agent that compares today’s metrics to historical trends and flags unusual patterns

3. Interactive Mode: Use Fast-Agent’s built-in interactive prompt for ad-hoc queries during incidents

4. Multi-Report Support: Extend beyond daily reports—weekly summaries, incident retrospectives, capacity planning reports

The architecture we built is flexible enough to support all of these without major changes. That’s the beauty of the modular agent approach.

Closing Thoughts

Building agents that work reliably in production is harder than it looks. The demo-to-production gap is real. LLMs are powerful but nondeterministic, APIs have quirks, and integrating multiple systems introduces failure modes you don’t see in tutorials.

But when you pair the right abstractions (Fast-Agent) with the right protocol (MCP) and invest in good prompt engineering, you end up with something that actually works. Not a proof-of-concept, not a demo—a tool your team uses every day.

If you’re building something similar, I hope our experience helps you avoid some of the pitfalls we encountered. And if you’re evaluating agent frameworks, give Fast-Agent a look. It might not be the flashiest option, but for production systems that need to work, it’s been solid.

The code for our on-call report agent is internal, but the patterns and techniques are broadly applicable. Fast-Agent is open source and well-documented at fast-agent.ai, and Datadog’s MCP server is publicly available. If you’re working on similar problems, feel free to reach out—I’m always curious to hear about other approaches.