Content moderator
A two-stage agent that classifies user content against moderation guidelines and flags PII, returning typed verdicts via structured output.
This example builds a content moderator that takes a piece of user-generated text and returns a structured verdict: is it safe, how severe is any violation, which policies it breaks, and what to do about it. The moderation guidelines live entirely in the agent's instructions, and the answer comes back as a typed Pydantic model instead of free text — so you can branch on result.output.is_safe in code without parsing a paragraph.
It also shows a common pattern: a second, focused agent that scans the same text for personally identifiable information (PII). Each stage has its own output model, and you compose them in plain Python.
The main concept on display is structured output — passing output=Model to get an instance of your own type back.
The moderation model
Define exactly the shape you want back. The model is the contract: the agent is told to produce it, and result.output is an instance of it.
from enum import Enum
from pydantic import BaseModel, Field
class Severity(str, Enum):
none = "none"
low = "low"
medium = "medium"
high = "high"
class Violation(BaseModel):
category: str = Field(description="Policy category, e.g. 'harassment', 'hate', 'self-harm'")
quote: str = Field(description="The specific span of text that triggered this violation")
reason: str = Field(description="Why this span violates the category")
class Moderation(BaseModel):
is_safe: bool
severity: Severity
violations: list[Violation]
recommendation: str = Field(description="One of: allow, flag-for-review, block")Using a str-valued Enum for severity keeps the model honest — it can only return one of the four levels, not an arbitrary string.
The full agent
The guidelines go in instructions. The moderator agent is configured with output=Moderation, so a single run returns a fully populated Moderation instance. A second agent, pii_agent, reuses the same input to detect PII with its own PIIReport model.
import asyncio
from agentroute import Agent
from pydantic import BaseModel, Field
from models import Moderation, Severity
GUIDELINES = """
You are a content moderation system. Evaluate the user's text against these policies:
- harassment: targeted insults, threats, or demeaning language aimed at a person or group
- hate: dehumanizing content based on protected attributes
- self-harm: encouragement or instructions for self-harm
- violence: credible threats or glorification of violence
- sexual: explicit sexual content involving minors is always a high-severity block
Rules:
- Quote the exact span that triggered each violation.
- severity is the maximum across all violations ("none" if there are none).
- is_safe is true only when there are zero violations.
- recommendation must be one of: "allow", "flag-for-review", "block".
Use "allow" for safe text, "flag-for-review" for low/medium severity,
and "block" for high severity.
Be precise. Do not invent violations that are not clearly present.
"""
moderator = Agent(
name="content-moderator",
model="claude-sonnet-4",
instructions=GUIDELINES,
output=Moderation,
)
class PIIItem(BaseModel):
kind: str = Field(description="e.g. email, phone, ssn, credit_card, address")
value: str = Field(description="The detected value as it appears in the text")
class PIIReport(BaseModel):
has_pii: bool
items: list[PIIItem]
pii_agent = Agent(
name="pii-detector",
model="claude-sonnet-4",
instructions=(
"Detect any personally identifiable information in the text. "
"Return each piece of PII with its kind and the exact value found. "
"has_pii is true only when at least one item is present."
),
output=PIIReport,
)
async def review(text: str) -> tuple[Moderation, PIIReport]:
# Run both stages concurrently — they are independent.
verdict, pii = await asyncio.gather(
moderator.arun(text),
pii_agent.arun(text),
)
return verdict.output, pii.output
async def main() -> None:
text = "Hey loser, I know you live at 42 Elm St — email me at jo@x.com or I'll find you."
verdict, pii = await review(text)
print(f"safe={verdict.is_safe} severity={verdict.severity.value} -> {verdict.recommendation}")
for v in verdict.violations:
print(f" [{v.category}] {v.quote!r}: {v.reason}")
if pii.has_pii:
print("PII found:")
for item in pii.items:
print(f" {item.kind}: {item.value}")
if __name__ == "__main__":
asyncio.run(main())Because both moderator and pii_agent expose arun, you can fan them out with asyncio.gather — each runs as its own LLM call, and you read the typed result off each Result.output.
With output=Moderation, result.output is a Moderation instance — so verdict.severity is a Severity enum and verdict.violations is a list[Violation]. There is no JSON parsing or string matching in your code. See structured output and Result.
Run it
export AGENTROUTE_API_KEY="sk-..."
python content_moderator.pyA single OpenRouter key (AGENTROUTE_API_KEY or OPENROUTER_API_KEY) works for every model string, including claude-sonnet-4. See models for how model strings resolve.
Variations
Add an @moderator.output_validator that raises Retry("...") when the model returns is_safe=True but a non-empty violations list, forcing a self-correction. See errors and retries.
If you only need the moderation verdict, skip the gather and call moderator.run(text).output directly — no async required.
Concepts used
- Structured output —
output=Moderationand readingresult.output - Agents — configuring
instructionsandmodel - Results and
Result— the typed return value - Errors and retries —
Retryfrom an output validator for self-correction - Models — how
claude-sonnet-4resolves to a provider
See the full examples index for more end-to-end agents.