Building a new agent

A developer guide for creating LLM-backed microservices that live alongside the AarhusAI docker stack. The retrieval-agent is the worked example throughout - every pattern documented here is taken from there.

What this guide is for
Prerequisites
Glossary
Overview
Recommended stack
Repository layout
Step-by-step setup
PydanticAI patterns
Production - docker compose server
Production builds (multi-arch)
Verification checklist
Common pitfalls / FAQ
Reference index

1. What this guide is for

This guide is for a developer new to the project who’s been asked to build an agent-based tool that plugs into the AarhusAI platform. If you’ve never seen this codebase before, start here.

AarhusAI docker is the Docker orchestration around a customised fork of Open WebUI - the AI chat platform users interact with. “Agents” in this org are standalone FastAPI microservices that extend the platform with LLM-backed workflows: retrieval, summarisation, classification, multi-step reasoning, anything that benefits from running outside the main UI process. The canonical example shipped today is retrieval-agent - a sibling repo to AarhusAI-docker/ - every pattern in this guide is taken from it.

This guide focuses on building a new agent and is not a tour of the existing stack.

2. Prerequisites

Install on your machine

Docker with buildx (recent Docker Desktop, or docker-buildx-plugin on Linux).
Go Task brew install go-task (macOS) or apt install go-task (Debian/Ubuntu). Commands are invoked as task <name>.
An editor of choice. Python 3.11+ locally is helpful for IDE integration, but all Python tooling actually runs inside the container - no local venv required.

Access you’ll need

A local clone of AarhusAI docker ideally as a sibling directory to where you’ll create your new agent.
A ghcr.io Personal Access Token with write:packages scope, for publishing production images.
A LiteLLM virtual key if your agent actually calls models - ask whoever owns the stack.
Read access to the AarhusAI GitHub org for the existing agent repos.

Knowledge level assumed

Comfortable with Python and async/await.
Have used Docker and docker-compose before (you don’t need to be an expert).
Have read or skimmed FastAPI’s tutorial. PydanticAI is introduced with snippets in §8 - no prior exposure needed.

3. Glossary

The terms below show up throughout the rest of the doc. Skim once and refer back as needed.

Term	Means	Where you’ll see it
AarhusAi docker	This monorepo of Docker orchestration; the “parent stack”.	Throughout
Open WebUI	The AI chat platform (forked at AarhusAI/open-webui), the user-facing UI.	Parent stack; not modified here
LiteLLM	LLM proxy at `http://litellm:4000/v1` inside the stack; agents route every LLM call through it.	§4, §7, §8, §9
Qdrant	Vector database; used by retrieval-agent for RAG.	§9, retrieval-agent
Traefik	Reverse proxy / ingress that fronts every public-facing service.	§7.4, §9
`frontend` network	External Docker network Traefik reads from; every service that takes user traffic joins it.	§7.4, §9
`app` network	Internal Docker network where services talk to each other (LiteLLM, Qdrant, Postgres, Redis).	§7.4, §9
retrieval-agent	Canonical example agent. Repo: https://github.com/AarhusAI/retrieval-agent.	Throughout
PydanticAI	LLM agent framework - bounded loops, tool calls, structured output.	§5, §8
Go Task	Task runner (`Taskfile.yml`) used across the stack. `task <name>` typically runs commands inside the container.	§7.5
ghcr.io	GitHub Container Registry. Production images publish to `ghcr.io/aarhusai/<repo>`.	§10

4. Overview

A “new agent” in this stack is a standalone FastAPI microservice that:

Lives in its own git repository, sibling to aarhusai-docker/.
Builds a multi-arch Docker image (linux/amd64,linux/arm64) pushed to ghcr.io/aarhusai/<repo-name>.
Runs locally from its own docker-compose.yml (optional), joined to the parent stack’s app network (for service-to-service calls like LiteLLM, Qdrant, Postgres, Redis) and Traefik’s frontend network (for ingress).
In production, is referenced as a pre-built image from the parent stack’s docker-compose.yml / docker-compose.server.yml - the parent stack pulls the image, no source mount.

flowchart LR
    user([user])

    subgraph frontend["frontend network"]
        traefik[Traefik]
    end

    subgraph app["app network"]
        agent["my-agent<br/>(FastAPI :8000)"]
        litellm["litellm :4000"]
        qdrant["qdrant :6333"]
        postgres["postgres :5432"]
        redis["redis :6379"]
    end

    user --> traefik --> agent
    agent --> litellm
    agent --> qdrant
    agent -.-> postgres
    agent -.-> redis

Solid arrows are typical dependencies (almost every agent calls LiteLLM; many call Qdrant). Dashed arrows are optional - most agents don’t need Postgres or Redis directly. The parent stack provides all four; you join the app network and pick what you actually use.

Use this template when you need:

An LLM-backed HTTP endpoint (summarization, classification, extraction, agentic workflows).
Structured output via PydanticAI.
Tool-calling against LiteLLM-proxied models.
A long-lived stateful client (vector DB, blob store, cache).

Don’t use it for:

Long-running batch jobs - use a task runner / cron worker pattern instead.
Pure frontend changes - patch Open WebUI directly via the parent stack’s patch system.
Anything that needs Open WebUI’s session/auth context. Open WebUI integration patterns (RAG external retrieval, OpenAI-compatible endpoints, tool servers) are out of scope for this guide - wire that up per agent type once the service exists.

5. Recommended stack

Use this stack unless you have a specific reason not to - sticking with them means future maintainers (and Claude) can read your repo without context-switching.

Concern	Choice	Why
Web framework	`fastapi>=0.115` + `uvicorn[standard]>=0.30`	Async, OpenAPI docs for free, mature dep-injection
Config	`pydantic-settings>=2.0`	Env-first, typed, fails fast at import time
Validation	`pydantic>=2.0`	Already a peer dep; use the same models for I/O and config
Agent loop / tool-calling	`pydantic-ai-slim[openai]>=0.2.0,<1.0`	Bounded iteration, structured output, tool-call retry
Single-shot LLM calls	`httpx>=0.27`	Talk directly to the OpenAI-compatible API; cheaper than spinning up a PydanticAI Agent for one call
HTTP client	`httpx`	Async, used for all outbound HTTP
Tests	`pytest>=8` + `pytest-asyncio>=0.25` (`asyncio_mode=auto`) + `pytest-cov>=7`	Async-native, mature fixtures
Lint + format	`ruff>=0.9`	One tool, replaces black/isort/flake8
Python	`3.12-slim` container, `requires-python = ">=3.11"` in `pyproject.toml`	Container parity, allows local 3.11 for tooling
Container build	Docker buildx + QEMU	Multi-arch (`linux/amd64,linux/arm64`); devs on Apple Silicon need arm64
Task runner	Go Task (`Taskfile.yml`)	Matches parent stack convention

PydanticAI vs raw httpx - rule of thumb: if a request hits the LLM more than once (tool-calling, retries, multi-step reasoning), use PydanticAI. If it’s a single completion (extract → respond), use plain httpx against the OpenAI-compatible endpoint. The retrieval-agent does both: app/services/agent.py for the agent loop, and app/services/query_generation.py for single-shot calls.

6. Repository layout

Every file below has a job. The “routes are thin, services hold logic” convention is what keeps an agent readable as it grows - you can scan the routes to learn the public surface, then dive into services for behaviour.

The canonical sibling-repo structure, mirroring retrieval-agent:

my-agent/
├── app/
│   ├── __init__.py
│   ├── main.py            # FastAPI app, lifespan, health endpoints
│   ├── config.py          # pydantic-settings - env-driven config
│   ├── auth.py            # Bearer-token middleware
│   ├── models.py          # Request/response schemas (pydantic)
│   ├── routes/            # Route handlers - thin shells, auth + validation only
│   │   ├── __init__.py
│   │   └── <endpoint>.py
│   └── services/          # Business logic - one module per concern
│       ├── __init__.py
│       └── <service>.py
├── tests/
│   ├── __init__.py
│   ├── conftest.py        # Env setup BEFORE app imports + autouse cleanup
│   ├── test_health.py
│   ├── test_auth.py
│   └── services/
│       └── test_<service>.py
├── Dockerfile             # Multi-stage: base → dev → prod
├── docker-compose.yml     # Standalone dev compose
├── Taskfile.yml           # task wrapper (build, test, lint, build:image)
├── pyproject.toml         # deps, ruff, pytest config
├── .env.example           # Required + optional env vars, no secrets
├── .dockerignore          # Allowlist style
├── .gitignore             # Standard Python + .env
├── README.md
└── CLAUDE.md              # Agent-specific Claude Code instructions

Convention: routes are thin, services hold the logic. A route does (1) auth via Depends(verify_api_key), (2) request validation via the pydantic body model, (3) one call into app/services/.... Everything else lives in services. Reference: retrieval-agent/app/routes/search.py is a 20-line shell over retrieval-agent/app/services/pipeline.py.

7. Step-by-step setup

Each subsection introduces what the file is for before showing the template. Replace my-agent with your repo name throughout.

7.1 Bootstrap the repo

This is the housekeeping that keeps secrets and build artefacts out of git and the Docker build context.

mkdir my-agent && cd my-agent
git init

Once app/ and tests/ exist (after §7.6 and §7.10), make sure each Python package directory has an empty __init__.py: app/, app/routes/, app/services/, tests/, tests/services/. Easy to forget - imports start failing silently otherwise.

Create .gitignore:

__pycache__/
*.py[cod]
*.egg-info/
.pytest_cache/
.ruff_cache/
.coverage
htmlcov/
.env
.env.local
.venv/

Create .dockerignore (allowlist style - only what the Dockerfile needs):

# Exclude everything by default (allowlist approach)
*

# Allow only what the Dockerfile needs
!pyproject.toml
!app/

# Deny patterns inside allowed dirs
app/**/__pycache__
app/**/*.pyc
app/**/*.pyo

Reference: retrieval-agent/.dockerignore.

7.2 `pyproject.toml`

This is the single source of truth for dependencies, the lint/format config, and the test runner config. Container builds install from this file, so changes here mean rebuilding the container (see §12).

Drop in this template, change name/description, then add agent-specific deps:

[project]
name = "my-agent"
version = "0.1.0"
description = "Short description of what this agent does"
requires-python = ">=3.11"
dependencies = [
    "fastapi>=0.115.0",
    "uvicorn[standard]>=0.30.0",
    "pydantic>=2.0",
    "pydantic-settings>=2.0",
    "httpx>=0.27.0",
    "pydantic-ai-slim[openai]>=0.2.0,<1.0",
]

[project.optional-dependencies]
dev = [
    "ruff>=0.9.0",
    "pytest>=8.0",
    "pytest-asyncio>=0.25.0",
    "pytest-cov>=7.0",
]

[build-system]
requires = ["setuptools>=68.0"]
build-backend = "setuptools.build_meta"

[tool.ruff]
target-version = "py311"
line-length = 99
exclude = ["build"]

[tool.ruff.lint]
select = ["E", "W", "F", "I", "UP", "B", "SIM", "RUF"]
ignore = ["B008"]  # Allow Depends() in function defaults (FastAPI pattern)

[tool.ruff.lint.isort]
known-first-party = ["app"]

[tool.pytest.ini_options]
asyncio_mode = "auto"
testpaths = ["tests"]

Reference: retrieval-agent/pyproject.toml.

7.3 `Dockerfile`

Multi-stage build: base (shared deps) → dev (adds lint/test tools - the running container doubles as your test environment) → prod (slimmer image you ship). Both runnable targets run as non-root and expose 8000.

FROM python:3.12-slim AS base

WORKDIR /app

RUN apt-get update \
 && apt-get install -y --no-install-recommends curl \
 && rm -rf /var/lib/apt/lists/*

COPY pyproject.toml .

# --- Dev target: includes test/lint tools ---
FROM base AS dev
RUN pip install --no-cache-dir ".[dev]"
COPY app/ app/
RUN adduser --system --no-create-home appuser
USER appuser

EXPOSE 8000
HEALTHCHECK CMD curl -f http://localhost:8000/health || exit 1
CMD ["python", "-m", "uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]

# --- Prod target: runtime deps only ---
FROM base AS prod
RUN pip install --no-cache-dir .
COPY app/ app/
RUN adduser --system --no-create-home appuser
USER appuser

EXPOSE 8000
HEALTHCHECK CMD curl -f http://localhost:8000/health || exit 1
CMD ["python", "-m", "uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]

If your agent needs an on-disk model cache (HuggingFace, fastembed, etc.), add these lines to both dev and prod after COPY app/ app/, and mount a named volume to /cache in docker-compose.yml:

RUN adduser --system --no-create-home appuser \
 && mkdir -p /cache/hf /cache/fastembed \
 && chown -R appuser /cache
USER appuser

Docker’s named-volume first-mount copies the directory’s ownership into the volume, so creating /cache as appuser is what lets the non-root process write to the mounted volume. Reference: retrieval-agent/Dockerfile (which uses this pattern for the BM42 sparse model cache).

7.4 Docker compose

In development (local), the agent is mounted and built from source. The retrieval-agent’s existing entry (around aarhusai-docker/docker-compose.yml:517-560) is the template.

Add this block to docker-compose.yml:

services:
  my-agent:
    build:
      dockerfile: Dockerfile
      context: ./my-agent
      target: ${TARGET:-dev}
    command: [ "python", "-m", "uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000", "--reload" ]
    networks:
      - app
      - frontend
    extra_hosts:
      - "host.docker.internal:host-gateway"
    ports:
      - "8000"
    healthcheck:
      test: [ "CMD", "curl", "-f", "http://localhost:8000/health" ]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 10s
    environment:
      API_KEY: ${API_KEY:-CHANGE_ME_NOW}
      AGENT_MODEL: ${AGENT_MODEL:-gpt-4o-mini}
      AGENT_API_BASE_URL: ${AGENT_API_BASE_URL:-http://litellm:4000/v1}
      AGENT_API_KEY: ${AGENT_API_KEY:-}
      DEBUG: ${DEBUG:-false}
    volumes:
      - ./my-agent:/app
    labels:
      - "traefik.enable=true"
      - "traefik.docker.network=frontend"
      - "traefik.http.routers.${COMPOSE_PROJECT_NAME}.rule=Host(`${COMPOSE_DOMAIN}`)"
      - "traefik.http.routers.${COMPOSE_PROJECT_NAME}.middlewares=redirect-to-https"
      - "traefik.http.middlewares.redirect-to-https.redirectscheme.scheme=https"

Reference: retrieval-agent/docker-compose.yml.

7.5 `Taskfile.yml`

Go Task is the dev-experience layer over Docker. task up, task test, task lint etc. all wrap docker compose exec my-agent … under the hood, so every Python command runs inside the running container - no local venv needed. If task isn’t on your PATH yet, see §2.

# https://taskfile.dev
version: "3"

dotenv: [ ".env.local", ".env" ]

vars:
  CONTAINER_RUNTIME: ''
  DOCKER_COMPOSE: ''
  SERVICE: my-agent
  PYTHON: " exec "

tasks:
  default:
    desc: Show available tasks
    cmds:
      - task --list

  # --- Code quality ---
  lint:
    desc: Run all linters
    cmds:
      - task: lint:check
      - task: lint:format:check

  lint:check:
    desc: Check code with ruff
    cmds:
      - " ruff check ."

  lint:fix:
    desc: Fix code issues with ruff
    cmds:
      - " ruff check --fix ."

  lint:format:
    desc: Format code with ruff
    cmds:
      - " ruff format ."

  lint:format:check:
    desc: Check code formatting with ruff
    cmds:
      - " ruff format --check ."

  # --- Testing ---
  test:
    desc: Run all tests
    cmds:
      - " pytest -v"

  test:coverage:
    desc: Run tests with coverage report
    cmds:
      - " pytest --cov=app --cov-report=term-missing -v"

  # --- CI ---
  ci:
    desc: Run all CI checks (lint + test)
    cmds:
      - task: lint
      - task: test

  # --- Build & push production image (multi-arch) ---
  build:image:
    desc: "Build and push production image to ghcr.io (multi-arch). Override PLATFORMS to build one arch (e.g. PLATFORMS=linux/amd64) for faster local builds."
    vars:
      IMAGE: ghcr.io/aarhusai/my-agent
      TAG: ''
      PLATFORMS: ''
    cmds:
      - task: build:image:builder
      - docker buildx build --builder my-agent-builder --platform  --target prod -t : --push .

  build:image:builder:
    desc: "Ensure the buildx builder + QEMU binfmt handlers exist (idempotent first-run setup)."
    internal: true
    silent: true
    cmds:
      - cmd: |
          if ! docker buildx inspect my-agent-builder >/dev/null 2>&1; then
            echo "First-run setup: registering QEMU binfmt handlers (cross-arch emulation)..."
            docker run --privileged --rm tonistiigi/binfmt --install all
            echo "Creating buildx builder 'my-agent-builder' (docker-container driver)..."
            docker buildx create --name my-agent-builder --driver docker-container --bootstrap
          fi

Reference: retrieval-agent/Taskfile.yml.

7.6 `app/main.py`

The entry point: FastAPI instance, a lifespan context manager for client setup/teardown, and two health endpoints. Liveness (/health) always returns 200 if the process is up. Readiness (/health/ready) actually probes downstreams - that’s what the platform’s orchestrator uses to decide if your service is ready for traffic.

import logging
from contextlib import asynccontextmanager

from fastapi import FastAPI
from fastapi.responses import JSONResponse

from app.config import settings

# from app.routes.<your_route> import router as <your_router>
# from app.services import <external_clients>

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s [%(levelname)s] %(name)s: %(message)s",
)
if settings.debug:
    logging.getLogger("app").setLevel(logging.DEBUG)
log = logging.getLogger(__name__)


@asynccontextmanager
async def lifespan(app: FastAPI):
    log.info("Starting %s", app.title)
    log.info("Agent model: %s (%s)", settings.agent_model, settings.agent_api_base_url)

    # Eagerly initialize long-lived clients here so first-request latency
    # doesn't include connection setup. Example:
    # await some_service.preload()

    yield

    # Shutdown - close every long-lived client.
    # await some_service.close()
    log.info("%s shut down", app.title)


app = FastAPI(
    title="My Agent",
    description="What this agent does",
    version="0.1.0",
    lifespan=lifespan,
)


# app.include_router(<your_router>)


@app.get("/health")
async def health():
    return {"status": "ok"}


@app.get("/health/ready")
async def health_ready():
    """Readiness probe - verifies downstream connectivity."""
    try:
        # Probe each downstream you depend on. Example:
        # await some_service.ping()
        return {"status": "ok"}
    except Exception as exc:
        log.warning("Readiness check failed: %s", exc)
        return JSONResponse(
            status_code=503,
            content={"status": "error", "detail": str(exc)},
        )

Reference: retrieval-agent/app/main.py.

Why eager init in lifespan, not lazy in services? The first request shouldn’t pay startup cost (cold model load, DNS resolution, vector-store capability probes). Lifespan runs once per process; failures here surface as clear startup errors instead of a confusing 500 on the first request.

7.7 `app/config.py`

All runtime config comes from environment variables - no config.yaml, no command-line flags. pydantic-settings turns env vars into a typed object you import everywhere. Define every setting with a type and (where reasonable) a default, and instantiate settings at module level so Settings() runs at import time and crashes fast on missing required vars.

from pydantic_settings import BaseSettings


class Settings(BaseSettings):
    model_config = {"env_file": ".env", "env_file_encoding": "utf-8", "extra": "ignore"}

    # --- Auth ---
    api_key: str  # required - no default

    # --- LLM (OpenAI-compatible API, typically LiteLLM proxy) ---
    agent_model: str = "gpt-4o-mini"
    agent_api_base_url: str = "http://litellm:4000/v1"
    agent_api_key: str = ""
    agent_timeout: int = 60

    # --- Debug ---
    debug: bool = False

    # --- Server ---
    host: str = "0.0.0.0"
    port: int = 8000


settings = Settings()

Things to know:

extra: "ignore" lets the same .env carry vars for other tools without choking validation.
A field without a default is required - missing it crashes at import time, which is what you want.
Document any cross-system contracts with an inline comment (e.g., “must match the value the X service was configured with”). The retrieval-agent’s embedding_prefix_query is the canonical example.

Reference: retrieval-agent/app/config.py.

7.8 `app/auth.py`

Bearer-token auth - whoever calls your agent (Open WebUI, another service, a curl command) must include the key in an Authorization: Bearer <key> header. FastAPI’s HTTPBearer extracts the credential; hmac.compare_digest does a constant-time compare to prevent timing attacks against the key. The dependency returns the validated key so routes can declare Depends(verify_api_key) to gate themselves.

import hmac

from fastapi import Depends, HTTPException, status
from fastapi.security import HTTPAuthorizationCredentials, HTTPBearer

from app.config import settings

_bearer = HTTPBearer()


async def verify_api_key(
        credentials: HTTPAuthorizationCredentials = Depends(_bearer),
) -> str:
    if not hmac.compare_digest(credentials.credentials, settings.api_key):
        raise HTTPException(
            status_code=status.HTTP_401_UNAUTHORIZED,
            detail="Invalid API key",
        )
    return credentials.credentials

Reference: retrieval-agent/app/auth.py. Usage on a route:

@router.post("/do-thing")
async def do_thing(req: MyRequest, _api_key: str = Depends(verify_api_key)) -> MyResponse:
    return await my_service.handle(req)

7.9 Routes and services

This is where your agent’s actual behaviour lives. **Routes do auth + request validation only; services hold the work ** - every agent in this org follows that split, so sticking to it keeps your code familiar to anyone reading it later. A minimal “echo with LLM completion” example:

app/models.py:

from pydantic import BaseModel, Field


class EchoRequest(BaseModel):
    text: str = Field(..., min_length=1, max_length=10_000)


class EchoResponse(BaseModel):
    original: str
    rewritten: str

app/routes/echo.py:

import logging

from fastapi import APIRouter, Depends

from app.auth import verify_api_key
from app.models import EchoRequest, EchoResponse
from app.services.rewrite import rewrite

log = logging.getLogger(__name__)
router = APIRouter()


@router.post("/echo", response_model=EchoResponse)
async def echo(req: EchoRequest, _api_key: str = Depends(verify_api_key)) -> EchoResponse:
    log.info("Echo request: %d chars", len(req.text))
    rewritten = await rewrite(req.text)
    return EchoResponse(original=req.text, rewritten=rewritten)

app/services/rewrite.py - single-shot LLM call via httpx (no PydanticAI needed for one round-trip):

import httpx

from app.config import settings

_client: httpx.AsyncClient | None = None


def _get_client() -> httpx.AsyncClient:
    global _client
    if _client is None:
        _client = httpx.AsyncClient(
            headers={"Authorization": f"Bearer {settings.agent_api_key}"},
            timeout=settings.agent_timeout,
        )
    return _client


async def close_client() -> None:
    global _client
    if _client is not None:
        await _client.aclose()
        _client = None


async def rewrite(text: str) -> str:
    client = _get_client()
    resp = await client.post(
        f"{settings.agent_api_base_url}/chat/completions",
        json={
            "model": settings.agent_model,
            "messages": [
                {"role": "system", "content": "Rewrite the user message more concisely."},
                {"role": "user", "content": text},
            ],
        },
    )
    resp.raise_for_status()
    return resp.json()["choices"][0]["message"]["content"]

Don’t use httpx.AsyncClient(base_url=…) here with a leading-slash path. httpx follows RFC 3986 URL joining: AsyncClient(base_url="http://litellm:4000/v1").post("/chat/completions") resolves to http://litellm:4000/chat/completions - the /v1 is dropped, LiteLLM returns 404. Either pass the full URL (as above) or use a trailing slash on base_url and no leading slash on the path.

Don’t forget to wire the router in app/main.py:

from app.routes.echo import router as echo_router

# inside the file:
app.include_router(echo_router)

And close the client on shutdown in lifespan:

from app.services import rewrite as rewrite_service

# inside lifespan after `yield`:
await rewrite_service.close_client()

Pattern: module-level client + close_client(). Long-lived async clients are cached at module level (lazy init via _get_client()). Lifespan calls close_client() on shutdown. Tests reset _client = None between tests (see §7.10).

For multi-step agentic flows, swap rewrite.py for a PydanticAI Agent - see §8.

7.10 `tests/conftest.py`

Tests run inside the same container as the app, but need to override env vars to point at fake downstreams. The single most important rule: set env vars before any app import. Settings() runs at import time, so by the time from app.main import app finishes, every config decision is baked in - late-set env vars are silently ignored. See §12 if you hit this.

import os

# Override env vars BEFORE any app imports (Settings() runs at import time)
os.environ["API_KEY"] = "test-api-key"
os.environ["AGENT_API_BASE_URL"] = "http://fake-agent:4000/v1"
os.environ["AGENT_API_KEY"] = "fake-agent-key"

import pytest
from httpx import ASGITransport, AsyncClient

from app.config import settings
from app.main import app
from app.services import rewrite

# Belt-and-braces: also force-overwrite attributes in case the container's
# compose env disagrees with what tests expect.
settings.api_key = "test-api-key"


@pytest.fixture
def api_headers():
    return {"Authorization": "Bearer test-api-key"}


@pytest.fixture
async def client():
    transport = ASGITransport(app=app)
    async with AsyncClient(transport=transport, base_url="http://test") as c:
        yield c


@pytest.fixture(autouse=True)
def reset_clients():
    yield
    # Reset module-level clients between tests so a mock in one test doesn't
    # leak into the next. Add every service that caches a client.
    rewrite._client = None

Reference: retrieval-agent/tests/conftest.py.

Mock external services (the LLM endpoint, vector DBs, …) at the call boundary - typically by patching the module-level client. respx is a useful drop-in for mocking httpx calls.

7.11 `.env.example`

Human-readable documentation of your agent’s runtime knobs. Real secrets go in .env (gitignored); .env.example is committed and shows newcomers what they need to set. Document every variable, split required vs optional, never commit a real .env:

# --- Required ---
API_KEY=change-me

# --- LLM (OpenAI-compatible, typically LiteLLM proxy) ---
AGENT_MODEL=gpt-4o-mini
AGENT_API_BASE_URL=http://litellm:4000/v1
AGENT_API_KEY=
AGENT_TIMEOUT=60

# --- Local dev plumbing ---
COMPOSE_PROJECT_NAME=my-agent
COMPOSE_DOMAIN=my-agent.local.itkdev.dk
DEBUG=false

Generate the production API_KEY with:

python -c "import secrets; print(secrets.token_urlsafe(32))"

7.12 `README.md`

`README.md the public-facing API doc. Cover:

One-paragraph what-it-does.
Quick start (task setup, env vars to set).
Endpoint list: method, path, request body, response, auth header.
curl examples for each endpoint.
Full env var reference (with defaults).

8. PydanticAI patterns

Skip this section on your first build if you only need a single LLM call per request - the §7.9 httpx example is enough. Come back here when you’re adding tool-calling, multi-step reasoning, or structured output that justifies a real agent loop.

PydanticAI is the recommended framework for any agent that hits the LLM more than once per request (tool-calling, retries, multi-step). For single-shot calls, stay on httpx (§7.9).

The retrieval-agent’s app/services/agent.py is the canonical worked example for the patterns below.

8.1 Single-shot structured output

from pydantic import BaseModel
from pydantic_ai import Agent
from pydantic_ai.models.openai import OpenAIModel
from pydantic_ai.providers.openai import OpenAIProvider

from app.config import settings


class Classification(BaseModel):
    label: str
    confidence: float


_agent: Agent[None, Classification] | None = None


def _get_agent() -> Agent[None, Classification]:
    global _agent
    if _agent is None:
        model = OpenAIModel(
            settings.agent_model,
            provider=OpenAIProvider(
                base_url=settings.agent_api_base_url,
                api_key=settings.agent_api_key,
            ),
        )
        _agent = Agent(model, output_type=Classification, system_prompt="Classify the input.")
    return _agent


async def classify(text: str) -> Classification:
    result = await _get_agent().run(text)
    return result.output

8.2 Tool-calling loop with per-request state

Use a frozen dataclass for per-request dependencies and tool-side state. Tools mutate the deps to accumulate side-channel results without dumping the full payload back into the agent’s context.

from dataclasses import dataclass, field
from typing import Any

from pydantic_ai import Agent, RunContext
from pydantic_ai.models.openai import OpenAIModel
from pydantic_ai.providers.openai import OpenAIProvider

from app.config import settings


@dataclass
class AgentDeps:
    user_id: str
    full_results: list[dict[str, Any]] = field(default_factory=list)


model = OpenAIModel(
    settings.agent_model,
    provider=OpenAIProvider(
        base_url=settings.agent_api_base_url,
        api_key=settings.agent_api_key,
    ),
)

agent = Agent(
    model,
    deps_type=AgentDeps,
    system_prompt="Use the lookup tool to answer the user's question.",
)


@agent.tool
async def lookup(ctx: RunContext[AgentDeps], query: str) -> list[dict]:
    docs = await my_search(query)  # full results
    ctx.deps.full_results.extend(docs)  # side channel - full payload
    return [{"id": d["id"], "preview": d["text"][:200]} for d in docs]  # truncated for LLM


async def handle(user_id: str, question: str) -> list[dict]:
    deps = AgentDeps(user_id=user_id)
    await agent.run(question, deps=deps)
    return deps.full_results  # return the full payload, NOT the agent's reply

Why side-channel results? Token budgets. The tool returns the minimal preview the LLM needs to grade relevance; the full doc bodies / metadata never enter the conversation. The retrieval-agent’s AGENT_TOOL_PREVIEW_CHARS (default 200) caps the per-doc preview, and AGENT_PREVIEW_K caps how many previews per tool call. The final API response is built from deps.full_results, not from anything the model saw.

8.3 Wall-clock timeout with partial-result return

The agent loop has two budgets: AGENT_MAX_ITERATIONS (logical iterations) and a wall-clock timeout. On timeout, return whatever the tools already wrote to deps.

import asyncio
from pydantic_ai.exceptions import UnexpectedModelBehavior


async def handle(user_id: str, question: str) -> list[dict]:
    deps = AgentDeps(user_id=user_id)
    try:
        await asyncio.wait_for(
            agent.run(question, deps=deps),
            timeout=settings.agent_timeout,
        )
    except (asyncio.TimeoutError, UnexpectedModelBehavior) as exc:
        log.warning("Agent run cut short: %s", exc)
    return deps.full_results

8.4 Strict-tools toggle

Some models (Mistral, older Llamas) don’t support OpenAI’s strict tool schemas. Drive strict-mode from an env flag so you can flip it without code changes:

# app/config.py
agent_strict_tools: bool = True


# app/services/<your_agent>.py
@agent.tool(strict=settings.agent_strict_tools)
async def lookup(ctx: RunContext[AgentDeps], query: str) -> list[dict]:
    ...

Set AGENT_STRICT_TOOLS=false in .env for models that need it. Apply the same strict=settings.agent_strict_tools to every tool the agent owns.

8.5 Fallback parser for off-protocol output

When the model emits tool calls as plain text or vendor-specific syntax (e.g. Mistral’s [TOOL_CALLS]) instead of an OpenAI tool call, parse it manually. Reference: _parse_fallback_queries() in retrieval-agent/app/services/agent.py. Pattern: try JSON, then regex for vendor syntax, then bail; wrap downstream calls in try/except so embedding/DB failures inside the fallback return empty results, not 500s.

8.6 Pointing at LiteLLM

From inside a container on the app network: AGENT_API_BASE_URL=http://litellm:4000/v1. From outside (e.g. task on the host, or a non-stack environment): https://litellm.itkdev.dk/v1.

8.7 Exposing the agent over MCP

Once the agent works over HTTP, you can additionally expose it as a Model Context Protocol tool so MCP-native clients (Open WebUI, Claude Desktop) can call it directly — the same capability, a second transport. The move is to factor the request handling into one service function and have both the POST /search route and the MCP tool call it, so their behaviour can’t drift. The MCP-specific wiring — mounting FastMCP alongside these routes, the session-manager lifespan, the pure-ASGI bearer guard, and host validation — lives in the Build a tool guide. Reuse the same constant-time token_matches from §7.8 across both transports.

9. Production - docker compose sever

In production, the agent is not built from source. The retrieval-agent’s existing entry (around aarhusai-docker/docker-compose.server.yml:517-560) is the template. Add this block to docker-compose.server.yml:

  my-agent:
    image: ghcr.io/aarhusai/my-agent:${MY_AGENT_VERSION:-latest}
    restart: unless-stopped   # production only - omit in docker-compose.yml
    command:
      - "python"
      - "-m"
      - "uvicorn"
      - "app.main:app"
      - "--host"
      - "0.0.0.0"
      - "--port"
      - "8000"
    networks:
      - app
      - frontend
    ports:
      - "8000"
    environment:
      API_KEY: ${MY_AGENT_API_KEY:?}
      AGENT_MODEL: ${MY_AGENT_MODEL:-gpt-4o-mini}
      AGENT_API_BASE_URL: ${MY_AGENT_API_BASE_URL:-http://litellm:4000/v1}
      AGENT_API_KEY: ${MY_AGENT_API_KEY_LLM:?}
      DEBUG: ${MY_AGENT_DEBUG:-false}
    healthcheck:
      test: [ "CMD", "curl", "-f", "http://localhost:8000/health" ]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 10s
    depends_on:
      # Only declare deps for services your agent actually needs.
      litellm:
        condition: service_started
    labels:
      - "traefik.enable=true"
      - "traefik.docker.network=frontend"
      - "traefik.http.routers.my-agent.rule=Host(`my-agent.${COMPOSE_SERVER_DOMAIN}`)"
      - "traefik.http.routers.my-agent.middlewares=redirect-to-https"
      - "traefik.http.middlewares.redirect-to-https.redirectscheme.scheme=https"

Then add the env vars to the parent stack’s .env (or .env.default if they have sensible defaults):

MY_AGENT_VERSION=v0.1.0
MY_AGENT_API_KEY=<secrets.token_urlsafe(32)>
MY_AGENT_API_KEY_LLM=<LiteLLM virtual key>

Things to know:

Sanity-check the diff with docker compose -f docker-compose.server.yml config before merging.

10. Production builds (multi-arch)

You’ll push to ghcr.io, which requires a Personal Access Token (one-time setup) - see §2 Prerequisites for what scope it needs. There is no GitHub Actions pipeline in this org; releases are triggered by hand via task build:image.

The agent is built and pushed to ghcr.io/aarhusai/<repo-name> for both linux/amd64 (production servers) and linux/arm64 (Apple Silicon devs).

10.1 One-time setup

You need a ghcr.io Personal Access Token with write:packages:

echo "$GHCR_TOKEN" | docker login ghcr.io -u <github-username> --password-stdin

The buildx builder + QEMU binfmt handlers are bootstrapped automatically on the first task build:image (see the build:image:builder task in §7.5).

10.2 Build & push

# Tagged release (recommended)
task build:image TAG=v0.1.0

# Latest only
task build:image

# Faster local build - single arch, useful for iterating
task build:image TAG=v0.1.0-dev PLATFORMS=linux/amd64

Important details:

--target prod - never push the dev image. Prod target has no test/lint tools and a smaller surface.
--push - multi-arch manifests can only be pushed; Docker’s local image store doesn’t hold them. task build:image hardcodes --push. To test a built image locally without pushing, bypass the task and run docker buildx build --target prod --platform linux/amd64 --load -t my-agent:test . directly (single-arch only - --load doesn’t work with multi-arch).
Two tags per release: push both vX.Y.Z and latest. The Taskfile only pushes one tag at a time; run it twice:
```
task build:image TAG=v0.1.0
task build:image TAG=latest
```

11. Verification checklist

Walk through this before declaring your new agent done.

Skeleton builds. task build succeeds from a clean checkout.
Service starts. task up brings the container healthy (docker compose ps shows (healthy)).
Health endpoints respond. From inside the container itself - task shell, then:
```
curl http://localhost:8000/health
curl http://localhost:8000/health/ready
```
Or via the public Traefik URL from your host: curl -k https://${COMPOSE_DOMAIN}/health (uses your .env).

Auth works. From task shell, against the /echo example endpoint:

# Wrong key → 401
curl -H "Authorization: Bearer wrong" -H "Content-Type: application/json" \
     -d '{"text":"hi"}' http://localhost:8000/echo
# Correct key → 200
curl -H "Authorization: Bearer $API_KEY" -H "Content-Type: application/json" \
     -d '{"text":"hi"}' http://localhost:8000/echo

task ci is green. Lint + format check + tests pass.
task test:coverage reports coverage and prints the missing-lines table.
Single-arch build works. task build:image TAG=v0.0.1-test PLATFORMS=linux/amd64 - verify against a registry you control, then untag.
Multi-arch build works. task build:image TAG=v0.0.1-test produces a manifest list. Inspect with docker buildx imagetools inspect ghcr.io/aarhusai/my-agent:v0.0.1-test - should show both linux/amd64 and linux/arm64.
Parent compose parses. Add the service block from §9b to a working copy of the parent compose; run docker compose -f docker-compose.yml -f docker-compose.server.yml config - exits 0 and emits the resolved YAML.
README documents the public API. Endpoints, auth header, request/response examples, env var reference. Use the retrieval-agent’s README.md as a model.

12. Common pitfalls / FAQ

The traps below catch nearly every newcomer at least once. Skim now; come back when something breaks.

1. task up fails with “frontend network does not exist.” The parent openwebui-docker stack (which owns Traefik) isn’t running, or you haven’t created the network yourself. Either start the parent stack first, or docker network create frontend if you’re running standalone.

2. task: command not found. Go Task isn’t installed - and it’s not GNU make. Install it: brew install go-task (macOS) or apt install go-task (Debian/Ubuntu). See §2.

3. Env var changes aren’t picked up. Task loads .env.local then .env via the dotenv: directive at the top of Taskfile.yml. After editing .env, recreate the container (task restart) - up -d alone won’t push new env into a running container.

4. Hot reload doesn’t pick up file changes. Check that docker-compose.yml has both volumes: ./:/app and --reload on the uvicorn command. Editing files outside ./ won’t propagate. On macOS, large node_modules-style directories also slow the file-event delivery - keep them out of the mounted tree.

5. Tests can’t see my env overrides - Settings() still has the container env. Settings() runs at import time. os.environ[...] = ... must happen before from app.main import app. The tests/conftest.py template (§7.10) puts env writes at the top of the file for this reason.

6. task build:image fails: denied: permission_denied to ghcr.io. You haven’t logged in. echo $GHCR_TOKEN | docker login ghcr.io -u <github-username> --password-stdin. The PAT needs write:packages scope.

7. task build:image fails: “multiple platforms feature is currently not supported for docker driver.” The buildx builder didn’t get created (or wasn’t selected). Run task build:image:builder directly, or delete the existing builder and retry: docker buildx rm my-agent-builder.

8. On Apple Silicon, the multi-arch build is painfully slow. The linux/amd64 half builds under QEMU emulation. For local iteration, build only your native arch: task build:image PLATFORMS=linux/arm64. The full multi-arch build is only needed for the actual release push.

9. Agent can’t reach LiteLLM. Container-internal URL is http://litellm:4000/v1; external is https://litellm.itkdev.dk/v1. If you point the container at the external URL by mistake, you’ll either hit the public internet (and probably hit a firewall) or fail TLS verification. The same virtual key works at both URLs.

10. Healthcheck shows (unhealthy). curl must be available inside the image. The Dockerfile template installs it in the base stage ( apt-get install curl). If you copied a slimmer base or removed the install line, the healthcheck silently fails and the container stays (unhealthy).

11. pip install from your laptop fails or hangs. Don’t install Python deps on the host. Every Python tool runs inside the container: task install, or task shell then pip .... The whole point of the Taskfile wrapper is that you never set up a local venv.

12. Added a dep to pyproject.toml but task test still errors with ImportError. The container only re-installs deps when rebuilt. After editing pyproject.toml: task build && task restart, or just task install to install into the running container without a rebuild (faster for iteration).

13. task lint / task test says “service my-agent is not running.” Most tasks shell into a running container via docker compose exec. You need task up first.

14. Traefik shows “no route” or 404 for your agent’s host. Check that ${COMPOSE_PROJECT_NAME} and ${COMPOSE_DOMAIN} are set (Taskfile loads them via dotenv) and that they match the labels in your compose. Verify with docker compose config - if the rendered Host(...) is empty, the env var didn’t propagate.

15. Works locally, won’t start in production: Permission denied: '/cache/...'. The non-root appuser doesn’t own a volume that was mounted before the ownership fix. Recreate the named volume ( docker volume rm <vol>), and double-check that the Dockerfile chown -R appuser /cache runs in the prod target too - easy to add to dev only and forget.

13. Reference index

Every file in retrieval-agent this guide leans on - open these when you’re stuck. All retrieval-agent/... paths below link to main on GitHub.

File	What to look at
`retrieval-agent/Dockerfile`	Multi-stage base → dev → prod, non-root user, model cache pattern
`retrieval-agent/docker-compose.yml`	Networks, healthcheck, Traefik labels, `target: ${TARGET:-dev}`
`retrieval-agent/Taskfile.yml`	Full task catalogue; `build:image` + `build:image:builder` for multi-arch
`retrieval-agent/pyproject.toml`	Deps, optional dev deps, ruff + pytest config
`retrieval-agent/.dockerignore`	Allowlist-style exclude-everything-then-allow pattern
`retrieval-agent/app/main.py`	FastAPI app, lifespan eager init, health + readiness endpoints
`retrieval-agent/app/config.py`	pydantic-settings pattern, cross-system contract comments
`retrieval-agent/app/auth.py`	`HTTPBearer` + `hmac.compare_digest` constant-time check
`retrieval-agent/app/routes/search.py`	Thin route shell - auth + validation + service call
`retrieval-agent/app/services/pipeline.py`	Where the actual work lives
`retrieval-agent/app/services/agent.py`	PydanticAI agent loop, side-channel results, fallback parser, timeout handling, strict-tools toggle
`retrieval-agent/app/services/query_generation.py`	Single-shot LLM call via httpx (no PydanticAI)
`retrieval-agent/tests/conftest.py`	Env-before-import setup + autouse client reset fixture
`retrieval-agent/.env.example`	Documented env var template
`retrieval-agent/README.md`	Public-facing API docs and config reference
`retrieval-agent/CLAUDE.md`	Non-obvious internals worth capturing for future maintainers
`aarhusai-docker/docker-compose.yml`	Dev embedding pattern in the parent stack (service `retrieval`)
`aarhusai-docker/docker-compose.server.yml`	Production embedding pattern with `restart: unless-stopped` (service `retrieval`)