Monitoring & Metrics
This suite provides enterprise-grade observability out of the box using the Prometheus, Grafana, and Jaeger stack.
📊 Real-Time Dashboard (Grafana)
The suite includes a comprehensive Grafana Dashboard designed for SREs and Reliability Engineers.

Key Metrics Tracked
- SLO Tracking: Monitors the Error Budget Burn Rate to detect reliability risks early.
- Latency Distribution: Tracks P99 Latency to ensure performance for tail-end users.
- System Resilience: Visualizes Circuit Breaker states in real-time.
Import Instructions
- Access Grafana at
http://localhost:3030. - Navigate to Dashboards → New → Import.
- Upload the configuration file:
infra/grafana/dashboard.json. - Select Prometheus as the data source.
🔗 Distributed Tracing (Jaeger)
Every request is traced using OpenTelemetry, providing visibility into how requests flow through your system.
Trace Propagation
- Inbound: Middleware automatically injects
trace_id,span_id, and a user-facingcorrelation_id. - Outbound: The instrumented HTTP client (
src/infrastructure/http_client.py) propagates context to external services (like Groq or OpenAI) via W3Ctraceparentheaders.
Dashboard Access
View live traces at http://localhost:16686.
🚨 Alerting Strategy
Prometheus is configured with Golden Signal alerts to proactively notify you of system distress.
| Alert | Condition | Severity |
|---|---|---|
| HighErrorRate | 5xx Errors > 5% | Critical |
| HighLatency | P99 Latency > 1s | Warning |
| CircuitBreakerOpen | Breaker state is "Open" | Warning |
Alert Config
Rules are defined in infra/prometheus/alert_rules.yml.