Observability & Monitoring
About 5 minutes
Observability is the property of a system that allows you to understand its internal state from its external outputs (logs, metrics, and traces). The difference from simple monitoring is that monitoring tells you “something is wrong,” while observability gives you the means to understand “why it is wrong.” Production AI services also need to track metrics that traditional web services do not — LLM API latency, cost, and quality.
Monitoring vs Observability
Section titled “Monitoring vs Observability”| Aspect | Monitoring | Observability |
|---|---|---|
| Answers the question | ”Is something wrong?" | "Why is it wrong?” |
| Approach | Alert on known failure patterns | Investigate unknown failures from external outputs |
| Assumption | Handles anticipated problems | Can investigate unanticipated problems too |
| Data | Pre-defined metrics | Logs, metrics, traces (three pillars) |
The Three Pillars of Observability
Section titled “The Three Pillars of Observability”1. Metrics
Section titled “1. Metrics”Metrics are numerical data collected over time. They are used to understand “how much is the current state.”
Examples:
- Latency: p99 = 2.4s (99% of requests complete within 2.4 seconds)
- Error rate: 0.3% (0.3% of all requests result in an error)
- Throughput: 150 req/s
- CPU usage: 62%
- LLM cost: $0.023/requestBecause metrics are aggregated numbers, they are storage-efficient and suitable for long-term retention.
2. Logs
Section titled “2. Logs”Logs are timestamped records of events. They provide detailed context of “what happened.” Using structured logs (JSON format) makes aggregation and search easier afterwards.
{
"timestamp": "2026-05-13T10:23:45.123Z",
"level": "ERROR",
"service": "chat-service",
"request_id": "req_abc123",
"user_id": "user_456",
"message": "LLM API timeout after 30000ms",
"model": "claude-opus-4-5",
"input_tokens": 850,
"error_code": "TIMEOUT"
}Unstructured logs (plain text) are human-readable but difficult to aggregate automatically. Structured logs are recommended for production environments.
3. Traces
Section titled “3. Traces”Distributed tracing records the entire processing path of a single request as it passes through multiple services. In a microservice architecture, a request flows through API gateway → app server → LLM API → database. Traces reveal exactly where latency is occurring.
Trace example (request ID: req_xyz789):
├── API Gateway (5ms)
├── App Server (4,250ms)
│ ├── JWT validation (3ms)
│ ├── DB query: fetch conversation (12ms)
│ ├── LLM API call (4,100ms) ← bottleneck
│ └── DB write: save message (35ms)
└── Total: 4,255msObservability Architecture
Section titled “Observability Architecture”graph TD
Client["Client"] --> APIGW["API Gateway\n▶ Send metrics\n▶ Access logs"]
APIGW --> App["App Server\n▶ Start trace span\n▶ Structured log output"]
App --> LLM["LLM API call\n▶ Trace span\n▶ Latency measurement\n▶ Token count & cost recording"]
App --> DB["DB query\n▶ Trace span\n▶ Query execution time"]
App --> Cache["Redis\n▶ Cache hit rate"]
APIGW --> OtelCollector["OpenTelemetry\nCollector"]
App --> OtelCollector
LLM --> OtelCollector
DB --> OtelCollector
Cache --> OtelCollector
OtelCollector --> Metrics["Metrics DB\nPrometheus / CloudWatch"]
OtelCollector --> LogStore["Log Store\nElasticsearch / CloudWatch Logs"]
OtelCollector --> TraceStore["Trace Store\nJaeger / X-Ray"]
Metrics --> Dashboard["Dashboard\nGrafana / Datadog"]
LogStore --> Dashboard
TraceStore --> Dashboard
Metrics --> Alert["Alerting\nPagerDuty / OpsGenie"]The Four Golden Signals
Section titled “The Four Golden Signals”Google’s SRE Book defines four signals to monitor in any service as the “Golden Signals.”
| Signal | Meaning | How to Measure |
|---|---|---|
| Latency | Time to process a request | p50/p95/p99 percentiles |
| Error Rate | Proportion of requests resulting in an error | 5xx response count / total request count |
| Traffic | Scale of demand on the system | Requests per second (req/s) |
| Saturation | Resource utilization | CPU usage, memory, DB connection count |
AI-System-Specific Metrics
Section titled “AI-System-Specific Metrics”In addition to standard web service metrics, AI services using LLMs need to track the following.
LLM API Performance
Section titled “LLM API Performance”| Metric | Description | Target |
|---|---|---|
| TTFT (Time to First Token) | From request sent to first token received | < 2 seconds |
| TPS (Tokens per Second) | Generation speed | Model-dependent |
| Total Latency | End-to-end request time | < 30 seconds (with streaming) |
Cost Tracking
Section titled “Cost Tracking”{
"date": "2026-05-13",
"model": "claude-opus-4-5",
"requests": 12450,
"input_tokens_total": 8234000,
"output_tokens_total": 15670000,
"cost_usd_total": 421.30,
"cost_per_request_avg": 0.0338
}RAG Metrics
Section titled “RAG Metrics”| Metric | Description |
|---|---|
| Retrieval Latency | Time taken for vector search |
| Chunks Retrieved | Number of document chunks retrieved per request |
| Cache Hit Rate | Hit rate for LLM response cache |
Quality Metrics
Section titled “Quality Metrics”| Metric | Measurement Method |
|---|---|
| User thumbs-down rate | Proportion of “not helpful” feedback |
| Regeneration request rate | Rate at which regeneration is triggered in the same conversation |
| Session duration | Length of time users continue a conversation |
Designing SLOs (Service Level Objectives)
Section titled “Designing SLOs (Service Level Objectives)”An SLO (Service Level Objective) is a target value for the quality a service should provide. Defining SLOs upfront clarifies alert thresholds and incident priority.
# SLO example
slo:
chat_api:
latency_p99: "< 5s" # Process 99% of requests within 5 seconds
error_rate: "< 0.5%" # Maintain error rate below 0.5%
availability: "99.5%" # Monthly uptime above 99.5%
llm_api_cost:
daily_budget_usd: 500 # Daily cost ceiling
per_request_max_usd: 0.10 # Per-request maximumAlerting and Alert Fatigue
Section titled “Alerting and Alert Fatigue”Too many alerts cause alert fatigue, increasing the risk of missing critical alerts.
Alert design principles:
- Set only actionable alerts: Alerts that the receiving person cannot act on are not useful
- Alert on symptoms: “Error rate > 1%” is more useful than “CPU usage > 80%”
- Graduated severity: Separate those requiring immediate action (PagerDuty notification) from those that can wait until morning (Slack only)
Choosing an Observability Stack
Section titled “Choosing an Observability Stack”| Stack | Type | Cost | Key Features | AI Service Fit |
|---|---|---|---|---|
| Prometheus + Grafana + Jaeger | OSS | Low (with operational overhead) | High customizability | Possible with configuration |
| Datadog | Managed | High | Feature-rich, includes LLM observability | High |
| New Relic | Managed | Medium–High | AI monitoring features available | High |
| Honeycomb | Managed | Medium | Strong at high-cardinality trace analysis | Medium |
| AWS CloudWatch | Cloud-native | Medium | High affinity with AWS environments | Medium |
| Google Cloud Monitoring | Cloud-native | Medium | Integrated with GCP | Medium |
On-Call Dashboard Design
Section titled “On-Call Dashboard Design”The on-call dashboard should show everything needed to assess the situation at a glance.
┌─────────────────────────────────────────────────┐
│ Golden Signals (last 1 hour) │
│ Error Rate: 0.12% ✓ Latency p99: 3.2s ✓ │
│ Throughput: 145 req/s Availability: 99.98% ✓ │
├─────────────────────────────────────────────────┤
│ AI Service Specifics │
│ LLM TTFT p95: 1.8s ✓ Cost/hr: $18.4 │
│ Cache Hit Rate: 23% Errors: 0 │
├─────────────────────────────────────────────────┤
│ Infrastructure │
│ CPU: 45% ✓ Memory: 67% ✓ DB Conn: 45/100 │
└─────────────────────────────────────────────────┘Summary
Section titled “Summary”- Monitoring detects “something is wrong”; observability provides the means to understand “why”
- Combining the three pillars (metrics, logs, traces) makes it possible to investigate production incidents
- AI-service-specific metrics (TTFT, token cost, RAG quality) require additional tracking
- Define SLOs, set alert thresholds from them, and configure only actionable alerts
- For small scale: CloudWatch or Grafana; for serious production: Datadog or New Relic are strong choices
Frequently Asked Questions
Section titled “Frequently Asked Questions”Q: What should I monitor first?
A: Start with the Golden Signals (error rate, latency, throughput, saturation). For AI services, I also recommend tracking LLM API latency and cost from the beginning. Logs should be output in structured JSON format — that decision made upfront allows aggregation later regardless of what monitoring tool is added.
Q: How do I trace AI model calls?
A: Integrate the OpenTelemetry SDK into the application and manually create trace spans before and after LLM API HTTP requests. AI-specialized observability tools such as Datadog’s “LLM Observability,” Arize Phoenix, and LangSmith have matured by 2026 and can auto-instrument prompts, inputs, outputs, and latency.
Q: Logs and metrics seem to overlap — do I need both?
A: They serve different roles. Metrics are time-series aggregations of numbers (suitable for long-term storage and alerting), and logs are detailed records of individual events (used for root cause analysis during incidents). Neither alone is sufficient. Traces additionally provide visibility into processing paths that span multiple services.
Q: Should I choose open-source or a managed service?
A: For small teams with limited operational resources, a managed service (Datadog, etc.) is recommended. Open source (Prometheus + Grafana) offers high customizability, but the team takes on the responsibility of maintaining and scaling the infrastructure. A practical approach is to start with Datadog or New Relic’s free trial, and consider migrating to OSS if cost becomes an issue.
See the references for the external specifications and background sources used on this page.[1][2][3][4][5]