Observability & Monitoring

About 5 minutes

Engineers learning monitoring design for production AI services, those who want to understand the three pillars of observability

Understanding the overall structure from Cloud Architecture Overview will help

Observability is the property of a system that allows you to understand its internal state from its external outputs (logs, metrics, and traces). The difference from simple monitoring is that monitoring tells you “something is wrong,” while observability gives you the means to understand “why it is wrong.” Production AI services also need to track metrics that traditional web services do not — LLM API latency, cost, and quality.

Monitoring vs Observability

Aspect	Monitoring	Observability
Answers the question	”Is something wrong?"	"Why is it wrong?”
Approach	Alert on known failure patterns	Investigate unknown failures from external outputs
Assumption	Handles anticipated problems	Can investigate unanticipated problems too
Data	Pre-defined metrics	Logs, metrics, traces (three pillars)

The Three Pillars of Observability

1. Metrics

Metrics are numerical data collected over time. They are used to understand “how much is the current state.”

Examples:
- Latency: p99 = 2.4s (99% of requests complete within 2.4 seconds)
- Error rate: 0.3% (0.3% of all requests result in an error)
- Throughput: 150 req/s
- CPU usage: 62%
- LLM cost: $0.023/request

Because metrics are aggregated numbers, they are storage-efficient and suitable for long-term retention.

2. Logs

Logs are timestamped records of events. They provide detailed context of “what happened.” Using structured logs (JSON format) makes aggregation and search easier afterwards.

{
  "timestamp": "2026-05-13T10:23:45.123Z",
  "level": "ERROR",
  "service": "chat-service",
  "request_id": "req_abc123",
  "user_id": "user_456",
  "message": "LLM API timeout after 30000ms",
  "model": "claude-opus-4-5",
  "input_tokens": 850,
  "error_code": "TIMEOUT"
}

Unstructured logs (plain text) are human-readable but difficult to aggregate automatically. Structured logs are recommended for production environments.

3. Traces

Distributed tracing records the entire processing path of a single request as it passes through multiple services. In a microservice architecture, a request flows through API gateway → app server → LLM API → database. Traces reveal exactly where latency is occurring.

Trace example (request ID: req_xyz789):
├── API Gateway (5ms)
├── App Server (4,250ms)
│   ├── JWT validation (3ms)
│   ├── DB query: fetch conversation (12ms)
│   ├── LLM API call (4,100ms)  ← bottleneck
│   └── DB write: save message (35ms)
└── Total: 4,255ms

Observability Architecture

graph TD
    Client["Client"] --> APIGW["API Gateway\n▶ Send metrics\n▶ Access logs"]

    APIGW --> App["App Server\n▶ Start trace span\n▶ Structured log output"]

    App --> LLM["LLM API call\n▶ Trace span\n▶ Latency measurement\n▶ Token count & cost recording"]

    App --> DB["DB query\n▶ Trace span\n▶ Query execution time"]

    App --> Cache["Redis\n▶ Cache hit rate"]

    APIGW --> OtelCollector["OpenTelemetry\nCollector"]
    App --> OtelCollector
    LLM --> OtelCollector
    DB --> OtelCollector
    Cache --> OtelCollector

    OtelCollector --> Metrics["Metrics DB\nPrometheus / CloudWatch"]
    OtelCollector --> LogStore["Log Store\nElasticsearch / CloudWatch Logs"]
    OtelCollector --> TraceStore["Trace Store\nJaeger / X-Ray"]

    Metrics --> Dashboard["Dashboard\nGrafana / Datadog"]
    LogStore --> Dashboard
    TraceStore --> Dashboard

    Metrics --> Alert["Alerting\nPagerDuty / OpsGenie"]

The Four Golden Signals

Google’s SRE Book defines four signals to monitor in any service as the “Golden Signals.”

Signal	Meaning	How to Measure
Latency	Time to process a request	p50/p95/p99 percentiles
Error Rate	Proportion of requests resulting in an error	5xx response count / total request count
Traffic	Scale of demand on the system	Requests per second (req/s)
Saturation	Resource utilization	CPU usage, memory, DB connection count

AI-System-Specific Metrics

In addition to standard web service metrics, AI services using LLMs need to track the following.

LLM API Performance

Metric	Description	Target
TTFT (Time to First Token)	From request sent to first token received	< 2 seconds
TPS (Tokens per Second)	Generation speed	Model-dependent
Total Latency	End-to-end request time	< 30 seconds (with streaming)

Cost Tracking

{
  "date": "2026-05-13",
  "model": "claude-opus-4-5",
  "requests": 12450,
  "input_tokens_total": 8234000,
  "output_tokens_total": 15670000,
  "cost_usd_total": 421.30,
  "cost_per_request_avg": 0.0338
}

RAG Metrics

Metric	Description
Retrieval Latency	Time taken for vector search
Chunks Retrieved	Number of document chunks retrieved per request
Cache Hit Rate	Hit rate for LLM response cache

Quality Metrics

Metric	Measurement Method
User thumbs-down rate	Proportion of “not helpful” feedback
Regeneration request rate	Rate at which regeneration is triggered in the same conversation
Session duration	Length of time users continue a conversation

Designing SLOs (Service Level Objectives)

An SLO (Service Level Objective) is a target value for the quality a service should provide. Defining SLOs upfront clarifies alert thresholds and incident priority.

# SLO example
slo:
  chat_api:
    latency_p99: "< 5s"        # Process 99% of requests within 5 seconds
    error_rate: "< 0.5%"       # Maintain error rate below 0.5%
    availability: "99.5%"      # Monthly uptime above 99.5%
  
  llm_api_cost:
    daily_budget_usd: 500      # Daily cost ceiling
    per_request_max_usd: 0.10  # Per-request maximum

Alerting and Alert Fatigue

Too many alerts cause alert fatigue, increasing the risk of missing critical alerts.

Alert design principles:

Set only actionable alerts: Alerts that the receiving person cannot act on are not useful
Alert on symptoms: “Error rate > 1%” is more useful than “CPU usage > 80%”
Graduated severity: Separate those requiring immediate action (PagerDuty notification) from those that can wait until morning (Slack only)

Choosing an Observability Stack

Stack	Type	Cost	Key Features	AI Service Fit
Prometheus + Grafana + Jaeger	OSS	Low (with operational overhead)	High customizability	Possible with configuration
Datadog	Managed	High	Feature-rich, includes LLM observability	High
New Relic	Managed	Medium–High	AI monitoring features available	High
Honeycomb	Managed	Medium	Strong at high-cardinality trace analysis	Medium
AWS CloudWatch	Cloud-native	Medium	High affinity with AWS environments	Medium
Google Cloud Monitoring	Cloud-native	Medium	Integrated with GCP	Medium

On-Call Dashboard Design

The on-call dashboard should show everything needed to assess the situation at a glance.

┌─────────────────────────────────────────────────┐
│ Golden Signals (last 1 hour)                     │
│  Error Rate: 0.12%  ✓   Latency p99: 3.2s  ✓   │
│  Throughput: 145 req/s   Availability: 99.98% ✓ │
├─────────────────────────────────────────────────┤
│ AI Service Specifics                             │
│  LLM TTFT p95: 1.8s  ✓   Cost/hr: $18.4         │
│  Cache Hit Rate: 23%   Errors: 0                 │
├─────────────────────────────────────────────────┤
│ Infrastructure                                   │
│  CPU: 45%  ✓   Memory: 67%  ✓   DB Conn: 45/100 │
└─────────────────────────────────────────────────┘

Summary

Monitoring detects “something is wrong”; observability provides the means to understand “why”
Combining the three pillars (metrics, logs, traces) makes it possible to investigate production incidents
AI-service-specific metrics (TTFT, token cost, RAG quality) require additional tracking
Define SLOs, set alert thresholds from them, and configure only actionable alerts
For small scale: CloudWatch or Grafana; for serious production: Datadog or New Relic are strong choices

Frequently Asked Questions

Q: What should I monitor first?

A: Start with the Golden Signals (error rate, latency, throughput, saturation). For AI services, I also recommend tracking LLM API latency and cost from the beginning. Logs should be output in structured JSON format — that decision made upfront allows aggregation later regardless of what monitoring tool is added.

Q: How do I trace AI model calls?

A: Integrate the OpenTelemetry SDK into the application and manually create trace spans before and after LLM API HTTP requests. AI-specialized observability tools such as Datadog’s “LLM Observability,” Arize Phoenix, and LangSmith have matured by 2026 and can auto-instrument prompts, inputs, outputs, and latency.

Q: Logs and metrics seem to overlap — do I need both?

A: They serve different roles. Metrics are time-series aggregations of numbers (suitable for long-term storage and alerting), and logs are detailed records of individual events (used for root cause analysis during incidents). Neither alone is sufficient. Traces additionally provide visibility into processing paths that span multiple services.

Q: Should I choose open-source or a managed service?

A: For small teams with limited operational resources, a managed service (Datadog, etc.) is recommended. Open source (Prometheus + Grafana) offers high customizability, but the team takes on the responsibility of maintaining and scaling the infrastructure. A practical approach is to start with Datadog or New Relic’s free trial, and consider migrating to OSS if cost becomes an issue.

See the references for the external specifications and background sources used on this page.[1][2][3][4][5]

References

Deployment & CI/CD

Database Design Patterns