Skip to content
LinkedInX

Observability & Monitoring

About 5 minutes

Target audience: Engineers learning monitoring design for production AI services, those who want to understand the three pillars of observability
Prerequisites: Understanding the overall structure from Cloud Architecture Overview will help

Observability is the property of a system that allows you to understand its internal state from its external outputs (logs, metrics, and traces). The difference from simple monitoring is that monitoring tells you “something is wrong,” while observability gives you the means to understand “why it is wrong.” Production AI services also need to track metrics that traditional web services do not — LLM API latency, cost, and quality.

AspectMonitoringObservability
Answers the question”Is something wrong?""Why is it wrong?”
ApproachAlert on known failure patternsInvestigate unknown failures from external outputs
AssumptionHandles anticipated problemsCan investigate unanticipated problems too
DataPre-defined metricsLogs, metrics, traces (three pillars)

Metrics are numerical data collected over time. They are used to understand “how much is the current state.”

Examples:
- Latency: p99 = 2.4s (99% of requests complete within 2.4 seconds)
- Error rate: 0.3% (0.3% of all requests result in an error)
- Throughput: 150 req/s
- CPU usage: 62%
- LLM cost: $0.023/request

Because metrics are aggregated numbers, they are storage-efficient and suitable for long-term retention.

Logs are timestamped records of events. They provide detailed context of “what happened.” Using structured logs (JSON format) makes aggregation and search easier afterwards.

{
  "timestamp": "2026-05-13T10:23:45.123Z",
  "level": "ERROR",
  "service": "chat-service",
  "request_id": "req_abc123",
  "user_id": "user_456",
  "message": "LLM API timeout after 30000ms",
  "model": "claude-opus-4-5",
  "input_tokens": 850,
  "error_code": "TIMEOUT"
}

Unstructured logs (plain text) are human-readable but difficult to aggregate automatically. Structured logs are recommended for production environments.

Distributed tracing records the entire processing path of a single request as it passes through multiple services. In a microservice architecture, a request flows through API gateway → app server → LLM API → database. Traces reveal exactly where latency is occurring.

Trace example (request ID: req_xyz789):
├── API Gateway (5ms)
├── App Server (4,250ms)
│   ├── JWT validation (3ms)
│   ├── DB query: fetch conversation (12ms)
│   ├── LLM API call (4,100ms)  ← bottleneck
│   └── DB write: save message (35ms)
└── Total: 4,255ms
graph TD
    Client["Client"] --> APIGW["API Gateway\n▶ Send metrics\n▶ Access logs"]

    APIGW --> App["App Server\n▶ Start trace span\n▶ Structured log output"]

    App --> LLM["LLM API call\n▶ Trace span\n▶ Latency measurement\n▶ Token count & cost recording"]

    App --> DB["DB query\n▶ Trace span\n▶ Query execution time"]

    App --> Cache["Redis\n▶ Cache hit rate"]

    APIGW --> OtelCollector["OpenTelemetry\nCollector"]
    App --> OtelCollector
    LLM --> OtelCollector
    DB --> OtelCollector
    Cache --> OtelCollector

    OtelCollector --> Metrics["Metrics DB\nPrometheus / CloudWatch"]
    OtelCollector --> LogStore["Log Store\nElasticsearch / CloudWatch Logs"]
    OtelCollector --> TraceStore["Trace Store\nJaeger / X-Ray"]

    Metrics --> Dashboard["Dashboard\nGrafana / Datadog"]
    LogStore --> Dashboard
    TraceStore --> Dashboard

    Metrics --> Alert["Alerting\nPagerDuty / OpsGenie"]

Google’s SRE Book defines four signals to monitor in any service as the “Golden Signals.”

SignalMeaningHow to Measure
LatencyTime to process a requestp50/p95/p99 percentiles
Error RateProportion of requests resulting in an error5xx response count / total request count
TrafficScale of demand on the systemRequests per second (req/s)
SaturationResource utilizationCPU usage, memory, DB connection count

In addition to standard web service metrics, AI services using LLMs need to track the following.

MetricDescriptionTarget
TTFT (Time to First Token)From request sent to first token received< 2 seconds
TPS (Tokens per Second)Generation speedModel-dependent
Total LatencyEnd-to-end request time< 30 seconds (with streaming)
{
  "date": "2026-05-13",
  "model": "claude-opus-4-5",
  "requests": 12450,
  "input_tokens_total": 8234000,
  "output_tokens_total": 15670000,
  "cost_usd_total": 421.30,
  "cost_per_request_avg": 0.0338
}
MetricDescription
Retrieval LatencyTime taken for vector search
Chunks RetrievedNumber of document chunks retrieved per request
Cache Hit RateHit rate for LLM response cache
MetricMeasurement Method
User thumbs-down rateProportion of “not helpful” feedback
Regeneration request rateRate at which regeneration is triggered in the same conversation
Session durationLength of time users continue a conversation

An SLO (Service Level Objective) is a target value for the quality a service should provide. Defining SLOs upfront clarifies alert thresholds and incident priority.

# SLO example
slo:
  chat_api:
    latency_p99: "< 5s"        # Process 99% of requests within 5 seconds
    error_rate: "< 0.5%"       # Maintain error rate below 0.5%
    availability: "99.5%"      # Monthly uptime above 99.5%
  
  llm_api_cost:
    daily_budget_usd: 500      # Daily cost ceiling
    per_request_max_usd: 0.10  # Per-request maximum

Too many alerts cause alert fatigue, increasing the risk of missing critical alerts.

Alert design principles:

  1. Set only actionable alerts: Alerts that the receiving person cannot act on are not useful
  2. Alert on symptoms: “Error rate > 1%” is more useful than “CPU usage > 80%”
  3. Graduated severity: Separate those requiring immediate action (PagerDuty notification) from those that can wait until morning (Slack only)
StackTypeCostKey FeaturesAI Service Fit
Prometheus + Grafana + JaegerOSSLow (with operational overhead)High customizabilityPossible with configuration
DatadogManagedHighFeature-rich, includes LLM observabilityHigh
New RelicManagedMedium–HighAI monitoring features availableHigh
HoneycombManagedMediumStrong at high-cardinality trace analysisMedium
AWS CloudWatchCloud-nativeMediumHigh affinity with AWS environmentsMedium
Google Cloud MonitoringCloud-nativeMediumIntegrated with GCPMedium

The on-call dashboard should show everything needed to assess the situation at a glance.

┌─────────────────────────────────────────────────┐
│ Golden Signals (last 1 hour)                     │
│  Error Rate: 0.12%  ✓   Latency p99: 3.2s  ✓   │
│  Throughput: 145 req/s   Availability: 99.98% ✓ │
├─────────────────────────────────────────────────┤
│ AI Service Specifics                             │
│  LLM TTFT p95: 1.8s  ✓   Cost/hr: $18.4         │
│  Cache Hit Rate: 23%   Errors: 0                 │
├─────────────────────────────────────────────────┤
│ Infrastructure                                   │
│  CPU: 45%  ✓   Memory: 67%  ✓   DB Conn: 45/100 │
└─────────────────────────────────────────────────┘
  • Monitoring detects “something is wrong”; observability provides the means to understand “why”
  • Combining the three pillars (metrics, logs, traces) makes it possible to investigate production incidents
  • AI-service-specific metrics (TTFT, token cost, RAG quality) require additional tracking
  • Define SLOs, set alert thresholds from them, and configure only actionable alerts
  • For small scale: CloudWatch or Grafana; for serious production: Datadog or New Relic are strong choices

Q: What should I monitor first?

A: Start with the Golden Signals (error rate, latency, throughput, saturation). For AI services, I also recommend tracking LLM API latency and cost from the beginning. Logs should be output in structured JSON format — that decision made upfront allows aggregation later regardless of what monitoring tool is added.

Q: How do I trace AI model calls?

A: Integrate the OpenTelemetry SDK into the application and manually create trace spans before and after LLM API HTTP requests. AI-specialized observability tools such as Datadog’s “LLM Observability,” Arize Phoenix, and LangSmith have matured by 2026 and can auto-instrument prompts, inputs, outputs, and latency.

Q: Logs and metrics seem to overlap — do I need both?

A: They serve different roles. Metrics are time-series aggregations of numbers (suitable for long-term storage and alerting), and logs are detailed records of individual events (used for root cause analysis during incidents). Neither alone is sufficient. Traces additionally provide visibility into processing paths that span multiple services.

Q: Should I choose open-source or a managed service?

A: For small teams with limited operational resources, a managed service (Datadog, etc.) is recommended. Open source (Prometheus + Grafana) offers high customizability, but the team takes on the responsibility of maintaining and scaling the infrastructure. A practical approach is to start with Datadog or New Relic’s free trial, and consider migrating to OSS if cost becomes an issue.

See the references for the external specifications and background sources used on this page.[1][2][3][4][5]

  1. Google SRE Book - Monitoring Distributed Systems
  2. OpenTelemetry Documentation
  3. Datadog APM & Distributed Tracing
  4. Prometheus Documentation
  5. Grafana Documentation