Cloud Architecture Overview

About 5 minutes

Developers who want to understand the structure of production AI services, or engineers learning cloud architecture fundamentals

Basic knowledge from Python API Basics is helpful

Cloud architecture is the design of how applications and services are structured and operated on cloud infrastructure (AWS, GCP, Azure, etc.) accessible over the internet. A production AI service involves multiple components working together: an API gateway receiving user requests, LLM API calls, databases, authentication, and monitoring — all connected in a layered design.

Learning Path for Cloud Architecture

This section covers the components that power production AI services.

This page: Understand the overall structure of a production AI service and the role of each component
API Design & Gateway: Learn REST, GraphQL, gRPC tradeoffs plus rate limiting and authentication integration
Database Design Patterns: Understand the tradeoffs between relational databases, vector databases, and caches
Observability & Monitoring: Learn logging, metrics, and tracing design
Deployment & CI/CD: Learn deployment using containers, Kubernetes, and GitHub Actions

Why Cloud Architecture Knowledge Matters

Taking a locally working AI script to a production environment raises the following questions:

When multiple users send requests simultaneously, can the system scale?
Are API keys managed securely so they cannot be leaked?
If the system goes down, is there an alert mechanism to catch it quickly?
Are there backups so database data is not lost?
Is each deployment automated so it does not require manual steps?

Cloud architecture knowledge provides systematic answers to these problems.

Typical Production AI Service Architecture

graph TD
    User["User\n(Browser/App)"] --> CDN["CDN\nCloudflare / CloudFront"]
    CDN --> APIGW["API Gateway\nRate Limiting & Routing"]
    APIGW --> Auth["Auth Service\nOAuth2 / JWT"]
    Auth --> App["Application Server\nFastAPI / Node.js"]
    App --> DB["Relational DB\nPostgreSQL"]
    App --> VDB["Vector DB\nPinecone / pgvector"]
    App --> Cache["Cache\nRedis"]
    App --> Queue["Message Queue\nSQS / Pub/Sub"]
    App --> LLMAPI["LLM API\nAnthropic / OpenAI"]
    Queue --> Worker["Async Worker"]
    Worker --> DB
    App --> Monitor["Monitoring & Logs\nDatadog / CloudWatch"]
    APIGW --> Monitor

Component Reference

Component	Role	Representative Services
CDN	Edge delivery for static files, DDoS protection	Cloudflare, AWS CloudFront
API Gateway	Request routing, rate limiting, SSL termination	AWS API Gateway, Kong, nginx
Auth Service	JWT issuance/validation, OAuth2 flow, session management	Auth0, AWS Cognito, custom implementation
Application Server	Business logic, LLM API calls, response formatting	FastAPI, Express, Django
Relational DB	Persistent storage for user data, conversation history, settings	PostgreSQL, MySQL, Cloud SQL
Vector DB	Storage and similarity search for embeddings (for RAG)	Pinecone, pgvector, Weaviate
Cache	Fast responses for frequent requests, session storage	Redis, Memcached
Message Queue	Async processing, buffering during peak load	AWS SQS, Google Pub/Sub, RabbitMQ
LLM API	External API calls for text generation and embeddings	Anthropic, OpenAI, Gemini
Monitoring & Logs	Error detection, performance monitoring, log aggregation	Datadog, CloudWatch, Grafana

Core Design Principles

1. Stateless Design

Application servers should not hold state (such as session information) in memory. Externalizing state to Redis or a database enables load balancing across multiple server instances (scale-out).

2. Principle of Least Privilege

Each component holds only the minimum permissions it needs. The application server gets read/write access to the database; the worker gets read access to the queue. Fine-grained control is implemented using IAM roles (AWS) or service accounts (GCP).

3. Fault Tolerance

Eliminate single points of failure (SPOF). Databases should have replicas (read-only copies), and application servers should run as multiple instances. Retry logic is also needed for cases where the LLM API temporarily fails to respond.

4. Observability

To quickly identify the root cause when problems occur, collect three types of data:

Type	Content	Example Tools
Logs	Application operation records, error messages	CloudWatch Logs, Datadog
Metrics	Latency, error rate, CPU/memory usage	Prometheus, CloudWatch Metrics
Traces	How a request passed through each component	AWS X-Ray, Jaeger, OpenTelemetry

5. Defense in Depth

Rather than relying on a single security measure, protect the system across multiple layers:

[CDN Layer]         DDoS protection, WAF (Web Application Firewall)
  ↓
[API Gateway Layer] Rate limiting, API key validation
  ↓
[Application Layer] Authentication/authorization, input validation
  ↓
[Data Layer]        Encryption, access control

Starting Simple and Growing the Architecture

There is no need to build the full architecture from the start. A phased approach is practical:

graph LR
    A["Phase 1\nMVP"] --> B["Phase 2\nProduction-Ready"]
    B --> C["Phase 3\nScale-Ready"]

    subgraph A["Phase 1: MVP"]
        A1["Single server\n(e.g., Railway, Render)"]
        A2["PostgreSQL\n(managed)"]
        A3["Direct LLM API calls"]
    end

    subgraph B["Phase 2: Production-Ready"]
        B1["Containerization\n(Docker)"]
        B2["Add auth\n(Auth0, etc.)"]
        B3["Add monitoring\n(Datadog, etc.)"]
    end

    subgraph C["Phase 3: Scale-Ready"]
        C1["Kubernetes / ECS"]
        C2["CDN + API Gateway"]
        C3["Add cache & queue"]
    end

Design Considerations Specific to AI Services

AI services using LLMs have unique considerations compared to standard web services:

Latency and Streaming

LLM responses take time to generate. Streaming responses (via Server-Sent Events or WebSocket) that deliver text in real time as it is generated are important for user experience.

Cost Management

LLM APIs charge based on token usage. Monitor request counts and token consumption, and set cost alert thresholds. Caching repeated identical requests with Redis improves both cost and speed.

Prompt Injection Defense

User inputs may contain malicious strings designed to manipulate the prompt. Input validation and a design that clearly separates the system prompt from user input are necessary.

Summary

A production AI service is a multi-layered system where the API gateway, authentication, databases, vector DB, LLM API, and monitoring all work together
The core principles are: stateless design, least privilege, fault tolerance, observability, and defense in depth
Starting simple and incrementally adding features is the practical approach
AI-specific challenges (latency, cost, prompt injection) require dedicated design attention

FAQ

Q: Which cloud platform should I choose — AWS, GCP, or Azure?

A: All three can support production AI services. In Japan, AWS is often chosen because existing enterprise systems tend to use it (and Amazon Bedrock offers strong LLM integration). GCP has strong integration with Vertex AI and Gemini, and Azure integrates well with Microsoft 365 environments.

Q: Should I use containers or serverless?

A: Because LLM responses take several seconds to tens of seconds to generate, serverless functions (AWS Lambda, etc.) can be problematic due to cold starts. Containers with a persistent runtime (ECS, Cloud Run, Kubernetes) tend to be more stable for this use case.

Q: Is a vector database mandatory?

A: Not if you are building a simple AI chat without RAG. A vector database becomes necessary when adding features like internal document search or personalization. For small-scale use, the pgvector extension for PostgreSQL is a low-friction starting point.

Q: How can I deploy a production AI service as an individual developer?

A: Managed platforms like Railway, Render, or Fly allow you to deploy to production with just a Dockerfile and no server management. PostgreSQL is also available on the same platforms, keeping initial costs low.

See the references for the external specifications and background sources used on this page.[1][2][3][4][5]

References

API Design & Gateway

Data Protection & Privacy