Skip to content
LinkedInX

Cloud Architecture Overview

About 5 minutes

Target audience: Developers who want to understand the structure of production AI services, or engineers learning cloud architecture fundamentals
Prerequisites: Basic knowledge from Python API Basics is helpful

Cloud architecture is the design of how applications and services are structured and operated on cloud infrastructure (AWS, GCP, Azure, etc.) accessible over the internet. A production AI service involves multiple components working together: an API gateway receiving user requests, LLM API calls, databases, authentication, and monitoring — all connected in a layered design.

This section covers the components that power production AI services.

  1. This page: Understand the overall structure of a production AI service and the role of each component
  2. API Design & Gateway: Learn REST, GraphQL, gRPC tradeoffs plus rate limiting and authentication integration
  3. Database Design Patterns: Understand the tradeoffs between relational databases, vector databases, and caches
  4. Observability & Monitoring: Learn logging, metrics, and tracing design
  5. Deployment & CI/CD: Learn deployment using containers, Kubernetes, and GitHub Actions

Taking a locally working AI script to a production environment raises the following questions:

  • When multiple users send requests simultaneously, can the system scale?
  • Are API keys managed securely so they cannot be leaked?
  • If the system goes down, is there an alert mechanism to catch it quickly?
  • Are there backups so database data is not lost?
  • Is each deployment automated so it does not require manual steps?

Cloud architecture knowledge provides systematic answers to these problems.

Typical Production AI Service Architecture

Section titled “Typical Production AI Service Architecture”
graph TD
    User["User\n(Browser/App)"] --> CDN["CDN\nCloudflare / CloudFront"]
    CDN --> APIGW["API Gateway\nRate Limiting & Routing"]
    APIGW --> Auth["Auth Service\nOAuth2 / JWT"]
    Auth --> App["Application Server\nFastAPI / Node.js"]
    App --> DB["Relational DB\nPostgreSQL"]
    App --> VDB["Vector DB\nPinecone / pgvector"]
    App --> Cache["Cache\nRedis"]
    App --> Queue["Message Queue\nSQS / Pub/Sub"]
    App --> LLMAPI["LLM API\nAnthropic / OpenAI"]
    Queue --> Worker["Async Worker"]
    Worker --> DB
    App --> Monitor["Monitoring & Logs\nDatadog / CloudWatch"]
    APIGW --> Monitor
ComponentRoleRepresentative Services
CDNEdge delivery for static files, DDoS protectionCloudflare, AWS CloudFront
API GatewayRequest routing, rate limiting, SSL terminationAWS API Gateway, Kong, nginx
Auth ServiceJWT issuance/validation, OAuth2 flow, session managementAuth0, AWS Cognito, custom implementation
Application ServerBusiness logic, LLM API calls, response formattingFastAPI, Express, Django
Relational DBPersistent storage for user data, conversation history, settingsPostgreSQL, MySQL, Cloud SQL
Vector DBStorage and similarity search for embeddings (for RAG)Pinecone, pgvector, Weaviate
CacheFast responses for frequent requests, session storageRedis, Memcached
Message QueueAsync processing, buffering during peak loadAWS SQS, Google Pub/Sub, RabbitMQ
LLM APIExternal API calls for text generation and embeddingsAnthropic, OpenAI, Gemini
Monitoring & LogsError detection, performance monitoring, log aggregationDatadog, CloudWatch, Grafana

Application servers should not hold state (such as session information) in memory. Externalizing state to Redis or a database enables load balancing across multiple server instances (scale-out).

Each component holds only the minimum permissions it needs. The application server gets read/write access to the database; the worker gets read access to the queue. Fine-grained control is implemented using IAM roles (AWS) or service accounts (GCP).

Eliminate single points of failure (SPOF). Databases should have replicas (read-only copies), and application servers should run as multiple instances. Retry logic is also needed for cases where the LLM API temporarily fails to respond.

To quickly identify the root cause when problems occur, collect three types of data:

TypeContentExample Tools
LogsApplication operation records, error messagesCloudWatch Logs, Datadog
MetricsLatency, error rate, CPU/memory usagePrometheus, CloudWatch Metrics
TracesHow a request passed through each componentAWS X-Ray, Jaeger, OpenTelemetry

Rather than relying on a single security measure, protect the system across multiple layers:

[CDN Layer]         DDoS protection, WAF (Web Application Firewall)

[API Gateway Layer] Rate limiting, API key validation

[Application Layer] Authentication/authorization, input validation

[Data Layer]        Encryption, access control

Starting Simple and Growing the Architecture

Section titled “Starting Simple and Growing the Architecture”

There is no need to build the full architecture from the start. A phased approach is practical:

graph LR
    A["Phase 1\nMVP"] --> B["Phase 2\nProduction-Ready"]
    B --> C["Phase 3\nScale-Ready"]

    subgraph A["Phase 1: MVP"]
        A1["Single server\n(e.g., Railway, Render)"]
        A2["PostgreSQL\n(managed)"]
        A3["Direct LLM API calls"]
    end

    subgraph B["Phase 2: Production-Ready"]
        B1["Containerization\n(Docker)"]
        B2["Add auth\n(Auth0, etc.)"]
        B3["Add monitoring\n(Datadog, etc.)"]
    end

    subgraph C["Phase 3: Scale-Ready"]
        C1["Kubernetes / ECS"]
        C2["CDN + API Gateway"]
        C3["Add cache & queue"]
    end

Design Considerations Specific to AI Services

Section titled “Design Considerations Specific to AI Services”

AI services using LLMs have unique considerations compared to standard web services:

LLM responses take time to generate. Streaming responses (via Server-Sent Events or WebSocket) that deliver text in real time as it is generated are important for user experience.

LLM APIs charge based on token usage. Monitor request counts and token consumption, and set cost alert thresholds. Caching repeated identical requests with Redis improves both cost and speed.

User inputs may contain malicious strings designed to manipulate the prompt. Input validation and a design that clearly separates the system prompt from user input are necessary.

  • A production AI service is a multi-layered system where the API gateway, authentication, databases, vector DB, LLM API, and monitoring all work together
  • The core principles are: stateless design, least privilege, fault tolerance, observability, and defense in depth
  • Starting simple and incrementally adding features is the practical approach
  • AI-specific challenges (latency, cost, prompt injection) require dedicated design attention

Q: Which cloud platform should I choose — AWS, GCP, or Azure?

A: All three can support production AI services. In Japan, AWS is often chosen because existing enterprise systems tend to use it (and Amazon Bedrock offers strong LLM integration). GCP has strong integration with Vertex AI and Gemini, and Azure integrates well with Microsoft 365 environments.

Q: Should I use containers or serverless?

A: Because LLM responses take several seconds to tens of seconds to generate, serverless functions (AWS Lambda, etc.) can be problematic due to cold starts. Containers with a persistent runtime (ECS, Cloud Run, Kubernetes) tend to be more stable for this use case.

Q: Is a vector database mandatory?

A: Not if you are building a simple AI chat without RAG. A vector database becomes necessary when adding features like internal document search or personalization. For small-scale use, the pgvector extension for PostgreSQL is a low-friction starting point.

Q: How can I deploy a production AI service as an individual developer?

A: Managed platforms like Railway, Render, or Fly allow you to deploy to production with just a Dockerfile and no server management. PostgreSQL is also available on the same platforms, keeping initial costs low.

See the references for the external specifications and background sources used on this page.[1][2][3][4][5]

  1. AWS Well-Architected Framework
  2. Google Cloud Architecture Framework
  3. The Twelve-Factor App
  4. OpenTelemetry Documentation
  5. FastAPI Production Deployment