Cloud Architecture Overview
About 5 minutes
Cloud architecture is the design of how applications and services are structured and operated on cloud infrastructure (AWS, GCP, Azure, etc.) accessible over the internet. A production AI service involves multiple components working together: an API gateway receiving user requests, LLM API calls, databases, authentication, and monitoring — all connected in a layered design.
Learning Path for Cloud Architecture
Section titled “Learning Path for Cloud Architecture”This section covers the components that power production AI services.
- This page: Understand the overall structure of a production AI service and the role of each component
- API Design & Gateway: Learn REST, GraphQL, gRPC tradeoffs plus rate limiting and authentication integration
- Database Design Patterns: Understand the tradeoffs between relational databases, vector databases, and caches
- Observability & Monitoring: Learn logging, metrics, and tracing design
- Deployment & CI/CD: Learn deployment using containers, Kubernetes, and GitHub Actions
Why Cloud Architecture Knowledge Matters
Section titled “Why Cloud Architecture Knowledge Matters”Taking a locally working AI script to a production environment raises the following questions:
- When multiple users send requests simultaneously, can the system scale?
- Are API keys managed securely so they cannot be leaked?
- If the system goes down, is there an alert mechanism to catch it quickly?
- Are there backups so database data is not lost?
- Is each deployment automated so it does not require manual steps?
Cloud architecture knowledge provides systematic answers to these problems.
Typical Production AI Service Architecture
Section titled “Typical Production AI Service Architecture”graph TD
User["User\n(Browser/App)"] --> CDN["CDN\nCloudflare / CloudFront"]
CDN --> APIGW["API Gateway\nRate Limiting & Routing"]
APIGW --> Auth["Auth Service\nOAuth2 / JWT"]
Auth --> App["Application Server\nFastAPI / Node.js"]
App --> DB["Relational DB\nPostgreSQL"]
App --> VDB["Vector DB\nPinecone / pgvector"]
App --> Cache["Cache\nRedis"]
App --> Queue["Message Queue\nSQS / Pub/Sub"]
App --> LLMAPI["LLM API\nAnthropic / OpenAI"]
Queue --> Worker["Async Worker"]
Worker --> DB
App --> Monitor["Monitoring & Logs\nDatadog / CloudWatch"]
APIGW --> MonitorComponent Reference
Section titled “Component Reference”| Component | Role | Representative Services |
|---|---|---|
| CDN | Edge delivery for static files, DDoS protection | Cloudflare, AWS CloudFront |
| API Gateway | Request routing, rate limiting, SSL termination | AWS API Gateway, Kong, nginx |
| Auth Service | JWT issuance/validation, OAuth2 flow, session management | Auth0, AWS Cognito, custom implementation |
| Application Server | Business logic, LLM API calls, response formatting | FastAPI, Express, Django |
| Relational DB | Persistent storage for user data, conversation history, settings | PostgreSQL, MySQL, Cloud SQL |
| Vector DB | Storage and similarity search for embeddings (for RAG) | Pinecone, pgvector, Weaviate |
| Cache | Fast responses for frequent requests, session storage | Redis, Memcached |
| Message Queue | Async processing, buffering during peak load | AWS SQS, Google Pub/Sub, RabbitMQ |
| LLM API | External API calls for text generation and embeddings | Anthropic, OpenAI, Gemini |
| Monitoring & Logs | Error detection, performance monitoring, log aggregation | Datadog, CloudWatch, Grafana |
Core Design Principles
Section titled “Core Design Principles”1. Stateless Design
Section titled “1. Stateless Design”Application servers should not hold state (such as session information) in memory. Externalizing state to Redis or a database enables load balancing across multiple server instances (scale-out).
2. Principle of Least Privilege
Section titled “2. Principle of Least Privilege”Each component holds only the minimum permissions it needs. The application server gets read/write access to the database; the worker gets read access to the queue. Fine-grained control is implemented using IAM roles (AWS) or service accounts (GCP).
3. Fault Tolerance
Section titled “3. Fault Tolerance”Eliminate single points of failure (SPOF). Databases should have replicas (read-only copies), and application servers should run as multiple instances. Retry logic is also needed for cases where the LLM API temporarily fails to respond.
4. Observability
Section titled “4. Observability”To quickly identify the root cause when problems occur, collect three types of data:
| Type | Content | Example Tools |
|---|---|---|
| Logs | Application operation records, error messages | CloudWatch Logs, Datadog |
| Metrics | Latency, error rate, CPU/memory usage | Prometheus, CloudWatch Metrics |
| Traces | How a request passed through each component | AWS X-Ray, Jaeger, OpenTelemetry |
5. Defense in Depth
Section titled “5. Defense in Depth”Rather than relying on a single security measure, protect the system across multiple layers:
[CDN Layer] DDoS protection, WAF (Web Application Firewall)
↓
[API Gateway Layer] Rate limiting, API key validation
↓
[Application Layer] Authentication/authorization, input validation
↓
[Data Layer] Encryption, access controlStarting Simple and Growing the Architecture
Section titled “Starting Simple and Growing the Architecture”There is no need to build the full architecture from the start. A phased approach is practical:
graph LR
A["Phase 1\nMVP"] --> B["Phase 2\nProduction-Ready"]
B --> C["Phase 3\nScale-Ready"]
subgraph A["Phase 1: MVP"]
A1["Single server\n(e.g., Railway, Render)"]
A2["PostgreSQL\n(managed)"]
A3["Direct LLM API calls"]
end
subgraph B["Phase 2: Production-Ready"]
B1["Containerization\n(Docker)"]
B2["Add auth\n(Auth0, etc.)"]
B3["Add monitoring\n(Datadog, etc.)"]
end
subgraph C["Phase 3: Scale-Ready"]
C1["Kubernetes / ECS"]
C2["CDN + API Gateway"]
C3["Add cache & queue"]
endDesign Considerations Specific to AI Services
Section titled “Design Considerations Specific to AI Services”AI services using LLMs have unique considerations compared to standard web services:
Latency and Streaming
Section titled “Latency and Streaming”LLM responses take time to generate. Streaming responses (via Server-Sent Events or WebSocket) that deliver text in real time as it is generated are important for user experience.
Cost Management
Section titled “Cost Management”LLM APIs charge based on token usage. Monitor request counts and token consumption, and set cost alert thresholds. Caching repeated identical requests with Redis improves both cost and speed.
Prompt Injection Defense
Section titled “Prompt Injection Defense”User inputs may contain malicious strings designed to manipulate the prompt. Input validation and a design that clearly separates the system prompt from user input are necessary.
Summary
Section titled “Summary”- A production AI service is a multi-layered system where the API gateway, authentication, databases, vector DB, LLM API, and monitoring all work together
- The core principles are: stateless design, least privilege, fault tolerance, observability, and defense in depth
- Starting simple and incrementally adding features is the practical approach
- AI-specific challenges (latency, cost, prompt injection) require dedicated design attention
Q: Which cloud platform should I choose — AWS, GCP, or Azure?
A: All three can support production AI services. In Japan, AWS is often chosen because existing enterprise systems tend to use it (and Amazon Bedrock offers strong LLM integration). GCP has strong integration with Vertex AI and Gemini, and Azure integrates well with Microsoft 365 environments.
Q: Should I use containers or serverless?
A: Because LLM responses take several seconds to tens of seconds to generate, serverless functions (AWS Lambda, etc.) can be problematic due to cold starts. Containers with a persistent runtime (ECS, Cloud Run, Kubernetes) tend to be more stable for this use case.
Q: Is a vector database mandatory?
A: Not if you are building a simple AI chat without RAG. A vector database becomes necessary when adding features like internal document search or personalization. For small-scale use, the pgvector extension for PostgreSQL is a low-friction starting point.
Q: How can I deploy a production AI service as an individual developer?
A: Managed platforms like Railway, Render, or Fly allow you to deploy to production with just a Dockerfile and no server management. PostgreSQL is also available on the same platforms, keeping initial costs low.
See the references for the external specifications and background sources used on this page.[1][2][3][4][5]