API Design & Gateway

About 10 minutes

Engineers learning API design for production AI services, those who want to understand microservice communication patterns

Understanding the overall structure from Cloud Architecture Overview will help

An API (Application Programming Interface) is a communication contract between different software components. Clients (browsers, mobile apps, other services) learn from the API specification what to send and what to receive in return, without needing to know the server’s internal implementation. In production AI services, multiple microservices collaborate, making API design and a unified API gateway indispensable.

Three API Styles: REST, GraphQL, and gRPC

Three API styles are widely used in modern web services. Each was created to solve different problems.

REST API

REST (Representational State Transfer) is an API design style that leverages HTTP’s built-in mechanisms. It is designed around “resources” and uses HTTP methods to express operations.

HTTP Method	Meaning	Example
`GET`	Retrieve	`GET /users/123` → Fetch user 123
`POST`	Create	`POST /conversations` → Create a new conversation
`PUT`	Full update	`PUT /users/123` → Update all fields of user 123
`PATCH`	Partial update	`PATCH /users/123` → Update specific fields of user 123
`DELETE`	Delete	`DELETE /conversations/456` → Delete conversation 456

The response communicates state via HTTP status codes.

200 OK                 → Success (retrieval or update)
201 Created            → Creation successful
400 Bad Request        → Problem with the client's request
401 Unauthorized       → Authentication required
403 Forbidden          → Authenticated but access not permitted
404 Not Found          → Resource does not exist
429 Too Many Requests  → Rate limit reached
500 Internal Server Error → Server-side error

GraphQL

GraphQL is a query-language-based API specification developed by Facebook. It uses a single endpoint (/graphql) where the client specifies only the fields it needs.

# The client specifies only the fields it needs
query {
  user(id: "123") {
    name
    email
    conversations(last: 5) {
      id
      createdAt
      messages(last: 1) {
        content
      }
    }
  }
}

Where REST would require multiple endpoint calls to combine data, GraphQL retrieves all necessary data in a single request.

gRPC

gRPC (Google Remote Procedure Call) is a binary-protocol-based RPC framework developed by Google. It defines typed interfaces in Protocol Buffers (.proto files) and communicates in binary rather than JSON.

// chat.proto
service ChatService {
  rpc SendMessage (MessageRequest) returns (MessageResponse);
  rpc StreamResponse (MessageRequest) returns (stream MessageChunk);
}

message MessageRequest {
  string conversation_id = 1;
  string content = 2;
}

Comparison: REST vs GraphQL vs gRPC

Item	REST	GraphQL	gRPC
Protocol	HTTP/JSON	HTTP/JSON	HTTP/2 + Protocol Buffers (binary)
Endpoints	Multiple (per resource)	Single (`/graphql`)	Per method
Type definition	OpenAPI (optional)	Schema required	`.proto` file required
Primary use	Public APIs, simple CRUD	Complex frontend data fetching	Microservice-to-microservice
Learning curve	Low	Medium	Medium–High
Browser support	Native	Native	Limited (grpc-web required)
Performance	Standard	Standard	Fast (binary communication)

What Is an API Gateway

An API gateway is a single entry point that receives all client requests. Instead of exposing multiple backend services directly to clients, the API gateway handles everything centrally.

API Gateway Architecture

graph TD
    Browser["Browser"] --> GW["API Gateway"]
    Mobile["Mobile App"] --> GW
    External["External Service"] --> GW

    GW --> AuthCheck["① Auth Check\nJWT validation / API key check"]
    AuthCheck --> RateLimit["② Rate Limit Check\nCounter managed in Redis"]
    RateLimit --> Route["③ Routing\nDetermine target service by path"]

    Route --> UserSvc["User Service\n:8001"]
    Route --> AISvc["AI Chat Service\n:8002"]
    Route --> FileSvc["File Service\n:8003"]

    GW --> Log["④ Log & Metrics Collection\nDatadog / CloudWatch"]

    UserSvc --> DB["PostgreSQL"]
    AISvc --> LLMAPI["LLM API\nAnthropic / OpenAI"]
    FileSvc --> Storage["Object Storage\nS3 / GCS"]

Six Responsibilities of an API Gateway

1. Single Entry Point

Clients only need to know the API gateway’s URL, regardless of how many microservices sit behind it. Backend service topology changes without affecting client code.

2. Authentication and Authorization

Every request goes through JWT token validation or API key verification. Invalid requests are rejected before reaching any backend service.

Request header example:
Authorization: Bearer eyJhbGciOiJSUzI1NiJ9...
X-API-Key: sk-proj-xxxxxxxxxxxxx

3. Rate Limiting

Limits the number of requests per user or IP address, protecting backend services from overload and preventing API abuse.

4. Request Routing

Determines the forwarding target based on the request path (e.g., /api/chat/* → AI service, /api/users/* → user service).

5. SSL Termination

The configuration where HTTPS (encrypted) is used between the client and the API gateway, and HTTP is used within the internal network between the gateway and backend services. Certificate management is centralized at the gateway.

6. Logging and Metrics Collection

All API request logs can be collected in one place. Latency, error rate, and throughput are sent to monitoring tools.

Three Rate Limiting Algorithms

Rate limiting is a mechanism that controls the number of requests per unit of time. Its purpose is to protect backend services from overload and prevent API abuse.

Algorithm	Mechanism	Characteristics
Fixed Window	Reset counter every minute	Simple to implement. Bursts can occur at window boundaries
Sliding Window	Track requests in the last N seconds	Better burst prevention. Slightly more complex
Token Bucket	Refill tokens at a fixed rate. Consume a token per request	Allows limited bursts while constraining average rate

Major gateways like AWS API Gateway and Kong have these algorithms built in.

API Versioning

When changing API specifications, version management is needed to avoid breaking existing clients.

URL Versioning (Most Common)

https://api.example.com/v1/chat
https://api.example.com/v2/chat

Visible in the URL, easy to understand, and cache-friendly.

Header Versioning

GET /chat
Accept: application/vnd.api+json; version=2

Keeps URLs clean but makes implementation and testing more complex.

In practice, URL versioning is by far the most common approach. Sufficient migration time should be announced before v1 is retired.

AI API-Specific Design

AI services using LLM APIs have design considerations beyond typical REST APIs.

Streaming Responses (SSE)

LLM generation takes several seconds to tens of seconds. Returning the response only after generation completes means a long silent period for users. Server-Sent Events (SSE) allows generated text to be sent to the client progressively.

# SSE streaming example with FastAPI
from fastapi.responses import StreamingResponse

async def stream_chat(message: str):
    async def generate():
        async with anthropic_client.messages.stream(
            model="claude-opus-4-5",
            max_tokens=1024,
            messages=[{"role": "user", "content": message}]
        ) as stream:
            async for text in stream.text_stream:
                yield f"data: {text}\n\n"
    return StreamingResponse(generate(), media_type="text/event-stream")

Cost-Per-Call Tracking

LLM APIs are metered by token count. Record input tokens, output tokens, and cost in the DB for every API request.

{
  "request_id": "req_123",
  "user_id": "user_456",
  "model": "claude-opus-4-5",
  "input_tokens": 250,
  "output_tokens": 1800,
  "cost_usd": 0.0234,
  "latency_ms": 4200
}

Major API Gateway Products

Product	Type	Characteristics
AWS API Gateway	Managed	Easy integration with AWS services. Works well with Lambda and ECS
Kong	OSS/Managed	Rich plugin ecosystem. Can run on-premises
Nginx	OSS	High-performance reverse proxy with high customizability
Cloudflare Workers	Edge	Runs at the edge. Good for minimizing global latency
Traefik	OSS	High affinity with Kubernetes environments

Summary

REST uses HTTP natively, is simple to design, and is optimal for public APIs and CRUD operations
GraphQL allows fetching only the needed fields and is suited for complex frontend data retrieval
gRPC uses binary communication for high speed, requires typed schemas, and is suited for microservice-to-microservice communication
The API gateway centralizes authentication, rate limiting, routing, and log collection at a single entry point
AI services have additional design points: streaming responses (SSE) and cost tracking

Frequently Asked Questions

Q: Do I need an API gateway for a small project?

A: Not always for a single service. However, authentication, rate limiting, and log collection provide value even at small scale. With AWS, you can start with API Gateway from the beginning, or implement equivalent functionality through FastAPI middleware. Consider a full-featured API gateway when you start splitting into microservices.

Q: When do I need API versioning?

A: I recommend designing versioning from the start for APIs used by external clients (partner companies, mobile apps). For internal service-to-service APIs, versioning may not be necessary if deployments can be coordinated for simultaneous updates.

Q: Should I choose REST or GraphQL?

A: If the frontend needs to flexibly fetch large amounts of different data types, GraphQL is a good fit. For simple CRUD or public APIs, REST keeps implementation costs lower. A common approach is to start with REST and consider GraphQL when frontend data fetching becomes complex.

Q: Can I call gRPC directly from a frontend browser?

A: Direct calls from a browser do not work with standard gRPC. You need to use the grpc-web specification and its proxy, or set up a transcoding layer on the backend that converts gRPC to HTTP. It is very effective for internal communication between microservices.

See the references for the external specifications and background sources used on this page.[1][2][3][4][5]

References

Database Design Patterns

Cloud Architecture Overview