Skip to content
LinkedInX

API Design & Gateway

About 10 minutes

Target audience: Engineers learning API design for production AI services, those who want to understand microservice communication patterns
Prerequisites: Understanding the overall structure from Cloud Architecture Overview will help

An API (Application Programming Interface) is a communication contract between different software components. Clients (browsers, mobile apps, other services) learn from the API specification what to send and what to receive in return, without needing to know the server’s internal implementation. In production AI services, multiple microservices collaborate, making API design and a unified API gateway indispensable.

Three API styles are widely used in modern web services. Each was created to solve different problems.

REST (Representational State Transfer) is an API design style that leverages HTTP’s built-in mechanisms. It is designed around “resources” and uses HTTP methods to express operations.

HTTP MethodMeaningExample
GETRetrieveGET /users/123 → Fetch user 123
POSTCreatePOST /conversations → Create a new conversation
PUTFull updatePUT /users/123 → Update all fields of user 123
PATCHPartial updatePATCH /users/123 → Update specific fields of user 123
DELETEDeleteDELETE /conversations/456 → Delete conversation 456

The response communicates state via HTTP status codes.

200 OK                 → Success (retrieval or update)
201 Created            → Creation successful
400 Bad Request        → Problem with the client's request
401 Unauthorized       → Authentication required
403 Forbidden          → Authenticated but access not permitted
404 Not Found          → Resource does not exist
429 Too Many Requests  → Rate limit reached
500 Internal Server Error → Server-side error

GraphQL is a query-language-based API specification developed by Facebook. It uses a single endpoint (/graphql) where the client specifies only the fields it needs.

# The client specifies only the fields it needs
query {
  user(id: "123") {
    name
    email
    conversations(last: 5) {
      id
      createdAt
      messages(last: 1) {
        content
      }
    }
  }
}

Where REST would require multiple endpoint calls to combine data, GraphQL retrieves all necessary data in a single request.

gRPC (Google Remote Procedure Call) is a binary-protocol-based RPC framework developed by Google. It defines typed interfaces in Protocol Buffers (.proto files) and communicates in binary rather than JSON.

// chat.proto
service ChatService {
  rpc SendMessage (MessageRequest) returns (MessageResponse);
  rpc StreamResponse (MessageRequest) returns (stream MessageChunk);
}

message MessageRequest {
  string conversation_id = 1;
  string content = 2;
}
ItemRESTGraphQLgRPC
ProtocolHTTP/JSONHTTP/JSONHTTP/2 + Protocol Buffers (binary)
EndpointsMultiple (per resource)Single (/graphql)Per method
Type definitionOpenAPI (optional)Schema required.proto file required
Primary usePublic APIs, simple CRUDComplex frontend data fetchingMicroservice-to-microservice
Learning curveLowMediumMedium–High
Browser supportNativeNativeLimited (grpc-web required)
PerformanceStandardStandardFast (binary communication)

An API gateway is a single entry point that receives all client requests. Instead of exposing multiple backend services directly to clients, the API gateway handles everything centrally.

graph TD
    Browser["Browser"] --> GW["API Gateway"]
    Mobile["Mobile App"] --> GW
    External["External Service"] --> GW

    GW --> AuthCheck["① Auth Check\nJWT validation / API key check"]
    AuthCheck --> RateLimit["② Rate Limit Check\nCounter managed in Redis"]
    RateLimit --> Route["③ Routing\nDetermine target service by path"]

    Route --> UserSvc["User Service\n:8001"]
    Route --> AISvc["AI Chat Service\n:8002"]
    Route --> FileSvc["File Service\n:8003"]

    GW --> Log["④ Log & Metrics Collection\nDatadog / CloudWatch"]

    UserSvc --> DB["PostgreSQL"]
    AISvc --> LLMAPI["LLM API\nAnthropic / OpenAI"]
    FileSvc --> Storage["Object Storage\nS3 / GCS"]

Clients only need to know the API gateway’s URL, regardless of how many microservices sit behind it. Backend service topology changes without affecting client code.

Every request goes through JWT token validation or API key verification. Invalid requests are rejected before reaching any backend service.

Request header example:
Authorization: Bearer eyJhbGciOiJSUzI1NiJ9...
X-API-Key: sk-proj-xxxxxxxxxxxxx

Limits the number of requests per user or IP address, protecting backend services from overload and preventing API abuse.

Determines the forwarding target based on the request path (e.g., /api/chat/* → AI service, /api/users/* → user service).

The configuration where HTTPS (encrypted) is used between the client and the API gateway, and HTTP is used within the internal network between the gateway and backend services. Certificate management is centralized at the gateway.

All API request logs can be collected in one place. Latency, error rate, and throughput are sent to monitoring tools.

Rate limiting is a mechanism that controls the number of requests per unit of time. Its purpose is to protect backend services from overload and prevent API abuse.

AlgorithmMechanismCharacteristics
Fixed WindowReset counter every minuteSimple to implement. Bursts can occur at window boundaries
Sliding WindowTrack requests in the last N secondsBetter burst prevention. Slightly more complex
Token BucketRefill tokens at a fixed rate. Consume a token per requestAllows limited bursts while constraining average rate

Major gateways like AWS API Gateway and Kong have these algorithms built in.

When changing API specifications, version management is needed to avoid breaking existing clients.

https://api.example.com/v1/chat
https://api.example.com/v2/chat

Visible in the URL, easy to understand, and cache-friendly.

GET /chat
Accept: application/vnd.api+json; version=2

Keeps URLs clean but makes implementation and testing more complex.

In practice, URL versioning is by far the most common approach. Sufficient migration time should be announced before v1 is retired.

AI services using LLM APIs have design considerations beyond typical REST APIs.

LLM generation takes several seconds to tens of seconds. Returning the response only after generation completes means a long silent period for users. Server-Sent Events (SSE) allows generated text to be sent to the client progressively.

# SSE streaming example with FastAPI
from fastapi.responses import StreamingResponse

async def stream_chat(message: str):
    async def generate():
        async with anthropic_client.messages.stream(
            model="claude-opus-4-5",
            max_tokens=1024,
            messages=[{"role": "user", "content": message}]
        ) as stream:
            async for text in stream.text_stream:
                yield f"data: {text}\n\n"
    return StreamingResponse(generate(), media_type="text/event-stream")

LLM APIs are metered by token count. Record input tokens, output tokens, and cost in the DB for every API request.

{
  "request_id": "req_123",
  "user_id": "user_456",
  "model": "claude-opus-4-5",
  "input_tokens": 250,
  "output_tokens": 1800,
  "cost_usd": 0.0234,
  "latency_ms": 4200
}
ProductTypeCharacteristics
AWS API GatewayManagedEasy integration with AWS services. Works well with Lambda and ECS
KongOSS/ManagedRich plugin ecosystem. Can run on-premises
NginxOSSHigh-performance reverse proxy with high customizability
Cloudflare WorkersEdgeRuns at the edge. Good for minimizing global latency
TraefikOSSHigh affinity with Kubernetes environments
  • REST uses HTTP natively, is simple to design, and is optimal for public APIs and CRUD operations
  • GraphQL allows fetching only the needed fields and is suited for complex frontend data retrieval
  • gRPC uses binary communication for high speed, requires typed schemas, and is suited for microservice-to-microservice communication
  • The API gateway centralizes authentication, rate limiting, routing, and log collection at a single entry point
  • AI services have additional design points: streaming responses (SSE) and cost tracking

Q: Do I need an API gateway for a small project?

A: Not always for a single service. However, authentication, rate limiting, and log collection provide value even at small scale. With AWS, you can start with API Gateway from the beginning, or implement equivalent functionality through FastAPI middleware. Consider a full-featured API gateway when you start splitting into microservices.

Q: When do I need API versioning?

A: I recommend designing versioning from the start for APIs used by external clients (partner companies, mobile apps). For internal service-to-service APIs, versioning may not be necessary if deployments can be coordinated for simultaneous updates.

Q: Should I choose REST or GraphQL?

A: If the frontend needs to flexibly fetch large amounts of different data types, GraphQL is a good fit. For simple CRUD or public APIs, REST keeps implementation costs lower. A common approach is to start with REST and consider GraphQL when frontend data fetching becomes complex.

Q: Can I call gRPC directly from a frontend browser?

A: Direct calls from a browser do not work with standard gRPC. You need to use the grpc-web specification and its proxy, or set up a transcoding layer on the backend that converts gRPC to HTTP. It is very effective for internal communication between microservices.

See the references for the external specifications and background sources used on this page.[1][2][3][4][5]

  1. REST API Design Best Practices - Microsoft
  2. AWS API Gateway Developer Guide
  3. Kong Gateway Documentation
  4. GraphQL Official Documentation
  5. gRPC Documentation