Claude API and Prompt Caching

About 5 minutes

Developers new to the Claude API, or those optimizing costs with the SDK

Python basics (variables, functions, pip), Claude Model Comparison and Selection Guide

With the Claude API, adding AI conversation capabilities to an application takes just a few lines of code. Using prompt caching, the cost of repeated requests can be reduced by up to 90%.

Claude API Fundamentals

Obtaining and Setting an API Key

The Claude API is an HTTP-based interface provided by Anthropic for calling Claude models programmatically. Authentication with an API key is required.

API keys are obtained from the Anthropic Console. After obtaining a key, setting it as an environment variable is strongly recommended. Hardcoding an API key in source code creates a security risk, so the environment variable approach must be used.

# Set the API key as an environment variable (add to ~/.zshrc or ~/.bashrc)
export ANTHROPIC_API_KEY="sk-ant-xxxxxxxxxxxx"

Installing the Python SDK

The Anthropic Python SDK is installable via pip.

pip install anthropic

Python 3.8 or later is required. After installation, the following code can be used to verify that the SDK works correctly.

Basic Message Sending (Python)

# Python 3.8 or later required
import anthropic

# Initialize the client (automatically reads the ANTHROPIC_API_KEY environment variable)
client = anthropic.Anthropic()

# Send a message
message = client.messages.create(
    model="claude-sonnet-4-6",        # Model ID to use
    max_tokens=1024,                   # Maximum number of output tokens
    messages=[
        {"role": "user", "content": "Write a function in Python to generate the Fibonacci sequence."}
    ]
)

# Print the response
print(message.content[0].text)
# Example output: def fibonacci(n): ... (code is returned)

TypeScript / Node.js Example

import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic();
// Automatically reads the ANTHROPIC_API_KEY environment variable

const message = await client.messages.create({
  model: "claude-sonnet-4-6",
  max_tokens: 1024,
  messages: [{ role: "user", content: "Hello, Claude" }],
});

console.log(message.content[0].text);

Key API Parameters

Messages API Structure

The Claude API is centered around the Messages API. Requests specify the following main parameters.

Parameter	Type	Required	Description
`model`	string	Yes	Model ID to use (e.g., `claude-sonnet-4-6`)
`max_tokens`	integer	Yes	Maximum number of output tokens (up to 8192)
`messages`	array	Yes	Conversation history (pairs of `role` and `content`)
`system`	string	No	System prompt (defines the model’s role and constraints)
`temperature`	float	No	Output randomness (0.0–1.0, default: 1.0)
`stream`	boolean	No	Enable streaming response

# Example with system prompt and temperature
message = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=2048,
    system="You are a Python expert. Provide explanations in English with inline comments in English.",
    temperature=0.3,       # Lower values produce more consistent output (recommended for code generation)
    messages=[
        {"role": "user", "content": "Explain list comprehensions."},
        {"role": "assistant", "content": "List comprehensions provide a concise way to create lists from existing iterables."},
        {"role": "user", "content": "Can you show me a more concrete example?"}
    ]
)

Streaming Responses

Streaming responses deliver generated text progressively as it is produced. This is used in chat UIs and other interfaces where showing partial responses immediately is desirable.

# Streaming response example
with client.messages.stream(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Write a long story."}]
) as stream:
    for text in stream.text_stream:
        print(text, end="", flush=True)  # Output text in real time

Rate Limits and Cost Structure

Rate Limits

The Claude API enforces two types of rate limits: Tokens Per Minute (TPM) and Requests Per Minute (RPM).

Limit Type	Description
TPM (Tokens Per Minute)	Maximum number of tokens that can be processed per minute
RPM (Requests Per Minute)	Maximum number of requests that can be sent per minute

Rate limit thresholds vary by API plan. When limits are reached, the API returns a 429 Too Many Requests error. Implementing retries with exponential backoff is recommended.

Cost Structure

Claude API costs are based on input tokens and output tokens.

Input tokens: System prompt + user messages + conversation history combined
Output tokens: Tokens in the text generated by Claude

Output tokens are generally priced higher per token than input tokens. Setting max_tokens appropriately to avoid unnecessary output, and keeping system prompts concise, are effective cost reduction measures.

# Check token usage
print(f"Input tokens: {message.usage.input_tokens}")
print(f"Output tokens: {message.usage.output_tokens}")
print(f"Total tokens: {message.usage.input_tokens + message.usage.output_tokens}")

Prompt Caching

What Is Prompt Caching?

Prompt caching is a feature that significantly reduces the cost of processing the same prompt prefix when it is sent multiple times.

Normally, every API request processes and charges for the full system prompt token count. With prompt caching enabled, the same prefix section is served from cache, reducing processing costs by up to 90%.

graph TD
  subgraph Without Cache
    R1[Request 1: full system prompt processed] --> C1[100% cost]
    R2[Request 2: full system prompt processed] --> C2[100% cost]
    R3[Request 3: full system prompt processed] --> C3[100% cost]
  end

  subgraph With Cache
    R4[Request 1: cache write] --> C4[100% cost]
    R5[Request 2: cache read] --> C5[10% cost]
    R6[Request 3: cache read] --> C6[10% cost]
  end

How to Enable Caching (cache_control: ephemeral)

Caching is enabled by setting the cache_control parameter to {"type": "ephemeral"}.

import anthropic

client = anthropic.Anthropic()

# Example of caching a long system prompt
# (In practice, this would contain thousands of tokens of guidelines)
LONG_SYSTEM_PROMPT = """
You are a Python and data science expert.
Follow these guidelines when answering:
1. Always include type annotations in code
2. Write docstrings in Google format
3. Always include error handling
... (thousands of tokens of detailed guidelines)
"""

message = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": LONG_SYSTEM_PROMPT,
            "cache_control": {"type": "ephemeral"}  # Cache this section
        }
    ],
    messages=[
        {"role": "user", "content": "Write a function to read a CSV file with pandas."}
    ]
)

# Check cache usage
print(f"Cache write tokens: {message.usage.cache_creation_input_tokens}")
print(f"Cache read tokens: {message.usage.cache_read_input_tokens}")
print(f"Regular input tokens: {message.usage.input_tokens}")

Cache TTL and Considerations

Item	Details
TTL (Time to Live)	5 minutes from last access
Minimum cache size	Only applies to blocks of 1024 tokens or more
Write cost	Slightly higher than regular input tokens (1.25×)
Read cost reduction	Up to 90% reduction (10% of regular cost)

Use Cases Where Caching Is Effective

Prompt caching is particularly effective in the following scenarios.

Long system prompts: Sending thousands of tokens of guidelines or role definitions on every request
Repeated document reference: Asking multiple questions against the same codebase or technical document
RAG (Retrieval-Augmented Generation): Using the same retrieved document for multiple queries
Multi-turn conversations: Chatbots with long conversation histories

# Example of using cache in a RAG workflow
# Setting the retrieved document as the cache target
retrieved_document = "... (long document retrieved from search) ..."

message = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": retrieved_document,
                    "cache_control": {"type": "ephemeral"}  # Cache the document
                },
                {
                    "type": "text",
                    "text": "What are the main points of this document?"
                }
            ]
        }
    ]
)

Batches API (Bulk Processing)

What Is the Batches API?

The Batches API is a feature for processing large numbers of requests asynchronously. It offers 50% cost reduction compared to the standard API. It is suited for tasks that do not require real-time responses.

graph LR
  SUBMIT[Submit batch request] --> QUEUE[Processing queue]
  QUEUE --> PROCESS[Asynchronous processing]
  PROCESS --> RESULT["Retrieve results (up to 24 hours)"]

Basic Batches API Usage

import anthropic

client = anthropic.Anthropic()

# Create a batch request
batch = client.messages.batches.create(
    requests=[
        {
            "custom_id": "request-1",  # ID to identify this request
            "params": {
                "model": "claude-haiku-4-5",   # Haiku is recommended for batch processing
                "max_tokens": 100,
                "messages": [
                    {"role": "user", "content": "Translate 'Tokyo' into French."}
                ]
            }
        },
        {
            "custom_id": "request-2",
            "params": {
                "model": "claude-haiku-4-5",
                "max_tokens": 100,
                "messages": [
                    {"role": "user", "content": "Translate 'Osaka' into French."}
                ]
            }
        }
        # ... up to 10,000 requests can be submitted at once
    ]
)

print(f"Batch ID: {batch.id}")
print(f"Status: {batch.processing_status}")

# Poll for completion
import time

while True:
    batch_status = client.messages.batches.retrieve(batch.id)
    if batch_status.processing_status == "ended":
        break
    time.sleep(60)  # Wait 1 minute before checking again

# Retrieve results
for result in client.messages.batches.results(batch.id):
    print(f"ID: {result.custom_id}, Result: {result.result.message.content[0].text}")

Use Cases Suited for Batches API

Use Case	Description
Data analysis and labeling	Classifying and tagging large volumes of text data
Evaluation and benchmarking	Evaluating model response quality across large test suites
Bulk translation	Translating large volumes of documents at once
Offline report generation	Generating periodic analysis reports

Summary and Best Practices

Recommendations for Implementation

Always manage the API key via environment variables — never hardcode the API key in source code
Choose the model based on the use case — Sonnet as default, Haiku for cost-priority, Opus for high accuracy (see the model selection guide)
Set max_tokens appropriately — unnecessarily large values increase cost
Use prompt caching — always configure it for long system prompts or repeatedly referenced documents
Use the Batches API for non-real-time processing — 50% cost reduction is available
Implement error handling — use exponential backoff for rate limit errors (429)

# Implementation example with best practices
import anthropic
import time

client = anthropic.Anthropic()  # Uses the ANTHROPIC_API_KEY environment variable

def create_message_with_retry(messages, system=None, max_retries=3):
    """Send a message with retry using exponential backoff."""
    for attempt in range(max_retries):
        try:
            kwargs = {
                "model": "claude-sonnet-4-6",
                "max_tokens": 1024,
                "messages": messages
            }
            if system:
                kwargs["system"] = system

            return client.messages.create(**kwargs)

        except anthropic.RateLimitError:
            if attempt == max_retries - 1:
                raise
            wait_time = (2 ** attempt) * 1  # 1s, 2s, 4s
            print(f"Rate limit reached. Waiting {wait_time} seconds...")
            time.sleep(wait_time)

        except anthropic.APIError as e:
            print(f"API error: {e}")
            raise

FAQ

Q: What value should I set for max_tokens?

A general guideline is to set it to 1.5–2 times the expected output length for the task. For code generation, 500–2048 is common; for short summaries, 100–500. Setting an excessively large value may generate unnecessary tokens and increase costs.

Q: Is prompt caching available for all models?

It is supported for Claude 3 and later models (Opus, Sonnet, Haiku). For the latest information on supported models, refer to the Anthropic documentation.

Q: How should I decide between streaming and the Batches API?

Use streaming when immediate feedback to users is required, such as in chat UIs. Use the Batches API when real-time responses are not needed and large-scale processing is required, such as data analysis or evaluation pipelines.

Q: What should I do when rate limit errors (429) occur frequently?

First, implement retries with exponential backoff. If that is insufficient, consider upgrading to a higher plan in the Anthropic Console.

See the references for the external specifications and background sources used on this page.[1][2][3]

References

Quiz

Claude Code × MCP Integration Guide

Claude Model Comparison and Selection Guide