Claude API and Prompt Caching
About 5 minutes
With the Claude API, adding AI conversation capabilities to an application takes just a few lines of code. Using prompt caching, the cost of repeated requests can be reduced by up to 90%.
Claude API Fundamentals
Section titled “Claude API Fundamentals”Obtaining and Setting an API Key
Section titled “Obtaining and Setting an API Key”The Claude API is an HTTP-based interface provided by Anthropic for calling Claude models programmatically. Authentication with an API key is required.
API keys are obtained from the Anthropic Console. After obtaining a key, setting it as an environment variable is strongly recommended. Hardcoding an API key in source code creates a security risk, so the environment variable approach must be used.
# Set the API key as an environment variable (add to ~/.zshrc or ~/.bashrc)
export ANTHROPIC_API_KEY="sk-ant-xxxxxxxxxxxx"Installing the Python SDK
Section titled “Installing the Python SDK”The Anthropic Python SDK is installable via pip.
pip install anthropicPython 3.8 or later is required. After installation, the following code can be used to verify that the SDK works correctly.
Basic Message Sending (Python)
Section titled “Basic Message Sending (Python)”# Python 3.8 or later required
import anthropic
# Initialize the client (automatically reads the ANTHROPIC_API_KEY environment variable)
client = anthropic.Anthropic()
# Send a message
message = client.messages.create(
model="claude-sonnet-4-6", # Model ID to use
max_tokens=1024, # Maximum number of output tokens
messages=[
{"role": "user", "content": "Write a function in Python to generate the Fibonacci sequence."}
]
)
# Print the response
print(message.content[0].text)
# Example output: def fibonacci(n): ... (code is returned)TypeScript / Node.js Example
Section titled “TypeScript / Node.js Example”import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic();
// Automatically reads the ANTHROPIC_API_KEY environment variable
const message = await client.messages.create({
model: "claude-sonnet-4-6",
max_tokens: 1024,
messages: [{ role: "user", content: "Hello, Claude" }],
});
console.log(message.content[0].text);Key API Parameters
Section titled “Key API Parameters”Messages API Structure
Section titled “Messages API Structure”The Claude API is centered around the Messages API. Requests specify the following main parameters.
| Parameter | Type | Required | Description |
|---|---|---|---|
model | string | Yes | Model ID to use (e.g., claude-sonnet-4-6) |
max_tokens | integer | Yes | Maximum number of output tokens (up to 8192) |
messages | array | Yes | Conversation history (pairs of role and content) |
system | string | No | System prompt (defines the model’s role and constraints) |
temperature | float | No | Output randomness (0.0–1.0, default: 1.0) |
stream | boolean | No | Enable streaming response |
# Example with system prompt and temperature
message = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=2048,
system="You are a Python expert. Provide explanations in English with inline comments in English.",
temperature=0.3, # Lower values produce more consistent output (recommended for code generation)
messages=[
{"role": "user", "content": "Explain list comprehensions."},
{"role": "assistant", "content": "List comprehensions provide a concise way to create lists from existing iterables."},
{"role": "user", "content": "Can you show me a more concrete example?"}
]
)Streaming Responses
Section titled “Streaming Responses”Streaming responses deliver generated text progressively as it is produced. This is used in chat UIs and other interfaces where showing partial responses immediately is desirable.
# Streaming response example
with client.messages.stream(
model="claude-sonnet-4-6",
max_tokens=1024,
messages=[{"role": "user", "content": "Write a long story."}]
) as stream:
for text in stream.text_stream:
print(text, end="", flush=True) # Output text in real timeRate Limits and Cost Structure
Section titled “Rate Limits and Cost Structure”Rate Limits
Section titled “Rate Limits”The Claude API enforces two types of rate limits: Tokens Per Minute (TPM) and Requests Per Minute (RPM).
| Limit Type | Description |
|---|---|
| TPM (Tokens Per Minute) | Maximum number of tokens that can be processed per minute |
| RPM (Requests Per Minute) | Maximum number of requests that can be sent per minute |
Rate limit thresholds vary by API plan. When limits are reached, the API returns a 429 Too Many Requests error. Implementing retries with exponential backoff is recommended.
Cost Structure
Section titled “Cost Structure”Claude API costs are based on input tokens and output tokens.
- Input tokens: System prompt + user messages + conversation history combined
- Output tokens: Tokens in the text generated by Claude
Output tokens are generally priced higher per token than input tokens. Setting max_tokens appropriately to avoid unnecessary output, and keeping system prompts concise, are effective cost reduction measures.
# Check token usage
print(f"Input tokens: {message.usage.input_tokens}")
print(f"Output tokens: {message.usage.output_tokens}")
print(f"Total tokens: {message.usage.input_tokens + message.usage.output_tokens}")Prompt Caching
Section titled “Prompt Caching”What Is Prompt Caching?
Section titled “What Is Prompt Caching?”Prompt caching is a feature that significantly reduces the cost of processing the same prompt prefix when it is sent multiple times.
Normally, every API request processes and charges for the full system prompt token count. With prompt caching enabled, the same prefix section is served from cache, reducing processing costs by up to 90%.
graph TD
subgraph Without Cache
R1[Request 1: full system prompt processed] --> C1[100% cost]
R2[Request 2: full system prompt processed] --> C2[100% cost]
R3[Request 3: full system prompt processed] --> C3[100% cost]
end
subgraph With Cache
R4[Request 1: cache write] --> C4[100% cost]
R5[Request 2: cache read] --> C5[10% cost]
R6[Request 3: cache read] --> C6[10% cost]
endHow to Enable Caching (cache_control: ephemeral)
Section titled “How to Enable Caching (cache_control: ephemeral)”Caching is enabled by setting the cache_control parameter to {"type": "ephemeral"}.
import anthropic
client = anthropic.Anthropic()
# Example of caching a long system prompt
# (In practice, this would contain thousands of tokens of guidelines)
LONG_SYSTEM_PROMPT = """
You are a Python and data science expert.
Follow these guidelines when answering:
1. Always include type annotations in code
2. Write docstrings in Google format
3. Always include error handling
... (thousands of tokens of detailed guidelines)
"""
message = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system=[
{
"type": "text",
"text": LONG_SYSTEM_PROMPT,
"cache_control": {"type": "ephemeral"} # Cache this section
}
],
messages=[
{"role": "user", "content": "Write a function to read a CSV file with pandas."}
]
)
# Check cache usage
print(f"Cache write tokens: {message.usage.cache_creation_input_tokens}")
print(f"Cache read tokens: {message.usage.cache_read_input_tokens}")
print(f"Regular input tokens: {message.usage.input_tokens}")Cache TTL and Considerations
Section titled “Cache TTL and Considerations”| Item | Details |
|---|---|
| TTL (Time to Live) | 5 minutes from last access |
| Minimum cache size | Only applies to blocks of 1024 tokens or more |
| Write cost | Slightly higher than regular input tokens (1.25×) |
| Read cost reduction | Up to 90% reduction (10% of regular cost) |
Use Cases Where Caching Is Effective
Section titled “Use Cases Where Caching Is Effective”Prompt caching is particularly effective in the following scenarios.
- Long system prompts: Sending thousands of tokens of guidelines or role definitions on every request
- Repeated document reference: Asking multiple questions against the same codebase or technical document
- RAG (Retrieval-Augmented Generation): Using the same retrieved document for multiple queries
- Multi-turn conversations: Chatbots with long conversation histories
# Example of using cache in a RAG workflow
# Setting the retrieved document as the cache target
retrieved_document = "... (long document retrieved from search) ..."
message = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": retrieved_document,
"cache_control": {"type": "ephemeral"} # Cache the document
},
{
"type": "text",
"text": "What are the main points of this document?"
}
]
}
]
)Batches API (Bulk Processing)
Section titled “Batches API (Bulk Processing)”What Is the Batches API?
Section titled “What Is the Batches API?”The Batches API is a feature for processing large numbers of requests asynchronously. It offers 50% cost reduction compared to the standard API. It is suited for tasks that do not require real-time responses.
graph LR
SUBMIT[Submit batch request] --> QUEUE[Processing queue]
QUEUE --> PROCESS[Asynchronous processing]
PROCESS --> RESULT["Retrieve results (up to 24 hours)"]Basic Batches API Usage
Section titled “Basic Batches API Usage”import anthropic
client = anthropic.Anthropic()
# Create a batch request
batch = client.messages.batches.create(
requests=[
{
"custom_id": "request-1", # ID to identify this request
"params": {
"model": "claude-haiku-4-5", # Haiku is recommended for batch processing
"max_tokens": 100,
"messages": [
{"role": "user", "content": "Translate 'Tokyo' into French."}
]
}
},
{
"custom_id": "request-2",
"params": {
"model": "claude-haiku-4-5",
"max_tokens": 100,
"messages": [
{"role": "user", "content": "Translate 'Osaka' into French."}
]
}
}
# ... up to 10,000 requests can be submitted at once
]
)
print(f"Batch ID: {batch.id}")
print(f"Status: {batch.processing_status}")
# Poll for completion
import time
while True:
batch_status = client.messages.batches.retrieve(batch.id)
if batch_status.processing_status == "ended":
break
time.sleep(60) # Wait 1 minute before checking again
# Retrieve results
for result in client.messages.batches.results(batch.id):
print(f"ID: {result.custom_id}, Result: {result.result.message.content[0].text}")Use Cases Suited for Batches API
Section titled “Use Cases Suited for Batches API”| Use Case | Description |
|---|---|
| Data analysis and labeling | Classifying and tagging large volumes of text data |
| Evaluation and benchmarking | Evaluating model response quality across large test suites |
| Bulk translation | Translating large volumes of documents at once |
| Offline report generation | Generating periodic analysis reports |
Summary and Best Practices
Section titled “Summary and Best Practices”Recommendations for Implementation
Section titled “Recommendations for Implementation”- Always manage the API key via environment variables — never hardcode the API key in source code
- Choose the model based on the use case — Sonnet as default, Haiku for cost-priority, Opus for high accuracy (see the model selection guide)
- Set
max_tokensappropriately — unnecessarily large values increase cost - Use prompt caching — always configure it for long system prompts or repeatedly referenced documents
- Use the Batches API for non-real-time processing — 50% cost reduction is available
- Implement error handling — use exponential backoff for rate limit errors (429)
# Implementation example with best practices
import anthropic
import time
client = anthropic.Anthropic() # Uses the ANTHROPIC_API_KEY environment variable
def create_message_with_retry(messages, system=None, max_retries=3):
"""Send a message with retry using exponential backoff."""
for attempt in range(max_retries):
try:
kwargs = {
"model": "claude-sonnet-4-6",
"max_tokens": 1024,
"messages": messages
}
if system:
kwargs["system"] = system
return client.messages.create(**kwargs)
except anthropic.RateLimitError:
if attempt == max_retries - 1:
raise
wait_time = (2 ** attempt) * 1 # 1s, 2s, 4s
print(f"Rate limit reached. Waiting {wait_time} seconds...")
time.sleep(wait_time)
except anthropic.APIError as e:
print(f"API error: {e}")
raiseQ: What value should I set for max_tokens?
A general guideline is to set it to 1.5–2 times the expected output length for the task. For code generation, 500–2048 is common; for short summaries, 100–500. Setting an excessively large value may generate unnecessary tokens and increase costs.
Q: Is prompt caching available for all models?
It is supported for Claude 3 and later models (Opus, Sonnet, Haiku). For the latest information on supported models, refer to the Anthropic documentation.
Q: How should I decide between streaming and the Batches API?
Use streaming when immediate feedback to users is required, such as in chat UIs. Use the Batches API when real-time responses are not needed and large-scale processing is required, such as data analysis or evaluation pipelines.
Q: What should I do when rate limit errors (429) occur frequently?
First, implement retries with exponential backoff. If that is insufficient, consider upgrading to a higher plan in the Anthropic Console.
See the references for the external specifications and background sources used on this page.[1][2][3]
References
Section titled “References”- Anthropic Documentation — Messages API
- Anthropic Documentation — Prompt Caching
- Anthropic Documentation — Message Batches API