Skip to content
LinkedInX

Claude API and Prompt Caching

About 5 minutes

Target audience: Developers new to the Claude API, or those optimizing costs with the SDK
Prerequisites: Python basics (variables, functions, pip), Claude Model Comparison and Selection Guide

With the Claude API, adding AI conversation capabilities to an application takes just a few lines of code. Using prompt caching, the cost of repeated requests can be reduced by up to 90%.

The Claude API is an HTTP-based interface provided by Anthropic for calling Claude models programmatically. Authentication with an API key is required.

API keys are obtained from the Anthropic Console. After obtaining a key, setting it as an environment variable is strongly recommended. Hardcoding an API key in source code creates a security risk, so the environment variable approach must be used.

# Set the API key as an environment variable (add to ~/.zshrc or ~/.bashrc)
export ANTHROPIC_API_KEY="sk-ant-xxxxxxxxxxxx"

The Anthropic Python SDK is installable via pip.

pip install anthropic

Python 3.8 or later is required. After installation, the following code can be used to verify that the SDK works correctly.

# Python 3.8 or later required
import anthropic

# Initialize the client (automatically reads the ANTHROPIC_API_KEY environment variable)
client = anthropic.Anthropic()

# Send a message
message = client.messages.create(
    model="claude-sonnet-4-6",        # Model ID to use
    max_tokens=1024,                   # Maximum number of output tokens
    messages=[
        {"role": "user", "content": "Write a function in Python to generate the Fibonacci sequence."}
    ]
)

# Print the response
print(message.content[0].text)
# Example output: def fibonacci(n): ... (code is returned)
import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic();
// Automatically reads the ANTHROPIC_API_KEY environment variable

const message = await client.messages.create({
  model: "claude-sonnet-4-6",
  max_tokens: 1024,
  messages: [{ role: "user", content: "Hello, Claude" }],
});

console.log(message.content[0].text);

The Claude API is centered around the Messages API. Requests specify the following main parameters.

ParameterTypeRequiredDescription
modelstringYesModel ID to use (e.g., claude-sonnet-4-6)
max_tokensintegerYesMaximum number of output tokens (up to 8192)
messagesarrayYesConversation history (pairs of role and content)
systemstringNoSystem prompt (defines the model’s role and constraints)
temperaturefloatNoOutput randomness (0.0–1.0, default: 1.0)
streambooleanNoEnable streaming response
# Example with system prompt and temperature
message = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=2048,
    system="You are a Python expert. Provide explanations in English with inline comments in English.",
    temperature=0.3,       # Lower values produce more consistent output (recommended for code generation)
    messages=[
        {"role": "user", "content": "Explain list comprehensions."},
        {"role": "assistant", "content": "List comprehensions provide a concise way to create lists from existing iterables."},
        {"role": "user", "content": "Can you show me a more concrete example?"}
    ]
)

Streaming responses deliver generated text progressively as it is produced. This is used in chat UIs and other interfaces where showing partial responses immediately is desirable.

# Streaming response example
with client.messages.stream(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Write a long story."}]
) as stream:
    for text in stream.text_stream:
        print(text, end="", flush=True)  # Output text in real time

The Claude API enforces two types of rate limits: Tokens Per Minute (TPM) and Requests Per Minute (RPM).

Limit TypeDescription
TPM (Tokens Per Minute)Maximum number of tokens that can be processed per minute
RPM (Requests Per Minute)Maximum number of requests that can be sent per minute

Rate limit thresholds vary by API plan. When limits are reached, the API returns a 429 Too Many Requests error. Implementing retries with exponential backoff is recommended.

Claude API costs are based on input tokens and output tokens.

  • Input tokens: System prompt + user messages + conversation history combined
  • Output tokens: Tokens in the text generated by Claude

Output tokens are generally priced higher per token than input tokens. Setting max_tokens appropriately to avoid unnecessary output, and keeping system prompts concise, are effective cost reduction measures.

# Check token usage
print(f"Input tokens: {message.usage.input_tokens}")
print(f"Output tokens: {message.usage.output_tokens}")
print(f"Total tokens: {message.usage.input_tokens + message.usage.output_tokens}")

Prompt caching is a feature that significantly reduces the cost of processing the same prompt prefix when it is sent multiple times.

Normally, every API request processes and charges for the full system prompt token count. With prompt caching enabled, the same prefix section is served from cache, reducing processing costs by up to 90%.

graph TD
  subgraph Without Cache
    R1[Request 1: full system prompt processed] --> C1[100% cost]
    R2[Request 2: full system prompt processed] --> C2[100% cost]
    R3[Request 3: full system prompt processed] --> C3[100% cost]
  end

  subgraph With Cache
    R4[Request 1: cache write] --> C4[100% cost]
    R5[Request 2: cache read] --> C5[10% cost]
    R6[Request 3: cache read] --> C6[10% cost]
  end

How to Enable Caching (cache_control: ephemeral)

Section titled “How to Enable Caching (cache_control: ephemeral)”

Caching is enabled by setting the cache_control parameter to {"type": "ephemeral"}.

import anthropic

client = anthropic.Anthropic()

# Example of caching a long system prompt
# (In practice, this would contain thousands of tokens of guidelines)
LONG_SYSTEM_PROMPT = """
You are a Python and data science expert.
Follow these guidelines when answering:
1. Always include type annotations in code
2. Write docstrings in Google format
3. Always include error handling
... (thousands of tokens of detailed guidelines)
"""

message = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": LONG_SYSTEM_PROMPT,
            "cache_control": {"type": "ephemeral"}  # Cache this section
        }
    ],
    messages=[
        {"role": "user", "content": "Write a function to read a CSV file with pandas."}
    ]
)

# Check cache usage
print(f"Cache write tokens: {message.usage.cache_creation_input_tokens}")
print(f"Cache read tokens: {message.usage.cache_read_input_tokens}")
print(f"Regular input tokens: {message.usage.input_tokens}")
ItemDetails
TTL (Time to Live)5 minutes from last access
Minimum cache sizeOnly applies to blocks of 1024 tokens or more
Write costSlightly higher than regular input tokens (1.25×)
Read cost reductionUp to 90% reduction (10% of regular cost)

Prompt caching is particularly effective in the following scenarios.

  • Long system prompts: Sending thousands of tokens of guidelines or role definitions on every request
  • Repeated document reference: Asking multiple questions against the same codebase or technical document
  • RAG (Retrieval-Augmented Generation): Using the same retrieved document for multiple queries
  • Multi-turn conversations: Chatbots with long conversation histories
# Example of using cache in a RAG workflow
# Setting the retrieved document as the cache target
retrieved_document = "... (long document retrieved from search) ..."

message = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": retrieved_document,
                    "cache_control": {"type": "ephemeral"}  # Cache the document
                },
                {
                    "type": "text",
                    "text": "What are the main points of this document?"
                }
            ]
        }
    ]
)

The Batches API is a feature for processing large numbers of requests asynchronously. It offers 50% cost reduction compared to the standard API. It is suited for tasks that do not require real-time responses.

graph LR
  SUBMIT[Submit batch request] --> QUEUE[Processing queue]
  QUEUE --> PROCESS[Asynchronous processing]
  PROCESS --> RESULT["Retrieve results (up to 24 hours)"]
import anthropic

client = anthropic.Anthropic()

# Create a batch request
batch = client.messages.batches.create(
    requests=[
        {
            "custom_id": "request-1",  # ID to identify this request
            "params": {
                "model": "claude-haiku-4-5",   # Haiku is recommended for batch processing
                "max_tokens": 100,
                "messages": [
                    {"role": "user", "content": "Translate 'Tokyo' into French."}
                ]
            }
        },
        {
            "custom_id": "request-2",
            "params": {
                "model": "claude-haiku-4-5",
                "max_tokens": 100,
                "messages": [
                    {"role": "user", "content": "Translate 'Osaka' into French."}
                ]
            }
        }
        # ... up to 10,000 requests can be submitted at once
    ]
)

print(f"Batch ID: {batch.id}")
print(f"Status: {batch.processing_status}")

# Poll for completion
import time

while True:
    batch_status = client.messages.batches.retrieve(batch.id)
    if batch_status.processing_status == "ended":
        break
    time.sleep(60)  # Wait 1 minute before checking again

# Retrieve results
for result in client.messages.batches.results(batch.id):
    print(f"ID: {result.custom_id}, Result: {result.result.message.content[0].text}")
Use CaseDescription
Data analysis and labelingClassifying and tagging large volumes of text data
Evaluation and benchmarkingEvaluating model response quality across large test suites
Bulk translationTranslating large volumes of documents at once
Offline report generationGenerating periodic analysis reports

  1. Always manage the API key via environment variables — never hardcode the API key in source code
  2. Choose the model based on the use case — Sonnet as default, Haiku for cost-priority, Opus for high accuracy (see the model selection guide)
  3. Set max_tokens appropriately — unnecessarily large values increase cost
  4. Use prompt caching — always configure it for long system prompts or repeatedly referenced documents
  5. Use the Batches API for non-real-time processing — 50% cost reduction is available
  6. Implement error handling — use exponential backoff for rate limit errors (429)
# Implementation example with best practices
import anthropic
import time

client = anthropic.Anthropic()  # Uses the ANTHROPIC_API_KEY environment variable

def create_message_with_retry(messages, system=None, max_retries=3):
    """Send a message with retry using exponential backoff."""
    for attempt in range(max_retries):
        try:
            kwargs = {
                "model": "claude-sonnet-4-6",
                "max_tokens": 1024,
                "messages": messages
            }
            if system:
                kwargs["system"] = system

            return client.messages.create(**kwargs)

        except anthropic.RateLimitError:
            if attempt == max_retries - 1:
                raise
            wait_time = (2 ** attempt) * 1  # 1s, 2s, 4s
            print(f"Rate limit reached. Waiting {wait_time} seconds...")
            time.sleep(wait_time)

        except anthropic.APIError as e:
            print(f"API error: {e}")
            raise

Q: What value should I set for max_tokens?

A general guideline is to set it to 1.5–2 times the expected output length for the task. For code generation, 500–2048 is common; for short summaries, 100–500. Setting an excessively large value may generate unnecessary tokens and increase costs.

Q: Is prompt caching available for all models?

It is supported for Claude 3 and later models (Opus, Sonnet, Haiku). For the latest information on supported models, refer to the Anthropic documentation.

Q: How should I decide between streaming and the Batches API?

Use streaming when immediate feedback to users is required, such as in chat UIs. Use the Batches API when real-time responses are not needed and large-scale processing is required, such as data analysis or evaluation pipelines.

Q: What should I do when rate limit errors (429) occur frequently?

First, implement retries with exponential backoff. If that is insufficient, consider upgrading to a higher plan in the Anthropic Console.


See the references for the external specifications and background sources used on this page.[1][2][3]

  1. Anthropic Documentation — Messages API
  2. Anthropic Documentation — Prompt Caching
  3. Anthropic Documentation — Message Batches API

Quiz