Prompt Caching

OpenAI now offers prompt caching, a feature that can significantly reduce both latency and costs for your API requests. This feature is particularly beneficial for prompts exceeding 1024 tokens, offering up to an 80% reduction in latency for longer prompts over 10,000 tokens.

Prompt Caching is enabled for following models

  • gpt-4o (excludes gpt-4o-2024-05-13)

  • gpt-4o-mini

  • o1-preview

  • o1-mini

Portkey supports OpenAI's prompt caching feature out of the box. Here is an examples on of how to use it:

from portkey_ai import Portkey

portkey = Portkey(
    api_key="PORTKEY_API_KEY",
    virtual_key="OPENAI_VIRTUAL_KEY",
)

# Define tools (for function calling example)
tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get the current weather in a given location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {
                        "type": "string",
                        "description": "The city and state, e.g. San Francisco, CA",
                    },
                    "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
                },
                "required": ["location"],
            },
        }
    }
]


# Example: Function calling with caching
response = portkey.chat.completions.create(
  model="gpt-4",
  messages=[
    {"role": "system", "content": "You are a helpful assistant that can check the weather."},
    {"role": "user", "content": "What's the weather like in San Francisco?"}
  ],
  tools=tools,
  tool_choice="auto"
)
print(json.dumps(response.model_dump(), indent=2))

What can be cached

  • Messages: The complete messages array, encompassing system, user, and assistant interactions.

  • Images: Images included in user messages, either as links or as base64-encoded data, as well as multiple images can be sent. Ensure the detail parameter is set identically, as it impacts image tokenization.

  • Tool use: Both the messages array and the list of available tools can be cached, contributing to the minimum 1024 token requirement.

  • Structured outputs: The structured output schema serves as a prefix to the system message and can be cached.

What's Not Supported

  • Completions API (only Chat Completions API is supported)

  • Streaming responses (caching works, but streaming itself is not affected)

Monitoring Cache Performance

Prompt caching requests & responses based on OpenAI's calculations here:

All requests, including those with fewer than 1024 tokens, will display a cached_tokens field of the usage.prompt_tokens_details chat completions object indicating how many of the prompt tokens were a cache hit.

For requests under 1024 tokens, cached_tokens will be zero.

Key Features:

  • Reduced Latency: Especially significant for longer prompts.

  • Lower Costs: Cached portions of prompts are billed at a discounted rate.

  • Improved Efficiency: Allows for more context in prompts without increasing costs proportionally.

  • Zero Data Retention: No data is stored during the caching process, making it eligible for zero data retention policies.

Last updated