Extended thinking gives Claude 3.7 Sonnet enhanced reasoning capabilities for complex tasks, while also providing transparency into its step-by-step thought process before it delivers its final answer.

How extended thinking works

When extended thinking is turned on, Claude creates thinking content blocks where it outputs its internal reasoning. Claude incorporates insights from this reasoning before crafting a final response.

The API response will include both thinking and text content blocks.

In multi-turn conversations, only thinking blocks associated with a tool use session or assistant turn in the last message position are visible to Claude and are billed as input tokens; thinking blocks associated with earlier assistant messages are not visible to Claude during sampling and do not get billed as input tokens.

Implementing extended thinking

Add the thinking parameter and a specified token budget to use for extended thinking to your API request.

The budget_tokens parameter determines the maximum number of tokens Claude is allowed use for its internal reasoning process. Larger budgets can improve response quality by enabling more thorough analysis for complex problems, although Claude may not use the entire budget allocated, especially at ranges above 32K.

Your budget_tokens must always be less than the max_tokens specified.

The API response will include both thinking and text content blocks:

{
    "content": [
        {
            "type": "thinking",
            "thinking": "To approach this, let's think about what we know about prime numbers...",
            "signature": "zbbJhbGciOiJFU8zI1NiIsImtakcjsu38219c0.eyJoYXNoIjoiYWJjMTIzIiwiaWFxxxjoxNjE0NTM0NTY3fQ...."
        },
        {
            "type": "text",
            "text": "Yes, there are infinitely many prime numbers such that..."
        }
    ]
}

Understanding thinking blocks

Thinking blocks represent Claude’s internal thought process. In order to allow Claude to work through problems with minimal internal restrictions while maintaining our safety standards and our stateless APIs, we have implemented the following:

  • Thinking blocks contain a signature field. This field holds a cryptographic token which verifies that the thinking block was generated by Claude, and is verified when thinking blocks are passed back to the API. When streaming responses, the signature is added via a signature_delta inside a content_block_delta event just before the content_block_stop event. It is only strictly necessary to send back thinking blocks when using tool use with extended thinking. Otherwise you can omit thinking blocks from previous turns, or let the API strip them for you if you pass them back.
  • Occasionally Claude’s internal reasoning will be flagged by our safety systems. When this occurs, we encrypt some or all of the thinking block and return it to you as a redacted_thinking block. These redacted thinking blocks are decrypted when passed back to the API, allowing Claude to continue its response without losing context.

Here’s an example showing both normal and redacted thinking blocks:

{
  "content": [
    {
      "type": "thinking",
      "thinking": "Let me analyze this step by step...",
      "signature": "WaUjzkypQ2mUEVM36O2TxuC06KN8xyfbJwyem2dw3URve/op91XWHOEBLLqIOMfFG/UvLEczmEsUjavL...."
    },
    {
      "type": "redacted_thinking",
      "data": "EmwKAhgBEgy3va3pzix/LafPsn4aDFIT2Xlxh0L5L8rLVyIwxtE3rAFBa8cr3qpP..."
    },
    {
      "type": "text",
      "text": "Based on my analysis..."
    }
  ]
}

Seeing redacted thinking blocks in your output is expected behavior. The model can still use this redacted reasoning to inform its responses while maintaining safety guardrails.

If you need to test redacted thinking handling in your application, you can use this special test string as your prompt: ANTHROPIC_MAGIC_STRING_TRIGGER_REDACTED_THINKING_46C9A13E193C177646C7398A98432ECCCE4C1253D5E2D82641AC0E52CC2876CB

When passing thinking and redacted_thinking blocks back to the API in a multi-turn conversation, you must include the complete unmodified block back to the API for the last assistant turn.

This is critical for maintaining the model’s reasoning flow. We suggest always passing back all thinking blocks to the API. For more details, see the Preserving thinking blocks section below.

Suggestions for handling redacted thinking in production

When building customer-facing applications that use extended thinking:

  • Be aware that redacted thinking blocks contain encrypted content that isn’t human-readable
  • Consider providing a simple explanation like: “Some of Claude’s internal reasoning has been automatically encrypted for safety reasons. This doesn’t affect the quality of responses.”
  • If showing thinking blocks to users, you can filter out redacted blocks while preserving normal thinking blocks
  • Be transparent that using extended thinking features may occasionally result in some reasoning being encrypted
  • Implement appropriate error handling to gracefully manage redacted thinking without breaking your UI

Streaming extended thinking

When streaming is enabled, you’ll receive thinking content via thinking_delta events. Here’s how to handle streaming with thinking:

Example streaming output:

event: message_start
data: {"type": "message_start", "message": {"id": "msg_01...", "type": "message", "role": "assistant", "content": [], "model": "claude-3-7-sonnet-20250219", "stop_reason": null, "stop_sequence": null}}

event: content_block_start
data: {"type": "content_block_start", "index": 0, "content_block": {"type": "thinking", "thinking": ""}}

event: content_block_delta
data: {"type": "content_block_delta", "index": 0, "delta": {"type": "thinking_delta", "thinking": "Let me solve this step by step:\n\n1. First break down 27 * 453"}}

event: content_block_delta
data: {"type": "content_block_delta", "index": 0, "delta": {"type": "thinking_delta", "thinking": "\n2. 453 = 400 + 50 + 3"}}

// Additional thinking deltas...

event: content_block_delta
data: {"type": "content_block_delta", "index": 0, "delta": {"type": "signature_delta", "signature": "EqQBCgIYAhIM1gbcDa9GJwZA2b3hGgxBdjrkzLoky3dl1pkiMOYds..."}}

event: content_block_stop
data: {"type": "content_block_stop", "index": 0}

event: content_block_start
data: {"type": "content_block_start", "index": 1, "content_block": {"type": "text", "text": ""}}

event: content_block_delta
data: {"type": "content_block_delta", "index": 1, "delta": {"type": "text_delta", "text": "27 * 453 = 12,231"}}

// Additional text deltas...

event: content_block_stop
data: {"type": "content_block_stop", "index": 1}

event: message_delta
data: {"type": "message_delta", "delta": {"stop_reason": "end_turn", "stop_sequence": null}}

event: message_stop
data: {"type": "message_stop"}

About streaming behavior with thinking

When using streaming with thinking enabled, you might notice that text sometimes arrives in larger chunks alternating with smaller, token-by-token delivery. This is expected behavior, especially for thinking content.

The streaming system needs to process content in batches for optimal performance, which can result in this “chunky” delivery pattern. We’re continuously working to improve this experience, with future updates focused on making thinking content stream more smoothly.

redacted_thinking blocks will not have any deltas associated and will be sent as a single event.

Important considerations when using extended thinking

Working with the thinking budget: The minimum budget is 1,024 tokens. We suggest starting at the minimum and increasing the thinking budget incrementally to find the optimal range for Claude to perform well for your use case. Higher token counts may allow you to achieve more comprehensive and nuanced reasoning, but there may also be diminishing returns depending on the task.

  • The thinking budget is a target rather than a strict limit - actual token usage may vary based on the task.
  • Be prepared for potentially longer response times due to the additional processing required for the reasoning process.
  • Streaming is required when max_tokens is greater than 21,333.

For thinking budgets above 32K: We recommend using batch processing for workloads where the thinking budget is set above 32K to avoid networking issues. Requests pushing the model to think above 32K tokens causes long running requests that might run up against system timeouts and open connection limits.

Thinking compatibility with other features:

  • Thinking isn’t compatible with temperature, top_p, or top_k modifications as well as forced tool use.
  • You cannot pre-fill responses when thinking is enabled.
  • Changes to the thinking budget invalidate cached prompt prefixes that include messages. However, cached system prompts and tool definitions will continue to work when thinking parameters change.

Pricing and token usage for extended thinking

Extended thinking tokens count towards the context window and are billed as output tokens. Since thinking tokens are treated as normal output tokens, they also count towards your rate limits. Be sure to account for this increased token usage when planning your API usage.

For Claude 3.7 Sonnet, the pricing is:

Token useCost
Input tokens$3 / MTok
Output tokens (including thinking tokens)$15 / MTok
Prompt caching write$3.75 / MTok
Prompt caching read$0.30 / MTok

Batch processing for extended thinking is available at 50% off these prices and often completes in less than 1 hour.

All extended thinking tokens (including redacted thinking tokens) are billed as output tokens and count toward your rate limits.

In multi-turn conversations, thinking blocks associated with earlier assistant messages do not get billed as input tokens.

When extended thinking is enabled, a specialized 28 or 29 token system prompt is automatically included to support this feature.

Extended output capabilities (beta)

Claude 3.7 Sonnet can produce substantially longer responses than previous models with support for up to 128K output tokens (beta)—more than 15x longer than other Claude models. This expanded capability is particularly effective for extended thinking use cases involving complex reasoning, rich code generation, and comprehensive content creation.

This feature can be enabled by passing an anthropic-beta header of output-128k-2025-02-19.

When using extended thinking with longer outputs, you can allocate a larger thinking budget to support more thorough reasoning, while still having ample tokens available for the final response.

We suggest using streaming or batch mode with this extended output capability; for more details see our guidance on network reliability considerations for long requests.

Using extended thinking with prompt caching

Prompt caching with thinking has several important considerations:

Thinking block inclusion in cached prompts

  • Thinking is only included when generating an assistant turn and not meant to be cached.
  • Previous turn thinking blocks are ignored.
  • If thinking becomes disabled, any thinking content passed to the API is simply ignored.

Cache invalidation rules

  • Alterations to thinking parameters (enabling/disabling or budget changes) invalidate cache breakpoints set in messages.
  • System prompts and tools maintain caching even when thinking parameters change.

Examples of prompt caching with extended thinking

Max tokens and context window size with extended thinking

In older Claude models (prior to Claude 3.7 Sonnet), if the sum of prompt tokens and max_tokens exceeded the model’s context window, the system would automatically adjust max_tokens to fit within the context limit. This meant you could set a large max_tokens value and the system would silently reduce it as needed.

With Claude 3.7 Sonnet, max_tokens (which includes your thinking budget when thinking is enabled) is enforced as a strict limit. The system will now return a validation error if prompt tokens + max_tokens exceeds the context window size.

How context window is calculated with extended thinking

When calculating context window usage with thinking enabled, there are some considerations to be aware of:

  • Thinking blocks from previous turns are stripped and not counted towards your context window
  • Current turn thinking counts towards your max_tokens limit for that turn

The diagram below demonstrates the specialized token management when extended thinking is enabled:

The effective context window is calculated as:

context window =
  (current input tokens - previous thinking tokens) +
  (thinking tokens + redacted thinking tokens + text output tokens)

We recommend using the token counting API to get accurate token counts for your specific use case, especially when working with multi-turn conversations that include thinking.

You can read through our guide on context windows for a more thorough deep dive.

Managing tokens with extended thinking

Given new context window and max_tokens behavior with extended thinking models like Claude 3.7 Sonnet, you may need to:

  • More actively monitor and manage your token usage
  • Adjust max_tokens values as your prompt length changes
  • Potentially use the token counting endpoints more frequently
  • Be aware that previous thinking blocks don’t accumulate in your context window

This change has been made to provide more predictable and transparent behavior, especially as maximum token limits have increased significantly.

Extended thinking with tool use

When using extended thinking with tool use, be aware of the following behavior pattern:

  1. First assistant turn: When you send an initial user message, the assistant response will include thinking blocks followed by tool use requests.

  2. Tool result turn: When you pass the user message with tool result blocks, the subsequent assistant message will not contain any additional thinking blocks.

To expand here, the normal order of a tool use conversation with thinking follows these steps:

  1. User sends initial message
  2. Assistant responds with thinking blocks and tool requests
  3. User sends message with tool results
  4. Assistant responds with either more tool calls or just text (no thinking blocks in this response)
  5. If more tools are requested, repeat steps 3-4 until the conversation is complete

This design allows Claude to show its reasoning process before making tool requests, but not repeat the thinking process after receiving tool results. Claude will not output another thinking block until after the next non-tool_result user turn.

The diagram below illustrates the context window token management when combining extended thinking with tool use:

Preserving thinking blocks

During tool use, you must pass thinking and redacted_thinking blocks back to the API, and you must include the complete unmodified block back to the API. This is critical for maintaining the model’s reasoning flow and conversation integrity.

While you can omit thinking and redacted_thinking blocks from prior assistant role turns, we suggest always passing back all thinking blocks to the API for any multi-turn conversation. The API will:

  • Automatically filter the provided thinking blocks
  • Use the relevant thinking blocks necessary to preserve the model’s reasoning
  • Only bill for the input tokens for the blocks shown to Claude

Why thinking blocks must be preserved

When Claude invokes tools, it is pausing its construction of a response to await external information. When tool results are returned, Claude will continue building that existing response. This necessitates preserving thinking blocks during tool use, for a couple of reasons:

  1. Reasoning continuity: The thinking blocks capture Claude’s step-by-step reasoning that led to tool requests. When you post tool results, including the original thinking ensures Claude can continue its reasoning from where it left off.

  2. Context maintenance: While tool results appear as user messages in the API structure, they’re part of a continuous reasoning flow. Preserving thinking blocks maintains this conceptual flow across multiple API calls.

Important: When providing thinking or redacted_thinking blocks, the entire sequence of consecutive thinking or redacted_thinking blocks must match the outputs generated by the model during the original request; you cannot rearrange or modify the sequence of these blocks.

Tips for making the best use of extended thinking mode

To get the most out of extended thinking:

  1. Set appropriate budgets: Start with larger thinking budgets (16,000+ tokens) for complex tasks and adjust based on your needs.

  2. Experiment with thinking token budgets: The model might perform differently at different max thinking budget settings. Increasing max thinking budget can make the model think better/harder, at the tradeoff of increased latency. For critical tasks, consider testing different budget settings to find the optimal balance between quality and performance.

  3. You do not need to remove previous thinking blocks yourself: The Anthropic API automatically ignores thinking blocks from previous turns and they are not included when calculating context usage.

  4. Monitor token usage: Keep track of thinking token usage to optimize costs and performance.

  5. Use extended thinking for particularly complex tasks: Enable thinking for tasks that benefit from step-by-step reasoning like math, coding, and analysis.

  6. Account for extended response time: Factor in that generating thinking blocks may increase overall response time.

  7. Handle streaming appropriately: When streaming, be prepared to handle both thinking and text content blocks as they arrive.

  8. Prompt engineering: Review our extended thinking prompting tips if you want to maximize Claude’s thinking capabilities.

Next steps