Reducing latency
Latency refers to the time it takes for the model to process a prompt and and generate an output. Latency can be influenced by various factors, such as the size of the model, the complexity of the prompt, and the underlying infrastucture supporting the model and point of interaction.
It’s always better to first engineer a prompt that works well without model or prompt constraints, and then try latency reduction strategies afterward. Trying to reduce latency prematurely might prevent you from discovering what top performance looks like.
How to measure latency
When discussing latency, you may come across several terms and measurements:
- Baseline latency: This is the time taken by the model to process the prompt and generate the response, without considering the input and output tokens per second. It provides a general idea of the model’s speed.
- Time to first token (TTFT): This metric measures the time it takes for the model to generate the first token of the response, from when the prompt was sent. It’s particularly relevant when you’re using streaming (more on that later) and want to provide a responsive experience to your users.
For a more in-depth understanding of these terms, check out our glossary.
How to reduce latency
1. Choose the right model
One of the most straightforward ways to reduce latency is to select the appropriate model for your use case. Anthropic offers a range of models with different capabilities and performance characteristics. Consider your specific requirements and choose the model that best fits your needs in terms of speed and output quality. For more details about model metrics, see our models overview page.
2. Optimize prompt and output length
Minimize the number of tokens in both your input prompt and the expected output, while still maintaining high performance. The fewer tokens the model has to process and generate, the faster the response will be.
Here are some tips to help you optimize your prompts and outputs:
- Be clear but concise: Aim to convey your intent clearly and concisely in the prompt. Avoid unnecessary details or redundant information, while keeping in mind that claude lacks context on your use case and may not make the intended leaps of logic if instructions are unclear.
- Ask for shorter responses:: Ask Claude directly to be concise. The Claude 3 family of models has improved steerability over previous generations. If Claude is outputting unwanted length, ask Claude to curb its chattiness.
Due to how LLMs count tokens instead of words, asking for an exact word count or a word count limit is not as effective a strategy as asking for paragraph or sentence count limits.
- Set appropriate output limits: Use the
max_tokens
parameter to set a hard limit on the maximum length of the generated response. This prevents Claude from generating overly long outputs.Note: When the response reaches
max_tokens
tokens, the response will be cut off, perhaps midsentence or mid-word, so this is a blunt technique that may require post-processing and is usually most appropriate for multiple choice or short answer responses where the answer comes right at the beginning. - Experiment with temperature: The
temperature
parameter controls the randomness of the output. Lower values (e.g., 0.2) can sometimes lead to more focused and shorter responses, while higher values (e.g., 0.8) may result in more diverse but potentially longer outputs.
Finding the right balance between prompt clarity, output quality, and token count may require some experimentation.
3. Leverage streaming
Streaming is a feature that allows the model to start sending back its response before the full output is complete. This can significantly improve the perceived responsiveness of your application, as users can see the model’s output in real-time.
With streaming enabled, you can process the model’s output as it arrives, updating your user interface or performing other tasks in parallel. This can greatly enhance the user experience and make your application feel more interactive and responsive.
Visit streaming Messages to learn about how you can implement streaming for your use case.
Was this page helpful?