Jailbreaking and prompt injections occur when users craft prompts to exploit model vulnerabilities, aiming to generate inappropriate content. While Claude is inherently resilient to such attacks, here are additional steps to strengthen your guardrails.

Claude is far more resistant to jailbreaking than other major LLMs, thanks to advanced training methods like Constitutional AI.
  • Harmlessness screens: Use a lightweight model like Claude 3 Haiku to pre-screen user inputs.

  • Input validation: Filter prompts for jailbreaking patterns. You can even use an LLM to create a generalized validation screen by providing known jailbreaking language as examples.

  • Prompt engineering: Craft prompts that emphasize ethical boundaries.

  • Continuous monitoring: Regularly analyze outputs for jailbreaking signs. Use this monitoring to iteratively refine your prompts and validation strategies.

Advanced: Chain safeguards

Combine strategies for robust protection. Here’s an enterprise-grade example with tool use:

By layering these strategies, you create a robust defense against jailbreaking and prompt injections, ensuring your Claude-powered applications maintain the highest standards of safety and compliance.