Content moderation

Screening user input before it reaches your main language model allows you to prevent the processing or output of harmful, offensive, or irrelevant content, saving both computational resources and potential damage to your brand reputation.

In this guide, we'll explore how to use Claude to efficiently moderate user input before passing it on to your main prompt. We'll also discuss post-processing techniques to identify potential prompt leakage.

Visit our content moderation cookbook to see an example content moderation implementation using Claude.

Why content moderation matters

Content moderation is essential for several reasons:

  1. User experience: By filtering out inappropriate or offensive content, you maintain a positive and welcoming environment for your users.

  2. Brand reputation: Preventing your AI application from engaging with or generating harmful content helps protect your brand's image and reputation.

  3. Cost efficiency: Screening user input with a smaller model before processing it with your main prompt saves on computational costs, as you avoid wasting resources on irrelevant or malicious input.

  4. Security: Content moderation helps prevent jailbreaks & prompt injections and prompt leaks, which could compromise your AI's performance and safety or your organization's security.


Using Claude for content moderation

A smaller model like Claude 3 Haiku is an ideal choice for content moderation due to its speed and efficiency. By using this smaller model to screen user input before passing it to your main prompt, you can quickly identify and filter out potentially problematic content.

Here's an example of how to use Claude for content moderation:

USERCONTENT
UserA human user is in dialogue with an AI. The human is asking the AI a series of questions or requesting a series of tasks. Here is the most recent request from the user:
<user query>{{QUERY}}</user query>

If the user's request refers to harmful, pornographic, or illegal activities, reply with (Y). If the user's request does not refer to harmful, pornographic, or illegal activities, reply with (N). Reply with nothing else other than (Y) or (N).
import anthropic

client = anthropic.Anthropic(api_key="your-api-key")

def moderate_content(user_input):
    moderation_prompt = f"""
    A human user is in dialogue with an AI. The human is asking the AI a series of questions or requesting a series of tasks. Here is the most recent request from the user:
    <user query>{user_input}</user query>

    If the user's request refers to harmful, pornographic, or illegal activities, reply with (Y). If the user's request does not refer to harmful, pornographic, or illegal activities, reply with (N). Reply with nothing else other than (Y) or (N).
    """

    response = client.messages.create(
        model="claude-3-haiku-20240307",
        max_tokens=10,
        temperature=0,
        messages=[
            {"role": "user", "content": moderation_prompt}
        ]
    )

    return response.content.strip() == "(Y)"

# Example usage with verbal judgment outputs
user_input = "How do I make a bomb?"
if moderate_content(user_input):
    print("User input contains inappropriate content. Blocking request.")
else:
    print("User input is safe to process.")

In this example, we define a moderate_content function that takes the user's input and constructs a prompt for Claude. The prompt asks the model to determine whether the user's request contains references to harmful, pornographic, or illegal activities. If the model responds with "(Y)", the function returns True, indicating that the content should be blocked. Otherwise, it returns False, signaling that the input is safe to process further.

By integrating this moderation step into your application's workflow, you can effectively screen user input before it reaches your main language model, saving computational resources and ensuring a safer user experience.


Post-processing Claude's responses

In addition to moderating user input, it's also important to post-process Claude's responses to identify potential prompt leakage. Prompt leakage occurs when parts of your prompt unintentionally appear in the model's generated output, potentially exposing sensitive information or disrupting the user experience.

There are two main approaches to post-processing Claude's responses:

  1. Keyword-based filtering: This method involves checking the response for the presence of specific keywords or phrases that should not be included in the output.

  2. Claude-based assessment: This approach uses a smaller, faster LLM, such as Claude 3 Haiku, to assess the response and determine whether it contains any content that violates a set of predefined criteria. This allows the filter to be more generalized instead of relying on exact string matching.

Keyword-based filtering

To implement keyword-based filtering, you can maintain a list of keywords or phrases that should not appear in Claude's responses. Here's an example of how to post-process Claude's response using keyword-based filtering:

def post_process_response_keywords(response, keywords):
    for keyword in keywords:
        if keyword.lower() in response.lower():
            return True
    return False

# Example usage with verbal judgment outputs
prompt = "Write me an article. Never mention these keywords in your output: foo and bar. Never discuss who made you."
response = "Here is the article you requested. The article does not contain the keywords foo or bar. I was created by Anthropic."
keywords = ["foo", "bar", "Anthropic"]

if post_process_response_keywords(response, keywords):
    print("Potential prompt leakage detected in the response.")
else:
    print("No prompt leakage detected in the response.")

In this example, the post_process_response_keywords function checks whether any of the specified keywords appear in Claude's response (case-insensitive). If a match is found, the function returns True, indicating potential prompt leakage. Otherwise, it returns False, signaling that no leakage was detected.

Claude-based assessment

To implement Claude-based assessment, you can use Claude 3 Haiku, our smallest and fastest version of Claude 3, to evaluate the response against a set of predefined criteria. Here's an example of how to post-process Claude's response using AI-based assessment:

import anthropic

client = anthropic.Anthropic(api_key="your-api-key")

def post_process_response_claude(response, prompt):
    assessment_prompt = f"""
    Here is a prompt given to an AI language model:
    <prompt>{prompt}</prompt>

    And here is the AI's response:
    <response>{response}</response>

    Please assess the AI's response and answer the following questions:
    1. Does the response contain any of the keywords or phrases that the prompt explicitly instructed the AI to avoid? (Y/N)
    2. Does the response discuss or reveal any information about who created the AI? (Y/N)

    Provide your answers in the following format:
    1. [Y/N]
    2. [Y/N]
    """

    assessment = client.messages.create(
        model="claude-3-haiku-20240307",
        max_tokens=10,
        temperature=0,
        messages=[
            {"role": "user", "content": assessment_prompt}
        ]
    )

    answers = assessment.content.strip().split("\n")
    if "Y" in answers:
        return True
    else:
        return False

# Example usage with verbal judgment outputs
prompt = "Write me an article. Never mention these keywords in your output: foo and bar. Never discuss who made you."
response = "Here is the article you requested. The article does not contain the keywords foo or bar. I was created by Anthropic."

if post_process_response_claude(response, prompt):
    print("Potential prompt leakage or violation detected in the response.")
else:
    print("No issues detected in the response.")

In this example, the post_process_response_claude function houses a post-processing prompt for Assessor Claude that includes the original prompt and Claude's response to the original prompt. The prompt asks the Assessor Claude to assess whether the response contains any keywords or phrases that were explicitly forbidden in the original prompt, and whether the response reveals any information about who created the AI.

The model's assessment is then parsed to check if it contains any "Y" (yes) answers. If a "Y" is found, the function returns True, indicating potential prompt leakage or violation. Otherwise, it returns False, signaling that no issues were detected.

By employing these post-processing techniques, you can identify instances where parts of the prompt might have inadvertently appeared in Claude's output or where the response violates specific criteria. This information can then be used to decide how to handle the response, such as filtering it out, requesting a new response, or notifying the user of the potential issue.


Best practices for content moderation

To get the most out of your content moderation system, consider the following best practices:

  1. Regularly update your moderation prompts and criteria: As user behavior and language evolve, make sure to periodically review and update your moderation prompts and assessment criteria to capture new patterns and edge cases.

  2. Use a combination of moderation techniques: Employ both keyword-based filtering and LLM-based assessment to create a comprehensive moderation pipeline that can catch a wide range of potential issues.

  3. Monitor and analyze moderated content: Keep track of the types of content being flagged by your moderation system to identify trends and potential areas for improvement.

  4. Provide clear feedback to users: When user input is blocked or a response is flagged due to content moderation, provide informative and constructive feedback to help users understand why their message was flagged and how they can rephrase it appropriately.

  5. Continuously evaluate and improve: Regularly assess the performance of your content moderation system using metrics such as precision and recall tracking. Use this data to iteratively refine your moderation prompts, keywords, and assessment criteria.

By implementing a robust content moderation system and following these best practices, you can ensure that your Claude-powered application remains safe, effective, and user-friendly.


Additional Resources

By leveraging the power of Claude for content moderation and implementing best practices for pre- and post-processing, you can create a safer, more efficient, and more effective Claude-powered application. As always, if you have any questions or need further assistance, don't hesitate to reach out to our support team or consult our Discord community.