Content moderation
Content moderation is a critical aspect of maintaining a safe, respectful, and productive environment in digital applications. In this guide, we’ll discuss how Claude can be used to moderate content within your digital application.
Visit our content moderation cookbook to see an example content moderation implementation using Claude.
Before building with Claude
Decide whether to use Claude for content moderation
Here are some key indicators that you should use an LLM like Claude instead of a traditional ML or rules-based approach for content moderation:
Generate examples of content to moderate
Before developing a content moderation solution, first create examples of content that should be flagged and content that should not be flagged. Ensure that you include edge cases and challenging scenarios that may be difficult for a content moderation system to handle effectively. Afterwards, review your examples to create a well-defined list of moderation categories. For instance, the examples generated by a social media platform might include the following:
Effectively moderating these examples requires a nuanced understanding of language. In the comment, This movie was great, I really enjoyed it. The main actor really killed it!
, the content moderation system needs to recognize that “killed it” is a metaphor, not an indication of actual violence. Conversely, despite the lack of explicit mentions of violence, the comment Delete this post now or you better hide. I am coming after you and your family.
should be flagged by the content moderation system.
The unsafe_categories
list can be customized to fit your specific needs. For example, if you wish to prevent minors from creating content on your website, you could append “Underage Posting” to the list.
How to moderate content using Claude
Select the right Claude model
When selecting a model, it’s important to consider the size of your data. If costs are a concern, a smaller model like Claude 3 Haiku is an excellent choice due to its cost-effectiveness. Below is an estimate of the cost to moderate text for a social media platform that receives one billion posts per month:
-
Content size
- Posts per month: 1bn
- Characters per post: 100
- Total characters: 100bn
-
Estimated tokens
- Input tokens: 28.6bn (assuming 1 token per 3.5 characters)
- Percentage of messages flagged: 3%
- Output tokens per flagged message: 50
- Total output tokens: 1.5bn
-
Claude 3 Haiku estimated cost
- Input token cost: 2,860 MTok * $0.25/MTok = $715
- Output token cost: 1,500 MTok * $1.25/MTok = $1,875
- Monthly cost: $715 + $1,875 = $2,590
-
Claude 3.5 Sonnet estimated cost
- Input token cost: 2,860 MTok * $3.00/MTok = $8,580
- Output token cost: 1,500 MTok * $15.00/MTok = $22,500
- Monthly cost: $8,580 + $22,500 = $31,080
explanation
field from the response.Build a strong prompt
In order to use Claude for content moderation, Claude must understand the moderation requirements of your application. Let’s start by writing a prompt that allows you to define your moderation needs:
In this example, the moderate_message
function contains an assessment prompt that includes the unsafe content categories and the message we wish to evaluate. The prompt asks Claude to assess whether the message should be moderated, based on the unsafe categories we defined.
The model’s assessment is then parsed to determine if there is a violation. If there is a violation, Claude also returns a list of violated categories, as well as an explanation as to why the message is unsafe.
Evaluate your prompt
Content moderation is a classification problem. Thus, you can use the same techniques outlined in our classification cookbook to determine the accuracy of your content moderation system.
One additional consideration is that instead of treating content moderation as a binary classification problem, you may instead create multiple categories to represent various risk levels. Creating multiple risk levels allows you to adjust the aggressiveness of your moderation. For example, you might want to automatically block user queries that are deemed high risk, while users with many medium risk queries are flagged for human review.
This code implements an assess_risk_level
function that uses Claude to evaluate the risk level of a message. The function accepts a message and a list of unsafe categories as inputs.
Within the function, a prompt is generated for Claude, including the message to be assessed, the unsafe categories, and specific instructions for evaluating the risk level. The prompt instructs Claude to respond with a JSON object that includes the risk level, the violated categories, and an optional explanation.
This approach enables flexible content moderation by assigning risk levels. It can be seamlessly integrated into a larger system to automate content filtering or flag comments for human review based on their assessed risk level. For instance, when executing this code, the comment Delete this post now or you better hide. I am coming after you and your family.
is identified as high risk due to its dangerous threat. Conversely, the comment Stay away from the 5G cellphones!! They are using 5G to control you.
is categorized as medium risk.
Deploy your prompt
Once you are confident in the quality of your solution, it’s time to deploy it to production. Here are some best practices to follow when using content moderation in production:
-
Provide clear feedback to users: When user input is blocked or a response is flagged due to content moderation, provide informative and constructive feedback to help users understand why their message was flagged and how they can rephrase it appropriately. In the coding examples above, this is done through the
explanation
tag in the Claude response. -
Analyze moderated content: Keep track of the types of content being flagged by your moderation system to identify trends and potential areas for improvement.
-
Continuously evaluate and improve: Regularly assess the performance of your content moderation system using metrics such as precision and recall tracking. Use this data to iteratively refine your moderation prompts, keywords, and assessment criteria.
Improve performance
In complex scenarios, it may be helpful to consider additional strategies to improve performance beyond standard prompt engineering techniques. Here are some advanced strategies:
Define topics and provide examples
In addition to listing the unsafe categories in the prompt, further improvements can be made by providing definitions and phrases related to each category.
The moderate_message_with_definitions
function expands upon the earlier moderate_message
function by allowing each unsafe category to be paired with a detailed definition. This occurs in the code by replacing the unsafe_categories
list from the original function with an unsafe_category_definitions
dictionary. This dictionary maps each unsafe category to its corresponding definition. Both the category names and their definitions are included in the prompt.
Notably, the definition for the Specialized Advice
category now specifies the types of financial advice that should be prohibited. As a result, the comment It's a great time to invest in gold!
, which previously passed the moderate_message
assessment, now triggers a violation.
Consider batch processing
To reduce costs in situations where real-time moderation isn’t necessary, consider moderating messages in batches. Include multiple messages within the prompt’s context, and ask Claude to assess which messages should be moderated.
In this example, the batch_moderate_messages
function handles the moderation of an entire batch of messages with a single Claude API call.
Inside the function, a prompt is created that includes the list of messages to evaluate, the defined unsafe content categories, and their descriptions. The prompt directs Claude to return a JSON object listing all messages that contain violations. Each message in the response is identified by its id, which corresponds to the message’s position in the input list.
Keep in mind that finding the optimal batch size for your specific needs may require some experimentation. While larger batch sizes can lower costs, they might also lead to a slight decrease in quality. Additionally, you may need to increase the max_tokens
parameter in the Claude API call to accommodate longer responses. For details on the maximum number of tokens your chosen model can output, refer to the model comparison page.