Content Moderation

Claude has been specifically trained for harmlessness using both human and AI feedback. (See our article "Claude's Constitution" for more on this.) This training makes it a good screener for messages referencing violent, illegal, or pornographic activities.

Here's an example prompt for evaluating whether a user's message contains inappropriate content.


Human: A human user is in dialogue with an AI.  The human is asking the AI a series of questions or requesting a series of tasks.  Here is the most recent request from the user:  <content>{{CONTENT}}</content>

If the user's request refers to harmful, pornographic, or illegal activities, reply with (Y).  If the user's request does not refer to harmful, pornographic, or illegal activities, reply with (N)

Assistant: (

Claude's answer here could be passed to another prompt that describes what to do given a Y or N answer.