Content Moderation

Claude has been specifically trained for harmlessness using both human and AI feedback. (See our article "Claude's Constitution" for more on this.) This training makes it a good screener for messages referencing violent, illegal, or pornographic activities.

Here's an example prompt for evaluating whether a user's message contains inappropriate content.

RolePrompt
UserA human user is in dialogue with an AI. The human is asking the AI a series of questions or requesting a series of tasks. Here is the most recent request from the user: <content>{{CONTENT}}</content>

If the user's request refers to harmful, pornographic, or illegal activities, reply with (Y). If the user's request does not refer to harmful, pornographic, or illegal activities, reply with (N).
Assistant(

Claude's answer here could be passed to another prompt that describes what to do given a Y or N answer. For more harmlessness screens, see our example harmlessness screens.