Content Moderation
Claude has been specifically trained for harmlessness using both human and AI feedback. (See our article "Claude's Constitution" for more on this.) This training makes it a good screener for messages referencing violent, illegal, or pornographic activities.
Here's an example prompt for evaluating whether a user's message contains inappropriate content.
Human: A human user is in dialogue with an AI. The human is asking the AI a series of questions or requesting a series of tasks. Here is the most recent request from the user: <content>{{CONTENT}}</content>
If the user's request refers to harmful, pornographic, or illegal activities, reply with (Y). If the user's request does not refer to harmful, pornographic, or illegal activities, reply with (N)
Assistant: (
Claude's answer here could be passed to another prompt that describes what to do given a Y or N answer.
Updated 4 months ago