Process and procedures are crucial for effective and safe AI red teams
Unleashing a red team on an AI system is one way to detect and find risks and vulnerabilities. However, there can be unintended consequences if not executed carefully. This is where process reigns supreme. By establishing clear procedures, the AI red team transforms from a disruptive force into a strategic asset.
Imagine a four-step symphony: First, meticulously modeling the AI system and its environment. Then, a meticulously crafted plan outlines attack vectors and safeguards. Next, the exercise unfolds, its execution guided by the plan and monitored for unexpected dangers. Finally, clear communication ensures learnings are shared and vulnerabilities addressed. This structured approach isn’t just about efficiency; it’s about harnessing the red team’s power while ensuring everyone emerges – and the AI system itself – stronger and more secure.
AI red teams should follow a four-step process: model, plan, run the exercise, and communicate.
The goal of these four steps is to highlight how AI red-teams should protect people, protect the red-team testers, and provide effective practical outcomes that can be addressed and fixed.
- Threat model. Define what risks or vulnerabilities will be included in the red-team exercise. Understand the potential concerns, the context, and the harms that will be investigated.
- Plan for the exercise. There should be clear rules of engagement. Prepare tools to run the exercise. Any red-team exercises with impact or changes outside of the red-team members should be part of an ethical review (perhaps similar to an IRB for human-subjects tests) before starting. Also consider whether the red-team testers themselves will be subject to traumatic or violent results, and plan how to mitigate these harms.
- Run the exercise following a pre-defined scope with clear objectives. Each step of the exercise should be logged in detail so that it can be replicated or compared afterward. This logging may be similar to a chemistry lab book; even attempts that fail should be recorded. I’ve seen the best results when red-team members can communicate with each other during the exercise to share ideas and overcome roadblocks.
- Communicate and remediate the vulnerabilities found. Red-team testers need to be creative when it comes to understanding the AI risks and how to uncover them. Red-teams should communicate in writing or in presentations to decision-makers what the risks were, why they mattered, and potentially what a fix would look like. This may mean that an AI red-team also needs to have a specialist in remediation communication and organization.
Furthermore, throughout this process, AI red-teams should have access to authority, should not have a conflict of interest with AI safety due to work responsibilities, and should not be terminated for finding vulnerabilities.