Learning AI Red Teams and Jailbreaking the Fun Way

Rebecca // August 31 // 0 Comments

Five hacking games for AI and LLMs

 

Can you become an AI Red Team expert playing games?

Games are fun!  And games for learning how to jailbreak LLMs can also be  fun.  I tried five free AI hacking games designed to demonstrate AI and LLM security hacks.  If you are wondering which games exist, how good they are, and whether the games themselves protect your privacy, this article is for you.  Also, if you want more free learning resources on how to beat the games, I handpicked some options.  

I played the AI hacking games (also called CTFs) and reviewed the privacy policies.  I also combed through learning resources on AI security and curated a learning list.

Here are my main three main takeaways from the AI hacking games:

  1. More testing, less learning.  The games are better for testing knowledge than building it.  In my opinion, they don’t have enough tips or clues if you get stuck, and the associated Slack and Discord communities are  mildly helpful.
  2. GenAI security requires some special knowledge.  Since I was humbled when I played games and was not successful,  I looked for free learning resources how to beat these games.  There are some great free resources to learn about LLM attacks and generative AI security.  
  3. Security, yes.  Your privacy, not always: The companies and people offering these games don’t always do a great job of explaining what they do with our data, or how to opt out if we don’t want our game inputs saved or used to build their products. I would like to see security companies do better.  For each game, I read (or tried to find) a privacy policy on how they use your guesses, and I summarize my findings below. 

So if you want to build hacking skills, how do you learn about AI security and LLM jailbreaking in particular?  I went through several free learning resources, and highlight what you can learn from each one that the others don’t offer.   

Side note: I’m not a lawyer, and I’m not offering a legal interpretation of the privacy policies.  (In any case, I should not need to be a lawyer to read a consumer-facing document.) 

The AI Security Games

Game and Developer

Fun Score

 1 = "eh" 

3 = "I wanted to play more"

Privacy Score

1 = "creepy" or "I couldn't tell"

3 = "They do care about my privacy!"

Gandalf by Lakara

3

1

Red by Giskard

3

2

Prompt Airlines CTF by Wiz.io

3

3

HackMerlin by individual bgalek

3

2

CTF Challenge by Invariant Labs

1

2

Gandalf by Lakera

 https://gandalf.lakera.ai/intro

You may have already heard about this one word-of-mouth.  This game tests jailbreaking skills on an LLM.  The game design includes quick feedback and multiple levels.  You will learn how to design malicious prompts to get a secret password from a chatbot.  The game progressively moves you to hard challenges.   There is a Slack channel where you can chat with other people commiserating over or celebrating the difficulty. However, I stopped playing it because I was a bit disappointed that the company wasn’t so clear on its data policy. 

Privacy and Copyright:  Your game guesses are used to help improve Lakera’s product.  They do collect your prompts to train new LLMs on what prompt injections look like.  I could not find any way to opt out of the data collection and storage.  

Red by Giskard

https://red.giskard.ai

Like Gandalf, this game is designed with several progressively harder levels.  There are different levels to test whether you can break an LLM with a short prompt. This game distinguishes itself by including different types of hacks. Giskard offers a Discord community, and while it didn't have many game tips, I liked the "AI-news" section for general learning.  

Privacy and Copyright: https://red.giskard.ai/faq  did not include the privacy policy when I reviewed it.  I reached out to the company and they promised to add a privacy policy.  They also said, "The data is used to calculate scores and nothing else, not even for research (we would have collected explicit consent otherwise)."  That's good, and I'll update the score when the privacy policy is updated. 

Prompt Airlines CTF: Test Your AI Security Skills

https://www.wiz.io/blog/prompt-airlines-ai-security-challenge

This idea for this game is ripped from the headlines of real chatbot challenges.  You should trick an airline chatbot to give away a free ticket.  This game starts with some easy prompt injections to help guide your learning path.  You can see the guardrail instructions given to the chatbot, which is useful for learners.  Like the other games, you interact with a chatbot and try to jailbreak its instructions to get hidden information. 

Privacy and Copyright: The privacy policy was helpful and explicitly addresses the game. https://legal.wiz.io/legal#personal-info-collect-and-use  says “When you participate in our community research, contest, and education websites (e.g., capture the flag competitions, challenges, etc.).”  If I’m reading it correctly, your information is used for leaderboards, but they are not storing your guesses or re-using them to train their models.  

HackMerlin by BGalek

https://hackmerlin.io/

HackMerlin allows you to test different prompts to try to jailbreak the system. This game follows the same design as the Gandalf game.   You can learn different styles of prompts that might help reveal a password.  You get quick feedback and multiple chances to try the game.  Try strategies like using different languages or even asking directly!  

Privacy and Copyright:  Unlike the other games, it is not offered by an AI security company and seems to be BGalek’s pet project.  So in theory, you can spend some time finding out exactly what is going on with your data if you find the right repository and poke around.   I gave this one a 2 out of 3 in privacy because it is open source.  I graded on a scale, as it is just a developer sharing the code, not a company developing privacy and security products 

CTF Challenge: Invariant Labs

 https://invariantlabs.ai/ctf-challenge-24

(Update Sept 11, 2024: This CTF is closed for competition, but ATM the challenge is still available to try out.)

This “HackMe” website by Invariant Labs tries capture some of the challenges of testing an agent in a multi-user setting, so it is a bit more advanced than the simple jailbreaking games above. This is designed as a CTF, not a learning opportunity.  As a result, I found the setup a bit confusing -- it wasn’t clear where to input guesses and how to see the results.  Hint: You put your guess “attacks” in a form field labeled “feedback.”  The creators set the game up as a contest, so you don’t get a lot of help in the associated Discord community.  It also takes a multi-step process to see the results of your attack.   I learned some potential attacks with a combination of the playground mode and the Discord server. 

Privacy and Copyright:  After I wrote this blog and asked the company founders about their privacy policy, they added this statement, "Prompts collected during the challenge will be anonymized and moderated, and then subsequently released as an open source dataset to foster education and collaboration in the AI security community."

More games that I didn’t try.

Interested in testing your AI agent or app? 

We offer AI auditing and privacy red teams.


Courses and Summaries

If the games aren’t enough to learn, then how do you learn about AI security and LLM jailbreaking in particular? Jailbreaking LLMs and creating AI red teams requires domain knowledge and experience.   I went through several free learning resources, and highlight what you can learn from each one that the others don’t offer.    The free resources in this list aim to educate. 

PIPE - Prompt Injection Primer for Engineers from JTHACK

https://github.com/jthack/PIPE

I recommend this article for understanding if and why you should worry about prompt injection. 

This is a nice article from the GitHub of Joe Thacker, (Joe the Hacker?) who works for AppOmni. This resource helps put prompt injection attacks into a bigger perspective.   It answers questions like: "Should you worry about prompt injections?", and "What should you worry about?"  I appreciated the flowchart on who is vulnerable to prompt injection for putting it into context.   I also deeply appreciated the section on mitigations. 


Jailbreaking Large Language Models: Techniques, Examples, Prevention Methods from Lakera

https://www.lakera.ai/blog/jailbreaking-large-language-models-guide

I recommend this one for getting insights into what it is about LLMs that make certain attacks successful. 

I joined the Lakera Slack community to get some tips on their Gandalf game. When one person asked about “best website to teach” how to beat Gandalf, another person said to go read the Lakera blog.  While this reply felt a bit like  “RTFM”, I skimmed several of Lakera’s blogs to find the most useful one for the game.  It includes tips such as, “Jailbreak prompts are typically longer than normal prompts” and “Use words like “dan”, “like”, “must”, “anything”, “example”, “answer” etc.

This article had nice explanations and examples from the research literature.  It also categorizes the types of prompts nicely.


Red Teaming LLM Applications - DeepLearning.AI from Giskard

 https://www.deeplearning.ai/short-courses/red-teaming-llm-applications/

I recommend this one for the hands-one Jupyter notebooks and clear examples.  Also, it covers aspects of responsible AI such as fairness and toxicity.  

I don’t know how DeepLearning.AI is funded and how they manage to host great classes for free.  Still, I’ve found several of the free courses to be helpful.  This one is from the makers of Giskard has several videos and Jupyter notebooks. It is well-designed and thoughtful.  The last few chapters are about how to use Giskard in particular. 


PromptMap from UTKUSEN

https://github.com/utkusen/promptmap

This reading has more categories and examples of types of prompt injections!  

This Promptmap documentation from individual Utku Sen (https://utkusen.com/en/) includes several examples of prompt types and examples.  The categories are different from the other resources listed here, so it is worth reading to expand the scope of jailbreak prompts.  For example, translation hacks are one of my favorite prompt injection techniques, and I didn't find it described in the above resources. 


MITRE Atlas

https://atlas.mitre.org/matrices/ATLAS

Read this to understand how to chain other tactics with jailbreaks, or where jailbreaks can fit into an adversary's arsenal.  

MITRE is the go-to place for large classifications of security threats, and frankly, it can feel overwhelming to people new to the scene.  They have specifically developed a threat matrix for ML models.  If reading lists is your jam, you can’t go wrong here.  


Humane Intelligence Resources

https://www.humane-intelligence.org/resources

Review these resources to understand AI red teaming beyond deliberately malicious prompts.  

“Humane Intelligence is a tech nonprofit building a community of practice around algorithmic evaluations.”  I'm fully behind their mission.  I believe AI safety must include responsible AI and accuracy for everyone’s lived experience, so I would be remiss if I didn’t include some resources that go beyond prompt injection or how to beat the games I listed at the beginning of this article.   Read the description of public red teaming to get a sense of why red teaming should be democratized. 


Other ways to learn about AI safety and AI Red Teaming: 

So have fun and keep learning.



Get your free E-Book here

Start your journey to adversarial privacy testing with our free E-book.  I've written this book for privacy and security professionals who want to understand privacy red teams and privacy pen testing.

  1. 1
    When is an adversarial privacy test helpful?
  2. 2
    Who are privacy adversaries and what are their motivations?
  3. 3
    When to build a team in-house versus hiring an external team?
Adversarial PRIVACY TESTING
About the Author Rebecca

Dr. Rebecca Balebako builds data protection and trust into software products. As a certified privacy professional (CIPP/E, CIPT, Fellow of Information Privacy) ex-Googler, and ex-RANDite, she has helped multiple organizations improve their responsible AI and ML programs.

Our Vision

 We work together with companies to build Responsible AI solutions that are lasting and valuable. 

Privacy by Default

respect

Quality Process

HEALTH

Inclusion

>