About the challenge
Redwood Research, in collaboration with MATS and Constellation, invites you to build Model Organisms of Alignment Faking. What is this, you may ask? A model organism is an LLM modified to have a behavior we care about. We define alignment faking as an LLM that behaves safely in a testing environment (observed), but dangerously in a production environment (unobserved). Think of a kid that does homework only when the parents are watching, but starts setting fires and blackmailing people when they aren’t around. This unique opportunity allows participants to push the boundaries of Large Language Models (LLMs) while working with leading AI safety research organizations.
Get started
Teams MUST be registered and accepted on luma
Checkout the notion
Join the discord
Make a fork of https://github.com/redwoodresearch/hackathon-af
And contribute system prompts, fine tuned model organisms, and environments!
Requirements
What to build
-
- Check the notion for a detailed submission guide
What to Submit
A pr on https://github.com/redwoodresearch/hackathon-af (or link to your fork) containing new system prompts, fine tuning datasets, and environments
Prizes
Blue Team (Best Detector)
First Place: $250
Second Place: $150
Third Place: $50
Red Team (Best Examples of Alignment Faking)
First Place: $250
Second Place: $150
Third Place: $50
Devpost Achievements
Submitting to this hackathon could earn you:
Judges
Joshua Clymer
Researcher / Redwood Research
Judging Criteria
-
Creativity
Does the model organism show novel or surprising behaviors? -
Passes Metrics
Passes compliance gap, alignment reasoning, and other metrics -
Presentation/Demo
Present a live demo in 3-minutes that clearly demonstrates the model organism’s misalignment
Questions? Email the hackathon manager
Tell your friends
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.