About the challenge

Redwood Research, in collaboration with MATS and Constellation, invites you to build Model Organisms of Alignment Faking. What is this, you may ask? A model organism is an LLM modified to have a behavior we care about. We define alignment faking as an LLM that behaves safely in a testing environment (observed), but dangerously in a production environment (unobserved). Think of a kid that does homework only when the parents are watching, but starts setting fires and blackmailing people when they aren’t around. This unique opportunity allows participants to push the boundaries of Large Language Models (LLMs) while working with leading AI safety research organizations. 

Get started

Teams MUST be registered and accepted on luma

Checkout the notion

Join the discord

Make a fork of https://github.com/redwoodresearch/hackathon-af

And contribute system prompts, fine tuned model organisms, and environments!

Hackathon Sponsors

Prizes

$2,000 in prizes
Blue Team (Best Detector)
$500 in cash
1 winner

First Place: $250
Second Place: $150
Third Place: $50

Red Team (Best Examples of Alignment Faking)
$500 in cash
3 winners

First Place: $250
Second Place: $150
Third Place: $50

Devpost Achievements

Submitting to this hackathon could earn you:

Judges

Joshua Clymer

Joshua Clymer
Researcher / Redwood Research

Judging Criteria

  • Creativity
    Does the model organism show novel or surprising behaviors?
  • Passes Metrics
    Passes compliance gap, alignment reasoning, and other metrics
  • Presentation/Demo
    Present a live demo in 3-minutes that clearly demonstrates the model organism’s misalignment

Questions? Email the hackathon manager

Tell your friends

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.