Was ChatGPT a good idea? W04
In this week’s ML & AI Safety Update, we hear Paul Christiano’s take on one of OpenAI’s main alignment strategies, dive into the second round winners of the inverse scaling prize and share the many fascinating projects from our mechanistic interpretability hackathon. And stay tuned until the end for some unique opportunities in AI safety!
In this week’s ML & AI Safety Update, we hear Paul Christiano’s take on one of OpenAI’s main alignment strategies, dive into the second round winners of the inverse scaling prize and share the many fascinating projects from our mechanistic interpretability hackathon!
Opportunities (https://ais.pub/aistraining)
- Deadline is coming up in 10 days for PIBBSS: https://ais.pub/pibbss
- EAG London is coming up in May: https://ais.pub/eag
- Introduction to ML safety: https://ais.pub/gt2
- Alignment competitions: https://ais.pub/aawards
Sources
- RLHF 2015: https://ai-alignment.com/efficient-feedback-a347748b1557
- Christiano on RLHF: https://www.alignmentforum.org/posts/vwu4kegAEZTBtpT6p/thoughts-on-the-impact-of-rlhf-research
- Inverse scaling prize winners: https://www.lesswrong.com/posts/DARiTSTx5xDLQGrrz/inverse-scaling-prize-second-round-winners
- We discovered “ an” neuron: https://itch.io/jam/mechint/rate/1890024
- Identifying a preliminary circuit for predicting gendered pronouns in GPT-2 small with the automatic circuit identification algorithm: https://itch.io/jam/mechint/rate/1889871
- Automated identification of potential feature neurons: https://itch.io/jam/mechint/rate/1889215
- Soft prompts are a convex set: https://itch.io/jam/mechint/rate/1889669
- Mentaleap team https://mentaleap.ai/
- Prompt tuning: https://arxiv.org/abs/2104.08691
- Results page: https://itch.io/jam/mechint/results
