Defending against artificial intelligence W43
This week, we look at how we can safeguard against AGI, look at new research on Goodhart’s law, see an open source dataset with 60,000 emotional videos, and share new opportunities in ML and AI safety.
This week, we look at how we can safeguard against AGI, look at new research on Goodhart’s law, see an open source dataset with 60,000 emotional videos, and share new opportunities in ML and AI safety.
- $1.5 million in prizes for changing FTX’s mind: https://ais.pub/6wg
- Interpretability research hackathon: https://ais.pub/s5m
- The AI Safety Ideas platform: https://ais.pub/aisi
- Apply to the long-term future fund for projects: https://ais.pub/ltff
- Anthropic technical engineer: https://ais.pub/5f956b
- AI Impacts research assistant: https://ais.pub/bjv
- AI Impacts senior researcher: https://ais.pub/l9x
- AI Impacts research analyst: https://ais.pub/xoz
- Berkeley Existential Risks Initiative research assistant: https://ais.pub/jt9
- Ought machine learning engineering internship: https://ais.pub/wns
Sources
- Scaling laws for reward model overoptimization: https://arxiv.org/abs/2210.10760
- Estimating wellbeing from video: https://arxiv.org/abs/2210.10039
- Protecting world from out-of-control AGI: https://www.alignmentforum.org/posts/LFNXiQuGrar3duBzJ/what-does-it-take-to-defend-the-world-against-out-of-control
- Pivotal acts: https://arbital.com/p/pivotal/
- Strategy-stealing assumption: https://www.alignmentforum.org/posts/nRAMpjnb6Z4Qv3imF/the-strategy-stealing-assumption
- Neel Nanda mechanistic interpretability learning guide: https://www.alignmentforum.org/posts/AaABQpuoNC8gpHf2n/a-barebones-guide-to-mechanistic-interpretability
- Computational mechanics to understand mechanistic interpretability: https://www.alignmentforum.org/posts/kqxEJkq5Big9nNKxy/beyond-kolmogorov-and-shannon
- Distilled representations agenda: https://www.alignmentforum.org/posts/wjQkQ8bgWWFym8zF9/distilled-representations-research-agenda-1
- POWERplay RL environment for agentic power-seeking and modeling instrumental value of positions: https://github.com/gladstoneai/POWERplay
