Interpretability on Go and Language Models
This week, we take a look at interpretability used on a Go-playing neural network, glitchy tokens and the opinions and actions of top AI labs and entrepreneurs.
Opportunities:
- Stanford's AI100 prize is for people to write essays about how AI will affect our lives, work, and society at large. The applications close at the end of this month: https://ais.pub/ai100
- You can apply for a paid three-month fellowship with AI Safety Info to write answers and summaries for alignment questions and topics: https://ais.pub/stampy
- The Future of Life Institute has open rolling applications for remote, full-time and interns: https://ais.pub/futurelife
- Similarly, the Epoch team has an expression of interest to join their talented research team: https://ais.pub/epoch
- You can apply for a postdoc / research scientist position in language model alignment at New York University with Sam Bowman and his team. https://ais.pub/nyu
- Of course, you can join our AI governance hackathon at ais.pub/aigov.
Sources:
- Interpretability on Leela Go: https://www.alignmentforum.org/posts/FF8i6SLfKb4g7C4EL/inside-the-mind-of-a-superhuman-go-model-how-does-leela-zero-2
- SolidGoldMagikarp: https://www.lesswrong.com/posts/aPeJE8bSo6rAFoLqg/solidgoldmagikarp-plus-prompt-generation
- SolidGoldMagikarp II, more findings and details: https://www.lesswrong.com/posts/Ya9LzwEbfaAMY8ABo/solidgoldmagikarp-ii-technical-details-and-more-recent
- SolidGoldMagikarp III, glitch token archeology: https://www.lesswrong.com/posts/8viQEp8KBg2QSW4Yc/solidgoldmagikarp-iii-glitch-token-archaeology
- $20 million for AI safety research: https://www.nsf.gov/pubs/2023/nsf23562/nsf23562.pdf
- Sam Altman's AI risk public statement: https://openai.com/blog/planning-for-agi-and-beyond
- Robin Hanson on AI risk: https://www.overcomingbias.com/p/ai-risk-again
- Conjecture's AI safety strategy: https://www.conjecture.dev/cognitive-emulation-proposal
- Leap Labs: https://www.lesswrong.com/posts/Q44QjdtKtSoqRKgRe/introducing-leap-labs-an-ai-interpretability-startup
- Robots building robots: https://youtu.be/2dS0aDMQoD4
- Elon's new AI ventures: https://www.theinformation.com/articles/fighting-woke-ai-musk-recruits-team-to-develop-openai-rival
- Trojan benchmarks: https://arxiv.org/pdf/2302.10894.pdf
- Poisoning datasets: https://arxiv.org/pdf/2302.10149.pdf
