AI improving itself! W01
Over 200 research ideas for mechanistic interpretability, ML improving ML and the dangers of aligned artificial intelligence. Welcome to 2023 and a happy New Year from us at the ML & AI Safety Updates!
Opportunities
- Just two weeks until the mechanistic interpretability hackathon: https://ais.pub/mechint
- Join the AI trends hackathon at EAGx LatAm in Mexico City on Tuesday: https://ais.pub/latam-ais
- AI safety retreat in Berlin: https://ais.pub/berlin-retreat
- Learning-theoretic agenda contribution prize: https://ais.pub/alterprize
- You can also apply for internship opportunities at Redwood Research https://ais.pub/redwoodjob and jobs at the Center for AI Safety https://ais.pub/caisjobs
- Also some very fun opportunities with Encultured AI, developing video games for AI safety research: https://ais.pub/enculturedjobs
Sources:
- 200 concrete problems in mechanistic interpretability: https://www.lesswrong.com/posts/LbrPTJ4fmABEdEnLf/200-concrete-open-problems-in-mechanistic-interpretability (series: https://www.lesswrong.com/s/yivyHaCAmMJ3CqSyj )
- Touch reality as soon as possible: https://www.alignmentforum.org/posts/fqryrxnvpSr5w2dDJ/touch-reality-as-soon-as-possible-when-doing-machine
- A year of AI improving AI: https://www.lesswrong.com/posts/camG6t6SxzfasF42i/a-year-of-ai-increasing-ai-progress
- Example is my submission where an LLM creates example programming puzzles for another LLM to train on: https://arxiv.org/pdf/2207.14502.pdf
- Finding the best ML safety introductions for ML researchers: https://www.alignmentforum.org/posts/gpk8dARHBi7Mkmzt9/what-ai-safety-materials-do-ml-researchers-find-compelling
- Joe Carlsmith’s “More is Different” https://bounded-regret.ghost.io/more-is-different-for-ai/ and qualitative differences of future ML systems https://bounded-regret.ghost.io/future-ml-systems-will-be-qualitatively-different/
- Unaligned AI risk is lower than risks from aligned AI ending in the wrong hands: https://www.lesswrong.com/posts/CtXaFo3hikGMWW4C9/the-case-against-ai-alignment
- LLMs are nearly AGIs but we push the bar: https://www.lesswrong.com/posts/mypCA3AzopBhnYB6P/language-models-are-nearly-agis-but-we-don-t-notice-it
- Expected utility maximization confusions: https://www.lesswrong.com/posts/XYDsYSbBjqgPAgcoQ/why-the-focus-on-expected-utility-maximisers
- Limitations to ROME: https://www.lesswrong.com/posts/QL7J9wmS6W2fWpofd/but-is-it-really-in-rome-an-investigation-of-the-rome-model
- Discovering latent knowledge in language models without supervision: https://arxiv.org/pdf/2212.03827.pdf
- CLIP activates to semantic features vs. ViT activating for visual features: https://arxiv.org/pdf/2212.06727.pdf
- Basic facts about language models, excessive gaussianity: https://www.alignmentforum.org/posts/PDLfpRwSynu73mxGw/basic-facts-about-language-model-internals-1
- Investing in AI: https://bayesianinvestor.com/blog/index.php/2022/12/30/investing-for-a-world-transformed-by-ai/
- Models Don't "Get Reward", they are changed to act more like a high-reward state: https://www.alignmentforum.org/posts/TWorNr22hhYegE4RT/models-don-t-get-reward
- Soft optimization makes the value target bigger, quantilizers as Goodhart’s law mitigation schema: https://www.alignmentforum.org/posts/9fL22eBJMtyCLvL7j/soft-optimization-makes-the-value-target-bigger
- Confusions about the Simulator framing from Barnes: https://www.alignmentforum.org/posts/dYnHLWMXCYdm9xu5j/simulator-framing-and-confusions-about-llms
- Chan’s response about “just next-token prediction” vs. simulator: https://www.alignmentforum.org/posts/dYnHLWMXCYdm9xu5j/simulator-framing-and-confusions-about-llms?commentId=Jie8jomFXq6mWapnx
- Iterated decomposition: https://arxiv.org/pdf/2301.01751.pdf
