Can we predict the abilities of future AI? W44
This week, we look at broken scaling laws, surgical fine-tuning, interpretability in the wild, and threat models of AI.
This week, we look at broken scaling laws, surgical fine-tuning, interpretability in the wild, and threat models of AI.
Opportunities
Redwood Research is inviting 30-50 researchers to join them in Berkeley for a very interesting mechanistic interpretability research programme. https://ais.pub/remix
Anthropic is looking for operations managers, recruiters, researchers, engineers, and product managers: https://ais.pub/anthropic
AI Safety Ideas: https://ais.pub/aisi
Hackathons: https://ais.pub/jam
00:20 Broken scaling laws & surgical fine-tuning
01:34 Debate & interpretability
02:55 Threat models in ML safety
04:30 Other news
05:30 REMIX, Anthropic, and hackathons
Sources
Interpreting DeepMindās in-context RL paper: https://www.lesswrong.com/posts/avvXAvGhhGgkJDDso/caution-when-interpreting-deepmind-s-in-context-rl-paper
Interpretability in the Wild: https://twitter.com/kevrowan/status/1587601532639494146
Precision machine learning, Max Tegmark: https://arxiv.org/pdf/2210.13447.pdf
Broken natural scaling laws: https://arxiv.org/pdf/2210.14891.pdf
Debate does not work for question-answering: https://arxiv.org/pdf/2210.10860.pdf
Vision for meta-science: https://scienceplusplus.org/metascience/index.html#how-does-the-culture-of-science-change
AI X-risk 35% mostly based on a recent peer-reviewed argument - LessWrong: https://www.lesswrong.com/posts/XtBJTFszs8oP3vXic/ai-x-risk-greater-than-35-mostly-based-on-a-recent-peer
Clarifying X-risk: https://www.lesswrong.com/posts/GctJD5oCDRxCspEaZ/clarifying-ai-x-risk
Threat model literature review: https://www.lesswrong.com/posts/wnnkD6P2k2TfHnNmt/threat-model-literature-review
Frames vs. boundaries: https://www.alignmentforum.org/posts/SZjHimszxqjNJzQWK/boundaries-vs-frames
Surgical fine-tuning improves robustness against distributional shifts: https://arxiv.org/pdf/2210.11466.pdf
Opportunities
Redwood Research is inviting 30-50 researchers to join them in Berkeley for a very interesting mechanistic interpretability research programme. https://ais.pub/remix
Anthropic is looking for operations managers, recruiters, researchers, engineers, and product managers: https://ais.pub/anthropic
AI Safety Ideas: https://ais.pub/aisi
Hackathons: https://ais.pub/jam
00:20 Broken scaling laws & surgical fine-tuning
01:34 Debate & interpretability
02:55 Threat models in ML safety
04:30 Other news
05:30 REMIX, Anthropic, and hackathons
Sources
Interpreting DeepMindās in-context RL paper: https://www.lesswrong.com/posts/avvXAvGhhGgkJDDso/caution-when-interpreting-deepmind-s-in-context-rl-paper
Interpretability in the Wild: https://twitter.com/kevrowan/status/1587601532639494146
Precision machine learning, Max Tegmark: https://arxiv.org/pdf/2210.13447.pdf
Broken natural scaling laws: https://arxiv.org/pdf/2210.14891.pdf
Debate does not work for question-answering: https://arxiv.org/pdf/2210.10860.pdf
Vision for meta-science: https://scienceplusplus.org/metascience/index.html#how-does-the-culture-of-science-change
AI X-risk 35% mostly based on a recent peer-reviewed argument - LessWrong: https://www.lesswrong.com/posts/XtBJTFszs8oP3vXic/ai-x-risk-greater-than-35-mostly-based-on-a-recent-peer
Clarifying X-risk: https://www.lesswrong.com/posts/GctJD5oCDRxCspEaZ/clarifying-ai-x-risk
Threat model literature review: https://www.lesswrong.com/posts/wnnkD6P2k2TfHnNmt/threat-model-literature-review
Frames vs. boundaries: https://www.alignmentforum.org/posts/SZjHimszxqjNJzQWK/boundaries-vs-frames
Surgical fine-tuning improves robustness against distributional shifts: https://arxiv.org/pdf/2210.11466.pdf
