Interpretability state-of-the-art W37
π Interpretability research is doing great and π AI is still getting better
π Citations:
Circuits: https://distill.pub/2020/circuits/zoom-in/
Interpretability survey: https://arxiv.org/abs/2207.13243, see Twitter summary: https://twitter.com/StephenLCasper/status/1569401262558576642, and PDF: https://arxiv.org/pdf/2207.13243.pdf
Activation atlas: https://distill.pub/2019/activation-atlas/
Changing training data: https://arxiv.org/pdf/1811.12231.pdf
Editing factual associations in GPT: https://arxiv.org/pdf/2202.05262.pdf
Natural language descriptions of deep visual features: https://arxiv.org/pdf/2201.11114.pdf
Robust feature-level adversaries are interpretability tools: https://arxiv.org/pdf/2110.03605.pdf
Samotsvetyβs AI risk forecast: https://forum.effectivealtruism.org/posts/EG9xDM8YRz4JN4wMN/samotsvety-s-ai-risk-forecasts
Date of AGI: https://www.metaculus.com/questions/5121/date-of-artificial-general-intelligence/
(June) Forecasting TAI with biological anchors summary: https://www.lesswrong.com/s/B9Qc8ifidAtDpsuu8/p/wgio8E758y9XWsi8j
Monitoring for deceptive alignment: https://www.alignmentforum.org/posts/Km9sHjHTsBdbgwKyi/monitoring-for-deceptive-alignment
Deceptive alignment: https://www.alignmentforum.org/posts/zthDPAjh9w6Ytbeks/deceptive-alignment
Quintinβs alignment paper lineup: https://www.lesswrong.com/posts/7cHgjJR2H5e4w4rxT/quintin-s-alignment-papers-roundup-week-1
Most people start with the same few bad ideas: https://www.lesswrong.com/posts/Afdohjyt6gESu4ANf/most-people-start-with-the-same-few-bad-ideas
Beth Barnes starting a risks and development evaluations group at ARC: https://www.alignmentforum.org/posts/svhQMdsefdYFDq5YM/evaluations-project-arc-is-hiring-a-researcher-and-a-webdev-1
Cognitive biases in LLMs: https://arxiv.org/pdf/2206.14576.pdf
Academia vs. industry: https://www.alignmentforum.org/posts/HXxHcRCxR4oHrAsEr/an-update-on-academia-vs-industry-one-year-into-my-faculty
Circuits: https://distill.pub/2020/circuits/zoom-in/
Interpretability survey: https://arxiv.org/abs/2207.13243, see Twitter summary: https://twitter.com/StephenLCasper/status/1569401262558576642, and PDF: https://arxiv.org/pdf/2207.13243.pdf
Activation atlas: https://distill.pub/2019/activation-atlas/
Changing training data: https://arxiv.org/pdf/1811.12231.pdf
Editing factual associations in GPT: https://arxiv.org/pdf/2202.05262.pdf
Natural language descriptions of deep visual features: https://arxiv.org/pdf/2201.11114.pdf
Robust feature-level adversaries are interpretability tools: https://arxiv.org/pdf/2110.03605.pdf
Samotsvetyβs AI risk forecast: https://forum.effectivealtruism.org/posts/EG9xDM8YRz4JN4wMN/samotsvety-s-ai-risk-forecasts
Date of AGI: https://www.metaculus.com/questions/5121/date-of-artificial-general-intelligence/
(June) Forecasting TAI with biological anchors summary: https://www.lesswrong.com/s/B9Qc8ifidAtDpsuu8/p/wgio8E758y9XWsi8j
Monitoring for deceptive alignment: https://www.alignmentforum.org/posts/Km9sHjHTsBdbgwKyi/monitoring-for-deceptive-alignment
Deceptive alignment: https://www.alignmentforum.org/posts/zthDPAjh9w6Ytbeks/deceptive-alignment
Quintinβs alignment paper lineup: https://www.lesswrong.com/posts/7cHgjJR2H5e4w4rxT/quintin-s-alignment-papers-roundup-week-1
Most people start with the same few bad ideas: https://www.lesswrong.com/posts/Afdohjyt6gESu4ANf/most-people-start-with-the-same-few-bad-ideas
Beth Barnes starting a risks and development evaluations group at ARC: https://www.alignmentforum.org/posts/svhQMdsefdYFDq5YM/evaluations-project-arc-is-hiring-a-researcher-and-a-webdev-1
Cognitive biases in LLMs: https://arxiv.org/pdf/2206.14576.pdf
Academia vs. industry: https://www.alignmentforum.org/posts/HXxHcRCxR4oHrAsEr/an-update-on-academia-vs-industry-one-year-into-my-faculty
