Interpretability state-of-the-art W37

πŸŽ‰ Interpretability research is doing great and πŸ“ˆ AI is still getting better
πŸ› Citations:
Interpretability survey:, see Twitter summary:, and PDF: 
Activation atlas:
Changing training data:
Editing factual associations in GPT: 
Natural language descriptions of deep visual features:
Robust feature-level adversaries are interpretability tools:
Samotsvety’s AI risk forecast:
Date of AGI: 
(June) Forecasting TAI with biological anchors summary:
Monitoring for deceptive alignment: 
Deceptive alignment: 
Quintin’s alignment paper lineup:
Most people start with the same few bad ideas:
Beth Barnes starting a risks and development evaluations group at ARC:
Cognitive biases in LLMs:
Academia vs. industry:
Interpretability state-of-the-art W37
Broadcast by