Violent language models & neural hacking W38
This Safe AI Progress Report describes the past week's developments in ML and AI safety. Follow along to get regular updates for the scientific field to stay safe from artificial intelligence.
👩‍🔬 Hosted by Sabrina Zaki
This Safe AI Progress Report describes the past week's developments in ML and AI safety. Follow along to get regular updates for the scientific field to stay safe from artificial intelligence.
Citations
This Safe AI Progress Report describes the past week's developments in ML and AI safety. Follow along to get regular updates for the scientific field to stay safe from artificial intelligence.
Citations
- Reinforcement learning from human feedback: https://twitter.com/anthropicai/status/1514277273070825476?lang=en
- First SAIPR: https://www.youtube.com/watch?v=ETknJbbL3PY&t=5s&ab_channel=ApartResearch
- Red teaming LLMs: https://arxiv.org/pdf/2202.03286.pdf
- Adversarial training [Redwood]: https://arxiv.org/abs/2205.01663
- Robust injury classifier [Redwood]: https://www.alignmentforum.org/posts/n3LAgnHg6ashQK3fF/takeaways-from-our-robust-injury-classifier-project-redwood
- Red teaming language models to reduce harms: Review [Anthropic]: https://arxiv.org/abs/2209.07858
- Aligning language models: https://arxiv.org/abs/2209.00731
- Refine’s third blog post battery: https://www.alignmentforum.org/posts/PhKSe9BT4h5peqrHL/refine-s-third-blog-post-day-week
- Refine as a concept: https://www.alignmentforum.org/posts/5uiQkyKdejX3aEHLM/how-to-diversify-conceptual-alignment-the-model-behind
- Ordering capability thresholds: https://www.alignmentforum.org/posts/ttRyu8u9vqX3jZFjr/ordering-capability-thresholds
- Levels of goals and alignment: https://www.alignmentforum.org/posts/rzkCTPnkydQxfkZsX/levels-of-goals-and-alignment
- Representational tether: https://www.alignmentforum.org/posts/h7BA7TQTo3dxvYrek/representational-tethers-tying-ai-latents-to-human-ones
- Coordinate-free interpretability theory: https://www.alignmentforum.org/posts/sxhfSBej6gdAwcn7X/coordinate-free-interpretability-theory
- Backdoor bench: https://arxiv.org/abs/2206.12654
- Leon Lang’s summary of AGISF readings: https://www.alignmentforum.org/posts/eymFwwc6jG9gPx5Zz/summaries-alignment-fundamentals-curriculum
- Vanessa Kosoy’s ALTER prize for learning-theoretic progress in alignment: https://www.alignmentforum.org/posts/8BL7w55PS4rWYmrmv/prize-and-fast-track-to-alignment-research-at-alter
- Apart Research: https://apartresearch.com
AI Safety Ideas: https://aisafetyideas.com
