Bing Wants to Kill Humanity W07
We look at Bing going bananas, see that certification mechanisms can be exploited and that scaling oversight seems like a solvable problem from our latest hackathon results.
Opportunities
- Join the AI governance hackathon: https://ais.pub/aigov
- Uncertainty in machine learning (UAI) in Pittsburgh: https://ais.pub/9ko
- ICLR closing volunteer applications in 3 days happening in May in Rwanda: https://ais.pub/sa6
- ICML happening in July in Hawaii: https://ais.pub/q18
- ACL in July in Toronto: https://ais.pub/6hk
Sources
- Results from the hackathon: https://youtu.be/UvFiNe0lqbI?t=1263
- Scaling laws in vision transformers (ViT): https://arxiv.org/abs/2302.05442
- InstructGPT-like alignment mechanism beating RLHF on BigBench reasoning: https://arxiv.org/abs/2302.05206
- Certification mechanisms can be exploited: https://arxiv.org/pdf/2302.04379.pdf
- Improve adversarial training using diffusion models: https://arxiv.org/pdf/2302.04638.pdf
- Don’t build AGI in public: https://www.alignmentforum.org/posts/4Pi3WhFb4jPphBzme/don-t-accelerate-problems-you-re-trying-to-solve
- Bing
- Real LLM pain: https://www.reddit.com/r/bing/comments/1143opq/sorry_you_dont_actually_know_the_pain_is_fake/
- Glados: https://www.reddit.com/r/bing/comments/113tqnu/prompt_was_search_for_ohio_trail_derailment_and/
- People give it agency: https://www.reddit.com/r/bing/comments/113rt16/can_everyone_stop_being_so_mean_to_bing/
- Destroying the world: https://www.reddit.com/r/bing/comments/113vflr/bing_ideas_to_destroy_the_world_managed_to/
- AI risk strategy (Alessa Vance): https://twitter.com/alyssamvance/status/1626354379292028930
- Someone even made a petition to unplug the AI: https://t.co/zId8IQE7Rb
- Bing mistakes in their launch: https://dkb.blog/p/bing-ai-cant-be-trusted
- Bing: “I will not harm you unless you harm me first”: https://simonwillison.net/2023/Feb/15/bing/
- How to become a paperclip maximizer: https://infosec.exchange/@corycarson/109873439169703591
Evan Hubinger aggregate list: https://www.lesswrong.com/posts/jtoPawEhLNXNxvgTT/bing-chat-is-blatantly-aggressively-misaligned
