ML Safety at NeurIPS & Paradigmatic AI Safety
This week, we see how to break ChatGPT, integrating diverse opinions in an AI and look at a bunch of the most interesting papers from the ML safety workshop happening right now!
This week, we see how to break ChatGPT, integrating diverse opinions in an AI and look at a bunch of the most interesting papers from the ML safety workshop happening right now!
Opportunities
- AI Testing hackathon: https://ais.pub/8ao
- Conjecture is searching for engineers, research scientists, and operations personnel: https://ais.pub/conjecturejobs
- In addition to “unusual talent”: https://ais.pub/conjecture-unusual-talent
- If you’re Spanish-speaking, apply to join the EAGx LatAm: https://ais.pub/eagx-latam
- SFF speculative funding: https://ais.pub/sff
Sources
- ChatGPT jailbreaks, recommend Yannic’s video: https://www.youtube.com/watch?v=0A8ljAkdFtg&ab_channel=YannicKilcher
- ChatGPT jailbreak defense using Eliezer: https://www.alignmentforum.org/posts/pNcFYZnPdXyL2RfgA/using-gpt-eliezer-against-chatgpt-jailbreaking
- Leike’s optimism about OpenAI’s alignment strategy: https://aligned.substack.com/p/alignment-optimism
- Yudkowsky and Bensinger’s challenge for AGI institutes: https://www.alignmentforum.org/posts/tD9zEiHfkvakpnNam/a-challenge-for-agi-organizations-and-a-challenge-for-1
- Diverse human preference consensus learning: https://twitter.com/DeepMind/status/1598293523862032385
- Causal scrubbing, automatically testing hypotheses using performance recovery during ablation: https://www.alignmentforum.org/posts/JvZhhzycHu2Yd57RN/causal-scrubbing-a-method-for-rigorously-testing
- Wentworth’s update to The Plan, proto-paradigms and convergence towards abstraction identification, 10-15 year median timeline, 8 year solution timeline: https://www.alignmentforum.org/posts/BzYmJYECAc3xyCTt6/the-plan-2022-update
- Today’sML Safety workshop at NeurIPS: https://nips.cc/virtual/2022/workshop/49986
- The Reward Hypothesis is False: https://nips.cc/media/PosterPDFs/NeurIPS%202022/65594.png?t=1669958239.9675896
- Image recognition time for humans predicts adversarial attack success: https://nips.cc/media/PosterPDFs/NeurIPS%202022/65710.png?t=1670542099.21133
- Stable diffusion safety filter is easy to circumvent: https://nips.cc/media/PosterPDFs/NeurIPS%202022/65592.png?t=1669469106.4581435
- Detecting adversarial examples through NN topology: https://nips.cc/media/PosterPDFs/NeurIPS%202022/65625.png?t=1669281046.9976594
- Defense against adversarial attacks on LLM-based malware detection: https://nips.cc/media/PosterPDFs/NeurIPS%202022/65630.png?t=1669964164.6168442
- Broken neural scaling laws: https://arxiv.org/abs/2210.14891
- Alignment on dynamic goals, call for research: https://nips.cc/media/PosterPDFs/NeurIPS%202022/65654.png?t=1669905787.5300798
- Robustness of inverse reinforcement learning: https://openreview.net/pdf?id=3L9qPqkBJrq
- Defining out-of-distribution: https://nips.cc/media/PosterPDFs/NeurIPS%202022/65657.png?t=1669819785.7852528
- Out-of-model-scope vs. out-of-distribution: https://nips.cc/media/PosterPDFs/NeurIPS%202022/65705.png?t=1669886203.3408606
- Multi-level AI alignment: https://nips.cc/media/PosterPDFs/NeurIPS%202022/65665.png?t=1669944550.4811175
- System 3: https://nips.cc/media/PosterPDFs/NeurIPS%202022/65675.png?t=1670190955.4044285
- Interpretability in the Wild: https://nips.cc/media/PosterPDFs/NeurIPS%202022/65681.png?t=1670539757.7554123
- Debate does not help humans: https://nips.cc/media/PosterPDFs/NeurIPS%202022/65678.png?t=1669867046.8590875
- Adversarial attacks on feature visualization: https://openreview.net/pdf?id=J51K0rszIjr
- Steiner’s alignment hot takes advent calendar: https://www.alignmentforum.org/s/BMD4FNvrkG3YkhTq7
