Lessons from the Impossibility of Safety
mpi-is 14 November 2025 - 14 November 2025 N0.002 MPI-IS (Lecture Hall)
What kind of results is impossible for safety research, and what pathways forward can we hope to achieve? First, we will discuss theoretical results on rule-following that demonstrate token-level jailbreaks as an architectural inevitability of attention (LogicBreaks). While initially pessimistic, these theoretical insights can also be leveraged to steer models to state of the art performance in five lines of code (InstABoost). Lastly, we will argue for a shift in safety strategy away from aligning model weights to stateful monitoring, as the only level at which one can hope to stop misuse (https://modelmisuse.com/ <https://modelmisuse.com/> ).
If you can’t come to the lecture hall of the MPI-IS, you can also listen to the talk online:
Join Zoom Meeting
https://eu02web.zoom-x.de/j/69394856294?pwd=WulBcrYUnaqn8RUI6Zrxi0cs22YhLi.1
Meeting ID: 693 9485 6294
Passcode: 485387
Speaker Biography:
| Eric Wong (Assistant Professor) | |
| University of Pennsylvania | |
| More Information |