Lessons from the Impossibility of Safety

mpi-is 14 November 2025 - 14 November 2025 N0.002 MPI-IS (Lecture Hall)

14 November 2025 • 14:00 - 15:30

Eric Wong (University of Pennsylvania)

What kind of results is impossible for safety research, and what pathways forward can we hope to achieve? First, we will discuss theoretical results on rule-following that demonstrate token-level jailbreaks as an architectural inevitability of attention (LogicBreaks). While initially pessimistic, these theoretical insights can also be leveraged to steer models to state of the art performance in five lines of code (InstABoost). Lastly, we will argue for a shift in safety strategy away from aligning model weights to stateful monitoring, as the only level at which one can hope to stop misuse (https://modelmisuse.com/ <https://modelmisuse.com/> ).

Speaker Biography:

Eric Wong (Assistant Professor)
University of Pennsylvania
More Information
I am an assistant professor at the Department of Computer and Information Science at the University of Pennsylvania. I lead Brachio Lab on debugging machine learning and making systems actually do what we want them to do. I’m also a part of the ASSET Center on safe, explainable, and trustworthy AI systems. Previously, I completed my PhD at CMU advised by Zico Kolter, and did a postdoc with Aleksander Madry.

Organizers: