Lessons from the Impossibility of Safety

mpi-is 14 November 2025 - 14 November 2025 N0.002 MPI-IS (Lecture Hall)

14 November 2025 • 14:00 - 15:30

What kind of results is impossible for safety research, and what pathways forward can we hope to achieve? First, we will discuss theoretical results on rule-following that demonstrate token-level jailbreaks as an architectural inevitability of attention (LogicBreaks). While initially pessimistic, these theoretical insights can also be leveraged to steer models to state of the art performance in five lines of code (InstABoost). Lastly, we will argue for a shift in safety strategy away from aligning model weights to stateful monitoring, as the only level at which one can hope to stop misuse (https://modelmisuse.com/ <https://modelmisuse.com/> ).

If you can’t come to the lecture hall of the MPI-IS, you can also listen to the talk online:

Join Zoom Meeting

https://eu02web.zoom-x.de/j/69394856294?pwd=WulBcrYUnaqn8RUI6Zrxi0cs22YhLi.1

Meeting ID: 693 9485 6294

Passcode: 485387

More Information

Speaker Biography:

	Eric Wong (Assistant Professor)
	University of Pennsylvania
	More Information

I am an assistant professor at the Department of Computer and Information Science at the University of Pennsylvania. I lead Brachio Lab on debugging machine learning and making systems actually do what we want them to do. I’m also a part of the ASSET Center on safe, explainable, and trustworthy AI systems. Previously, I completed my PhD at CMU advised by Zico Kolter, and did a postdoc with Aleksander Madry.

Organizers:

Maksym Andriushchenko

About

Events

News

Public engagement

Jobs

Start-up Network

Corporate Network

Investor Network

The AI Incubator

Health Cluster

Lessons from the Impossibility of Safety

mpi-is 14 November 2025 - 14 November 2025 N0.002 MPI-IS (Lecture Hall)

Speaker Biography:

Organizers: