[Thanks to Valerie Morris for help editing this post.] Overview Large Language Models (LLMs) and their safety properties are often studied from the perspective of a single pass: what is the single next token the LLM will produce? But this is almost never how they are deployed. In contrast, LLMs are almost always run autoregressively: they produce tokens sequentially, taking earlier outputs as part of their input, until a halting condition is met (such as a special token or a token limit). In this post I discuss how one might study LLMs as
The STAMP model might be relevant to some of this. It's concerned with dangerous states for complex systems (like airplanes) where traditional models of points of failure (this component failed) are impractical or incorrect (there's no single point of failure because two systems that were working correctly had a dysfunctional interaction).
I went to a conference on it many years ago and they weren't great at explaining it, but I got glimpses of a cool top-down approach to evaluating risk in complex systems. Here's their website:
The STAMP model might be relevant to some of this. It's concerned with dangerous states for complex systems (like airplanes) where traditional models of points of failure (this component failed) are impractical or incorrect (there's no single point of failure because two systems that were working correctly had a dysfunctional interaction).
I went to a conference on it many years ago and they weren't great at explaining it, but I got glimpses of a cool top-down approach to evaluating risk in complex systems. Here's their website:
https://stamp-consulting.com/what-is-stamp/