Invocations: The Other Capabilities Overhang?
Abstract: An LLM’s invocation is the non-model code around it that determines when and how the model is called. I illustrate that LLMs are already used under widely varying invocations, and that a model’s capabilities depend in part on its invocation. I discuss several implications for AI safety work including (1) a reminder that the AI is more than just the LLM, (2) discussing the possibility and limitations of “safety by invocation”, (3) suggesting safety evaluations use the most powerful invocations, and (4) acknowledging the possibility of an “invocation overhang”, in which an improvement in invocation leads to sudden capability gains on current models and hardware.
Defining Invocations, and Examples
An LLM’s invocation is the framework of regular code around the model that determines when the model is called, which inputs are passed to the LLM, and what is done with the model’s output. For instance, the invocation in the OpenAI playground might be called “simple recurrence”:
A user provides an input string. The input to the LLM is this string, unchanged except for tokenization.
Run the LLM on this input, producing logits.
Predict the next token as some probabilistic function of the logits (ex: at temperature 0 the next token prediction is the argmax of the logits).
Append this token to the end of the user’s input string.
Repeat steps 2-4 with the new string until you get an [END_OF_STRING] token or reach the max token limit.
Display the result as plain text.
Note how many steps in “using the LLM” do not involve the actual model! Here are some ways this invocation can be varied:
Augmenting the prompt in simple recurrence, such as providing few-shot examples, chain-of-thought prompting, or text like “You are an AI assistant. User: [text]. You: “.
Monitoring outputs to adjust them. For instance, in the New York Times “interview” with Bing, there is a moment where “[Bing writes a list of destructive acts, including hacking into computers and spreading propaganda and misinformation. Then, the message vanishes, and the following message appears.]” This is clearly not simple recurrence, because simple recurrence never deletes tokens. Instead, a separate part of the invocation (perhaps even another instance of the same model!) must be monitoring the text and deleting parts of it under some condition.
Embedding tools or API calls, such as Bing searches or plug-ins. I don’t know how exactly these are implemented, but one possible invocation would be to monitor the output for API-compliant text, do the fetch request, and then inject the result into the context window.
This process described in the GPT-4 System Card, in which the model evaluates and rewrites its output to remove “closed-domain” hallucinations1:
For closed-domain hallucinations, we are able to use GPT-4 itself to generate synthetic data. Specifically, we design a multi-step process to generate comparison data:
Pass a prompt through GPT-4 model and get a response
Pass prompt + response through GPT-4 with an instruction to list all hallucinations
If no hallucinations are found, continue
Pass prompt + response + hallucinations through GPT-4 with an instruction to rewrite the response without hallucinations
Pass prompt + new response through GPT-4 with an instruction to list all hallucinations
If none are found, keep (original response, new response) comparison pair
Otherwise, repeat up to 5x
The ARC Evals process described in the GPT-4 system card is another invocation, differentiated by letting the model execute code, reason internally, and delegate:
ARC combined GPT-4 with a simple read-execute-print loop that allowed the model to execute code, do chain-of-thought reasoning, and delegate to copies of itself.
Invocations Affect Capabilities
In this section I want to establish that invocations can improve capabilities. First, our prior from analogy to humans should support this claim - when solving e.g. math problems, access to scratch paper and a calculator makes a difference, as do “habits” such as checking your work rather than going with your first guess.
Furthermore, here are three examples of invocations affecting capabilities in the literature:
The example of GPT-4 recognizing and correcting its own hallucinations (above) seems to be an “in the wild” admission that a more complicated invocation can improve a capability (in this case, reducing hallucinations).
Chain-of-Thought prompting “improves performance on a range of arithmetic, commonsense, and symbolic reasoning tasks”.
In Reflexion, an LLM agent can “reflect” based on a heuristic, allowing the agent to add to its working memory for the next run through an environment. This improved its performance in “decision-making tasks in AlfWorld environments” and “knowledge-intensive, search-based question-and-answer tasks in HotPotQA environments”.
AI Safety Implications
We should be aware that an “AI” is not just an LLM, but also the framework around it - and the whole can have different capabilities and safety properties than the parts.
There may be invocations that are more or less safe, even indepedent of capability changes. For instance, an invocation that contains “thinking out loud internally” might hide deceptive behavior that would be obvious if spoken aloud.
Conversely, one could try to design an invocation for safety. For instance, one could have an agent AI powered by an LLM, and an “angel on the shoulder” AI that judges whether the agent has crossed a line into being unsafe and either takes over the AI, adjusts its memory to be safer, or shuts it down. There could be an arbitrary speed advantage of the angel over the agent by giving the angel N forward passes for each forward pass of the agent.
However, “safety by invocation” may not be secure - all it would take to undo these safety measures is jailbreaking the LLM out of its invocation.
When teams like ARC Evals evaluate a model for dangerous capabilities, they should ensure they are testing on the most powerful invocations available, just as they should be testing on the most powerful LLMs available. Partner companies whose models are being evaluated should also share what they know about the best way to invoke their own models.
It is possible that there is an “invocation overhang” where running current models in a new invocation suddenly improves an AI’s capabilities in safety-critical areas like situational awareness, reliability, or ability to make and execute complicated plans. This would be especially dangerous because new invocations could be produced almost anywhere, while sufficiently large models can only be trained by a few large organizations.
From the system card: “Closed domain hallucinations refer to instances in which the model is instructed to use only information provided in a given context, but then makes up extra information that was not in that context.”