OpenAI develops new tool to explain language models’ behaviors

OpenAI is working on a technology that will automatically identify which portions of an LLM are accountable for particular behaviours.

The engineers behind it emphasise that it is still in its early stages, but the code to run it is open source and available on GitHub as of this morning.

“We’re trying to [develop ways to] anticipate what the problems with an AI system will be,” William Saunders, the interpretability team manager at OpenAI, told TechCrunch.

“We want to really be able to know that we can trust what the model is doing and the answer that it produces.”

OpenAI’s tool (ironically) use a language model to deduce the functionality of the components of other, architecturally simpler LLMs, especially OpenAI’s own GPT-2.

This arrangement is used by OpenAI’s tool to deconstruct models into their constituent parts.

First, the programme passes text sequences through the model under consideration and looks for cases where a specific neuron “activates” repeatedly.

Following that, it “shows” GPT-4, OpenAI’s most recent text-generating AI model, these extremely active neurons and asks GPT-4 to give an explanation.

To assess how correct the explanation is, the tool feeds text sequences to GPT-4 and asks it to anticipate, or simulate, how the neuron would respond.

The researchers were able to generate explanations for all 307,200 neurons in GPT-2, which were put into a dataset that was made available alongside the tool code.

According to the researchers, tools like this could one day be used to improve an LLM’s effectiveness, such as reducing bias or toxicity. They concede, however, that it has a long way to go before it is truly useful. The algorithm was confident in its explanations for about 1,000 of those neurons, which represented a modest percentage of the total.

Tags: OpenAI