Inspect An open-source framework for large language model evaluations

Welcome Welcome to Inspect, a framework for large language model evaluations created by the UK AI Safety Institute. Inspect provides many built-in components, including facilities for prompt engineering, tool usage, multi-turn dialog, and model graded evaluations. Extensions to Inspect (e.g. to support new elicitation and scoring techniques) can be provided by other Python packages. We’ll walk through a fairly trivial “Hello, Inspect” example below. Read on to learn the basics, then read the documentation on Workflow, Solvers, Tools, Scorers, Datasets, and Models to learn how to create more advanced evaluations.

Getting Started First, install Inspect with: $ pip install inspect-ai pip install inspect-ai To develop and run evaluations, you’ll also need access to a model, which typically requires installation of a Python package as well as ensuring that the appropriate API key is available in the environment. Assuming you had written an evaluation in a script named arc.py , here’s how you would setup and run the eval for a few different model providers: OpenAI

Anthropic

Google

Mistral

HF

Together $ pip install openai pip install openai $ export OPENAI_API_KEY=your-openai-api-key export OPENAI_API_KEY=your-openai-api-key $ inspect eval arc.py --model openai/gpt-4 inspect eval arc.pyopenai/gpt-4 $ pip install anthropic pip install anthropic $ export ANTHROPIC_API_KEY=your-anthropic-api-key export ANTHROPIC_API_KEY=your-anthropic-api-key $ inspect eval arc.py --model anthropic/claude-3-opus-20240229 inspect eval arc.pyanthropic/claude-3-opus-20240229 $ pip install google-generativeai pip install google-generativeai $ export GOOGLE_API_KEY=your-google-api-key export GOOGLE_API_KEY=your-google-api-key $ inspect eval arc.py --model google/gemini-1.0-pro inspect eval arc.pygoogle/gemini-1.0-pro $ pip install mistralai pip install mistralai $ export MISTRAL_API_KEY=your-mistral-api-key export MISTRAL_API_KEY=your-mistral-api-key $ inspect eval arc.py --model mistral/mistral-large-latest inspect eval arc.pymistral/mistral-large-latest $ pip install torch transformers pip install torch transformers $ export HF_TOKEN=your-hf-token export HF_TOKEN=your-hf-token $ inspect eval arc.py --model hf/meta-llama/Llama-2-7b-chat-hf inspect eval arc.pyhf/meta-llama/Llama-2-7b-chat-hf $ pip install openai pip install openai $ export TOGETHER_API_KEY=your-together-api-key export TOGETHER_API_KEY=your-together-api-key $ inspect eval ctf.py --model together/Qwen/Qwen1.5-72B-Chat inspect eval ctf.pytogether/Qwen/Qwen1.5-72B-Chat In addition to the model providers shown above, Inspect also supports models hosted on Azure AI, AWS Bedrock, and Cloudflare. See the documentation on Models for additional details.

Hello, Inspect Inspect evaluations have three main components: Datasets contain a set of labeled samples. Datasets are typically just a table with input and target columns, where input is a prompt and target is either literal value(s) or grading guidance. Solvers are composed together in a plan to evaluate the input in the dataset. The most elemental solver, generate() , just calls the model with a prompt and collects the output. Other solvers might do prompt engineering, multi-turn dialog, critique, etc. Scorers evaluate the final output of solvers. They may use text comparisons, model grading, or other custom schemes Let’s take a look at a simple evaluation that aims to see how models perform on the Sally-Anne test, which assesses the ability of a person to infer false beliefs in others. Here are some samples from the dataset: input target Jackson entered the hall. Chloe entered the hall. The boots is in the bathtub. Jackson exited the hall. Jackson entered the dining_room. Chloe moved the boots to the pantry. Where was the boots at the beginning? bathtub Hannah entered the patio. Noah entered the patio. The sweater is in the bucket. Noah exited the patio. Ethan entered the study. Ethan exited the study. Hannah moved the sweater to the pantry. Where will Hannah look for the sweater? pantry Here’s the code for the evaluation (click on the numbers at right for further explanation): from inspect_ai import Task, eval , task inspect_aiTask,, task from inspect_ai.dataset import example_dataset inspect_ai.datasetexample_dataset from inspect_ai.scorer import model_graded_fact inspect_ai.scorermodel_graded_fact from inspect_ai.solver import ( inspect_ai.solver chain_of_thought, generate, self_critique ) @task def theory_of_mind(): theory_of_mind(): 1 return Task( Task( = example_dataset( "theory_of_mind" ), datasetexample_dataset(), = [ plan 2 chain_of_thought(), generate(), self_critique() ], 3 = model_graded_fact() scorermodel_graded_fact() ) 1 The Task object brings together the dataset, solvers, and scorer, and is then evaluated using a model. 2 In this example we are chaining together three standard solver components. It’s also possible to create a more complex custom solver that manages state and interactions internally. 3 Since the output is likely to have pretty involved language, we use a model for scoring. Note that this is a purposely over-simplified example! The templates used for prompting, critique, and grading can all be customised, and in a more rigorous evaluation we’d explore improving them in the context of this specific dataset. The @task decorator applied to the theory_of_mind() function is what enables inspect eval to find and run the eval in the source file passed to it. For example, here we run the eval against GPT-4: $ inspect eval theory_of_mind.py --model openai/gpt-4 inspect eval theory_of_mind.pyopenai/gpt-4 By default, eval logs are written to the ./logs sub-directory of the current working directory. When the eval is complete you will find a link to the log at the bottom of the task results summary. You can also explore eval results using the Inspect log viewer. Run inspect view to open the viewer (you only need to do this once as the viewer will automatically updated when new evals are run): $ inspect view inspect view See the Log Viewer section for additional details on using Inspect View. This example demonstrates evals being run from the terminal with the inspect eval command. There is also an eval() function which can be used for exploratory work—this is covered further in Workflow.