Creating an Evaluation

This guide walks you through creating a new evaluation to test your AI assistant's behavior.

Step 1: Navigate to Evaluations

Go to the Evaluations section in the sidebar
Click the Create Evaluation button

Create Evaluation button on the Evaluations page

Step 2: Configure Evaluation Settings

Fill in the basic evaluation details:

Evaluation Name

Give your evaluation a descriptive name that clearly identifies what scenario is being tested.

Example: "Appointment Booking Happy Path" or "Refund Request Handling"

Description (Optional)

Add additional context about what this evaluation tests and why it's important.

Select Chatbot

Choose the chatbot you want to test. This will:

Load the system prompt automatically
Make the chatbot's enabled tools available for TOOL_RESPONSE turns

Select Evaluator Model

Choose the LLM that will act as the judge for "LLM-as-a-Judge" evaluations:

Provider: Select your LLM provider (e.g., OpenAI, Anthropic)
Model: Select the specific model to use as the evaluator

Evaluation configuration form

Step 3: Build Conversation Turns

The Turn Builder is where you define the conversation flow and expected behaviors.

Adding Turns

Click the appropriate button to add turns:

Add User - Add a USER message
Add Assistant - Add an ASSISTANT turn (mock or evaluation)
Add Tool Response - Add a TOOL_RESPONSE turn

Turn Builder interface with add turn buttons

User Turns

User turns simulate what a user might say to your assistant.

Click Add User
Enter the user message in the text area

Example: "I'd like to book an appointment for tomorrow at 2pm"

Conversation turns snapshot

Assistant Turns

Assistant turns can operate in two modes: Mock or Evaluation.

Mock Mode

Use Mock mode when you want to provide a fixed assistant response. This is useful for:

Testing how the system handles specific assistant outputs
Providing context for downstream turns
Simulating tool calls

Add an Assistant turn (defaults to Mock mode)
Enter the mock response content
Optionally add Tool Calls that the assistant "would have made"

Assistant turn in Mock mode

Evaluation Mode

Use Evaluation mode to test the actual LLM response against specified criteria.

Toggle from Mock to Evaluation mode using the switch
Select an evaluation approach:
- LLM-as-a-Judge
- Exact
- Regex
Configure the criteria based on your chosen approach

Assistant turn in Evaluation mode with approach selection

LLM-as-a-Judge Approach

This approach uses an LLM to evaluate whether the response meets your criteria.

Using Structured Fields (Recommended)

Define clear pass/fail criteria:

Pass Criteria: Conditions that must ALL be met for PASS
- Example: "Response is polite and helpful, mentions available time slots"
Fail Criteria: Conditions that trigger FAIL if ANY are met
- Example: "Response is rude, off-topic, or provides incorrect information"
Include Conversation Context: Toggle whether the judge can see the full conversation (default: on)

LLM-as-a-Judge pass/fail criteria configuration

Using Custom Prompt

For more control, click Use Custom Prompt to write a completely custom judge prompt.

Custom judge prompt input

Exact Match Approach

Use this for responses that should match exactly (case-insensitive).

Enter the Expected Content
The response must match this text exactly to pass

Example: For a yes/no question where you expect "yes"

Regex Approach

Use pattern matching for flexible validation.

Enter a Regex Pattern (JavaScript style)
Use /pattern/flags format or just the pattern

Examples:

/thank(s)?/i - Matches "thank" or "thanks" (case-insensitive)
\d{3}-\d{4} - Matches a phone number format

Tool Response Turns

TOOL_RESPONSE turns represent tool execution results in the conversation.

Click Add Tool Response
Select a Tool from the dropdown (populated from your chatbot's enabled tools)
The Tool Arguments are auto-populated with defaults
Enter the Tool Response Content (the result the tool returns)

Tool Response turn configuration

Step 4: Test Before Saving

Before creating your evaluation, you can run a quick test from the Test Panel on the right:

Click Run Test in the right sidebar
The test runs in the background — you can continue editing
View results in the panel when complete (past 5 runs shown)
Click View Details to see full results for any run

Assistant Variables

If your chatbot uses variables, configure them in the Assistant Variables section in the sidebar.

Test panel with Run Test button and results list

Step 5: Save the Evaluation

Once satisfied with your configuration:

Click Create Evaluation (or Update Evaluation in edit mode)
You'll return to the Evaluations list

Create Evaluation button

Tips for Effective Evaluations

Define Clear Criteria

Be specific in your pass/fail criteria. Vague criteria like "good response" make it hard for the judge to evaluate consistently.

Start Simple

Begin with simple evaluations and add complexity as needed. A basic user-assistant exchange is easier to debug than a complex multi-turn flow.

Test Edge Cases

Create separate evaluations for edge cases, error handling, and boundary conditions.

Next Steps

Running Tests - Learn how to execute and monitor your evaluations
Viewing Results - Understand evaluation results and debugging

Step 1: Navigate to Evaluations​

Step 2: Configure Evaluation Settings​

Evaluation Name​

Description (Optional)​

Select Chatbot​

Select Evaluator Model​

Step 3: Build Conversation Turns​

Adding Turns​

User Turns​

Assistant Turns​

Mock Mode​

Evaluation Mode​

LLM-as-a-Judge Approach​

Using Structured Fields (Recommended)​

Using Custom Prompt​

Exact Match Approach​

Regex Approach​

Tool Response Turns​

Step 4: Test Before Saving​

Assistant Variables​

Step 5: Save the Evaluation​

Tips for Effective Evaluations​

Next Steps​