Skip to main content
Test your system prompts across different models and scenarios without building a full agent. Perfect for prompt optimization and A/B testing.

Getting Started

1. Configure Your Model

Choose your preferred LLM provider and model:
Available Models:• GPT-4o, GPT-4o mini • GPT-4 Turbo, GPT-4 • GPT-3.5 Turbo
Available Models:• Claude 3.5 Sonnet • Claude 3 Opus, Claude 3 Sonnet • Claude 3 Haiku
Available Models:• Gemini 1.5 Pro • Gemini 1.5 Flash • Gemini 1.0 Pro

2. Input Your System Prompt

Enter your complete system prompt in the text area. This is what will be used to guide the model’s behavior during testing. Example:
You are a helpful customer service agent for a delivery company. 
Your goal is to resolve customer issues quickly and empathetically.

Guidelines:
1. Always ask for the order number first
2. Acknowledge the customer's frustration
3. Provide clear next steps
4. Offer compensation when appropriate

Keep responses under 100 words and maintain a professional tone.
Remove Temporary Variables: Make sure to remove any placeholder text like {customer_name} or {order_id} from your prompt. Use static examples instead.

3. Configure Model Parameters

Temperature (0.0 - 1.0)
  • 0.0-0.3: Consistent, predictable responses
  • 0.4-0.7: Balanced creativity and consistency
  • 0.8-1.0: More creative and varied responses
Response Format
  • Text: Standard text responses
  • JSON: Structured JSON output (specify schema in prompt)

4. Set Evaluation Context

This is crucial for accurate testing. Add any information from your prompt that the evaluator needs to know: What to Include:
  • Key guidelines or rules from your prompt
  • Expected response format or structure
  • Specific goals or success criteria
  • Any constraints or limitations
Example Evaluation Context:
The agent should:
- Always ask for order number first
- Acknowledge customer frustration
- Keep responses under 100 words
- Maintain professional tone
- Offer compensation when issues warrant it

Success criteria:
- Issue resolution within 3 exchanges
- Customer satisfaction maintained
- Company policies followed
Why This Matters: The simulated user and evaluator don’t see your system prompt during testing. The evaluation context ensures they understand what behavior to expect and how to measure success.

5. Advanced Features

Tool Calls & Function Calling If your prompt involves tool calls or function calling, use our Custom Chat Agent setup instead, which provides full control over tool definitions and execution.

Best Practices

Prompt Clarity

Clear Instructions• Use specific, actionable guidelines • Include examples of good responses • Define success criteria explicitly

Testing Strategy

Effective Testing• Start with edge cases • Test across different user personas • Compare performance across models

Example Workflow

  1. Input your system prompt
  2. Select OpenAI GPT-4o with temperature 0.3
  3. Add evaluation context about expected behavior
  4. Choose scenarios like “frustrated delivery customer”
  5. Run simulations across multiple personas
  6. Analyze results and iterate on your prompt

Common Use Cases

  • Customer Service: Testing support agent responses
  • Content Creation: Evaluating writing assistant prompts
  • Educational: Testing tutoring or explanation prompts
  • Classification: Testing categorization and tagging prompts
  • Summarization: Testing document or conversation summaries
Ready to test more complex scenarios? Check out our Custom Chat Agent guide for tool calling, multi-turn conversations, and custom API integrations.