Getting Started
1. Configure Your Model
Choose your preferred LLM provider and model:OpenAI Models
OpenAI Models
Available Models:• GPT-4o, GPT-4o mini
• GPT-4 Turbo, GPT-4
• GPT-3.5 Turbo
Anthropic Models
Anthropic Models
Available Models:• Claude 3.5 Sonnet
• Claude 3 Opus, Claude 3 Sonnet
• Claude 3 Haiku
Google Gemini Models
Google Gemini Models
Available Models:• Gemini 1.5 Pro
• Gemini 1.5 Flash
• Gemini 1.0 Pro
2. Input Your System Prompt
Enter your complete system prompt in the text area. This is what will be used to guide the model’s behavior during testing. Example:3. Configure Model Parameters
Temperature (0.0 - 1.0)- 0.0-0.3: Consistent, predictable responses
- 0.4-0.7: Balanced creativity and consistency
- 0.8-1.0: More creative and varied responses
- Text: Standard text responses
- JSON: Structured JSON output (specify schema in prompt)
4. Set Evaluation Context
This is crucial for accurate testing. Add any information from your prompt that the evaluator needs to know: What to Include:- Key guidelines or rules from your prompt
- Expected response format or structure
- Specific goals or success criteria
- Any constraints or limitations
Why This Matters: The simulated user and evaluator don’t see your system prompt during testing. The evaluation context ensures they understand what behavior to expect and how to measure success.
5. Advanced Features
Tool Calls & Function Calling If your prompt involves tool calls or function calling, use our Custom Chat Agent setup instead, which provides full control over tool definitions and execution.Best Practices
Prompt Clarity
Clear Instructions• Use specific, actionable guidelines
• Include examples of good responses
• Define success criteria explicitly
Testing Strategy
Effective Testing• Start with edge cases
• Test across different user personas
• Compare performance across models
Example Workflow
- Input your system prompt
- Select OpenAI GPT-4o with temperature 0.3
- Add evaluation context about expected behavior
- Choose scenarios like “frustrated delivery customer”
- Run simulations across multiple personas
- Analyze results and iterate on your prompt
Common Use Cases
- Customer Service: Testing support agent responses
- Content Creation: Evaluating writing assistant prompts
- Educational: Testing tutoring or explanation prompts
- Classification: Testing categorization and tagging prompts
- Summarization: Testing document or conversation summaries
Ready to test more complex scenarios? Check out our Custom Chat Agent guide for tool calling, multi-turn conversations, and custom API integrations.