agents or workflows.
Test Types
At present, we support a single type of test,type: consistency, which
measures the consistency between two results. Within agents, this can be
implemented as follows:
task_description field is the question that you want to test the LLM’s
performance on (note: we don’t call this prompt because we are nesting this
task_description within a separate prompt that runs the evaluation, so
prompt in this situation would be ambiguous). n indicates the number of
times to run the agent to produce a response to the task_description request.
For workflows, task_description is not required, but instead a task_ref
value should be provided, as shown below:
task_ref field indicates the task name that is to be tested. No
task_description is required because the given prompt will be used for
evaluation.
Consistency Runs in Workflow Agent Tasks
For agent tasks within workflows, you can enable consistency checking directly on the task by using theconsistency_run field. This runs the agent multiple times and selects the most consistent output:
| Field | Type | Default | Description |
|---|---|---|---|
consistency_run | integer | 1 | Number of times to run the agent (must be > 1 to enable) |
consistency_prompt | string | (built-in) | Custom evaluation prompt for consistency checking. Supports workflow-level and task-level configuration. |
- Workflow-level: Add
consistency_promptat the workflow root to apply to all agent tasks - Task-level: Add
consistency_promptto a specific task to override the workflow prompt - Default: If not specified, uses a built-in evaluation prompt optimized for data analysis
{{ submission_1 }}- First agent output{{ submission_2 }}- Second agent output{{ task_description }}- The task prompt being evaluated
How Consistency Testing Works
Consistency tests evaluate whether your agent produces factually consistent results across multiple runs. Since Oxy is primarily used for data analysis (not general factual Q&A), the consistency evaluator is optimized for numerical data and analytical insights.The Evaluation Process
When you run a consistency test withn: 10:
- Generate N outputs: Your agent runs 10 times with the same input
- Pairwise comparison: Outputs are compared in pairs (e.g., output 1 vs 2, 1 vs 3, etc.)
- LLM evaluation: An LLM judge evaluates if each pair is factually consistent
- Consistency score: Each output receives a score based on how many comparisons it passed
- Best output selected: The most consistent output is returned
What Consistency Testing Ignores
For data analysis use cases, the evaluator intelligently ignores:1. Numerical Rounding & Precision (< 0.1% difference)
2. Grammar & Style Variations
3. Formatting Differences
Why This Matters for Data Analysis
In data analysis workflows:- SQL queries may return slightly different precision depending on database settings
- Different phrasing of the same insight is acceptable (“revenue increased” vs “revenue went up”)
- Rounding is common and expected (databases, visualization tools, reporting systems)
- Trends and insights matter more than exact decimal places
Configuring Consistency Tests
Basic Configuration
The simplest consistency test uses the default evaluation prompt:- Minor numerical differences from rounding (e.g., 1,081,395.67)
- Grammar and style variations (e.g., “amounts to” vs “amount to”)
- Formatting differences (e.g., “1000” vs “1,000”)
Default Prompt Reference: The built-in
CONSISTENCY_PROMPT provides detailed evaluation guidelines optimized for data analysis. View the full default prompt in the source code to understand its logic or use it as a starting point for custom prompts.Advanced Configuration with Custom Prompts
For specific use cases, you can fully customize the evaluation logic by providing a customprompt:
Configuration Options
| Field | Type | Default | Description |
|---|---|---|---|
n | integer | 10 | Number of times to run the test |
task_description | string | required | The question/query to test |
prompt | string | (built-in) | Custom evaluation prompt template. Must include {{ submission_1 }}, {{ submission_2 }}, {{ task_description }} variables and return A or B. |
When to Customize the Evaluation Prompt
Use default prompt for general analysis (recommended):- General data analysis
- Sales reporting, KPIs, dashboards
- Most analytical queries
- Handles minor numerical differences intelligently
- Financial calculations where precision matters
- Critical metrics that must be exact
- Compliance reporting
- Zero-tolerance scenarios
- Trend analysis and directional insights
- Sentiment or qualitative analysis
- High-level summaries
- Focus on meaning over exact wording
Running Tests
Basic Usage
These tests can be run by running either, for an agent:Output Formats
Theoxy test command supports two output formats for flexibility in different environments:
Pretty Format (Default)
The default format provides colored, human-readable output with detailed metrics:JSON Format (CI/CD)
For continuous integration and automated pipelines, use the--format json flag to get machine-readable output:
- CI/CD pipelines
- Automated quality gates
- Parsing with tools like
jq
Accuracy Thresholds
You can enforce minimum accuracy requirements using the--min-accuracy flag. This is useful for CI/CD pipelines to prevent regressions:
- Exit with code 0 if accuracy meets or exceeds 80%
- Exit with code 1 if accuracy falls below 80%
- Output results to stdout regardless of pass/fail
Threshold Modes for Multiple Tests
When your test file contains multiple tests, you can control how the threshold is evaluated:Average Mode (Default)
Checks if the average of all test accuracies meets the threshold:[0.85, 0.92, 0.78] average to 0.85, which passes the 0.8 threshold.
All Mode
Requires every individual test to meet the threshold:[0.85, 0.92, 0.78] would fail because Test 3 (0.78) is below the threshold.
Error output:
Quiet Mode
Suppress progress bars and detailed output during test execution:- Clean CI logs
- Parsing output programmatically
- Reducing noise in automated environments
CLI Reference
oxy test Command
Syntax:
<file>- Path to the.agent.ymlor.workflow.ymlfile to test (required)
| Flag | Short | Description | Default |
|---|---|---|---|
--format <format> | Output format: pretty or json | pretty | |
--min-accuracy <threshold> | Minimum accuracy threshold (0.0-1.0). Exit code 1 if below threshold | None | |
--threshold-mode <mode> | Threshold evaluation mode: average or all | average | |
--quiet | -q | Suppress detailed output and show only results summary | false |
CI/CD Integration Examples
GitHub Actions
Docker
Parsing JSON Output
Extract specific metrics usingjq:
Best Practices
- Multiple Tests: Write multiple tests to cover different aspects of your agent’s behavior
- Threshold Mode: Use
--threshold-mode allfor critical quality gates,averagefor overall performance monitoring - Version Control: Commit your test files (
.agent.yml,.workflow.yml) to track test definitions - CI Integration: Always use
--format jsonin CI pipelines for reliable parsing - Quiet Mode: Combine
--quietwith--format jsonin automated environments for clean logs
Error Handling
- Execution Errors: If tests fail to run (e.g., connection issues), they are written to stderr and don’t affect the JSON output on stdout
- Threshold Failures: Only exit with code 1 when
--min-accuracyis specified and the threshold isn’t met - Missing Metrics: If no accuracy metrics are found but
--min-accuracyis specified, a warning is displayed but the command succeeds (exit code 0)