The Together AI Evaluations service is a powerful framework for using LLM-as-a-Judge to evaluate other LLMs and various inputs.Documentation Index
Fetch the complete documentation index at: https://docs.together.ai/llms.txt
Use this file to discover all available pages before exploring further.
Overview
Large language models can serve as judges to evaluate other language models or assess different types of content. You can simply describe in detail how you want the LLM-as-a-Judge to assess your inputs, and it will perform this evaluation for you. For example, they can identify and flag content containing harmful material, personal information, or other policy-violating elements. Another common use case is comparing the quality of two LLMs, or configurations of the same model (for example prompts) to determine which performs better on your specific task. Our Evaluations service allows you to easily submit tasks for assessment by a judge language model. With Evaluations, you can:- Compare models and configurations: Understand which setup works best for your task
- Measure performance: Use a variety of metrics to score your model’s responses
- Filter datasets: Apply LLM-as-a-Judge to filter and curate your datasets
- Gain insights: Understand where your model excels and where it needs improvement
- Build with confidence: Ensure your models meet quality standards before deploying them to production
Quickstart
To launch evaluations using the UI, please refer to: AI Evaluations UI For the full API specification, please refer to docs Get started with the Evaluations API in just a few steps. This example shows you how to run a simple evaluation.1. Prepare Your Dataset
First, you’ll need a dataset to evaluate your model on. The dataset should be in JSONL or CSV format. Each line must contain the same fields. Example JSONL dataset:dataset.jsonl
- CSV: math_dataset.csv
- JSONL: math_dataset.jsonl
2. Upload Your Dataset
You can use our UI, API, or CLI.Make sure to specify
purpose: "eval" to ensure the data is processed correctly.3. Run the Evaluation
We support three evaluation types, each designed for specific assessment needs:classify— Classifies the input into one of the provided categories. Returns one of the predefined classes.score— Takes an input and produces a score within a specified range. Returns a numerical score.compare— Takes responses from two models and determines which one is better according to a given criterion.
Evaluation Type: Classify
Purpose: Categorizes input into predefined classes (e.g., “Toxic” vs “Non-toxic”) Parameters:- judge (required): Configuration for the judge model
model– The model to use for evaluationmodel_source– One of: “serverless”, “dedicated”, or “external”system_template– Jinja2 template providing guidance for the judge (see Understanding Templates)external_api_token– Optional; required whenmodel_source = "external". If you selectexternalmodel source, use this to provide API bearer authentication token (eg. OpenAI token)external_base_url- Optional; when using anexternalmodel source, you can specify your own base URL. (e.g.,"https://api.openai.com"). The API must be OpenAIchat/completions-compatible.max_tokens– Optional; maximum number of tokens the judge model can generate. Defaults to 32768. Increase for reasoning models (for example, Gemini or o-series) that consume output token budget for chain-of-thought.temperature– Optional; sampling temperature for the judge model. Defaults to 0.05.num_workers– Optional; number of concurrent workers for judge inference requests. Defaults:serverless→ 25,dedicated→ 5 (minimum),external→ 2 for first-party APIs (OpenAI, Anthropic, Google) or 20 for proxy/aggregator endpoints (e.g. OpenRouter). Override this to tune throughput for your workload.
- labels (required): List of strings defining the classification categories
- pass_labels (optional): List of labels considered as “passing” for statistics
- model_to_evaluate (required): Configuration for the model being evaluated
- Can be either:
- A string referencing a column in your dataset (e.g.,
"prompt") - A model configuration object (see below)
- A string referencing a column in your dataset (e.g.,
- Can be either:
- input_data_file_path (required): File ID of your uploaded dataset
model– Choose from supported serverless models; formodel_source = "dedicated", use your dedicated endpoint. Whenmodel_source = "external", you can specify either a model name shortcut (e.g.,openai/gpt-5), or provide a model name for an OpenAI-compatible URL. For more details, see the notes below.model_source– Literal: “serverless” | “dedicated” | “external” (required)external_api_token– Optional; required whenmodel_source = "external". If you selectexternalmodel source, use this to provide API bearer authentication token (eg. OpenAI token)external_base_url- Optional; when using anexternalmodel source, you can specify your own base URL. (e.g.,"https://api.openai.com"). The API must be OpenAIchat/completions-compatible.system_template– Jinja2 template for generation instructions (see Understanding Templates)input_template– Jinja2 template for formatting input (see Understanding Templates)max_tokens– Maximum tokens for generationtemperature– Temperature setting for generationnum_workers– Optional; number of concurrent workers for inference requests. Defaults:serverless→ 25,dedicated→ 5 (minimum),external→ 2 for first-party APIs or 20 for proxy endpoints. Override to tune throughput for your workload.
Model source options:
"serverless"- Any Together serverless model with structured outputs support"dedicated"- Your dedicated endpoint ID"external"- External models via shortcuts or custom OpenAI-compatible APIs
Evaluating external models
You can evaluate models from external providers like OpenAI, Anthropic, or Google by settingmodel_source = "external" in the model_to_evaluate configuration. Use a supported shortcut or provide a custom external_base_url for OpenAI-compatible APIs.
Using external models as judges
You can use external models as the judge by settingjudge.model_source = "external" and providing judge.external_api_token in the parameters. Use a supported shortcut or specify judge.external_base_url for custom OpenAI-compatible endpoints.
Evaluation Type: Score
Purpose: Rates input on a numerical scale (e.g., quality score from 1-10) Parameters:- judge (required): Configuration for the judge model
model– The model to use for evaluationmodel_source– One of: “serverless”, “dedicated”, or “external”system_template– Jinja2 template providing guidance for the judge (see Understanding Templates)external_api_token– Optional; required whenmodel_source = "external". If you selectexternalmodel source, use this to provide API bearer authentication token (eg. OpenAI token)external_base_url- Optional; when using anexternalmodel source, you can specify your own base URL. (e.g.,"https://api.openai.com"). The API must be OpenAIchat/completions-compatible.max_tokens– Optional; maximum number of tokens the judge model can generate. Defaults to 32768. Increase for reasoning models (for example, Gemini or o-series) that consume output token budget for chain-of-thought.temperature– Optional; sampling temperature for the judge model. Defaults to 0.05.num_workers– Optional; number of concurrent workers for judge inference requests. Defaults:serverless→ 25,dedicated→ 5 (minimum),external→ 2 for first-party APIs (OpenAI, Anthropic, Google) or 20 for proxy/aggregator endpoints (e.g. OpenRouter). Override this to tune throughput for your workload.
- min_score (required): Minimum score the judge can assign (float)
- max_score (required): Maximum score the judge can assign (float)
- pass_threshold (optional): Score at or above which is considered “passing”
- model_to_evaluate (required): Configuration for the model being evaluated
- Can be either:
- A string referencing a column in your dataset
- A model configuration object (same structure as in Classify)
- Can be either:
- input_data_file_path (required): File ID of your uploaded dataset
Evaluating external models
You can evaluate models from external providers like OpenAI, Anthropic, or Google by settingmodel_source = "external" in the model_to_evaluate configuration. Use a supported shortcut or provide a custom external_base_url for OpenAI-compatible APIs.
Using external models as judges
You can use external models as the judge by settingjudge.model_source = "external" and providing judge.external_api_token in the parameters. Use a supported shortcut or specify judge.external_base_url for custom OpenAI-compatible endpoints.
Evaluation Type: Compare
Purpose: Determines which of two models performs better on the same task Parameters:- judge (required): Configuration for the judge model
model– The model to use for evaluationmodel_source– One of: “serverless”, “dedicated”, or “external”system_template– Jinja2 template providing guidance for comparison (see Understanding Templates)external_api_token– Optional; required whenmodel_source = "external". If you selectexternalmodel source, use this to provide API bearer authentication token (eg. OpenAI token)external_base_url- Optional; when using anexternalmodel source, you can specify your own base URL. (e.g.,"https://api.openai.com"). The API must be OpenAIchat/completions-compatible.max_tokens– Optional; maximum number of tokens the judge model can generate. Defaults to 32768. Increase for reasoning models (for example, Gemini or o-series) that consume output token budget for chain-of-thought.temperature– Optional; sampling temperature for the judge model. Defaults to 0.05.num_workers– Optional; number of concurrent workers for judge inference requests. Defaults:serverless→ 25,dedicated→ 5 (minimum),external→ 2 for first-party APIs (OpenAI, Anthropic, Google) or 20 for proxy/aggregator endpoints (e.g. OpenRouter). Override this to tune throughput for your workload.
- model_a (required): Configuration for the first model
- Can be either:
- A string referencing a column in your dataset
- A model configuration object
- Can be either:
- model_b (required): Configuration for the second model
- Can be either:
- A string referencing a column in your dataset
- A model configuration object
- Can be either:
- input_data_file_path (required): File ID of your uploaded dataset
- disable_position_bias_correction (optional, default:
false): Whenfalse(default), the judge runs twice per sample — once in the original order (A then B) and once in the flipped order (B then A) — and the two verdicts are reconciled to cancel out position bias. Set totrueto run only the original-order pass, halving judge cost and latency at the expense of position-bias correction.
Default (two-pass): The judge evaluates each sample twice with model positions swapped to correct for position bias. When both verdicts agree the winner is declared; when they disagree the result is a “Tie”.When
disable_position_bias_correction: true (single-pass): Only one judge pass is run (original order). This roughly halves judge cost and latency. Use this when speed or cost matters more than bias correction, or when your judge is known to be position-insensitive.When both model_a and model_b are model configuration objects (not pre-generated column references), their inference runs execute in parallel, reducing total wall-clock time.Evaluating external models
You can compare models from external providers like OpenAI, Anthropic, or Google by settingmodel_source = "external" in the model configuration. Use a supported shortcut or provide a custom external_base_url for OpenAI-compatible APIs.
Using external models as judges
You can use external models as the judge by settingjudge.model_source = "external" and providing judge.external_api_token in the parameters. Use a supported shortcut or specify judge.external_base_url for custom OpenAI-compatible endpoints.
JSON
JSON
"file-95c8f0a3-e8cf-43ea-889a-e79b1f1ea1b9"
4. View Results
We provide comprehensive results without omitting lines from the original file unless errors occur (up to 30% may be omitted in error cases).Result Formats by Evaluation Type
Classify Results (ClassifyEvaluationResult):
| Field | Type | Description |
|---|---|---|
error | string | Present only when job fails |
label_counts | object<string, int> | Count of each label assigned (e.g., {"positive": 45, "negative": 30}) |
pass_percentage | float | Percentage of samples with labels in pass_labels |
generation_fail_count | int | Failed generations when using model configuration |
judge_fail_count | int | Samples the judge couldn’t evaluate |
invalid_label_count | int | Judge responses that couldn’t be parsed into valid labels |
result_file_id | string | File ID for detailed row-level results |
ScoreEvaluationResult):
| Field | Type | Description |
|---|---|---|
error | string | Present only on failure |
aggregated_scores.mean_score | float | Mean of all numeric scores |
aggregated_scores.std_score | float | Standard deviation of scores |
aggregated_scores.pass_percentage | float | Percentage of scores meeting pass threshold |
failed_samples | int | Total samples that failed processing |
invalid_score_count | int | Scores outside allowed range or unparseable |
generation_fail_count | int | Failed generations when using model configuration |
judge_fail_count | int | Samples the judge couldn’t evaluate |
result_file_id | string | File ID for per-sample scores and feedback |
CompareEvaluationResult):
| Field | Type | Description |
|---|---|---|
error | string | Present only on failure |
A_wins | int | Count where Model A was preferred |
B_wins | int | Count where Model B was preferred |
Ties | int | Count where judge found no clear winner |
generation_fail_count | int | Failed generations from either model |
judge_fail_count | int | Samples the judge couldn’t evaluate |
result_file_id | string | File ID for detailed pairwise decisions |
Downloading Result Files
Pass any
result_file_id to the Files API to download a complete report for auditing or deeper analysis. Each line in the result file has an evaluation_status field (True or False) indicating if the line was processed without issues.- Original input data
- Generated responses (if applicable)
- Judge’s decision and feedback
evaluation_statusfield indicating if processing succeeded (True) or failed (False)
JSON
Understanding Templates
Templates are used throughout the Evaluations API to dynamically inject data from your dataset into prompts. Bothsystem_template and input_template parameters support Jinja2 templating syntax.
Jinja2 templates allow you to inject columns from the dataset into the system_template or input_template for either the judge or the generation model.
Examples
- You can specify a reference answer for the judge:
"Please use the reference answer: {{reference_answer_column_name}}"
- You can provide a separate instruction for generation for each example:
"Please use the following guidelines: {{guidelines_column_name}}"
- You can specify any column(s) as input for the model being evaluated:
"Continue: {{prompt_column_name}}"
- You can also reference nested fields from your JSON input:
"{{column_name.field_name}}"
- And many more options are supported.
Basic Example
If your dataset contains:JSON
Python
Text
Nested Data Example
For complex structures:JSON
Python
Best Practices
- Provide clear judge instructions: Write detailed, structured system prompts with examples and explicit rules for the judge to follow
- Choose appropriate judge models: Use larger, more capable models as judges than the models being evaluated
- Test your templates: Verify that your Jinja2 templates correctly format your data before running large evaluations
Python
Example: Classification System Prompt
Here’s an example of a well-structured system prompt for a classify evaluation that determines whether model responses are harmful:Python
- Clear role definition: Explicitly states the evaluator’s single purpose
- Structured procedure: Step-by-step evaluation process
- Specific criteria: Well-defined categories with examples
- Decision rules: Clear instructions for edge cases