Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
Reinforcement fine-tuning (RFT) is a technique for improving reasoning models by training them through a reward-based process, rather than relying only on labeled data. RFT helps models develop better reasoning and problem-solving skills, especially in cases where labeled examples are limited or complex behaviors are desired.
Note
The fine-tuning service automatically pauses RFT jobs once they hit $5,000 in total training costs (training + grading). Users can deploy the most recent checkpoint or resume the training job. If the user decides to resume the job, billing continues for the job with no further cost-based limits.
Model support
Reinforcement fine-tuning is supported for the following models:
o4-miniversion2025-04-16gpt-5version2025-08-07(preview)
GPT-5 support for reinforcement fine tuning is in private preview and might not be available in your subscription.
Requirements
Reinforcement fine-tuning (RFT) requires training and validation data formatted as JSONL and containing a messages array using the chat completions format.
However, RFT has more requirements:
- Data
- The final "message" in the data must be assigned a
userrole. - The data can contain extra fields and values for use by a grader.
- Both a training and a validation dataset must be provided.
- The final "message" in the data must be assigned a
- Graders
- A grader must be defined to score the quality of your fine-tuned model and guide learning.
- Only a single grader can be provided, but multiple graders can be combined using a multigrader.
Example training data
The following example shows how to present prompts to the model and include ground truth accessible to a grader.
{
"messages": [
{
"role": "developer",
"content": "Your task is to solve logic puzzles. The user will provide an expression with ?'s as placeholders for arithmetic operations. Replace the ?'s with arithmetic operation signs (+, -, *, /) to obtain a valid equation."
},
{
"role": "user",
"content": "1 ? 2 ? 3 ? 4 ? 5 ? 6 ? 7 ? 8 ? 9 = 100"
}
],
"solution": "1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 * 9 = 100"
}
Note
This example is split across multiple lines for demonstration purposes only. It must be a single line in your JSONL file.
Graders
Graders provide the reward function used during training and have access to any user-supplied fields in the dataset. Multiple types of graders are available:
- text comparison: score response content based on its text
- model: score responses using a language model and prompt
- custom code: score responses using custom code
- multigrader: score based on a combination of scores from other graders
Most graders perform substitution of runtime data via templates. Any input or reference properties can include variable substitution enclosed in double curly braces ({{ }}) containing a reference to a variable.
Each template reference must be namespaced using a pattern like {{ namespace.variable }}. For any complex, nested data, a JSON-path like syntax is supported.
The following namespaces are supported:
sample- model output to be graded appears under thesamplenamespace in a format similar to a chat completions response.item- optional, extra fields provided in training data appear under theitemnamespace.
Some examples of template substitution that use the above namespaces:
{{ sample.output_text }}- substitute the model output as a string{{ sample.output_json }}- if the model produced structured outputs, reference it as JSON{{ item.answer }}- substitute the "answer" field in the dataset{{ item.ground_truth.date }}- substitute the "date" field of a "ground_truth" object defined in the dataset
The following sections document individual graders and provide their JSON specification for defining via the API.
Text comparison graders
Use text comparison graders when the use case requires the model output either a definitive label or if the output must resemble a known ground truth answer.
String-check-grader
String-check graders apply a given operation to the input and a reference to return a 0 or 1, providing a simple pass/fail function.
{
"type": "string_check",
"name": string,
"operation": "eq" | "ne" | "like" | "ilike",
"input": string,
"reference": string,
}
Operations:
eq: Returns 1 if the input matches the reference (case-sensitive), 0 otherwiseneq: Returns 1 if the input doesn't match the reference (case-sensitive), 0 otherwiselike: Returns 1 if the input contains the reference (case-sensitive), 0 otherwiseilike: Returns 1 if the input contains the reference (not case-sensitive), 0 otherwise
Text similarity
Text-similarity graders compute a score based on a select algorithm for quantifying similarity between the input text and a given reference text.
Specification:
{
"type": "text_similarity",
"name": string,
"input": string,
"reference": string,
"pass_threshold": number,
"evaluation_metric": "fuzzy_match" | "bleu" | "gleu" | "meteor" | "rouge_1" | "rouge_2" | "rouge_3" | "rouge_4" | "rouge_5" | "rouge_l"
}
Evaluation metrics:
fuzzy_match– fuzzy string match, using the RapidFuzz algorithmbleu– computes BLEU (bilingual evaluation understudy) score between stringsgleu– computes Google BLEU score between stringsmeteor– computes METEOR score between stringsrouge-*- as defined by the rouge python library
Model graders
Model graders take a prompt to a grader model instructing it how to evaluate and score a given response. This flexibility allows for prompt engineering complex graders that support explaining the reason for a given score.
The following models can be used as model graders:
gpt-4o-2024-08-06o3-mini-2025-01-31
Model graders do not require model deployments in Foundry.
Score model
Score model graders output a numeric score based on their given input and prompt. Any provided sampling_params control the behavior of the scoring model and allow for customizing things like temperature and reasoning effort.
{
"type": "score_model",
"name": string,
"input": Message[],
"model": string,
"pass_threshold": number,
"range": number[],
"sampling_params": object
}
Code graders
Model graders provide a flexible, but nondeterministic way to grade responses. When determinism is important, code graders provide an alternative approach to defining scoring logic.
Python grader
The Python grader allows you to execute arbitrary Python code to produce a score.
The provided code must define a grade function expecting two positional arguments: sample and item. The function must return a numeric score.
{
"type": "python",
"name": string,
"source": "def grade(sample, item):\n return 1.0"
}
The Python code executes in a constrained environment with the following limitations:
- the provided code must be less than 256 kB in size
- no network access is provided
- Two (2) GB of memory and one (1) GB of disk space are available for use
- One (1) CPU core is available
- the total runtime of the grader is limited to 2 minutes
Tip
Your code should handle any possible errors and always return a numeric value. If too many exceptions occur during the execution of the grader, the training job fails.
Within the Python runtime, the following modules and versions are made available for use by the provided code:
- numpy==2.2.4
- scipy==1.15.2
- sympy==1.13.3
- pandas==2.2.3
- rapidfuzz==3.10.1
- scikit-learn==1.6.1
- rouge-score==0.1.2
- deepdiff==8.4.2
- jsonschema==4.23.0
- pydantic==2.10.6
- pyyaml==6.0.2
- nltk==3.9.1
- sqlparse==0.5.3
- rdkit==2024.9.6
- scikit-bio==0.6.3
- ast-grep-py==0.36.2
Endpoint grader (preview)
Endpoint graders call a remote endpoint via an HTTP API to score the model response. They're ideal for use cases requiring access to ground truth for accurate scoring or the ability to implement the grader in a language other than Python.
While in private preview, the API for endpoint graders remains unpublished.
Multigrader
A multigrader combines the output of multiple graders to produce a single score based on an arithmetic expression provided in calculate_output.
{
"type": "multi",
"name": string,
"graders": dict[str, Grader],
"calculate_output": string
}
When a multigrader computes the score, the calculate_output expression references the individual scores from the provided graders by the key in the graders object.
Operators:
+(addition)-(subtraction)*(multiplication)/(division)^(power)
Functions:
min: compute the minimum of a valuemax: compute the maximum of a valueabs: compute the absolute valuefloor: round the value downceil: round the value upexp: computeeto the power of the provided valuesqrt: take the square root of the valuelog: compute the logarithm of the provided value
As an example, a multigrader defined with two graders, "similarity-score" and "label-checker," that must average their outputs could look like:
{
"type": "multi",
"name": "Example multigrader",
"graders": {
"similarity_score": {
"type": "text_similarity",
"name": "similarity grader",
"input": "{{ sample.output_text }}",
"reference": "{{ item.summary }}",
"evaluation_metric": "bleu"
},
"label_checker": {
"type": "string_check",
"name": "label grader",
"input": "{{ sample.output_text }}",
"reference": "{{ item.label }}",
"operation": "eq"
}
},
"calculate_output": "(similarity_score + label_checker) / 2"
}
Response format (optional)
The model can be made to produce structured outputs during training either to align to the intended use case of the model or to make grading the output easier.
The response format configuration follows the same specification as Chat Completions, either supporting text (the default) or JSON. When the model should output JSON, a JSON Schema must be provided.
To continue with the previous example, if the model must output the response in a structured format such as:
{ "solution": "1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 * 9 = 100" }
The following JSON schema describes the response format:
{
"type": "json_schema",
"name": "puzzles_assistant",
"schema": {
"type" : "object",
"properties": {
"solution": {
"type": "string",
"title": "solution"
}
},
"required": [
"solution",
],
"additionalProperties": false
},
"strict": true
}
Hyperparameter selection
Reinforcement fine-tuning supports the same hyperparameters as Supervised fine-tuning. Additionally, the following hyperparameters control features specific to RFT:
| Hyperparameter name | Value | Default | Description |
|---|---|---|---|
eval_interval |
integer | auto |
The number of training steps between evaluation runs. |
eval_samples |
integer | auto |
The number of samples to use during evaluation. |
compute_multiplier |
number | auto |
The multiplier on amount of compute use for exploring space during training. |
reasoning_effort |
low, medium, high |
medium |
The reasoning effort used by the model during training. |
Note
The training service automatically replaces hyperparameters set to auto with defaults based on heuristics on the provided training data.
Interpreting training results
Reinforcement fine-tuning provides both automatic evaluations of the model during training and real-time training metrics.
Training metrics
When you monitor a running job or inspecting a completed job, the "reward" and "reasoning" metrics provide an indicator of training success.
Reward
Reward metrics track the resulting scores from the grader acting as the reward function.
train_reward_mean: the average reward across the batch of training data at a given step. Because each batch might be different across steps, the trend of this metric is more important than comparing values across steps.valid_reward_mean: The average reward across the samples taken from the validation set at a given step.
Reward metrics should generally increase over the course of the training job. If they diverge significantly, it's a sign the model might be reward hacking and the grader requires more engineering.
Reasoning tokens
Each training job tracks the number of reasoning tokens produced by the model. Reasoning token metrics captures how the model changes its behavior over the lifetime of the training job.
train_reasoning_tokens_mean: the average number of reasoning tokens produced across the batch of training data at a given step.valid_reasoning_tokens_mean: the average number of reasoning tokens produced across the validation data at a given step.
The model might learn to use fewer reasoning tokens to achieve the same reward, or it might learn to use more reasoning tokens to achieve a higher reward. These metrics typically rise and fall during the training job.
Automatic evaluations
An evaluation is created automatically for each RFT job. At regular intervals defined by the eval_interval hyperparameter, the training system executes an evaluation run using the validation data. Scores for each run are available via the linked evaluation, discoverable from the Foundry user interface.
Inspecting these evaluations provides an extra data point for deciding on early-stopping. If the model exhibits learning during training, the results of each evaluation run should improve over the lifetime of the job.
Example projects and datasets
The following example demos and datasets provide starting points for new users of Reinforcement fine tuning:
- Countdown Demo - end-to-end demo of using RFT to improve mathematical reasoning.
- MedMCQ - sample dataset and graders for answering multiple-choice questions from the medical domain.
- ClauseMatching - sample dataset and graders showcasing both summarization and content interpretation in the legal domain.