Together Fine-tuning API accepts two data formats for training dataset files: text data and tokenized data (in the form of Parquet files). Below, you can learn about different types of those formats and the scenarios in which they can be most useful.
attention_mask
and labels
should be set to 0 and -100, respectively, so that the model ignores the padding tokens during prediction and excludes them from the loss calculation.attention_mask
values or apply some tokenization customizations unique to your setup, you can use the Parquet format as well..jsonl
format with fields that indicate the dataset format. You can have other fields, but they will be ignored during training. To speed up the data uploading and processing steps and to maximize the number of examples per file, we recommend to remove the unused fields.
Also, if the data has two or more possible formats (e.g., it contains both text
and messages
), the Together client will show an error at the file check stage before the upload.
messages
field on each line, with role
and content
specified for each message. Each sample should start with either a system
or user
message, followed by alternating user
and assistant
messages. The Together client will reject any dataset that does not follow this pattern.
Optionally, you can add a weight
field to any message to control its contribution to the training loss. Messages with weight=0
will be masked during training (their tokens won’t contribute to the loss), while messages with weight=1
(default) will be included. Only values 0 and 1 are supported for the weight
field.
assistant
messages. Use --train-on-inputs true
to include other messages in training. See the API Reference for details.
Note that if any message in the conversation has a weight
field, the train-on-inputs
setting will be automatically set to true
to respect provided weights.
Example datasets:
prompt
and completion
fields:
--train-on-inputs true
to include prompts in training. See the API Reference for details.
Here are some examples with this format that you can download from the Hugging Face Hub:
text
field.
.parquet
format.input_ids
(required): List of token ids to be fed to a model.attention_mask
(required): List of indices specifying which tokens should be attended to by the model.labels
(optional): List of token ids to be used as target predictions. The default token ID to be ignored in the loss calculation is -100
. To ignore certain predictions in the loss, replace their corresponding values with -100
. If this field is not given, input_ids
will be used.NousResearch/Nous-Hermes-2-Mixtral-8x7B-SFT
model. The max sequence length of this model is 32768. To compare the differences between packing and padding, we will run the script twice with and without --packing
. When packing is not applied, each example will be (left-)padded with the tokenizer’s own pad token to keep the length of all examples consistent. Note that packing is used during training by default, and we recommend to use packing during the tokenization step by passing --packing
in the example script. Also note that we shift labels internally for model training and you do not need to do this.
processed_dataset_packed.parquet
will be saved under the same directory.
processed_dataset_padded.parquet
will be saved under the same directory.
Let’s load the generated files to see the results. In python,
dataset_padded
you will find the first 31140 tokens are padded and have -100
as their labels to be ignored during the loss mask. The pad token for this tokenizer is 32000
dataset_packed
, no padding is used. And the first 1628 token ids match the last 1628 token ids from the first example of dataset_padded
.