Skip to main content

How long will it take for my job to start?

It depends. Factors that affect waiting time include the number of number of pending jobs from other customers, the number of jobs currently running, and available hardware. If there are no other pending jobs and there is available hardware, your job should start within a minute of submission. Typically jobs will start within an hour of submission. However, there is no guarantee on waiting time.

How long will my job take to run?

It depends. Factors that impact your job run time are model size, training data size, and network conditions when downloading/uploading model/training files. You can estimate how long your job will take to complete training by multiplying the number of epochs by the time to complete the first epoch.

How can I estimate my fine-tuning job cost?

To estimate the cost of your fine-tuning job:
  1. Calculate approximate training tokens: context_length × batch_size × steps × epochs
  2. Add validation tokens: validation_dataset_size × evaluation_frequency
  3. Multiply the total tokens by the per-token rate for your chosen model size, fine-tuning type, and implementation method

Fine-Tuning Pricing

Fine-tuning pricing is based on the total number of tokens processed during your job, including training and validation. Cost varies by model size, fine-tuning type (Supervised Fine-tuning or DPO), and implementation method (LoRA or Full Fine-tuning). The total cost is calculated as: total_tokens_processed × per_token_rate Where total_tokens_processed = (n_epochs × n_tokens_per_training_dataset) + (n_evals × n_tokens_per_validation_dataset) For current rates, refer to our fine-tuning pricing page. The exact token count and final price are available after tokenization completes, shown in your jobs dashboard or via together fine-tuning retrieve $JOB_ID.

Dedicated Endpoint Charges for Fine-Tuned Models

After fine-tuning, hosting charges apply for dedicated endpoints (per minute, even when not in use). These are separate from job costs and continue until you stop the endpoint. To avoid unexpected charges:

Why am I getting an error when uploading a training file?

There are two common issues you may encounter, Your API key may be incorrect. If you get a 403 status code, this indicates your API Key is incorrect. Your balance may be less than the job minimum. We verify that you have sufficient balance on your account that is equal to the minimum job charge ($5). If you do not have sufficient balance, you can increase your account limit by adding a credit card to your account, adjusting your spending limit if you ready have a credit card, or paying your outstanding account balance. If you have sufficient balance on your account, contact support for assistance.

Why was my job cancelled?

There are two reasons that a job may be automatically cancelled. You do not have sufficient balance on your account to cover the cost of the job. You have entered an incorrect WandB API key You can determine why your job was cancelled by: (1) checking the events list for your job via the together-CLI tool $ together list-events <job-fine-tune-id> (2) Via the web interface https://api.together.ai > Jobs > Cancelled Job > Events List
The following is an example of a job an event log in the web jobs tab where the billing limit was reached:

What should I do if my job is cancelled due to billing limits?

You can an add a credit card to your account to increase your spending limit. If you already have a credit card on your account, you can make a payment or adjust your spending limit. Contact support if you need assistance with your account balance.

Understanding Refunds When Canceling Fine-Tuning Jobs

When you cancel a running fine-tuning job, you may notice that you don’t receive a full refund of the original amount charged. This is because our fine-tuning billing system works based on the actual steps completed.

How Fine-Tuning Refunds Work

When you cancel a fine-tuning job:
  • You are charged for the steps that have already been completed, accounting for the hardware resources used during that time
  • You receive a refund only for the steps that have not yet been completed

Checking Your Job Progress

To understand how many steps have been completed, you can use the following API call: client.fine_tuning.retrieve("your-job-id").total_steps Replace "your-job-id" with your actual fine-tuning job ID. This will show you how far the job has progressed. If you have questions about a specific fine-tuning job’s billing, please contact support with your job ID for a detailed breakdown.

Why was there an error while running my job?

If your job fails after downloading the training file, but before training starts, the most likely source of the error is the training data. For example, your event log might look like
You can verify the formatting of your input file with the Together CLI tool with the following command: $ together files check ~/Downloads/unified_joke_explanations.jsonl { "is_check_passed": true, "model_special_tokens": "we are not yet checking end of sentence tokens for this model", "file_present": "File found", "file_size": "File size 0.0 GB", "num_samples": 356 } Despite our best efforts, the file checker does not catch all errors. Please contact support if your training data file passes the checks, but you are still seeing the above error conditions. If you see an error during other steps in your training job, this may be due to internal errors in our training stack (e.g. hardware failure or bugs). We actively monitor job failures, and work as quickly as we can to resolve these issues. Once the issue has been resolved by our engineers, your job will be automatically or manually restarted. Charges for the restarted job will be refunded.

How do I know if my job was restarted?

A job will be automatically or manually restarted if the job fails to complete due to an internal error. You can view the event log to see if the job was restarted, to determine the new fine tune ID of the restarted job, and check the refund amount (if applicable). Any charges from the failed job will be refunded when your job is restarted. An example event log for a restarted job is:

Common Error Codes During Fine-Tuning

CodeCauseSolution
401Missing or Invalid API KeyEnsure you are using the correct API Key and supplying it correctly
403Input token count + max_tokens parameter exceeds model context lengthSet max_tokens to a lower number. For chat models, you may set max_tokens to null
404Invalid Endpoint URL or model nameCheck your request is made to the correct endpoint and the model is available
429Rate limit exceededThrottle request rate (see rate limits)
500Invalid RequestEnsure valid JSON, correct API key, and proper prompt format for the model type
503Engine OverloadedTry again after a brief wait. Contact support if persistent
504TimeoutTry again after a brief wait. Contact support if persistent
524Cloudflare TimeoutTry again after a brief wait. Contact support if persistent
529Server ErrorTry again after a wait. Contact support if persistent
If you encounter other errors or these solutions don’t work, contact support.

Can I download the weights of my model?

If you would like to download the weights of your model so that you can use your fine-tuned model outside of our platform, you are able to do this through the following: To download the weights of a fine-tuned model, run: together fine-tuning download <FT-ID> This command will download ZSTD compressed weights of the model. To extract the weights, run tar -xf filename. Other arguments: --output,-o (filename, optional) — Specify the output filename. Default: <MODEL-NAME>.tar.zst --step,-s (integer, optional) — Download a specific checkpoint’s weights. Defaults to download the latest weights. Default: -1