Dedicated Containers provide a flexible way to run your own Dockerized workloads on managed GPU infrastructure. You supply the container image, and Together manages everything else—handling compute provisioning, autoscaling, networking, and observability for you. The platform is designed for teams that need full control over their runtime environment while avoiding the operational complexity of managing GPU clusters directly. With Together Deployments, you can:Documentation Index
Fetch the complete documentation index at: https://docs.together.ai/llms.txt
Use this file to discover all available pages before exploring further.
- Deploy custom inference, data processing jobs, or long-running workers
- Scale workloads automatically based on demand, including down to zero
- Run queue-based or asynchronous jobs with built-in request handling
- Securely manage secrets, environment variables, and configuration
- Scale from a single replica to thousands of GPUs as traffic grows
Platform Components
Jig – Deployment CLI
A lightweight CLI for building, pushing, and deploying containers. Jig handles:- Dockerfile generation from
pyproject.toml - Image building and pushing to Together’s registry
- Deployment creation and updates
- Secrets and volume management
- Log streaming and status monitoring
Sprocket – Worker SDK
A Python SDK for building inference workers that integrate with Together’s job queue:- Implement
setup()andpredict(args) -> dict - Automatic file download and upload handling
- Progress reporting for long-running jobs
- Health checks and metrics endpoints
- Graceful shutdown support
Container Registry
A Together-hosted Docker registry atregistry.together.ai for storing your container images. Images are private to your organization and referenced by digest for reproducible deployments.
Available Hardware
Choose from high-performance NVIDIA GPU configurations:| GPU Type | gpu_type value | Memory | Use Case |
|---|---|---|---|
| NVIDIA H100 SXM | h100-80gb | 80GB | Large models, high throughput |
| NVIDIA H100 MIG | h100-40gb-mig | 40GB | Cost-efficient option for smaller models |
| NVIDIA B200 | b200-192gb | 192GB | Next-generation hardware for the largest models |
| CPU-only | none | — | Lightweight preprocessing or embedding models |
gpu_count in your deployment and use torchrun for distributed inference.
When to Use Dedicated Containers
Dedicated Containers are appropriate when:- You have a custom model or inference stack – Custom architectures, fine-tuned models, or proprietary inference code
- You’ve modified open-source engines – Customized vLLM, SGLang, or other serving frameworks
- You’re running media generation – Audio, image, or video models with variable execution times
- You need async or batch processing – Long-running jobs that don’t fit the request-response pattern
- You want full control – Specific library versions, custom preprocessing, or non-standard runtimes
How It Works
- Package your model as a Docker container Create a container with your runtime, dependencies, and inference code. Use Sprocket for queue integration or bring your own HTTP server.
-
Configure your deployment
Define GPU type, replica limits, autoscaling behavior, and environment variables in
pyproject.toml. -
Deploy to Together
Run
together beta jig deployto build, push, and create your deployment. Together provisions GPUs and starts your containers. - Submit jobs Use the Queue API to submit jobs. Workers pull jobs from the queue, execute inference, and report results.
- Monitor and scale View logs, metrics, and job status. The autoscaler adjusts replica count based on queue depth.
Monitoring and Observability
Metrics
Each Sprocket worker exposes a/metrics endpoint with Prometheus-compatible metrics:
Logging
Access deployment logs via:Health Checks
The platform monitors your deployment’s/health endpoint. Ensure it:
- Returns 200 when ready to accept jobs
- Returns 503 during startup or when unhealthy
- Responds within a reasonable timeout
Autoscaling
Configuration
Enable autoscaling in yourpyproject.toml:
Metrics
QueueBacklogPerWorker Scales based on queue depth relative to worker count.target = 1.0— Exact match (queue_depth = workers)target = 1.05— 5% overprovisioning (recommended)target = 0.9— Aggressive scaling (more workers than needed)
desired_replicas = queue_depth / target
CustomMetric
Scales based on any Prometheus metric your application exposes. Your worker must export the metric on its /metrics endpoint.
pyproject.toml
custom_metric_nameis required and must match^[a-zA-Z_:][a-zA-Z0-9_:]*$.targetdefaults to500if not specified.
Scaling Behavior
- Scale Up: When the metric exceeds the target, new replicas are added
- Scale Down: When the metric drops below the target, replicas are removed (respecting
min_replicas) - Graceful Shutdown: Workers complete current job before terminating
Troubleshooting
Common Issues
Container fails to start Symptoms: Deployment status shows “failed” or “error” Check:- View logs:
together beta jig logs - Verify health endpoint works locally
- Check for missing environment variables
- Ensure sufficient memory allocated
- Deployment status:
together beta jig status - Queue status:
together beta jig queue-status - Worker logs for errors:
together beta jig logs --follow - Verify
--queueflag in startup command
- Increase
memoryin deployment config - Use
device_map="auto"for large models - Enable gradient checkpointing if training
- Reduce batch size
- Use volumes for model weights (faster than downloading)
- Pre-download models in Dockerfile
- Increase health check timeout
torch.cuda.is_available() returns False
Check:
- Verify
gpu_count >= 1in config - Check CUDA compatibility with base image
- Ensure PyTorch is installed with CUDA support
Debug Mode
Enable debug logging:Getting Help
- View deployment status:
together beta jig status - Check queue:
together beta jig queue-status - Stream logs:
together beta jig logs --follow - Contact support with your deployment name and request IDs
FAQs
General Q: What’s the difference between Sprocket and a regular HTTP server? A: Sprocket integrates with Together’s managed job queue, providing automatic job distribution, status reporting, file handling, and graceful shutdown. Use Sprocket for batch/async workloads; use a regular HTTP server for low-latency request-response APIs. Q: Can I use my own Dockerfile? A: Yes. Setdockerfile = "Dockerfile" in your config and jig will use your custom Dockerfile instead of generating one.
Q: How do I handle large model weights?
A: Use volumes (together beta jig volumes create) to upload weights once, then mount them at runtime. This is faster than including weights in the container image.
Scaling
Q: How does autoscaling work?
A: The autoscaler monitors queue depth and worker utilization. When queue backlog grows, it adds replicas. When workers are idle, it removes them (down to min_replicas).
Q: What’s the maximum number of replicas?
A: Set max_replicas in your config. The actual limit depends on your Together organization’s quota.
Q: How long does scaling take?
A: New replicas typically start within 1-2 minutes, depending on image size and model loading time.
Jobs
Q: How long can a job run?
A: Default timeout is 5 minutes (TERMINATION_GRACE_PERIOD_SECONDS, default 300s). For longer jobs, increase this value in your deployment configuration.
Q: What happens if a job fails?
A: The job status is set to “failed” with error details. The worker remains healthy and continues processing other jobs.
Q: Can I retry failed jobs?
A: Resubmit the job with the same payload. Automatic retry is not currently supported.
Billing
Q: How am I billed?
A: You’re billed for GPU-hours while replicas are running. Scale to zero (min_replicas = 0) when not in use to minimize costs.
Q: Are there costs for the queue?
A: Queue usage is included. You’re only billed for compute (running replicas).