GPU Clusters Overview

What are GPU Clusters?

Together GPU Clusters provide on-demand access to high-performance GPU infrastructure for training, fine-tuning, and running large-scale AI workloads. Create clusters in minutes with features like real-time scaling, persistent storage, and support for both Kubernetes and Slurm workload managers.

Concepts

Kubernetes Cluster Architecture

Each GPU cluster is built on Kubernetes, providing a robust container orchestration platform. The architecture includes:

Control Plane – Manages cluster state, scheduling, and API access
Worker Nodes – GPU-equipped nodes that run your workloads
Networking – High-speed InfiniBand for multi-node communication
Storage Layer – Persistent volumes, local NVMe, and shared storage

You interact with the cluster using standard Kubernetes tools like kubectl, or through higher-level abstractions like Slurm.

Slurm on Kubernetes via Slinky

For users preferring HPC-style workflows, Together runs Slurm on top of Kubernetes using Slinky, an integration layer that bridges traditional HPC scheduling with cloud-native infrastructure:

Slurm Controller – Runs as Kubernetes pods, managing job queues and scheduling
Login Nodes – SSH-accessible entry points for job submission
Compute Nodes – GPU workers registered with both Kubernetes and Slurm
Unified Resource Management – Slinky translates Slurm jobs into Kubernetes pods while maintaining Slurm’s familiar interface

This architecture gives you the simplicity of sbatch and srun commands while leveraging Kubernetes’ reliability, scalability, and ecosystem.

Key Features

Fast provisioning – Clusters ready in minutes, not hours or days
Flexible scaling – Scale up or down in real time to match workload demands
Persistent storage – Long-lived, resizable shared storage with high throughput
Multiple workload managers – Choose between Kubernetes or Slurm-on-Kubernetes
Full API access – Manage clusters via REST API, CLI, or Terraform
Enterprise integration – Works with SkyPilot and other orchestration tools

Available Hardware

Choose from the latest NVIDIA GPU configurations:

NVIDIA HGX B200 – Latest generation for maximum performance
NVIDIA HGX H200 – Enhanced memory for large models
NVIDIA HGX H100 SXM – High-bandwidth training and inference
NVIDIA HGX H100 SXM - Inference – Optimized for single-node inference workloads

All nodes feature high-speed InfiniBand networking for multi-node training (except inference-optimized variants).

Capacity Options

GPU Clusters offer two billing modes to match different workload patterns and budget requirements. You can choose Reserved capacity for predictable, sustained workloads with cost savings, or On-demand capacity for flexible, pay-as-you-go usage.

Reserved Capacity

Reserve GPU capacity upfront for a commitment period of 1-90 days at discounted rates. How It Works:

Upfront payment – Credits are charged or deducted when the cluster is provisioned
Fixed duration – Reserve capacity for 1 to 90 days
Discounted pricing – Lower rates compared to on-demand
Automatic decommission – Clusters are decommissioned when the reservation expires

When to Use Reserved:

Predictable workloads where you know the duration
Multi-day training runs or experiments
Cost optimization with discounted rates
Planned workloads with specific commitments

Extending Reservations: To extend a reservation beyond its initial period, contact your account team.

On-demand Capacity

Pay only for what you use with hourly billing and no upfront commitment. How It Works:

Hourly billing – Pay per hour of cluster runtime
No commitment – Terminate anytime without penalty
Flexible – Scale up and down as needed
Standard pricing – Higher per-hour rates than reserved capacity

When to Use On-demand:

Variable or unpredictable resource needs
Short-term experiments or development work
Exploratory testing before committing to longer runs
Temporary capacity needs beyond reserved baseline

Mixing Capacity Types

You can combine reserved and on-demand capacity in the same cluster for optimal cost and flexibility:

Start with reserved capacity for your baseline workload (e.g., reserve 8xH100 for 30 days)
Add on-demand capacity during peak periods (e.g., scale to 16xH100 temporarily)
Scale back down when burst period ends – on-demand capacity is removed, reserved capacity remains

Any usage beyond your reserved capacity is automatically billed at on-demand rates.

Choosing the Right Type

Choose Reserved if:

✓ You know the duration of your workload
✓ You’re running multi-day training or experiments
✓ Cost optimization is important
✓ You can commit to a specific period

Choose On-demand if:

✓ Your resource needs are unpredictable
✓ You’re running short experiments
✓ You need maximum flexibility
✓ You’re in development/testing phase

Mix Both if:

✓ You have a predictable baseline with occasional bursts
✓ You want cost savings on steady-state workload
✓ You need flexibility for peak periods

Storage

Clusters include multiple storage tiers:

Shared volumes – Multi-NIC bare metal storage with high throughput, persistent across cluster lifecycle
Local NVMe – Fast local disks on each node for temporary data
Shared /home – NFS-mounted home directories for code and configs

Storage can be dynamically resized as your data grows. Learn more about storage →

Workload Management

Kubernetes

Use standard Kubernetes workflows with kubectl to:

Deploy pods and jobs
Manage persistent volumes
Access the Kubernetes Dashboard
Integrate with existing K8s tooling

Slurm

For HPC-style workflows, use Slurm with:

Direct SSH access to login nodes
Familiar commands (sbatch, srun, squeue)
Job arrays for distributed processing
Traditional batch scheduling

Learn more about Slurm →

Getting Started

Ready to create your first cluster?

Follow the Quickstart guide for step-by-step instructions
Review the Capacity Options above to choose the right billing mode
Check Pricing for current rates

Support

Capacity unavailable? Use the “Notify Me” option to get alerts when capacity comes online
Questions or custom requirements? Contact support@together.ai

Getting Started

Inference

Training

Capabilities

Other APIs

What are GPU Clusters?

Concepts

Kubernetes Cluster Architecture

Slurm on Kubernetes via Slinky

Key Features

Available Hardware

Capacity Options

Reserved Capacity

On-demand Capacity

Mixing Capacity Types

Choosing the Right Type

Storage

Workload Management

Kubernetes

Slurm

Getting Started

Support

Getting Started

Inference

Training

Capabilities

Other APIs

​What are GPU Clusters?

​Concepts

​Kubernetes Cluster Architecture

​Slurm on Kubernetes via Slinky

​Key Features

​Available Hardware

​Capacity Options

​Reserved Capacity

​On-demand Capacity

​Mixing Capacity Types

​Choosing the Right Type

​Storage

​Workload Management

​Kubernetes

​Slurm

​Getting Started

​Support

What are GPU Clusters?

Concepts

Kubernetes Cluster Architecture

Slurm on Kubernetes via Slinky

Key Features

Available Hardware

Capacity Options

Reserved Capacity

On-demand Capacity

Mixing Capacity Types

Choosing the Right Type

Storage

Workload Management

Kubernetes

Slurm

Getting Started

Support