What is the maximum context window supported by Together models?
The maximum context window varies significantly by model. Refer to the specific model’s documentation or the inference models page for the exact context length supported by each model.
What kind of latency can I expect for inference requests?
Latency depends on the model and prompt length. Smaller models like Mistral may respond in less than 1 second, while larger MoE models like Mixtral may take several seconds. Prompt caching and streaming can help reduce perceived latency.
Is Together suitable for high-throughput workloads?
Yes. Together supports production-scale inference. For high-throughput applications (e.g., over 100 RPS), contact the Together team for dedicated support and infrastructure.
Yes. Together supports private networking VPC-based deployments for enterprise customers requiring data residency or regulatory compliance. Contact us for more information.
Yes. Together hosts some models with quantized weights (e.g., FP8, FP16, INT4) for faster and more memory-efficient inference. Support varies by model.
Yes. Together supports batching and high-concurrency usage. You can send parallel requests from your client and take advantage of backend batching. See Batch Inference for more details.
Can I use Together inference with LangChain or LlamaIndex?
Yes. Together is compatible with LangChain via the OpenAI API interface. Set your Together API key and model name in your environment or code.See more about all available integrations: Langchain, LlamaIndex, Hugging Face, Vercel AI SDK.