TL;DR
On this submit, we discover how main inference suppliers carry out on the GPT-OSS-120B mannequin utilizing benchmarks from Synthetic Evaluation. You’ll study what issues most when evaluating inference platforms together with throughput, time to first token, and price effectivity. We evaluate Vertex AI, Azure, AWS, Databricks, Clarifai, Collectively AI, Fireworks, Nebius, CompactifAI, and Hyperbolic on their efficiency and deployment effectivity.
Introduction
Massive language fashions (LLMs)Â like GPT-OSS-120B, an open-weight 120-billion-parameter mixture-of-experts mannequin, are designed for superior reasoning and multi-step technology. Reasoning workloads eat tokens quickly and place excessive calls for on compute, so deploying these fashions in manufacturing requires inference infrastructure that delivers low latency, excessive throughput, and decrease price.
Variations in {hardware}, software program optimizations, and useful resource allocation methods can result in massive variations in latency, effectivity, and price. These variations instantly have an effect on real-world functions corresponding to reasoning brokers, doc understanding methods, or copilots, the place even small delays can affect general responsiveness and throughput.
To judge these variations objectively, unbiased benchmarks have turn out to be important. As an alternative of counting on inner efficiency claims, open and data-driven evaluations now supply a extra clear strategy to assess how completely different platforms carry out below actual workloads.
On this submit, we evaluate main GPU-based inference suppliers utilizing the GPT-OSS-120B mannequin as a reference benchmark. We study how every platform performs throughout key inference metrics corresponding to throughput, time to first token, and price effectivity, and the way these trade-offs affect efficiency and scalability for reasoning-heavy workloads.
Earlier than diving into the outcomes, let’s take a fast take a look at Synthetic Evaluation and the way their benchmarking framework works.
Synthetic Evaluation Benchmarks
Synthetic Evaluation (AA) is an unbiased benchmarking initiative that runs standardized checks throughout inference suppliers to measure how fashions like GPT-OSS-120B carry out in actual situations. Their evaluations concentrate on real looking workloads involving lengthy contexts, streaming outputs, and reasoning-heavy prompts slightly than quick, artificial samples.
You may discover the complete GPT-OSS-120B benchmark outcomes right here.
Synthetic Evaluation evaluates a spread of efficiency metrics, however right here we concentrate on the three key elements that matter when selecting an inference platform for GPT-OSS-120B: time to first token, throughput, and price per million tokens.
- Time to First Token (TTFT)
The time between sending a immediate and receiving the mannequin’s first token. Decrease TTFT means output begins streaming sooner, which is crucial for interactive functions and multi-step reasoning the place delays can disrupt the circulation. - Throughput (tokens per second)
The speed at which tokens are generated as soon as streaming begins. Greater throughput shortens complete completion time for lengthy outputs and permits extra concurrent requests, instantly affecting scalability for large-context or multi-turn workloads. - Value per million tokens (blended price)
A mixed metric that accounts for each enter and output token pricing. This offers a transparent view of operational prices for prolonged contexts and streaming workloads, serving to groups plan for predictable bills.
Benchmark Methodology
- Immediate Measurement: Benchmarks coated on this weblog use a 1,000-token enter immediate run by Synthetic Evaluation, reflecting a typical real-world situation corresponding to a chatbot question or reasoning-heavy instruction. Benchmarks for considerably longer prompts are additionally obtainable and may be explored for reference right here.
- Median Measurements: The reported values signify the median (p50) during the last 72 hours, capturing sustained efficiency developments slightly than single-point spikes or dips. For essentially the most up-to-date benchmark outcomes, go to the Synthetic Evaluation GPT‑OSS‑120B mannequin suppliers web page right here.
- Metrics Focus: This abstract highlights time to first token (TTFT), throughput, and blended price to supply a sensible view for workload planning. Different metrics—corresponding to end-to-end response time, latency by enter token rely, and time to first reply token—are additionally measured by Synthetic Evaluation however usually are not included on this overview.
With this technique in thoughts, we will now evaluate how completely different GPU-based platforms carry out on GPT‑OSS‑120B and what these outcomes suggest for reasoning-heavy workloads.
Supplier Comparability (GPT‑OSS‑120B)
Clarifai
-
Time to First Token: 0.32 s
-
Throughput: 544Â tokens/s
-
Blended Value: $0.16 per 1M tokens
-
Notes: Extraordinarily excessive throughput; low latency; cost-efficient; sturdy alternative for reasoning-heavy workloads.
Key Options:
- GPU fractioning and autoscaling choices for environment friendly compute utilization
- Native runners to execute fashions domestically by yourself {hardware} for testing and improvement
- On-prem, VPC, and multi-site deployment choices
- Management Heart for monitoring and managing utilization and efficiency
Google Vertex AI
-
Time to First Token: 0.40 s
-
Throughput: 392 tokens/s
-
Blended Value: $0.26 per 1M tokens
-
Notes: Average latency and throughput; appropriate for general-purpose reasoning workloads.
Key Options:
-
Built-in AI instruments (AutoML, coaching, deployment, monitoring)
-
Scalable cloud infrastructure for batch and on-line inference
-
Enterprise-grade safety and compliance
Microsoft Azure
-
Time to First Token: 0.48 s
-
Throughput: 348 tokens/s
-
Blended Value: $0.26 per 1M tokens
-
Notes: Barely greater latency; balanced efficiency and price for normal workloads.
Key Options:
-
Complete AI companies (ML, cognitive companies, customized bots)
-
Deep integration with Microsoft ecosystem
-
International enterprise-grade infrastructure
Hyperbolic
-
Time to First Token: 0.52 s
-
Throughput: 395 tokens/s
-
Blended Value: $0.30 per 1M tokens
-
Notes: Greater price than friends; good throughput for reasoning-heavy duties.
Key Options:
AWS
-
Time to First Token: 0.64 s
-
Throughput: 252 tokens/s
-
Blended Value: $0.26 per 1M tokens
-
Notes: Decrease throughput and better latency; appropriate for much less time-sensitive workloads.
Key Options:
-
Broad AI/ML service portfolio (Bedrock, SageMaker)
-
International cloud infrastructure
-
Enterprise-grade safety and compliance
Databricks
-
Time to First Token: 0.36 s
-
Throughput: 195 tokens/s
-
Blended Value: $0.26 per 1M tokens
-
Notes: Decrease throughput; acceptable latency; higher for batch or background duties.
Key Options:
-
Unified analytics platform (Spark + ML + notebooks)
-
Collaborative workspace for groups
-
Scalable compute for giant ML/AI workloads
Collectively AI
-
Time to First Token: 0.25 s
-
Throughput: 248 tokens/s
-
Blended Value: $0.26 per 1M tokens
-
Notes: Very low latency; reasonable throughput; good for real-time reasoning-heavy functions.
Key Options:
-
Actual-time inference and coaching
-
Cloud/VPC-based deployment orchestration
-
Versatile and safe platform
Fireworks AI
-
Time to First Token: 0.44 s
-
Throughput: 482 tokens/s
-
Blended Value: $0.26 per 1M tokens
-
Notes: Excessive throughput and balanced latency; appropriate for interactive functions.
Key Options:
CompactifAI
-
Time to First Token: 0.29 s
-
Throughput: 186 tokens/s
-
Blended Value: $0.10 per 1M tokens
-
Notes: Low price; decrease throughput; greatest for cost-sensitive workloads with smaller concurrency wants.
Key Options:
-
Environment friendly, compressed fashions for price financial savings
-
Simplified deployment on AWS
-
Optimized for high-throughput batch inference
Nebius Base
-
Time to First Token: 0.66 s
-
Throughput: 165 tokens/s
-
Blended Value: $0.26 per 1M tokens
-
Notes: Considerably decrease throughput and better latency; could wrestle with reasoning-heavy or interactive workloads.
Key Options:
-
Fundamental AI service endpoints
-
Normal cloud infrastructure
-
Appropriate for steady-demand workloads
Greatest Suppliers Based mostly on Worth and Throughput
Choosing the best inference supplier for GPT‑OSS‑120B requires evaluating time to first token, throughput, and price primarily based in your workload. Platforms like Clarifai supply excessive throughput, low latency, and aggressive price, making them well-suited for reasoning-heavy or interactive duties. Different suppliers, corresponding to CompactifAI, prioritize decrease price however include lowered throughput, which can be extra appropriate for cost-sensitive or batch-oriented workloads. The optimum alternative is dependent upon which trade-offs matter most on your functions.
Greatest for Worth
Greatest for Throughput
-
Clarifai: Highest throughput at 544Â tokens/s with low first-chunk latency.
-
Fireworks AI: Robust throughput at 482 tokens/s and reasonable latency.
-
Hyperbolic: Good throughput at 395 tokens/s; greater price however viable for heavy workloads.
Efficiency and Flexibility
Together with worth and throughput, flexibility is crucial for real-world workloads. Groups usually want management over scaling habits, GPU utilization, and deployment environments to handle price and effectivity.
Clarifai, for instance, helps fractional GPU utilization, autoscaling, and native runners — options that may enhance effectivity and cut back infrastructure overhead.
These capabilities prolong past GPT‑OSS‑120B. With the Clarifai Reasoning Engine, customized or open-weight reasoning fashions can run with constant efficiency and reliability. The engine additionally adapts to workload patterns over time, regularly bettering pace for repetitive duties with out sacrificing accuracy.
Benchmark Abstract
Thus far, we’ve in contrast suppliers primarily based on throughput, latency, and price utilizing the Synthetic Evaluation Benchmark. To see how these trade-offs play out in apply, right here’s a visible abstract of the outcomes throughout the completely different suppliers. These charts are instantly from Synthetic Evaluation.
The primary chart highlights output pace vs worth, whereas the second chart compares latency vs output pace.
Output Velocity vs. Worth
%20.png?width=1000&height=547&name=Latency%20vs%20Output%20Speed%20(8%20Oct%2025)%20.png)
Latency vs. Output Velocity
Beneath is an in depth comparability desk summarizing the important thing metrics for GPT-OSS-120B inference throughout suppliers.
| Supplier | Throughput (tokens/s) | Time to First Token (s) | Blended Value ($ / 1M tokens) |
|---|---|---|---|
| Clarifai | 544 | 0.32 | 0.16 |
| Google Vertex AI | 392 | 0.40 | 0.26 |
| Microsoft Azure | 348 | 0.48 | 0.26 |
| Hyperbolic | 395 | 0.52 | 0.30 |
| AWS | 252 | 0.64 | 0.26 |
| Databricks | 195 | 0.36 | 0.26 |
| Collectively AI | 248 | 0.25 | 0.26 |
| Fireworks AI | 482 | 0.44 | 0.26 |
| CompactifAI | 186 | 0.29 | 0.10 |
| Nebius Base | 165 | 0.66 | 0.26 |
Conclusion
Selecting an inference supplier for GPT‑OSS‑120B entails balancing throughput, latency, and price. Every supplier handles these trade-offs in a different way, and the only option is dependent upon the particular workload and efficiency necessities.
Suppliers with excessive throughput excel at reasoning-heavy or interactive duties, whereas these with decrease median throughput could also be extra appropriate for batch or background processing the place pace is much less crucial. Latency additionally performs a key function: low time-to-first-token improves responsiveness for real-time functions, whereas barely greater latency could also be acceptable for much less time-sensitive duties.
Value concerns stay essential. Some suppliers supply sturdy efficiency at low blended prices, whereas others commerce effectivity for worth. Benchmarks protecting throughput, time to first token, and blended price present a transparent foundation for understanding these trade-offs.
In the end, the best supplier is dependent upon the engineering drawback, workload traits, and which trade-offs matter most for the applying.
Â
Be taught extra about Clarifai’s reasoning engine
The Quickest AI Inference and Reasoning on GPUs.
Verified by Synthetic Evaluation
Â
