The Real Cost of Cheap GPUs

The 40-60% savings are real. So are the tradeoffs.

Illustration for The GPU Democratization Nobody's Talking About
gpu-democratization-nobody-talking Alternative GPU clouds offer 50-80% savings over hyperscalers. The AI revolution is being democratized, but AWS would prefer you didn't know. gpu,cloud,ai,training,startup,vast.ai,runpod,lambda,aws,cost

Alternative GPU clouds can cut training costs by 40-60%. I've seen this pattern before: when something sounds too good to be true in infrastructure, it usually comes with asterisks. The savings are real, but the tradeoffs in reliability, security, and support that nobody in the "just use Vast.ai" crowd wants to talk about are also real.

TL;DR

GPU marketplaces offer 40-60% savings over hyperscalers. The tradeoffs—reliability, security, support—are real. Works for training and experimentation with checkpoint/resume. Use hyperscalers for production inference, compliance, or deadline-critical work.

Related: Bootstrap vs VC in 2026 Bootstrap vs VC in 2026: The Math Changed

Every founder I talk to has the same complaint: "We can't afford the GPU compute for AI." They've looked at AWS pricing, done the math, and concluded that serious AI work requires serious VC funding. They're often wrong, but not always, and the nuance matters more than the marketing.

I remember learning COBOL, FORTRAN, and PL/1 in college—time-sharing on a mainframe, submitting batch jobs through punch cards, waiting hours for results. Then PCs arrived and suddenly you owned your cycles. The cloud felt like going backward, paying by the hour for someone else's machine. Now we're watching the same cycle repeat with GPUs. The hyperscalers want you to believe their way is the only way. It's not. But the alternatives have real tradeoffs that the "just use Vast.ai" crowd glosses over.

I learned this distinction the expensive way.

The 97% Lesson

A couple years back, I was fine-tuning an ASR model. Four 3090s on a marketplace provider, maybe $1.20/hour total. I had time. The job would take weeks, but I was juggling other projects anyway. Check in occasionally, watch the loss curve drop, go back to real work. No rush.

I wasn't saving checkpoints externally. The instance had plenty of disk. Why pay for S3 transfers?

Three weeks in, the model hit 97% of target accuracy. I went to bed expecting to wake up to a finished fine-tune. Instead, I woke up to a terminated instance and an empty directory. The host had rebooted for maintenance. No warning. No checkpoint. Three weeks of compute time, gone.

I ended up renting 8 H100s at 4x the hourly rate to redo the job in days instead of weeks. The "savings" from those cheap 3090s cost me a month of calendar time and more money than doing it right from the start would have.

Here's what the math actually looked like:

Approach Config Hourly Rate Time Compute Cost Outcome
Plan A: "Cheap" 4× RTX 3090 ~$1.20/hr 3 weeks ~$600 Lost everything
Plan B: Recovery 8× H100 ~$16/hr 4 days ~$1,500 Completed
Actual total ~1 month ~$2,100 Should've been $1,500

The H100 cluster was roughly 4-6× faster per GPU than the 3090s for transformer training, and I had twice as many of them. What took weeks on consumer hardware finished in days on data center silicon. The raw hourly rate difference (13×) was dwarfed by the speed difference (10×+).

That was the day I learned that marketplace GPUs aren't cheap if you don't design for failure. The hourly rate is only part of the cost. The real cost includes every hour you lose when (not if) something goes wrong.

Updated February 2026: Refreshed pricing data and added current provider comparisons. Market has matured significantly, with prices stabilizing and some reliability improvements.

The GPU Marketplace Landscape

While AWS, Azure, and GCP dominated enterprise GPU compute, a parallel market emerged. Companies like Vast.ai, RunPod, Lambda Labs, and TensorDock built GPU rental marketplaces with lower prices, but different tradeoffs.

The model varies by provider. Some aggregate idle capacity from data centers and research institutions, while others (like Vast.ai) include individual rig owners. The lower prices come from cutting enterprise sales teams, premium support, and SLA guarantees.

Current pricing comparison (as of early 2026):

GPU AWS (On-Demand) Vast.ai RunPod Lambda
H100 SXM (NVLink) $3.90/hr $2.69/hr $2.49/hr
H100 PCIe $1.87-2.00/hr $1.99/hr
A100 80GB ~$3.00/hr $0.66-0.80/hr $1.19-1.89/hr $1.29/hr
RTX 4090 N/A $0.31-0.40/hr $0.44/hr N/A

Important: AWS P5 instances are full 8-GPU nodes only: you cannot rent a single H100. While the per-GPU rate is ~$3.90/hr, your minimum hourly burn is ~$31/hr. Marketplace providers allow single-GPU rentals, making the actual barrier to entry ~16× lower. Additionally, AWS P5 uses H100 SXM with NVLink (900 GB/s GPU-to-GPU); most marketplace H100s are PCIe (64 GB/s). For single-GPU training, the interconnect doesn't matter. For multi-GPU training, verify you're comparing equivalent hardware. Verify current rates: AWS P5 · Vast.ai · RunPod · Lambda

But hourly rate is only half the story. Training speed determines your actual cost per job.

GPU VRAM FP16 TFLOPS Relative Speed Marketplace $/hr Effective $/job
RTX 3090 24GB 35.6 1.0× (baseline) ~$0.25 Cheap but slow
RTX 4090 24GB 82.6 ~1.8× ~$0.40 Good value
A100 80GB 80GB 77.9 ~2.2× ~$0.70 Best $/performance
H100 SXM 80GB 267 ~4-6× ~$1.90 Fastest wall-clock

Relative speed varies by workload. Transformer training favors high memory bandwidth (H100 advantage). Smaller models may not saturate H100 tensor cores. Benchmark source.

The counterintuitive insight is this: for time-sensitive work, H100s at 8x the hourly rate can be cheaper than 3090s because they finish 5x faster. The cheap option is only cheap if your time has zero value.

The Real Tradeoffs

Let's be honest about what you give up for cheaper compute.

Reliability is genuinely worse. On marketplace platforms, instances get terminated unexpectedly. One Trustpilot reviewer wrote, "Rented a GPU instance for an important project, but the server was suddenly disconnected without warning." This isn't rare. It's the business model. Reviews consistently mention "a lot of bad / non working machines" and instance instability.

Security isolation varies wildly. Vast.ai explicitly states it "doesn't offer secure runtime isolation for executing untrusted or third-party code. There's no built-in sandboxing, syscall filtering, or container-level hardening." If you're training on proprietary data or sensitive IP, you're trusting individual host security practices. RunPod's "Secure Cloud" option addresses this with single-tenant machines, at higher prices.

Support is minimal. When something breaks at 2 AM, you're on your own. The hyperscalers have 24/7 support teams. The marketplaces have Discord channels. For hobby projects, this is fine. For production workloads with deadlines, it's a real risk.

Provider quality is inconsistent. On platforms with community hosts, "some hosts are excellent; others might have connectivity issues or slower drives." You're doing the QA that AWS handles internally.

Hardware isn't equivalent. A "4090" on a marketplace isn't the same as an H100 in a data center. Consumer GPUs thermal throttle under sustained load; that 4090 might drop from 450W TDP to 300W after 20 minutes of training when the host's cooling can't keep up. Data center GPUs have server-grade cooling and power delivery. You're paying less partly because you're getting less consistent compute per dollar-hour.

Network interconnects kill multi-GPU training. This is the one CTOs miss most often. Hyperscalers use InfiniBand (400-800 Gb/s, sub-microsecond latency) for GPU-to-GPU communication. Marketplace providers typically use Ethernet (25-100 Gb/s, higher latency). For single-GPU work, this doesn't matter. For distributed training across 8+ GPUs, the gradient sync overhead on Ethernet can add 30-50% to your training time. You're not just paying for slower GPUs. You're paying for slower communication between GPUs. Always verify the interconnect before committing to multi-node training on marketplace hardware.

Hardware Audit: Consumer GPUs vs. Data Center

Consumer cards like the RTX 4090 are designed for gaming sessions, meaning high bursts followed by idle periods. Running them at 100% utilization 24/7 exposes fundamental hardware limitations:

  • VRM (Voltage Regulator Module): Consumer boards use cheaper VRM components rated for gaming duty cycles, not sustained server loads. I've seen 4090s develop VRM instability after 2-3 months of continuous training.
  • Cooling: Air-cooled consumer cards throttle when ambient temps rise. A gaming PC in a bedroom is not a server room with 68°F controlled air.
  • Memory: Consumer GDDR6X runs hotter than HBM2e in data center cards. Higher temps = higher error rates = training instability.
  • Power delivery: That 12VHPWR connector on your 4090? It's melted in enough rigs that NVIDIA redesigned it. Data center cards use server-grade power connections.

The A100 and H100 aren't just faster. They're built for 24/7/365 operation. Consumer hardware at server workloads is borrowing reliability from your future self.

Egress costs can eat your savings. Training on cheap GPUs is only half the problem. Moving terabytes of model weights, datasets, and checkpoints back to S3 (or wherever your production infrastructure lives) triggers egress charges. Here's what moving 1TB actually costs:

Transfer Direction Vast.ai RunPod AWS (in-region)
Download to instance (dataset in) Free Free Free
Upload to S3 (checkpoints out) ~$50-90/TB* ~$50/TB Free
Final model to prod ~$50-90/TB* ~$50/TB Free

*Vast.ai egress varies by host: some have metered bandwidth, others don't. Check before committing.

If your workflow involves pulling 500GB of training data, checkpointing to S3 every 15 minutes, and syncing final weights back, add up the transfer costs. I've seen teams save 40% on compute and lose half of it on data movement. The layer tax applies to bits in motion, not just bits at rest.

When AWS Actually Makes Sense

I've been critical of hyperscaler costs, but they earn their premium in specific scenarios.

Compliance requirements. HIPAA, SOC2, FedRAMP: if you need regulatory certification, the hyperscalers have it. Vast.ai recently achieved SOC2 Type 2, but most marketplace providers can't offer the audit trail enterprises require.

Production inference with SLAs. When you're serving real-time predictions to paying customers, a 99.9% uptime SLA matters. The cost of an outage, including lost revenue and customer churn, often exceeds the GPU savings.

Predictable capacity planning. If you need guaranteed access to 100 GPUs at 9 AM every Monday, AWS Reserved Instances or Capacity Blocks deliver that. Marketplace availability is first-come, first-served.

Integration with existing infrastructure. If your data is in S3, your auth is in IAM, and your team knows CloudWatch, the operational cost of context-switching to a different platform is real. We ran 3,000 AWS instances. The ecosystem lock-in is genuine.

Support and accountability. When a training run fails and you can't figure out why, having an actual support engineer to call has value. The "figure it out yourself" model breaks down under deadline pressure.

When Cheap GPUs Make Sense

The marketplace model genuinely works for certain workloads.

Training runs that can checkpoint. If your training job saves state every 15 minutes, instance termination is an inconvenience, not a disaster. Resume from checkpoint, continue. Design for interruption and the economics change dramatically.

Experimentation and prototyping. When you're iterating on model architecture, you don't need five-nines uptime. You need cheap cycles to test hypotheses quickly. An RTX 4090 at $0.40/hour lets you experiment at a pace that hyperscaler pricing prohibits.

Batch inference with latency tolerance. If your inference doesn't need sub-100ms latency, you can run it on marketplace GPUs during off-peak hours. Process your queue, download results, shut down.

Academic research and side projects. The barrier to entry for AI experimentation dropped significantly. A graduate student can now afford compute that was enterprise-only five years ago.

The Decision Framework

Factor Use Marketplace Use Hyperscaler
Workload type Training, batch inference Real-time production inference
Interruption tolerance Can checkpoint & resume Cannot tolerate interruption
Data sensitivity Public data, non-proprietary models HIPAA, PCI, proprietary IP
Support needs Self-sufficient team Need vendor support
Capacity needs Flexible, can work around availability Guaranteed capacity required
Budget vs time More budget-sensitive More time-sensitive
Team experience Comfortable with DIY infrastructure Prefer managed services

The Playbook for Marketplace GPUs

If you decide the tradeoffs are worth it, here's the playbook.

1. Start with interruptible instances. Marketplace pricing can drop significantly for preemptible compute. Design for interruption from day one.

# Search for cheapest reliable GPUs (Vast.ai example)
vast search offers --type bid --gpu-name RTX_4090 --max-price 0.40

# Create instance with budget cap
vast create instance $OFFER_ID --onstart-cmd "python train.py"

2. Checkpoint religiously, and handle SIGTERM correctly. Marketplace instances don't die gracefully. They get SIGTERM'd with seconds of warning. Your training code needs to catch the signal and save state. But the save can fail if the network is flaky (often the reason you're being terminated). Production code handles this.

The signal handler should only set a flag. Never call sys.exit() from a signal handler because it can race with your cleanup logic, skip finally blocks, and leave wandb/database connections dangling. Let the training loop exit cleanly.

import logging
import os
import shutil
import signal
import time
from concurrent.futures import ThreadPoolExecutor
from pathlib import Path

import boto3
import torch
from botocore.config import Config
from botocore.exceptions import ClientError

# Module-level logger - never configure root logger in library code
logger = logging.getLogger(__name__)

# Multipart threshold: files > 5GB use multipart upload
MULTIPART_THRESHOLD_BYTES = 5 * 1024 * 1024 * 1024  # 5GB
MULTIPART_CHUNKSIZE = 100 * 1024 * 1024  # 100MB chunks


class GracefulCheckpointer:
    """Production checkpointing for interruptible GPU instances.

    Key design: Local save is FAST (blocks training briefly).
    S3 upload is SLOW (runs in background thread, never blocks training).

    Features:
    - Exponential backoff with jitter for transient S3 failures
    - Automatic multipart upload for files > 5GB
    - Graceful signal handling with time-aware shutdown

    THREAD SAFETY WARNING (boto3):
    The boto3 client is thread-safe, but boto3.Session is NOT. This class
    creates the client at init time and uses it from a background thread,
    which is safe. However:

    - DO NOT pass this object to DataLoader workers (multiprocessing.fork())
    - After fork(), the S3 client's connection pool becomes corrupted
    - If using num_workers > 0, create a NEW checkpointer in the main process
      AFTER the DataLoader is initialized, or use 'spawn' start method

    Safe pattern:
        dataloader = DataLoader(..., num_workers=4)
        checkpointer = GracefulCheckpointer(...)  # Create AFTER DataLoader

    Note: OS signals (SIGTERM) are only part of the solution. Spot/preemptible
    instances often provide metadata notifications before the signal. Combine
    this with a polling loop that checks your provider's termination API
    (AWS instance metadata, Vast.ai webhooks, etc.).
    """

    GRACE_PERIOD_SECONDS = 25
    CHECKPOINT_INTERVAL_SECONDS = 900  # 15 minutes
    MAX_RETRIES = 4
    BASE_DELAY_SECONDS = 1.0

    def __init__(
        self,
        s3_bucket: str,
        prefix: str,
        local_fallback: Path | str = "/mnt/checkpoint"
    ):
        config = Config(connect_timeout=5, read_timeout=30, retries={'max_attempts': 0})
        self.s3 = boto3.client('s3', config=config)
        self.bucket = s3_bucket
        self.prefix = prefix
        self.local_fallback = Path(local_fallback)
        self.shutdown_requested = False
        self._shutdown_mono: float | None = None

        # Background thread for S3 uploads - never block the training loop
        self.executor = ThreadPoolExecutor(max_workers=1, thread_name_prefix="s3_upload")
        self.pending_upload = None

        # Transfer config for multipart uploads
        from boto3.s3.transfer import TransferConfig
        self.transfer_config = TransferConfig(
            multipart_threshold=MULTIPART_THRESHOLD_BYTES,
            multipart_chunksize=MULTIPART_CHUNKSIZE,
            max_concurrency=4,
            use_threads=True
        )

        signal.signal(signal.SIGTERM, self._flag_shutdown)
        signal.signal(signal.SIGINT, self._flag_shutdown)

    def _flag_shutdown(self, signum, frame):
        logger.warning("Shutdown signal received, flagging for clean exit")
        self.shutdown_requested = True
        self._shutdown_mono = time.monotonic()

    def _time_left(self) -> float:
        if self._shutdown_mono is None:
            return float('inf')
        elapsed = time.monotonic() - self._shutdown_mono
        return max(0.0, self.GRACE_PERIOD_SECONDS - elapsed)

    def _upload_with_retry(self, local_path: Path, s3_key: str) -> bool:
        """Upload to S3 with exponential backoff and multipart support.

        Returns True on success, False on permanent failure.
        """
        import random  # for jitter

        file_size = local_path.stat().st_size
        using_multipart = file_size > MULTIPART_THRESHOLD_BYTES

        if using_multipart:
            logger.info(f"Using multipart upload for {file_size / 1e9:.1f}GB file")

        for attempt in range(self.MAX_RETRIES):
            try:
                self.s3.upload_file(
                    str(local_path),
                    self.bucket,
                    s3_key,
                    Config=self.transfer_config
                )
                logger.info(f"Uploaded to s3://{self.bucket}/{s3_key}")
                return True

            except ClientError as e:
                error_code = e.response.get('Error', {}).get('Code', '')
                # Permanent failures - don't retry
                if error_code in ('AccessDenied', 'NoSuchBucket', 'InvalidBucketName'):
                    logger.error(f"Permanent S3 error: {error_code}")
                    return False

                # Transient failures - retry with backoff
                delay = self.BASE_DELAY_SECONDS * (2 ** attempt)
                jitter = random.uniform(0, delay * 0.1)
                sleep_time = min(delay + jitter, self._time_left() - 1)

                if sleep_time <= 0:
                    logger.warning("No time left for retry, aborting upload")
                    return False

                logger.warning(f"S3 upload failed (attempt {attempt + 1}), "
                             f"retrying in {sleep_time:.1f}s: {e}")
                time.sleep(sleep_time)

            except Exception as e:
                logger.exception(f"Unexpected upload error: {e}")
                return False

        logger.error(f"S3 upload failed after {self.MAX_RETRIES} attempts")
        return False

    def _persist_and_upload(self, local_path: Path, s3_key: str):
        """Runs in background thread. Never blocks training.

        Handles BOTH local persistence AND S3 upload. The local_fallback
        might be a network mount (NFS, EBS) which can block - keep it
        off the main training thread.
        """
        # Step 1: Copy to persistent local storage (may be network mount)
        if self.local_fallback.is_dir():
            fallback_path = self.local_fallback / "checkpoint_latest.pt"
            try:
                shutil.copy2(local_path, fallback_path)
                logger.info(f"Local checkpoint: {fallback_path}")
            except Exception:
                logger.exception("Local persistence failed")

        # Step 2: Upload to S3 with retry logic
        self._upload_with_retry(local_path, s3_key)

    def save(self, model, optimizer, epoch: int, step: int) -> bool:
        # Race condition fix: check BEFORE starting any expensive work
        if self.shutdown_requested and self._time_left() < 3:
            logger.warning("Not enough time left, skipping save")
            return False

        # Step 1: Save to FAST ephemeral /tmp ONLY (NVMe, never network)
        # This is the ONLY blocking I/O in the main thread
        tmp_dir = Path("/tmp")
        local_path = tmp_dir / f"ckpt_{epoch}_{step}.pt"
        torch.save({
            'model': model.state_dict(),
            'optimizer': optimizer.state_dict(),
            'epoch': epoch, 'step': step
        }, local_path)

        # Flush to disk (skip on network mounts: fsync blocks for seconds on EFS/NFS)
        if str(local_path).startswith('/tmp'):
            with local_path.open('rb') as f:
                os.fsync(f.fileno())

        # Step 2: Offload ALL slow I/O to background thread
        # Prevent queue buildup: if previous job still running, skip
        if self.pending_upload and not self.pending_upload.done():
            logger.warning("Previous persist/upload still in progress, skipping")
            return True  # /tmp save worked, that's enough

        s3_key = f"{self.prefix}/checkpoint_latest.pt"
        self.pending_upload = self.executor.submit(
            self._persist_and_upload, local_path, s3_key
        )
        return True

    def wait_for_upload(self, timeout: float = 20.0):
        """Call during shutdown to wait for pending upload."""
        if self.pending_upload:
            try:
                self.pending_upload.result(timeout=timeout)
            except Exception:
                logger.exception("Final upload failed")

    def close(self):
        self.executor.shutdown(wait=False)


def train(model, dataloader, epochs: int, checkpointer: GracefulCheckpointer):
    optimizer = torch.optim.AdamW(model.parameters())
    last_ckpt_mono = time.monotonic()
    global_step = 0

    try:
        for epoch in range(epochs):
            for step, batch in enumerate(dataloader):
                global_step += 1

                if checkpointer.shutdown_requested:
                    checkpointer.save(model, optimizer, epoch, global_step)
                    checkpointer.wait_for_upload(timeout=20)
                    return

                inputs, targets = batch
                loss = model(inputs, targets)
                loss.backward()
                optimizer.step()
                optimizer.zero_grad()

                if time.monotonic() - last_ckpt_mono > checkpointer.CHECKPOINT_INTERVAL_SECONDS:
                    checkpointer.save(model, optimizer, epoch, global_step)
                    last_ckpt_mono = time.monotonic()
    finally:
        checkpointer.close()

Here's what it looks like when the host pulls the plug mid-training:

2026-02-01 14:32:15 [INFO] Local checkpoint: /mnt/checkpoint/checkpoint_latest.pt
2026-02-01 14:32:18 [INFO] Uploaded to s3://models/ckpt/checkpoint_latest.pt
2026-02-01 14:47:15 [INFO] Local checkpoint: /mnt/checkpoint/checkpoint_latest.pt
2026-02-01 14:47:16 [WARNING] Shutdown signal received, flagging for clean exit
2026-02-01 14:47:16 [INFO] Local checkpoint: /mnt/checkpoint/checkpoint_latest.pt
2026-02-01 14:47:19 [INFO] Uploaded to s3://models/ckpt/checkpoint_latest.pt
Training complete. Final checkpoint at epoch 47, step 13200.

The key insight is this: local saves are fast (~100ms), network uploads are slow (seconds to minutes). By saving locally first and uploading in a background thread, the training loop never blocks on network I/O. If SIGTERM hits mid-upload, you still have the local checkpoint. The wait_for_upload() call during shutdown uses whatever time remains to try completing the S3 upload, but the local copy is already safe.

Why This Matters at Scale

A naïve implementation would call s3.upload_file() directly in the save method, blocking the training loop for 2-30 seconds depending on checkpoint size and network conditions. At scale, this creates two problems.

  • Stalled heartbeats: Distributed training frameworks expect regular progress. A 30-second block can trigger timeout failures in your orchestrator.
  • Wasted SIGTERM window: You get ~30 seconds between SIGTERM and forced termination. Spending 25 of those waiting on S3 means you can't save final state if the upload fails.

The background thread pattern (or aioboto3 for async) keeps your training loop responsive while uploads happen in parallel. Local-first means you're never racing the network against termination.

DataLoader gotcha: If SIGTERM hits while a PyTorch DataLoader worker is mid-read, you can get zombie processes or corrupted shared memory. Set num_workers=0 during your grace period check, or ensure pin_memory=False before the final save.

Serialization overhead: torch.save() uses pickle, which can spike CPU/RAM before the background thread even starts. For large models (7B+), consider safetensors for zero-copy serialization: it's faster, safer, and doesn't execute arbitrary code on load.

3. Use budget controls. Every platform has spending alerts. Set them. Founders have woken up to $10,000 bills because they forgot to terminate an instance.

4. Have a fallback. When you absolutely need a training run to complete by Thursday, have an AWS or Lambda Labs account ready. The 2x cost is insurance against marketplace volatility.

5. Test provider reliability. Before committing to a platform, run small test workloads. Check actual availability, network speeds, and how often instances get interrupted.

# Makefile for GPU provider benchmarking
# Usage: make benchmark PROVIDER=vastai GPU=4090

PROVIDER ?= vastai
GPU ?= 4090
ITERATIONS ?= 100

.PHONY: benchmark benchmark-matrix benchmark-memory benchmark-full

# Quick transformer benchmark (~5 min)
benchmark:
	python -c "import torch; \
		x = torch.randn(1024, 1024, device='cuda'); \
		for _ in range($(ITERATIONS)): torch.mm(x, x); \
		torch.cuda.synchronize(); print('Matrix ops: OK')"
	@echo "Provider: $(PROVIDER) | GPU: $(GPU)"

# Memory bandwidth test
benchmark-memory:
	python -c "import torch; import time; \
		size = 1024 * 1024 * 256; \
		x = torch.randn(size, device='cuda'); \
		torch.cuda.synchronize(); t0 = time.time(); \
		for _ in range(10): y = x.clone(); \
		torch.cuda.synchronize(); \
		gb_per_sec = (size * 4 * 10) / (time.time() - t0) / 1e9; \
		print(f'Memory bandwidth: {gb_per_sec:.1f} GB/s')"

# Full benchmark suite
benchmark-full: benchmark benchmark-memory
	nvidia-smi --query-gpu=temperature.gpu,power.draw,clocks.gr --format=csv
	@echo "Benchmark complete. Check for thermal throttling above."

The Honest Math

Consider a startup needing 1,000 GPU-hours of H100 time per month:

  • AWS On-Demand: 1,000 × $3.90 = $3,900/month
  • AWS Spot: 1,000 × $2.50 = $2,500/month (when available)
  • AWS Savings Plan: ~$2,730/month (30% off with 1-year commit)
  • RunPod: 1,000 × $1.99 = $1,990/month
  • Vast.ai: 1,000 × $1.87 = $1,870/month (marketplace rate, variable)

The savings are real: $1,500-2,000/month. Over two years, that's $36,000-48,000. But factor in the operational overhead of managing interruptions, debugging provider-specific issues, and the occasional lost workload. The net savings are real, but smaller than the headline numbers suggest.

What This Actually Means

The GPU compute market has more options than most founders realize. The 40-60% savings on marketplace platforms are genuine, but so are the tradeoffs in reliability, security, and support.

The right answer depends on your specific situation.

Bootstrapped startup with technical founders? The marketplace model probably works. Design for interruption, accept the operational overhead, pocket the savings.

Series A company with production SLAs? The hyperscaler premium is often justified. Downtime costs more than the GPU savings.

Research or experimentation? Marketplace platforms are a clear win. The reliability concerns don't matter when you're testing hypotheses.

The hyperscalers will continue to dominate enterprise AI. But for startups, researchers, and independent developers who can handle the operational complexity, alternatives exist. Whether they're right for you depends on honest assessment of your team's capabilities and your workload's requirements.

The Bottom Line

GPU marketplace platforms offer 40-60% savings over hyperscaler on-demand pricing. The savings are real. So are the tradeoffs, including unreliable instances, weaker security isolation, minimal support, and variable provider quality.

The platforms work well for training and experimentation with interruption-tolerant workloads. They work poorly for production inference with SLAs, compliance requirements, or deadline-critical work.

Before switching, honestly assess your situation. Can your team handle the operational overhead? Can your workload tolerate interruption? Is the savings worth the debugging time when things break at 2 AM?

Sometimes the answer is yes. Sometimes AWS earns its premium. Know which situation you're in.

"Before switching, honestly assess your situation. Can your team handle the operational overhead? Can your workload tolerate interruption?"

Sources

Startup Advisory

Planning your technical strategy? Get honest feedback from someone who's been through acquisitions, pivots, and failures.

Get Honest Feedback

Disagree? Have a War Story?

I read every reply. If you've seen this pattern play out differently, or have a counter-example that breaks my argument, I want to hear it.

Send a Reply →