mts1b-cloudburst
Cloud-burst worker for GPU bursting: Vast.ai, Runpod, Thunder Compute, SSH runner, budget enforcer.
Repo: github.com/MTS1B/mts1b-cloudburst
Layer: 4
Wave: 2 (months 4-7)
Depends on: foundation, platform, httpx, paramiko
Audience: mts1b-research (ladder sweeps), mts1b-GPUbacktester (heavy backtests)
What it is
A small worker that spins up rented GPU instances, runs a batch job, persists results, and tears down. Used to:
- Run ladder sweeps across millions of param combos in hours instead of weeks
- Backtest the full Russell-3000 daily 2000-2026 (uneconomical on a workstation)
- Train factor-extraction models
Self-hosted by default; uses Vast.ai / Runpod / Thunder Compute as a spot-GPU spot market.
Supported providers
| Provider | API auth | Hourly cost (approx) | Spot reliability |
|---|---|---|---|
vast_ai | API key | $0.20-2.00 per RTX 4090 | medium |
runpod | API key | $0.30-1.00 per RTX 4090 | high |
thunder | API key | $0.40-1.50 per H100 | high (focus on premium GPUs) |
ssh | SSH key | (free, your own boxes) | high |
Module layout
mts1b_cloudburst/
├── providers/
│ ├── vast_ai.py
│ ├── runpod.py
│ ├── thunder.py
│ └── ssh.py
├── budget/
│ ├── enforcer.py # USD cap per job + per day
│ └── ledger.py # cost tracking
├── runner/
│ ├── job.py # JobSpec + lifecycle
│ ├── image.py # build/push container image
│ └── result_sync.py # rsync results to local datalake
└── cli/
└── burst.py
API
Submit a job
from mts1b_cloudburst import burst, JobSpec
job = await burst(
JobSpec(
name="ladder-sweep-momentum",
image="ghcr.io/mts1b/mts1b-gpubacktester:0.1.0",
gpus=1,
gpu_model="RTX_4090", # or "H100" or "ANY"
memory_gb=24,
max_duration_minutes=120,
max_cost_usd=10.0,
command=[
"mts1b-backtest", "batch",
"--config", "/data/sweep.yaml",
"--output", "/data/results/",
],
mount_local=[
("./configs", "/data/configs:ro"),
("./results", "/data/results:rw"),
],
provider="auto", # picks cheapest available
)
)
# Monitor
async for event in job.stream_events():
print(event)
# JobEvent(type="provisioned", at=..., cost_per_hour_usd=0.34)
# JobEvent(type="container_started", at=...)
# JobEvent(type="progress", at=..., percent=12.5)
# JobEvent(type="completed", at=..., total_cost_usd=0.68)
# Sync results
await job.sync_results()
Provider auto-selection
provider="auto" queries each provider's spot market, picks the cheapest instance matching the spec. Tie-breaker: provider reliability score (Runpod > Thunder > Vast.ai based on historical job-completion rate).
Budget enforcement
# Per-job
JobSpec(..., max_cost_usd=10.0) # hard kill at $10
# Per-day (across all jobs)
await burst.set_daily_budget(usd=50.0)
# Subsequent burst() calls fail with BudgetExceededError if would push past
mts1b-platform/messaging alerts when daily spend > 80%.
CLI
mts1b-burst submit \
--image ghcr.io/mts1b/mts1b-gpubacktester:0.1.0 \
--gpus 1 --gpu-model RTX_4090 \
--max-cost-usd 10 \
--command "mts1b-backtest batch --config /data/sweep.yaml"
mts1b-burst list # active jobs
mts1b-burst show <job_id> # status + logs
mts1b-burst kill <job_id> # tear down early
mts1b-burst budget show
# Today: $12.50 / $50.00 (25%)
# This week: $87.30 / $200.00 (44%)
mts1b-burst providers status
# vast.ai: available, lowest RTX_4090 = $0.22/hr (3 offers)
# runpod: available, lowest RTX_4090 = $0.34/hr (10 offers)
# thunder: available, lowest H100 = $1.40/hr (2 offers)
Result sync
Each job has a result mount. After completion, mts1b-cloudburst rsyncs to data/cloudburst/results/<job_id>/. From there, mts1b-research ingests into the lake.
For large results (>1 GB), use S3 / R2 instead of rsync:
JobSpec(
...,
result_sink="s3://my-bucket/cloudburst/{job_id}/",
)
Image building
Reference Dockerfile at mts1b-cloudburst/images/gpubacktester.Dockerfile. Build + push:
mts1b-burst image build --image gpubacktester --tag 0.1.0
mts1b-burst image push --image gpubacktester --tag 0.1.0
# Pushes to ghcr.io/mts1b/mts1b-gpubacktester:0.1.0
Workers automatically pull the right image at startup.
SSH provider (your own boxes)
For pre-existing GPU servers:
JobSpec(
...,
provider="ssh",
ssh_hosts=["gpu1.local", "gpu2.local"], # round-robin
)
Skips spot-market provisioning; just runs on your hardware. Same budget enforcement (tracks GPU-hours × your declared $/hr rate).
Build + test
pip install -e ".[dev]"
pytest -m unit # mock providers
pytest -m live --provider=vast # spins up actual instance (costs ~$0.20)
Roadmap
| Version | Items |
|---|---|
| 0.1 (Wave 2) | Vast.ai + Runpod + Thunder + SSH, budget enforcer, result sync |
| 0.2 (Wave 2) | AWS spot, GCP preemptible, Azure spot |
| 0.3 (Wave 3) | Multi-job orchestration (DAGs) |
| 1.0 (LTS) | Stable JobSpec |
See also
mts1b-research— primary consumer (ladder sweeps)mts1b-GPUbacktester— runs inside the burst container