Performance notes
Where MTS1B is fast, where it isn't, and how to tune it.
Latency budget (live trading hot path)
End-to-end target: strategy signal → broker submission ≤ 100 ms p99.
| Step | Target | Why |
|---|---|---|
| Strategy signal generation | ≤ 10 ms | factor compute (cached features) |
| Portfolio sizing | ≤ 5 ms | quantkit kelly/vol-target |
| OMS receive + idempotency check | ≤ 5 ms | dedupe lookup |
| Risk gate 1 (idempotency) | ≤ 1 ms | hashed dedupe |
| Risk gate 2 (schema) | ≤ 1 ms | pydantic validate |
| Risk gate 3 (static) | ≤ 2 ms | in-memory lookups |
| Risk gate 4 (position) | ≤ 5 ms | position store + cov compute |
| Risk gate 5 (drawdown) | ≤ 2 ms | NAV diff |
| Risk gate 6 (short) | ≤ 2 ms | borrow cache |
| Risk gate 7 (CRO veto, optional) | ≤ 5 s | LLM call (fail-OPEN) |
| Broker submit | ≤ 20 ms | network to venue |
| Total | ≤ 100 ms p99 |
Without the CRO veto, hot path is ≤ 50 ms p99.
Backtest throughput
GPU vs CPU comparison (Russell 1000 × 10 years daily)
| Backend | Universe size | Period | Wall time | Speedup |
|---|---|---|---|---|
| CPU (numpy) | 100 | 10 yr | 8 sec | 1x |
| CPU (numpy) | 1000 | 10 yr | 95 sec | 1x |
| CPU (numpy) | 1000 | 10 yr (1m bars) | 18 min | 1x |
| GPU (cupy RTX 4090) | 100 | 10 yr | 1.2 sec | ~7x |
| GPU (cupy RTX 4090) | 1000 | 10 yr | 4.8 sec | ~20x |
| GPU (cupy RTX 4090) | 1000 | 10 yr (1m bars) | 38 sec | ~28x |
| GPU (cupy H100) | 1000 | 10 yr (1m bars) | 14 sec | ~77x |
For large parameter sweeps (ladder), the GPU advantage compounds:
- 100k combos × 1000 universe × 10 yr daily:
- CPU: ~10 hours
- GPU (RTX 4090): ~30 minutes
- GPU (H100): ~10 minutes
When CPU is fine
- Universe < 100 symbols
- Period < 5 years daily
- One-off run (no sweep)
- Development on a laptop without CUDA
When GPU pays off
- Universe > 200 symbols
- Intraday bars (1m / 5m)
- Parameter sweeps (anything > 1k combinations)
- Walk-forward CV (multi-fold runs)
Memory
Foundation library
Tiny — under 10 MB resident.
Platform primitives
| Primitive | Resident |
|---|---|
| Logging setup | ~5 MB |
| Config + Vault client | ~15 MB |
| HTTP client (httpx) | ~10 MB |
| NATS client | ~5 MB |
| Postgres pool | ~30 MB |
| Redis client | ~5 MB |
A typical service
| Service | Idle | Active |
|---|---|---|
| mts1b-foundation (library) | 10 MB | 10 MB |
| mts1b-platform (library) | 50 MB | 80 MB |
| mts1b-marketdata (service) | 80 MB | 200 MB |
| mts1b-oms (service) | 100 MB | 300 MB |
| mts1b-riskengine (service) | 80 MB | 150 MB |
| mts1b-research (service) | 200 MB | 1-4 GB (active sweep) |
| mts1b-GPUbacktester (service) | 100 MB host | 8-24 GB GPU (active backtest) |
| mts1b-datalake (service) | 150 MB | 500 MB-2 GB (active ingest) |
Per-service tuning lives in mts1b.config under each section.
Concurrency
Async first
Every I/O-bound function in MTS1B is async. Don't mix sync (blocking) HTTP calls.
❌ Wrong:
import requests
response = requests.get("https://api.example.com") # blocks the event loop
✅ Right:
from mts1b_platform.http import http_client
async with http_client("example") as c:
response = await c.get("https://api.example.com")
Multi-process (CPU-bound work)
Async helps with I/O, not CPU. For CPU-heavy work (factor compute, optimization):
# Use ProcessPoolExecutor
from concurrent.futures import ProcessPoolExecutor
with ProcessPoolExecutor(max_workers=8) as ex:
results = list(ex.map(run_one_param_set, param_grid))
Or use mts1b-cloudburst to fan out to rented GPU instances.
NATS consumer concurrency
sub = await js.subscribe(
"mts.v1.oms.fills.created",
durable="my-consumer",
max_ack_pending=100, # process up to 100 in flight
)
Multiple consumer instances with the same durable name share work — like a queue.
Disk I/O
Parquet partitioning
The data lake is partitioned by year/month + symbol/interval. Queries that filter on partition columns are fast:
# Fast — uses partition pruning
df = lake.equities.bars.read(
symbols=["AAPL"],
interval="daily",
start="2024-01-01", end="2024-06-01",
)
# Slow — full scan
df = lake.equities.bars.read(
symbols=["AAPL"],
interval="daily", # all dates
)
Compression
| Type | Compression | Speed | Size reduction vs CSV |
|---|---|---|---|
| Daily bars | snappy | fast | 5x |
| Intraday bars | zstd:3 | medium | 12x |
| News text | zstd:3 | medium | 8x |
| Options chains | snappy | fast | 4x |
Reading parquet is 5-50x faster than CSV due to columnar layout.
DuckDB for ad-hoc
with lake.duckdb_session() as conn:
df = conn.execute("""
SELECT symbol, AVG(close) AS avg_close
FROM equities.bars
WHERE ts BETWEEN '2024-01-01' AND '2024-06-01'
GROUP BY symbol
""").pl()
DuckDB has predicate pushdown into parquet — only relevant rows are read.
Network
Rate limits
Each adapter respects venue rate limits via mts1b_platform.ratelimit.RateLimiter. Limits shared across processes via Redis.
| Provider | Free tier | Paid |
|---|---|---|
| FMP | 250/day | 30k+/day |
| Polygon | 5/min | unlimited |
| Coinbase Advanced | 30 req/sec | 100+/sec |
| IBKR Gateway | 50 req/sec | unlimited |
Hitting a rate limit triggers exponential backoff via mts1b_platform.retry.
Connection pooling
HTTP client factory keeps connections alive:
async with http_client("polygon", base_url="https://api.polygon.io",
max_connections=20, max_keepalive_connections=10) as c:
# All requests share the pool
results = await asyncio.gather(*[c.get(f"/v2/last/trade/{s}") for s in symbols])
NATS publish latency
Local cluster: < 1 ms p99 for publish-ack. Cross-region: 10-100 ms. JetStream durable adds ~1-3 ms for disk persistence.
Database
Postgres pool size
Default is 20 connections. For a service handling > 100 concurrent requests, increase:
db:
primary:
dsn: postgres://...
pool_size: 50
max_overflow: 20
Monitor:
postgres_pool_active{pool="primary"}
postgres_pool_idle{pool="primary"}
postgres_pool_waiting{pool="primary"}
If waiting > 0 consistently, increase the pool. If idle = pool_size consistently, decrease it.
Indexes
Critical indexes (auto-created):
orders(order_id)PKorders(idempotency_key, created_at)unique within dedup windoworders(fund_id, created_at)for fund viewsfills(order_id)positions(fund_id, symbol)PKaudit_chain(sequence)PKaudit_chain(subject_id, timestamp)for "show order trail"
Verify:
SELECT indexname FROM pg_indexes WHERE schemaname = 'public' ORDER BY indexname;
LLM cost
| Persona | Calls/day | Avg tokens | Cost/day |
|---|---|---|---|
| CRO (gate 7) | ~500 | 800 in / 200 out | ~$3 |
| equities_analyst | ~50 | 2000 in / 500 out | ~$1.50 |
| news_summarizer | ~20 | 5000 in / 800 out | ~$2 |
| quant_screener | ~10 | 3000 in / 1500 out | ~$3 |
| ... | |||
| TOTAL | ~700 | ~$15-25/day |
Semantic cache hits typically reduce cost by 60-80% on stable workloads.
Where to look for slowness
mts mts1b-platform tail --slow-only --threshold-ms 50
# Streams any operation > 50ms across all services
mts mts1b-platform metric --top 10 --since 1h
# Top 10 slowest operations in the last hour
mts1b-deploy open grafana
# Browse dashboards → Service Overview
Tuning checklist
- All HTTP calls use
mts1b_platform.http.http_client(pooled + retried) - All Postgres queries go through
mts1b_platform.db.get_pool(pooled) - All NATS publish uses
mts1b_platform.eventbus.publish_typed(typed + traced) - LLM calls bounded by daily budget
- Backtest uses GPU when universe > 100
- Parquet queries filter on partition columns
- Watchdog alerts wired for: drift, vpin, slow consumer, dependency, db health
See also
- Concept — Event bus — NATS sizing
mts1b-platform— observability primitives- Troubleshooting — when things go wrong