Performance notes

Where MTS1B is fast, where it isn't, and how to tune it.

Latency budget (live trading hot path)

End-to-end target: strategy signal → broker submission ≤ 100 ms p99.

Step	Target	Why
Strategy signal generation	≤ 10 ms	factor compute (cached features)
Portfolio sizing	≤ 5 ms	quantkit kelly/vol-target
OMS receive + idempotency check	≤ 5 ms	dedupe lookup
Risk gate 1 (idempotency)	≤ 1 ms	hashed dedupe
Risk gate 2 (schema)	≤ 1 ms	pydantic validate
Risk gate 3 (static)	≤ 2 ms	in-memory lookups
Risk gate 4 (position)	≤ 5 ms	position store + cov compute
Risk gate 5 (drawdown)	≤ 2 ms	NAV diff
Risk gate 6 (short)	≤ 2 ms	borrow cache
Risk gate 7 (CRO veto, optional)	≤ 5 s	LLM call (fail-OPEN)
Broker submit	≤ 20 ms	network to venue
Total	≤ 100 ms p99

Without the CRO veto, hot path is ≤ 50 ms p99.

Backtest throughput

GPU vs CPU comparison (Russell 1000 × 10 years daily)

Backend	Universe size	Period	Wall time	Speedup
CPU (numpy)	100	10 yr	8 sec	1x
CPU (numpy)	1000	10 yr	95 sec	1x
CPU (numpy)	1000	10 yr (1m bars)	18 min	1x
GPU (cupy RTX 4090)	100	10 yr	1.2 sec	~7x
GPU (cupy RTX 4090)	1000	10 yr	4.8 sec	~20x
GPU (cupy RTX 4090)	1000	10 yr (1m bars)	38 sec	~28x
GPU (cupy H100)	1000	10 yr (1m bars)	14 sec	~77x

For large parameter sweeps (ladder), the GPU advantage compounds:

100k combos × 1000 universe × 10 yr daily:
- CPU: ~10 hours
- GPU (RTX 4090): ~30 minutes
- GPU (H100): ~10 minutes

When CPU is fine

Universe < 100 symbols
Period < 5 years daily
One-off run (no sweep)
Development on a laptop without CUDA

When GPU pays off

Universe > 200 symbols
Intraday bars (1m / 5m)
Parameter sweeps (anything > 1k combinations)
Walk-forward CV (multi-fold runs)

Memory

Foundation library

Tiny — under 10 MB resident.

Platform primitives

Primitive	Resident
Logging setup	~5 MB
Config + Vault client	~15 MB
HTTP client (httpx)	~10 MB
NATS client	~5 MB
Postgres pool	~30 MB
Redis client	~5 MB

A typical service

Service	Idle	Active
mts1b-foundation (library)	10 MB	10 MB
mts1b-platform (library)	50 MB	80 MB
mts1b-marketdata (service)	80 MB	200 MB
mts1b-oms (service)	100 MB	300 MB
mts1b-riskengine (service)	80 MB	150 MB
mts1b-research (service)	200 MB	1-4 GB (active sweep)
mts1b-GPUbacktester (service)	100 MB host	8-24 GB GPU (active backtest)
mts1b-datalake (service)	150 MB	500 MB-2 GB (active ingest)

Per-service tuning lives in mts1b.config under each section.

Concurrency

Async first

Every I/O-bound function in MTS1B is async. Don't mix sync (blocking) HTTP calls.

❌ Wrong:

import requests
response = requests.get("https://api.example.com")    # blocks the event loop

✅ Right:

from mts1b_platform.http import http_client
async with http_client("example") as c:
    response = await c.get("https://api.example.com")

Multi-process (CPU-bound work)

Async helps with I/O, not CPU. For CPU-heavy work (factor compute, optimization):

# Use ProcessPoolExecutor
from concurrent.futures import ProcessPoolExecutor

with ProcessPoolExecutor(max_workers=8) as ex:
    results = list(ex.map(run_one_param_set, param_grid))

Or use mts1b-cloudburst to fan out to rented GPU instances.

NATS consumer concurrency

sub = await js.subscribe(
    "mts.v1.oms.fills.created",
    durable="my-consumer",
    max_ack_pending=100,        # process up to 100 in flight
)

Multiple consumer instances with the same durable name share work — like a queue.

Disk I/O

Parquet partitioning

The data lake is partitioned by year/month + symbol/interval. Queries that filter on partition columns are fast:

# Fast — uses partition pruning
df = lake.equities.bars.read(
    symbols=["AAPL"],
    interval="daily",
    start="2024-01-01", end="2024-06-01",
)

# Slow — full scan
df = lake.equities.bars.read(
    symbols=["AAPL"],
    interval="daily",                     # all dates
)

Compression

Type	Compression	Speed	Size reduction vs CSV
Daily bars	snappy	fast	5x
Intraday bars	zstd:3	medium	12x
News text	zstd:3	medium	8x
Options chains	snappy	fast	4x

Reading parquet is 5-50x faster than CSV due to columnar layout.

DuckDB for ad-hoc

with lake.duckdb_session() as conn:
    df = conn.execute("""
        SELECT symbol, AVG(close) AS avg_close
        FROM equities.bars
        WHERE ts BETWEEN '2024-01-01' AND '2024-06-01'
        GROUP BY symbol
    """).pl()

DuckDB has predicate pushdown into parquet — only relevant rows are read.

Network

Rate limits

Each adapter respects venue rate limits via mts1b_platform.ratelimit.RateLimiter. Limits shared across processes via Redis.

Provider	Free tier	Paid
FMP	250/day	30k+/day
Polygon	5/min	unlimited
Coinbase Advanced	30 req/sec	100+/sec
IBKR Gateway	50 req/sec	unlimited

Hitting a rate limit triggers exponential backoff via mts1b_platform.retry.

Connection pooling

HTTP client factory keeps connections alive:

async with http_client("polygon", base_url="https://api.polygon.io",
                        max_connections=20, max_keepalive_connections=10) as c:
    # All requests share the pool
    results = await asyncio.gather(*[c.get(f"/v2/last/trade/{s}") for s in symbols])

NATS publish latency

Local cluster: < 1 ms p99 for publish-ack. Cross-region: 10-100 ms. JetStream durable adds ~1-3 ms for disk persistence.

Database

Postgres pool size

Default is 20 connections. For a service handling > 100 concurrent requests, increase:

db:
  primary:
    dsn: postgres://...
    pool_size: 50
    max_overflow: 20

Monitor:

postgres_pool_active{pool="primary"}
postgres_pool_idle{pool="primary"}
postgres_pool_waiting{pool="primary"}

If waiting > 0 consistently, increase the pool. If idle = pool_size consistently, decrease it.

Indexes

Critical indexes (auto-created):

orders(order_id) PK
orders(idempotency_key, created_at) unique within dedup window
orders(fund_id, created_at) for fund views
fills(order_id)
positions(fund_id, symbol) PK
audit_chain(sequence) PK
audit_chain(subject_id, timestamp) for "show order trail"

Verify:

SELECT indexname FROM pg_indexes WHERE schemaname = 'public' ORDER BY indexname;

LLM cost

Persona	Calls/day	Avg tokens	Cost/day
CRO (gate 7)	~500	800 in / 200 out	~$3
equities_analyst	~50	2000 in / 500 out	~$1.50
news_summarizer	~20	5000 in / 800 out	~$2
quant_screener	~10	3000 in / 1500 out	~$3
...
TOTAL	~700		~$15-25/day

Semantic cache hits typically reduce cost by 60-80% on stable workloads.

Where to look for slowness

mts mts1b-platform tail --slow-only --threshold-ms 50
# Streams any operation > 50ms across all services

mts mts1b-platform metric --top 10 --since 1h
# Top 10 slowest operations in the last hour

mts1b-deploy open grafana
# Browse dashboards → Service Overview

Tuning checklist

All HTTP calls use mts1b_platform.http.http_client (pooled + retried)
All Postgres queries go through mts1b_platform.db.get_pool (pooled)
All NATS publish uses mts1b_platform.eventbus.publish_typed (typed + traced)
LLM calls bounded by daily budget
Backtest uses GPU when universe > 100
Parquet queries filter on partition columns
Watchdog alerts wired for: drift, vpin, slow consumer, dependency, db health

Latency budget (live trading hot path)​

Backtest throughput​

GPU vs CPU comparison (Russell 1000 × 10 years daily)​

When CPU is fine​

When GPU pays off​

Memory​

Foundation library​

Platform primitives​

A typical service​

Concurrency​

Async first​

Multi-process (CPU-bound work)​

NATS consumer concurrency​

Disk I/O​

Parquet partitioning​

Compression​

DuckDB for ad-hoc​

Network​

Rate limits​

Connection pooling​

NATS publish latency​

Database​

Postgres pool size​

Indexes​

LLM cost​

Where to look for slowness​

Tuning checklist​

See also​