Own model serving for multiple LLM/speech models on Modal. Build and maintain the APIs around those models. Create the feedback/eval loop to improve quality while meeting strict latency/cost SLOs.

Responsibilities

Host and scale real-time & batch inference on Modal (autoscaling, images/volumes/secrets).
Operate a multi-model fleet (versioning, routing, canaries/blue-green, traffic shaping).
Ship endpoints; auth, RBAC, quotas, rate limits, telemetry.
Implement feedback pipelines, online A/B evals, and guardrails with actionable alerts.
Drive performance: profiling, batching, quantization, KV-cache, runtime tuning.
Establish observability and reliability (OTel, metrics/logs, SLOs, runbooks, on-call).
CI/CD and IaC for reproducible builds and one-click rollbacks.

Must-haves

5+ years in ML Ops/Platform/SRE with production LLM/ML serving.
Strong Python; high-throughput async APIs (FastAPI/Starlette) and GitHub-based CI/CD.
Deep experience with vLLM, TensorRT-LLM, Triton, or ONNX Runtime.
Hands-on with Modal or equivalent GPU/k8s platform.
Solid observability (OTel) and incident response/postmortems.

Preferred

ONNX export expertise (PyTorch→ONNX), quantized/dynamic graphs, custom ops.
Safety/guardrails and constrained decoding.
Systems perf (CUDA/Triton kernels) or Rust for hot paths; load/chaos testing.

Apply now

See more open positions at Mappa

Privacy policy Cookie policy