Senior Speech & Audio ML Engineer

Mappa

Mappa

Software Engineering, Data Science
Latina, Province of Latina, Italy
USD 2,500-5,500 / month
Posted on Aug 28, 2025

Job Title: Senior Speech & Audio ML Engineer

Location: Remote

Type: Full-time

Salary Range: $2,500–5,500 USD / month

Role Purpose

We are looking for a Senior ML Engineer to build and ship core models for a speech-driven behavioral engine. You will own end-to-end modeling from raw, long-form audio and layered annotations to production inference.

Responsibilities include:

  • Designing audio features and embeddings.
  • Training and evaluating a suite of models.
  • Delivering reproducible pipelines that meet targets for accuracy, robustness, latency, and cost.

Non-Negotiables

  • Experience: 5+ years building production ML systems, including 2+ years in speech/audio.
  • Speech & Signal Processing: VAD, diarization, segmentation, denoising, spectral features (log-mel/MFCC), prosody (pitch/energy), long-form audio handling.
  • SOTA Audio Models & Embeddings: Wav2Vec2, HuBERT, wavLM (or similar); fine-tuning/self-supervised learning; contrastive/metric learning for downstream tasks.
  • Data Engineering & Quality: SQL, Python data stack (Pandas/Polars), ETL for audio + metadata, stratified sampling, leakage prevention, feature stores.
  • Evaluation Discipline: Golden sets, robust speaker/content splits, ROC/PR/calibration, fairness/bias checks, ablations, drift/shift detection on embeddings and audio quality.
  • MLOps, Serving & Reproducibility: FastAPI/gRPC around HF/torchaudio models, experiment tracking (W&B/MLflow), artifact/model versioning, CI/CD, observability, scalable batch/streaming inference.
  • Proven ability to create and document novel IP (methods, architectures, or training/eval techniques) with clear prior-art awareness.

Nice to Have

  • Tooling: SpeechBrain, Lightning, OpenSMILE/Praat, Kaldi/Conformer/Emformer, Label Studio.
  • Multimodal Skills: ASR (e.g., Whisper) + paralinguistic features; emotion/prosody modeling; speaker embeddings (x-vectors, ECAPA-TDNN).
  • Performance & Deployment: Quantization/distillation, Triton/CUDA basics, distributed training, real-time/streaming inference, on-device DSP (Rust/C++).
  • Publications/Patents/Competitions: Demonstrating novel audio modeling work.

Details

  • Full time
  • Payment in USD [5000-5500 USD]
  • Remote