Senior Machine Learning Engineer, On-Device & Mobile AI Optimization

Unity
San Francisco, CA
Job Description
Role Overview

We are building the next generation of AI-driven game experiences, running generative models on-device, right where the players are — on phones, tablets, laptops, and desktops. As a Senior Machine Learning Engineer for On-Device & Mobile AI, you will take state-of-the-art multi-modal models — transformers, diffusion networks, and vision-language models (VLMs) — and make them run fast, small, and reliably on mobile and constrained hardware.

What You Will Do

Inference & On-Device Optimization: Own the optimization pipeline for the models you ship: model export, graph transformation, operator fusion, memory-layout planning, and hardware-specific tuning across NPU, mobile GPU, and desktop/laptop GPU. Apply quantization (INT4/INT8/FP16), weight sharing, structured/unstructured pruning, and knowledge distillation to hit hard latency, memory, and power budgets — and validate them against quality bars.

Why It Might Be a Fit

This role is for an engineer who is energized by the gap between a research model and a shipping, on-device product. If you enjoy profilers, frame captures, op-fusion, and shaving milliseconds and megabytes, this is your role.

Requirements

  • 5+ years in software/ML engineering, with meaningful time focused on on-device / edge inference or real-time, performance-critical systems
  • Production deployment of transformer- and/or diffusion-based models (e.g., ViT, Stable Diffusion, CLIP/SigLIP-style encoders) on mobile, desktop, or embedded hardware — shipped, not just prototyped
  • Hands-on experience with at least one major inference runtime (ONNX Runtime / ORT Web, CoreML, TFLite, ExecuTorch) and a working understanding of operator fusion, memory layout, and runtime scheduling
  • Low-level performance engineering: solid command of at least one GPU/compute API — WebGPU/WGSL, Metal, Vulkan, D3D12, or CUDA — and the profiling tools to go with it
  • Working knowledge of model-optimization techniques — quantization (INT4/INT8/FP16), weight sharing, pruning, and distillation — and the judgment to apply them to hit latency and memory budgets
  • Understanding of target hardware: mobile SoCs (Apple Neural Engine, Qualcomm Hexagon/Adreno, ARM Mali) and/or desktop/laptop GPUs (Apple Silicon, NVIDIA, AMD, Intel)
  • Strong Python for export pipelines and training-side tooling; familiarity with the core languages of a browser-native runtime (TypeScript/JavaScript, WGSL) is a plus
  • Working fluency with the models you deploy — enough to read an architecture, modify it for deployment, and reason about accuracy trade-offs
  • A collaborative working style: clear communication, reliable delivery, and a willingness to support and learn from teammates

Benefits

  • Comprehensive health, life, and disability insurance
  • Commutte subsidy
  • Employee stock ownership
  • Competitive retirement/pension plans
  • Generous vacation and personal days
  • Support for new parents through leave and family-care programs
  • Office food snacks
  • Mental Health and Wellbeing programs and support
  • Employee Resource Groups
  • Global Employee Assistance Program
  • Training and development programs
  • Volunteering and donation matching program
]]>