Senior Machine Learning Engineer, On-Device & Mobile AI Optimization

Unity

San Francisco, CA

Category Software Engineering

Job Description

Role Overview

We are building the next generation of AI-driven game experiences, running generative models on-device, right where the players are — on phones, tablets, laptops, and desktops. As a Senior Machine Learning Engineer for On-Device & Mobile AI, you will take state-of-the-art multi-modal models — transformers, diffusion networks, and vision-language models (VLMs) — and make them run fast, small, and reliably on mobile and constrained hardware.

What You Will Do

Inference & On-Device Optimization: Own the optimization pipeline for the models you ship: model export, graph transformation, operator fusion, memory-layout planning, and hardware-specific tuning across NPU, mobile GPU, and desktop/laptop GPU. Apply quantization (INT4/INT8/FP16), weight sharing, structured/unstructured pruning, and knowledge distillation to hit hard latency, memory, and power budgets — and validate them against quality bars.

Why It Might Be a Fit

This role is for an engineer who is energized by the gap between a research model and a shipping, on-device product. If you enjoy profilers, frame captures, op-fusion, and shaving milliseconds and megabytes, this is your role.

Requirements

5+ years in software/ML engineering, with meaningful time focused on on-device / edge inference or real-time, performance-critical systems
Production deployment of transformer- and/or diffusion-based models (e.g., ViT, Stable Diffusion, CLIP/SigLIP-style encoders) on mobile, desktop, or embedded hardware — shipped, not just prototyped
Hands-on experience with at least one major inference runtime (ONNX Runtime / ORT Web, CoreML, TFLite, ExecuTorch) and a working understanding of operator fusion, memory layout, and runtime scheduling
Low-level performance engineering: solid command of at least one GPU/compute API — WebGPU/WGSL, Metal, Vulkan, D3D12, or CUDA — and the profiling tools to go with it
Working knowledge of model-optimization techniques — quantization (INT4/INT8/FP16), weight sharing, pruning, and distillation — and the judgment to apply them to hit latency and memory budgets
Understanding of target hardware: mobile SoCs (Apple Neural Engine, Qualcomm Hexagon/Adreno, ARM Mali) and/or desktop/laptop GPUs (Apple Silicon, NVIDIA, AMD, Intel)
Strong Python for export pipelines and training-side tooling; familiarity with the core languages of a browser-native runtime (TypeScript/JavaScript, WGSL) is a plus
Working fluency with the models you deploy — enough to read an architecture, modify it for deployment, and reason about accuracy trade-offs
A collaborative working style: clear communication, reliable delivery, and a willingness to support and learn from teammates

Benefits

Comprehensive health, life, and disability insurance
Commutte subsidy
Employee stock ownership
Competitive retirement/pension plans
Generous vacation and personal days
Support for new parents through leave and family-care programs
Office food snacks
Mental Health and Wellbeing programs and support
Employee Resource Groups
Global Employee Assistance Program
Training and development programs
Volunteering and donation matching program

]]>