Lead Infrastructure and Reliability Engineer (Systems & Scale)

Luma AI

Palo Alto, CA

Category Software Engineering

Job Description

We are hiring a Lead Infrastructure and Reliability Engineer to define the direction of our Infrastructure Engineering team. The role requires deep expertise in Linux and distributed systems, experience operating GPU / accelerator clusters, and strong fluency in Kubernetes and modern open-source infrastructure.

Requirements

Deep expertise in Linux and distributed systems
Experience operating GPU / accelerator clusters in real production environments
Strong fluency in Kubernetes and modern open-source infrastructure
Comfortable debugging across hardware → kernel → runtime → orchestration
Strong leadership and partnership skills

Benefits

Competitive salary
Equity
401(k) matching
Health, dental, and vision insurance
Paid time off

]]>