Role OverviewThe TRAX Reliability team partners closely with application and infrastructure engineers to embed scalability, resilience, and technical risk management into trading systems from the ground up. We take a data-driven, proactive approach to ensure that our real-time trading systems can scale safely, perform reliably under extreme market conditions, and recover gracefully from failures before issues impact clients.
What You Will Do
Identify, prioritize, and track scalability and reliability risks across large-scale trading platforms, partner with application teams to diagnose and address performance and resilience challenges, analyze system behavior under real and simulated load, and design and run chaos engineering experiments and game-day exercises to validate system capacity and resilience.
Why It Might Be a Fit
Have direct impact on the stability and resilience of execution platforms relied upon by the world’s leading buy-side firms, develop deep expertise in scaling, failure modes, and technical risk management for real-time trading systems, and collaborate with engineers across New York, London, and Frankfurt.
Requirements
- 5+ years of professional experience with a high-level programming language such as Python, Java, or C++, preferably on Unix/Linux
- Solid understanding of Unix/Linux fundamentals
- Hands-on experience contributing to or triaging scaling and reliability issues in production distributed systems
- Experience working with metrics, monitoring, or observability platforms, such as Grafana, Prometheus, or log analytics tools
- Strong analytical skills and the ability to reason about complex system behavior and failure modes
Benefits
- Benefits
- Bonus
- Paid holidays
- Paid time off
- Medical
- Dental
- Vision
- Short and long term disability benefits
- 401(k) + match
- Life insurance
- Wellness programs
]]>