Staff / Principal Engineer – Core Engineering

TrueFoundry

Date: 12 hours ago
City: San Mateo, California
Contract type: Full time
Build the Future of Scalable AI at TrueFoundry

At TrueFoundry, we're redefining how ML teams train, deploy, and scale their models. Our LLMOps and MLOps platform empowers organizations to experiment faster, train large-scale models reliably, and deploy them seamlessly on Kubernetes—with the same muscle as Big Tech.

We're looking for ML Systems Engineers who are passionate about scaling deep learning workloads, optimizing multi-GPU training, and shipping production-grade solutions. If you live and breathe PyTorch, multi-node training, and love solving gnarly infra challenges—this is your place.

The Role: We are seeking a Staff / Principal Engineer to join our Core Engineering team as a senior technical leader based in the United States.

You will:

  • Solve some of the most complex Engineering problems and drive it alongside a team of engineers & ML researchers.
  • Build a deep, holistic understanding of the TrueFoundry platform across all components and shape the product vision and implementation.
  • Act as the technical face of engineering for customer-related discussions and escalations
  • Guide and unblock engineers across projects in the US region
  • Partner closely with our CTO and India-based engineering team to drive system design, architecture, and implementation of complex products
  • Lead technical design, critical customer problem-solving, and platform scalability initiatives end-to-end

This is a high-ownership, high-impact role designed for an engineer who loves combining world-class systems thinking with real-world execution.

What You'll Do:

  • Develop deep expertise across TrueFoundry's platform stack — infrastructure, deployment systems, LLM/ML orchestration, observability, cost optimization, and more
  • Drive the system architecture and design for complex, distributed, cloud-native systems
  • Act as the technical point-of-contact for enterprise customer engineering needs and escalations
  • Lead and participate in design reviews, code reviews, and critical incident responses
  • Collaborate closely with the CTO on architectural decisions, scaling strategies, and technical roadmap prioritization
  • Guide and mentor US-based engineers across multiple initiatives, helping them deliver high-quality, scalable systems
  • Identify and drive technical debt cleanup, performance improvements, and resilience upgrades across the platform
  • Bring a product engineering mindset, ensuring that customer needs and feedback translate into scalable engineering solutions

Who You Are:

  • 8+ years of strong backend / systems engineering experience at top technology companies or startups
  • Deep expertise in distributed systems, cloud-native architectures, and scalable system design
  • Strong working knowledge of Kubernetes, containerized workloads, and infrastructure engineering
  • Practical experience building or deploying ML/GenAI applications (or closely working with ML/DS teams)
  • Skilled in programming languages such as Python, Go, or typescript
  • Solid understanding of system observability, resiliency design, and SRE practices
  • Strong technical leadership and communication skills — able to work with both customers and engineering teams
  • Ability to think strategically while also executing hands-on when required

Bonus: Experience supporting enterprise deployments of AI/ML infrastructure, model training, or inference systems



Why Join TrueFoundry?

  • Work directly with ex-Facebook engineers and founders from IIT Kharagpur, UC Berkeley, and Y Combinator alumni.
  • First-hand exposure to building and scaling a deep-tech startup—insights you'll carry if you want to start your own one day.
  • Be part of a fearlessly experimental culture focused on customer success and long-term impact.

Flexible hours, learning credits, and the opportunity to work shoulder-to-shoulder with the co-founders (Abhishek & Nikunj).
Post a CV