Principal Software Engineer, DataRobot

Location

Remote from US

Employment Type

Not explicitly stated, but implied full-time (typical for Principal Software Engineer roles)

Experience Level

Expert level
10+ years of engineering experience, with at least 5+ years in infrastructure, platform, or backend systems roles

Company Mission

To change the way businesses all over the world make their most important decisions.

Role

Who you are

10+ years of engineering experience, with at least 5+ in infrastructure, platform, or backend systems roles
Deep expertise in Kubernetes internals and operations (networking, scheduling, scaling, controller patterns)
Proven ability to design and build systems from scratch with pragmatic tradeoffs
Strong proficiency in modern programming languages such as Python or Go
Experience building production-quality, reliable, and observable systems used across engineering organizations
Growth-oriented mindset—driven to teach, learn, and improve systems and people
Experience operating across multiple cloud providers (AWS, GCP, Azure) and/or hybrid environments
Strong experience with Helm, container orchestration patterns, and CI/CD automation
Comfortable working with Infrastructure as Code (IaC) tools like Terraform and Pulumi and GitOps workflows
Ability to influence without authority and align diverse stakeholders around technical decisions

Desirable:

Familiarity with Cilium, Kyverno, KEDA, Gateway API, OPA or similar technologies
Experience building and running multi-tenant SaaS platforms
Exposure to on-prem delivery models or regulated environments
Experience with performance tuning for large-scale data or compute workloads
Past success driving infrastructure transformation or decomposing legacy systems
Experience working with GPU infrastructure for training and inference

What the job involves

Technical leadership and vision as a Principal Software Engineer
Lead by example: hands-on technical contributor solving complex problems, shaping architecture, mentoring engineers for career growth
Work across control plane systems; influence cross-team roadmaps; bring pragmatic engineering practices into building/testing/operating infrastructure software
Challenge assumptions and complexity; drive high-performance culture; bring clarity where ambiguous; create momentum where inertia exists
Participate in on-call rotation supporting platform resilience and observability with minimal intervention required
Design, develop, optimize inference engine powering DataRobot's agentic infrastructure API ensuring fast/scalable/efficient large language model (LLM) serving systems
Work on full GenAI inference stack: kernels/runtimes/orchestration/memory management
Collaborate with partners like NVIDIA to integrate new model architectures/features (sparsity, activation compression, mixture-of-experts)
Optimize latency, throughput, memory efficiency & hardware utilization across GPUs & accelerators
Build/maintain instrumentation/profiling/tracing tooling to identify bottlenecks & guide optimizations
Develop scalable routing/batching/scheduling/memory management/dynamic loading mechanisms for inference workloads
Integrate federated/distributed inference infrastructure: orchestrate nodes/load balancing/communication overhead
Collaborate cross-functionally with platform engineers/cloud infrastructure/security/compliance teams
Document/share learnings; contribute to internal best practices & open-source efforts when possible

Skills Mentioned

Programming Languages & Tools:

Python
Go
Terraform
Pulumi

Cloud Providers:

AWS
GCP
Azure

Container & Orchestration:

Kubernetes (internals & operations including networking/scheduling/scaling/controller patterns)
Helm
Container orchestration patterns

CI/CD & DevOps:

CI/CD automation
GitOps workflows

Other Technologies:

Cilium (desirable)
Kyverno (desirable)
KEDA (desirable)
Gateway API (desirable)
OPA (Open Policy Agent) (desirable)

Infrastructure & Systems:

Multi-cloud/hybrid environments
Multi-tenancy SaaS platforms (desirable)
On-prem delivery models/regulatory environments (desirable)
GPU infrastructure for training/inference (desirable)

Performance & Optimization:

Performance tuning for large-scale data/compute workloads (desirable)
Latency/throughput/memory efficiency/hardware utilization optimization

Inference Engine / AI Specific:

Large language model serving systems
GenAI inference stack components: kernels/runtimes/orchestration/memory management
Model-serving stack optimized for large-scale LLM inference
Collaboration on sparsity/activation compression/mixture-of-experts features integration

Salary Information

Salary not provided; no clues or estimates given.

Remote Work Allowed?

Yes — explicitly stated "Remote from US"

Application URL

tttps://app.welcometothejungle.com/companies/DataRobot