Staff Site Reliability Engineer, Cloud

Who we are Kentik is the network intelligence platform for modern infrastructure teams. Unlike traditional monitoring and observability tools, we demystify complex network operations, enabling organizations to deliver applications and innovation at scale. Built by network experts to make critical insight accessible to every engineer, Kentik is the real-time source of truth that understands every network in context — from data center to cloud to the internet. This single platform unifies and correlates cloud, device, flow, synthetic data to turn telemetry into action. Market leaders like Akamai, Booking.com, Dropbox, and Zoom rely on Kentik to run, manage, and optimize their networks. What we do Our platform ingests trillions of records and serves hundreds of thousands of queries for our users each day. You will gain experience building a production quality, high performance server-and-client SaaS application that handles uniquely high volumes of data. We have built a team of world-class engineers, network experts, and technology thought leaders in a remote-friendly culture from day one. While prior experience in a remote environment is not required, we highly value strong collaboration and communication skills, as well as a high level of independence and autonomy. What you'll do Kentik is looking for a Staff level Site Reliability Engineer (Cloud) to join our Product Engineering team to help build and maintain our Synthetics and Cloud product lines. These products have multiple applications deployed in various cloud providers all over the world. We manage these cloud applications using observability tooling, automated build processes, and adherence to configuration as code best practices. We’re looking for an experienced engineer who will work with engineering teams across the company to help grow our hardware and software infrastructure. We operate a well-organized, well-instrumented platform, and offer enormous opportunities for employee growth. Make sure our real-time, scalable infrastructure is set up for growth and working efficiently. Our infrastructure runs on our own hardware across multiple locations as well as all major cloud vendors. Work on tools and processes to better monitor our platform as well as ensuring its stability through our rapid growth. Deep-diving into diverse topics from firewalls and IP routing to database replication strategies or automating build processes. Collaborate with engineering and infrastructure teams on finding solutions from an operational perspective. Assist with expanding our cloud deployments across the major cloud providers. Contribute code reviews and tools or patches to all kinds of existing code. Write design documents or collaborate on colleagues’ docs to introduce new features or changes into our infrastructure. Provide valuable feedback on team goals projects and processes. We believe in continuously improving our team. What you'll bring Studies have shown that some candidates tend to apply to jobs only if they meet 100% of the qualifications. We encourage you to apply if you meet most of the criteria - even if you don’t match all of the qualifications your skills and experience could be valuable in this role!

8+ years of experience in cloud-based Systems Administration IT or SRE related projects.
Expertise in public cloud environments such as AWS GCP Azure or OCI.
Strong command of containerization and orchestration using Docker and Kubernetes.
Solid programming and automation skills using Bash Python or Go.
Proficiency with Infrastructure as Code (IaC) configuration management platforms such as Terraform Ansible Puppet.
Proficiency in Linux administration command-line tools (e.g., SSH grep awk).
Detailed understanding of major internet protocols (TCP/IP DNS HTTP TLS).
Networking administration experience concepts such as routing firewalls (iptables) peering sound familiar.
A passion for documenting code processes infrastructure in runbooks wikis.
Worked with metrics monitoring solutions such as Grafana Prometheus Telegraf OpenTelemetry.
Experience creating managing tickets with third party vendors owning cloud vendor partner relationships.

Nice to haves:

Familiarity with Kubernetes automation tools specifically managing complex deployments with Helm Helmfile.
Knowledge of scaling Kubernetes workloads compute infrastructure.
Experience optimizing CI/CD build deploy pipelines using GitHub Actions Jenkins.
PagerDuty integrations knowledge.

Tech Stack:

Go (core engine) Node.js + Express (app serving) React (UI).
JS Python for tooling scripting.
Databases: Postgres Kafka MySQL Redis.
APIs: REST/JSON gRPC endpoints.
Traffic routing: Haproxy Envoy.

Compensation Range: $165000 - $200000 USD annually Benefits: Health coverage premiums paid by company HRA accounts paid leave holidays retirement account options home office reimbursement stock options.