Distributed Training Engineer
Menlo Park, Remote
Full Time
6 days ago
Senior LevelWorldwide
Over $120K

USD per year

Job Description

Distributed Training Engineer

Location

Menlo Park, Remote

Employment Type

Full time

Department

Bits: LLMs, machine learning, infra, etc. About Periodic Labs We are an AI + physical sciences lab building state of the art models to make novel scientific discoveries. We are well funded and growing rapidly. Team members are owners who identity and solve problems without boundaries or bureaucracy. We eagerly learn new tools and new science to push forward our mission. About the role You will optimize, operate and develop large-scale distributed LLM training systems that power AI scientific research. You will work closely with researchers to bring up, debug, and maintain mid-training and reinforcement learning workflows. You will build tools and directly support frontier-scale experiments to make Periodic Labs the world’s best AI + science lab for physicists, computational materials scientists, AI researchers, and engineers. You will contribute open-source large scale LLM training frameworks. You might thrive in this role if you have experience with:

  • Training on clusters with 65,000 GPUs
  • 5D parallel LLM training
  • Distributed training frameworks such as Megatron-LM, FSDP, DeepSpeed, TorchTitan
  • Optimizing training throughput for large scale Mixture-of-Expert models
Job Expired

This job posting has expired and is no longer accepting applications.

Browse Active Jobs
About Periodic Labs

Periodic Labs aims to create an AI scientist and autonomous laboratories for them to operate, focusing on accelerating science in the physical sciences. They build AI scientists and autonomous labs to generate high-quality experimental data, enabling new scientific discoveries and applications such as discovering higher-temperature superconductors and aiding semiconductor manufacturers.

View Company Profile