Our client is building the kind of infrastructure most engineers only read about. They run an AI‑centric cloud that combines huge GPU clusters, high‑speed networks, and cloud‑native tooling into a platform used by enterprises, fast‑growing startups, and advanced research teams. The focus is simple: make it possible to train and run serious AI and simulation workloads without every customer having to build their own supercomputer.
They’re publicly traded and growing quickly with R&D hubs across North America, Europe, and the Middle East. The culture is very engineering‑driven: low on bureaucracy, high on ownership, and built around people who like hard infrastructure problems and seeing their work show up in real customer workloads. You’ll be working with colleagues who care about doing things properly at scale, not just shipping another dashboard.
You’ll be the person customers turn to when they want to stand up or scale out serious GPU and HPC environments in the cloud: multi‑rack clusters, fast interconnects, complex scheduling, and demanding SLAs around throughput and latency.
As an HPC Specialist Solutions Architect, you’ll design and tune next‑generation platforms for AI training, large simulations, and data‑heavy workloads. You’ll work directly with NVIDIA’s latest hardware (Hopper, Blackwell, and successors), NVLink/NVSwitch topologies, and InfiniBand/RoCE fabrics, and you’ll have a real say in how the platform and reference architectures evolve. If you enjoy going from “here’s the workload” to “here’s the cluster and how we squeeze the last 20–30% out of it,” this will feel like home.
Design real clusters: Architect and implement HPC clusters for AI, simulation, and distributed training using Kubernetes and schedulers like Slurm. You’ll think about everything from node types and GPU topology to queues, partitions, and failure modes.
Shape GPU‑accelerated infrastructure: Integrate NVIDIA Hopper and Blackwell‑class GPUs with NVLink/NVSwitch and InfiniBand/RoCE, making sure the hardware layout actually matches the communication patterns of the workloads you run.
Automate GPU and network lifecycle: Deploy and manage GPU Operator and Network Operator so that drivers, CUDA, firmware, and high‑speed networking are consistent and automated across large fleets, not managed box by box.
Make the cloud behave like a supercomputer: Design and validate cloud‑native HPC environments that still deliver low latency, high bandwidth, and predictable scheduling. You’ll look at utilization, preemption, fragmentation, and squeeze out performance.
Set the standard for AI/HPC architectures: Define and document reference architectures for AI model training, data pipelines, and MLOps, including observability and CI/CD. When customers ask “how should we do this?”, your work will be what “good” looks like.
Work directly with vendors and partners: Collaborate with NVIDIA and other partners to evaluate new GPU generations, interconnects, and software stacks. You’ll help decide what is ready for prime time and under which conditions.
Debug the hard problems: Benchmark performance, track down bottlenecks across compute, network, and storage, and recommend concrete changes that move the needle—not just check a box.
Be a trusted voice to customers: Lead design sessions, architecture reviews, and operational excellence check‑ins with customers who care a lot about performance and reliability. You’ll translate between “this job keeps timing out” and “here’s what we’ll change in the topology and scheduler.”
A Bachelor’s or Master’s in Computer Science, Engineering, or a related field (PhD is a plus).
3+ years actually building or running HPC or large GPU clusters—on‑prem, cloud, or hybrid. You’ve owned outcomes, not just submitted jobs.
Strong Linux background, plus Kubernetes and container runtimes (containerd, CRI‑O, Docker) in real environments, with CI/CD in the loop.
A solid handle on HPC networking and RDMA: InfiniBand, RoCE, NVLink/NVSwitch. You understand why topology and fabric design matter, and you’ve seen what happens when they’re wrong.
Experience with storage and I/O for big workloads: Ceph, Lustre, NFS at scale, GPUDirect Storage, or similar systems where throughput, latency, and contention actually matter.
Comfort with Terraform, Ansible, Helm, and GitOps‑style workflows to keep configurations reproducible and sane.
Good scripting skills in Python or Bash; you’re happy to automate checks, glue systems together, or prototype tooling.
You write and speak clearly, can lead a design review without losing the room, and can keep both engineers and non‑technical stakeholders on the same page.
Legal authorization to work in the U.S. on a full-time basis without visa sponsorship.
Hands‑on with the NVIDIA ecosystem: GPU Operator, MIG, DCGM, NCCL, Nsight, and managing CUDA stacks across production clusters.
Experience with MLflow, Kubeflow, NeMo, or similar for AI/ML pipelines, or with distributed training frameworks like PyTorch DDP, DeepSpeed, or Megatron.
Time spent with Slurm, LSF, PBS, or similar on real clusters, not just in a lab.
Experience with multi‑tenant GPU environments or “AI training farms.”
Familiarity with observability stacks for HPC: Prometheus, DCGM Exporter, Grafana, and NGC tools.
Any open‑source work in HPC, CUDA, or Kubernetes is a strong plus.
You like understanding a workload deeply, then designing a cluster and config that fits it like a glove.
You’re comfortable saying, “This is fast, but we can make it faster—and here’s how,” and then proving it with numbers.
You enjoy working directly with customers and partners, but you still want to stay close to the technology.
You prefer a low‑ego, high‑ownership environment where people care more about doing the right thing than about title.
Serious compensation: OTE in the $225,000–$315,000 range, plus equity, calibrated to your experience and location.
Real benefits: 100% employer‑paid medical, dental, and vision for you and your family; 4% 401(k) match with immediate vesting; company‑paid short‑ and long‑term disability and life insurance.
Time for life: 20 weeks paid parental leave for primary caregivers, 12 weeks for secondary.
Remote‑first: Work from where you are in the US, with support for your home office (mobile + internet stipend).
Hardware you actually want to work on: H200, B200, GB200‑class GPUs, NVLink/NVSwitch, InfiniBand/RoCE, and clusters that are genuinely in “top of the market” territory.
Impact: The platforms you design will be used to train cutting‑edge models and run workloads that actually push the limits of current hardware.
Step 1 – HR screen
Step 2 – Hiring manager interview
Step 3 – Technical assignment / challenge
Step 4 – Leadership meeting
References & background check. Offer
We are proud to be an equal opportunity workplace and are committed to equal employment opportunity regardless of race, color, religion, national origin, age, sex, marital status, ancestry, physical or mental disability, genetic information, veteran status, gender identity, or expression, sexual orientation, or any other characteristic protected by applicable federal, state or local law.