About Gruve
Gruve is an innovative software services startup dedicated to transforming enterprises to AI powerhouses. We specialize in cybersecurity, customer experience, cloud infrastructure, and advanced technologies such as Large Language Models (LLMs). Our mission is to assist our customers in their business strategies utilizing their data to make more intelligent decisions. As a well-funded early-stage startup, Gruve offers a dynamic environment with strong customer and partner networks.
Position Summary:
We are looking for a highly experienced Network Architect & Operations Lead to drive the design, operation, and optimization of large-scale, AI inferencing data center environments. The ideal candidate will have a strong background in architecting and running complex distributed infrastructures supporting GPU-as-a-Service (GPUaaS) workloads, high-throughput AI inferencing traffic, GPU cluster networking, and globally distributed network topologies.
This role is central to ensuring the uptime, reliability, and scalability of our inferencing data centers. This role requires a senior professional with CCIE-level expertise (or equivalent capability) who can combine architecture, operations, team leadership, and automation skills to ensure network reliability, scalability, and operational excellence across a 24/7 GPU inferencing environment. The ideal candidate will lead and mentor a tiered team of L1, L2, and L3 network engineers, owning end-to-end network operations and SLA commitments for our AI inferencing data centers. Experience in high-scale environments such as Lambda Labs, Nvidia, Equinix, AWS, Azure, Google Cloud, or similar large AI infrastructure or hyperscale organizations is highly preferred.
Key Responsibilities:
- Operate large-scale AI inferencing data center network environments across multiple sites.
- Lead and manage a tiered NOC team (L1, L2, L3), driving operational discipline, escalation processes, and continuous uptime of inferencing infrastructure.
- Own uptime and SLA commitments for GPU inferencing data centres, ensuring high availability and rapid incident response.
- Design and maintain scalable network topologies supporting high-throughput GPU cluster traffic, east-west data flows, API traffic, VPN connectivity, and GPUaaS tenant workloads.
- Lead technical design decisions for leaf-spine data center architectures using Arista switching platforms.
- Configure, manage, and troubleshoot advanced routing protocols with a strong focus on BGP, EVPN, VXLAN, and large-scale traffic engineering.
- Manage and optimize Cisco firewall platforms (Firepower/FTD) to ensure secure and efficient traffic flows across tenant and infrastructure networks.
- Support GPU cluster networking including high-bandwidth, low-latency east-west traffic between GPU nodes, with familiarity in RDMA over Converged Ethernet (RoCE) or similar low-latency fabrics.
- Drive operational excellence through structured NOC methodologies, runbook standardization, incident management, and continuous optimization.
- Improve network efficiency through automation, tooling, and operational workflows.
- Perform deep troubleshooting and debugging of complex network and performance issues across Arista, Cisco, and Dell infrastructure.
- Lead network capacity planning, upgrades, and infrastructure scaling initiatives in line with GPU compute growth.
- Collaborate with cross-functional engineering teams (compute, storage, platform, security) to support business growth and ensure high availability.
- Document architecture standards, operational procedures, runbooks, and best practices.
Basic Qualifications:
- 10+ years of hands-on experience in network architecture or network operations within large-scale enterprise, cloud, or AI/HPC data center environments.
- Strong expertise in advanced routing & switching technologies, including BGP, OSPF, EVPN, and VXLAN.
- Deep operational understanding of multi-site, high-scale data center network infrastructure.
- Hands-on experience with Arista EOS and Arista switching platforms in a data center environment.
- Proven experience managing complex network topologies and distributed environments.
- Experience leading and managing NOC teams (L1/L2/L3), including escalation frameworks, shift management, and SLA ownership.
- Familiarity with high-performance compute (HPC) or GPU cluster networking and associated traffic patterns.
- Strong troubleshooting, debugging, and analytical skills across multi-vendor environments.
- Practical experience with network automation and operational optimization.
Preferred Qualifications:
- CCIE certification (or equivalent real-world expertise).
- Arista ACE (Arista Certified Engineer) certification or equivalent hands-on Arista expertise.
- Experience with GPU-as-a-Service (GPUaaS), AI/ML inferencing platforms, or hyperscale compute environments.
- Hands-on experience with Cisco Firepower / FTD firewall platforms and enterprise security frameworks.
- Experience with Dell PowerEdge or similar GPU server environments and their network integration.
- Familiarity with RDMA over Converged Ethernet (RoCE), InfiniBand, or other low-latency GPU interconnect fabrics.
- Experience with Arista and Cisco routing and switching platforms.Leaf-spine data center fabric design for AI/ML compute clusters.
- Experience with BGP EVPN and VXLAN overlay architectures. Network automation frameworks and scripting (Python, Ansible, Terraform)
- Experience with Tenant network isolation, VLANs, and multi-tenancy in GPUaaS environments.
- Additional certifications such as CCNP, Arista ACE-L, or equivalent advanced networking certifications.
Why Gruve
At Gruve, we foster a culture of innovation, collaboration, and continuous learning. We are committed to building a diverse and inclusive workplace where everyone can thrive and contribute their best work. If you’re passionate about technology and eager to make an impact, we’d love to hear from you.
Gruve is an equal opportunity employer. We welcome applicants from all backgrounds and thank all who apply; however, only those selected for an interview will be contacted.