About Gruve
Gruve is an innovative software services startup dedicated to transforming enterprises to AI powerhouses. We specialize in cybersecurity, customer experience, cloud infrastructure, and advanced technologies such as Large Language Models (LLMs). Our mission is to assist our customers in their business strategies utilizing their data to make more intelligent decisions. As a well-funded early-stage startup, Gruve offers a dynamic environment with strong customer and partner networks.
Position Summary:
We are seeking a Solution Architect (AI Infrastructure & Deployment Lead) to lead the strategic design, architecture, and deployment of large-scale, enterprise-grade Red Hat OpenShift and Kubernetes environments. As a technical authority at the L4 level, you will be responsible for defining the blueprint of our cloud-native infrastructure, ensuring it is secure, scalable, and highly automated.
The ideal candidate acts as the bridge between traditional infrastructure and modern DevOps, serving as the lead design authority for global clients. You will collaborate with Network and Firewall Architects to build a unified fabric where containerized workloads, legacy data centers, and hybrid cloud environments coexist seamlessly through advanced automation and Infrastructure-as-Code (IaC).
Key Responsibilities:
- Architect and Design Enterprise OpenShift Solutions: Lead the high-level design (HLD) and low-level design (LLD) for multi-tenant Red Hat OpenShift and Kubernetes clusters across on-prem and hybrid cloud environments.
- Define the technology stack, standards, and blueprints for deploying AI solutions across global, multi-region public clouds (AWS/Azure/GCP) and diverse on-premise hardware.
- Oversee the successful end-to-end rollout of critical services including AI SOC, OpenShift AI, and AI-based Cybersecurity Log Optimization.
- Drive Network DevOps Strategy: Define and standardize the automation roadmap using Ansible, Terraform, and Python to achieve "Zero-Touch" infrastructure provisioning and configuration.
- Lead Customer & Stakeholder Engagement: Act as the primary technical consultant for global clients, leading design workshops, architecture validation, and executive-level technical reviews.
- Integrate Advanced AI apps, Networking & Security: Collaborate with Pre-sales, AI application developers & Engineers, Firewall Architects to design secure AI agents & use cases, container networking (CNI) models, implementing Zero-Trust security, service mesh (Istio), and micro-segmentation within OpenShift environment.
- Optimize Hybrid Infrastructure: Oversee the seamless integration of OpenShift with physical networking (Cisco ACI, VXLAN) and virtualized platforms (RHEL-V, VMware ESXi).
- GPU & Hardware Orchestration: Design and manage hardware acceleration using the NVIDIA GPU Operator and Node Feature Discovery (NFD). Implement Multi-Instance GPU (MIG) and time-slicing to optimize resource utilization across multi-tenant clusters.
- Establish CI/CD Governance: Architect robust CI/CD pipelines (Jenkins, GitLab CI, GitHub Actions) for infrastructure and application delivery, ensuring compliance and security are baked into the workflow.
- Lead Observability & Reliability: Design comprehensive monitoring and logging architectures using Prometheus, Grafana, and ELK stack to ensure 99.99% availability of cluster services.
- Mentorship & Technical Leadership: Guide and mentor L2/L3 engineers, providing expert-level escalation support and establishing best practices for the DevOps and Network teams.
- Innovation & R&D: Evaluate and introduce emerging technologies such as Advanced Cluster Management (ACM), Advanced Cluster Security (ACS), and Cloud-Native Networking (OVN-Kubernetes).
Basic Qualifications:
- Architect and Design Enterprise OpenShift Solutions: Lead the high-level design (HLD) and low-level design (LLD) for multi-tenant Red Hat OpenShift and Kubernetes clusters across on-prem and hybrid cloud environments.
- Define the technology stack, standards, and blueprints for deploying AI solutions across global, multi-region public clouds (AWS/Azure/GCP) and diverse on-premise hardware.
- Oversee the successful end-to-end rollout of critical services including AI SOC, OpenShift AI, and AI-based Cybersecurity Log Optimization.
- Drive Network DevOps Strategy: Define and standardize the automation roadmap using Ansible, Terraform, and Python to achieve "Zero-Touch" infrastructure provisioning and configuration.
- Lead Customer & Stakeholder Engagement: Act as the primary technical consultant for global clients, leading design workshops, architecture validation, and executive-level technical reviews.
- Integrate Advanced AI apps, Networking & Security: Collaborate with Pre-sales, AI application developers & Engineers, Firewall Architects to design secure AI agents & use cases, container networking (CNI) models, implementing Zero-Trust security, service mesh (Istio), and micro-segmentation within OpenShift environment.
- Optimize Hybrid Infrastructure: Oversee the seamless integration of OpenShift with physical networking (Cisco ACI, VXLAN) and virtualized platforms (RHEL-V, VMware ESXi).
- GPU & Hardware Orchestration: Design and manage hardware acceleration using the NVIDIA GPU Operator and Node Feature Discovery (NFD). Implement Multi-Instance GPU (MIG) and time-slicing to optimize resource utilization across multi-tenant clusters.
- Establish CI/CD Governance: Architect robust CI/CD pipelines (Jenkins, GitLab CI, GitHub Actions) for infrastructure and application delivery, ensuring compliance and security are baked into the workflow.
- Lead Observability & Reliability: Design comprehensive monitoring and logging architectures using Prometheus, Grafana, and ELK stack to ensure 99.99% availability of cluster services.
- Mentorship & Technical Leadership: Guide and mentor L2/L3 engineers, providing expert-level escalation support and establishing best practices for the DevOps and Network teams.
- Innovation & R&D: Evaluate and introduce emerging technologies such as Advanced Cluster Management (ACM), Advanced Cluster Security (ACS), and Cloud-Native Networking (OVN-Kubernetes).
Preferred Qualifications:
- Red Hat Certified OpenShift Administrator, AWS Certified AI Practitioner, Certified Information Systems Security Professional (CISSP), Certified Cloud Security Professional (CCSP).
- Security Focus: Exposure to DevSecOps tools (e.g., Quay, StackRox) and zero-trust framework implementation.
- Legacy Integration: Familiarity with Cisco ACI, Arista CloudVision, or Juniper Apstra for end-to-end automation integration.
- Red Hat OpenShift AI (ROAI) expertise will be preferred.
- Familiarity with LLM deployment requirements and vector database infrastructure.
- Background in Cybersecurity infrastructure (SIEM, SOAR, SOC & VAPT platforms).
- Experience with MLOps infrastructure (Kubeflow, MLflow) and high-speed telemetry pipelines.
Why Gruve
At Gruve, we foster a culture of innovation, collaboration, and continuous learning. We are committed to building a diverse and inclusive workplace where everyone can thrive and contribute their best work. If you’re passionate about technology and eager to make an impact, we’d love to hear from you.
Gruve is an equal opportunity employer. We welcome applicants from all backgrounds and thank all who apply; however, only those selected for an interview will be contacted.