Is the Software Engineering – Inference Engineer role at Virtue Ai remote?

Yes, the Software Engineering – Inference Engineer role at Virtue Ai is a remote position.

How do I apply for the Software Engineering – Inference Engineer position at Virtue Ai?

You can apply for the Software Engineering – Inference Engineer position at Virtue Ai directly through HireHere. Click the "Apply" button on the job listing to be taken to the application page.

Software Engineering – Inference Engineer

Virtue AiRemote1h ago

Location: San Francisco, CA (Onsite | Remote)

About Virtue AI

Virtue AI sets the standard for advanced AI security platforms. Built on decades of foundational and award-winning research in AI security, its AI-native architecture unifies automated red-teaming, real-time multimodal guardrails, and systematic governance for enterprise apps and agents. Deploy in minutes—across any environment—to keep your AI protected and compliant. We are a well-funded, early-stage startup founded by industry veterans, and we're looking for passionate builders to join our core team.

What You’ll Do

As an Inference Engineer, you will own how models are served in production. Your job is to make inferences fast, stable, observable, and cost-efficient—even under unpredictable workloads.

You will:

Serve and optimize LLM, embedding, and other ML models' inference across multiple model families
Design and operate inference APIs with clear contracts, versioning, and backward compatibility
Build routing and load-balancing logic for inference traffic
- Multi-model routing
- Fallback and degradation strategies
- vLLM or SGLang
Package inference services into production-ready Docker images
Implement logging and metrics for inference systems
- Latency, throughput, token counts, GPU utilization
- Prometheus-based metrics
Analyze server uptime and failure modes
- GPU OOMs, hangs, slowdowns, fragmentation
- Recovery and restart strategies
Design GPU and model placement strategies
- Model sharding, replication, and batching
- Tradeoffs between latency, cost, and availability
Work closely with backend, platform (Cloud, DevOps), and ML teams to align inference behavior with product requirements

What Makes You a Great Fit

You understand that inference is a systems problem, not just a model problem. You think in QPS, p99 latency, GPU memory, and failure domains.

Required Qualifications

Bachelor’s degree or higher in CS, CE, or related field
Strong experience serving LLMs and embedding models in production
Hands-on experience designing:
- Inference APIs
- Load balancing and routing logic
Experience with SGLang, vLLM, TensorRT, or similar inference frameworks
Strong understanding of GPU behavior
- Memory limits, batching, fragmentation, utilization
Experience with:
- Docker
- Prometheus metrics
- Structured logging
Ability to debug and fix real inference failures in production
Experience with autoscaling inference services
Familiarity with Kubernetes GPU scheduling
Experience supporting production systems with real SLAs
Proven ability to debug and fix inference failures in production
Comfortable operating in a fast-paced startup environment with high ownership

Preferred Qualifications

Experience with GPU-level optimization
- Memory planning and reuse
- Kernel launch efficiency
- Reducing fragmentation and allocator overhead
Experience with kernel- or runtime-level optimization
- CUDA kernels, Triton kernels, or custom ops
Experience with model-level inference optimization
- Quantization (FP8 / INT8 / BF16)
- KV-cache optimization
- Speculative decoding or batching strategies
Experience pushing inference efficiency boundaries (latency, throughput, or cost)

Why Join Virtue AI

Competitive salary + equity
Direct ownership of inference reliability and performance
Hard problems at the intersection of systems, GPUs, and AI
Production impact – Your work directly affects latency, cost, and uptime
Strong technical culture – Engineers who debug and optimize, not just prototype