Nearly every organization in the world relies on complex manual work to carry out critical internal processes. These are processes that keep the world going — enrolling patients in a hospital, underwriting loans inside a bank, or processing new transactions for an airline. Yet most companies don’t have enough resources to properly automate these tasks and are stuck in manual, decades old way of doing things.
At Luminai, we develop technology to automate long-form organization wide workflows of any complexity easily and safely using AI. Luminai serves some of the world’s most critical organizations in sectors like Healthcare, Finance, and Telecommunication to delegate mission-critical workflows that previously required hands-on human involvement, over to autonomous AI systems. Our approach combines frontier AI development, with a purpose built workflow execution engine to achieve this goal.
We've raised significant amounts of capital (including some un-announced) from many of the best Silicon Valley VCs: General Catalyst, YCombinator, and investors including Kevin Weil (Chief Product Officer at OpenAI), Arash Ferdowsi (co-founder of Dropbox), Katie Stanton (former VP Global Media, Twitter) and CEOs of companies including Flexport, Notion, Front, Ramp and Twitch.
As a Maintenance Engineer at Luminai, you will ensure the reliability, performance, and resilience of the systems that power mission-critical AI workflows. You’ll operate at the core of our production infrastructure — maintaining, monitoring, and continuously improving the systems that healthcare, finance, and telecommunications organizations depend on every day.
This is a highly ownership-driven role for someone who thrives on operational excellence, proactively prevents issues before they arise, and takes pride in keeping complex systems running smoothly in high-stakes environments. You’ll work closely with Engineering, Product, and Forward Deployed teams to ensure our deployments are stable, secure, and scalable.
This is a hybrid position. Our team is in-office 3 days a week (Mon, Tue, Thu) in San Mateo, California.
Monitor, maintain, and improve the reliability of production AI systems and workflow infrastructure
Proactively identify, diagnose, and resolve system issues across application, integration, and cloud infrastructure layers
Own incident response processes, including root cause analysis and long-term remediation
Implement monitoring, alerting, and observability tooling to ensure system health and uptime
Collaborate with Engineering to harden deployments and improve system architecture for resilience and scalability
Support customer-facing teams by troubleshooting and resolving technical issues in live environments
Document system configurations, operational procedures, and recovery protocols
Continuously improve reliability standards, deployment practices, and operational safeguards
3+ years of experience in support engineering, site reliability engineering, or infrastructure maintenance
Strong proficiency in Python or scripting languages
Experience managing cloud infrastructure (AWS, GCP, or Azure)
Strong problem-solving skills and a proactive, preventative mindset
Clear communication skills and ability to collaborate across engineering and customer-facing teams
High ownership and accountability in high-reliability environments