CloudFactory

MLOps Support Engineer

CloudFactory Kathmandu, Bagmati Province, Nepal Today
engineering

About the role:

The MLOps Support Engineer is an operations-first role, focused on ensuring AI/ML systems remain stable, observable, and supportable in production environments. This is not a data science or feature development role.

The primary objective is to maintain continuous performance of ML models and associated pipelines with minimal disruption to both internal and client-facing services. You will provide Tier 1 and Tier 2 support, escalating to Tier 3 Engineering as needed.

What you’ll do:

  • Provide Tier 1 / Tier 2 operational support for AI/ML solutions.
  • Identify failed jobs, degraded pipelines, or performance anomalies.
  • Triage incidents, investigate issues, and coordinate escalation to Tier 3 Engineering.
  • Participate in on-call rotas once established.
  • Validate that pipelines and jobs complete successfully.
  • Monitor data pipeline health, model execution, and basic performance metrics.
  • Identify operational issues before they impact customers
  • Respond or alert customers when there has been an outage or issue with one of their models.
  • Support incident management, rollback, and recovery activities.
  • Use and maintain runbooks and operational documentation.
  • Work with Engineering to improve supportability and observability.
  • Contribute to knowledge sharing to reduce single points of failure.
  • Work within defined SLAs and support processes as the service matures
  • Build quarterly business reviews to provide updates on the health of the ML Models.
  • Evaluate champion/challenger models to see if a new model should be promoted.
  • Monitor for model drift and performance degradation, while validating that updates (new champion models or added data) do not introduce bias.

Requirements

Essential

  • Experience in operations, DevOps, SRE, or platform support roles.
  • Strong troubleshooting skills in production environments.
  • Proficiency in SQL and scripting (Python, Bash) for developing and automating ML workflows.
  • Familiarity with Cloud-hosted systems (AWS, GCP, Azure) for cloud-based ML services.
  • Git: Solid understanding of version control, particularly in collaborative development environments.
  • Comfortable working from runbooks and structured processes.

Desirable

  • Exposure to AI/ML systems in production.
  • Familiarity with monitoring and observability tools (Grafana, PowerBI, New Relic).
  • Knowledge of MLOps tooling and data platforms (ML FLow, Databricks)
  • Experience supporting customer-facing platforms.
  • Knowledge of containerization (Kubernetes) is a plus.
  • Experience of LLM Prompt Engineering and troubleshooting
  • Early career in MLOps or ML Engineering.
  • Someone who is eager to learn about complex predictive models.
  • Background in computer science, informatics, or related fields
  • Passion for Machine Learning and AI: An eager learner who is excited about working with cutting-edge ML technologies and is passionate about optimizing and maintaining ML models in production environments.
  • Early Career in MLOps or ML Engineering: Ideally, Junior ML Engineer with a strong desire to grow in the field of MLOps and AI operations.
  • A Collaborative Mindset: You thrive in a team setting and are ready to contribute to model improvement, A/B testing, and iterative development.
  • Attention to Detail: A focus on model performance, bias prevention, and ensuring optimal model behavior as new data and models are introduced.

Additional information:

Nepal

  • This role provides MLOps coverage from 07:45 – 15:45* NPT for US-based customers. You will be required to work during these hours and potentially outside of them if a model has issues.
  • Rotational On-Call work will also be required.

Colombia

  • This role provides MLOps coverage from 11am to 9pm* Colombia time for a US-based customer. You will be required to work on a shift rota to  cover 8 hour time blocks during this time period and potentially outside of them if a model has issues.
  • Rotational On-Call work will also be required.

*note that these hours are subject to change upon review.

Sponsored

Explore Engineering

Skills in this job

People also search for