Weekday AI

AI Evaluation Engineer

Weekday AI Pune, Maharashtra, India 1 day ago
engineering

This role is for one of the Weekday's clients


We are seeking an AI Evaluation Engineer to evaluate, validate, and ensure the quality of AI/ML systems working with complex, real-world data. This role focuses on assessing component mapping, retrieval-augmented generation (RAG) based Q&A systems, and feature extraction from structured and unstructured sources such as repair records, catalogs, free-text inputs, and technical documentation.

This is a hands-on engineering role centered on designing custom evaluation frameworks, datasets, and automated pipelines (including LLM-as-a-judge approaches) to measure quality, detect regressions, and support release readiness. While domain training will be provided, strong ownership in building evaluation intuition and maintaining high-quality test datasets is essential.

Requirements

Key Responsibilities

AI Evaluation & Quality Assurance

  • Evaluate ML and LLM outputs using defined metrics, benchmarks, and acceptance criteria.
  • Design and maintain automated evaluation pipelines to assess model accuracy, consistency, and reliability.
  • Develop and own high-quality evaluation datasets, golden test cases, and benchmarks.

Testing & Release Validation

  • Execute evaluation-driven smoke tests and regression tests prior to releases.
  • Track quality metrics and provide clear go/no-go signals for production deployments.
  • Detect regressions and unexpected model behavior across releases and data changes.

Analysis & Insights

  • Analyze evaluation results to identify trends, inconsistencies, and failure patterns.
  • Provide actionable insights to improve model performance and system behavior.

System & API Validation

  • Validate AI services at the API level for correctness, robustness, and stability.
  • Monitor system performance, latency, and error rates under production-like workloads.

Cross-Functional Collaboration

  • Work closely with ML, backend, and product teams to define expected AI behavior.
  • Ensure evaluation coverage aligns with real-world use cases and business requirements.

Skills & Experience

Core Skills

  • Strong proficiency in Python for evaluation scripting and automation.
  • Solid understanding of Machine Learning and AI systems, including LLM-based workflows.
  • Experience with data analysis to interpret evaluation metrics and model outputs.

Nice to Have

  • Experience with LLM evaluation frameworks or LLM-as-a-judge techniques.
  • Familiarity with RAG pipelines, NLP systems, or large-scale data processing.
  • Experience building CI/CD-style evaluation or testing pipelines for AI systems.

Skills

Python · Machine Learning · Artificial Intelligence · Data Analytics

Sponsored

Explore Engineering

Skills in this job

People also search for