We think conversational AI agents will deliver all professional services in India. We started with astrology. We're a small group of engineers, designers, and product folks building at the intersection of conversational AI and domain expertise. Making an AI agent sound human-like is hard. Making an AI an expert in a domain is also hard. We're doing both together.
We're backed by Accel, Arkam Ventures, and Weekend Fund.
Our AI agent talks to thousands of users every day. The question we obsess over: how do you know if it's actually good?
We went from 60% to 90% quality - not by finding a better model, but by learning to measure properly. The Evaluation team owns this entire problem. You'll build the systems that tell us whether our agent is getting better or worse, catch regressions before they ship, and surface the insights that drive every model and prompt improvement we make.
This is not "run a benchmark and report a number." Our agent conducts multi-turn conversations with deep domain context, tool calls, and real-time astronomical data. Evaluating it requires decomposed metrics, session-level analysis, and an understanding of where aggregate scores lie to you.
Build and own our evaluation infrastructure end-to-end - from data pipelines to scoring systems to dashboards
Design evaluation frameworks for multi-turn, tool-using conversational agents - turn-level quality, session-level goal completion, and everything in between
Build automated regression detection that catches quality drops before they reach users
Work with the AI Research and Agent Orchestration teams to define what "good" means for new capabilities
Design LLM-as-judge systems that actually work - avoiding correlated failures, tuning for the right false-positive vs false-negative tolerance
Analyze user conversations at scale to find patterns, failure modes, and opportunities
Strong software engineering fundamentals - you can build production systems, not just notebooks
Experience with LLM evaluation, NLP metrics, or ML quality assurance
Comfort with statistics - you understand when a metric is lying to you
Ability to design evaluation criteria for subjective, open-ended outputs
You've worked with or built data pipelines that process conversational or unstructured data
You think in systems, not scripts - evaluation infrastructure that scales with the product
Experience with LLM-as-judge evaluation patterns
Familiarity with our stack: TypeScript/Bun, Elixir, ClickHouse, PostgreSQL
Experience evaluating multi-turn conversational AI or dialogue systems
You've read our article - "Your Eval Is Broken for the Same Reason Your LLM Is" - and had opinions
We care about craft obsessively. Your work gets questioned, pulled apart, and rebuilt - not because we're harsh, but because everyone here holds each other to a standard most places don't bother with. We work out of a hacker house in Vasant Kunj. We strongly encourage everyone to be in office.
If that sounds like the only way you'd want to work - let's talk.