Spun out of MIT CSAIL, we build general-purpose AI systems that run efficiently across deployment targets, from data center accelerators to on-device hardware, ensuring low latency, minimal memory usage, privacy, and reliability. We partner with enterprises across consumer electronics, automotive, life sciences, and financial services. We are scaling rapidly and need exceptional people to help us get there.
The VLM team builds vision-language models that run on-device, under tight latency and memory constraints, without sacrificing quality. We have released four best-in-class models and we're just getting started. This role blends research and implementation: you'll design experiments, run them, and turn the results into shipped models.
Minimal qualifications:
Hands-on experience in training or evaluating VLMs with demonstrated experimental rigor.
Turning research ideas into robust, maintainable implementations, not one-off prototypes.
Proficiency in Python, and experience with distributed training (DeepSpeed, FSDP, Megatron-LM, etc.).
M.S. or Ph.D. in Computer Science, Mathematics, or a related field; or equivalent industry experience.
This role is for you if you have experience in some of the following:
Building or optimizing multimodal training or data pipelines.
Multimodal post-training experience (SFT, preference optimization, RL-style methods).
Dataset design and data quality expertise (scoring, filtering, dedup, long-tail mining).
Prior open-source contributions (models, benchmarks, eval tooling).
Published research at top AI conferences (NeurIPS, ICML, CVPR, ECCV, ICLR, ACL, etc.).
Experience with computer vision or visual representation learning.
Ship a capability end-to-end. Example: lead visual grounding from task spec through data curation, training recipe, ablations, evaluation, integration into the final run, and open-weight release.
Improve reasoning through RL, preference methods, and better attribution to visual evidence.
Push the quality-efficiency frontier on token efficiency, encoder/connector design. Exemplary outcome: a connector that cuts vision tokens without quality loss.
Build data pipelines that move model quality. Synthetic generation, filtering, dedup, diagnostics, from captioning to reasoning tasks.
Scale VLM infra and raise the team's bar. Multi-node pipelines, reproducible experiments, shared tooling, and hiring.
Our VLM models are SOTA across all major benchmarks
You own a major workstream (video understanding, data quality, or encoder architecture) end-to-end
At least one model has shipped to production with your direct contribution
Full ownership: You own your work from architecture to deployment.
Compensation: Competitive base salary with equity in a unicorn-stage company
Health: We pay 100% of medical, dental, and vision premiums for employees and dependents
Financial: 401(k) matching up to 4% of base pay
Time Off: Unlimited PTO plus company-wide Refill Days throughout the year