About Etched
Etched is building the world’s first AI inference system purpose-built for transformers - delivering over 10x higher performance and dramatically lower cost and latency than a B200. With Etched ASICs, you can build products that would be impossible with GPUs, like real-time video generation models and extremely deep & parallel chain-of-thought reasoning agents. Backed by hundreds of millions from top-tier investors and staffed by leading engineers, Etched is redefining the infrastructure layer for the fastest growing industry in history.
Job Summary
We are seeking a highly skilled and motivated Supercomputing Software Engineer to join our team, responsible for the foundational software that powers our server infrastructure. This role focuses on the development, integration, and debugging of critical system software components, including BIOS, BMC firmware, boot processes (including NetBoot), root of trust implementations, advanced system logging, and kernel-mode drivers. You will play a pivotal role in ensuring the reliability, security, and performance of our server platforms, and contribute to the integration of data center orchestration technologies at the node level.
Key Team Responsibilities
Integrate and maintain BIOS and BMC firmware, ensuring robust and efficient server boot processes.
Measure and Tune System Performance Configuration: Analyze DRAM timings, PCIe configurations, power state transitions etc. to ensure high performance and maximal reliability.
Root of Trust and Security: Validating security features, including root of trust mechanisms, to protect system integrity and data security.
Advanced System Logging and Diagnostics: Design and implement advanced system logging and diagnostic capabilities to facilitate efficient troubleshooting and performance analysis.
Data Center Orchestration Integration: Integrate and optimize node-level data center orchestration technologies, such as Kubernetes and Docker, into the system software stack.
System Validation and Testing: Develop and execute comprehensive test plans to validate system software functionality, stability, and performance.
Collaboration and Troubleshooting: Collaborate with hardware and software teams to diagnose and resolve complex system-level issues.
Representative Projects
Implement and validate secure boot processes, including root of trust verification.
Design and implement advanced system logging and monitoring solutions.
Optimize BIOS and BMC firmware for improved boot times and system stability.
Integrate node-level container orchestration capabilities into the system software.
Analyze and resolve complex system-level issues related to boot failures, hardware errors, and performance degradation.
Analyze and optimize system level logging for large scale server deployments.
You may be a good fit if you have
Proficiency in C/C++ or Python.
Strong understanding of BIOS and BMC firmware architectures.
Experience with server boot processes.
Knowledge of root-of-trust and security principles.
Strong understanding of operating systems (Linux preferred) and server hardware architectures.
Experience with advanced system logging and diagnostic tools.
Ability to analyze complex technical problems and provide effective solutions.
Excellent communication and collaboration skills.
Experience with version control systems (e.g., Git).
Experience with reading and interpreting hardware logs.
Strong candidates may also have experience with (Nice-to-have qualifications)
Experience with data center orchestration technologies (Kubernetes, Docker).
Experience with tracing tools like perf, eBPF, ftrace, etc.
Experience with performance testing and benchmarking tools (gProf, vTune, Wireshark, etc.).
Experience with CI/CD pipelines.
Experience with Rust.
Experience with kernel-mode driver development and debugging.
Benefits
Competitive compensation packages including generous equity packages
Comprehensive insurance coverage and other top-of-market benefits
How we’re different
Etched believes in the Bitter Lesson. We think most of the progress in the AI field has come from using more FLOPs to train and run models, and the best way to get more FLOPs is to build model-specific hardware. Larger and larger training runs encourage companies to consolidate around fewer model architectures, which creates a market for single-model ASICs.
We are a fully in-person team in San Jose and Taipei, and greatly value engineering skills. We do not have boundaries between engineering and research, and we expect all of our technical staff to contribute to both as needed.