
We are hiring PhD research interns for summer 2026 but are only targeting senior PhD students expected to graduate in 2026 or 2027, as one of the goals is to convert successful interns into full-time employees after graduation.
About Us
The AI and Systems Co-Design team at Meta (formerly known as Facebook), led by Chunqiang Tang (a.k.a. CQ Tang), consists of over 100 employees, mostly PhDs, including many world-class research scientists and engineers.
As reflected in our team name "co-design", we conduct interdisciplinary research and development across AI, hardware, and software, with a focus on performance, efficiency, and scalability.
- We own the company's overall strategy for exploring innovative hardware technologies for CPUs, GPUs, memory, storage, and Meta's custom AI chips, and we productionize them in Meta's hyperscale fleet of O(1,000,000) servers and O(100,000) GPUs, powering all Meta products such as Facebook, Instagram, and Meta AI.
- We apply novel optimizations across the whole stack---hardware, ML models, ML systems, applications, and the Linux kernel---to achieve optimal performance.
- We develop innovative AI technologies for large language models (Llama), recommender systems, and more.
In addition to the real-world impact on billions of users of the Meta products, our team members have won Best Paper Awards at prestigious conferences such as ISCA, ASPLOS, SOSP, and OSDI, with multiple papers selected for IEEE Micro Top Picks. Additionally, we regularly publish in other areas such as ICML, NeurIPS, SC, HPCA, NSDI, VLDB, and MLSys. Overall, our work largely corresponds to the research communities of systems in general and especially systems for ML (MLSys, SOSP, OSDI, SIGCOMM, NSDI), hardware architecture (ISCA, ASPLOS), ML (NeurIPS, ICML, ICLR) and supercomputing (SC, ICS).
Selected Publications
Here are selected publications that showcase our work in diverse areas:
- Systems for ML
- AI Chips
- ML Models and Kernels
- ML Numerics, Pruning, Distillation, and Optimizer
- HPC and Collective Communications Library (MPI, NCCL, RCCL)
- Performance Benchmarking and Projection for AI and non-AI workloads
- Hardware and Software Co-design
How We Work
Like research labs, our team consists primarily of PhDs. However, we differ from traditional research labs in several key ways:
- Direct ownership: Like traditional research labs, we build strong partnerships with numerous teams across diverse areas for broad influence. However, what sets us apart is our direct ownership of the hardware strategy for Meta's hyperscale fleet. This enables us to lead in many areas while fostering seamless partnerships in others.
- Production systems: Our primary goal is to develop forward-looking innovations in AI, hardware, and software, and directly implement them in production systems that serve billions of people. The billions of users of Meta products and Meta's hyperscale fleet of O(1,000,000) servers and O(100,000) GPUs are, in effect, our lab. In contrast, traditional research labs often rely on technology transfer for a less direct impact.
- Impact: Our impact is widely acknowledged within the company and throughout the industry. We drive Meta's hardware strategy to save billions of dollars, and directly develop innovative technologies in Meta's flagship products like Llama and Ads ranking models.
Open Source Projects
- DCPerf: An open source benchmark suite for hyperscale compute applications
- DLRM: An advanced, open source deep learning recommendation model
- FBGEMM: ML kernels
- PyTorch distributed Shampoo optimizer: This work won the competition of the external tuning track of the inaugural AlgoPerf training algorithms benchmark.
2026
- [TOCS'26] Detecting Tiny Performance Regressions at Hyperscale
- [IEEE MICRO Journal '26] Triton for MTIA: Bridging the Programming Model Gaps for Custom AI Accelerators
- [IEEE MICRO Journal '26] Demystifying the cost versus benefits of sparse LLM acceleration
- [MLSys'26] Agentic Operator Generation for ML ASICs
- [MLSys'26] Sparing Strategy to Minimize Reliability Impact on Large Scale Training Jobs
- [MLSys'26] Demystifying the Mixture of Experts Serving Tax
- [HPCA'26] ReScue: Reliable and Secure CXL Memory
- [HPCA'26] PinDrop: Breaking the Silence on SDCs in a Large-Scale Fleet (Best paper finalist.)
- [ASPLOS'26] SNIP: An Adaptive Mixed Precision Framework for Subbyte Large Language Model Training
- [ASPLOS'26] SEVI: Silent Data Corruption of Vector Instructions in Hyper-Scale Datacenters
- [ASPLOS'26] Triton-Sanitizer: A Fast and Device-Agnostic Memory Sanitizer for Triton with Rich Diagnostic Context
- [CACM Research Highlights'26] TMO: Transparent Memory Offloading in Datacenters
- [ISCA'26] PhaseWeave: Phase-Aware Execution on Heterogeneous Chiplet Architectures for Datacenters
- [ISCA'26] LoKA: Low-precision Kernel Applications for Recommendation Models At Scale
- [ISCA'26] Vistara: Making CXL Real—Full Path from ASIC Design and OS Support to Hyperscale Deployment
- [ISCA'26] Power Management for Large-Scale Heterogeneous AI Workloads
- [ISCA'26] MTIA-300: Meta’s Training Chip with Embedded NIC Chiplets and Communication Offloading Engine
- [OSDI'26] Syncopate: Automatic Fine-Grained Compute-Communication Overlap via Chunk-Centric Scheduling
- [OSDI'26] Hardware Lifecycle-Aware Power Planning in Commercial Hyperscale Datacenters
2025
2024
2023
2022
2021
2020
2019