TLDR: We are hiring managers! Candidates please email us at "codesign AT meta DOT com". Currently, we are not generally hiring individual contributors, interns, or visiting professors, except for specific targeted candidates.

The AI and Systems Co-Design team at Meta (formerly known as Facebook), led by Chunqiang Tang (a.k.a. CQ Tang), consists of over 100 employees, mostly PhDs, including many world-class research scientists and engineers. For example, Pavan Balaji is a recipient of the 2024 ACM Software System Award for his contributions to MPICH.

As reflected in our team name "co-design", we conduct interdisciplinary research and development across AI, hardware, and software, with a focus on performance, efficiency, and scalability.

We own the company's overall strategy for exploring innovative hardware technologies for CPUs, GPUs, memory, storage, and Meta's custom AI chips, and we productionize them in Meta's hyperscale fleet of O(1,000,000) servers and O(100,000) GPUs, powering all Meta products such as Facebook, Instagram, and Meta AI.
We apply novel optimizations across the whole stack---hardware, ML models, ML systems, applications, and the Linux kernel---to achieve optimal performance.
We develop innovative AI technologies for large language models (Llama), ranking systems, and more.

In addition to the real-world impact on billions of users of the Meta products, our team members have won Best Paper Awards at prestigious conferences such as ISCA, ASPLOS, SOSP, and OSDI, with multiple papers selected for IEEE Micro Top Picks. Additionally, we regularly publishe in ICML, NeurIPS, SC, HPCA, NSDI, VLDB, MLSys, and more. Overall, our work largely corresponds to the research communities of systems in general and especially systems for ML (MLSys, SOSP, OSDI, SIGCOMM, NSDI), hardware architecture (ISCA, ASPLOS), ML (NeurIPS, ICML, ICLR) and supercomputing (SC, ICS).

Here are selected publications that showcase our work in diverse areas:

Systems for ML
- [OSDI'25] WLB-LLM: Workload-Balanced 4D Parallelism for Large Language Model Training
- [ISCA'25] Scaling Llama 3 Training with Efficient Parallelism Strategies
- [OSDI'24] MAST: Global Scheduling of ML Training across Geo-Distributed Datacenters at Hyperscale
- [ISCA'22] Software-Hardware Co-design for Fast and Scalable Training of Deep Learning Recommendation Models
- [VLDB'23] PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel
- The Llama 3 Herd of Models. Our contributions include much of the work described in the paper's Section 3.3 "Infrastructure, Scaling, and Efficiency", Section 6 "Inference", and Section 7.3 "Model Scaling".
AI chip
- [ISCA'25] Meta's Second Generation AI Chip: Model-Chip Co-Design and Productionization Experiences
- [ISCA'23] MTIA: First Generation Silicon Targeting Meta's Recommendation Systems
ML models and kernels
ML numerics, pruning, distillation, and optimizer
HPC and collective communications library (MPI, NCCL, RCCL)
- [SC'24] Accelerating Communication in Deep Learning Recommendation Model Training with Dual-Level Adaptive Lossy Compression
- [DLP-KDD'20] Training Deep Learning Recommendation Model with Quantized Collective Communications
Performance benchmarking and projection for both AI and non-AI workloads
- [ISCA'25] DCPerf: An Open-Source, Battle-Tested Performance Benchmark Suite for Datacenter Workloads
- DLRM: An advanced, open source deep learning recommendation model
Hardware and software co-design
- [ISCA'23 Best Paper] Contiguitas: The Pursuit of Physical Memory Contiguity in Datacenters
- [ISCA'23] TPP: Transparent Page Placement for CXL-Enabled Tiered-Memory
- [ISCA'19] SoftSKU: Optimizing Server Architectures for Microservice Diversity @Scale

Like research labs, our team consists primarily of PhDs. However, we differ from traditional research labs in several key ways:

Direct ownership: Like traditional research labs, we build strong partnerships with numerous teams across diverse areas for broad influence. However, what sets us apart is our direct ownership of the hardware strategy for Meta's hyperscale fleet. This enables us to lead in many areas while fostering seamless partnerships in others.
Production systems: Our primary goal is to develop forward-looking innovations in AI, hardware, and software, and directly implement them in production systems that serve billions of people. The billions of users of Meta products and Meta's hyperscale fleet of O(1,000,000) servers and O(100,000) GPUs are, in effect, our lab. In contrast, traditional research labs often rely on technology transfer for a less direct impact.
Impact: Our impact is widely acknowledged within the company and throughout the industry. We drive Meta's hardware strategy to save billions of dollars, and directly develop innovative technologies in Meta's flagship products like Llama and Ads ranking models.

Open Source Projects

DCPerf: An open source benchmark suite for hyperscale compute applications
DLRM: An advanced, open source deep learning recommendation model
FBGEMM: ML kernels
Pytorch distributed Shampoo optimizer. This work won the competition of the external tuning track of the inaugural AlgoPerf training algorithms benchmark.

Selected Publications

Open Source Projects

Selected Publications

2025

2024

2023

2022

2021

2020

2019