About Us

The AI and Systems Co-Design team at Meta (formerly known as Facebook), led by Chunqiang Tang (a.k.a. CQ Tang), consists of over 100 employees, mostly PhDs, including many world-class research scientists and engineers.

As reflected in our team name "co-design", we conduct interdisciplinary research and development across AI, hardware, and software, with a focus on performance, efficiency, and scalability.

We own the company's overall strategy for exploring innovative hardware technologies for CPUs, GPUs, memory, storage, and Meta's custom AI chips, and we productionize them in Meta's hyperscale fleet of O(1,000,000) servers and O(100,000) GPUs, powering all Meta products such as Facebook, Instagram, and Meta AI.
We apply novel optimizations across the whole stack---hardware, ML models, ML systems, applications, and the Linux kernel---to achieve optimal performance.
We develop innovative AI technologies for large language models (Llama), recommender systems, and more.

In addition to the real-world impact on billions of users of the Meta products, our team members have won Best Paper Awards at prestigious conferences such as ISCA, ASPLOS, SOSP, and OSDI, with multiple papers selected for IEEE Micro Top Picks. Additionally, we regularly publish in other areas such as ICML, NeurIPS, SC, HPCA, NSDI, VLDB, and MLSys. Overall, our work largely corresponds to the research communities of systems in general and especially systems for ML (MLSys, SOSP, OSDI, SIGCOMM, NSDI), hardware architecture (ISCA, ASPLOS), ML (NeurIPS, ICML, ICLR) and supercomputing (SC, ICS).

Selected Publications

Here are selected publications that showcase our work in diverse areas:

Systems for ML
- [OSDI'25] WLB-LLM: Workload-Balanced 4D Parallelism for Large Language Model Training
- [ISCA'25] Scaling Llama 3 Training with Efficient Parallelism Strategies
- [OSDI'24] MAST: Global Scheduling of ML Training across Geo-Distributed Datacenters at Hyperscale
- [ISCA'22] Software-Hardware Co-design for Fast and Scalable Training of Deep Learning Recommendation Models
- [VLDB'23] PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel
- The Llama 3 Herd of Models. Our contributions include much of the work described in the paper's Section 3.3 "Infrastructure, Scaling, and Efficiency", Section 6 "Inference", and Section 7.3 "Model Scaling".
AI Chips
- [ISCA'25] Meta's Second Generation AI Chip: Model-Chip Co-Design and Productionization Experiences
- [ISCA'23] MTIA: First Generation Silicon Targeting Meta's Recommendation Systems
ML Models and Kernels
ML Numerics, Pruning, Distillation, and Optimizer
HPC and Collective Communications Library (MPI, NCCL, RCCL)
- [SC'24] Accelerating Communication in Deep Learning Recommendation Model Training with Dual-Level Adaptive Lossy Compression
- [DLP-KDD'20] Training Deep Learning Recommendation Model with Quantized Collective Communications
Performance Benchmarking and Projection for AI and non-AI workloads
- [ISCA'25] DCPerf: An Open-Source, Battle-Tested Performance Benchmark Suite for Datacenter Workloads
- DLRM: An advanced, open source deep learning recommendation model
Hardware and Software Co-design
- [ISCA'23] Contiguitas: The Pursuit of Physical Memory Contiguity in Datacenters Best Paper
- [ISCA'23] TPP: Transparent Page Placement for CXL-Enabled Tiered-Memory
- [ISCA'19] SoftSKU: Optimizing Server Architectures for Microservice Diversity @Scale

How We Work

Like research labs, our team consists primarily of PhDs. However, we differ from traditional research labs in several key ways:

Direct ownership: Like traditional research labs, we build strong partnerships with numerous teams across diverse areas for broad influence. However, what sets us apart is our direct ownership of the hardware strategy for Meta's hyperscale fleet. This enables us to lead in many areas while fostering seamless partnerships in others.
Production systems: Our primary goal is to develop forward-looking innovations in AI, hardware, and software, and directly implement them in production systems that serve billions of people. The billions of users of Meta products and Meta's hyperscale fleet of O(1,000,000) servers and O(100,000) GPUs are, in effect, our lab. In contrast, traditional research labs often rely on technology transfer for a less direct impact.
Impact: Our impact is widely acknowledged within the company and throughout the industry. We drive Meta's hardware strategy to save billions of dollars, and directly develop innovative technologies in Meta's flagship products like Llama and Ads ranking models.

Open Source Projects

DCPerf: An open source benchmark suite for hyperscale compute applications
DLRM: An advanced, open source deep learning recommendation model
FBGEMM: ML kernels
PyTorch distributed Shampoo optimizer: This work won the competition of the external tuning track of the inaugural AlgoPerf training algorithms benchmark.

Recent Publications

2026

[TOCS'26] Detecting Tiny Performance Regressions at Hyperscale
[IEEE MICRO Journal '26] Triton for MTIA: Bridging the Programming Model Gaps for Custom AI Accelerators
[IEEE MICRO Journal '26] Demystifying the cost versus benefits of sparse LLM acceleration
[MLSys'26] Agentic Operator Generation for ML ASICs
[MLSys'26] Sparing Strategy to Minimize Reliability Impact on Large Scale Training Jobs
[MLSys'26] Demystifying the Mixture of Experts Serving Tax
[HPCA'26] ReScue: Reliable and Secure CXL Memory
[HPCA'26] PinDrop: Breaking the Silence on SDCs in a Large-Scale Fleet (Best paper finalist.)
[HPCA'26] RoMe: Row Granularity Access Memory System for Large Language Models
[ASPLOS'26] SNIP: An Adaptive Mixed Precision Framework for Subbyte Large Language Model Training
[ASPLOS'26] SEVI: Silent Data Corruption of Vector Instructions in Hyper-Scale Datacenters
[ASPLOS'26] Triton-Sanitizer: A Fast and Device-Agnostic Memory Sanitizer for Triton with Rich Diagnostic Context
[CACM Research Highlights'26] TMO: Transparent Memory Offloading in Datacenters
[ISCA'26] PhaseWeave: Phase-Aware Execution on Heterogeneous Chiplet Architectures for Datacenters
[ISCA'26] Vistara: Making CXL Real—Full Path from ASIC Design and OS Support to Hyperscale Deployment
[ISCA'26] Power Sloshing in Compound Servers for Large-Scale AI Inference Workloads
[ISCA'26] MTIA-300: Meta’s Training Chip with Embedded NIC Chiplets and Communication Offloading Engine
[OSDI'26] Syncopate: Automatic Fine-Grained Compute-Communication Overlap via Chunk-Centric Scheduling
[OSDI'26] Hardware Lifecycle-Aware Power Planning in Commercial Hyperscale Datacenters
[S&P'26] SoK: Systematizing a Decade of Architectural RowHammer Defenses Through the Lens of Streaming Algorithms

2025

[MICRO'25] Chameleon: Adaptive Caching and Scheduling for Many-Adapter LLM Inference Environments
[MICRO'25] Multi-Dimensional ML-Pipeline Optimization in Cost-Effective Disaggregated Datacenter
[MICRO'25] Re-architecting End-host Networking with CXL: Coherence, Memory, and Offloading
[SOSP'25] Mercury: Unlocking Multi-GPU Operator Optimization for LLMs via Remote Memory Scheduling
[OSDI'25] WLB-LLM: Workload-Balanced 4D Parallelism for Large Language Model Training
[OSDI'25] KPerfIR: Towards a Open and Compiler-centric Ecosystem for GPU Kernel Performance Tooling on Modern AI Workloads
[ISCA'25] DCPerf: An Open-Source, Battle-Tested Performance Benchmark Suite for Datacenter Workloads
[ISCA'25] Meta's Second Generation AI Chip: Model-Chip Co-Design and Productionization Experiences
[ISCA'25] Scaling Llama 3 Training with Efficient Parallelism Strategies
[HPCA'25] Machine Learning-Guided Memory Optimization for DLRM Inference on Tiered Memory
[Communications of the ACM] Meta's Hyperscale Infrastructure: Overview and Insights
[ARITH'25] An Empirical Study of Microscaling Formats for Low-Precision LLM Training
[MLSys'25] Context Parallelism for Scalable Million-Token Inference
[MLSys'25] Lumos: Efficient Performance Modeling and Estimation for Large-scale LLM Training

2024

2023

Llama 2: Open Foundation and Fine-Tuned Chat Models
[ISCA'23] Contiguitas: The Pursuit of Physical Memory Contiguity in Datacenters Best Paper
[IEEE Micro'23] IOCost: Block IO Control for Containers in Datacenters Top Picks
[ISCA'23] MTIA: First Generation Silicon Targeting Meta's Recommendation Systems
[ISCA'23] Shared Microexponents: A Little Shifting Goes a Long Way
[ISCA'23] TPP: Transparent Page Placement for CXL-Enabled Tiered-Memory
[ISCA'23] Mystique: Enabling Accurate and Scalable Generation of Production AI Benchmarks
[MLSys'23] RecD: Deduplication for End-to-End Deep Learning Recommendation Model Training Infrastructure
[OSDI'23] AdaEmbed: Adaptive Embedding for Large-Scale Recommendation Models
[ISPASS'23] Characterization of Data Compression in Datacenters
[ASPLOS'23] Ditto: End-to-End Application Cloning for Networked Cloud Services
[VLDB'23] PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel
Microscaling Data Formats for Deep Learning
FlexShard: Flexible Sharding for Industry-Scale Sequence Recommendation Models

About Us

Selected Publications

How We Work

Open Source Projects

Recent Publications

2026

2025

2024

2023

2022

2021

2020

2019