References

This page lists relevant papers, projects, and resources that have influenced the development of LLMIR or are related to LLM optimization and compilation.

Foundational Work ¶

MLIR Framework ¶

Chris Lattner, Mehdi Amini, Uday Bondhugula, Albert Cohen, Andy Davis, Jacques Pienaar, River Riddle, Tatiana Shpeisman, Nicolas Vasilache, and Oleksandr Zinenko. “MLIR: Scaling compiler infrastructure for domain specific computation.” In 2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), pp. 2-14. IEEE, 2021. Link

Large Language Model Inference ¶

Dao, Tri, et al. “FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness.” Advances in Neural Information Processing Systems, 2022. Link
Sheng, Ying, et al. “High-throughput Generative Inference of Large Language Models with a Single GPU.” In International Conference on Machine Learning, 2023. Link

LLM Inference Frameworks ¶

vLLM ¶

Kwon, Woosuk, et al. “Efficient Memory Management for Large Language Model Serving with PagedAttention.” In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023. Link

SGLang ¶

Xiao, Lianmin, et al. “SGLang: Semi-Structured Gateway Language for Multi-Agent System.” arXiv preprint arXiv:2403.06071 (2024). Link

Compiler Optimization for LLMs ¶

DeepSpeed Team. “DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale.” arXiv preprint arXiv:2207.00032 (2022). Link
Frantar, Elias, et al. “GPTQ: Accurate Post-training Quantization for Generative Pre-trained Transformers.” In International Conference on Learning Representations, 2023. Link
Korthikanti, Vijay Anand, et al. “Reducing Activation Recomputation in Large Transformer Models.” In Proceedings of Machine Learning and Systems, 2023. Link

Hardware-Specific Optimizations ¶

Wang, Ze, et al. “TensorRT-LLM: A Comprehensive and Efficient Large Language Model Inference Library.” arXiv preprint arXiv:2405.09386 (2024). Link
Dao, Tri, et al. “FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning.” arXiv preprint arXiv:2307.08691 (2023). Link

Quantization Techniques ¶

Dettmers, Tim, et al. “LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale.” In Advances in Neural Information Processing Systems, 2022. Link
Xiao, Guangxuan, et al. “SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models.” In Proceedings of the International Conference on Machine Learning, 2023. Link

Distributed LLM Inference ¶

Aminabadi, Reza Yazdani, et al. “DeepSpeed-Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale.” In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2022. Link
Zheng, Lianmin, et al. “Alpa: Automating Inter-and Intra-Operator Parallelism for Distributed Deep Learning.” In 16th USENIX Symposium on Operating Systems Design and Implementation, 2022. Link

LLVM: The LLVM Compiler Infrastructure Project
MLIR: Multi-Level Intermediate Representation
vLLM: High-throughput and memory-efficient inference and serving engine for LLMs
SGLang: Semi-structured gateway language for large language models
TensorRT-LLM: NVIDIA’s LLM optimization library

This reference list will be updated as the LLMIR project progresses and new relevant research emerges in the field.

Edit on GitHub

References

References

Foundational Work ¶

MLIR Framework ¶

Large Language Model Inference ¶

LLM Inference Frameworks ¶

vLLM ¶

SGLang ¶

Compiler Optimization for LLMs ¶

Hardware-Specific Optimizations ¶

Quantization Techniques ¶

Distributed LLM Inference ¶

Related Projects ¶