References
References
This page lists relevant papers, projects, and resources that have influenced the development of LLMIR or are related to LLM optimization and compilation.
Foundational Work ¶
MLIR Framework ¶
- Chris Lattner, Mehdi Amini, Uday Bondhugula, Albert Cohen, Andy Davis, Jacques Pienaar, River Riddle, Tatiana Shpeisman, Nicolas Vasilache, and Oleksandr Zinenko. “MLIR: Scaling compiler infrastructure for domain specific computation.” In 2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), pp. 2-14. IEEE, 2021. Link
Large Language Model Inference ¶
Dao, Tri, et al. “FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness.” Advances in Neural Information Processing Systems, 2022. Link
Sheng, Ying, et al. “High-throughput Generative Inference of Large Language Models with a Single GPU.” In International Conference on Machine Learning, 2023. Link
LLM Inference Frameworks ¶
vLLM ¶
- Kwon, Woosuk, et al. “Efficient Memory Management for Large Language Model Serving with PagedAttention.” In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023. Link
SGLang ¶
- Xiao, Lianmin, et al. “SGLang: Semi-Structured Gateway Language for Multi-Agent System.” arXiv preprint arXiv:2403.06071 (2024). Link
Compiler Optimization for LLMs ¶
DeepSpeed Team. “DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale.” arXiv preprint arXiv:2207.00032 (2022). Link
Frantar, Elias, et al. “GPTQ: Accurate Post-training Quantization for Generative Pre-trained Transformers.” In International Conference on Learning Representations, 2023. Link
Korthikanti, Vijay Anand, et al. “Reducing Activation Recomputation in Large Transformer Models.” In Proceedings of Machine Learning and Systems, 2023. Link
Hardware-Specific Optimizations ¶
Wang, Ze, et al. “TensorRT-LLM: A Comprehensive and Efficient Large Language Model Inference Library.” arXiv preprint arXiv:2405.09386 (2024). Link
Dao, Tri, et al. “FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning.” arXiv preprint arXiv:2307.08691 (2023). Link
Quantization Techniques ¶
Dettmers, Tim, et al. “LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale.” In Advances in Neural Information Processing Systems, 2022. Link
Xiao, Guangxuan, et al. “SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models.” In Proceedings of the International Conference on Machine Learning, 2023. Link
Distributed LLM Inference ¶
Aminabadi, Reza Yazdani, et al. “DeepSpeed-Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale.” In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2022. Link
Zheng, Lianmin, et al. “Alpa: Automating Inter-and Intra-Operator Parallelism for Distributed Deep Learning.” In 16th USENIX Symposium on Operating Systems Design and Implementation, 2022. Link
Related Projects ¶
- LLVM: The LLVM Compiler Infrastructure Project
- MLIR: Multi-Level Intermediate Representation
- vLLM: High-throughput and memory-efficient inference and serving engine for LLMs
- SGLang: Semi-structured gateway language for large language models
- TensorRT-LLM: NVIDIA’s LLM optimization library
This reference list will be updated as the LLMIR project progresses and new relevant research emerges in the field.