LLMIR

Version 0.0.1

Large Language Model IR Compiler Framework

References

References

This page lists relevant papers, projects, and resources that have influenced the development of LLMIR or are related to LLM optimization and compilation.

Foundational Work 

MLIR Framework 

  • Chris Lattner, Mehdi Amini, Uday Bondhugula, Albert Cohen, Andy Davis, Jacques Pienaar, River Riddle, Tatiana Shpeisman, Nicolas Vasilache, and Oleksandr Zinenko. “MLIR: Scaling compiler infrastructure for domain specific computation.” In 2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), pp. 2-14. IEEE, 2021. Link

Large Language Model Inference 

  • Dao, Tri, et al. “FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness.” Advances in Neural Information Processing Systems, 2022. Link

  • Sheng, Ying, et al. “High-throughput Generative Inference of Large Language Models with a Single GPU.” In International Conference on Machine Learning, 2023. Link

LLM Inference Frameworks 

vLLM 

  • Kwon, Woosuk, et al. “Efficient Memory Management for Large Language Model Serving with PagedAttention.” In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023. Link

SGLang 

  • Xiao, Lianmin, et al. “SGLang: Semi-Structured Gateway Language for Multi-Agent System.” arXiv preprint arXiv:2403.06071 (2024). Link

Compiler Optimization for LLMs 

  • DeepSpeed Team. “DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale.” arXiv preprint arXiv:2207.00032 (2022). Link

  • Frantar, Elias, et al. “GPTQ: Accurate Post-training Quantization for Generative Pre-trained Transformers.” In International Conference on Learning Representations, 2023. Link

  • Korthikanti, Vijay Anand, et al. “Reducing Activation Recomputation in Large Transformer Models.” In Proceedings of Machine Learning and Systems, 2023. Link

Hardware-Specific Optimizations 

  • Wang, Ze, et al. “TensorRT-LLM: A Comprehensive and Efficient Large Language Model Inference Library.” arXiv preprint arXiv:2405.09386 (2024). Link

  • Dao, Tri, et al. “FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning.” arXiv preprint arXiv:2307.08691 (2023). Link

Quantization Techniques 

  • Dettmers, Tim, et al. “LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale.” In Advances in Neural Information Processing Systems, 2022. Link

  • Xiao, Guangxuan, et al. “SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models.” In Proceedings of the International Conference on Machine Learning, 2023. Link

Distributed LLM Inference 

  • Aminabadi, Reza Yazdani, et al. “DeepSpeed-Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale.” In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2022. Link

  • Zheng, Lianmin, et al. “Alpa: Automating Inter-and Intra-Operator Parallelism for Distributed Deep Learning.” In 16th USENIX Symposium on Operating Systems Design and Implementation, 2022. Link

  • LLVM: The LLVM Compiler Infrastructure Project
  • MLIR: Multi-Level Intermediate Representation
  • vLLM: High-throughput and memory-efficient inference and serving engine for LLMs
  • SGLang: Semi-structured gateway language for large language models
  • TensorRT-LLM: NVIDIA’s LLM optimization library

This reference list will be updated as the LLMIR project progresses and new relevant research emerges in the field.