Large Language Model Intermediate Representation Overview

The LLMIR project is a novel approach to building reusable and extensible compiler infrastructure for large language model inference. LLMIR aims to unify and optimize LLM inference workflows, improve compilation for heterogeneous hardware, significantly reduce inference latency, and enhance integration between various LLM frameworks.

LLMIR is a dedicated compilation middle layer for platform architects and developers, built on the MLIR framework. It leverages MLIR’s flexible infrastructure to represent and transform computational graphs. LLMIR can integrate with multiple LLM inference frameworks (like vLLM, SGLang) by converting their high-level operators or model graphs into a unified intermediate representation for further optimization.

Project Status

LLMIR is currently in active development. The project is following the development plan as outlined in our GitHub repository. We are in the early phases of building the core infrastructure and MLIR dialect definition.

Project Resources

For more information on LLMIR, please see:

Project Repository
Developer Guide for getting started with LLMIR.

What is LLMIR for?

LLMIR is an intermediate representation specialized for optimizing large language model inference. It provides:

The ability to represent inference workflows from popular LLM frameworks (such as vLLM, SGLang), including dynamic shapes, batching strategies, and framework-specific operators.
Optimizations and transformations specifically designed for LLM inference (e.g. attention fusion, KV cache management).
Cross-framework end-to-end compilation for LLM inference, enabling optimizations like attention computation fusion, KV cache management, quantization, and pipeline parallelism.
Ability to target various hardware platforms (GPU, TPU, ASIC, CPU) efficiently by leveraging the MLIR ecosystem.
Representation of hardware-specific operations for accelerators specialized in LLM workloads.

Core Value Proposition

Compared to using the native execution paths of individual frameworks, LLMIR’s core value lies in providing cross-framework compilation capabilities for end-to-end optimization. This includes:

Performance Improvement: Leveraging compilation optimizations to reduce inference latency and increase throughput
Resource Efficiency: Optimizing memory usage, supporting longer sequences and larger batch sizes
Scalability: Supporting different hardware platforms and inference frameworks
Usability: Providing developer-friendly APIs to lower integration barriers

Key Features (Under Development)

PagedKVCache: Efficient key-value cache implementation for optimized attention computation
MLIR Dialect for LLMs: Custom operations and types for language model inference
Memory Optimizations: Block-based memory management for efficient, low-fragmentation memory usage
Multi-sequence Support: Handle multiple concurrent sequences with varying lengths
Hardware Targeting: Backend code generation for various platforms

Weekly Public Meeting

We host a weekly public meeting about LLMIR and the ecosystem. To be notified of the next meeting, please subscribe to the LLMIR Announcements category on Discourse.

You can register to this public calendar to keep up-to-date with the schedule.

If you’d like to discuss a particular topic or have questions, please add it to the agenda doc.

More resources

For more information on LLMIR, please see:

The LLMIR section of the LLVM forums for any questions.
Real-time discussion on the LLMIR channel of the LLVM discord server.

Citing LLMIR

Please see the FAQ entry on how to cite LLMIR in publications.

Edit on GitHub