Architecture

LLMIR Architecture

This section provides detailed information about the LLMIR architecture, its key components, and features.

Key Features

LLMIR is being developed with several key optimizations for LLM inference:

KV Cache Optimization: Efficient key-value cache management techniques
Quantization Support: Comprehensive quantization capabilities
Distributed Deployment: Support for multi-device inference
Performance Evaluation: Benchmarking and evaluation methodologies

System Architecture

LLMIR (Large Language Model Intermediate Representation) is a compiler infrastructure for large language models based on MLIR, designed to optimize and accelerate LLM inference through specialized compilation techniques.

LLMIR follows a layered architecture:

                       ┌─────────────────┐
                       │   Application   │
                       │ vLLM / SGLang   │
                       └────────┬────────┘
                                │
                                ▼
┌──────────────────────────────────────────────────┐
│                    LLMIR Compiler                │
│                                                  │
│  ┌──────────────┐    ┌───────────────────────┐   │
│  │ Front-end    │ → │  MLIR Optimization     │   │
│  │ Converters   │    │  Pipeline             │   │
│  └──────────────┘    └───────────┬───────────┘   │
│                                  │               │
│                      ┌───────────▼───────────┐   │
│                      │    Backend Generators  │   │
│                      └───────────────────────┘   │
└──────────────────────────┬───────────────────────┘
                           │
                           ▼
            ┌─────────────────────────────┐
            │       Execution Layer       │
            │ CUDA / ROCm / LLVM / Accel  │
            └─────────────────────────────┘

Front-end Converters

The front-end converters are responsible for translating models and operations from existing frameworks into the LLMIR representation:

vLLM Converter: Translates vLLM’s model representation and PagedAttention mechanism into LLMIR
SGLang Converter: Maps SGLang’s computation graphs to LLMIR operations

MLIR Optimization Pipeline

The optimization pipeline includes a range of passes specifically designed for LLM inference:

General Optimizations: Common compiler optimizations like constant folding, dead code elimination, and loop optimizations
LLM-Specific Optimizations: KV cache blocking, attention computation fusion, quantization transformations
Hardware-Specific Optimizations: Optimizations targeting specific hardware features

Backend Generators

Backend generators produce optimized code for different execution targets:

CUDA/HIP Code Generation: For NVIDIA and AMD GPUs
LLVM IR Generation: For CPUs and general platforms
Specialized Accelerator Code: For ML accelerators like TPUs

Runtime Library

LLMIR includes a runtime library that provides key functionality:

Memory Management: Efficient KV cache allocation and scheduling
Execution Scheduler: Dynamic batching and request management
Device Communication: Multi-device data exchange for distributed inference

LLMIR Dialect

The core of LLMIR is a specialized MLIR dialect for LLM operations, including custom types and operations tailored for LLM workloads. For detailed information about specific features, please visit the dedicated pages listed above.

Development Status

LLMIR is being developed in phases according to our development plan:

Phase 1 (Current Focus): Building the core infrastructure, including MLIR dialect design and implementation
Phase 2 (Planned): Implementing core optimizations like KV cache management and attention fusion
Phase 3 (Future): Adding advanced features such as quantization, parallelism strategies, and advanced hardware targeting

For the current status and detailed roadmap, please visit our GitHub repository.

References

For a comprehensive list of related work and publications that have influenced LLMIR, please see our References page.

Architecture Docs

Edit on GitHub