Developer Guide

LLMIR Developer Guide

This guide provides an overview of how to develop with LLMIR.

Building LLMIR ¶

LLMIR is built on top of the MLIR ecosystem. To build LLMIR, you’ll need:

A C++ compiler (GCC or Clang) with C++17 support
CMake (3.13.4 or higher)
Python (3.7 or higher)
Ninja or Make build system

Clone the Repository ¶

git clone https://github.com/chenxingqiang/llmir.git
cd llmir

Configure the Build ¶

mkdir build && cd build
cmake -G Ninja ..

Build ¶

ninja

LLMIR Project Structure ¶

The LLMIR project is structured as follows:

include/mlir/Dialect/LLM/       # MLIR dialect definitions
  ├── IR/                       # MLIR operations and types
  └── Runtime/                  # Runtime support headers

lib/Dialect/LLM/                # Implementation
  ├── IR/                       # MLIR operation implementations
  └── Runtime/                  # Runtime library implementations

test/Dialect/LLM/               # Tests
  ├── IR/                       # MLIR operation tests
  └── Runtime/                  # Runtime tests

examples/                       # Example applications
  └── kv_cache_example.cpp      # KV cache example

Core Components (Under Development) ¶

LLMIR is being developed in phases according to our development plan. The core components are:

Phase 1: Basic Infrastructure ¶

LLM MLIR Dialect: Specialized dialect defining operations and types for LLM inference
Custom Type System: Types for representing KV caches, sharded tensors, etc.
Core Operations: Attention, linear, layernorm, etc.

Phase 2: Core Optimizations ¶

KV Cache Management: PagedAttention-style block-based KV cache handling
Attention Computation: Fusion and optimization of attention operations
Memory Management: Block allocation and recycling strategies

Phase 3: Advanced Features ¶

Quantization Support: INT8/INT4 quantization transformations
Parallelism Strategies: Tensor and pipeline parallelism
Backend Code Generation: CUDA/CPU/accelerator support

Contributing to LLMIR ¶

LLMIR is in the early phases of development, and contributions are welcome. Here’s how you can contribute:

Review the development plan in our repository
Choose an area to focus on (dialect design, optimization, etc.)
Follow standard MLIR development practices
Submit pull requests with well-tested changes

Development Workflow ¶

We recommend the following workflow for contributing to LLMIR:

Create a new branch for your feature
Implement the necessary changes with appropriate tests
Update documentation to reflect your changes
Submit a pull request for review

Running Tests ¶

Once tests are implemented, you can run them from the build directory:

ninja check-llmir

Example: KV Cache in MLIR ¶

Here’s an example of how a paged KV cache might be represented in LLMIR (syntax may evolve as the project develops):

// Create a paged KV cache type
!kv_cache_t = !llm.paged_kv_cache<f16, 12, 16, 64, 16, 4096>

// Append key-value pairs to the cache
%new_kv, %block_indices = llm.append_kv %kv_cache, %keys, %values, %seq_ids {
  block_size = 16 : i32,
  max_seq_len = 4096 : i32
} : (!kv_cache_t, tensor<2x1x16x64xf16>, tensor<2x1x16x64xf16>, tensor<2xi32>) 
    -> (!kv_cache_t, tensor<2x1xi32>)

// Perform paged attention with the KV cache
%output = llm.paged_attention %query, %new_kv, %block_indices, %seq_lens {
  num_heads = 16 : i32,
  head_dim = 64 : i32,
  scale = 0.125 : f32
} : (tensor<2x1x16x64xf16>, !kv_cache_t, tensor<2x128xi32>, tensor<2xi32>) 
    -> tensor<2x1x16x64xf16>

For more details on the project roadmap and architecture, please refer to our GitHub repository.

Edit on GitHub