Untone to
  • Joined on 2025-05-09

turboquant (0.3.3)

Published 2026-04-19 09:36:07 +00:00 by to

Installation

pip install --index-url  --extra-index-url https://pypi.org/simple turboquant

About this package

High-performance quantization library for local LLM inference

TurboQuant

High-performance quantization library and model compression tool for local LLM inference

Crates.io Documentation License: MIT

Overview

TurboQuant is a Rust library and CLI tool designed for efficient quantization of large language models (LLMs). It enables running AI models locally with minimal resource usage by compressing model weights while maintaining acceptable accuracy.

Research Background

This library is inspired by Google Research's TurboQuant paper:

"TurboQuant: Online Vector Quantization with Near-optimal Distortion"
Google Research, April 2025
arXiv:2504.19874

Key insights from the research:

  • Random rotation induces concentrated Beta distribution on coordinates
  • Two-stage quantization: MSE-optimal quantizer + 1-bit QJL transform for unbiased inner products
  • Near-optimal distortion: Within ≈2.7x of Shannon's Lower Bound
  • KV cache compression: 6x memory reduction with zero accuracy loss at 3.5 bits/channel
  • Speed improvements: Up to 8x inference speedup, virtually zero indexing time

See docs/TURBOQUANT_RESEARCH.md for detailed paper summary and key quotes.

Key Features

  • Multiple Quantization Types: Support for Int8, Int4, Int3, Int2, Int1 (binary), and NF4 (normal-form) quantization
  • High Performance: Multi-threaded processing using Rayon for parallel computation
  • Model Format Support: GGUF, GGML, and Safetensors formats
  • CLI Tool: Easy-to-use command-line interface for model quantization
  • Library API: Programmatic access for integration into your applications
  • Model Discovery: Automatically find and analyze models in directories
  • Batch Processing: Quantize multiple models in a single operation
  • Benchmarking: Built-in performance benchmarking tools

Installation

From Source

git clone https://github.com/your-org/turboquant
cd turboquant
cargo build --release

The CLI tool will be available at target/release/turboquant-cli.

From Crates.io (when published)

cargo install turboquant

As a Library

Add to your Cargo.toml:

[dependencies]
turboquant = "0.1.0"

CLI Usage

Quantize a Model

# Quantize to 4-bit (recommended)
turboquant-cli quantize -i model.gguf -o model-q4.gguf -q int4

# Quantize to 8-bit for maximum accuracy
turboquant-cli quantize -i model.gguf -o model-q8.gguf -q int8

# Quantize to 2-bit for minimal memory usage
turboquant-cli quantize -i model.gguf -o model-q2.gguf -q int2

# Use custom thread count
turboquant-cli quantize -i model.gguf -o model-q4.gguf -q int4 -t 8

# Enable verbose output
turboquant-cli quantize -i model.gguf -o model-q4.gguf -q int4 -v

Analyze a Model

# Get recommendations for optimal quantization
turboquant-cli analyze -m model.gguf

# Verbose analysis
turboquant-cli analyze -m model.gguf -v

Discover Models

# Find all models in a directory
turboquant-cli discover -d /path/to/models

# Verbose discovery with recommendations
turboquant-cli discover -d /path/to/models -v

Batch Processing

# Quantize all models in a directory
turboquant-cli batch -i /input/models -o /output/models -q int4

# Dry run (see what would be processed)
turboquant-cli batch -i /input/models -o /output/models -q int4 --dry-run

Benchmark

# Run synthetic benchmarks
turboquant-cli benchmark -n 20

# Benchmark with a specific model
turboquant-cli benchmark -m model.gguf -n 10

Get Information

# Show supported formats and quantization types
turboquant-cli info

Library Usage

Basic Quantization

use turboquant::{TurboQuant, QuantizationType, QuantizationConfig};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Create TurboQuant instance with default config
    let tq = TurboQuant::new();
    
    // Quantize a model to 4-bit
    tq.quantize_model(
        "model.gguf",
        "model-q4.gguf",
        QuantizationType::Int4
    ).await?;
    
    Ok(())
}

Custom Configuration

use turboquant::{TurboQuant, QuantizationConfig, QuantizationType};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Create custom configuration
    let config = QuantizationConfig {
        quantization_bits: 4,
        max_context_length: 2048,
        use_gpu: false,
        batch_size: 1,
        num_threads: 8,
        temperature: 0.7,
        top_p: 0.9,
        top_k: 40,
    };
    
    let tq = TurboQuant::with_config(config);
    
    // Load and quantize model
    let model = tq.load_model("model.gguf").await?;
    
    Ok(())
}

Model Discovery

use turboquant::{ModelDiscovery, DiscoveryConfig};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let config = DiscoveryConfig::default();
    let discovery = ModelDiscovery::new(config);
    
    // Discover models in a directory
    let models = discovery.discover_models("/path/to/models").await?;
    
    for model in models {
        println!("Found: {:?}", model.path);
        println!("Format: {:?}", model.format);
        println!("Size: {} MB", model.size / (1024 * 1024));
    }
    
    Ok(())
}

Inference Engine

use turboquant::{TurboQuant, QuantizationConfig, InferenceEngine};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let tq = TurboQuant::new();
    
    // Load model
    let model = tq.load_model("model-q4.gguf").await?;
    
    // Create inference engine
    let config = QuantizationConfig::default();
    let engine = InferenceEngine::new(model, config)?;
    
    // Generate text
    let result = engine.generate("Hello, how are you?", 100).await?;
    
    println!("Generated: {}", result.text);
    println!("Tokens/sec: {:.2}", result.stats.tokens_per_second);
    
    Ok(())
}

Quantization Types

Type Bits Compression Accuracy Use Case
Int8 8 4x Minimal High-accuracy requirements
Int4 4 8x Good Recommended for general use
Int3 3 10.7x Moderate Memory-constrained environments
Int2 2 16x Noticeable Maximum compression
Int1 1 32x Significant Experimental, extreme compression
NF4 4 8x Excellent Models with normal weight distribution

Performance

TurboQuant is optimized for:

  • Multi-threaded Processing: Automatically utilizes all available CPU cores
  • Memory Efficiency: Streamlined algorithms minimize memory footprint
  • Parallel Quantization: Layer-wise parallelization using Rayon
  • SIMD Operations: Leveraging modern CPU vector instructions

Benchmarks

Typical quantization speeds (on modern hardware):

  • 7B parameter model: ~2-5 minutes (Int4)
  • 13B parameter model: ~5-10 minutes (Int4)
  • 70B parameter model: ~20-40 minutes (Int4)

Architecture

turboquant/
├── src/
│   ├── lib.rs           # Library entry point
│   ├── config.rs        # Configuration structures
│   ├── error.rs         # Error types
│   ├── quantization.rs  # Quantization algorithms
│   ├── model.rs         # Model loading and management
│   ├── inference.rs     # Inference engine
│   ├── tokenizer.rs     # Tokenization support
│   ├── model_discovery.rs # Model discovery utilities
│   └── bin/
│       └── main.rs      # CLI application
├── benches/
│   └── quantization_benchmark.rs
└── Cargo.toml

Configuration

QuantizationConfig

pub struct QuantizationConfig {
    pub quantization_bits: u8,        // 1-8 bits
    pub max_context_length: usize,    // Maximum context window
    pub use_gpu: bool,                // GPU acceleration
    pub batch_size: usize,            // Batch size for inference
    pub num_threads: usize,           // CPU threads
    pub temperature: f32,             // Sampling temperature
    pub top_p: f32,                   // Nucleus sampling
    pub top_k: usize,                 // Top-k sampling
}

Preset Configurations

// CPU-optimized configuration
let config = QuantizationConfig::cpu_optimized();

// GPU-optimized configuration
let config = QuantizationConfig::gpu_optimized();

// Memory-efficient configuration
let config = QuantizationConfig::memory_efficient();

Error Handling

TurboQuant provides comprehensive error types:

use turboquant::TurboQuantError;

match tq.load_model("model.gguf").await {
    Ok(model) => { /* ... */ }
    Err(TurboQuantError::ModelLoading(msg)) => eprintln!("Loading failed: {}", msg),
    Err(TurboQuantError::Io(err)) => eprintln!("IO error: {}", err),
    Err(TurboQuantError::Quantization(msg)) => eprintln!("Quantization failed: {}", msg),
    Err(e) => eprintln!("Error: {}", e),
}

Limitations

  • Incomplete Implementation: Some features (GGUF loading, inference forward pass) are marked as TODO
  • Model Format Support: Full GGUF/GGML loading is planned but not yet complete
  • GPU Acceleration: GPU support is configured but not yet implemented
  • Inference Engine: Forward pass implementation is in progress

Roadmap

  • Complete GGUF model loading implementation
  • Full inference engine forward pass
  • GPU acceleration support (CUDA, Metal)
  • Additional model format support (Safetensors)
  • Advanced quantization algorithms (AWQ, GPTQ)
  • Model merging and fine-tuning support
  • Python bindings

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

  • Inspired by llama.cpp and the GGUF format
  • Uses ndarray for numerical computing
  • Built with tokio for async runtime
  • CLI powered by clap

Citation

@software{turboquant2024,
  author = {Tony Rewin},
  title = {TurboQuant: High-performance LLM Quantization},
  year = {2024},
  url = {https://github.com/your-org/turboquant}
}

Requirements

Requires Python: >=3.9
Details
PyPI
2026-04-19 09:36:07 +00:00
19
MIT
1.7 MiB
Assets (3)
Versions (1) View all
0.3.3 2026-04-19