Untone to

Joined on 2025-05-09

turboquant (0.3.3)

Published 2026-04-19 09:36:07 +00:00 by to

To install the package using pip, run the following command:

pip install --index-url  --extra-index-url https://pypi.org/simple turboquant

For more information on the PyPI registry, see the documentation.

High-performance quantization library for local LLM inference

TurboQuant

High-performance quantization library and model compression tool for local LLM inference

Overview

TurboQuant is a Rust library and CLI tool designed for efficient quantization of large language models (LLMs). It enables running AI models locally with minimal resource usage by compressing model weights while maintaining acceptable accuracy.

Research Background

This library is inspired by Google Research's TurboQuant paper:

"TurboQuant: Online Vector Quantization with Near-optimal Distortion"
Google Research, April 2025
arXiv:2504.19874

Key insights from the research:

Random rotation induces concentrated Beta distribution on coordinates
Two-stage quantization: MSE-optimal quantizer + 1-bit QJL transform for unbiased inner products
Near-optimal distortion: Within ≈2.7x of Shannon's Lower Bound
KV cache compression: 6x memory reduction with zero accuracy loss at 3.5 bits/channel
Speed improvements: Up to 8x inference speedup, virtually zero indexing time

See docs/TURBOQUANT_RESEARCH.md for detailed paper summary and key quotes.

Key Features

Multiple Quantization Types: Support for Int8, Int4, Int3, Int2, Int1 (binary), and NF4 (normal-form) quantization
High Performance: Multi-threaded processing using Rayon for parallel computation
Model Format Support: GGUF, GGML, and Safetensors formats
CLI Tool: Easy-to-use command-line interface for model quantization
Library API: Programmatic access for integration into your applications
Model Discovery: Automatically find and analyze models in directories
Batch Processing: Quantize multiple models in a single operation
Benchmarking: Built-in performance benchmarking tools

Installation

From Source

git clone https://github.com/your-org/turboquant
cd turboquant
cargo build --release

The CLI tool will be available at target/release/turboquant-cli.

From Crates.io (when published)

cargo install turboquant

As a Library

Add to your Cargo.toml:

[dependencies]
turboquant = "0.1.0"

CLI Usage

Quantize a Model

# Quantize to 4-bit (recommended)
turboquant-cli quantize -i model.gguf -o model-q4.gguf -q int4

# Quantize to 8-bit for maximum accuracy
turboquant-cli quantize -i model.gguf -o model-q8.gguf -q int8

# Quantize to 2-bit for minimal memory usage
turboquant-cli quantize -i model.gguf -o model-q2.gguf -q int2

# Use custom thread count
turboquant-cli quantize -i model.gguf -o model-q4.gguf -q int4 -t 8

# Enable verbose output
turboquant-cli quantize -i model.gguf -o model-q4.gguf -q int4 -v

Analyze a Model

# Get recommendations for optimal quantization
turboquant-cli analyze -m model.gguf

# Verbose analysis
turboquant-cli analyze -m model.gguf -v

Discover Models

# Find all models in a directory
turboquant-cli discover -d /path/to/models

# Verbose discovery with recommendations
turboquant-cli discover -d /path/to/models -v

Batch Processing

# Quantize all models in a directory
turboquant-cli batch -i /input/models -o /output/models -q int4

# Dry run (see what would be processed)
turboquant-cli batch -i /input/models -o /output/models -q int4 --dry-run

Benchmark

# Run synthetic benchmarks
turboquant-cli benchmark -n 20

# Benchmark with a specific model
turboquant-cli benchmark -m model.gguf -n 10

Get Information

# Show supported formats and quantization types
turboquant-cli info

Library Usage

Basic Quantization

use turboquant::{TurboQuant, QuantizationType, QuantizationConfig};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Create TurboQuant instance with default config
    let tq = TurboQuant::new();
    
    // Quantize a model to 4-bit
    tq.quantize_model(
        "model.gguf",
        "model-q4.gguf",
        QuantizationType::Int4
    ).await?;
    
    Ok(())
}

Custom Configuration

use turboquant::{TurboQuant, QuantizationConfig, QuantizationType};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Create custom configuration
    let config = QuantizationConfig {
        quantization_bits: 4,
        max_context_length: 2048,
        use_gpu: false,
        batch_size: 1,
        num_threads: 8,
        temperature: 0.7,
        top_p: 0.9,
        top_k: 40,
    };
    
    let tq = TurboQuant::with_config(config);
    
    // Load and quantize model
    let model = tq.load_model("model.gguf").await?;
    
    Ok(())
}

Model Discovery

use turboquant::{ModelDiscovery, DiscoveryConfig};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let config = DiscoveryConfig::default();
    let discovery = ModelDiscovery::new(config);
    
    // Discover models in a directory
    let models = discovery.discover_models("/path/to/models").await?;
    
    for model in models {
        println!("Found: {:?}", model.path);
        println!("Format: {:?}", model.format);
        println!("Size: {} MB", model.size / (1024 * 1024));
    }
    
    Ok(())
}

Inference Engine

use turboquant::{TurboQuant, QuantizationConfig, InferenceEngine};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let tq = TurboQuant::new();
    
    // Load model
    let model = tq.load_model("model-q4.gguf").await?;
    
    // Create inference engine
    let config = QuantizationConfig::default();
    let engine = InferenceEngine::new(model, config)?;
    
    // Generate text
    let result = engine.generate("Hello, how are you?", 100).await?;
    
    println!("Generated: {}", result.text);
    println!("Tokens/sec: {:.2}", result.stats.tokens_per_second);
    
    Ok(())
}

Quantization Types

Type	Bits	Compression	Accuracy	Use Case
Int8	8	4x	Minimal	High-accuracy requirements
Int4	4	8x	Good	Recommended for general use
Int3	3	10.7x	Moderate	Memory-constrained environments
Int2	2	16x	Noticeable	Maximum compression
Int1	1	32x	Significant	Experimental, extreme compression
NF4	4	8x	Excellent	Models with normal weight distribution

Performance

TurboQuant is optimized for:

Multi-threaded Processing: Automatically utilizes all available CPU cores
Memory Efficiency: Streamlined algorithms minimize memory footprint
Parallel Quantization: Layer-wise parallelization using Rayon
SIMD Operations: Leveraging modern CPU vector instructions

Benchmarks

Typical quantization speeds (on modern hardware):

7B parameter model: ~2-5 minutes (Int4)
13B parameter model: ~5-10 minutes (Int4)
70B parameter model: ~20-40 minutes (Int4)

Architecture

turboquant/
├── src/
│   ├── lib.rs           # Library entry point
│   ├── config.rs        # Configuration structures
│   ├── error.rs         # Error types
│   ├── quantization.rs  # Quantization algorithms
│   ├── model.rs         # Model loading and management
│   ├── inference.rs     # Inference engine
│   ├── tokenizer.rs     # Tokenization support
│   ├── model_discovery.rs # Model discovery utilities
│   └── bin/
│       └── main.rs      # CLI application
├── benches/
│   └── quantization_benchmark.rs
└── Cargo.toml

Configuration

QuantizationConfig

pub struct QuantizationConfig {
    pub quantization_bits: u8,        // 1-8 bits
    pub max_context_length: usize,    // Maximum context window
    pub use_gpu: bool,                // GPU acceleration
    pub batch_size: usize,            // Batch size for inference
    pub num_threads: usize,           // CPU threads
    pub temperature: f32,             // Sampling temperature
    pub top_p: f32,                   // Nucleus sampling
    pub top_k: usize,                 // Top-k sampling
}

Preset Configurations

// CPU-optimized configuration
let config = QuantizationConfig::cpu_optimized();

// GPU-optimized configuration
let config = QuantizationConfig::gpu_optimized();

// Memory-efficient configuration
let config = QuantizationConfig::memory_efficient();

Error Handling

TurboQuant provides comprehensive error types:

use turboquant::TurboQuantError;

match tq.load_model("model.gguf").await {
    Ok(model) => { /* ... */ }
    Err(TurboQuantError::ModelLoading(msg)) => eprintln!("Loading failed: {}", msg),
    Err(TurboQuantError::Io(err)) => eprintln!("IO error: {}", err),
    Err(TurboQuantError::Quantization(msg)) => eprintln!("Quantization failed: {}", msg),
    Err(e) => eprintln!("Error: {}", e),
}

Limitations

Incomplete Implementation: Some features (GGUF loading, inference forward pass) are marked as TODO
Model Format Support: Full GGUF/GGML loading is planned but not yet complete
GPU Acceleration: GPU support is configured but not yet implemented
Inference Engine: Forward pass implementation is in progress

Roadmap

Complete GGUF model loading implementation
Full inference engine forward pass
GPU acceleration support (CUDA, Metal)
Additional model format support (Safetensors)
Advanced quantization algorithms (AWQ, GPTQ)
Model merging and fine-tuning support
Python bindings

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Fork the repository
Create your feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

Inspired by llama.cpp and the GGUF format
Uses ndarray for numerical computing
Built with tokio for async runtime
CLI powered by clap

Citation

@software{turboquant2024,
  author = {Tony Rewin},
  title = {TurboQuant: High-performance LLM Quantization},
  year = {2024},
  url = {https://github.com/your-org/turboquant}
}

Requires Python: >=3.9

Details

PyPI

2026-04-19 09:36:07 +00:00

MIT

1.7 MiB

Assets (3)

turboquant-0.3.3-cp313-cp313-macosx_11_0_arm64.whl 879 KiB

turboquant-0.3.3.tar.gz 196 KiB

turboquant-0.3.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl 654 KiB

Versions (1) View all

0.3.3

2026-04-19

turboquant (0.3.3)

Installation

About this package

TurboQuant

Overview

Research Background

Key Features

Installation

From Source

From Crates.io (when published)

As a Library

CLI Usage

Quantize a Model

Analyze a Model

Discover Models

Batch Processing

Benchmark

Get Information

Library Usage

Basic Quantization

Custom Configuration

Model Discovery

Inference Engine

Quantization Types

Performance

Benchmarks

Architecture

Configuration

QuantizationConfig

Preset Configurations

Error Handling

Limitations

Roadmap

Contributing

License

Acknowledgments

Citation

Requirements