Getting Started
Introduction
QuantLLM is a powerful library for quantizing and deploying large language models with a focus on memory efficiency and performance. The library now supports GGUF format, advanced progress tracking, and comprehensive benchmarking tools.
Installation
Install the base package:
pip install quantllm
For GGUF support, install with extras:
pip install quantllm[gguf]
Quick Start
Here’s a complete example showcasing GGUF quantization and benchmarking:
from quantllm import QuantLLM
from quantllm.quant import GGUFQuantizer
from transformers import AutoTokenizer
# 1. Load tokenizer and prepare calibration data
model_name = "facebook/opt-125m"
tokenizer = AutoTokenizer.from_pretrained(model_name)
calibration_text = ["This is an example text for calibration."] * 10
calibration_data = tokenizer(calibration_text, return_tensors="pt", padding=True)["input_ids"]
# 2. Quantize using high-level API
quantized_model, benchmark_results = QuantLLM.quantize_from_pretrained(
model_name_or_path=model_name,
bits=4, # Quantization bits (2-8)
group_size=32, # Group size for quantization
quant_type="Q4_K_M", # GGUF quantization type
calibration_data=calibration_data,
benchmark=True, # Run benchmarks
benchmark_input_shape=(1, 32),
benchmark_steps=50,
cpu_offload=False, # Set to True for large models
chunk_size=1000 # Process in chunks for memory efficiency
)
# 3. Save the quantized model
QuantLLM.save_quantized_model(
model=quantized_model,
output_path="quantized_model",
save_tokenizer=True
)
# 4. Convert to GGUF format
QuantLLM.convert_to_gguf(
model=quantized_model,
output_path="model.gguf"
)
Core Features
Advanced GGUF Quantization
The library supports various GGUF quantization types:
- 2-bit Quantization
Q2_K: Best for extreme compression
Suitable for smaller models or when size is critical
- 4-bit Quantization
Q4_K_S: Standard 4-bit quantization
Q4_K_M: 4-bit quantization with improved accuracy
Best balance of size and quality
- 8-bit Quantization
Q8_0: High-precision 8-bit quantization
Best for quality-critical applications
Memory-Efficient Processing
Chunk-based quantization for large models
Automatic device management
CPU offloading support
Progress tracking with memory statistics
Detailed Examples
1. Direct GGUF Quantization
For more control over the quantization process:
from quantllm.quant import GGUFQuantizer
import torch
# Initialize quantizer with detailed configuration
quantizer = GGUFQuantizer(
model_name="facebook/opt-125m",
bits=4,
group_size=32,
quant_type="Q4_K_M",
use_packed=True,
desc_act=False,
desc_ten=False,
legacy_format=False,
batch_size=4,
device="cuda" if torch.cuda.is_available() else "cpu",
cpu_offload=False,
gradient_checkpointing=False,
chunk_size=1000
)
# Quantize the model
quantized_model = quantizer.quantize(calibration_data=calibration_data)
# Convert to GGUF format with progress tracking
quantizer.convert_to_gguf("model.gguf")
2. Comprehensive Benchmarking
Evaluate quantization performance:
from quantllm.utils.benchmark import QuantizationBenchmark
# Initialize benchmark
benchmark = QuantizationBenchmark(
model=model,
calibration_data=calibration_data,
input_shape=(1, 32),
num_inference_steps=100,
device="cuda",
num_warmup_steps=10
)
# Run benchmarks and get detailed metrics
results = benchmark.run_all_benchmarks()
# Print detailed report
benchmark.print_report()
# Optional: Generate visualization
benchmark.plot_comparison("benchmark_results.png")
3. Memory-Efficient Processing
For large models with memory constraints:
# Configure for memory efficiency
quantizer = GGUFQuantizer(
model_name="facebook/opt-1.3b", # Larger model
bits=4,
group_size=32,
cpu_offload=True, # Enable CPU offloading
chunk_size=500, # Smaller chunks for memory efficiency
gradient_checkpointing=True
)
# Process in chunks with progress display
quantized_model = quantizer.quantize(calibration_data)
Supported GGUF Types
Best Practices
- Memory Management
Use cpu_offload=True for models larger than 70% of GPU memory
Adjust chunk_size based on available memory
Enable gradient_checkpointing for large models
- Quantization Selection
Use Q4_K_M for general use cases
Use Q2_K for extreme compression needs
Use Q8_0 for quality-critical applications
- Performance Optimization
Run benchmarks to find optimal settings
Use appropriate batch sizes
Monitor memory usage with built-in tools
- Progress Tracking
Use the built-in progress bars
Monitor layer-wise quantization
Track memory usage during processing
Next Steps
Check out our tutorials/index for more examples
Read the API Reference for API details
See advanced_usage/index for advanced features
Visit deployment for deployment guides