π¦ Installationο
Get QuantLLM up and running in minutes.
Requirementsο
Component |
Version |
|---|---|
Python |
3.10+ |
PyTorch |
2.0+ |
CUDA |
11.8+ (for GPU acceleration) |
Quick Installο
From GitHub (Recommended)ο
pip install git+https://github.com/codewithdark-git/QuantLLM.git
From PyPIο
pip install quantllm
Installation Optionsο
Choose the features you need:
# Basic installation
pip install quantllm
# With GGUF export support
pip install "quantllm[gguf]"
# With ONNX export support
pip install "quantllm[onnx]"
# With MLX export (Apple Silicon)
pip install "quantllm[mlx]"
# With Triton kernels (Linux, faster inference)
pip install "quantllm[triton]"
# With Flash Attention
pip install "quantllm[flash]"
# Full installation (everything)
pip install "quantllm[full]"
From Source (Development)ο
git clone https://github.com/codewithdark-git/QuantLLM.git
cd QuantLLM
pip install -e ".[dev]"
Verify Installationο
import quantllm
# Check version
print(f"QuantLLM v{quantllm.__version__}")
# Show banner
quantllm.show_banner()
# Quick test
from quantllm import turbo
model = turbo("TinyLlama/TinyLlama-1.1B-Chat-v1.0")
print(model.generate("Hello!"))
Expected output:
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β π QuantLLM v2.1.0rc1 β
β Ultra-fast LLM Quantization & Export β
β β
β β GGUF β ONNX β MLX β SafeTensors β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Optional Dependenciesο
Flash Attention (Faster Inference)ο
pip install flash-attn --no-build-isolation
Note: Requires CUDA toolkit installed on your system.
Triton Kernels (GPU Optimization)ο
pip install triton>=2.1.0
Note: Linux only. Provides fused quantization kernels.
Troubleshootingο
CUDA Not Availableο
python -c "import torch; print(torch.cuda.is_available())"
If False, reinstall PyTorch with CUDA:
# CUDA 11.8
pip install torch --index-url https://download.pytorch.org/whl/cu118
# CUDA 12.1
pip install torch --index-url https://download.pytorch.org/whl/cu121
Flash Attention Build Errorsο
Flash Attention requires NVIDIA CUDA toolkit:
# Ubuntu/Debian
sudo apt install nvidia-cuda-toolkit
# Then install
pip install flash-attn --no-build-isolation
Memory Issuesο
If you encounter OOM errors:
# Use 4-bit quantization
model = turbo("model-name", bits=4)
# Or use a smaller model
model = turbo("TinyLlama/TinyLlama-1.1B-Chat-v1.0")
Windows Issuesο
Some features require Visual C++ Build Tools:
Download Visual Studio Build Tools
Install βDesktop development with C++β
Restart your terminal
Hardware Requirementsο
GPU VRAM |
Recommended Models |
|---|---|
6-8 GB |
1-7B models (4-bit) |
12-24 GB |
7-30B models (4-bit) |
24-80 GB |
70B+ models |
Tested GPUs:
NVIDIA: RTX 3060, 3070, 3080, 3090, 4070, 4080, 4090, A100, H100
AMD: RX 7900 XTX (with ROCm)
Apple: M1, M2, M3, M4 (via MLX export)