# 📦 Installation Get QuantLLM up and running in minutes. --- ## Requirements | Component | Version | |-----------|---------| | Python | 3.10+ | | PyTorch | 2.0+ | | CUDA | 11.8+ (for GPU acceleration) | --- ## Quick Install ### From GitHub (Recommended) ```bash pip install git+https://github.com/codewithdark-git/QuantLLM.git ``` ### From PyPI ```bash pip install quantllm ``` --- ## Installation Options Choose the features you need: ```bash # Basic installation pip install quantllm # With GGUF export support pip install "quantllm[gguf]" # With ONNX export support pip install "quantllm[onnx]" # With MLX export (Apple Silicon) pip install "quantllm[mlx]" # With Triton kernels (Linux, faster inference) pip install "quantllm[triton]" # With Flash Attention pip install "quantllm[flash]" # Full installation (everything) pip install "quantllm[full]" ``` --- ## From Source (Development) ```bash git clone https://github.com/codewithdark-git/QuantLLM.git cd QuantLLM pip install -e ".[dev]" ``` --- ## Verify Installation ```python import quantllm # Check version print(f"QuantLLM v{quantllm.__version__}") # Show banner quantllm.show_banner() # Quick test from quantllm import turbo model = turbo("TinyLlama/TinyLlama-1.1B-Chat-v1.0") print(model.generate("Hello!")) ``` Expected output: ``` ╔════════════════════════════════════════════════════════════╗ ║ ║ ║ 🚀 QuantLLM v2.1.0rc1 ║ ║ Ultra-fast LLM Quantization & Export ║ ║ ║ ║ ✓ GGUF ✓ ONNX ✓ MLX ✓ SafeTensors ║ ║ ║ ╚════════════════════════════════════════════════════════════╝ ``` --- ## Optional Dependencies ### Flash Attention (Faster Inference) ```bash pip install flash-attn --no-build-isolation ``` > **Note**: Requires CUDA toolkit installed on your system. ### Triton Kernels (GPU Optimization) ```bash pip install triton>=2.1.0 ``` > **Note**: Linux only. Provides fused quantization kernels. --- ## Troubleshooting ### CUDA Not Available ```bash python -c "import torch; print(torch.cuda.is_available())" ``` If `False`, reinstall PyTorch with CUDA: ```bash # CUDA 11.8 pip install torch --index-url https://download.pytorch.org/whl/cu118 # CUDA 12.1 pip install torch --index-url https://download.pytorch.org/whl/cu121 ``` ### Flash Attention Build Errors Flash Attention requires NVIDIA CUDA toolkit: ```bash # Ubuntu/Debian sudo apt install nvidia-cuda-toolkit # Then install pip install flash-attn --no-build-isolation ``` ### Memory Issues If you encounter OOM errors: ```python # Use 4-bit quantization model = turbo("model-name", bits=4) # Or use a smaller model model = turbo("TinyLlama/TinyLlama-1.1B-Chat-v1.0") ``` ### Windows Issues Some features require Visual C++ Build Tools: 1. Download [Visual Studio Build Tools](https://visualstudio.microsoft.com/visual-cpp-build-tools/) 2. Install "Desktop development with C++" 3. Restart your terminal --- ## Hardware Requirements | GPU VRAM | Recommended Models | |----------|-------------------| | 6-8 GB | 1-7B models (4-bit) | | 12-24 GB | 7-30B models (4-bit) | | 24-80 GB | 70B+ models | **Tested GPUs:** - NVIDIA: RTX 3060, 3070, 3080, 3090, 4070, 4080, 4090, A100, H100 - AMD: RX 7900 XTX (with ROCm) - Apple: M1, M2, M3, M4 (via MLX export) --- ## Next Steps - [Quick Start →](quickstart.md) - [Loading Models →](guide/loading-models.md) - [GGUF Export →](guide/gguf-export.md)