🚀 QuantLLM Documentation

The Ultra-Fast LLM Quantization & Export Library
Load → Quantize → Fine-tune → Export — All in One Line

Welcome to QuantLLM v2.1 (pre-release)

QuantLLM makes working with large language models simple. Load any model, quantize it automatically, fine-tune with your data, and export to any format — all with just a few lines of code.

from quantllm import turbo

# Load with shared export/push defaults
model = turbo(
    "meta-llama/Llama-3.2-3B",
    config={"format": "gguf", "quantization": "Q4_K_M", "push_format": "gguf"},
)

# Generate text
print(model.generate("Explain quantum computing"))

# Export to GGUF for Ollama/llama.cpp
model.export()

# Push to HuggingFace with auto-generated model card
model.push("username/my-model")

📚 Documentation

User Guide

✨ Key Features

Feature	Description
🔥 TurboModel API	One unified interface for everything
📦 Multi-Format Export	GGUF, ONNX, MLX, SafeTensors
⚡ Auto-Optimization	Flash Attention, torch.compile, dynamic padding
🎨 Beautiful UI	Orange-themed progress bars and logging
🤗 Hub Integration	One-click push with auto model cards
🧠 45+ Architectures	Llama, Mistral, Qwen, Phi, Gemma, and more

🚀 Quick Examples

Load Any Model

from quantllm import turbo

model = turbo("mistralai/Mistral-7B")
model = turbo("Qwen/Qwen2-7B", bits=4)
model = turbo("microsoft/phi-3-mini")

Export to Any Format

model = turbo(
    "meta-llama/Llama-3.2-3B",
    config={"format": "gguf", "quantization": "Q4_K_M", "push_format": "gguf"},
)
model.export()
model.export("onnx", "./model-onnx/")
model.export("mlx", "./model-mlx/", quantization="4bit")

Fine-tune in One Line

model.finetune("training_data.json", epochs=3)

Push to HuggingFace

model.push("username/my-model")

💻 System Requirements

Python: 3.10+
PyTorch: 2.0+
GPU: NVIDIA with 6GB+ VRAM (recommended)
Platforms: Windows, Linux, macOS

Indices and Tables