🚀 QuantLLM Documentation

The Ultra-Fast LLM Quantization & Export Library
Load → Quantize → Fine-tune → Export — All in One Line

Welcome to QuantLLM v2.1 (pre-release)

QuantLLM makes working with large language models simple. Load any model, quantize it automatically, fine-tune with your data, and export to any format — all with just a few lines of code.

from quantllm import turbo

# Load with shared export/push defaults
model = turbo(
    "meta-llama/Llama-3.2-3B",
    config={"format": "gguf", "quantization": "Q4_K_M", "push_format": "gguf"},
)

# Generate text
print(model.generate("Explain quantum computing"))

# Export to GGUF for Ollama/llama.cpp
model.export()

# Push to HuggingFace with auto-generated model card
model.push("username/my-model")

📚 Documentation


✨ Key Features

Feature

Description

🔥 TurboModel API

One unified interface for everything

📦 Multi-Format Export

GGUF, ONNX, MLX, SafeTensors

Auto-Optimization

Flash Attention, torch.compile, dynamic padding

🎨 Beautiful UI

Orange-themed progress bars and logging

🤗 Hub Integration

One-click push with auto model cards

🧠 45+ Architectures

Llama, Mistral, Qwen, Phi, Gemma, and more


🚀 Quick Examples

Load Any Model

from quantllm import turbo

model = turbo("mistralai/Mistral-7B")
model = turbo("Qwen/Qwen2-7B", bits=4)
model = turbo("microsoft/phi-3-mini")

Export to Any Format

model = turbo(
    "meta-llama/Llama-3.2-3B",
    config={"format": "gguf", "quantization": "Q4_K_M", "push_format": "gguf"},
)
model.export()
model.export("onnx", "./model-onnx/")
model.export("mlx", "./model-mlx/", quantization="4bit")

Fine-tune in One Line

model.finetune("training_data.json", epochs=3)

Push to HuggingFace

model.push("username/my-model")

💻 System Requirements

  • Python: 3.10+

  • PyTorch: 2.0+

  • GPU: NVIDIA with 6GB+ VRAM (recommended)

  • Platforms: Windows, Linux, macOS


Indices and Tables