πŸš€ Quick Start

Get up and running with QuantLLM in 5 minutes.


Your First Model

from quantllm import turbo

# Load any HuggingFace model with automatic optimization
model = turbo("meta-llama/Llama-3.2-3B")

# Generate text
response = model.generate("Explain machine learning in simple terms")
print(response)

That’s it! QuantLLM automatically:

  • βœ… Detects your GPU and available memory

  • βœ… Applies optimal 4-bit quantization

  • βœ… Enables Flash Attention 2 when available

  • βœ… Configures memory management


Basic Usage

Generate Text

response = model.generate(
    "Write a Python function to calculate fibonacci numbers",
    max_new_tokens=200,
    temperature=0.7,
)
print(response)

Chat Mode

messages = [
    {"role": "system", "content": "You are a helpful coding assistant."},
    {"role": "user", "content": "How do I read a file in Python?"},
]

response = model.chat(messages, max_new_tokens=200)
print(response)

Streaming Output

for token in model.generate("Count to 10:", stream=True):
    print(token, end="", flush=True)

Export to Different Formats

GGUF (llama.cpp, Ollama, LM Studio)

# Export with recommended Q4_K_M quantization
model.export("gguf", "model.Q4_K_M.gguf", quantization="Q4_K_M")

# Other quantization options
model.export("gguf", "model.Q8_0.gguf", quantization="Q8_0")   # Higher quality
model.export("gguf", "model.Q2_K.gguf", quantization="Q2_K")   # Smallest size

ONNX (ONNX Runtime, TensorRT)

model.export("onnx", "./model-onnx/")

MLX (Apple Silicon)

model.export("mlx", "./model-mlx/", quantization="4bit")

SafeTensors (HuggingFace)

model.export("safetensors", "./model-hf/")

Fine-Tune Your Model

Train with your own data in one line:

# Simple training
model.finetune("training_data.json", epochs=3)

# With more control
model.finetune(
    "training_data.json",
    epochs=5,
    learning_rate=2e-4,
    lora_r=16,
    batch_size=4,
)

Supported data formats:

[
  {"instruction": "What is Python?", "output": "Python is a programming language..."},
  {"text": "Full text for language modeling"},
  {"prompt": "Question here", "completion": "Answer here"}
]

Push to HuggingFace

Share your model with the world:

# Push with auto-generated model card
model = turbo(
    "meta-llama/Llama-3.2-3B",
    config={"format": "gguf", "quantization": "Q4_K_M", "push_format": "gguf"},
)
model.push(
    "your-username/my-awesome-model",
    license="apache-2.0"
)

The model card includes:

  • βœ… Proper YAML frontmatter for HuggingFace

  • βœ… Format-specific usage examples

  • βœ… β€œUse this model” button compatibility

  • βœ… Quantization details


Configuration Options

Override Auto-Detection

model = turbo(
    "meta-llama/Llama-3.2-3B",
    bits=4,                    # Force 4-bit quantization
    max_length=4096,           # Context length
    device="cuda:0",           # Specific GPU
    dtype="bfloat16",          # Data type
)

View Current Configuration

print(model.config)

Load GGUF Models

Load pre-quantized GGUF models directly:

from quantllm import TurboModel

model = TurboModel.from_gguf(
    "TheBloke/Llama-2-7B-Chat-GGUF",
    filename="llama-2-7b-chat.Q4_K_M.gguf"
)

print(model.generate("Hello!"))

Show the Banner

Display the QuantLLM banner anytime:

import quantllm

quantllm.show_banner()
╔════════════════════════════════════════════════════════════╗
β•‘                                                            β•‘
β•‘   πŸš€ QuantLLM v2.1.0rc1                                       β•‘
β•‘   Ultra-fast LLM Quantization & Export                     β•‘
β•‘                                                            β•‘
β•‘   βœ“ GGUF  βœ“ ONNX  βœ“ MLX  βœ“ SafeTensors                     β•‘
β•‘                                                            β•‘
β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•

Next Steps

Now that you know the basics, explore more: