π Quick Startο
Get up and running with QuantLLM in 5 minutes.
Your First Modelο
from quantllm import turbo
# Load any HuggingFace model with automatic optimization
model = turbo("meta-llama/Llama-3.2-3B")
# Generate text
response = model.generate("Explain machine learning in simple terms")
print(response)
Thatβs it! QuantLLM automatically:
β Detects your GPU and available memory
β Applies optimal 4-bit quantization
β Enables Flash Attention 2 when available
β Configures memory management
Basic Usageο
Generate Textο
response = model.generate(
"Write a Python function to calculate fibonacci numbers",
max_new_tokens=200,
temperature=0.7,
)
print(response)
Chat Modeο
messages = [
{"role": "system", "content": "You are a helpful coding assistant."},
{"role": "user", "content": "How do I read a file in Python?"},
]
response = model.chat(messages, max_new_tokens=200)
print(response)
Streaming Outputο
for token in model.generate("Count to 10:", stream=True):
print(token, end="", flush=True)
Export to Different Formatsο
GGUF (llama.cpp, Ollama, LM Studio)ο
# Export with recommended Q4_K_M quantization
model.export("gguf", "model.Q4_K_M.gguf", quantization="Q4_K_M")
# Other quantization options
model.export("gguf", "model.Q8_0.gguf", quantization="Q8_0") # Higher quality
model.export("gguf", "model.Q2_K.gguf", quantization="Q2_K") # Smallest size
ONNX (ONNX Runtime, TensorRT)ο
model.export("onnx", "./model-onnx/")
MLX (Apple Silicon)ο
model.export("mlx", "./model-mlx/", quantization="4bit")
SafeTensors (HuggingFace)ο
model.export("safetensors", "./model-hf/")
Fine-Tune Your Modelο
Train with your own data in one line:
# Simple training
model.finetune("training_data.json", epochs=3)
# With more control
model.finetune(
"training_data.json",
epochs=5,
learning_rate=2e-4,
lora_r=16,
batch_size=4,
)
Supported data formats:
[
{"instruction": "What is Python?", "output": "Python is a programming language..."},
{"text": "Full text for language modeling"},
{"prompt": "Question here", "completion": "Answer here"}
]
Push to HuggingFaceο
Share your model with the world:
# Push with auto-generated model card
model = turbo(
"meta-llama/Llama-3.2-3B",
config={"format": "gguf", "quantization": "Q4_K_M", "push_format": "gguf"},
)
model.push(
"your-username/my-awesome-model",
license="apache-2.0"
)
The model card includes:
β Proper YAML frontmatter for HuggingFace
β Format-specific usage examples
β βUse this modelβ button compatibility
β Quantization details
Configuration Optionsο
Override Auto-Detectionο
model = turbo(
"meta-llama/Llama-3.2-3B",
bits=4, # Force 4-bit quantization
max_length=4096, # Context length
device="cuda:0", # Specific GPU
dtype="bfloat16", # Data type
)
View Current Configurationο
print(model.config)
Load GGUF Modelsο
Load pre-quantized GGUF models directly:
from quantllm import TurboModel
model = TurboModel.from_gguf(
"TheBloke/Llama-2-7B-Chat-GGUF",
filename="llama-2-7b-chat.Q4_K_M.gguf"
)
print(model.generate("Hello!"))
Next Stepsο
Now that you know the basics, explore more:
Loading Models β β Advanced model loading options
Text Generation β β Generation parameters and modes
GGUF Export β β All quantization types explained
Fine-tuning β β Training with LoRA
Hub Integration β β Push and pull from HuggingFace
API Reference β β Full API documentation