π¦ GGUF Exportο
Export models to GGUF format for deployment with llama.cpp, Ollama, and LM Studio.
Quick Exportο
from quantllm import turbo
model = turbo("meta-llama/Llama-3.2-3B")
# Export with recommended Q4_K_M quantization
model.export("gguf", "model.Q4_K_M.gguf", quantization="Q4_K_M")
No llama.cpp compilation required! QuantLLM handles everything automatically.
Quantization Typesο
Choose the right quantization for your needs:
Type |
Bits |
Quality |
Size |
Use Case |
|---|---|---|---|---|
|
2-bit |
Low |
Smallest |
Extreme compression |
|
3-bit |
Fair |
Very small |
Memory constrained |
|
3-bit |
Fair |
Small |
Balanced for 3-bit |
|
4-bit |
Good |
Small |
Slightly smaller Q4 |
|
4-bit |
Good |
Medium |
Recommended β |
|
5-bit |
High |
Medium |
Quality-focused |
|
5-bit |
High |
Medium |
Best 5-bit balance |
|
6-bit |
Very High |
Large |
Near-original |
|
8-bit |
Excellent |
Largest |
Maximum quality |
|
16-bit |
Original |
Full size |
Reference |
Size Comparison (7B Model)ο
Quantization |
Size |
Quality Loss |
|---|---|---|
F16 |
~14 GB |
0% |
Q8_0 |
~7 GB |
<1% |
Q5_K_M |
~5 GB |
~2% |
Q4_K_M |
~4 GB |
~3% |
Q3_K_M |
~3 GB |
~5% |
Q2_K |
~2 GB |
~10% |
Export Examplesο
Different Quantization Typesο
from quantllm import turbo
model = turbo("meta-llama/Llama-3.2-3B")
# Recommended for most use cases
model.export("gguf", "model.Q4_K_M.gguf", quantization="Q4_K_M")
# Higher quality
model.export("gguf", "model.Q5_K_M.gguf", quantization="Q5_K_M")
model.export("gguf", "model.Q8_0.gguf", quantization="Q8_0")
# Smaller size
model.export("gguf", "model.Q3_K_M.gguf", quantization="Q3_K_M")
model.export("gguf", "model.Q2_K.gguf", quantization="Q2_K")
# Full precision (largest)
model.export("gguf", "model.F16.gguf", quantization="F16")
Using Exported Modelsο
With llama.cppο
# Download or build llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && make
# Run your model
./llama-cli -m model.Q4_K_M.gguf -p "Hello, world!" -n 100
With Ollamaο
# Create a Modelfile
echo 'FROM ./model.Q4_K_M.gguf' > Modelfile
# Create the model
ollama create mymodel -f Modelfile
# Run
ollama run mymodel
With LM Studioο
Open LM Studio
Go to βMy Modelsβ β βImportβ
Select your
.gguffileStart chatting!
With Python (llama-cpp-python)ο
from llama_cpp import Llama
llm = Llama(model_path="model.Q4_K_M.gguf")
output = llm(
"Write a poem about the ocean:",
max_tokens=100,
echo=True
)
print(output["choices"][0]["text"])
Push to HuggingFaceο
Export and push in one step:
model = turbo(
"meta-llama/Llama-3.2-3B",
config={"format": "gguf", "quantization": "Q4_K_M", "push_format": "gguf"},
)
model.push(
"your-username/my-model-gguf",
license="apache-2.0"
)
The model card is automatically generated with:
Usage examples for llama.cpp, Ollama, LM Studio
Quantization details
βUse this modelβ button compatibility
Direct Conversionο
Convert any HuggingFace model without loading into TurboModel:
from quantllm import convert_to_gguf
convert_to_gguf(
model_path="meta-llama/Llama-3.2-3B",
output_path="model.Q4_K_M.gguf",
quant_type="Q4_K_M",
verbose=True,
)
Quantize Existing GGUFο
Re-quantize a GGUF file to a different type:
from quantllm import quantize_gguf
quantize_gguf(
input_path="model.F16.gguf",
output_path="model.Q4_K_M.gguf",
quant_type="Q4_K_M"
)
List Available Quantization Typesο
from quantllm import GGUF_QUANT_TYPES, QUANT_RECOMMENDATIONS
# All available types
print(GGUF_QUANT_TYPES)
# Recommendations
print(QUANT_RECOMMENDATIONS)
Troubleshootingο
BitsAndBytes Modelsο
If you loaded a model with BitsAndBytes quantization:
# This works - QuantLLM dequantizes automatically
model = turbo("model-name", bits=4)
model.export("gguf", "model.gguf", quantization="Q4_K_M")
Large Modelsο
For very large models:
# Note: previous `streaming=True` guidance is superseded by `chunked_conversion=True`.
# If you previously used `streaming=True`, replace it with `chunked_conversion=True` (streaming has no effect here).
# Use lower quantization
model.export("gguf", "model.Q3_K_M.gguf", quantization="Q3_K_M")
# Enable chunked conversion + smart ordering
model.export(
"gguf",
"model.gguf",
quantization="Q4_K_M",
chunked_conversion=True,
max_shard_size="2GB",
smart_tensor_ordering=True,
)
# Force intermediate files to a dedicated disk offload directory
model.export(
"gguf",
"model.gguf",
quantization="Q4_K_M",
disk_offloading=True,
disk_offload_dir="./quantllm_offload",
)
Windows Issuesο
If you encounter issues on Windows:
Install Visual C++ Build Tools
Ensure Python 3.10+ is installed
Try running as administrator
Best Practicesο
Use Q4_K_M for most deployments (best quality/size balance)
Use Q5_K_M or Q6_K for quality-critical applications
Use Q2_K or Q3_K_M only when size is critical
Test output quality after quantization
Keep the F16 version as a reference
Next Stepsο
Hub Integration β β Push to HuggingFace
Other Export Formats β β ONNX, MLX, SafeTensors
API Reference β β Full GGUF API