# 📦 GGUF Export Export models to GGUF format for deployment with llama.cpp, Ollama, and LM Studio. --- ## Quick Export ```python from quantllm import turbo model = turbo("meta-llama/Llama-3.2-3B") # Export with recommended Q4_K_M quantization model.export("gguf", "model.Q4_K_M.gguf", quantization="Q4_K_M") ``` **No llama.cpp compilation required!** QuantLLM handles everything automatically. --- ## Quantization Types Choose the right quantization for your needs: | Type | Bits | Quality | Size | Use Case | |------|------|---------|------|----------| | `Q2_K` | 2-bit | Low | Smallest | Extreme compression | | `Q3_K_S` | 3-bit | Fair | Very small | Memory constrained | | `Q3_K_M` | 3-bit | Fair | Small | Balanced for 3-bit | | `Q4_K_S` | 4-bit | Good | Small | Slightly smaller Q4 | | `Q4_K_M` | 4-bit | Good | Medium | **Recommended** ⭐ | | `Q5_K_S` | 5-bit | High | Medium | Quality-focused | | `Q5_K_M` | 5-bit | High | Medium | Best 5-bit balance | | `Q6_K` | 6-bit | Very High | Large | Near-original | | `Q8_0` | 8-bit | Excellent | Largest | Maximum quality | | `F16` | 16-bit | Original | Full size | Reference | ### Size Comparison (7B Model) | Quantization | Size | Quality Loss | |--------------|------|--------------| | F16 | ~14 GB | 0% | | Q8_0 | ~7 GB | <1% | | Q5_K_M | ~5 GB | ~2% | | Q4_K_M | ~4 GB | ~3% | | Q3_K_M | ~3 GB | ~5% | | Q2_K | ~2 GB | ~10% | --- ## Export Examples ### Different Quantization Types ```python from quantllm import turbo model = turbo("meta-llama/Llama-3.2-3B") # Recommended for most use cases model.export("gguf", "model.Q4_K_M.gguf", quantization="Q4_K_M") # Higher quality model.export("gguf", "model.Q5_K_M.gguf", quantization="Q5_K_M") model.export("gguf", "model.Q8_0.gguf", quantization="Q8_0") # Smaller size model.export("gguf", "model.Q3_K_M.gguf", quantization="Q3_K_M") model.export("gguf", "model.Q2_K.gguf", quantization="Q2_K") # Full precision (largest) model.export("gguf", "model.F16.gguf", quantization="F16") ``` --- ## Using Exported Models ### With llama.cpp ```bash # Download or build llama.cpp git clone https://github.com/ggerganov/llama.cpp cd llama.cpp && make # Run your model ./llama-cli -m model.Q4_K_M.gguf -p "Hello, world!" -n 100 ``` ### With Ollama ```bash # Create a Modelfile echo 'FROM ./model.Q4_K_M.gguf' > Modelfile # Create the model ollama create mymodel -f Modelfile # Run ollama run mymodel ``` ### With LM Studio 1. Open LM Studio 2. Go to "My Models" → "Import" 3. Select your `.gguf` file 4. Start chatting! ### With Python (llama-cpp-python) ```python from llama_cpp import Llama llm = Llama(model_path="model.Q4_K_M.gguf") output = llm( "Write a poem about the ocean:", max_tokens=100, echo=True ) print(output["choices"][0]["text"]) ``` --- ## Push to HuggingFace Export and push in one step: ```python model = turbo( "meta-llama/Llama-3.2-3B", config={"format": "gguf", "quantization": "Q4_K_M", "push_format": "gguf"}, ) model.push( "your-username/my-model-gguf", license="apache-2.0" ) ``` The model card is automatically generated with: - Usage examples for llama.cpp, Ollama, LM Studio - Quantization details - "Use this model" button compatibility --- ## Direct Conversion Convert any HuggingFace model without loading into TurboModel: ```python from quantllm import convert_to_gguf convert_to_gguf( model_path="meta-llama/Llama-3.2-3B", output_path="model.Q4_K_M.gguf", quant_type="Q4_K_M", verbose=True, ) ``` --- ## Quantize Existing GGUF Re-quantize a GGUF file to a different type: ```python from quantllm import quantize_gguf quantize_gguf( input_path="model.F16.gguf", output_path="model.Q4_K_M.gguf", quant_type="Q4_K_M" ) ``` --- ## List Available Quantization Types ```python from quantllm import GGUF_QUANT_TYPES, QUANT_RECOMMENDATIONS # All available types print(GGUF_QUANT_TYPES) # Recommendations print(QUANT_RECOMMENDATIONS) ``` --- ## Troubleshooting ### BitsAndBytes Models If you loaded a model with BitsAndBytes quantization: ```python # This works - QuantLLM dequantizes automatically model = turbo("model-name", bits=4) model.export("gguf", "model.gguf", quantization="Q4_K_M") ``` ### Large Models For very large models: ```python # Note: previous `streaming=True` guidance is superseded by `chunked_conversion=True`. # If you previously used `streaming=True`, replace it with `chunked_conversion=True` (streaming has no effect here). # Use lower quantization model.export("gguf", "model.Q3_K_M.gguf", quantization="Q3_K_M") # Enable chunked conversion + smart ordering model.export( "gguf", "model.gguf", quantization="Q4_K_M", chunked_conversion=True, max_shard_size="2GB", smart_tensor_ordering=True, ) # Force intermediate files to a dedicated disk offload directory model.export( "gguf", "model.gguf", quantization="Q4_K_M", disk_offloading=True, disk_offload_dir="./quantllm_offload", ) ``` ### Windows Issues If you encounter issues on Windows: 1. Install Visual C++ Build Tools 2. Ensure Python 3.10+ is installed 3. Try running as administrator --- ## Best Practices 1. **Use Q4_K_M** for most deployments (best quality/size balance) 2. **Use Q5_K_M or Q6_K** for quality-critical applications 3. **Use Q2_K or Q3_K_M** only when size is critical 4. **Test output quality** after quantization 5. **Keep the F16 version** as a reference --- ## Next Steps - [Hub Integration →](hub-integration.md) — Push to HuggingFace - [Other Export Formats →](../quickstart.md#export-to-different-formats) — ONNX, MLX, SafeTensors - [API Reference →](../api/gguf.md) — Full GGUF API