# 📦 GGUF API Export models to GGUF format for llama.cpp, Ollama, and LM Studio. --- ## Quick Reference ```python from quantllm import turbo, convert_to_gguf, quantize_gguf # Method 1: Via TurboModel model = turbo( "meta-llama/Llama-3.2-3B", config={"format": "gguf", "quantization": "Q4_K_M", "push_format": "gguf"}, ) model.export("gguf", "model.Q4_K_M.gguf") # Method 2: Direct conversion convert_to_gguf("meta-llama/Llama-3.2-3B", "model.Q4_K_M.gguf", quant_type="Q4_K_M") # Method 3: Re-quantize existing GGUF quantize_gguf("model.F16.gguf", "model.Q4_K_M.gguf", quant_type="Q4_K_M") ``` --- ## convert_to_gguf() Convert a HuggingFace model to GGUF format. ```python def convert_to_gguf( model_path: str, output_path: str, quant_type: str = "Q4_K_M", model_dtype: str = "auto", verbose: bool = True, ) -> str ``` ### Parameters | Parameter | Type | Default | Description | |-----------|------|---------|-------------| | `model_path` | str | required | HuggingFace model name or local path | | `output_path` | str | required | Output .gguf file path | | `quant_type` | str | "Q4_K_M" | Quantization type | | `model_dtype` | str | "auto" | Model dtype (auto, f16, f32) | | `verbose` | bool | True | Show progress | ### Returns Path to the created GGUF file. ### Example ```python from quantllm import convert_to_gguf # Basic conversion convert_to_gguf( "meta-llama/Llama-3.2-3B", "llama3.Q4_K_M.gguf", quant_type="Q4_K_M" ) # Higher quality convert_to_gguf( "meta-llama/Llama-3.2-3B", "llama3.Q8_0.gguf", quant_type="Q8_0" ) ``` --- ## quantize_gguf() Re-quantize an existing GGUF file to a different quantization type. ```python def quantize_gguf( input_path: str, output_path: str, quant_type: str = "Q4_K_M", ) -> str ``` ### Parameters | Parameter | Type | Default | Description | |-----------|------|---------|-------------| | `input_path` | str | required | Input GGUF file path | | `output_path` | str | required | Output GGUF file path | | `quant_type` | str | "Q4_K_M" | Target quantization type | ### Example ```python from quantllm import quantize_gguf # Re-quantize F16 to Q4_K_M quantize_gguf( "model.F16.gguf", "model.Q4_K_M.gguf", quant_type="Q4_K_M" ) ``` --- ## GGUF_QUANT_TYPES Available quantization types. ```python from quantllm import GGUF_QUANT_TYPES print(GGUF_QUANT_TYPES) # ['Q2_K', 'Q3_K_S', 'Q3_K_M', 'Q3_K_L', 'Q4_K_S', 'Q4_K_M', # 'Q5_K_S', 'Q5_K_M', 'Q6_K', 'Q8_0', 'F16', 'F32'] ``` ### Quantization Comparison | Type | Bits | Quality | Size (7B) | Use Case | |------|------|---------|-----------|----------| | `Q2_K` | 2 | Low | ~2 GB | Extreme compression | | `Q3_K_S` | 3 | Fair | ~2.5 GB | Small devices | | `Q3_K_M` | 3 | Fair | ~3 GB | Constrained memory | | `Q4_K_S` | 4 | Good | ~3.5 GB | Balanced (smaller) | | `Q4_K_M` | 4 | Good | ~4 GB | **Recommended** ⭐ | | `Q5_K_S` | 5 | High | ~4.5 GB | Quality focus | | `Q5_K_M` | 5 | High | ~5 GB | Quality balance | | `Q6_K` | 6 | Very High | ~5.5 GB | Near original | | `Q8_0` | 8 | Excellent | ~7 GB | Maximum quality | | `F16` | 16 | Original | ~14 GB | Full precision | --- ## QUANT_RECOMMENDATIONS Get recommendations based on hardware. ```python from quantllm import QUANT_RECOMMENDATIONS print(QUANT_RECOMMENDATIONS) # { # 'low_memory': 'Q3_K_M', # <6 GB VRAM # 'balanced': 'Q4_K_M', # 6-12 GB VRAM (recommended) # 'quality': 'Q5_K_M', # 12-24 GB VRAM # 'high_quality': 'Q6_K', # >24 GB VRAM # 'maximum': 'Q8_0', # Maximum quality # } ``` --- ## check_llama_cpp() Check if llama.cpp is installed. ```python def check_llama_cpp() -> bool ``` ### Example ```python from quantllm import check_llama_cpp if check_llama_cpp(): print("llama.cpp is ready!") else: print("llama.cpp not found") ``` --- ## install_llama_cpp() Install llama.cpp automatically. ```python def install_llama_cpp( install_dir: str = "./llama.cpp", force: bool = False, ) -> str ``` ### Parameters | Parameter | Type | Default | Description | |-----------|------|---------|-------------| | `install_dir` | str | "./llama.cpp" | Installation directory | | `force` | bool | False | Force reinstall | ### Example ```python from quantllm import install_llama_cpp # Install to default location install_llama_cpp() # Install to custom location install_llama_cpp("./tools/llama.cpp") ``` --- ## ensure_llama_cpp_installed() Ensure llama.cpp is installed, installing if needed. ```python def ensure_llama_cpp_installed() -> str ``` ### Example ```python from quantllm import ensure_llama_cpp_installed # Automatically installs if not present llama_path = ensure_llama_cpp_installed() print(f"llama.cpp at: {llama_path}") ``` --- ## export_to_gguf() High-level export function (deprecated, use `convert_to_gguf`). ```python def export_to_gguf( model, tokenizer, output_path: str, quant_type: str = "Q4_K_M", ) -> str ``` --- ## Using Exported Models ### llama.cpp ```bash ./llama-cli -m model.Q4_K_M.gguf -p "Hello!" -n 100 ``` ### Ollama ```bash echo 'FROM ./model.Q4_K_M.gguf' > Modelfile ollama create mymodel -f Modelfile ollama run mymodel ``` ### LM Studio 1. Import the `.gguf` file 2. Start chatting ### Python (llama-cpp-python) ```python from llama_cpp import Llama llm = Llama(model_path="model.Q4_K_M.gguf") output = llm("Hello!", max_tokens=100) print(output["choices"][0]["text"]) ``` --- ## See Also - [GGUF Export Guide](../guide/gguf-export.md) — Detailed guide - [TurboModel.export()](turbomodel.md#export) — Export via TurboModel - [Hub Integration](hub.md) — Push GGUF to HuggingFace