π¦ GGUF APIο
Export models to GGUF format for llama.cpp, Ollama, and LM Studio.
Quick Referenceο
from quantllm import turbo, convert_to_gguf, quantize_gguf
# Method 1: Via TurboModel
model = turbo(
"meta-llama/Llama-3.2-3B",
config={"format": "gguf", "quantization": "Q4_K_M", "push_format": "gguf"},
)
model.export("gguf", "model.Q4_K_M.gguf")
# Method 2: Direct conversion
convert_to_gguf("meta-llama/Llama-3.2-3B", "model.Q4_K_M.gguf", quant_type="Q4_K_M")
# Method 3: Re-quantize existing GGUF
quantize_gguf("model.F16.gguf", "model.Q4_K_M.gguf", quant_type="Q4_K_M")
convert_to_gguf()ο
Convert a HuggingFace model to GGUF format.
def convert_to_gguf(
model_path: str,
output_path: str,
quant_type: str = "Q4_K_M",
model_dtype: str = "auto",
verbose: bool = True,
) -> str
Parametersο
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
str |
required |
HuggingFace model name or local path |
|
str |
required |
Output .gguf file path |
|
str |
βQ4_K_Mβ |
Quantization type |
|
str |
βautoβ |
Model dtype (auto, f16, f32) |
|
bool |
True |
Show progress |
Returnsο
Path to the created GGUF file.
Exampleο
from quantllm import convert_to_gguf
# Basic conversion
convert_to_gguf(
"meta-llama/Llama-3.2-3B",
"llama3.Q4_K_M.gguf",
quant_type="Q4_K_M"
)
# Higher quality
convert_to_gguf(
"meta-llama/Llama-3.2-3B",
"llama3.Q8_0.gguf",
quant_type="Q8_0"
)
quantize_gguf()ο
Re-quantize an existing GGUF file to a different quantization type.
def quantize_gguf(
input_path: str,
output_path: str,
quant_type: str = "Q4_K_M",
) -> str
Parametersο
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
str |
required |
Input GGUF file path |
|
str |
required |
Output GGUF file path |
|
str |
βQ4_K_Mβ |
Target quantization type |
Exampleο
from quantllm import quantize_gguf
# Re-quantize F16 to Q4_K_M
quantize_gguf(
"model.F16.gguf",
"model.Q4_K_M.gguf",
quant_type="Q4_K_M"
)
GGUF_QUANT_TYPESο
Available quantization types.
from quantllm import GGUF_QUANT_TYPES
print(GGUF_QUANT_TYPES)
# ['Q2_K', 'Q3_K_S', 'Q3_K_M', 'Q3_K_L', 'Q4_K_S', 'Q4_K_M',
# 'Q5_K_S', 'Q5_K_M', 'Q6_K', 'Q8_0', 'F16', 'F32']
Quantization Comparisonο
Type |
Bits |
Quality |
Size (7B) |
Use Case |
|---|---|---|---|---|
|
2 |
Low |
~2 GB |
Extreme compression |
|
3 |
Fair |
~2.5 GB |
Small devices |
|
3 |
Fair |
~3 GB |
Constrained memory |
|
4 |
Good |
~3.5 GB |
Balanced (smaller) |
|
4 |
Good |
~4 GB |
Recommended β |
|
5 |
High |
~4.5 GB |
Quality focus |
|
5 |
High |
~5 GB |
Quality balance |
|
6 |
Very High |
~5.5 GB |
Near original |
|
8 |
Excellent |
~7 GB |
Maximum quality |
|
16 |
Original |
~14 GB |
Full precision |
QUANT_RECOMMENDATIONSο
Get recommendations based on hardware.
from quantllm import QUANT_RECOMMENDATIONS
print(QUANT_RECOMMENDATIONS)
# {
# 'low_memory': 'Q3_K_M', # <6 GB VRAM
# 'balanced': 'Q4_K_M', # 6-12 GB VRAM (recommended)
# 'quality': 'Q5_K_M', # 12-24 GB VRAM
# 'high_quality': 'Q6_K', # >24 GB VRAM
# 'maximum': 'Q8_0', # Maximum quality
# }
check_llama_cpp()ο
Check if llama.cpp is installed.
def check_llama_cpp() -> bool
Exampleο
from quantllm import check_llama_cpp
if check_llama_cpp():
print("llama.cpp is ready!")
else:
print("llama.cpp not found")
install_llama_cpp()ο
Install llama.cpp automatically.
def install_llama_cpp(
install_dir: str = "./llama.cpp",
force: bool = False,
) -> str
Parametersο
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
str |
β./llama.cppβ |
Installation directory |
|
bool |
False |
Force reinstall |
Exampleο
from quantllm import install_llama_cpp
# Install to default location
install_llama_cpp()
# Install to custom location
install_llama_cpp("./tools/llama.cpp")
ensure_llama_cpp_installed()ο
Ensure llama.cpp is installed, installing if needed.
def ensure_llama_cpp_installed() -> str
Exampleο
from quantllm import ensure_llama_cpp_installed
# Automatically installs if not present
llama_path = ensure_llama_cpp_installed()
print(f"llama.cpp at: {llama_path}")
export_to_gguf()ο
High-level export function (deprecated, use convert_to_gguf).
def export_to_gguf(
model,
tokenizer,
output_path: str,
quant_type: str = "Q4_K_M",
) -> str
Using Exported Modelsο
llama.cppο
./llama-cli -m model.Q4_K_M.gguf -p "Hello!" -n 100
Ollamaο
echo 'FROM ./model.Q4_K_M.gguf' > Modelfile
ollama create mymodel -f Modelfile
ollama run mymodel
LM Studioο
Import the
.gguffileStart chatting
Python (llama-cpp-python)ο
from llama_cpp import Llama
llm = Llama(model_path="model.Q4_K_M.gguf")
output = llm("Hello!", max_tokens=100)
print(output["choices"][0]["text"])
See Alsoο
GGUF Export Guide β Detailed guide
TurboModel.export() β Export via TurboModel
Hub Integration β Push GGUF to HuggingFace