📦 GGUF API

Export models to GGUF format for llama.cpp, Ollama, and LM Studio.

Quick Reference

from quantllm import turbo, convert_to_gguf, quantize_gguf

# Method 1: Via TurboModel
model = turbo(
    "meta-llama/Llama-3.2-3B",
    config={"format": "gguf", "quantization": "Q4_K_M", "push_format": "gguf"},
)
model.export("gguf", "model.Q4_K_M.gguf")

# Method 2: Direct conversion
convert_to_gguf("meta-llama/Llama-3.2-3B", "model.Q4_K_M.gguf", quant_type="Q4_K_M")

# Method 3: Re-quantize existing GGUF
quantize_gguf("model.F16.gguf", "model.Q4_K_M.gguf", quant_type="Q4_K_M")

convert_to_gguf()

Convert a HuggingFace model to GGUF format.

def convert_to_gguf(
    model_path: str,
    output_path: str,
    quant_type: str = "Q4_K_M",
    model_dtype: str = "auto",
    verbose: bool = True,
) -> str

Parameters

Parameter	Type	Default	Description
`model_path`	str	required	HuggingFace model name or local path
`output_path`	str	required	Output .gguf file path
`quant_type`	str	“Q4_K_M”	Quantization type
`model_dtype`	str	“auto”	Model dtype (auto, f16, f32)
`verbose`	bool	True	Show progress

Returns

Path to the created GGUF file.

Example

from quantllm import convert_to_gguf

# Basic conversion
convert_to_gguf(
    "meta-llama/Llama-3.2-3B",
    "llama3.Q4_K_M.gguf",
    quant_type="Q4_K_M"
)

# Higher quality
convert_to_gguf(
    "meta-llama/Llama-3.2-3B",
    "llama3.Q8_0.gguf",
    quant_type="Q8_0"
)

quantize_gguf()

Re-quantize an existing GGUF file to a different quantization type.

def quantize_gguf(
    input_path: str,
    output_path: str,
    quant_type: str = "Q4_K_M",
) -> str

Parameters

Parameter	Type	Default	Description
`input_path`	str	required	Input GGUF file path
`output_path`	str	required	Output GGUF file path
`quant_type`	str	“Q4_K_M”	Target quantization type

Example

from quantllm import quantize_gguf

# Re-quantize F16 to Q4_K_M
quantize_gguf(
    "model.F16.gguf",
    "model.Q4_K_M.gguf",
    quant_type="Q4_K_M"
)

GGUF_QUANT_TYPES

Available quantization types.

from quantllm import GGUF_QUANT_TYPES

print(GGUF_QUANT_TYPES)
# ['Q2_K', 'Q3_K_S', 'Q3_K_M', 'Q3_K_L', 'Q4_K_S', 'Q4_K_M', 
#  'Q5_K_S', 'Q5_K_M', 'Q6_K', 'Q8_0', 'F16', 'F32']

Quantization Comparison

Type	Bits	Quality	Size (7B)	Use Case
`Q2_K`	2	Low	~2 GB	Extreme compression
`Q3_K_S`	3	Fair	~2.5 GB	Small devices
`Q3_K_M`	3	Fair	~3 GB	Constrained memory
`Q4_K_S`	4	Good	~3.5 GB	Balanced (smaller)
`Q4_K_M`	4	Good	~4 GB	Recommended ⭐
`Q5_K_S`	5	High	~4.5 GB	Quality focus
`Q5_K_M`	5	High	~5 GB	Quality balance
`Q6_K`	6	Very High	~5.5 GB	Near original
`Q8_0`	8	Excellent	~7 GB	Maximum quality
`F16`	16	Original	~14 GB	Full precision

QUANT_RECOMMENDATIONS

Get recommendations based on hardware.

from quantllm import QUANT_RECOMMENDATIONS

print(QUANT_RECOMMENDATIONS)
# {
#     'low_memory': 'Q3_K_M',      # <6 GB VRAM
#     'balanced': 'Q4_K_M',        # 6-12 GB VRAM (recommended)
#     'quality': 'Q5_K_M',         # 12-24 GB VRAM
#     'high_quality': 'Q6_K',      # >24 GB VRAM
#     'maximum': 'Q8_0',           # Maximum quality
# }

check_llama_cpp()

Check if llama.cpp is installed.

def check_llama_cpp() -> bool

Example

from quantllm import check_llama_cpp

if check_llama_cpp():
    print("llama.cpp is ready!")
else:
    print("llama.cpp not found")

install_llama_cpp()

Install llama.cpp automatically.

def install_llama_cpp(
    install_dir: str = "./llama.cpp",
    force: bool = False,
) -> str

Parameters

Parameter	Type	Default	Description
`install_dir`	str	“./llama.cpp”	Installation directory
`force`	bool	False	Force reinstall

Example

from quantllm import install_llama_cpp

# Install to default location
install_llama_cpp()

# Install to custom location
install_llama_cpp("./tools/llama.cpp")

ensure_llama_cpp_installed()

Ensure llama.cpp is installed, installing if needed.

def ensure_llama_cpp_installed() -> str

Example

from quantllm import ensure_llama_cpp_installed

# Automatically installs if not present
llama_path = ensure_llama_cpp_installed()
print(f"llama.cpp at: {llama_path}")

export_to_gguf()

High-level export function (deprecated, use convert_to_gguf).

def export_to_gguf(
    model,
    tokenizer,
    output_path: str,
    quant_type: str = "Q4_K_M",
) -> str

Using Exported Models

llama.cpp

./llama-cli -m model.Q4_K_M.gguf -p "Hello!" -n 100

Ollama

echo 'FROM ./model.Q4_K_M.gguf' > Modelfile
ollama create mymodel -f Modelfile
ollama run mymodel

LM Studio

Import the .gguf file
Start chatting

Python (llama-cpp-python)

from llama_cpp import Llama

llm = Llama(model_path="model.Q4_K_M.gguf")
output = llm("Hello!", max_tokens=100)
print(output["choices"][0]["text"])

📦 GGUF API

Quick Reference

convert_to_gguf()

Parameters

Returns

Example

quantize_gguf()

Parameters

Example

GGUF_QUANT_TYPES

Quantization Comparison

QUANT_RECOMMENDATIONS

check_llama_cpp()

Example

install_llama_cpp()

Parameters

Example

ensure_llama_cpp_installed()

Example

export_to_gguf()

Using Exported Models

llama.cpp

Ollama

LM Studio

Python (llama-cpp-python)

See Also