πŸ“¦ GGUF API

Export models to GGUF format for llama.cpp, Ollama, and LM Studio.


Quick Reference

from quantllm import turbo, convert_to_gguf, quantize_gguf

# Method 1: Via TurboModel
model = turbo(
    "meta-llama/Llama-3.2-3B",
    config={"format": "gguf", "quantization": "Q4_K_M", "push_format": "gguf"},
)
model.export("gguf", "model.Q4_K_M.gguf")

# Method 2: Direct conversion
convert_to_gguf("meta-llama/Llama-3.2-3B", "model.Q4_K_M.gguf", quant_type="Q4_K_M")

# Method 3: Re-quantize existing GGUF
quantize_gguf("model.F16.gguf", "model.Q4_K_M.gguf", quant_type="Q4_K_M")

convert_to_gguf()

Convert a HuggingFace model to GGUF format.

def convert_to_gguf(
    model_path: str,
    output_path: str,
    quant_type: str = "Q4_K_M",
    model_dtype: str = "auto",
    verbose: bool = True,
) -> str

Parameters

Parameter

Type

Default

Description

model_path

str

required

HuggingFace model name or local path

output_path

str

required

Output .gguf file path

quant_type

str

β€œQ4_K_M”

Quantization type

model_dtype

str

β€œauto”

Model dtype (auto, f16, f32)

verbose

bool

True

Show progress

Returns

Path to the created GGUF file.

Example

from quantllm import convert_to_gguf

# Basic conversion
convert_to_gguf(
    "meta-llama/Llama-3.2-3B",
    "llama3.Q4_K_M.gguf",
    quant_type="Q4_K_M"
)

# Higher quality
convert_to_gguf(
    "meta-llama/Llama-3.2-3B",
    "llama3.Q8_0.gguf",
    quant_type="Q8_0"
)

quantize_gguf()

Re-quantize an existing GGUF file to a different quantization type.

def quantize_gguf(
    input_path: str,
    output_path: str,
    quant_type: str = "Q4_K_M",
) -> str

Parameters

Parameter

Type

Default

Description

input_path

str

required

Input GGUF file path

output_path

str

required

Output GGUF file path

quant_type

str

β€œQ4_K_M”

Target quantization type

Example

from quantllm import quantize_gguf

# Re-quantize F16 to Q4_K_M
quantize_gguf(
    "model.F16.gguf",
    "model.Q4_K_M.gguf",
    quant_type="Q4_K_M"
)

GGUF_QUANT_TYPES

Available quantization types.

from quantllm import GGUF_QUANT_TYPES

print(GGUF_QUANT_TYPES)
# ['Q2_K', 'Q3_K_S', 'Q3_K_M', 'Q3_K_L', 'Q4_K_S', 'Q4_K_M', 
#  'Q5_K_S', 'Q5_K_M', 'Q6_K', 'Q8_0', 'F16', 'F32']

Quantization Comparison

Type

Bits

Quality

Size (7B)

Use Case

Q2_K

2

Low

~2 GB

Extreme compression

Q3_K_S

3

Fair

~2.5 GB

Small devices

Q3_K_M

3

Fair

~3 GB

Constrained memory

Q4_K_S

4

Good

~3.5 GB

Balanced (smaller)

Q4_K_M

4

Good

~4 GB

Recommended ⭐

Q5_K_S

5

High

~4.5 GB

Quality focus

Q5_K_M

5

High

~5 GB

Quality balance

Q6_K

6

Very High

~5.5 GB

Near original

Q8_0

8

Excellent

~7 GB

Maximum quality

F16

16

Original

~14 GB

Full precision


QUANT_RECOMMENDATIONS

Get recommendations based on hardware.

from quantllm import QUANT_RECOMMENDATIONS

print(QUANT_RECOMMENDATIONS)
# {
#     'low_memory': 'Q3_K_M',      # <6 GB VRAM
#     'balanced': 'Q4_K_M',        # 6-12 GB VRAM (recommended)
#     'quality': 'Q5_K_M',         # 12-24 GB VRAM
#     'high_quality': 'Q6_K',      # >24 GB VRAM
#     'maximum': 'Q8_0',           # Maximum quality
# }

check_llama_cpp()

Check if llama.cpp is installed.

def check_llama_cpp() -> bool

Example

from quantllm import check_llama_cpp

if check_llama_cpp():
    print("llama.cpp is ready!")
else:
    print("llama.cpp not found")

install_llama_cpp()

Install llama.cpp automatically.

def install_llama_cpp(
    install_dir: str = "./llama.cpp",
    force: bool = False,
) -> str

Parameters

Parameter

Type

Default

Description

install_dir

str

β€œ./llama.cpp”

Installation directory

force

bool

False

Force reinstall

Example

from quantllm import install_llama_cpp

# Install to default location
install_llama_cpp()

# Install to custom location
install_llama_cpp("./tools/llama.cpp")

ensure_llama_cpp_installed()

Ensure llama.cpp is installed, installing if needed.

def ensure_llama_cpp_installed() -> str

Example

from quantllm import ensure_llama_cpp_installed

# Automatically installs if not present
llama_path = ensure_llama_cpp_installed()
print(f"llama.cpp at: {llama_path}")

export_to_gguf()

High-level export function (deprecated, use convert_to_gguf).

def export_to_gguf(
    model,
    tokenizer,
    output_path: str,
    quant_type: str = "Q4_K_M",
) -> str

Using Exported Models

llama.cpp

./llama-cli -m model.Q4_K_M.gguf -p "Hello!" -n 100

Ollama

echo 'FROM ./model.Q4_K_M.gguf' > Modelfile
ollama create mymodel -f Modelfile
ollama run mymodel

LM Studio

  1. Import the .gguf file

  2. Start chatting

Python (llama-cpp-python)

from llama_cpp import Llama

llm = Llama(model_path="model.Q4_K_M.gguf")
output = llm("Hello!", max_tokens=100)
print(output["choices"][0]["text"])

See Also