📥 Loading Models

QuantLLM provides flexible model loading with automatic optimization.

Basic Loading

The `turbo()` Function

The simplest way to load any model:

from quantllm import turbo

# Load from HuggingFace Hub
model = turbo("meta-llama/Llama-3.2-3B")

# Load from local path
model = turbo("./my-local-model/")

What Happens Automatically

When you call turbo(), QuantLLM:

Detects your hardware — GPU memory, CUDA version, capabilities
Chooses quantization — 4-bit for most GPUs, 8-bit for high-memory systems
Enables optimizations — Flash Attention 2, gradient checkpointing
Configures memory — Automatic offloading if needed

Quantization Options

Automatic (Recommended)

# Let QuantLLM choose the best quantization
model = turbo("meta-llama/Llama-3.2-3B")

Manual Bit-Width

# Force specific quantization
model = turbo("meta-llama/Llama-3.2-3B", bits=4)   # 4-bit (smallest)
model = turbo("meta-llama/Llama-3.2-3B", bits=8)   # 8-bit (balanced)
model = turbo("meta-llama/Llama-3.2-3B", bits=16)  # FP16 (highest quality)

Disable Quantization

# Load in full precision (requires more memory)
model = turbo("meta-llama/Llama-3.2-3B", quantize=False)

Configuration Options

Common Options

model = turbo(
    "meta-llama/Llama-3.2-3B",
    bits=4,                      # Quantization bits (4, 8, 16)
    max_length=4096,             # Maximum context length
    device="cuda:0",             # Device (cuda, cpu, auto)
    dtype="bfloat16",            # Data type (float16, bfloat16)
    trust_remote_code=True,      # For custom model architectures
    verbose=True,                # Show loading progress
)

New Architecture Fallbacks (for very recent model releases)

QuantLLM ships a built-in fallback table covering common model-type suffixes — qwen3 → qwen2, llama4 → llama, phi4 → phi3, gemma3 → gemma2, and many others — so brand-new releases load with the same one-line API as established models:

from quantllm import turbo

# Works without manual registration: qwen3 falls back to qwen2 automatically
model = turbo("Qwen/Qwen3-8B", trust_remote_code=True)

When the built-in mapping does not cover your model, register an explicit fallback before loading:

from quantllm import turbo, register_architecture

# Map a brand-new architecture/model_type to a compatible base family
register_architecture("newmodel", base_model_type="llama")

# Optionally provide an explicit ``model_class`` (most useful for
# fine-tuned variants that ship their own modelling code):
from transformers import LlamaForCausalLM
register_architecture(
    "newmodel",
    base_model_type="llama",
    model_class=LlamaForCausalLM,
)

model = turbo(
    "new-model-org/NewModel-7B",
    model_type_override="llama",     # optional explicit override
    base_model_fallback=True,        # enabled by default; can be disabled
    trust_remote_code=True,
)

⚠️ Security note: trust_remote_code=True executes model-provided code. Only enable it for trusted publishers, especially when loading unregistered or very new architectures.

Pre-quantized HuggingFace repos

QuantLLM detects pre-quantized repository names (Unsloth *-bnb-4bit / *-bnb-8bit, AWQ, GPTQ, AQLM, HQQ, FP8, EETQ, etc.) and lets the model’s own quantization_config win — so you don’t accidentally re-quantize a model that ships at-rest in 4-bit:

# Loaded as 4-bit BitsAndBytes from the repo's embedded config -- no
# additional dynamic quantization is applied on top.
model = turbo("unsloth/Llama-3.2-3B-Instruct-bnb-4bit")

# Verify what actually got loaded:
print(model.report())
# {'quant_method': 'bitsandbytes', 'is_quantized': True, ...}

`from_config_only` is for skeleton inspection only

# Loads a randomly-initialised model from the config -- useful for
# inspecting layer shapes or wiring up tests, NOT for inference.
model = turbo(
    "new-model-org/NewModel-7B",
    from_config_only=True,
    trust_remote_code=True,
)

# ``model.is_quantized`` will correctly report False here even when you
# also passed ``bits=4`` -- there are no real weights to quantize.

Fast contribution template for new architectures

Add a registration in your code or PR:
- register_architecture("new-arch", base_model_type="llama")
Validate loading with:
- turbo("org/model", base_model_fallback=True, trust_remote_code=True)
Add/extend a focused test in tests/test_architecture_fallback.py or tests/test_resolve_model_type.py.

Inspecting the loaded state

model = turbo("Qwen/Qwen3-8B", bits=4)

report = model.report()
# {
#   'model_id': 'Qwen/Qwen3-8B',
#   'params_billion': 8.0,
#   'requested_bits': 4,
#   'effective_loading_bits': 4,
#   'is_quantized': True,
#   'quant_method': 'bitsandbytes',
#   'device': 'cuda:0',
#   'dtype': 'torch.bfloat16',
#   'finetuned': False,
#   'lora_applied': False,
# }

model.is_quantized is derived from the actual loaded model state (config.quantization_config and BitsAndBytes layer types). It is not a cached snapshot of your load-time intent, so from_config_only=True or a missing bitsandbytes install will correctly report False.

Memory Options

model = turbo(
    "meta-llama/Llama-3.2-3B",
    bits=4,
    device_map="auto",           # Automatic device mapping
    low_cpu_mem_usage=True,      # Reduce CPU memory during loading
)

Using TurboModel Directly

For more control, use the TurboModel class:

from quantllm import TurboModel, SmartConfig

# Create custom config
config = SmartConfig.detect("meta-llama/Llama-3.2-3B", bits=4)

# Load with custom config
model = TurboModel.from_pretrained(
    "meta-llama/Llama-3.2-3B",
    config=config,
)

Load GGUF Models

Load pre-quantized GGUF models directly from HuggingFace:

from quantllm import TurboModel

# From HuggingFace Hub
model = TurboModel.from_gguf(
    "TheBloke/Llama-2-7B-Chat-GGUF",
    filename="llama-2-7b-chat.Q4_K_M.gguf"
)

# From local file
model = TurboModel.from_gguf("./models/my-model.gguf")

List Available GGUF Files

files = TurboModel.list_gguf_files("TheBloke/Llama-2-7B-Chat-GGUF")
print(files)
# ['llama-2-7b-chat.Q2_K.gguf', 'llama-2-7b-chat.Q4_K_M.gguf', ...]

Supported Models

QuantLLM supports 45+ model architectures:

Family	Models
Llama	Llama 2, Llama 3, Llama 3.1, Llama 3.2, CodeLlama
Mistral	Mistral 7B, Mixtral 8x7B, Mixtral 8x22B
Qwen	Qwen, Qwen2, Qwen2.5, Qwen2-MoE
Microsoft	Phi-1, Phi-2, Phi-3
Google	Gemma, Gemma 2
Falcon	Falcon 7B, 40B, 180B
Code Models	StarCoder, StarCoder2, CodeGen
Chinese	ChatGLM, Yi, Baichuan, InternLM
Other	DeepSeek, StableLM, MPT, BLOOM, OPT, GPT-NeoX

Memory Optimization

For Large Models

# Enable gradient checkpointing (for training)
model = turbo("meta-llama/Llama-3-70B", bits=4)

# Use CPU offloading
model = turbo(
    "meta-llama/Llama-3-70B",
    bits=4,
    device_map="auto",  # Automatic CPU/GPU split
)

Memory Usage Estimates

Model Size	4-bit	8-bit	FP16
3B	~2 GB	~4 GB	~6 GB
7B	~4 GB	~8 GB	~14 GB
13B	~8 GB	~14 GB	~26 GB
70B	~40 GB	~70 GB	~140 GB

Best Practices

Start with automatic settings — Let QuantLLM detect your hardware
Use 4-bit for most cases — Best balance of quality and memory
Check memory first — turbo() shows memory stats before loading
Use GGUF for inference — Pre-quantized GGUF models load faster

Next Steps

Text Generation → — Generate text with your model
Fine-tuning → — Train with your own data
GGUF Export → — Export for deployment