π₯ Loading Modelsο
QuantLLM provides flexible model loading with automatic optimization.
Basic Loadingο
The turbo() Functionο
The simplest way to load any model:
from quantllm import turbo
# Load from HuggingFace Hub
model = turbo("meta-llama/Llama-3.2-3B")
# Load from local path
model = turbo("./my-local-model/")
What Happens Automaticallyο
When you call turbo(), QuantLLM:
Detects your hardware β GPU memory, CUDA version, capabilities
Chooses quantization β 4-bit for most GPUs, 8-bit for high-memory systems
Enables optimizations β Flash Attention 2, gradient checkpointing
Configures memory β Automatic offloading if needed
Quantization Optionsο
Automatic (Recommended)ο
# Let QuantLLM choose the best quantization
model = turbo("meta-llama/Llama-3.2-3B")
Manual Bit-Widthο
# Force specific quantization
model = turbo("meta-llama/Llama-3.2-3B", bits=4) # 4-bit (smallest)
model = turbo("meta-llama/Llama-3.2-3B", bits=8) # 8-bit (balanced)
model = turbo("meta-llama/Llama-3.2-3B", bits=16) # FP16 (highest quality)
Disable Quantizationο
# Load in full precision (requires more memory)
model = turbo("meta-llama/Llama-3.2-3B", quantize=False)
Configuration Optionsο
Common Optionsο
model = turbo(
"meta-llama/Llama-3.2-3B",
bits=4, # Quantization bits (4, 8, 16)
max_length=4096, # Maximum context length
device="cuda:0", # Device (cuda, cpu, auto)
dtype="bfloat16", # Data type (float16, bfloat16)
trust_remote_code=True, # For custom model architectures
verbose=True, # Show loading progress
)
New Architecture Fallbacks (for very recent model releases)ο
QuantLLM ships a built-in fallback table covering common model-type
suffixes β qwen3 β qwen2, llama4 β llama, phi4 β phi3,
gemma3 β gemma2, and many others β so brand-new releases load with
the same one-line API as established models:
from quantllm import turbo
# Works without manual registration: qwen3 falls back to qwen2 automatically
model = turbo("Qwen/Qwen3-8B", trust_remote_code=True)
When the built-in mapping does not cover your model, register an explicit fallback before loading:
from quantllm import turbo, register_architecture
# Map a brand-new architecture/model_type to a compatible base family
register_architecture("newmodel", base_model_type="llama")
# Optionally provide an explicit ``model_class`` (most useful for
# fine-tuned variants that ship their own modelling code):
from transformers import LlamaForCausalLM
register_architecture(
"newmodel",
base_model_type="llama",
model_class=LlamaForCausalLM,
)
model = turbo(
"new-model-org/NewModel-7B",
model_type_override="llama", # optional explicit override
base_model_fallback=True, # enabled by default; can be disabled
trust_remote_code=True,
)
β οΈ Security note:
trust_remote_code=Trueexecutes model-provided code. Only enable it for trusted publishers, especially when loading unregistered or very new architectures.
Pre-quantized HuggingFace reposο
QuantLLM detects pre-quantized repository names (Unsloth *-bnb-4bit /
*-bnb-8bit, AWQ, GPTQ, AQLM, HQQ, FP8, EETQ, etc.) and lets the modelβs
own quantization_config win β so you donβt accidentally re-quantize a
model that ships at-rest in 4-bit:
# Loaded as 4-bit BitsAndBytes from the repo's embedded config -- no
# additional dynamic quantization is applied on top.
model = turbo("unsloth/Llama-3.2-3B-Instruct-bnb-4bit")
# Verify what actually got loaded:
print(model.report())
# {'quant_method': 'bitsandbytes', 'is_quantized': True, ...}
from_config_only is for skeleton inspection onlyο
# Loads a randomly-initialised model from the config -- useful for
# inspecting layer shapes or wiring up tests, NOT for inference.
model = turbo(
"new-model-org/NewModel-7B",
from_config_only=True,
trust_remote_code=True,
)
# ``model.is_quantized`` will correctly report False here even when you
# also passed ``bits=4`` -- there are no real weights to quantize.
Fast contribution template for new architecturesο
Add a registration in your code or PR:
register_architecture("new-arch", base_model_type="llama")
Validate loading with:
turbo("org/model", base_model_fallback=True, trust_remote_code=True)
Add/extend a focused test in
tests/test_architecture_fallback.pyortests/test_resolve_model_type.py.
Inspecting the loaded stateο
model = turbo("Qwen/Qwen3-8B", bits=4)
report = model.report()
# {
# 'model_id': 'Qwen/Qwen3-8B',
# 'params_billion': 8.0,
# 'requested_bits': 4,
# 'effective_loading_bits': 4,
# 'is_quantized': True,
# 'quant_method': 'bitsandbytes',
# 'device': 'cuda:0',
# 'dtype': 'torch.bfloat16',
# 'finetuned': False,
# 'lora_applied': False,
# }
model.is_quantized is derived from the actual loaded model state
(config.quantization_config and BitsAndBytes layer types). It is not
a cached snapshot of your load-time intent, so from_config_only=True
or a missing bitsandbytes install will correctly report False.
Memory Optionsο
model = turbo(
"meta-llama/Llama-3.2-3B",
bits=4,
device_map="auto", # Automatic device mapping
low_cpu_mem_usage=True, # Reduce CPU memory during loading
)
Using TurboModel Directlyο
For more control, use the TurboModel class:
from quantllm import TurboModel, SmartConfig
# Create custom config
config = SmartConfig.detect("meta-llama/Llama-3.2-3B", bits=4)
# Load with custom config
model = TurboModel.from_pretrained(
"meta-llama/Llama-3.2-3B",
config=config,
)
Load GGUF Modelsο
Load pre-quantized GGUF models directly from HuggingFace:
from quantllm import TurboModel
# From HuggingFace Hub
model = TurboModel.from_gguf(
"TheBloke/Llama-2-7B-Chat-GGUF",
filename="llama-2-7b-chat.Q4_K_M.gguf"
)
# From local file
model = TurboModel.from_gguf("./models/my-model.gguf")
List Available GGUF Filesο
files = TurboModel.list_gguf_files("TheBloke/Llama-2-7B-Chat-GGUF")
print(files)
# ['llama-2-7b-chat.Q2_K.gguf', 'llama-2-7b-chat.Q4_K_M.gguf', ...]
Supported Modelsο
QuantLLM supports 45+ model architectures:
Family |
Models |
|---|---|
Llama |
Llama 2, Llama 3, Llama 3.1, Llama 3.2, CodeLlama |
Mistral |
Mistral 7B, Mixtral 8x7B, Mixtral 8x22B |
Qwen |
Qwen, Qwen2, Qwen2.5, Qwen2-MoE |
Microsoft |
Phi-1, Phi-2, Phi-3 |
Gemma, Gemma 2 |
|
Falcon |
Falcon 7B, 40B, 180B |
Code Models |
StarCoder, StarCoder2, CodeGen |
Chinese |
ChatGLM, Yi, Baichuan, InternLM |
Other |
DeepSeek, StableLM, MPT, BLOOM, OPT, GPT-NeoX |
Memory Optimizationο
For Large Modelsο
# Enable gradient checkpointing (for training)
model = turbo("meta-llama/Llama-3-70B", bits=4)
# Use CPU offloading
model = turbo(
"meta-llama/Llama-3-70B",
bits=4,
device_map="auto", # Automatic CPU/GPU split
)
Memory Usage Estimatesο
Model Size |
4-bit |
8-bit |
FP16 |
|---|---|---|---|
3B |
~2 GB |
~4 GB |
~6 GB |
7B |
~4 GB |
~8 GB |
~14 GB |
13B |
~8 GB |
~14 GB |
~26 GB |
70B |
~40 GB |
~70 GB |
~140 GB |
Best Practicesο
Start with automatic settings β Let QuantLLM detect your hardware
Use 4-bit for most cases β Best balance of quality and memory
Check memory first β
turbo()shows memory stats before loadingUse GGUF for inference β Pre-quantized GGUF models load faster
Next Stepsο
Text Generation β β Generate text with your model
Fine-tuning β β Train with your own data
GGUF Export β β Export for deployment