# 📥 Loading Models QuantLLM provides flexible model loading with automatic optimization. --- ## Basic Loading ### The `turbo()` Function The simplest way to load any model: ```python from quantllm import turbo # Load from HuggingFace Hub model = turbo("meta-llama/Llama-3.2-3B") # Load from local path model = turbo("./my-local-model/") ``` ### What Happens Automatically When you call `turbo()`, QuantLLM: 1. **Detects your hardware** — GPU memory, CUDA version, capabilities 2. **Chooses quantization** — 4-bit for most GPUs, 8-bit for high-memory systems 3. **Enables optimizations** — Flash Attention 2, gradient checkpointing 4. **Configures memory** — Automatic offloading if needed --- ## Quantization Options ### Automatic (Recommended) ```python # Let QuantLLM choose the best quantization model = turbo("meta-llama/Llama-3.2-3B") ``` ### Manual Bit-Width ```python # Force specific quantization model = turbo("meta-llama/Llama-3.2-3B", bits=4) # 4-bit (smallest) model = turbo("meta-llama/Llama-3.2-3B", bits=8) # 8-bit (balanced) model = turbo("meta-llama/Llama-3.2-3B", bits=16) # FP16 (highest quality) ``` ### Disable Quantization ```python # Load in full precision (requires more memory) model = turbo("meta-llama/Llama-3.2-3B", quantize=False) ``` --- ## Configuration Options ### Common Options ```python model = turbo( "meta-llama/Llama-3.2-3B", bits=4, # Quantization bits (4, 8, 16) max_length=4096, # Maximum context length device="cuda:0", # Device (cuda, cpu, auto) dtype="bfloat16", # Data type (float16, bfloat16) trust_remote_code=True, # For custom model architectures verbose=True, # Show loading progress ) ``` ### New Architecture Fallbacks (for very recent model releases) QuantLLM ships a built-in fallback table covering common model-type suffixes — `qwen3` → `qwen2`, `llama4` → `llama`, `phi4` → `phi3`, `gemma3` → `gemma2`, and many others — so brand-new releases load with the same one-line API as established models: ```python from quantllm import turbo # Works without manual registration: qwen3 falls back to qwen2 automatically model = turbo("Qwen/Qwen3-8B", trust_remote_code=True) ``` When the built-in mapping does not cover your model, register an explicit fallback before loading: ```python from quantllm import turbo, register_architecture # Map a brand-new architecture/model_type to a compatible base family register_architecture("newmodel", base_model_type="llama") # Optionally provide an explicit ``model_class`` (most useful for # fine-tuned variants that ship their own modelling code): from transformers import LlamaForCausalLM register_architecture( "newmodel", base_model_type="llama", model_class=LlamaForCausalLM, ) model = turbo( "new-model-org/NewModel-7B", model_type_override="llama", # optional explicit override base_model_fallback=True, # enabled by default; can be disabled trust_remote_code=True, ) ``` > ⚠️ **Security note:** `trust_remote_code=True` executes model-provided code. > Only enable it for trusted publishers, especially when loading unregistered or very new architectures. #### Pre-quantized HuggingFace repos QuantLLM detects pre-quantized repository names (Unsloth `*-bnb-4bit` / `*-bnb-8bit`, AWQ, GPTQ, AQLM, HQQ, FP8, EETQ, etc.) and lets the model's own `quantization_config` win — so you don't accidentally re-quantize a model that ships at-rest in 4-bit: ```python # Loaded as 4-bit BitsAndBytes from the repo's embedded config -- no # additional dynamic quantization is applied on top. model = turbo("unsloth/Llama-3.2-3B-Instruct-bnb-4bit") # Verify what actually got loaded: print(model.report()) # {'quant_method': 'bitsandbytes', 'is_quantized': True, ...} ``` #### `from_config_only` is for skeleton inspection only ```python # Loads a randomly-initialised model from the config -- useful for # inspecting layer shapes or wiring up tests, NOT for inference. model = turbo( "new-model-org/NewModel-7B", from_config_only=True, trust_remote_code=True, ) # ``model.is_quantized`` will correctly report False here even when you # also passed ``bits=4`` -- there are no real weights to quantize. ``` #### Fast contribution template for new architectures 1. Add a registration in your code or PR: - `register_architecture("new-arch", base_model_type="llama")` 2. Validate loading with: - `turbo("org/model", base_model_fallback=True, trust_remote_code=True)` 3. Add/extend a focused test in `tests/test_architecture_fallback.py` or `tests/test_resolve_model_type.py`. #### Inspecting the loaded state ```python model = turbo("Qwen/Qwen3-8B", bits=4) report = model.report() # { # 'model_id': 'Qwen/Qwen3-8B', # 'params_billion': 8.0, # 'requested_bits': 4, # 'effective_loading_bits': 4, # 'is_quantized': True, # 'quant_method': 'bitsandbytes', # 'device': 'cuda:0', # 'dtype': 'torch.bfloat16', # 'finetuned': False, # 'lora_applied': False, # } ``` `model.is_quantized` is derived from the actual loaded model state (`config.quantization_config` and BitsAndBytes layer types). It is not a cached snapshot of your load-time intent, so `from_config_only=True` or a missing `bitsandbytes` install will correctly report `False`. ### Memory Options ```python model = turbo( "meta-llama/Llama-3.2-3B", bits=4, device_map="auto", # Automatic device mapping low_cpu_mem_usage=True, # Reduce CPU memory during loading ) ``` --- ## Using TurboModel Directly For more control, use the `TurboModel` class: ```python from quantllm import TurboModel, SmartConfig # Create custom config config = SmartConfig.detect("meta-llama/Llama-3.2-3B", bits=4) # Load with custom config model = TurboModel.from_pretrained( "meta-llama/Llama-3.2-3B", config=config, ) ``` --- ## Load GGUF Models Load pre-quantized GGUF models directly from HuggingFace: ```python from quantllm import TurboModel # From HuggingFace Hub model = TurboModel.from_gguf( "TheBloke/Llama-2-7B-Chat-GGUF", filename="llama-2-7b-chat.Q4_K_M.gguf" ) # From local file model = TurboModel.from_gguf("./models/my-model.gguf") ``` ### List Available GGUF Files ```python files = TurboModel.list_gguf_files("TheBloke/Llama-2-7B-Chat-GGUF") print(files) # ['llama-2-7b-chat.Q2_K.gguf', 'llama-2-7b-chat.Q4_K_M.gguf', ...] ``` --- ## Supported Models QuantLLM supports **45+ model architectures**: | Family | Models | |--------|--------| | **Llama** | Llama 2, Llama 3, Llama 3.1, Llama 3.2, CodeLlama | | **Mistral** | Mistral 7B, Mixtral 8x7B, Mixtral 8x22B | | **Qwen** | Qwen, Qwen2, Qwen2.5, Qwen2-MoE | | **Microsoft** | Phi-1, Phi-2, Phi-3 | | **Google** | Gemma, Gemma 2 | | **Falcon** | Falcon 7B, 40B, 180B | | **Code Models** | StarCoder, StarCoder2, CodeGen | | **Chinese** | ChatGLM, Yi, Baichuan, InternLM | | **Other** | DeepSeek, StableLM, MPT, BLOOM, OPT, GPT-NeoX | --- ## Memory Optimization ### For Large Models ```python # Enable gradient checkpointing (for training) model = turbo("meta-llama/Llama-3-70B", bits=4) # Use CPU offloading model = turbo( "meta-llama/Llama-3-70B", bits=4, device_map="auto", # Automatic CPU/GPU split ) ``` ### Memory Usage Estimates | Model Size | 4-bit | 8-bit | FP16 | |------------|-------|-------|------| | 3B | ~2 GB | ~4 GB | ~6 GB | | 7B | ~4 GB | ~8 GB | ~14 GB | | 13B | ~8 GB | ~14 GB | ~26 GB | | 70B | ~40 GB | ~70 GB | ~140 GB | --- ## Best Practices 1. **Start with automatic settings** — Let QuantLLM detect your hardware 2. **Use 4-bit for most cases** — Best balance of quality and memory 3. **Check memory first** — `turbo()` shows memory stats before loading 4. **Use GGUF for inference** — Pre-quantized GGUF models load faster --- ## Next Steps - [Text Generation →](generation.md) — Generate text with your model - [Fine-tuning →](finetuning.md) — Train with your own data - [GGUF Export →](gguf-export.md) — Export for deployment