π turbo()ο
The main entry point for QuantLLM β load any model in one line.
Signatureο
def turbo(
model: str,
*,
bits: Optional[int] = None,
max_length: Optional[int] = None,
device: Optional[str] = None,
dtype: Optional[str] = None,
config: Optional[Dict[str, Any]] = None,
quantize: bool = True,
trust_remote_code: bool = False,
verbose: bool = True,
**kwargs
) -> TurboModel
Parametersο
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
str |
required |
HuggingFace model name or local path |
|
int |
auto |
Quantization bits (4, 8, 16) |
|
int |
auto |
Maximum context length |
|
str |
auto |
Device (βcudaβ, βcpuβ, βcuda:0β, βautoβ) |
|
str |
auto |
Data type (βfloat16β, βbfloat16β) |
|
dict |
None |
Shared export/push defaults ( |
|
bool |
True |
Whether to apply quantization |
|
bool |
False |
Trust remote code in model |
|
bool |
True |
Show loading progress and stats |
Returnsο
A TurboModel instance ready for generation, fine-tuning, and export.
Examplesο
Basic Usageο
from quantllm import turbo
# Load with automatic optimization
model = turbo("meta-llama/Llama-3.2-3B")
# Generate text
response = model.generate("What is machine learning?")
print(response)
With Custom Settingsο
model = turbo(
"meta-llama/Llama-3.2-3B",
bits=4, # Force 4-bit quantization
max_length=4096, # Context length
device="cuda:0", # Specific GPU
dtype="bfloat16", # Use bfloat16
)
Without Quantizationο
# Load in full precision
model = turbo("meta-llama/Llama-3.2-3B", quantize=False)
Local Modelο
model = turbo("./my-local-model/")
Silent Loadingο
model = turbo("meta-llama/Llama-3.2-3B", verbose=False)
Auto-Configurationο
When parameters are not specified, turbo() automatically:
Detects hardware
GPU memory and CUDA version
CPU cores and available RAM
Flash Attention availability
Analyzes model
Parameter count and size
Architecture type
Optimal settings
Chooses quantization
4-bit if GPU memory < 16GB
8-bit if GPU memory >= 16GB
No quantization if explicitly disabled
Enables optimizations
Flash Attention 2 when available
torch.compile for training
Dynamic memory management
Outputο
When verbose=True (default), youβll see:
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β π QuantLLM v2.1.0rc1 β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
π Loading: meta-llama/Llama-3.2-3B
Parameters: 3.21B
Original: 6.4 GB
Quantized: 1.9 GB (70% saved)
β Model loaded successfully
See Alsoο
TurboModel β Full class documentation
SmartConfig β Configuration details
Loading Models Guide β Detailed loading guide