🚀 turbo()

The main entry point for QuantLLM — load any model in one line.

Signature

def turbo(
    model: str,
    *,
    bits: Optional[int] = None,
    max_length: Optional[int] = None,
    device: Optional[str] = None,
    dtype: Optional[str] = None,
    config: Optional[Dict[str, Any]] = None,
    quantize: bool = True,
    trust_remote_code: bool = False,
    verbose: bool = True,
    **kwargs
) -> TurboModel

Parameters

Parameter	Type	Default	Description
`model`	str	required	HuggingFace model name or local path
`bits`	int	auto	Quantization bits (4, 8, 16)
`max_length`	int	auto	Maximum context length
`device`	str	auto	Device (“cuda”, “cpu”, “cuda:0”, “auto”)
`dtype`	str	auto	Data type (“float16”, “bfloat16”)
`config`	dict	None	Shared export/push defaults (`format`, `quantization`, `push_format`, `push_quantization`)
`quantize`	bool	True	Whether to apply quantization
`trust_remote_code`	bool	False	Trust remote code in model
`verbose`	bool	True	Show loading progress and stats

Returns

A TurboModel instance ready for generation, fine-tuning, and export.

Examples

Basic Usage

from quantllm import turbo

# Load with automatic optimization
model = turbo("meta-llama/Llama-3.2-3B")

# Generate text
response = model.generate("What is machine learning?")
print(response)

With Custom Settings

model = turbo(
    "meta-llama/Llama-3.2-3B",
    bits=4,                    # Force 4-bit quantization
    max_length=4096,           # Context length
    device="cuda:0",           # Specific GPU
    dtype="bfloat16",          # Use bfloat16
)

Without Quantization

# Load in full precision
model = turbo("meta-llama/Llama-3.2-3B", quantize=False)

Local Model

model = turbo("./my-local-model/")

Silent Loading

model = turbo("meta-llama/Llama-3.2-3B", verbose=False)

Auto-Configuration

When parameters are not specified, turbo() automatically:

Detects hardware
- GPU memory and CUDA version
- CPU cores and available RAM
- Flash Attention availability
Analyzes model
- Parameter count and size
- Architecture type
- Optimal settings
Chooses quantization
- 4-bit if GPU memory < 16GB
- 8-bit if GPU memory >= 16GB
- No quantization if explicitly disabled
Enables optimizations
- Flash Attention 2 when available
- torch.compile for training
- Dynamic memory management

Output

When verbose=True (default), you’ll see:

╔════════════════════════════════════════════════════════════╗
║  🚀 QuantLLM v2.1.0rc1                                        ║
╚════════════════════════════════════════════════════════════╝

📊 Loading: meta-llama/Llama-3.2-3B
   Parameters: 3.21B
   Original: 6.4 GB
   Quantized: 1.9 GB (70% saved)
   
✓ Model loaded successfully

See Also

TurboModel — Full class documentation
SmartConfig — Configuration details
Loading Models Guide — Detailed loading guide