πŸš€ turbo()

The main entry point for QuantLLM β€” load any model in one line.


Signature

def turbo(
    model: str,
    *,
    bits: Optional[int] = None,
    max_length: Optional[int] = None,
    device: Optional[str] = None,
    dtype: Optional[str] = None,
    config: Optional[Dict[str, Any]] = None,
    quantize: bool = True,
    trust_remote_code: bool = False,
    verbose: bool = True,
    **kwargs
) -> TurboModel

Parameters

Parameter

Type

Default

Description

model

str

required

HuggingFace model name or local path

bits

int

auto

Quantization bits (4, 8, 16)

max_length

int

auto

Maximum context length

device

str

auto

Device (β€œcuda”, β€œcpu”, β€œcuda:0”, β€œauto”)

dtype

str

auto

Data type (β€œfloat16”, β€œbfloat16”)

config

dict

None

Shared export/push defaults (format, quantization, push_format, push_quantization)

quantize

bool

True

Whether to apply quantization

trust_remote_code

bool

False

Trust remote code in model

verbose

bool

True

Show loading progress and stats


Returns

A TurboModel instance ready for generation, fine-tuning, and export.


Examples

Basic Usage

from quantllm import turbo

# Load with automatic optimization
model = turbo("meta-llama/Llama-3.2-3B")

# Generate text
response = model.generate("What is machine learning?")
print(response)

With Custom Settings

model = turbo(
    "meta-llama/Llama-3.2-3B",
    bits=4,                    # Force 4-bit quantization
    max_length=4096,           # Context length
    device="cuda:0",           # Specific GPU
    dtype="bfloat16",          # Use bfloat16
)

Without Quantization

# Load in full precision
model = turbo("meta-llama/Llama-3.2-3B", quantize=False)

Local Model

model = turbo("./my-local-model/")

Silent Loading

model = turbo("meta-llama/Llama-3.2-3B", verbose=False)

Auto-Configuration

When parameters are not specified, turbo() automatically:

  1. Detects hardware

    • GPU memory and CUDA version

    • CPU cores and available RAM

    • Flash Attention availability

  2. Analyzes model

    • Parameter count and size

    • Architecture type

    • Optimal settings

  3. Chooses quantization

    • 4-bit if GPU memory < 16GB

    • 8-bit if GPU memory >= 16GB

    • No quantization if explicitly disabled

  4. Enables optimizations

    • Flash Attention 2 when available

    • torch.compile for training

    • Dynamic memory management


Output

When verbose=True (default), you’ll see:

╔════════════════════════════════════════════════════════════╗
β•‘  πŸš€ QuantLLM v2.1.0rc1                                        β•‘
β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•

πŸ“Š Loading: meta-llama/Llama-3.2-3B
   Parameters: 3.21B
   Original: 6.4 GB
   Quantized: 1.9 GB (70% saved)
   
βœ“ Model loaded successfully

See Also