# 🚀 turbo()

The main entry point for QuantLLM — load any model in one line.

---

## Signature

```python
def turbo(
    model: str,
    *,
    bits: Optional[int] = None,
    max_length: Optional[int] = None,
    device: Optional[str] = None,
    dtype: Optional[str] = None,
    config: Optional[Dict[str, Any]] = None,
    quantize: bool = True,
    trust_remote_code: bool = False,
    verbose: bool = True,
    **kwargs
) -> TurboModel
```

---

## Parameters

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `model` | str | required | HuggingFace model name or local path |
| `bits` | int | auto | Quantization bits (4, 8, 16) |
| `max_length` | int | auto | Maximum context length |
| `device` | str | auto | Device ("cuda", "cpu", "cuda:0", "auto") |
| `dtype` | str | auto | Data type ("float16", "bfloat16") |
| `config` | dict | None | Shared export/push defaults (`format`, `quantization`, `push_format`, `push_quantization`) |
| `quantize` | bool | True | Whether to apply quantization |
| `trust_remote_code` | bool | False | Trust remote code in model |
| `verbose` | bool | True | Show loading progress and stats |

---

## Returns

A [`TurboModel`](turbomodel.md) instance ready for generation, fine-tuning, and export.

---

## Examples

### Basic Usage

```python
from quantllm import turbo

# Load with automatic optimization
model = turbo("meta-llama/Llama-3.2-3B")

# Generate text
response = model.generate("What is machine learning?")
print(response)
```

### With Custom Settings

```python
model = turbo(
    "meta-llama/Llama-3.2-3B",
    bits=4,                    # Force 4-bit quantization
    max_length=4096,           # Context length
    device="cuda:0",           # Specific GPU
    dtype="bfloat16",          # Use bfloat16
)
```

### Without Quantization

```python
# Load in full precision
model = turbo("meta-llama/Llama-3.2-3B", quantize=False)
```

### Local Model

```python
model = turbo("./my-local-model/")
```

### Silent Loading

```python
model = turbo("meta-llama/Llama-3.2-3B", verbose=False)
```

---

## Auto-Configuration

When parameters are not specified, `turbo()` automatically:

1. **Detects hardware**
   - GPU memory and CUDA version
   - CPU cores and available RAM
   - Flash Attention availability

2. **Analyzes model**
   - Parameter count and size
   - Architecture type
   - Optimal settings

3. **Chooses quantization**
   - 4-bit if GPU memory < 16GB
   - 8-bit if GPU memory >= 16GB
   - No quantization if explicitly disabled

4. **Enables optimizations**
   - Flash Attention 2 when available
   - torch.compile for training
   - Dynamic memory management

---

## Output

When `verbose=True` (default), you'll see:

```
╔════════════════════════════════════════════════════════════╗
║  🚀 QuantLLM v2.1.0rc1                                        ║
╚════════════════════════════════════════════════════════════╝

📊 Loading: meta-llama/Llama-3.2-3B
   Parameters: 3.21B
   Original: 6.4 GB
   Quantized: 1.9 GB (70% saved)
   
✓ Model loaded successfully
```

---

## See Also

- [TurboModel](turbomodel.md) — Full class documentation
- [SmartConfig](turbomodel.md#smartconfig) — Configuration details
- [Loading Models Guide](../guide/loading-models.md) — Detailed loading guide