# 🔥 TurboModel

The unified model class for loading, generating, fine-tuning, and exporting.

---

## Class Overview

```python
class TurboModel:
    """Ultra-fast LLM with auto-configuration."""
    
    model: PreTrainedModel           # The underlying HuggingFace model
    tokenizer: PreTrainedTokenizer   # The tokenizer
    config: SmartConfig              # Auto-detected configuration
```

---

## Class Methods

### from_pretrained()

Load a model from HuggingFace Hub or local path.

```python
@classmethod
def from_pretrained(
    cls,
    model_name: str,
    config: Optional[SmartConfig] = None,
    quantize: bool = True,
    verbose: bool = True,
    **kwargs
) -> "TurboModel"
```

**Example:**
```python
from quantllm import TurboModel, SmartConfig

# With auto-config
model = TurboModel.from_pretrained("meta-llama/Llama-3.2-3B")

# With custom config
config = SmartConfig.detect("meta-llama/Llama-3.2-3B", bits=4)
model = TurboModel.from_pretrained("meta-llama/Llama-3.2-3B", config=config)
```

### from_gguf()

Load a GGUF model from HuggingFace or local file.

```python
@classmethod
def from_gguf(
    cls,
    repo_id_or_path: str,
    filename: Optional[str] = None,
    **kwargs
) -> "TurboModel"
```

**Example:**
```python
# From HuggingFace
model = TurboModel.from_gguf(
    "TheBloke/Llama-2-7B-Chat-GGUF",
    filename="llama-2-7b-chat.Q4_K_M.gguf"
)

# From local file
model = TurboModel.from_gguf("./models/my-model.gguf")
```

### list_gguf_files()

List available GGUF files in a HuggingFace repository.

```python
@staticmethod
def list_gguf_files(repo_id: str) -> List[str]
```

**Example:**
```python
files = TurboModel.list_gguf_files("TheBloke/Llama-2-7B-Chat-GGUF")
print(files)
# ['llama-2-7b-chat.Q2_K.gguf', 'llama-2-7b-chat.Q4_K_M.gguf', ...]
```

---

## Instance Methods

### generate()

Generate text from a prompt.

```python
def generate(
    self,
    prompt: str,
    max_new_tokens: int = 256,
    temperature: float = 0.7,
    top_p: float = 0.9,
    top_k: int = 50,
    repetition_penalty: float = 1.0,
    do_sample: bool = True,
    stream: bool = False,
    stop_strings: Optional[List[str]] = None,
    **kwargs
) -> Union[str, Generator[str, None, None]]
```

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `prompt` | str | required | Input text |
| `max_new_tokens` | int | 256 | Maximum tokens to generate |
| `temperature` | float | 0.7 | Sampling temperature (0.0-2.0) |
| `top_p` | float | 0.9 | Nucleus sampling threshold |
| `top_k` | int | 50 | Top-k sampling |
| `repetition_penalty` | float | 1.0 | Repetition penalty (1.0-1.5) |
| `stream` | bool | False | Stream tokens as generated |
| `stop_strings` | list | None | Stop generation at these strings |

**Example:**
```python
# Basic generation
response = model.generate("What is AI?")

# With parameters
response = model.generate(
    "Write a poem:",
    max_new_tokens=200,
    temperature=0.8,
    top_p=0.95,
)

# Streaming
for token in model.generate("Count to 10:", stream=True):
    print(token, end="", flush=True)
```

### chat()

Chat with the model using messages format.

```python
def chat(
    self,
    messages: List[Dict[str, str]],
    max_new_tokens: int = 256,
    stream: bool = False,
    **kwargs
) -> Union[str, Generator[str, None, None]]
```

**Messages format:**
```python
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Hello!"},
    {"role": "assistant", "content": "Hi there!"},
    {"role": "user", "content": "How are you?"},
]
```

**Example:**
```python
messages = [
    {"role": "system", "content": "You are a coding expert."},
    {"role": "user", "content": "How do I read a file in Python?"},
]

response = model.chat(messages)
print(response)
```

### finetune()

Fine-tune the model with LoRA.

```python
def finetune(
    self,
    data: Union[str, List[Dict], Dataset],
    epochs: int = 3,
    batch_size: int = 4,
    learning_rate: float = 2e-4,
    lora_r: int = 8,
    lora_alpha: int = 16,
    lora_dropout: float = 0.1,
    output_dir: Optional[str] = None,
    hub_manager: Optional[QuantLLMHubManager] = None,
    **kwargs
) -> Dict[str, Any]
```

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `data` | str/list/Dataset | required | Training data |
| `epochs` | int | 3 | Training epochs |
| `batch_size` | int | 4 | Batch size |
| `learning_rate` | float | 2e-4 | Learning rate |
| `lora_r` | int | 8 | LoRA rank |
| `lora_alpha` | int | 16 | LoRA alpha |
| `output_dir` | str | None | Save directory |

**Returns:** Dictionary with `train_loss`, `epochs`, `output_dir`.

**Example:**
```python
# Simple training
result = model.finetune("data.json", epochs=3)

# Advanced
result = model.finetune(
    "data.json",
    epochs=5,
    learning_rate=2e-4,
    lora_r=16,
    lora_alpha=32,
    batch_size=4,
)
```

### export()

Export the model to various formats.

```python
def export(
    self,
    format: Optional[str] = None,
    output_path: Optional[str] = None,
    quantization: Optional[str] = None,
    **kwargs
) -> str
```

| Parameter | Type | Description |
|-----------|------|-------------|
| `format` | str | "gguf", "onnx", "mlx", "safetensors" (optional, uses shared config) |
| `output_path` | str | Output file or directory (optional) |
| `quantization` | str | Quantization type (format-specific) |

**Examples:**
```python
# GGUF
model = turbo(
    "meta-llama/Llama-3.2-3B",
    config={"format": "gguf", "quantization": "Q4_K_M", "push_format": "gguf"},
)
model.export()

# ONNX
model.export("onnx", "./model-onnx/")

# MLX
model.export("mlx", "./model-mlx/", quantization="4bit")

# SafeTensors
model.export("safetensors", "./model-hf/")
```

### push() / push_to_hub()

Push model to HuggingFace Hub.

```python
def push(
    self,
    repo_id: str,
    token: Optional[str] = None,
    format: Optional[str] = None,
    quantization: Optional[str] = None,
    license: str = "apache-2.0",
    commit_message: str = "Upload model via QuantLLM",
    **kwargs
)
```

**Example:**
```python
# Push as GGUF
model.push(
    "your-username/my-model"
)

# Push as MLX
model.push(
    "your-username/my-model-mlx",
    format="mlx",
    quantization="4bit"
)
```

---

## SmartConfig

Auto-detected configuration for optimal performance.

```python
@dataclass
class SmartConfig:
    bits: int = 4
    quant_type: str = "nf4"
    use_flash_attention: bool = True
    gradient_checkpointing: bool = False
    cpu_offload: bool = False
    compile_model: bool = False
    batch_size: int = 4
    max_seq_length: int = 4096
    device: torch.device = "cuda"
    dtype: torch.dtype = torch.float16
```

### SmartConfig.detect()

Auto-detect optimal configuration.

```python
@classmethod
def detect(
    cls,
    model_name: str,
    bits: Optional[int] = None,
    training: bool = False,
) -> SmartConfig
```

**Example:**
```python
from quantllm import SmartConfig

config = SmartConfig.detect("meta-llama/Llama-3.2-3B")
print(f"Bits: {config.bits}")
print(f"Flash Attention: {config.use_flash_attention}")
```

### print_summary()

Print configuration summary.

```python
config.print_summary()
```

**Output:**
```
╔════════════════════════════════════════════════════╗
║          QUANTLLM CONFIGURATION                    ║
╠════════════════════════════════════════════════════╣
║ 📦 Quantization: 4-bit (nf4)                       ║
║ 💾 Memory: CPU Offload Disabled                    ║
║ ⚡ Speed: Flash Attention Enabled                  ║
╚════════════════════════════════════════════════════╝
```

---

## See Also

- [turbo()](turbo.md) — Quick loading function
- [GGUF API](gguf.md) — GGUF export details
- [Hub API](hub.md) — HuggingFace integration