# 🚀 Quick Start Get up and running with QuantLLM in 5 minutes. --- ## Your First Model ```python from quantllm import turbo # Load any HuggingFace model with automatic optimization model = turbo("meta-llama/Llama-3.2-3B") # Generate text response = model.generate("Explain machine learning in simple terms") print(response) ``` **That's it!** QuantLLM automatically: - ✅ Detects your GPU and available memory - ✅ Applies optimal 4-bit quantization - ✅ Enables Flash Attention 2 when available - ✅ Configures memory management --- ## Basic Usage ### Generate Text ```python response = model.generate( "Write a Python function to calculate fibonacci numbers", max_new_tokens=200, temperature=0.7, ) print(response) ``` ### Chat Mode ```python messages = [ {"role": "system", "content": "You are a helpful coding assistant."}, {"role": "user", "content": "How do I read a file in Python?"}, ] response = model.chat(messages, max_new_tokens=200) print(response) ``` ### Streaming Output ```python for token in model.generate("Count to 10:", stream=True): print(token, end="", flush=True) ``` --- ## Export to Different Formats ### GGUF (llama.cpp, Ollama, LM Studio) ```python # Export with recommended Q4_K_M quantization model.export("gguf", "model.Q4_K_M.gguf", quantization="Q4_K_M") # Other quantization options model.export("gguf", "model.Q8_0.gguf", quantization="Q8_0") # Higher quality model.export("gguf", "model.Q2_K.gguf", quantization="Q2_K") # Smallest size ``` ### ONNX (ONNX Runtime, TensorRT) ```python model.export("onnx", "./model-onnx/") ``` ### MLX (Apple Silicon) ```python model.export("mlx", "./model-mlx/", quantization="4bit") ``` ### SafeTensors (HuggingFace) ```python model.export("safetensors", "./model-hf/") ``` --- ## Fine-Tune Your Model Train with your own data in one line: ```python # Simple training model.finetune("training_data.json", epochs=3) # With more control model.finetune( "training_data.json", epochs=5, learning_rate=2e-4, lora_r=16, batch_size=4, ) ``` **Supported data formats:** ```json [ {"instruction": "What is Python?", "output": "Python is a programming language..."}, {"text": "Full text for language modeling"}, {"prompt": "Question here", "completion": "Answer here"} ] ``` --- ## Push to HuggingFace Share your model with the world: ```python # Push with auto-generated model card model = turbo( "meta-llama/Llama-3.2-3B", config={"format": "gguf", "quantization": "Q4_K_M", "push_format": "gguf"}, ) model.push( "your-username/my-awesome-model", license="apache-2.0" ) ``` The model card includes: - ✅ Proper YAML frontmatter for HuggingFace - ✅ Format-specific usage examples - ✅ "Use this model" button compatibility - ✅ Quantization details --- ## Configuration Options ### Override Auto-Detection ```python model = turbo( "meta-llama/Llama-3.2-3B", bits=4, # Force 4-bit quantization max_length=4096, # Context length device="cuda:0", # Specific GPU dtype="bfloat16", # Data type ) ``` ### View Current Configuration ```python print(model.config) ``` --- ## Load GGUF Models Load pre-quantized GGUF models directly: ```python from quantllm import TurboModel model = TurboModel.from_gguf( "TheBloke/Llama-2-7B-Chat-GGUF", filename="llama-2-7b-chat.Q4_K_M.gguf" ) print(model.generate("Hello!")) ``` --- ## Show the Banner Display the QuantLLM banner anytime: ```python import quantllm quantllm.show_banner() ``` ``` ╔════════════════════════════════════════════════════════════╗ ║ ║ ║ 🚀 QuantLLM v2.1.0rc1 ║ ║ Ultra-fast LLM Quantization & Export ║ ║ ║ ║ ✓ GGUF ✓ ONNX ✓ MLX ✓ SafeTensors ║ ║ ║ ╚════════════════════════════════════════════════════════════╝ ``` --- ## Next Steps Now that you know the basics, explore more: - [Loading Models →](guide/loading-models.md) — Advanced model loading options - [Text Generation →](guide/generation.md) — Generation parameters and modes - [GGUF Export →](guide/gguf-export.md) — All quantization types explained - [Fine-tuning →](guide/finetuning.md) — Training with LoRA - [Hub Integration →](guide/hub-integration.md) — Push and pull from HuggingFace - [API Reference →](api/turbomodel.md) — Full API documentation