QuantLLM
Getting Started
📦 Installation
Requirements
Quick Install
From GitHub (Recommended)
From PyPI
Installation Options
From Source (Development)
Verify Installation
Optional Dependencies
Flash Attention (Faster Inference)
Triton Kernels (GPU Optimization)
Troubleshooting
CUDA Not Available
Flash Attention Build Errors
Memory Issues
Windows Issues
Hardware Requirements
Next Steps
🚀 Quick Start
Your First Model
Basic Usage
Generate Text
Chat Mode
Streaming Output
Export to Different Formats
GGUF (llama.cpp, Ollama, LM Studio)
ONNX (ONNX Runtime, TensorRT)
MLX (Apple Silicon)
SafeTensors (HuggingFace)
Fine-Tune Your Model
Push to HuggingFace
Configuration Options
Override Auto-Detection
View Current Configuration
Load GGUF Models
Show the Banner
Next Steps
User Guide
📥 Loading Models
Basic Loading
The
turbo()
Function
What Happens Automatically
Quantization Options
Automatic (Recommended)
Manual Bit-Width
Disable Quantization
Configuration Options
Common Options
New Architecture Fallbacks (for very recent model releases)
Pre-quantized HuggingFace repos
from_config_only
is for skeleton inspection only
Fast contribution template for new architectures
Inspecting the loaded state
Memory Options
Using TurboModel Directly
Load GGUF Models
List Available GGUF Files
Supported Models
Memory Optimization
For Large Models
Memory Usage Estimates
Best Practices
Next Steps
💬 Text Generation
Basic Generation
Generation Parameters
Temperature & Sampling
Controlling Output
Parameter Guide
Chat Mode
Multi-Turn Conversation
Streaming
Streaming with Chat
Stop Strings
Batch Generation
Common Use Cases
Factual Q&A
Creative Writing
Code Generation
Summarization
Best Practices
Next Steps
🎓 Fine-Tuning
Quick Start
Data Formats
Instruction Format (Recommended)
Simple Text Format
Prompt-Completion Format
HuggingFace Datasets
Training Parameters
Basic Training
Advanced Training
LoRA Configuration
Choosing LoRA Rank
Training with Hub Integration
After Training
Test Your Model
Export the Model
Save and Load
Tips & Best Practices
Data Quality
Training Settings
Memory Management
Avoiding Overfitting
Common Issues
Out of Memory
Training Loss Not Decreasing
Model Outputs Garbage
Next Steps
📦 GGUF Export
Quick Export
Quantization Types
Size Comparison (7B Model)
Export Examples
Different Quantization Types
Using Exported Models
With llama.cpp
With Ollama
With LM Studio
With Python (llama-cpp-python)
Push to HuggingFace
Direct Conversion
Quantize Existing GGUF
List Available Quantization Types
Troubleshooting
BitsAndBytes Models
Large Models
Windows Issues
Best Practices
Next Steps
🤗 Hub Integration
Quick Push
Setup
Get Your Token
Push Methods
Method 1: TurboModel.push() (Recommended)
Method 2: QuantLLMHubManager (Advanced)
Auto-Generated Model Cards
YAML Frontmatter
Format-Specific Usage Examples
Pull Models
List GGUF Files
Private Repositories
Fine-Tuning with Hub Tracking
Commit Messages
Multiple Formats
Best Practices
Troubleshooting
Authentication Error
Repository Already Exists
Large File Issues
Next Steps
API Reference
🚀 turbo()
Signature
Parameters
Returns
Examples
Basic Usage
With Custom Settings
Without Quantization
Local Model
Silent Loading
Auto-Configuration
Output
See Also
🔥 TurboModel
Class Overview
Class Methods
from_pretrained()
from_gguf()
list_gguf_files()
Instance Methods
generate()
chat()
finetune()
export()
push() / push_to_hub()
SmartConfig
SmartConfig.detect()
print_summary()
See Also
📦 GGUF API
Quick Reference
convert_to_gguf()
Parameters
Returns
Example
quantize_gguf()
Parameters
Example
GGUF_QUANT_TYPES
Quantization Comparison
QUANT_RECOMMENDATIONS
check_llama_cpp()
Example
install_llama_cpp()
Parameters
Example
ensure_llama_cpp_installed()
Example
export_to_gguf()
Using Exported Models
llama.cpp
Ollama
LM Studio
Python (llama-cpp-python)
See Also
🤗 Hub API
Quick Reference
TurboModel.push()
Parameters
Supported Formats
Examples
QuantLLMHubManager
Parameters
Methods
login()
track_hyperparameters()
save_final_model()
push()
Complete Workflow
Fine-Tune and Push
Export and Push
Auto-Generated Model Cards
YAML Frontmatter
Format-Specific Usage
ModelCardGenerator
Environment Variables
Best Practices
See Also
QuantLLM
Index
Index