QuantLLM

Getting Started

  • 📦 Installation
    • Requirements
    • Quick Install
      • From GitHub (Recommended)
      • From PyPI
    • Installation Options
    • From Source (Development)
    • Verify Installation
    • Optional Dependencies
      • Flash Attention (Faster Inference)
      • Triton Kernels (GPU Optimization)
    • Troubleshooting
      • CUDA Not Available
      • Flash Attention Build Errors
      • Memory Issues
      • Windows Issues
    • Hardware Requirements
    • Next Steps
  • 🚀 Quick Start
    • Your First Model
    • Basic Usage
      • Generate Text
      • Chat Mode
      • Streaming Output
    • Export to Different Formats
      • GGUF (llama.cpp, Ollama, LM Studio)
      • ONNX (ONNX Runtime, TensorRT)
      • MLX (Apple Silicon)
      • SafeTensors (HuggingFace)
    • Fine-Tune Your Model
    • Push to HuggingFace
    • Configuration Options
      • Override Auto-Detection
      • View Current Configuration
    • Load GGUF Models
    • Show the Banner
    • Next Steps

User Guide

  • 📥 Loading Models
    • Basic Loading
      • The turbo() Function
      • What Happens Automatically
    • Quantization Options
      • Automatic (Recommended)
      • Manual Bit-Width
      • Disable Quantization
    • Configuration Options
      • Common Options
      • New Architecture Fallbacks (for very recent model releases)
        • Pre-quantized HuggingFace repos
        • from_config_only is for skeleton inspection only
        • Fast contribution template for new architectures
        • Inspecting the loaded state
      • Memory Options
    • Using TurboModel Directly
    • Load GGUF Models
      • List Available GGUF Files
    • Supported Models
    • Memory Optimization
      • For Large Models
      • Memory Usage Estimates
    • Best Practices
    • Next Steps
  • 💬 Text Generation
    • Basic Generation
    • Generation Parameters
      • Temperature & Sampling
      • Controlling Output
      • Parameter Guide
    • Chat Mode
      • Multi-Turn Conversation
    • Streaming
      • Streaming with Chat
    • Stop Strings
    • Batch Generation
    • Common Use Cases
      • Factual Q&A
      • Creative Writing
      • Code Generation
      • Summarization
    • Best Practices
    • Next Steps
  • 🎓 Fine-Tuning
    • Quick Start
    • Data Formats
      • Instruction Format (Recommended)
      • Simple Text Format
      • Prompt-Completion Format
      • HuggingFace Datasets
    • Training Parameters
      • Basic Training
      • Advanced Training
    • LoRA Configuration
      • Choosing LoRA Rank
    • Training with Hub Integration
    • After Training
      • Test Your Model
      • Export the Model
      • Save and Load
    • Tips & Best Practices
      • Data Quality
      • Training Settings
      • Memory Management
      • Avoiding Overfitting
    • Common Issues
      • Out of Memory
      • Training Loss Not Decreasing
      • Model Outputs Garbage
    • Next Steps
  • 📦 GGUF Export
    • Quick Export
    • Quantization Types
      • Size Comparison (7B Model)
    • Export Examples
      • Different Quantization Types
    • Using Exported Models
      • With llama.cpp
      • With Ollama
      • With LM Studio
      • With Python (llama-cpp-python)
    • Push to HuggingFace
    • Direct Conversion
    • Quantize Existing GGUF
    • List Available Quantization Types
    • Troubleshooting
      • BitsAndBytes Models
      • Large Models
      • Windows Issues
    • Best Practices
    • Next Steps
  • 🤗 Hub Integration
    • Quick Push
    • Setup
      • Get Your Token
    • Push Methods
      • Method 1: TurboModel.push() (Recommended)
      • Method 2: QuantLLMHubManager (Advanced)
    • Auto-Generated Model Cards
      • YAML Frontmatter
      • Format-Specific Usage Examples
    • Pull Models
      • List GGUF Files
    • Private Repositories
    • Fine-Tuning with Hub Tracking
    • Commit Messages
    • Multiple Formats
    • Best Practices
    • Troubleshooting
      • Authentication Error
      • Repository Already Exists
      • Large File Issues
    • Next Steps

API Reference

  • 🚀 turbo()
    • Signature
    • Parameters
    • Returns
    • Examples
      • Basic Usage
      • With Custom Settings
      • Without Quantization
      • Local Model
      • Silent Loading
    • Auto-Configuration
    • Output
    • See Also
  • 🔥 TurboModel
    • Class Overview
    • Class Methods
      • from_pretrained()
      • from_gguf()
      • list_gguf_files()
    • Instance Methods
      • generate()
      • chat()
      • finetune()
      • export()
      • push() / push_to_hub()
    • SmartConfig
      • SmartConfig.detect()
      • print_summary()
    • See Also
  • 📦 GGUF API
    • Quick Reference
    • convert_to_gguf()
      • Parameters
      • Returns
      • Example
    • quantize_gguf()
      • Parameters
      • Example
    • GGUF_QUANT_TYPES
      • Quantization Comparison
    • QUANT_RECOMMENDATIONS
    • check_llama_cpp()
      • Example
    • install_llama_cpp()
      • Parameters
      • Example
    • ensure_llama_cpp_installed()
      • Example
    • export_to_gguf()
    • Using Exported Models
      • llama.cpp
      • Ollama
      • LM Studio
      • Python (llama-cpp-python)
    • See Also
  • 🤗 Hub API
    • Quick Reference
    • TurboModel.push()
      • Parameters
      • Supported Formats
      • Examples
    • QuantLLMHubManager
      • Parameters
      • Methods
        • login()
        • track_hyperparameters()
        • save_final_model()
        • push()
    • Complete Workflow
      • Fine-Tune and Push
      • Export and Push
    • Auto-Generated Model Cards
      • YAML Frontmatter
      • Format-Specific Usage
    • ModelCardGenerator
    • Environment Variables
    • Best Practices
    • See Also
QuantLLM
  • Search


© Copyright 2024, Dark Coder.

Built with Sphinx using a theme provided by Read the Docs.