Dataset API

Dataset Loading

class quantllm.data.load_dataset.LoadDataset[source]

Bases: object

__init__()[source]: Initialize the dataset loader.

load_hf_dataset(dataset_name, split=None, streaming=False, **kwargs)[source]

Load a dataset from HuggingFace with custom progress bar.

Parameters:

dataset_name (str) – Name of the dataset
split (str, optional) – Dataset split
streaming (bool) – Whether to use streaming
**kwargs – Additional arguments for dataset loading

Returns:

Loaded dataset

Return type:

Dataset

load_local_dataset(file_path, file_type='auto', **kwargs)[source]

Load a dataset from local file with progress bar.

Parameters:

file_path (str) – Path to the dataset file
file_type (str) – Type of file (auto, csv, json, text, parquet)
**kwargs – Additional arguments for dataset loading

Return type:

Dataset

load_custom_dataset(data, **kwargs)[source]

Load a custom dataset from data with progress indication.

Parameters:

data (Union[Dict, list]) – Dataset data
**kwargs – Additional arguments for dataset creation

Returns:

Created dataset

Return type:

Dataset

Dataset Preprocessing

class quantllm.data.dataset_preprocessor.DatasetPreprocessor(tokenizer)[source]

Bases: object

__init__(tokenizer)[source]

validate_datasets(datasets)[source]: Validate input datasets.

preprocess_text(text)[source]

Basic text preprocessing

Return type:: str

tokenize_dataset(train_dataset, val_dataset=None, test_dataset=None, max_length=512, text_column='text', label_column=None, batch_size=1000)[source]

Tokenize datasets with preprocessing and progress bars.

Return type:: Tuple[Dataset, Optional[Dataset], Optional[Dataset]]

Dataset Splitting

class quantllm.data.dataset_splitter.DatasetSplitter[source]

Bases: object

__init__()[source]: Initialize dataset splitter.

validate_split_params(train_size, val_size, test_size=None)[source]: Validate split parameters.

train_test_split(dataset, test_size=0.2, shuffle=True, seed=42, **kwargs)[source]

Split dataset into train and test sets.

Parameters:

dataset (Dataset) – Dataset to split
test_size (float) – Size of test set
shuffle (bool) – Whether to shuffle
seed (int) – Random seed
**kwargs – Additional splitting arguments

Returns:

Train and test datasets

Return type:

Tuple[Dataset, Dataset]

train_val_test_split(dataset, train_size=0.8, val_size=0.1, test_size=0.1, shuffle=True, seed=42, split='train')[source]

Split dataset into train, validation and test sets with progress indication.

Parameters:

dataset (Dataset or DatasetDict) – Dataset to split
train_size (float) – Proportion of training set
val_size (float) – Proportion of validation set
test_size (float) – Proportion of test set
shuffle (bool) – Whether to shuffle the dataset
seed (int) – Random seed
split (str) – Which split to use if dataset is a DatasetDict

Returns:

Train, validation and test datasets

Return type:

Tuple[Dataset, Dataset, Dataset]

train_val_split(dataset, train_size=0.8, shuffle=True, seed=42, split='train')[source]

Split dataset into train and validation sets.

Return type:: Tuple[Dataset, Dataset]

k_fold_split(dataset, n_splits=5, shuffle=True, seed=42)[source]: Create k-fold cross validation splits.

DataLoader

class quantllm.data.dataloader.DataLoader[source]

Bases: object

Custom DataLoader class for QuantLLM that wraps torch.utils.data.DataLoader.

static validate_dataset(dataset, name)[source]: Validate dataset.

classmethod from_datasets(train_dataset, val_dataset=None, test_dataset=None, batch_size=8, shuffle=True, num_workers=0, pin_memory=True, **kwargs)[source]: Create DataLoader instances from datasets.

Example Usage

Loading a Dataset

from quantllm import LoadDataset, DatasetConfig

# Load from HuggingFace
dataset = LoadDataset().load_hf_dataset("imdb")

# Load local dataset
dataset = LoadDataset().load_local_dataset(
    file_path="path/to/data.csv",
    file_type="csv"
)

Preprocessing

from quantllm import DatasetPreprocessor

preprocessor = DatasetPreprocessor(tokenizer)
train_processed, val_processed, test_processed = preprocessor.tokenize_dataset(
    train_dataset=train_dataset,
    val_dataset=val_dataset,
    test_dataset=test_dataset,
    max_length=512,
    text_column="text",
    label_column="label"
)

Dataset Splitting

from quantllm import DatasetSplitter

splitter = DatasetSplitter()
train, val, test = splitter.train_val_test_split(
    dataset,
    train_size=0.8,
    val_size=0.1,
    test_size=0.1
)

# Or just train-val split
train, val = splitter.train_val_split(
    dataset,
    train_size=0.8
)

Creating DataLoaders

from quantllm import DataLoader

train_loader, val_loader, test_loader = DataLoader.from_datasets(
    train_dataset=train_processed,
    val_dataset=val_processed,
    test_dataset=test_processed,
    batch_size=8
)