Dataset API
Dataset Loading
- class quantllm.data.load_dataset.LoadDataset[source]
Bases:
object- load_hf_dataset(dataset_name, split=None, streaming=False, **kwargs)[source]
Load a dataset from HuggingFace with custom progress bar.
Dataset Preprocessing
Dataset Splitting
- class quantllm.data.dataset_splitter.DatasetSplitter[source]
Bases:
object- train_test_split(dataset, test_size=0.2, shuffle=True, seed=42, **kwargs)[source]
Split dataset into train and test sets.
- train_val_test_split(dataset, train_size=0.8, val_size=0.1, test_size=0.1, shuffle=True, seed=42, split='train')[source]
Split dataset into train, validation and test sets with progress indication.
- Parameters:
dataset (Dataset or DatasetDict) – Dataset to split
train_size (float) – Proportion of training set
val_size (float) – Proportion of validation set
test_size (float) – Proportion of test set
shuffle (bool) – Whether to shuffle the dataset
seed (int) – Random seed
split (str) – Which split to use if dataset is a DatasetDict
- Returns:
Train, validation and test datasets
- Return type:
Tuple[Dataset, Dataset, Dataset]
DataLoader
Example Usage
Loading a Dataset
from quantllm import LoadDataset, DatasetConfig
# Load from HuggingFace
dataset = LoadDataset().load_hf_dataset("imdb")
# Load local dataset
dataset = LoadDataset().load_local_dataset(
file_path="path/to/data.csv",
file_type="csv"
)
Preprocessing
from quantllm import DatasetPreprocessor
preprocessor = DatasetPreprocessor(tokenizer)
train_processed, val_processed, test_processed = preprocessor.tokenize_dataset(
train_dataset=train_dataset,
val_dataset=val_dataset,
test_dataset=test_dataset,
max_length=512,
text_column="text",
label_column="label"
)
Dataset Splitting
from quantllm import DatasetSplitter
splitter = DatasetSplitter()
train, val, test = splitter.train_val_test_split(
dataset,
train_size=0.8,
val_size=0.1,
test_size=0.1
)
# Or just train-val split
train, val = splitter.train_val_split(
dataset,
train_size=0.8
)
Creating DataLoaders
from quantllm import DataLoader
train_loader, val_loader, test_loader = DataLoader.from_datasets(
train_dataset=train_processed,
val_dataset=val_processed,
test_dataset=test_processed,
batch_size=8
)