π¬ Text Generationο
Generate text with various options and modes.
Basic Generationο
from quantllm import turbo
model = turbo("meta-llama/Llama-3.2-3B")
response = model.generate("What is machine learning?")
print(response)
Generation Parametersο
Temperature & Samplingο
response = model.generate(
"Write a creative story about a robot.",
max_new_tokens=200, # Maximum tokens to generate
temperature=0.7, # Creativity (0.0 = deterministic, 1.0+ = creative)
top_p=0.9, # Nucleus sampling (higher = more diverse)
top_k=50, # Top-k sampling
do_sample=True, # Enable sampling (required for temperature > 0)
)
Controlling Outputο
response = model.generate(
"List 5 programming languages:",
max_new_tokens=100,
repetition_penalty=1.1, # Prevent repetition (1.0 = off, 1.2 = strong)
no_repeat_ngram_size=3, # Prevent repeating n-grams
)
Parameter Guideο
Parameter |
Range |
Description |
|---|---|---|
|
0.0-2.0 |
0.1-0.3 for factual, 0.7-0.9 for creative |
|
0.0-1.0 |
0.9 is a good default |
|
1-100 |
50 is a good default |
|
1.0-1.5 |
1.1-1.2 prevents repetition |
|
1-4096+ |
Depends on model context length |
Chat Modeο
For conversational models with system prompts:
messages = [
{"role": "system", "content": "You are a helpful coding assistant."},
{"role": "user", "content": "How do I read a file in Python?"},
]
response = model.chat(messages, max_new_tokens=200)
print(response)
Multi-Turn Conversationο
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is Python?"},
]
# First response
response = model.chat(messages)
print(f"Assistant: {response}")
# Continue conversation
messages.append({"role": "assistant", "content": response})
messages.append({"role": "user", "content": "What about JavaScript?"})
response = model.chat(messages)
print(f"Assistant: {response}")
Streamingο
Get tokens as theyβre generated for better UX:
# Streaming generation
for token in model.generate("Write a poem about the ocean:", stream=True):
print(token, end="", flush=True)
print() # Newline at end
Streaming with Chatο
messages = [{"role": "user", "content": "Tell me a story."}]
for token in model.chat(messages, stream=True):
print(token, end="", flush=True)
Stop Stringsο
Stop generation at specific patterns:
response = model.generate(
"Write a haiku:\n",
max_new_tokens=100,
stop_strings=["---", "\n\n\n"], # Stop at these patterns
)
Batch Generationο
Generate multiple responses efficiently:
prompts = [
"What is Python?",
"What is JavaScript?",
"What is Rust?",
]
for prompt in prompts:
response = model.generate(prompt, max_new_tokens=100)
print(f"Q: {prompt}")
print(f"A: {response}\n")
Common Use Casesο
Factual Q&Aο
response = model.generate(
"What is the capital of France?",
temperature=0.1, # Low temperature for factual
max_new_tokens=50,
)
Creative Writingο
response = model.generate(
"Write a short story about a dragon:",
temperature=0.8, # Higher temperature for creativity
top_p=0.95,
max_new_tokens=500,
)
Code Generationο
response = model.generate(
"Write a Python function to sort a list:",
temperature=0.2, # Low for accurate code
max_new_tokens=200,
)
Summarizationο
text = "..." # Long text to summarize
response = model.generate(
f"Summarize the following text:\n\n{text}\n\nSummary:",
temperature=0.3,
max_new_tokens=150,
)
Best Practicesο
Temperature: Use 0.1-0.3 for factual, 0.7-0.9 for creative
Max tokens: Set reasonable limits to avoid runaway generation
Repetition penalty: Use 1.1-1.2 to reduce repetition
Streaming: Use for long responses to improve user experience
Stop strings: Define clear stopping points for structured output
Next Stepsο
Fine-tuning β β Train the model on your data
GGUF Export β β Export for deployment
API Reference β β Full API documentation