πŸ’¬ Text Generation

Generate text with various options and modes.


Basic Generation

from quantllm import turbo

model = turbo("meta-llama/Llama-3.2-3B")

response = model.generate("What is machine learning?")
print(response)

Generation Parameters

Temperature & Sampling

response = model.generate(
    "Write a creative story about a robot.",
    max_new_tokens=200,        # Maximum tokens to generate
    temperature=0.7,           # Creativity (0.0 = deterministic, 1.0+ = creative)
    top_p=0.9,                 # Nucleus sampling (higher = more diverse)
    top_k=50,                  # Top-k sampling
    do_sample=True,            # Enable sampling (required for temperature > 0)
)

Controlling Output

response = model.generate(
    "List 5 programming languages:",
    max_new_tokens=100,
    repetition_penalty=1.1,    # Prevent repetition (1.0 = off, 1.2 = strong)
    no_repeat_ngram_size=3,    # Prevent repeating n-grams
)

Parameter Guide

Parameter

Range

Description

temperature

0.0-2.0

0.1-0.3 for factual, 0.7-0.9 for creative

top_p

0.0-1.0

0.9 is a good default

top_k

1-100

50 is a good default

repetition_penalty

1.0-1.5

1.1-1.2 prevents repetition

max_new_tokens

1-4096+

Depends on model context length


Chat Mode

For conversational models with system prompts:

messages = [
    {"role": "system", "content": "You are a helpful coding assistant."},
    {"role": "user", "content": "How do I read a file in Python?"},
]

response = model.chat(messages, max_new_tokens=200)
print(response)

Multi-Turn Conversation

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is Python?"},
]

# First response
response = model.chat(messages)
print(f"Assistant: {response}")

# Continue conversation
messages.append({"role": "assistant", "content": response})
messages.append({"role": "user", "content": "What about JavaScript?"})

response = model.chat(messages)
print(f"Assistant: {response}")

Streaming

Get tokens as they’re generated for better UX:

# Streaming generation
for token in model.generate("Write a poem about the ocean:", stream=True):
    print(token, end="", flush=True)
print()  # Newline at end

Streaming with Chat

messages = [{"role": "user", "content": "Tell me a story."}]

for token in model.chat(messages, stream=True):
    print(token, end="", flush=True)

Stop Strings

Stop generation at specific patterns:

response = model.generate(
    "Write a haiku:\n",
    max_new_tokens=100,
    stop_strings=["---", "\n\n\n"],  # Stop at these patterns
)

Batch Generation

Generate multiple responses efficiently:

prompts = [
    "What is Python?",
    "What is JavaScript?", 
    "What is Rust?",
]

for prompt in prompts:
    response = model.generate(prompt, max_new_tokens=100)
    print(f"Q: {prompt}")
    print(f"A: {response}\n")

Common Use Cases

Factual Q&A

response = model.generate(
    "What is the capital of France?",
    temperature=0.1,           # Low temperature for factual
    max_new_tokens=50,
)

Creative Writing

response = model.generate(
    "Write a short story about a dragon:",
    temperature=0.8,           # Higher temperature for creativity
    top_p=0.95,
    max_new_tokens=500,
)

Code Generation

response = model.generate(
    "Write a Python function to sort a list:",
    temperature=0.2,           # Low for accurate code
    max_new_tokens=200,
)

Summarization

text = "..." # Long text to summarize
response = model.generate(
    f"Summarize the following text:\n\n{text}\n\nSummary:",
    temperature=0.3,
    max_new_tokens=150,
)

Best Practices

  1. Temperature: Use 0.1-0.3 for factual, 0.7-0.9 for creative

  2. Max tokens: Set reasonable limits to avoid runaway generation

  3. Repetition penalty: Use 1.1-1.2 to reduce repetition

  4. Streaming: Use for long responses to improve user experience

  5. Stop strings: Define clear stopping points for structured output


Next Steps