CORE SAMPLING PARAMETERS
Min-P Sampling
What It Does
Filters out tokens whose probability is below a minimum threshold relative to the most probable token. It dynamically prunes very unlikely tokens while keeping a contextually appropriate number of candidates.
How It Works
For each generation step, the model identifies the highest probability token. Any token whose probability is less than (min_p × highest_probability) is discarded. For example, if the top token has probability 0.4 and min_p=0.1, any token with probability below 0.04 (i.e., 0.1 × 0.4) is removed from consideration. This adapts naturally: when the model is confident, fewer tokens pass; when uncertain, more tokens are eligible.
When To Use
- 0.05–0.1: Light filtering — removes only extremely unlikely tokens.
- 0.1–0.2: Moderate filtering — good balance for most tasks.
- 0.3+: Aggressive filtering — only high-probability tokens survive.
Examples
Results: Tokens with 0.5, 0.3, 0.1, 0.08 survive; tokens at 0.03, 0.01 are cut.
Pro Tip
Min-P is increasingly popular as an alternative to Top-K and Top-P because it scales naturally with the model's confidence level. It is supported in llama.cpp, vLLM, and several other frameworks.
Temperature
What It Does
Controls the randomness and creativity of the model's output by scaling the raw logit scores before they are converted into probabilities via the softmax function. It is the single most influential parameter for shaping output behavior.
How It Works
The formula is: adjusted_probability = softmax(logits / temperature). At temperature=1.0, the model uses its native probability distribution. Lower values (e.g., 0.2) sharpen the distribution, making the highest-probability token overwhelmingly likely — resulting in deterministic, focused, and consistent output. Higher values (e.g., 1.2) flatten the distribution, giving lower-probability tokens a better chance of being selected — producing more creative, diverse, and sometimes surprising text. At temperature=0, the model always picks the single most probable token (greedy decoding). Values above 1.5–2.0 can produce incoherent or nonsensical output.
When To Use
- Low (0.0–0.3): Factual Q&A, code generation, data extraction, classification, medical/legal applications where consistency is critical.
- Medium (0.4–0.7): Balanced tasks like email drafting, summarization, customer support, general conversation.
- High (0.8–1.2): Creative writing, brainstorming, poetry, storytelling, idea generation.
- Very High (1.3–2.0): Experimental use only — highly unpredictable output.
Examples
Pro Tip
For production systems, start with temperature=0.3 and increase only if output feels too rigid. Never change temperature and top_p simultaneously — adjust one at a time.
Top-K Sampling
What It Does
Restricts the model's next-token selection to only the K most probable tokens, ignoring all others regardless of their probability. It provides a hard ceiling on vocabulary diversity at each generation step.
How It Works
The model computes probabilities for all tokens in its vocabulary, then keeps only the K tokens with the highest probabilities. All other tokens have their probabilities set to zero. The model then samples from this reduced set. With K=1, the model always picks the single most likely token (greedy decoding). With K=50, it chooses from 50 candidates. With K=100,000+, essentially all tokens are eligible.
When To Use
- K=1: Greedy decoding — most deterministic, always picks the top token.
- K=10–20: Focused but with slight variation. Good for code, formal text.
- K=40–50: Standard default. Balanced diversity for general tasks.
- K=100+: High diversity, approaching unrestricted sampling.
- K=0 or -1: Disabled (no Top-K filtering applied).
Examples
Pro Tip
Top-K is less flexible than Top-P because it uses a fixed number regardless of probability distribution. If the model is very confident (one token has 99% probability), K=50 still considers 50 tokens unnecessarily. Top-P handles this better by adapting dynamically.
Top-P (Nucleus Sampling)
What It Does
Controls output diversity by dynamically selecting the smallest set of tokens whose cumulative probability exceeds the threshold P. Unlike Top-K which uses a fixed count, Top-P adapts — sometimes considering 5 tokens, sometimes 500 — depending on the model's confidence at each position.
How It Works
After computing probabilities for all tokens, the model sorts them from highest to lowest probability and accumulates until the sum reaches the P threshold. Only tokens within this cumulative set are eligible for sampling; all others are discarded. For example, with top_p=0.9, the model considers enough top tokens to cover 90% of the total probability mass. If one token has 95% probability, only that token is considered. If probabilities are spread evenly, many tokens are included.
When To Use
- Low (0.1–0.5): Highly constrained output — only the very top predictions. Good for classification, structured output, and deterministic tasks.
- Medium (0.6–0.8): Balanced diversity. Good for most general-purpose tasks.
- High (0.9–1.0): Maximum diversity from the full probability distribution. Good for creative and exploratory tasks.
- Note: top_p=1.0 means no filtering (all tokens eligible).
Examples
Pro Tip
Top-P is generally preferred over Top-K for most use cases because it adapts dynamically to the model's confidence. The general recommendation is to adjust either temperature OR top_p, but not both aggressively at the same time.
OUTPUT LENGTH & STRUCTURE CONTROLS
Context Window / Context Length
What It Does
Defines the total maximum number of tokens (input + output combined) that the model can process in a single interaction. It is the model's total working memory for the conversation — everything the model can 'see' at once.
How It Works
The context window is an architectural property of the model determined during training. Everything the model considers — system prompt, conversation history, retrieved documents, and generated output — must fit within this window. If the total exceeds the limit, earlier content is typically truncated or the request fails. Larger context windows enable longer conversations and bigger document inputs but consume more memory and compute.
When To Use
- 2K–4K: Simple Q&A, short conversations.
- 8K–32K: Standard applications, moderate documents.
- 64K–128K: Long documents, extended conversations, code repositories.
- 200K–1M+: Entire books, massive codebases, extensive research.
Examples
A 128K token context window can process ~96,000 words — enough for a full-length novel.
A 4K context window limits you to ~3,000 words total (prompt + response combined).
Pro Tip
Context window usage directly impacts cost and latency. Use context efficiently: summarize old conversation turns, use RAG to inject only relevant documents, and remove redundant information from prompts.
Max Tokens / Max Length
What It Does
Sets the maximum number of tokens the model can generate in a single response. This is a hard ceiling — the model will stop generating once it reaches this limit, even if its response is incomplete. It directly affects cost (more tokens = higher API cost) and latency (more tokens = longer response time).
How It Works
The model generates tokens one at a time until it either: (a) reaches the max_tokens limit, (b) produces a stop sequence, or (c) generates the special end-of-sequence (EOS) token. Max tokens includes only the output tokens — not the input/prompt tokens. However, the total (input + output) must fit within the model's context window.
When To Use
- Short (50–150): Quick answers, classifications, single-sentence responses.
- Medium (150–500): Paragraphs, summaries, standard Q&A.
- Long (500–2000): Articles, reports, detailed explanations.
- Very Long (2000–8000+): Full documents, book chapters, extensive code generation.
- Note: Setting this too low truncates responses mid-sentence; too high wastes cost on padding.
Examples
Pro Tip
Always set max_tokens explicitly rather than relying on defaults. For cost optimization, estimate the expected output length and add a 20–30% buffer. Remember: output tokens typically cost 4–6x more than input tokens on most API providers.
Min Tokens
What It Does
Sets the minimum number of tokens the model must generate before it is allowed to stop (by producing an EOS token or stop sequence). This prevents the model from giving overly brief or empty responses.
How It Works
When min_tokens is set, the model's EOS token and any stop sequences are suppressed until the minimum token count is reached. After that threshold, normal stopping behavior resumes. This ensures a minimum response length.
When To Use
- Useful when models tend to give very short or one-word answers. Set to a reasonable minimum to ensure substantive responses without forcing unnecessary padding.
Examples
Pro Tip
Use sparingly. Forcing minimum length can lead to padding and filler content. It's often better to address brevity issues through prompt engineering rather than this parameter.
Stop Sequences
What It Does
Defines one or more specific strings or tokens that, when generated by the model, immediately halt further text generation. The stop sequence itself may or may not be included in the output, depending on the provider's implementation.
How It Works
The model checks its output after each token generation. If the accumulated output ends with any of the defined stop sequences, generation stops immediately. This is useful for controlling output structure, preventing runaway generation, and creating clean boundaries in multi-turn or structured outputs.
When To Use
- Structured outputs: Use '\n\n' to stop after a single paragraph.
- Conversation: Use 'User:' to stop when the model would simulate the user's response.
- Lists: Use a number like '11.' to stop at 10 items.
- Code: Use '```' to stop after a code block.
- JSON: Use '}' or ']' to stop after the closing bracket.
Examples
Pro Tip
Stop sequences are powerful for preventing 'model leakage' where the model continues to generate content beyond the intended boundary, such as simulating both sides of a conversation. Use them in all production systems.
REPETITION CONTROL PARAMETERS
Frequency Penalty
What It Does
Applies a penalty to tokens proportional to how many times they have already appeared in the generated text. The more frequently a word has appeared, the stronger the penalty. This reduces word-level repetition and encourages lexical diversity.
How It Works
During generation, the model tracks token counts. For each candidate token, the penalty is calculated as: adjusted_logit = logit - (frequency_penalty × count_of_token_in_output). Higher counts receive proportionally higher penalties. Positive values discourage repetition; negative values actually encourage it (useful for tasks requiring consistent terminology). A value of 0.0 means no penalty is applied.
When To Use
- 0.0: No penalty — default for most tasks.
- 0.1–0.5: Light to moderate discouragement of repetition. Good for essays, articles, and general writing.
- 0.5–1.0: Strong discouragement. Good for creative writing where variety is important.
- 1.0–2.0: Very aggressive. May produce unnatural text. Use with caution.
- Negative values (-0.1 to -1.0): Encourage repetition — useful for consistent technical documentation or code.
Examples
Pro Tip
The key difference from presence penalty: frequency penalty scales with count (a word used 10 times gets 10x the penalty), while presence penalty applies the same flat penalty regardless of count. Generally, adjust frequency OR presence penalty, not both.
Presence Penalty
What It Does
Applies a flat, fixed penalty to any token that has appeared at least once in the output, regardless of how many times it appeared. A token used twice receives the same penalty as one used ten times. This encourages the model to introduce new topics and concepts.
How It Works
The model checks whether each candidate token has appeared anywhere in the generated text so far. If it has (even once), the presence penalty is subtracted from its logit score: adjusted_logit = logit - presence_penalty. Unlike frequency penalty, the penalty does not increase with multiple occurrences — it's a binary: present or not present. Positive values penalize previously used tokens; negative values encourage their reuse.
When To Use
- 0.0: Default — no topic diversity enforcement.
- 0.1–0.5: Gently nudges toward new topics. Good for diverse brainstorming.
- 0.5–1.0: Strong push to avoid returning to previously mentioned concepts.
- 1.0–2.0: Very aggressive. May cause the model to avoid important referencing.
- Negative values: Encourage staying on topic and reusing established terms.
Examples
Pro Tip
Use presence penalty when you want the model to explore new ground (brainstorming, diverse lists). Use frequency penalty when you just want to reduce word-level repetition while staying on topic.
Repeat Last N
What It Does
Defines the lookback window (in tokens) that the repetition penalty considers. Only tokens that appeared within the last N generated tokens are penalized — older tokens outside this window are treated as fresh.
How It Works
This parameter works in conjunction with repetition_penalty. Instead of checking the entire output, it only checks the most recent N tokens. If repeat_last_n=64, only the last 64 tokens are scanned for repetitions. A value of 0 typically disables the lookback limit (checks the entire output), while -1 may mean 'check full context' depending on implementation.
When To Use
- 64: Default — good for short to medium outputs.
- 128–512: Better for code (captures variable name patterns) and long-form prose.
- 0 or -1: Checks entire output — prevents any repetition throughout.
- Small values (16–32): Very local window — only prevents immediate echo/loops.
Examples
Pro Tip
For code generation, use 128–512 to maintain variable naming consistency. For creative prose, 64 is usually sufficient. Setting this too high with a strong repetition penalty can make the model 'run out' of natural words to use.
Repetition Penalty
What It Does
A multiplicative penalty applied to tokens that have recently appeared in the output. Unlike the additive frequency/presence penalties used by OpenAI-style APIs, repetition penalty is multiplicative and commonly used in open-source models (Hugging Face, llama.cpp, vLLM).
How It Works
For each candidate token that has appeared recently: if the logit is positive, it is divided by the penalty value; if negative, it is multiplied by the penalty value. A value of 1.0 means no penalty (disabled). Values above 1.0 reduce the probability of repetition. For example, at 1.1, a token with logit 5.0 becomes 5.0/1.1 = 4.54. The effect is proportional but non-linear.
When To Use
- 1.0: Disabled (default).
- 1.0–1.1: Light penalty — removes obvious loops.
- 1.1–1.2: Moderate — good default for most prose generation.
- 1.2–1.5: Strong — noticeably reduces repetition but may affect naturalness.
- 1.5+: Very aggressive — use only for severe repetition issues.
Examples
Pro Tip
Never exceed 1.2 for most tasks. For code generation, use 1.0–1.05 because code naturally reuses variable names and syntax. Pair with repeat_last_n to define the lookback window.
REPRODUCIBILITY & ADVANCED CONTROL
Logit Bias
What It Does
Allows manual adjustment of the probability of specific tokens being generated. You can increase or decrease the likelihood of individual tokens by adding a bias value to their logit scores before sampling. This provides fine-grained control over which words appear in the output.
How It Works
You provide a JSON mapping of token IDs to bias values. Before sampling at each step, the specified bias is added to the corresponding token's logit score. Positive bias values increase the token's probability; negative values decrease it. A bias of -100 effectively bans a token entirely, while +100 virtually guarantees its selection. Values between -1 and 1 provide subtle adjustments.
When To Use
- Classification tasks: Bias toward valid class labels (e.g., boost 'positive', 'negative', 'neutral' tokens).
- Content filtering: Ban specific words or tokens (bias = -100).
- Brand consistency: Boost preferred terminology, suppress competitors.
- Language control: Suppress tokens from unwanted languages.
- Format control: Boost JSON structure tokens like '{', '}', ':'.
Examples
Pro Tip
You need token IDs (not words) for most APIs. Use the provider's tokenizer to convert words to IDs first. Be cautious: aggressive biasing can produce grammatically incorrect or incoherent output.
Logprobs
What It Does
When enabled, returns the log-probabilities of the generated tokens (and optionally the top N alternative tokens) alongside the output text. This provides transparency into the model's confidence and decision-making process at each step.
How It Works
For each generated token, the API returns the natural log of its probability. A logprob of 0 means 100% confidence; more negative values indicate lower confidence. If top_logprobs=5, you also see the 5 most likely alternative tokens and their probabilities at each position. This is useful for understanding why the model chose particular words.
When To Use
- Debugging: See which tokens the model almost chose — helps diagnose issues.
- Confidence scoring: Use logprobs to assess how confident the model is in its answers.
- Classification: Compare logprobs of different label tokens to determine the most likely class.
- Research: Analyze model behavior and decision boundaries.
- Calibration: Identify when the model is uncertain and might hallucinate.
Examples
Token: 'Paris' → logprob: -0.05 (very confident, ~95%)
Token: 'Lyon' → logprob: -3.2 (low confidence, ~4%)
Token: 'Berlin' → logprob: -5.8 (very unlikely, ~0.3%)
Pro Tip
Logprobs are invaluable for building confidence-based routing systems: if the model's top logprob is very negative (low confidence), route the query to a human or a more capable model.
N (Number of Completions)
What It Does
Specifies how many independent completion responses the model should generate for a single prompt. This allows you to sample multiple outputs and select the best one, implement majority voting (self-consistency), or offer users multiple alternatives.
How It Works
The model runs the generation process N times independently, each potentially producing a different output (assuming temperature > 0). Some APIs also support 'best_of' which generates more candidates internally and returns only the N highest-scoring ones. Each completion is an independent sample from the same probability distribution.
When To Use
- n=1: Standard — one response per request (default, most cost-efficient).
- n=3–5: Self-consistency prompting — generate multiple reasoning paths, pick majority answer.
- n=5–10: Creative brainstorming — offer multiple alternatives to choose from.
- n>10: Research / evaluation — large sample for statistical analysis.
Examples
Response 1: '42' (correct)
Response 2: '42' (correct)
Response 3: '38' (incorrect)
Majority vote → '42' is the self-consistent answer.
Pro Tip
Costs scale linearly with N — generating 5 completions costs 5x a single completion. For self-consistency, n=3–5 is usually sufficient. For creative tasks, n=3 provides good diversity without excessive cost.
Seed
What It Does
Sets the random number generator seed for the sampling process. When the same seed is used with identical parameters and input, the model should produce the same output — enabling reproducible results for testing, debugging, and audit purposes.
How It Works
LLM text generation involves random sampling from probability distributions. The seed initializes the random number generator that controls this sampling. With the same seed, temperature, top_p, and input, the sequence of random choices is identical, producing deterministic output. Note: reproducibility is 'best effort' — GPU hardware differences, batching, and model updates may cause slight variations even with the same seed.
When To Use
- Testing & debugging: Set a fixed seed to reproduce exact outputs across test runs.
- Audit & compliance: Financial and medical applications that require reproducible results.
- A/B testing: Compare prompt versions while controlling for randomness.
- Production: Leave as null/random for natural variety in user-facing applications.
Examples
Pro Tip
For fully deterministic output, combine seed with temperature=0. Even with a seed, non-zero temperature introduces controlled randomness that the seed makes reproducible. Not all providers guarantee perfect reproducibility — OpenAI notes it as 'best effort'.
ADAPTIVE & SPECIALIZED SAMPLING
Mirostat
What It Does
An adaptive decoding algorithm that dynamically adjusts sampling constraints to maintain a target level of 'surprise' (perplexity) throughout generation. Unlike fixed parameters like temperature, Mirostat continuously adapts to keep output quality consistent — preventing both 'boredom traps' (too repetitive) and 'confusion traps' (too incoherent).
How It Works
Mirostat monitors the observed 'surprise' (cross-entropy) of each generated token and compares it to a target value (mirostat_tau). If the recent output is too predictable (low surprise), it loosens the sampling to allow more diversity. If it's too surprising (high surprise), it tightens sampling for more coherence. The learning rate (mirostat_eta) controls how quickly these adjustments happen. When enabled, Mirostat replaces Top-K, Top-P, and other truncation samplers — it takes full control of the sampling process.
When To Use
- Mode 0: Disabled (use standard sampling).
- Mode 1: Mirostat v1 — original algorithm, good for general use.
- Mode 2: Mirostat v2 — improved version, generally recommended when using Mirostat.
- Best for: Long-form generation where consistent quality matters (articles, stories, documentation).
Examples
→ Maintains moderate perplexity throughout a 2000-word article
→ Output stays coherent and varied without degrading over length
Pro Tip
When using Mirostat, disable other samplers: set top_p=1.0, top_k=0, min_p=0.0. Mirostat is designed to be the sole controller. Test both Mode 1 and Mode 2 for your use case — they produce noticeably different outputs.
Mirostat Eta (η)
What It Does
Controls the learning rate — how quickly the Mirostat algorithm adjusts its sampling parameters in response to deviations from the target perplexity (tau).
How It Works
After each token is generated, Mirostat compares the observed surprise to the target tau. If there's a deviation, eta determines how aggressively the algorithm corrects. A higher eta means faster corrections (more responsive but potentially unstable). A lower eta means gradual, smoother adjustments (more stable but slower to react).
When To Use
- Low (0.01–0.05): Slow adjustments — very stable output, good for long documents.
- Medium (0.1): Default — balanced responsiveness.
- High (0.2–0.5): Fast adjustments — quickly adapts to context changes.
- Very High (0.5+): Very responsive — may overcorrect and oscillate.
Examples
Pro Tip
For most use cases, the default of 0.1 works well. Only adjust if you notice Mirostat either reacting too slowly to context shifts (increase eta) or oscillating between too-creative and too-focused (decrease eta).
Mirostat Tau (τ)
What It Does
Sets the target perplexity (level of 'surprise') that Mirostat tries to maintain. Lower values produce more focused, predictable text; higher values produce more diverse, creative text.
How It Works
Tau represents the desired average cross-entropy (surprise) per token. Mirostat adjusts its sampling dynamically to keep the actual perplexity close to this target. A tau of 3.0 means low surprise — the model sticks to highly expected tokens. A tau of 7.0 allows much more variability and unexpected word choices.
When To Use
- Low (2.0–4.0): Technical writing, factual content, documentation.
- Medium (4.0–6.0): General purpose, balanced coherence and variety.
- High (6.0–8.0): Creative writing, storytelling, brainstorming.
- Very High (8.0+): Experimental — very high diversity, risk of incoherence.
Examples
Pro Tip
Think of tau as the Mirostat equivalent of temperature. Start at 5.0 (default) and adjust ±1.0 at a time. For most professional use cases, tau between 4.0 and 6.0 works well.
Tail-Free Sampling (TFS)
What It Does
Filters out tokens in the extreme low-probability tail of the distribution based on the curvature (second derivative) of the sorted probability curve. It removes 'tail' tokens that are statistically atypical, providing finer-grained pruning than Top-P or Top-K.
How It Works
TFS examines the sorted probability distribution and calculates the second derivative (rate of change of the rate of change). Tokens in the flat 'tail' of the distribution — where probabilities barely differ from each other — are identified and removed. The tfs_z parameter controls how aggressively this tail is cut. A value of 1.0 disables TFS; lower values cut more aggressively.
When To Use
- 1.0: Disabled (default).
- 0.9–0.95: Light tail removal — subtle quality improvement.
- 0.5–0.9: Moderate — removes clearly atypical tokens.
- 0.1–0.5: Aggressive — only keeps tokens well within the main distribution.
- Best used in combination with Top-P or Top-K for additional quality refinement.
Examples
Pro Tip
TFS is a refinement tool, not a primary sampler. Use it alongside Top-P for additional quality control. It's most useful when you notice the model occasionally producing bizarre or contextually inappropriate words.
Typical P (Locally Typical Sampling)
What It Does
Selects tokens whose information content (surprise) is close to the average expected surprise, filtering out both overly predictable tokens and overly surprising ones. The idea is to choose tokens that are 'typically' informative — not boring, not bizarre.
How It Works
For each candidate token, typical sampling calculates its 'local typicality' — how close its surprise (negative log probability) is to the expected surprise. Tokens that are either much more predictable or much less predictable than the average are filtered out. The parameter controls the cumulative threshold (similar to Top-P but over the typicality-sorted distribution).
When To Use
- 1.0: Disabled (default).
- 0.9–0.95: Light filtering — removes the most atypical tokens.
- 0.5–0.8: Moderate — keeps only reasonably typical tokens.
- 0.2–0.5: Strong — highly constrained to typical tokens only.
- Best for: Natural-sounding prose and dialogue where you want human-like word choices.
Examples
Pro Tip
Typical sampling was designed based on information theory — the idea that human language tends to be 'typically' informative rather than maximally predictable or maximally surprising. It's worth experimenting with for natural-sounding dialogue.
FORMAT, SYSTEM & API CONTROLS
End-of-Sequence Token (EOS)
What It Does
The EOS token is a special token that signals the model has completed its response. When generated, it normally causes the model to stop producing further tokens. The ignore_eos parameter can override this behavior, forcing the model to continue generating until max_tokens is reached.
How It Works
During training, models learn to produce the EOS token when they believe a complete response has been generated. In normal inference, producing EOS stops generation. Setting ignore_eos=true suppresses this signal, forcing continued generation — useful for benchmarking and specific research scenarios but generally not recommended for production use.
When To Use
- ignore_eos=false (default): Normal operation — let the model decide when it's done.
- ignore_eos=true: Benchmarking (to reach exact output token counts), specific research scenarios, or when the model prematurely stops and you need it to continue.
Examples
Normal: Model generates 'The answer is 42.' + EOS → Generation stops at 7 tokens
Pro Tip
In production, never set ignore_eos=true — it will cause the model to ramble past its natural completion point. This is primarily a debugging and benchmarking tool.
Model Selection
What It Does
Specifies which AI model to use for generation. Different models vary dramatically in capability, speed, cost, context window size, and specialization. Model selection is often the most impactful 'parameter' affecting output quality.
How It Works
Each model has been trained differently, with different data, architectures, and fine-tuning. Larger models (e.g., GPT-4, Claude Opus) generally produce higher-quality output but cost more and are slower. Smaller models (e.g., GPT-4o-mini, Claude Haiku) are faster and cheaper but may sacrifice quality on complex tasks.
When To Use
- Complex reasoning / analysis: Use the most capable model (GPT-4, Claude Opus).
- General tasks: Use mid-tier models (GPT-4o, Claude Sonnet) for good balance.
- Simple tasks / high volume: Use small models (GPT-4o-mini, Claude Haiku) for speed and cost.
- Specialized tasks: Consider domain-specific or fine-tuned models.
- Cost optimization: Route simple queries to small models, complex ones to large models.
Examples
Pro Tip
The best practice is to use model routing: classify incoming requests by complexity and route them to the appropriate model tier. This can reduce costs by 50–70% while maintaining quality where it matters.
Response Format / Structured Output
What It Does
Constrains the model's output to conform to a specific format, most commonly valid JSON. Some advanced implementations support JSON Schema enforcement, ensuring output matches an exact structure with required fields and data types.
How It Works
When response_format is set to 'json_object' or a JSON schema, the model's generation is constrained at the token level — at each step, only tokens that would produce valid JSON (or match the schema) are eligible for sampling. This is more reliable than prompting alone because it provides hard format guarantees rather than relying on the model's compliance.
When To Use
- json_object: When you need valid JSON output but flexible structure.
- json_schema: When you need output matching an exact schema (fields, types, nesting).
- text: Default — free-form text output.
- regex: Some frameworks support regex-constrained generation.
- Best for: API integrations, data extraction, classification, structured workflows.
Examples
response_format: { type: 'json_schema', json_schema: { name: 'sentiment', schema: { type: 'object', properties: { sentiment: { enum: ['positive','negative','neutral'] }, confidence: { type: 'number' } }, required: ['sentiment','confidence'] } } }
Pro Tip
Structured output via API is more reliable than prompt-only approaches ('Please respond in JSON'). Always prefer API-level format enforcement when available. Combine with a clear system prompt describing the expected structure.
Streaming
What It Does
When enabled, the model sends tokens to the client incrementally as they are generated, rather than waiting for the complete response. This provides a real-time 'typing' effect and significantly improves perceived latency for users.
How It Works
With streaming enabled, the API returns a stream of server-sent events (SSE), each containing one or a few tokens. The client can display these tokens immediately as they arrive. The total generation time is the same, but the time-to-first-token (TTFT) is much shorter, making the experience feel faster and more responsive.
When To Use
- Chatbots / UI: Almost always enable streaming for user-facing applications.
- API integrations: Disable if your pipeline processes the complete response at once.
- Batch processing: Disable for batch workloads where streaming adds overhead.
Examples
Pro Tip
Streaming is almost mandatory for user-facing chat applications. Without it, users stare at a blank screen for several seconds. With it, they see the first word appear in under a second, even if the full response takes 10 seconds.
System Prompt / System Message
What It Does
A special message set at the beginning of the conversation that defines the model's persona, behavior, constraints, and operational rules. System prompts are typically hidden from end users and persist throughout the entire conversation, establishing the foundational 'character' and ground rules for the AI.
How It Works
In the messages API format, the system prompt is sent with role='system' before any user messages. The model treats it as high-priority instruction that frames all subsequent interactions. System prompts can define: persona/role, tone/style, capabilities/limitations, safety guardrails, output format requirements, domain-specific knowledge, and behavioral constraints.
When To Use
- Always — every production application should use a system prompt.
- Define persona: 'You are a helpful financial advisor...'
- Set guardrails: 'Never provide medical diagnoses...'
- Control format: 'Always respond in JSON format...'
- Establish tone: 'Use a professional, concise tone...'
Examples
system: 'You are a senior Python developer. Respond with clean, well-commented code following PEP 8. Always explain your reasoning before writing code. If a question is ambiguous, ask for clarification.'
Pro Tip
System prompts are the most powerful tool for controlling model behavior in production. Invest significant time crafting them. Include: role definition, behavioral rules, output format, safety constraints, and edge case handling.