LLM Settings & Parameters Guide

CORE SAMPLING PARAMETERS

Min-P Sampling

API: min_p Range: 0.0 – 1.0 (default: 0.0 = disabled)

What It Does

Filters out tokens whose probability is below a minimum threshold relative to the most probable token. It dynamically prunes very unlikely tokens while keeping a contextually appropriate number of candidates.

How It Works

For each generation step, the model identifies the highest probability token. Any token whose probability is less than (min_p × highest_probability) is discarded. For example, if the top token has probability 0.4 and min_p=0.1, any token with probability below 0.04 (i.e., 0.1 × 0.4) is removed from consideration. This adapts naturally: when the model is confident, fewer tokens pass; when uncertain, more tokens are eligible.

When To Use

0.05–0.1: Light filtering — removes only extremely unlikely tokens.
0.1–0.2: Moderate filtering — good balance for most tasks.
0.3+: Aggressive filtering — only high-probability tokens survive.

Examples

Top token probability = 0.5, min_p = 0.1

Threshold = 0.05 → Any token below 5% probability is removed

Results: Tokens with 0.5, 0.3, 0.1, 0.08 survive; tokens at 0.03, 0.01 are cut.

Pro Tip

Min-P is increasingly popular as an alternative to Top-K and Top-P because it scales naturally with the model's confidence level. It is supported in llama.cpp, vLLM, and several other frameworks.

Temperature

API: temp Range: 0.0 – 2.0 (default: ~0.7–1.0)

What It Does

Controls the randomness and creativity of the model's output by scaling the raw logit scores before they are converted into probabilities via the softmax function. It is the single most influential parameter for shaping output behavior.

How It Works

The formula is: adjusted_probability = softmax(logits / temperature). At temperature=1.0, the model uses its native probability distribution. Lower values (e.g., 0.2) sharpen the distribution, making the highest-probability token overwhelmingly likely — resulting in deterministic, focused, and consistent output. Higher values (e.g., 1.2) flatten the distribution, giving lower-probability tokens a better chance of being selected — producing more creative, diverse, and sometimes surprising text. At temperature=0, the model always picks the single most probable token (greedy decoding). Values above 1.5–2.0 can produce incoherent or nonsensical output.

When To Use

Low (0.0–0.3): Factual Q&A, code generation, data extraction, classification, medical/legal applications where consistency is critical.
Medium (0.4–0.7): Balanced tasks like email drafting, summarization, customer support, general conversation.
High (0.8–1.2): Creative writing, brainstorming, poetry, storytelling, idea generation.
Very High (1.3–2.0): Experimental use only — highly unpredictable output.

Examples

temperature=0.2 → 'The capital of France is Paris.' (predictable)

temperature=1.0 → 'Paris, the City of Lights, serves as France's vibrant capital.' (varied)

temperature=1.8 → 'Ah, France! Its heart beats within the luminous embrace of Paris...' (creative/risky)

Pro Tip

For production systems, start with temperature=0.3 and increase only if output feels too rigid. Never change temperature and top_p simultaneously — adjust one at a time.

Top-K Sampling

API: top_k Range: 1 – vocabulary size (default: 40–50; 0 or -1 = disabled)

What It Does

Restricts the model's next-token selection to only the K most probable tokens, ignoring all others regardless of their probability. It provides a hard ceiling on vocabulary diversity at each generation step.

How It Works

The model computes probabilities for all tokens in its vocabulary, then keeps only the K tokens with the highest probabilities. All other tokens have their probabilities set to zero. The model then samples from this reduced set. With K=1, the model always picks the single most likely token (greedy decoding). With K=50, it chooses from 50 candidates. With K=100,000+, essentially all tokens are eligible.

When To Use

K=1: Greedy decoding — most deterministic, always picks the top token.
K=10–20: Focused but with slight variation. Good for code, formal text.
K=40–50: Standard default. Balanced diversity for general tasks.
K=100+: High diversity, approaching unrestricted sampling.
K=0 or -1: Disabled (no Top-K filtering applied).

Examples

top_k=1 → Always selects 'Paris' as the next word (greedy)

top_k=10 → Chooses from the 10 most likely words

top_k=50 → Chooses from the 50 most likely words

Pro Tip

Top-K is less flexible than Top-P because it uses a fixed number regardless of probability distribution. If the model is very confident (one token has 99% probability), K=50 still considers 50 tokens unnecessarily. Top-P handles this better by adapting dynamically.

Top-P (Nucleus Sampling)

API: top_p, nucleus_sampling Range: 0.0 – 1.0 (default: ~0.9–1.0)

What It Does

Controls output diversity by dynamically selecting the smallest set of tokens whose cumulative probability exceeds the threshold P. Unlike Top-K which uses a fixed count, Top-P adapts — sometimes considering 5 tokens, sometimes 500 — depending on the model's confidence at each position.

How It Works

After computing probabilities for all tokens, the model sorts them from highest to lowest probability and accumulates until the sum reaches the P threshold. Only tokens within this cumulative set are eligible for sampling; all others are discarded. For example, with top_p=0.9, the model considers enough top tokens to cover 90% of the total probability mass. If one token has 95% probability, only that token is considered. If probabilities are spread evenly, many tokens are included.

When To Use

Low (0.1–0.5): Highly constrained output — only the very top predictions. Good for classification, structured output, and deterministic tasks.
Medium (0.6–0.8): Balanced diversity. Good for most general-purpose tasks.
High (0.9–1.0): Maximum diversity from the full probability distribution. Good for creative and exploratory tasks.
Note: top_p=1.0 means no filtering (all tokens eligible).

Examples

top_p=0.1 → Only considers the top 1-3 most probable tokens → Very focused output

top_p=0.9 → Considers enough tokens to cover 90% probability → Balanced diversity

top_p=1.0 → No filtering, all tokens eligible → Maximum variety

Pro Tip

Top-P is generally preferred over Top-K for most use cases because it adapts dynamically to the model's confidence. The general recommendation is to adjust either temperature OR top_p, but not both aggressively at the same time.

OUTPUT LENGTH & STRUCTURE CONTROLS

Context Window / Context Length

API: context_length, context_size, ctx_size, n_ctx Range: 2,048 – 2,000,000+ tokens (model-dependent)

What It Does

Defines the total maximum number of tokens (input + output combined) that the model can process in a single interaction. It is the model's total working memory for the conversation — everything the model can 'see' at once.

How It Works

The context window is an architectural property of the model determined during training. Everything the model considers — system prompt, conversation history, retrieved documents, and generated output — must fit within this window. If the total exceeds the limit, earlier content is typically truncated or the request fails. Larger context windows enable longer conversations and bigger document inputs but consume more memory and compute.

When To Use

2K–4K: Simple Q&A, short conversations.
8K–32K: Standard applications, moderate documents.
64K–128K: Long documents, extended conversations, code repositories.
200K–1M+: Entire books, massive codebases, extensive research.

Examples

A 128K token context window can process ~96,000 words — enough for a full-length novel.

A 4K context window limits you to ~3,000 words total (prompt + response combined).

Pro Tip

Context window usage directly impacts cost and latency. Use context efficiently: summarize old conversation turns, use RAG to inject only relevant documents, and remove redundant information from prompts.

Max Tokens / Max Length

API: max_tokens, max_length, max_new_tokens, num_predict Range: 1 – context window limit (default varies by provider)

What It Does

Sets the maximum number of tokens the model can generate in a single response. This is a hard ceiling — the model will stop generating once it reaches this limit, even if its response is incomplete. It directly affects cost (more tokens = higher API cost) and latency (more tokens = longer response time).

How It Works

The model generates tokens one at a time until it either: (a) reaches the max_tokens limit, (b) produces a stop sequence, or (c) generates the special end-of-sequence (EOS) token. Max tokens includes only the output tokens — not the input/prompt tokens. However, the total (input + output) must fit within the model's context window.

When To Use

Short (50–150): Quick answers, classifications, single-sentence responses.
Medium (150–500): Paragraphs, summaries, standard Q&A.
Long (500–2000): Articles, reports, detailed explanations.
Very Long (2000–8000+): Full documents, book chapters, extensive code generation.
Note: Setting this too low truncates responses mid-sentence; too high wastes cost on padding.

Examples

max_tokens=50 → 'The capital of France is Paris. It is known for the Eiffel Tower and...'

max_tokens=500 → Full paragraph with detailed explanation

max_tokens=4096 → Extended multi-paragraph response

Pro Tip

Always set max_tokens explicitly rather than relying on defaults. For cost optimization, estimate the expected output length and add a 20–30% buffer. Remember: output tokens typically cost 4–6x more than input tokens on most API providers.

Min Tokens

API: min_tokens, min_length Range: 0 – max_tokens (default: 0)

What It Does

Sets the minimum number of tokens the model must generate before it is allowed to stop (by producing an EOS token or stop sequence). This prevents the model from giving overly brief or empty responses.

How It Works

When min_tokens is set, the model's EOS token and any stop sequences are suppressed until the minimum token count is reached. After that threshold, normal stopping behavior resumes. This ensures a minimum response length.

When To Use

Useful when models tend to give very short or one-word answers. Set to a reasonable minimum to ensure substantive responses without forcing unnecessary padding.

Examples

min_tokens=100 → Model must generate at least 100 tokens before it can stop

min_tokens=0 → Model can stop at any point (default behavior)

Pro Tip

Use sparingly. Forcing minimum length can lead to padding and filler content. It's often better to address brevity issues through prompt engineering rather than this parameter.

Stop Sequences

API: stop, stop_sequences, stop_token_ids Range: Up to 4 strings (provider-dependent)

What It Does

Defines one or more specific strings or tokens that, when generated by the model, immediately halt further text generation. The stop sequence itself may or may not be included in the output, depending on the provider's implementation.

How It Works

The model checks its output after each token generation. If the accumulated output ends with any of the defined stop sequences, generation stops immediately. This is useful for controlling output structure, preventing runaway generation, and creating clean boundaries in multi-turn or structured outputs.

When To Use

Structured outputs: Use '\n\n' to stop after a single paragraph.
Conversation: Use 'User:' to stop when the model would simulate the user's response.
Lists: Use a number like '11.' to stop at 10 items.
Code: Use '```' to stop after a code block.
JSON: Use '}' or ']' to stop after the closing bracket.

Examples

stop=['\n\n'] → Stops after a double newline (end of paragraph)

stop=['User:', 'Human:'] → Stops when model would generate these prefixes

stop=['END', '---'] → Stops at custom delimiters

Pro Tip

Stop sequences are powerful for preventing 'model leakage' where the model continues to generate content beyond the intended boundary, such as simulating both sides of a conversation. Use them in all production systems.

REPETITION CONTROL PARAMETERS

Frequency Penalty

API: frequency_penalty Range: -2.0 – 2.0 (default: 0.0)

What It Does

Applies a penalty to tokens proportional to how many times they have already appeared in the generated text. The more frequently a word has appeared, the stronger the penalty. This reduces word-level repetition and encourages lexical diversity.

How It Works

During generation, the model tracks token counts. For each candidate token, the penalty is calculated as: adjusted_logit = logit - (frequency_penalty × count_of_token_in_output). Higher counts receive proportionally higher penalties. Positive values discourage repetition; negative values actually encourage it (useful for tasks requiring consistent terminology). A value of 0.0 means no penalty is applied.

When To Use

0.0: No penalty — default for most tasks.
0.1–0.5: Light to moderate discouragement of repetition. Good for essays, articles, and general writing.
0.5–1.0: Strong discouragement. Good for creative writing where variety is important.
1.0–2.0: Very aggressive. May produce unnatural text. Use with caution.
Negative values (-0.1 to -1.0): Encourage repetition — useful for consistent technical documentation or code.

Examples

frequency_penalty=0.0 → 'The rose is red. The rose is beautiful. The rose smells sweet.'

frequency_penalty=0.8 → 'The rose is red. This flower looks beautiful. It has a sweet fragrance.'

Pro Tip

The key difference from presence penalty: frequency penalty scales with count (a word used 10 times gets 10x the penalty), while presence penalty applies the same flat penalty regardless of count. Generally, adjust frequency OR presence penalty, not both.

Presence Penalty

API: presence_penalty Range: -2.0 – 2.0 (default: 0.0)

What It Does

Applies a flat, fixed penalty to any token that has appeared at least once in the output, regardless of how many times it appeared. A token used twice receives the same penalty as one used ten times. This encourages the model to introduce new topics and concepts.

How It Works

The model checks whether each candidate token has appeared anywhere in the generated text so far. If it has (even once), the presence penalty is subtracted from its logit score: adjusted_logit = logit - presence_penalty. Unlike frequency penalty, the penalty does not increase with multiple occurrences — it's a binary: present or not present. Positive values penalize previously used tokens; negative values encourage their reuse.

When To Use

0.0: Default — no topic diversity enforcement.
0.1–0.5: Gently nudges toward new topics. Good for diverse brainstorming.
0.5–1.0: Strong push to avoid returning to previously mentioned concepts.
1.0–2.0: Very aggressive. May cause the model to avoid important referencing.
Negative values: Encourage staying on topic and reusing established terms.

Examples

presence_penalty=0.0 → Model freely revisits previous points.

presence_penalty=0.6 → Model actively introduces new concepts and avoids circling back.

presence_penalty=1.5 → Model aggressively avoids all previously used words (can become incoherent).

Pro Tip

Use presence penalty when you want the model to explore new ground (brainstorming, diverse lists). Use frequency penalty when you just want to reduce word-level repetition while staying on topic.

Repeat Last N

API: repeat_last_n, no_repeat_ngram_size Range: 0 – 2048 (default: 64)

What It Does

Defines the lookback window (in tokens) that the repetition penalty considers. Only tokens that appeared within the last N generated tokens are penalized — older tokens outside this window are treated as fresh.

How It Works

This parameter works in conjunction with repetition_penalty. Instead of checking the entire output, it only checks the most recent N tokens. If repeat_last_n=64, only the last 64 tokens are scanned for repetitions. A value of 0 typically disables the lookback limit (checks the entire output), while -1 may mean 'check full context' depending on implementation.

When To Use

64: Default — good for short to medium outputs.
128–512: Better for code (captures variable name patterns) and long-form prose.
0 or -1: Checks entire output — prevents any repetition throughout.
Small values (16–32): Very local window — only prevents immediate echo/loops.

Examples

repeat_last_n=64 → Only penalizes tokens repeated within the last 64 tokens

repeat_last_n=256 → Larger window prevents repeating concepts from earlier paragraphs

repeat_last_n=0 → Global — no token can repeat without penalty in the entire response

Pro Tip

For code generation, use 128–512 to maintain variable naming consistency. For creative prose, 64 is usually sufficient. Setting this too high with a strong repetition penalty can make the model 'run out' of natural words to use.

Repetition Penalty

API: repetition_penalty, repeat_penalty Range: 0.0 – 2.0 (default: 1.0 = disabled)

What It Does

A multiplicative penalty applied to tokens that have recently appeared in the output. Unlike the additive frequency/presence penalties used by OpenAI-style APIs, repetition penalty is multiplicative and commonly used in open-source models (Hugging Face, llama.cpp, vLLM).

How It Works

For each candidate token that has appeared recently: if the logit is positive, it is divided by the penalty value; if negative, it is multiplied by the penalty value. A value of 1.0 means no penalty (disabled). Values above 1.0 reduce the probability of repetition. For example, at 1.1, a token with logit 5.0 becomes 5.0/1.1 = 4.54. The effect is proportional but non-linear.

When To Use

1.0: Disabled (default).
1.0–1.1: Light penalty — removes obvious loops.
1.1–1.2: Moderate — good default for most prose generation.
1.2–1.5: Strong — noticeably reduces repetition but may affect naturalness.
1.5+: Very aggressive — use only for severe repetition issues.

Examples

repetition_penalty=1.0 → 'The cat sat on the mat. The cat looked at the cat...'

repetition_penalty=1.15 → 'The cat sat on the mat. It looked around the room contentedly.'

Pro Tip

Never exceed 1.2 for most tasks. For code generation, use 1.0–1.05 because code naturally reuses variable names and syntax. Pair with repeat_last_n to define the lookback window.

REPRODUCIBILITY & ADVANCED CONTROL

Logit Bias

API: logit_bias Range: -100 to 100 per token (default: none/empty)

What It Does

Allows manual adjustment of the probability of specific tokens being generated. You can increase or decrease the likelihood of individual tokens by adding a bias value to their logit scores before sampling. This provides fine-grained control over which words appear in the output.

How It Works

You provide a JSON mapping of token IDs to bias values. Before sampling at each step, the specified bias is added to the corresponding token's logit score. Positive bias values increase the token's probability; negative values decrease it. A bias of -100 effectively bans a token entirely, while +100 virtually guarantees its selection. Values between -1 and 1 provide subtle adjustments.

When To Use

Classification tasks: Bias toward valid class labels (e.g., boost 'positive', 'negative', 'neutral' tokens).
Content filtering: Ban specific words or tokens (bias = -100).
Brand consistency: Boost preferred terminology, suppress competitors.
Language control: Suppress tokens from unwanted languages.
Format control: Boost JSON structure tokens like '{', '}', ':'.

Examples

logit_bias={"positive": 5, "negative": 5} → Boosts classification labels

logit_bias={"damn": -100, "hell": -100} → Bans profanity tokens

logit_bias={"{": 3, "}": 3} → Encourages JSON-structured output

Pro Tip

You need token IDs (not words) for most APIs. Use the provider's tokenizer to convert words to IDs first. Be cautious: aggressive biasing can produce grammatically incorrect or incoherent output.

Logprobs

API: logprobs, top_logprobs Range: null, or 1–20 (default: null = disabled)

What It Does

When enabled, returns the log-probabilities of the generated tokens (and optionally the top N alternative tokens) alongside the output text. This provides transparency into the model's confidence and decision-making process at each step.

How It Works

For each generated token, the API returns the natural log of its probability. A logprob of 0 means 100% confidence; more negative values indicate lower confidence. If top_logprobs=5, you also see the 5 most likely alternative tokens and their probabilities at each position. This is useful for understanding why the model chose particular words.

When To Use

Debugging: See which tokens the model almost chose — helps diagnose issues.
Confidence scoring: Use logprobs to assess how confident the model is in its answers.
Classification: Compare logprobs of different label tokens to determine the most likely class.
Research: Analyze model behavior and decision boundaries.
Calibration: Identify when the model is uncertain and might hallucinate.

Examples

Token: 'Paris' → logprob: -0.05 (very confident, ~95%)

Token: 'Lyon' → logprob: -3.2 (low confidence, ~4%)

Token: 'Berlin' → logprob: -5.8 (very unlikely, ~0.3%)

Pro Tip

Logprobs are invaluable for building confidence-based routing systems: if the model's top logprob is very negative (low confidence), route the query to a human or a more capable model.

N (Number of Completions)

API: n, best_of, num_return_sequences Range: 1 – 20 (default: 1)

What It Does

Specifies how many independent completion responses the model should generate for a single prompt. This allows you to sample multiple outputs and select the best one, implement majority voting (self-consistency), or offer users multiple alternatives.

How It Works

The model runs the generation process N times independently, each potentially producing a different output (assuming temperature > 0). Some APIs also support 'best_of' which generates more candidates internally and returns only the N highest-scoring ones. Each completion is an independent sample from the same probability distribution.

When To Use

n=1: Standard — one response per request (default, most cost-efficient).
n=3–5: Self-consistency prompting — generate multiple reasoning paths, pick majority answer.
n=5–10: Creative brainstorming — offer multiple alternatives to choose from.
n>10: Research / evaluation — large sample for statistical analysis.

Examples

n=3 with a math problem:

Response 1: '42' (correct)

Response 2: '42' (correct)

Response 3: '38' (incorrect)

Majority vote → '42' is the self-consistent answer.

Pro Tip

Costs scale linearly with N — generating 5 completions costs 5x a single completion. For self-consistency, n=3–5 is usually sufficient. For creative tasks, n=3 provides good diversity without excessive cost.

Seed

API: seed, random_seed Range: Any integer (default: random/null)

What It Does

Sets the random number generator seed for the sampling process. When the same seed is used with identical parameters and input, the model should produce the same output — enabling reproducible results for testing, debugging, and audit purposes.

How It Works

LLM text generation involves random sampling from probability distributions. The seed initializes the random number generator that controls this sampling. With the same seed, temperature, top_p, and input, the sequence of random choices is identical, producing deterministic output. Note: reproducibility is 'best effort' — GPU hardware differences, batching, and model updates may cause slight variations even with the same seed.

When To Use

Testing & debugging: Set a fixed seed to reproduce exact outputs across test runs.
Audit & compliance: Financial and medical applications that require reproducible results.
A/B testing: Compare prompt versions while controlling for randomness.
Production: Leave as null/random for natural variety in user-facing applications.

Examples

seed=42, temp=0.7 → Always produces 'The quick brown fox jumps over the lazy dog.'

seed=42, temp=0.7 (same) → Same output again

seed=null → Different output each time

Pro Tip

For fully deterministic output, combine seed with temperature=0. Even with a seed, non-zero temperature introduces controlled randomness that the seed makes reproducible. Not all providers guarantee perfect reproducibility — OpenAI notes it as 'best effort'.

ADAPTIVE & SPECIALIZED SAMPLING

Mirostat

API: mirostat, mirostat_mode Range: 0 (disabled), 1 (Mirostat v1), 2 (Mirostat v2)

What It Does

An adaptive decoding algorithm that dynamically adjusts sampling constraints to maintain a target level of 'surprise' (perplexity) throughout generation. Unlike fixed parameters like temperature, Mirostat continuously adapts to keep output quality consistent — preventing both 'boredom traps' (too repetitive) and 'confusion traps' (too incoherent).

How It Works

Mirostat monitors the observed 'surprise' (cross-entropy) of each generated token and compares it to a target value (mirostat_tau). If the recent output is too predictable (low surprise), it loosens the sampling to allow more diversity. If it's too surprising (high surprise), it tightens sampling for more coherence. The learning rate (mirostat_eta) controls how quickly these adjustments happen. When enabled, Mirostat replaces Top-K, Top-P, and other truncation samplers — it takes full control of the sampling process.

When To Use

Mode 0: Disabled (use standard sampling).
Mode 1: Mirostat v1 — original algorithm, good for general use.
Mode 2: Mirostat v2 — improved version, generally recommended when using Mirostat.
Best for: Long-form generation where consistent quality matters (articles, stories, documentation).

Examples

mirostat=2, mirostat_tau=5.0, mirostat_eta=0.1

→ Maintains moderate perplexity throughout a 2000-word article

→ Output stays coherent and varied without degrading over length

Pro Tip

When using Mirostat, disable other samplers: set top_p=1.0, top_k=0, min_p=0.0. Mirostat is designed to be the sole controller. Test both Mode 1 and Mode 2 for your use case — they produce noticeably different outputs.

Mirostat Eta (η)

API: mirostat_eta, mirostat_lr Range: 0.01 – 1.0 (default: 0.1)

What It Does

Controls the learning rate — how quickly the Mirostat algorithm adjusts its sampling parameters in response to deviations from the target perplexity (tau).

How It Works

After each token is generated, Mirostat compares the observed surprise to the target tau. If there's a deviation, eta determines how aggressively the algorithm corrects. A higher eta means faster corrections (more responsive but potentially unstable). A lower eta means gradual, smoother adjustments (more stable but slower to react).

When To Use

Low (0.01–0.05): Slow adjustments — very stable output, good for long documents.
Medium (0.1): Default — balanced responsiveness.
High (0.2–0.5): Fast adjustments — quickly adapts to context changes.
Very High (0.5+): Very responsive — may overcorrect and oscillate.

Examples

mirostat_eta=0.05 → Slowly adapts to maintain target quality over a long article

mirostat_eta=0.1 → Standard response speed (default)

mirostat_eta=0.3 → Rapidly adjusts — good for varied multi-topic documents

Pro Tip

For most use cases, the default of 0.1 works well. Only adjust if you notice Mirostat either reacting too slowly to context shifts (increase eta) or oscillating between too-creative and too-focused (decrease eta).

Mirostat Tau (τ)

API: mirostat_tau Range: 1.0 – 10.0 (default: 5.0)

What It Does

Sets the target perplexity (level of 'surprise') that Mirostat tries to maintain. Lower values produce more focused, predictable text; higher values produce more diverse, creative text.

How It Works

Tau represents the desired average cross-entropy (surprise) per token. Mirostat adjusts its sampling dynamically to keep the actual perplexity close to this target. A tau of 3.0 means low surprise — the model sticks to highly expected tokens. A tau of 7.0 allows much more variability and unexpected word choices.

When To Use

Low (2.0–4.0): Technical writing, factual content, documentation.
Medium (4.0–6.0): General purpose, balanced coherence and variety.
High (6.0–8.0): Creative writing, storytelling, brainstorming.
Very High (8.0+): Experimental — very high diversity, risk of incoherence.

Examples

mirostat_tau=3.0 → 'Machine learning is a subset of artificial intelligence that uses statistical methods.'

mirostat_tau=7.0 → 'Machine learning dances at the intersection of mathematics and wonder, weaving patterns from chaos.'

Pro Tip

Think of tau as the Mirostat equivalent of temperature. Start at 5.0 (default) and adjust ±1.0 at a time. For most professional use cases, tau between 4.0 and 6.0 works well.

Tail-Free Sampling (TFS)

API: tfs_z, tfs Range: 0.0 – 1.0 (default: 1.0 = disabled)

What It Does

Filters out tokens in the extreme low-probability tail of the distribution based on the curvature (second derivative) of the sorted probability curve. It removes 'tail' tokens that are statistically atypical, providing finer-grained pruning than Top-P or Top-K.

How It Works

TFS examines the sorted probability distribution and calculates the second derivative (rate of change of the rate of change). Tokens in the flat 'tail' of the distribution — where probabilities barely differ from each other — are identified and removed. The tfs_z parameter controls how aggressively this tail is cut. A value of 1.0 disables TFS; lower values cut more aggressively.

When To Use

1.0: Disabled (default).
0.9–0.95: Light tail removal — subtle quality improvement.
0.5–0.9: Moderate — removes clearly atypical tokens.
0.1–0.5: Aggressive — only keeps tokens well within the main distribution.
Best used in combination with Top-P or Top-K for additional quality refinement.

Examples

tfs_z=1.0 → No tail filtering (disabled)

tfs_z=0.9 → Removes the least probable 10% of the tail distribution

tfs_z=0.5 → Aggressively removes low-probability tokens

Pro Tip

TFS is a refinement tool, not a primary sampler. Use it alongside Top-P for additional quality control. It's most useful when you notice the model occasionally producing bizarre or contextually inappropriate words.

Typical P (Locally Typical Sampling)

API: typ_p, typical_p, typical Range: 0.0 – 1.0 (default: 1.0 = disabled)

What It Does

Selects tokens whose information content (surprise) is close to the average expected surprise, filtering out both overly predictable tokens and overly surprising ones. The idea is to choose tokens that are 'typically' informative — not boring, not bizarre.

How It Works

For each candidate token, typical sampling calculates its 'local typicality' — how close its surprise (negative log probability) is to the expected surprise. Tokens that are either much more predictable or much less predictable than the average are filtered out. The parameter controls the cumulative threshold (similar to Top-P but over the typicality-sorted distribution).

When To Use

1.0: Disabled (default).
0.9–0.95: Light filtering — removes the most atypical tokens.
0.5–0.8: Moderate — keeps only reasonably typical tokens.
0.2–0.5: Strong — highly constrained to typical tokens only.
Best for: Natural-sounding prose and dialogue where you want human-like word choices.

Examples

typical_p=1.0 → No filtering (disabled)

typical_p=0.9 → Keeps tokens within the 'normally expected' range of surprise

typical_p=0.5 → Only allows tokens very close to average surprise level

Pro Tip

Typical sampling was designed based on information theory — the idea that human language tends to be 'typically' informative rather than maximally predictable or maximally surprising. It's worth experimenting with for natural-sounding dialogue.

FORMAT, SYSTEM & API CONTROLS

End-of-Sequence Token (EOS)

API: eos_token, ignore_eos, eos_token_id Range: true / false for ignore_eos; token ID varies by model

What It Does

The EOS token is a special token that signals the model has completed its response. When generated, it normally causes the model to stop producing further tokens. The ignore_eos parameter can override this behavior, forcing the model to continue generating until max_tokens is reached.

How It Works

During training, models learn to produce the EOS token when they believe a complete response has been generated. In normal inference, producing EOS stops generation. Setting ignore_eos=true suppresses this signal, forcing continued generation — useful for benchmarking and specific research scenarios but generally not recommended for production use.

When To Use

ignore_eos=false (default): Normal operation — let the model decide when it's done.
ignore_eos=true: Benchmarking (to reach exact output token counts), specific research scenarios, or when the model prematurely stops and you need it to continue.

Examples

Normal: Model generates 'The answer is 42.' + EOS → Generation stops at 7 tokens

ignore_eos=true: Model generates 'The answer is 42.' and continues generating more text until max_tokens

Pro Tip

In production, never set ignore_eos=true — it will cause the model to ramble past its natural completion point. This is primarily a debugging and benchmarking tool.

Model Selection

API: model Range: Model identifier string (provider-specific)

What It Does

Specifies which AI model to use for generation. Different models vary dramatically in capability, speed, cost, context window size, and specialization. Model selection is often the most impactful 'parameter' affecting output quality.

How It Works

Each model has been trained differently, with different data, architectures, and fine-tuning. Larger models (e.g., GPT-4, Claude Opus) generally produce higher-quality output but cost more and are slower. Smaller models (e.g., GPT-4o-mini, Claude Haiku) are faster and cheaper but may sacrifice quality on complex tasks.

When To Use

Complex reasoning / analysis: Use the most capable model (GPT-4, Claude Opus).
General tasks: Use mid-tier models (GPT-4o, Claude Sonnet) for good balance.
Simple tasks / high volume: Use small models (GPT-4o-mini, Claude Haiku) for speed and cost.
Specialized tasks: Consider domain-specific or fine-tuned models.
Cost optimization: Route simple queries to small models, complex ones to large models.

Examples

model='claude-opus-4-6' → Most capable, for complex analysis ($$$)

model='claude-sonnet-4-5-20250929' → Balanced capability and cost ($$)

model='claude-haiku-4-5-20251001' → Fastest and cheapest ($)

Pro Tip

The best practice is to use model routing: classify incoming requests by complexity and route them to the appropriate model tier. This can reduce costs by 50–70% while maintaining quality where it matters.

Response Format / Structured Output

API: response_format, structured, guided_decoding Range: text, json_object, json_schema (provider-dependent)

What It Does

Constrains the model's output to conform to a specific format, most commonly valid JSON. Some advanced implementations support JSON Schema enforcement, ensuring output matches an exact structure with required fields and data types.

How It Works

When response_format is set to 'json_object' or a JSON schema, the model's generation is constrained at the token level — at each step, only tokens that would produce valid JSON (or match the schema) are eligible for sampling. This is more reliable than prompting alone because it provides hard format guarantees rather than relying on the model's compliance.

When To Use

json_object: When you need valid JSON output but flexible structure.
json_schema: When you need output matching an exact schema (fields, types, nesting).
text: Default — free-form text output.
regex: Some frameworks support regex-constrained generation.
Best for: API integrations, data extraction, classification, structured workflows.

Examples

response_format: { type: 'json_schema', json_schema: { name: 'sentiment', schema: { type: 'object', properties: { sentiment: { enum: ['positive','negative','neutral'] }, confidence: { type: 'number' } }, required: ['sentiment','confidence'] } } }

Pro Tip

Structured output via API is more reliable than prompt-only approaches ('Please respond in JSON'). Always prefer API-level format enforcement when available. Combine with a clear system prompt describing the expected structure.

Streaming

API: stream Range: true / false (default: false)

What It Does

When enabled, the model sends tokens to the client incrementally as they are generated, rather than waiting for the complete response. This provides a real-time 'typing' effect and significantly improves perceived latency for users.

How It Works

With streaming enabled, the API returns a stream of server-sent events (SSE), each containing one or a few tokens. The client can display these tokens immediately as they arrive. The total generation time is the same, but the time-to-first-token (TTFT) is much shorter, making the experience feel faster and more responsive.

When To Use

Chatbots / UI: Almost always enable streaming for user-facing applications.
API integrations: Disable if your pipeline processes the complete response at once.
Batch processing: Disable for batch workloads where streaming adds overhead.

Examples

stream=true → Tokens arrive one by one: 'The' → ' capital' → ' of' → ' France' → ' is' → ' Paris.'

stream=false → Entire response arrives at once: 'The capital of France is Paris.'

Pro Tip

Streaming is almost mandatory for user-facing chat applications. Without it, users stare at a blank screen for several seconds. With it, they see the first word appear in under a second, even if the full response takes 10 seconds.

System Prompt / System Message

API: system, system_message, system_prompt, instruction Range: Free text (no strict limit, counts toward context window)

What It Does

A special message set at the beginning of the conversation that defines the model's persona, behavior, constraints, and operational rules. System prompts are typically hidden from end users and persist throughout the entire conversation, establishing the foundational 'character' and ground rules for the AI.

How It Works

In the messages API format, the system prompt is sent with role='system' before any user messages. The model treats it as high-priority instruction that frames all subsequent interactions. System prompts can define: persona/role, tone/style, capabilities/limitations, safety guardrails, output format requirements, domain-specific knowledge, and behavioral constraints.

When To Use

Always — every production application should use a system prompt.
Define persona: 'You are a helpful financial advisor...'
Set guardrails: 'Never provide medical diagnoses...'
Control format: 'Always respond in JSON format...'
Establish tone: 'Use a professional, concise tone...'

Examples

system: 'You are a senior Python developer. Respond with clean, well-commented code following PEP 8. Always explain your reasoning before writing code. If a question is ambiguous, ask for clarification.'

Pro Tip

System prompts are the most powerful tool for controlling model behavior in production. Invest significant time crafting them. Include: role definition, behavioral rules, output format, safety constraints, and edge case handling.