├── LICENSE └── README.md /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2025 Taylor Wilsdon 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # OpenAI, Anthropic, Qwen, Mistral, Deepseek, Llama, Phi, Gemini & More - API Max Context, Output Token Limits & Feature Compatibility 2 | 3 | ## The missing context limit & parameter support guide for local and hosted LLMs 4 | 5 | Since OpenAI won't just be cool and give us a max context and max output parameter in the OpenAI API-compatible models endpoint spec, I put together a quick reference for my own use that perhaps others can benefit from. This table represents the max current context window length, max input token, and max output token limits for OpenAI via API. This does not apply to ChatGPT through the UI. If anything looks wrong, please flag it or cut a PR to update, and I'll happily merge once confirmed accurate. 6 | 7 | > [!TIP] 8 | Are you using [open-webui](https://github.com/open-webui/open-webui)? You can configure the max context window in a persistent manner under the **Settings -> Models** interface under advanced parameters. 9 | 10 | > [!WARNING] 11 | Editor's note - if you don't utilize a [k/v cache](https://www.microsoft.com/en-us/research/blog/llm-profiling-guides-kv-cache-optimization/), setting the max context (even if you're not filling it up) will use up a **ton** of VRAM and potentially degrade performance. I strongly encourage running Ollama with Flash Attention enabled via `OLLAMA_FLASH_ATTENTION=1` & set `OLLAMA_KV_CACHE_TYPE=q8_0` (you can use a q4_0 quant but quality will degrade more) 12 | 13 | This table provides a quick reference to the key parameters of OpenAI's available API-driven models. These values apply to OpenAI's officially hosted API and may not match 3rd party providers. 14 | 15 | ### OpenAI API Model Reference 16 | 17 | | Model | Context Window | Max Output Tokens | Supports Temperature? | Supports Streaming? | 18 | |--------------|---------------|-------------------|----------------------|---------------------| 19 | | **GPT-4.1** | 1048k tokens | 32k tokens | ✅ Yes | ✅ Yes | 20 | | **GPT-4.1-mini** | 1048k tokens | 32k tokens | ✅ Yes | ✅ Yes | 21 | | **GPT-4.1-nano** | 1048k tokens | 32k tokens | ✅ Yes | ✅ Yes | 22 | | **GPT-4o** | 128k tokens | 16k tokens | ✅ Yes | ✅ Yes | 23 | | **GPT-4o-mini** | 128k tokens | 16k tokens | ✅ Yes | ✅ Yes | 24 | | **GPT-4** | 128k tokens | 16k tokens | ✅ Yes | ✅ Yes | 25 | | **GPT-3.5-turbo** | 16k tokens | 4k tokens | ✅ Yes | ✅ Yes | 26 | | **o3-mini** | 200k tokens | 100k tokens | ❌ No | ✅ Yes | 27 | | **o4-mini** | 128k tokens | 100k tokens | ❌ No | ✅ Yes | 28 | | **o1** | 200k tokens | 100k tokens | ✅ Yes | ✅ Yes | 29 | | **o1-mini** | 128k tokens | 65,536 tokens | ✅ Yes | ✅ Yes | 30 | | **o1-preview** | 128k tokens | 32k tokens | ❌ No | ✅ Yes | 31 | | **o1-pro** | 200k tokens | 100k tokens | ❌ No | ✅ Yes | 32 | | **o3** | 128k tokens | 100k tokens | ❌ No | ✅ Yes | 33 | 34 | 35 | --- 36 | 37 | ### Anthropic API Model Reference 38 | 39 | | Model | Context Window | Max Output Tokens | Supports Temperature? | Supports Streaming? | Vision Support? | 40 | |------------------------|---------------|-------------------|----------------------|---------------------|-----------------| 41 | | **Claude 3.7 Sonnet** | 200k tokens | 8k tokens (128k extended w/ output-128k-2025-02-19 header) | ✅ Yes | ✅ Yes | ✅ Yes | 42 | | **Claude 3.5 Sonnet** | 200k tokens | 8k tokens | ✅ Yes | ✅ Yes | ✅ Yes | 43 | | **Claude 3.5 Haiku** | 200k tokens | 8k tokens | ✅ Yes | ✅ Yes | ❌ No | 44 | | **Claude 3 Opus** | 200k tokens | 4k tokens | ✅ Yes | ✅ Yes | ✅ Yes | 45 | | **Claude 3 Sonnet** | 200k tokens | 4k tokens | ✅ Yes | ✅ Yes | ✅ Yes | 46 | | **Claude 3 Haiku** | 200k tokens | 4k tokens | ✅ Yes | ✅ Yes | ✅ Yes | 47 | 48 | #### Training Data Cut-off: 49 | - Claude 3.7 Sonnet: October 2024 50 | - Claude 3.5 Sonnet: April 2024 51 | - Claude 3.5 Haiku: July 2024 52 | - Claude 3 Opus: August 2023 53 | - Claude 3 Sonnet: August 2023 54 | - Claude 3 Haiku: August 2023 55 | 56 | --- 57 | 58 | ### DeepSeek API Model Reference 59 | 60 | Through official DeepSeek API. Self-hosted supports 128k. 61 | 62 | | Model | Context Window | Max CoT Tokens | Max Output Tokens | Supports Streaming? | Vision Support? | 63 | |------------------------------|---------------|---------------|-------------------|---------------------|-----------------| 64 | | **deepseek-chat (deepseek v3)** | 64k tokens | - | 8k tokens | ✅ Yes | ❌ No | 65 | | **deepseek-reasoner (deepseek r1)** | 64k tokens | 32K tokens | 8k tokens | ✅ Yes | ❌ No | 66 | 67 | --- 68 | 69 | ### Qwen API Model Reference 70 | 71 | Self-hosted maximums. Please note that you must configure your inference engine to these maximums, as the default (e.g., Ollama @ 2048 tokens) is generally much lower than the model maximum. 72 | 73 | | Model | Context Window | Max Output Tokens | Supports Streaming? | Vision Support? | 74 | |---------------------------|---------------|-------------------|---------------------|-----------------| 75 | | **qwen2.5-coder-32b** | 131,072 tokens | 8k tokens | ✅ Yes | ❌ No | 76 | | **qwen2.5-72b-instruct** | 131,072 tokens | 8k tokens | ✅ Yes | ❌ No | 77 | | **qwen2.5-3b** | 32k tokens (default, 128k possible) | 8k tokens | ✅ Yes | ❌ No | 78 | | **qwq** | 32k tokens | 8k tokens | ✅ Yes | ❌ No | 79 | 80 | --- 81 | 82 | ### Mistral API Model Reference 83 | 84 | Self-hosted maximums. 85 | 86 | | Model | Context Window | Max Output Tokens | Supports Streaming? | Vision Support? | 87 | |------------------------------|---------------|-------------------|---------------------|-----------------| 88 | | **Mistral-7B-Instruct-v0** | 32k tokens | 4k tokens | ✅ Yes | ❌ No | 89 | | **Mistral Medium** | 32k tokens | 4k tokens | ✅ Yes | ❌ No | 90 | | **Mistral Small** | 32k tokens | 4k tokens | ✅ Yes | ❌ No | 91 | | **Mistral Large** | 32k tokens | 4k tokens | ✅ Yes | ❌ No | 92 | | **Mistral Nemo** | 128k tokens | 4k tokens | ✅ Yes | ❌ No | 93 | 94 | --- 95 | 96 | ### Gemini API Model Reference 97 | 98 | Includes Gemini (hosted) and Gemma (self hosted). 99 | 100 | | Model | Context Window | Max Output Tokens | Supports Streaming? | Vision Support? | 101 | |------------------------------|---------------|-------------------|---------------------|-----------------| 102 | | **gemini-2.0-flash** | 1,048k tokens | 8k tokens | ✅ Yes | ❌ No | 103 | | **gemini-2.5-pro** | 1,048k tokens | 64k tokens | ✅ Yes | ❌ No | 104 | | **gemma-3** | 128k tokens | Unclear tokens | ✅ Yes | ❌ No | 105 | --- 106 | 107 | ### Other Model Reference 108 | 109 | | Model | Context Window | Max Output Tokens | Supports Streaming? | Vision Support? | 110 | |------------------------------|---------------|-------------------|---------------------|-----------------| 111 | | **Llama3.3:70b** | 131,072 tokens | 2k tokens | ✅ Yes | ❌ No | 112 | | **Phi4** | 16k tokens | 16k tokens (*combined window - 16k total split between input & output) | ✅ Yes | ❌ No | 113 | | **Phi4** | 16k tokens | 16k tokens | ✅ Yes | ❌ No | 114 | 115 | 116 | ### OpenAI API Model Endpoint Compatibility 117 | 118 | This table provides a reference for which models are compatible with various OpenAI API endpoints. 119 | 120 | | Endpoint | Compatible Models | 121 | |------------------------------|------------------| 122 | | **`/v1/assistants`** | All o-series, all GPT-4o (except `chatgpt-4o-latest`), GPT-4o-mini, GPT-4, and GPT-3.5 Turbo models. The `retrieval` tool requires `gpt-4-turbo-preview` (and subsequent dated model releases) or `gpt-3.5-turbo-1106` (and subsequent versions). | 123 | | **`/v1/audio/transcriptions`** | `whisper-1` | 124 | | **`/v1/audio/translations`** | `whisper-1` | 125 | | **`/v1/audio/speech`** | `tts-1`, `tts-1-hd` | 126 | | **`/v1/chat/completions`** | All o-series, GPT-4o (except for Realtime preview), GPT-4o-mini, GPT-4, and GPT-3.5 Turbo models and their dated releases. `chatgpt-4o-latest` dynamic model. Fine-tuned versions of `gpt-4o`, `gpt-4o-mini`, `gpt-4`, and `gpt-3.5-turbo`. | 127 | | **`/v1/completions (Legacy)`** | `gpt-3.5-turbo-instruct`, `babbage-002`, `davinci-002` | 128 | | **`/v1/embeddings`** | `text-embedding-3-small`, `text-embedding-3-large`, `text-embedding-ada-002` | 129 | | **`/v1/fine_tuning/jobs`** | `gpt-4o`, `gpt-4o-mini`, `gpt-4`, `gpt-3.5-turbo` | 130 | | **`/v1/moderations`** | `text-moderation-stable`, `text-moderation-latest` | 131 | | **`/v1/images/generations`** | `dall-e-2`, `dall-e-3` | 132 | | **`/v1/realtime (beta)`** | `gpt-4o-realtime-preview`, `gpt-4o-realtime-preview-2024-10-01` | 133 | --------------------------------------------------------------------------------