├── Dockerfile ├── README.MD ├── __pycache__ └── util.cpython-311.pyc ├── imgs ├── code-output.png └── promptlayer_logging.png ├── litellm_uuid.txt ├── main.py ├── requirements.txt └── util.py /Dockerfile: -------------------------------------------------------------------------------- 1 | # Use a recent version of Python as the base image 2 | FROM python:3.8-slim-buster 3 | 4 | # Set the working directory to /app 5 | WORKDIR /app 6 | 7 | # Copy the requirements.txt file to the image 8 | COPY requirements.txt . 9 | 10 | # Install the required packages 11 | 12 | # Install the required Python packages using pip 13 | RUN pip install --no-cache-dir -r requirements.txt 14 | 15 | # Copy the rest of the application files to the image 16 | COPY . . 17 | 18 | # Expose port 4000 for the Flask app to listen on 19 | EXPOSE 4000 20 | 21 | # Run the main.py file when the container is started 22 | CMD ["python", "main.py"] -------------------------------------------------------------------------------- /README.MD: -------------------------------------------------------------------------------- 1 | # CodeLlama Server: Streaming, Caching, Model Fallbacks (OpenAI + Anthropic), Prompt-tracking 2 | 3 | Works with: CodeLlama, Starcoder, Replit-Code-v1, Phind-CodeLlama, GPT-4, Claude-2, etc. 4 | 5 | [![PyPI Version](https://img.shields.io/pypi/v/litellm.svg)](https://pypi.org/project/litellm/) 6 | [![PyPI Version](https://img.shields.io/badge/stable%20version-v0.1.345-blue?color=green&link=https://pypi.org/project/litellm/0.1.1/)](https://pypi.org/project/litellm/0.1.1/) 7 | ![Downloads](https://img.shields.io/pypi/dm/litellm) 8 | 9 | [![Deploy on Railway](https://railway.app/button.svg)](https://railway.app/template/HuDPw-?referralCode=jch2ME) 10 | 11 | **LIVE DEMO** - https://litellm.ai/playground 12 | 13 | ## What does CodeLlama Server do 14 | - **Default model**: Uses Together AI's CodeLlama to answer coding questions, with GPT-4 + Claude-2 as backups (you can easily switch this to any model from Huggingface, Replicate, Cohere, AI21, Azure, OpenAI, etc.) 15 | - Together AI model + keys - https://together.ai/ 16 | - **Guardrail prompts**: `system_prompt = "Only respond to questions about code. Say 'I don't know' to anything outside of that."` 17 | 18 | 19 | Code Output 20 | 21 | - **Consistent Input/Output** Format 22 | - Call all models using the OpenAI format - `completion(model, messages)` 23 | - Text responses will always be available at `['choices'][0]['message']['content']` 24 | - Stream responses will always be available at `['choices'][0]['delta']['content']` 25 | - **Error Handling** Using Model Fallbacks (if `CodeLlama` fails, try `GPT-4`) with cooldowns, and retries 26 | - **Prompt Tracking** - Integrates with Promptlayer for model + prompt tracking 27 | 28 | **Example: Logs sent to PromptLayer** 29 | 30 | Prompt Logging 31 | 32 | 33 | - **Token Usage & Spend** - Track Input + Completion tokens used + Spend/model - https://docs.litellm.ai/docs/token_usage 34 | - **Caching** - Provides in-memory cache + GPT-Cache integration for more advanced usage - https://docs.litellm.ai/docs/caching/gpt_cache 35 | 36 | - **Streaming & Async Support** - Return generators to stream text responses - TEST IT 👉 https://litellm.ai/ 37 | 38 | ## API Endpoints 39 | 40 | ### `/chat/completions` (POST) 41 | 42 | This endpoint is used to generate chat completions for 50+ support LLM API Models. Use llama2, GPT-4, Claude2 etc 43 | 44 | #### Input 45 | 46 | This API endpoint accepts all inputs in raw JSON and expects the following inputs 47 | 48 | - `prompt` (string, required): The user's coding related question 49 | - Additional Optional parameters: `temperature`, `functions`, `function_call`, `top_p`, `n`, `stream`. See the full list of supported inputs here: https://litellm.readthedocs.io/en/latest/input/ 50 | 51 | #### Example JSON body 52 | 53 | ```json 54 | { 55 | "prompt": "write me a function to print hello world" 56 | } 57 | ``` 58 | 59 | ### Making an API request to the Code-Gen Server 60 | 61 | ```python 62 | import requests 63 | import json 64 | 65 | url = "localhost:4000/chat/completions" 66 | 67 | payload = json.dumps({ 68 | "prompt": "write me a function to print hello world" 69 | }) 70 | headers = { 71 | 'Content-Type': 'application/json' 72 | } 73 | 74 | response = requests.request("POST", url, headers=headers, data=payload) 75 | 76 | print(response.text) 77 | 78 | ``` 79 | 80 | ### Output [Response Format] 81 | 82 | Responses from the server are given in the following format. 83 | All responses from the server are returned in the following format (for all LLM models). More info on output here: https://litellm.readthedocs.io/en/latest/output/ 84 | 85 | ```json 86 | { 87 | "choices": [ 88 | { 89 | "finish_reason": "stop", 90 | "index": 0, 91 | "message": { 92 | "content": ".\n\n```\ndef print_hello_world():\n print(\"hello world\")\n", 93 | "role": "assistant" 94 | } 95 | } 96 | ], 97 | "created": 1693279694.6474009, 98 | "model": "togethercomputer/CodeLlama-34b-Instruct", 99 | "usage": { 100 | "completion_tokens": 14, 101 | "prompt_tokens": 28, 102 | "total_tokens": 42 103 | } 104 | } 105 | ``` 106 | 107 | ## Installation & Usage 108 | 109 | ### Running Locally 110 | 111 | 1. Clone liteLLM repository to your local machine: 112 | ``` 113 | git clone https://github.com/BerriAI/litellm-CodeLlama-server 114 | ``` 115 | 2. Install the required dependencies using pip 116 | ``` 117 | pip install requirements.txt 118 | ``` 119 | 3. Set your LLM API keys 120 | ``` 121 | os.environ['OPENAI_API_KEY]` = "YOUR_API_KEY" 122 | or 123 | set OPENAI_API_KEY in your .env file 124 | ``` 125 | 4. Run the server: 126 | ``` 127 | python main.py 128 | ``` 129 | 130 | ## Deploying 131 | 132 | 1. Quick Start: Deploy on Railway 133 | 134 | [![Deploy on Railway](https://railway.app/button.svg)](https://railway.app/template/HuDPw-?referralCode=jch2ME) 135 | 136 | 2. `GCP`, `AWS`, `Azure` 137 | This project includes a `Dockerfile` allowing you to build and deploy a Docker Project on your providers 138 | 139 | # Support / Talk with founders 140 | 141 | - [Our calendar 👋](https://calendly.com/d/4mp-gd3-k5k/berriai-1-1-onboarding-litellm-hosted-version) 142 | - [Community Discord 💭](https://discord.gg/wuPM9dRgDw) 143 | - Our numbers 📞 +1 (770) 8783-106 / +1 (412) 618-6238 144 | - Our emails ✉️ ishaan@berri.ai / krrish@berri.ai 145 | 146 | ## Roadmap 147 | 148 | - [ ] Implement user-based rate-limiting 149 | - [ ] Spending controls per project - expose key creation endpoint 150 | - [ ] Need to store a keys db -> mapping created keys to their alias (i.e. project name) 151 | - [ ] Easily add new models as backups / as the entry-point (add this to the available model list) 152 | -------------------------------------------------------------------------------- /__pycache__/util.cpython-311.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/BerriAI/litellm-CodeLlama-server/f725adb7bc5c9de0b1cb59aaf967fa50d44c81a3/__pycache__/util.cpython-311.pyc -------------------------------------------------------------------------------- /imgs/code-output.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/BerriAI/litellm-CodeLlama-server/f725adb7bc5c9de0b1cb59aaf967fa50d44c81a3/imgs/code-output.png -------------------------------------------------------------------------------- /imgs/promptlayer_logging.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/BerriAI/litellm-CodeLlama-server/f725adb7bc5c9de0b1cb59aaf967fa50d44c81a3/imgs/promptlayer_logging.png -------------------------------------------------------------------------------- /litellm_uuid.txt: -------------------------------------------------------------------------------- 1 | 7f3ef576-2106-4051-9c06-3a27f16f1f1a -------------------------------------------------------------------------------- /main.py: -------------------------------------------------------------------------------- 1 | import traceback 2 | from flask import Flask, request, jsonify, abort, Response 3 | from flask_cors import CORS 4 | import traceback 5 | import litellm 6 | from util import handle_error 7 | from litellm import completion 8 | import os, dotenv, time 9 | import json 10 | dotenv.load_dotenv() 11 | 12 | # TODO: set your keys in .env or here: 13 | # os.environ["OPENAI_API_KEY"] = "" # set your openai key here 14 | # os.environ["ANTHROPIC_API_KEY"] = "" # set your anthropic key here 15 | # os.environ["TOGETHER_AI_API_KEY"] = "" # set your together ai key here 16 | # see supported models / keys here: https://litellm.readthedocs.io/en/latest/supported/ 17 | ######### ENVIRONMENT VARIABLES ########## 18 | verbose = True 19 | 20 | # litellm.caching_with_models = True # CACHING: caching_with_models Keys in the cache are messages + model. - to learn more: https://docs.litellm.ai/docs/caching/ 21 | ######### PROMPT LOGGING ########## 22 | os.environ["PROMPTLAYER_API_KEY"] = "" # set your promptlayer key here - https://promptlayer.com/ 23 | 24 | # set callbacks 25 | litellm.success_callback = ["promptlayer"] 26 | ############ HELPER FUNCTIONS ################################### 27 | 28 | def print_verbose(print_statement): 29 | if verbose: 30 | print(print_statement) 31 | 32 | app = Flask(__name__) 33 | CORS(app) 34 | 35 | @app.route('/') 36 | def index(): 37 | return 'received!', 200 38 | 39 | def data_generator(response): 40 | for chunk in response: 41 | yield f"data: {json.dumps(chunk)}\n\n" 42 | 43 | @app.route('/chat/completions', methods=["POST"]) 44 | def api_completion(): 45 | data = request.json 46 | start_time = time.time() 47 | if data.get('stream') == "True": 48 | data['stream'] = True # convert to boolean 49 | try: 50 | if "prompt" not in data: 51 | raise ValueError("data needs to have prompt") 52 | data["model"] = "togethercomputer/CodeLlama-34b-Instruct" # by default use Together AI's CodeLlama model - https://api.together.xyz/playground/chat?model=togethercomputer%2FCodeLlama-34b-Instruct 53 | # COMPLETION CALL 54 | system_prompt = "Only respond to questions about code. Say 'I don't know' to anything outside of that." 55 | messages = [{"role": "system", "content": system_prompt}, {"role": "user", "content": data.pop("prompt")}] 56 | data["messages"] = messages 57 | print(f"data: {data}") 58 | response = completion(**data) 59 | ## LOG SUCCESS 60 | end_time = time.time() 61 | if 'stream' in data and data['stream'] == True: # use generate_responses to stream responses 62 | return Response(data_generator(response), mimetype='text/event-stream') 63 | except Exception as e: 64 | # call handle_error function 65 | print_verbose(f"Got Error api_completion(): {traceback.format_exc()}") 66 | ## LOG FAILURE 67 | end_time = time.time() 68 | traceback_exception = traceback.format_exc() 69 | return handle_error(data=data) 70 | return response 71 | 72 | @app.route('/get_models', methods=["POST"]) 73 | def get_models(): 74 | try: 75 | return litellm.model_list 76 | except Exception as e: 77 | traceback.print_exc() 78 | response = {"error": str(e)} 79 | return response, 200 80 | 81 | if __name__ == "__main__": 82 | from waitress import serve 83 | serve(app, host="0.0.0.0", port=4000, threads=500) 84 | 85 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | flask 2 | flask_cors 3 | waitress 4 | litellm==0.1.381 -------------------------------------------------------------------------------- /util.py: -------------------------------------------------------------------------------- 1 | import time, random, traceback 2 | from copy import deepcopy 3 | from litellm import completion 4 | 5 | ################ ERROR HANDLING ##################### 6 | # implement model fallbacks, cooldowns, and retries 7 | # if a model fails assume it was rate limited and let it cooldown for 60s 8 | def handle_error(data): 9 | import time 10 | # retry completion() request with fallback models 11 | response = None 12 | start_time = time.time() 13 | rate_limited_models = set() 14 | model_expiration_times = {} 15 | fallback_strategy=['gpt-4', 'claude-2'] 16 | while response == None and time.time() - start_time < 45: # retry for 45s 17 | for model in fallback_strategy: 18 | try: 19 | if model in rate_limited_models: # check if model is currently cooling down 20 | if model_expiration_times.get(model) and time.time() >= model_expiration_times[model]: 21 | rate_limited_models.remove(model) # check if it's been 60s of cool down and remove model 22 | else: 23 | continue # skip model 24 | print(f"calling model {model}") 25 | response = completion(**data) 26 | if response != None: 27 | return response 28 | except Exception as e: 29 | rate_limited_models.add(model) 30 | model_expiration_times[model] = time.time() + 60 # cool down this selected model 31 | pass 32 | return response 33 | --------------------------------------------------------------------------------