├── Dockerfile
├── README.MD
├── __pycache__
└── util.cpython-311.pyc
├── imgs
├── code-output.png
└── promptlayer_logging.png
├── litellm_uuid.txt
├── main.py
├── requirements.txt
└── util.py
/Dockerfile:
--------------------------------------------------------------------------------
1 | # Use a recent version of Python as the base image
2 | FROM python:3.8-slim-buster
3 |
4 | # Set the working directory to /app
5 | WORKDIR /app
6 |
7 | # Copy the requirements.txt file to the image
8 | COPY requirements.txt .
9 |
10 | # Install the required packages
11 |
12 | # Install the required Python packages using pip
13 | RUN pip install --no-cache-dir -r requirements.txt
14 |
15 | # Copy the rest of the application files to the image
16 | COPY . .
17 |
18 | # Expose port 4000 for the Flask app to listen on
19 | EXPOSE 4000
20 |
21 | # Run the main.py file when the container is started
22 | CMD ["python", "main.py"]
--------------------------------------------------------------------------------
/README.MD:
--------------------------------------------------------------------------------
1 | # CodeLlama Server: Streaming, Caching, Model Fallbacks (OpenAI + Anthropic), Prompt-tracking
2 |
3 | Works with: CodeLlama, Starcoder, Replit-Code-v1, Phind-CodeLlama, GPT-4, Claude-2, etc.
4 |
5 | [](https://pypi.org/project/litellm/)
6 | [](https://pypi.org/project/litellm/0.1.1/)
7 | 
8 |
9 | [](https://railway.app/template/HuDPw-?referralCode=jch2ME)
10 |
11 | **LIVE DEMO** - https://litellm.ai/playground
12 |
13 | ## What does CodeLlama Server do
14 | - **Default model**: Uses Together AI's CodeLlama to answer coding questions, with GPT-4 + Claude-2 as backups (you can easily switch this to any model from Huggingface, Replicate, Cohere, AI21, Azure, OpenAI, etc.)
15 | - Together AI model + keys - https://together.ai/
16 | - **Guardrail prompts**: `system_prompt = "Only respond to questions about code. Say 'I don't know' to anything outside of that."`
17 |
18 |
19 |
20 |
21 | - **Consistent Input/Output** Format
22 | - Call all models using the OpenAI format - `completion(model, messages)`
23 | - Text responses will always be available at `['choices'][0]['message']['content']`
24 | - Stream responses will always be available at `['choices'][0]['delta']['content']`
25 | - **Error Handling** Using Model Fallbacks (if `CodeLlama` fails, try `GPT-4`) with cooldowns, and retries
26 | - **Prompt Tracking** - Integrates with Promptlayer for model + prompt tracking
27 |
28 | **Example: Logs sent to PromptLayer**
29 |
30 |
31 |
32 |
33 | - **Token Usage & Spend** - Track Input + Completion tokens used + Spend/model - https://docs.litellm.ai/docs/token_usage
34 | - **Caching** - Provides in-memory cache + GPT-Cache integration for more advanced usage - https://docs.litellm.ai/docs/caching/gpt_cache
35 |
36 | - **Streaming & Async Support** - Return generators to stream text responses - TEST IT 👉 https://litellm.ai/
37 |
38 | ## API Endpoints
39 |
40 | ### `/chat/completions` (POST)
41 |
42 | This endpoint is used to generate chat completions for 50+ support LLM API Models. Use llama2, GPT-4, Claude2 etc
43 |
44 | #### Input
45 |
46 | This API endpoint accepts all inputs in raw JSON and expects the following inputs
47 |
48 | - `prompt` (string, required): The user's coding related question
49 | - Additional Optional parameters: `temperature`, `functions`, `function_call`, `top_p`, `n`, `stream`. See the full list of supported inputs here: https://litellm.readthedocs.io/en/latest/input/
50 |
51 | #### Example JSON body
52 |
53 | ```json
54 | {
55 | "prompt": "write me a function to print hello world"
56 | }
57 | ```
58 |
59 | ### Making an API request to the Code-Gen Server
60 |
61 | ```python
62 | import requests
63 | import json
64 |
65 | url = "localhost:4000/chat/completions"
66 |
67 | payload = json.dumps({
68 | "prompt": "write me a function to print hello world"
69 | })
70 | headers = {
71 | 'Content-Type': 'application/json'
72 | }
73 |
74 | response = requests.request("POST", url, headers=headers, data=payload)
75 |
76 | print(response.text)
77 |
78 | ```
79 |
80 | ### Output [Response Format]
81 |
82 | Responses from the server are given in the following format.
83 | All responses from the server are returned in the following format (for all LLM models). More info on output here: https://litellm.readthedocs.io/en/latest/output/
84 |
85 | ```json
86 | {
87 | "choices": [
88 | {
89 | "finish_reason": "stop",
90 | "index": 0,
91 | "message": {
92 | "content": ".\n\n```\ndef print_hello_world():\n print(\"hello world\")\n",
93 | "role": "assistant"
94 | }
95 | }
96 | ],
97 | "created": 1693279694.6474009,
98 | "model": "togethercomputer/CodeLlama-34b-Instruct",
99 | "usage": {
100 | "completion_tokens": 14,
101 | "prompt_tokens": 28,
102 | "total_tokens": 42
103 | }
104 | }
105 | ```
106 |
107 | ## Installation & Usage
108 |
109 | ### Running Locally
110 |
111 | 1. Clone liteLLM repository to your local machine:
112 | ```
113 | git clone https://github.com/BerriAI/litellm-CodeLlama-server
114 | ```
115 | 2. Install the required dependencies using pip
116 | ```
117 | pip install requirements.txt
118 | ```
119 | 3. Set your LLM API keys
120 | ```
121 | os.environ['OPENAI_API_KEY]` = "YOUR_API_KEY"
122 | or
123 | set OPENAI_API_KEY in your .env file
124 | ```
125 | 4. Run the server:
126 | ```
127 | python main.py
128 | ```
129 |
130 | ## Deploying
131 |
132 | 1. Quick Start: Deploy on Railway
133 |
134 | [](https://railway.app/template/HuDPw-?referralCode=jch2ME)
135 |
136 | 2. `GCP`, `AWS`, `Azure`
137 | This project includes a `Dockerfile` allowing you to build and deploy a Docker Project on your providers
138 |
139 | # Support / Talk with founders
140 |
141 | - [Our calendar 👋](https://calendly.com/d/4mp-gd3-k5k/berriai-1-1-onboarding-litellm-hosted-version)
142 | - [Community Discord 💭](https://discord.gg/wuPM9dRgDw)
143 | - Our numbers 📞 +1 (770) 8783-106 / +1 (412) 618-6238
144 | - Our emails ✉️ ishaan@berri.ai / krrish@berri.ai
145 |
146 | ## Roadmap
147 |
148 | - [ ] Implement user-based rate-limiting
149 | - [ ] Spending controls per project - expose key creation endpoint
150 | - [ ] Need to store a keys db -> mapping created keys to their alias (i.e. project name)
151 | - [ ] Easily add new models as backups / as the entry-point (add this to the available model list)
152 |
--------------------------------------------------------------------------------
/__pycache__/util.cpython-311.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/BerriAI/litellm-CodeLlama-server/f725adb7bc5c9de0b1cb59aaf967fa50d44c81a3/__pycache__/util.cpython-311.pyc
--------------------------------------------------------------------------------
/imgs/code-output.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/BerriAI/litellm-CodeLlama-server/f725adb7bc5c9de0b1cb59aaf967fa50d44c81a3/imgs/code-output.png
--------------------------------------------------------------------------------
/imgs/promptlayer_logging.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/BerriAI/litellm-CodeLlama-server/f725adb7bc5c9de0b1cb59aaf967fa50d44c81a3/imgs/promptlayer_logging.png
--------------------------------------------------------------------------------
/litellm_uuid.txt:
--------------------------------------------------------------------------------
1 | 7f3ef576-2106-4051-9c06-3a27f16f1f1a
--------------------------------------------------------------------------------
/main.py:
--------------------------------------------------------------------------------
1 | import traceback
2 | from flask import Flask, request, jsonify, abort, Response
3 | from flask_cors import CORS
4 | import traceback
5 | import litellm
6 | from util import handle_error
7 | from litellm import completion
8 | import os, dotenv, time
9 | import json
10 | dotenv.load_dotenv()
11 |
12 | # TODO: set your keys in .env or here:
13 | # os.environ["OPENAI_API_KEY"] = "" # set your openai key here
14 | # os.environ["ANTHROPIC_API_KEY"] = "" # set your anthropic key here
15 | # os.environ["TOGETHER_AI_API_KEY"] = "" # set your together ai key here
16 | # see supported models / keys here: https://litellm.readthedocs.io/en/latest/supported/
17 | ######### ENVIRONMENT VARIABLES ##########
18 | verbose = True
19 |
20 | # litellm.caching_with_models = True # CACHING: caching_with_models Keys in the cache are messages + model. - to learn more: https://docs.litellm.ai/docs/caching/
21 | ######### PROMPT LOGGING ##########
22 | os.environ["PROMPTLAYER_API_KEY"] = "" # set your promptlayer key here - https://promptlayer.com/
23 |
24 | # set callbacks
25 | litellm.success_callback = ["promptlayer"]
26 | ############ HELPER FUNCTIONS ###################################
27 |
28 | def print_verbose(print_statement):
29 | if verbose:
30 | print(print_statement)
31 |
32 | app = Flask(__name__)
33 | CORS(app)
34 |
35 | @app.route('/')
36 | def index():
37 | return 'received!', 200
38 |
39 | def data_generator(response):
40 | for chunk in response:
41 | yield f"data: {json.dumps(chunk)}\n\n"
42 |
43 | @app.route('/chat/completions', methods=["POST"])
44 | def api_completion():
45 | data = request.json
46 | start_time = time.time()
47 | if data.get('stream') == "True":
48 | data['stream'] = True # convert to boolean
49 | try:
50 | if "prompt" not in data:
51 | raise ValueError("data needs to have prompt")
52 | data["model"] = "togethercomputer/CodeLlama-34b-Instruct" # by default use Together AI's CodeLlama model - https://api.together.xyz/playground/chat?model=togethercomputer%2FCodeLlama-34b-Instruct
53 | # COMPLETION CALL
54 | system_prompt = "Only respond to questions about code. Say 'I don't know' to anything outside of that."
55 | messages = [{"role": "system", "content": system_prompt}, {"role": "user", "content": data.pop("prompt")}]
56 | data["messages"] = messages
57 | print(f"data: {data}")
58 | response = completion(**data)
59 | ## LOG SUCCESS
60 | end_time = time.time()
61 | if 'stream' in data and data['stream'] == True: # use generate_responses to stream responses
62 | return Response(data_generator(response), mimetype='text/event-stream')
63 | except Exception as e:
64 | # call handle_error function
65 | print_verbose(f"Got Error api_completion(): {traceback.format_exc()}")
66 | ## LOG FAILURE
67 | end_time = time.time()
68 | traceback_exception = traceback.format_exc()
69 | return handle_error(data=data)
70 | return response
71 |
72 | @app.route('/get_models', methods=["POST"])
73 | def get_models():
74 | try:
75 | return litellm.model_list
76 | except Exception as e:
77 | traceback.print_exc()
78 | response = {"error": str(e)}
79 | return response, 200
80 |
81 | if __name__ == "__main__":
82 | from waitress import serve
83 | serve(app, host="0.0.0.0", port=4000, threads=500)
84 |
85 |
--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | flask
2 | flask_cors
3 | waitress
4 | litellm==0.1.381
--------------------------------------------------------------------------------
/util.py:
--------------------------------------------------------------------------------
1 | import time, random, traceback
2 | from copy import deepcopy
3 | from litellm import completion
4 |
5 | ################ ERROR HANDLING #####################
6 | # implement model fallbacks, cooldowns, and retries
7 | # if a model fails assume it was rate limited and let it cooldown for 60s
8 | def handle_error(data):
9 | import time
10 | # retry completion() request with fallback models
11 | response = None
12 | start_time = time.time()
13 | rate_limited_models = set()
14 | model_expiration_times = {}
15 | fallback_strategy=['gpt-4', 'claude-2']
16 | while response == None and time.time() - start_time < 45: # retry for 45s
17 | for model in fallback_strategy:
18 | try:
19 | if model in rate_limited_models: # check if model is currently cooling down
20 | if model_expiration_times.get(model) and time.time() >= model_expiration_times[model]:
21 | rate_limited_models.remove(model) # check if it's been 60s of cool down and remove model
22 | else:
23 | continue # skip model
24 | print(f"calling model {model}")
25 | response = completion(**data)
26 | if response != None:
27 | return response
28 | except Exception as e:
29 | rate_limited_models.add(model)
30 | model_expiration_times[model] = time.time() + 60 # cool down this selected model
31 | pass
32 | return response
33 |
--------------------------------------------------------------------------------