├── Dockerfile
├── README.MD
├── __pycache__
    └── util.cpython-311.pyc
├── imgs
    ├── code-output.png
    └── promptlayer_logging.png
├── litellm_uuid.txt
├── main.py
├── requirements.txt
└── util.py


/Dockerfile:
--------------------------------------------------------------------------------
 1 | # Use a recent version of Python as the base image
 2 | FROM python:3.8-slim-buster
 3 | 
 4 | # Set the working directory to /app
 5 | WORKDIR /app
 6 | 
 7 | # Copy the requirements.txt file to the image
 8 | COPY requirements.txt .
 9 | 
10 | # Install the required packages
11 | 
12 | # Install the required Python packages using pip
13 | RUN pip install --no-cache-dir -r requirements.txt
14 | 
15 | # Copy the rest of the application files to the image
16 | COPY . .
17 | 
18 | # Expose port 4000 for the Flask app to listen on
19 | EXPOSE 4000
20 | 
21 | # Run the main.py file when the container is started
22 | CMD ["python", "main.py"]


--------------------------------------------------------------------------------
/README.MD:
--------------------------------------------------------------------------------
  1 | # CodeLlama Server: Streaming, Caching, Model Fallbacks (OpenAI + Anthropic), Prompt-tracking
  2 | 
  3 | Works with: CodeLlama, Starcoder, Replit-Code-v1, Phind-CodeLlama, GPT-4, Claude-2, etc.
  4 | 
  5 | [![PyPI Version](https://img.shields.io/pypi/v/litellm.svg)](https://pypi.org/project/litellm/)
  6 | [![PyPI Version](https://img.shields.io/badge/stable%20version-v0.1.345-blue?color=green&link=https://pypi.org/project/litellm/0.1.1/)](https://pypi.org/project/litellm/0.1.1/)
  7 | ![Downloads](https://img.shields.io/pypi/dm/litellm)
  8 | 
  9 | [![Deploy on Railway](https://railway.app/button.svg)](https://railway.app/template/HuDPw-?referralCode=jch2ME)
 10 | 
 11 | **LIVE DEMO** - https://litellm.ai/playground
 12 | 
 13 | ## What does CodeLlama Server do
 14 | - **Default model**: Uses Together AI's CodeLlama to answer coding questions, with GPT-4 + Claude-2 as backups (you can easily switch this to any model from Huggingface, Replicate, Cohere, AI21, Azure, OpenAI, etc.) 
 15 |   - Together AI model + keys - https://together.ai/
 16 | - **Guardrail prompts**: `system_prompt = "Only respond to questions about code. Say 'I don't know' to anything outside of that."`
 17 | 
 18 | 
 19 | <img src="imgs/code-output.png" alt="Code Output" width="600"/>
 20 | 
 21 | - **Consistent Input/Output** Format
 22 |   - Call all models using the OpenAI format - `completion(model, messages)`
 23 |   - Text responses will always be available at `['choices'][0]['message']['content']`
 24 |   - Stream responses will always be available at `['choices'][0]['delta']['content']`
 25 | - **Error Handling** Using Model Fallbacks (if `CodeLlama` fails, try `GPT-4`) with cooldowns, and retries
 26 | - **Prompt Tracking** - Integrates with Promptlayer for model + prompt tracking 
 27 | 
 28 |   **Example: Logs sent to PromptLayer**
 29 | 
 30 |     <img src="imgs/promptlayer_logging.png" alt="Prompt Logging" width="900"/>
 31 | 
 32 | 
 33 | - **Token Usage & Spend** - Track Input + Completion tokens used + Spend/model - https://docs.litellm.ai/docs/token_usage
 34 | - **Caching** - Provides in-memory cache + GPT-Cache integration for more advanced usage - https://docs.litellm.ai/docs/caching/gpt_cache
 35 | 
 36 | - **Streaming & Async Support** - Return generators to stream text responses - TEST IT 👉 https://litellm.ai/
 37 | 
 38 | ## API Endpoints
 39 | 
 40 | ### `/chat/completions` (POST)
 41 | 
 42 | This endpoint is used to generate chat completions for 50+ support LLM API Models. Use llama2, GPT-4, Claude2 etc
 43 | 
 44 | #### Input
 45 | 
 46 | This API endpoint accepts all inputs in raw JSON and expects the following inputs
 47 | 
 48 | - `prompt` (string, required): The user's coding related question
 49 | - Additional Optional parameters: `temperature`, `functions`, `function_call`, `top_p`, `n`, `stream`. See the full list of supported inputs here: https://litellm.readthedocs.io/en/latest/input/
 50 | 
 51 | #### Example JSON body
 52 | 
 53 | ```json
 54 | {
 55 |   "prompt": "write me a function to print hello world"
 56 | }
 57 | ```
 58 | 
 59 | ### Making an API request to the Code-Gen Server
 60 | 
 61 | ```python
 62 | import requests
 63 | import json
 64 | 
 65 | url = "localhost:4000/chat/completions"
 66 | 
 67 | payload = json.dumps({
 68 |   "prompt": "write me a function to print hello world"
 69 | })
 70 | headers = {
 71 |   'Content-Type': 'application/json'
 72 | }
 73 | 
 74 | response = requests.request("POST", url, headers=headers, data=payload)
 75 | 
 76 | print(response.text)
 77 | 
 78 | ```
 79 | 
 80 | ### Output [Response Format]
 81 | 
 82 | Responses from the server are given in the following format.
 83 | All responses from the server are returned in the following format (for all LLM models). More info on output here: https://litellm.readthedocs.io/en/latest/output/
 84 | 
 85 | ```json
 86 | {
 87 |     "choices": [
 88 |         {
 89 |             "finish_reason": "stop",
 90 |             "index": 0,
 91 |             "message": {
 92 |                 "content": ".\n\n```\ndef print_hello_world():\n    print(\"hello world\")\n",
 93 |                 "role": "assistant"
 94 |             }
 95 |         }
 96 |     ],
 97 |     "created": 1693279694.6474009,
 98 |     "model": "togethercomputer/CodeLlama-34b-Instruct",
 99 |     "usage": {
100 |         "completion_tokens": 14,
101 |         "prompt_tokens": 28,
102 |         "total_tokens": 42
103 |     }
104 | }
105 | ```
106 | 
107 | ## Installation & Usage
108 | 
109 | ### Running Locally
110 | 
111 | 1. Clone liteLLM repository to your local machine:
112 |    ```
113 |    git clone https://github.com/BerriAI/litellm-CodeLlama-server
114 |    ```
115 | 2. Install the required dependencies using pip
116 |    ```
117 |    pip install requirements.txt
118 |    ```
119 | 3. Set your LLM API keys
120 |    ```
121 |    os.environ['OPENAI_API_KEY]` = "YOUR_API_KEY"
122 |    or
123 |    set OPENAI_API_KEY in your .env file
124 |    ```
125 | 4. Run the server:
126 |    ```
127 |    python main.py
128 |    ```
129 | 
130 | ## Deploying
131 | 
132 | 1. Quick Start: Deploy on Railway
133 | 
134 |    [![Deploy on Railway](https://railway.app/button.svg)](https://railway.app/template/HuDPw-?referralCode=jch2ME)
135 | 
136 | 2. `GCP`, `AWS`, `Azure`
137 |    This project includes a `Dockerfile` allowing you to build and deploy a Docker Project on your providers
138 | 
139 | # Support / Talk with founders
140 | 
141 | - [Our calendar 👋](https://calendly.com/d/4mp-gd3-k5k/berriai-1-1-onboarding-litellm-hosted-version)
142 | - [Community Discord 💭](https://discord.gg/wuPM9dRgDw)
143 | - Our numbers 📞 +1 (770) 8783-106 / +1 (412) 618-6238
144 | - Our emails ✉️ ishaan@berri.ai / krrish@berri.ai
145 | 
146 | ## Roadmap
147 | 
148 | - [ ] Implement user-based rate-limiting
149 | - [ ] Spending controls per project - expose key creation endpoint
150 | - [ ] Need to store a keys db -> mapping created keys to their alias (i.e. project name)
151 | - [ ] Easily add new models as backups / as the entry-point (add this to the available model list)
152 | 


--------------------------------------------------------------------------------
/__pycache__/util.cpython-311.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/BerriAI/litellm-CodeLlama-server/f725adb7bc5c9de0b1cb59aaf967fa50d44c81a3/__pycache__/util.cpython-311.pyc


--------------------------------------------------------------------------------
/imgs/code-output.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/BerriAI/litellm-CodeLlama-server/f725adb7bc5c9de0b1cb59aaf967fa50d44c81a3/imgs/code-output.png


--------------------------------------------------------------------------------
/imgs/promptlayer_logging.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/BerriAI/litellm-CodeLlama-server/f725adb7bc5c9de0b1cb59aaf967fa50d44c81a3/imgs/promptlayer_logging.png


--------------------------------------------------------------------------------
/litellm_uuid.txt:
--------------------------------------------------------------------------------
1 | 7f3ef576-2106-4051-9c06-3a27f16f1f1a


--------------------------------------------------------------------------------
/main.py:
--------------------------------------------------------------------------------
 1 | import traceback
 2 | from flask import Flask, request, jsonify, abort, Response
 3 | from flask_cors import CORS
 4 | import traceback
 5 | import litellm
 6 | from util import handle_error
 7 | from litellm import completion 
 8 | import os, dotenv, time 
 9 | import json
10 | dotenv.load_dotenv()
11 | 
12 | # TODO: set your keys in .env or here:
13 | # os.environ["OPENAI_API_KEY"] = "" # set your openai key here
14 | # os.environ["ANTHROPIC_API_KEY"] = "" # set your anthropic key here
15 | # os.environ["TOGETHER_AI_API_KEY"] = "" # set your together ai key here
16 | # see supported models / keys here: https://litellm.readthedocs.io/en/latest/supported/
17 | ######### ENVIRONMENT VARIABLES ##########
18 | verbose = True
19 | 
20 | # litellm.caching_with_models = True # CACHING: caching_with_models Keys in the cache are messages + model. - to learn more: https://docs.litellm.ai/docs/caching/
21 | ######### PROMPT LOGGING ##########
22 | os.environ["PROMPTLAYER_API_KEY"] = "" # set your promptlayer key here - https://promptlayer.com/
23 | 
24 | # set callbacks
25 | litellm.success_callback = ["promptlayer"]
26 | ############ HELPER FUNCTIONS ###################################
27 | 
28 | def print_verbose(print_statement):
29 |     if verbose:
30 |         print(print_statement)
31 | 
32 | app = Flask(__name__)
33 | CORS(app)
34 | 
35 | @app.route('/')
36 | def index():
37 |     return 'received!', 200
38 | 
39 | def data_generator(response):
40 |     for chunk in response:
41 |         yield f"data: {json.dumps(chunk)}\n\n"
42 | 
43 | @app.route('/chat/completions', methods=["POST"])
44 | def api_completion():
45 |     data = request.json
46 |     start_time = time.time() 
47 |     if data.get('stream') == "True":
48 |         data['stream'] = True # convert to boolean
49 |     try:
50 |         if "prompt" not in data:
51 |             raise ValueError("data needs to have prompt")
52 |         data["model"] = "togethercomputer/CodeLlama-34b-Instruct" # by default use Together AI's CodeLlama model - https://api.together.xyz/playground/chat?model=togethercomputer%2FCodeLlama-34b-Instruct
53 |         # COMPLETION CALL
54 |         system_prompt = "Only respond to questions about code. Say 'I don't know' to anything outside of that."
55 |         messages = [{"role": "system", "content": system_prompt}, {"role": "user", "content": data.pop("prompt")}]
56 |         data["messages"] = messages
57 |         print(f"data: {data}")
58 |         response = completion(**data)
59 |         ## LOG SUCCESS
60 |         end_time = time.time() 
61 |         if 'stream' in data and data['stream'] == True: # use generate_responses to stream responses
62 |             return Response(data_generator(response), mimetype='text/event-stream')
63 |     except Exception as e:
64 |         # call handle_error function
65 |         print_verbose(f"Got Error api_completion(): {traceback.format_exc()}")
66 |         ## LOG FAILURE
67 |         end_time = time.time() 
68 |         traceback_exception = traceback.format_exc()
69 |         return handle_error(data=data)
70 |     return response
71 | 
72 | @app.route('/get_models', methods=["POST"])
73 | def get_models():
74 |     try:
75 |         return litellm.model_list
76 |     except Exception as e:
77 |         traceback.print_exc()
78 |         response = {"error": str(e)}
79 |     return response, 200
80 | 
81 | if __name__ == "__main__":
82 |   from waitress import serve
83 |   serve(app, host="0.0.0.0", port=4000, threads=500)
84 | 
85 | 


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | flask
2 | flask_cors
3 | waitress
4 | litellm==0.1.381


--------------------------------------------------------------------------------
/util.py:
--------------------------------------------------------------------------------
 1 | import time, random, traceback
 2 | from copy import deepcopy
 3 | from litellm import completion
 4 | 
 5 | ################ ERROR HANDLING #####################
 6 | # implement model fallbacks, cooldowns, and retries
 7 | # if a model fails assume it was rate limited and let it cooldown for 60s
 8 | def handle_error(data):
 9 |     import time
10 |     # retry completion() request with fallback models
11 |     response = None
12 |     start_time = time.time()
13 |     rate_limited_models = set()
14 |     model_expiration_times = {}
15 |     fallback_strategy=['gpt-4', 'claude-2']
16 |     while response == None and time.time() - start_time < 45: # retry for 45s
17 |       for model in fallback_strategy:
18 |         try:
19 |             if model in rate_limited_models: # check if model is currently cooling down
20 |               if model_expiration_times.get(model) and time.time() >= model_expiration_times[model]:
21 |                   rate_limited_models.remove(model) # check if it's been 60s of cool down and remove model
22 |               else:
23 |                   continue # skip model
24 |             print(f"calling model {model}")
25 |             response = completion(**data)
26 |             if response != None:
27 |               return response
28 |         except Exception as e:
29 |           rate_limited_models.add(model)
30 |           model_expiration_times[model] = time.time() + 60 # cool down this selected model
31 |           pass
32 |     return response
33 |     


--------------------------------------------------------------------------------