11 |
12 |
13 |
14 | ## 1. Understand Marimo Notebook
15 | > This is a simple demo of the Marimo Reactive Notebook
16 | - Install hyper modern [UV Python Package and Project](https://docs.astral.sh/uv/getting-started/installation/)
17 | - Install dependencies `uv sync`
18 | - Install marimo `uv pip install marimo`
19 | - To Edit, Run `uv run marimo edit marimo_is_awesome_demo.py`
20 | - To View, Run `uv run marimo run marimo_is_awesome_demo.py`
21 | - Then use your favorite IDE & AI Coding Assistant to edit the `marimo_is_awesome_demo.py` directly or via the UI.
22 |
23 | ## 2. Ad-hoc Prompt Notebook
24 | > Quickly run and test prompts across models
25 | - 🟡 Copy `.env.sample` to `.env` and set your keys (minimally set `OPENAI_API_KEY`)
26 | - Add other keys and update the notebook to add support for additional SOTA LLMs
27 | - 🟡 Install Ollama (https://ollama.ai/) and pull the models you want to use
28 | - Update the notebook to use Ollama models you have installed
29 | - To Edit, Run `uv run marimo edit adhoc_prompting.py`
30 | - To View, Run `uv run marimo run adhoc_prompting.py`
31 |
32 | ## 3. ⭐️ Prompt Library Notebook
33 | > Build, Manage, Reuse, Version, and Iterate on your Prompt Library
34 | - 🟡 Copy `.env.sample` to `.env` and set your keys (minimally set `OPENAI_API_KEY`)
35 | - Add other keys and update the notebook to add support for additional SOTA LLMs
36 | - 🟡 Install Ollama (https://ollama.ai/) and pull the models you want to use
37 | - Update the notebook to use Ollama models you have installed
38 | - To Edit, Run `uv run marimo edit prompt_library.py`
39 | - To View, Run `uv run marimo run prompt_library.py`
40 |
41 | ## 4. Multi-LLM Prompt
42 | > Quickly test a single prompt across multiple language models
43 | - 🟡 Ensure your `.env` file is set up with the necessary API keys for the models you want to use
44 | - 🟡 Install Ollama (https://ollama.ai/) and pull the models you want to use
45 | - Update the notebook to use Ollama models you have installed
46 | - To Edit, Run `uv run marimo edit multi_llm_prompting.py`
47 | - To View, Run `uv run marimo run multi_llm_prompting.py`
48 |
49 | ## 5. Multi Language Model Ranker
50 | > Compare and rank multiple language models across various prompts
51 | - 🟡 Ensure your `.env` file is set up with the necessary API keys for the models you want to compare
52 | - 🟡 Install Ollama (https://ollama.ai/) and pull the models you want to use
53 | - Update the notebook to use Ollama models you have installed
54 | - To Edit, Run `uv run marimo edit multi_language_model_ranker.py`
55 | - To View, Run `uv run marimo run multi_language_model_ranker.py`
56 |
57 | ## General Usage
58 | > See the [Marimo Docs](https://docs.marimo.io/index.html) for general usage details
59 |
60 | ## Personal Prompt Library Use-Cases
61 | - Ad-hoc prompting
62 | - Prompt reuse
63 | - Prompt versioning
64 | - Interactive prompts
65 | - Prompt testing & Benchmarking
66 | - LLM comparison
67 | - Prompt templating
68 | - Run a single prompt against multiple LLMs & SLMs
69 | - Compare multi prompts against multiple LLMs & SLMs
70 | - Anything you can imagine!
71 |
72 | ## Advantages of Marimo
73 |
74 | ### Key Advantages
75 | > Rapid Prototyping: Seamlessly transition between user and builder mode with `cmd+.` to toggle. Consumer vs Producer. UI vs Code.
76 |
77 | > Interactivity: Built-in reactive UI elements enable intuitive data exploration and visualization.
78 |
79 | > Reactivity: Cells automatically update when dependencies change, ensuring a smooth and efficient workflow.
80 |
81 | > Out of the box: Use sliders, textareas, buttons, images, dataframe GUIs, plotting, and other interactive elements to quickly iterate on ideas.
82 |
83 | > It's 'just' Python: Pure Python scripts for easy version control and AI coding.
84 |
85 |
86 | - **Reactive Execution**: Run one cell, and marimo automatically updates all affected cells. This eliminates the need to manually manage notebook state.
87 | - **Interactive Elements**: Provides reactive UI elements like dataframe GUIs and plots, making data exploration fast and intuitive.
88 | - **Python-First Design**: Notebooks are pure Python scripts stored as `.py` files. They can be versioned with git, run as scripts, and imported into other Python code.
89 | - **Reproducible by Default**: Deterministic execution order with no hidden state ensures consistent and reproducible results.
90 | - **Built for Collaboration**: Git-friendly notebooks where small changes yield small diffs, facilitating collaboration.
91 | - **Developer-Friendly Features**: Includes GitHub Copilot, autocomplete, hover tooltips, vim keybindings, code formatting, debugging panels, and extensive hotkeys.
92 | - **Seamless Transition to Production**: Notebooks can be run as scripts or deployed as read-only web apps.
93 | - **Versatile Use Cases**: Ideal for experimenting with data and models, building internal tools, communicating research, education, and creating interactive dashboards.
94 |
95 | ### Advantages Over Jupyter Notebooks
96 |
97 | - **Reactive Notebook**: Automatically updates dependent cells when code or values change, unlike Jupyter where cells must be manually re-executed.
98 | - **Pure Python Notebooks**: Stored as `.py` files instead of JSON, making them easier to version control, lint, and integrate with Python tooling.
99 | - **No Hidden State**: Deleting a cell removes its variables and updates affected cells, reducing errors from stale variables.
100 | - **Better Git Integration**: Plain Python scripts result in smaller diffs and more manageable version control compared to Jupyter's JSON format.
101 | - **Import Symbols**: Allows importing symbols from notebooks into other notebooks or Python files.
102 | - **Enhanced Interactivity**: Built-in reactive UI elements provide a more interactive experience than standard Jupyter widgets.
103 | - **App Deployment**: Notebooks can be served as web apps or exported to static HTML for easier sharing and deployment.
104 | - **Advanced Developer Tools**: Features like code formatting, GitHub Copilot integration, and debugging panels enhance the development experience.
105 | - **Script Execution**: Can be executed as standard Python scripts, facilitating integration into pipelines and scripts without additional tools.
106 |
107 | ## Resources
108 | - https://docs.astral.sh/uv/
109 | - https://docs.marimo.io/index.html
110 | - https://youtu.be/PcLkBkQujMI
111 | - https://github.com/BuilderIO/gpt-crawler
112 | - https://github.com/simonw/llm
113 | - https://ollama.com/
114 | - https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/
115 | - https://qwenlm.github.io/
--------------------------------------------------------------------------------
/src/marimo_notebook/modules/llm_module.py:
--------------------------------------------------------------------------------
1 | import llm
2 | from dotenv import load_dotenv
3 | import os
4 | from mako.template import Template
5 |
6 | # Load environment variables from .env file
7 | load_dotenv()
8 |
9 |
10 | def conditional_render(prompt, context, start_delim="% if", end_delim="% endif"):
11 | template = Template(prompt)
12 | return template.render(**context)
13 |
14 |
15 | def parse_markdown_backticks(str) -> str:
16 | if "```" not in str:
17 | return str.strip()
18 | # Remove opening backticks and language identifier
19 | str = str.split("```", 1)[-1].split("\n", 1)[-1]
20 | # Remove closing backticks
21 | str = str.rsplit("```", 1)[0]
22 | # Remove any leading or trailing whitespace
23 | return str.strip()
24 |
25 |
26 | def prompt(model: llm.Model, prompt: str):
27 | res = model.prompt(prompt, stream=False)
28 | return res.text()
29 |
30 |
31 | def prompt_with_temp(model: llm.Model, prompt: str, temperature: float = 0.7):
32 | """
33 | Send a prompt to the model with a specified temperature.
34 |
35 | Args:
36 | model (llm.Model): The LLM model to use.
37 | prompt (str): The prompt to send to the model.
38 | temperature (float): The temperature setting for the model's response. Default is 0.7.
39 |
40 | Returns:
41 | str: The model's response text.
42 | """
43 |
44 | model_id = model.model_id
45 | if "o1" in model_id or "gemini" in model_id:
46 | temperature = 1
47 | res = model.prompt(prompt, stream=False)
48 | return res.text()
49 |
50 | res = model.prompt(prompt, stream=False, temperature=temperature)
51 | return res.text()
52 |
53 |
54 | def get_model_name(model: llm.Model):
55 | return model.model_id
56 |
57 |
58 | def build_sonnet_3_5():
59 | ANTHROPIC_API_KEY = os.getenv("ANTHROPIC_API_KEY")
60 |
61 | sonnet_3_5_model: llm.Model = llm.get_model("claude-3.5-sonnet")
62 | sonnet_3_5_model.key = ANTHROPIC_API_KEY
63 |
64 | return sonnet_3_5_model
65 |
66 |
67 | def build_mini_model():
68 | OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
69 | gpt4_o_mini_model: llm.Model = llm.get_model("gpt-4o-mini")
70 | gpt4_o_mini_model.key = OPENAI_API_KEY
71 | return gpt4_o_mini_model
72 |
73 |
74 | def build_big_3_models():
75 | ANTHROPIC_API_KEY = os.getenv("ANTHROPIC_API_KEY")
76 | OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
77 | GEMINI_API_KEY = os.getenv("GEMINI_API_KEY")
78 |
79 | sonnet_3_5_model: llm.Model = llm.get_model("claude-3.5-sonnet")
80 | sonnet_3_5_model.key = ANTHROPIC_API_KEY
81 |
82 | gpt4_o_model: llm.Model = llm.get_model("4o")
83 | gpt4_o_model.key = OPENAI_API_KEY
84 |
85 | gemini_1_5_pro_model: llm.Model = llm.get_model("gemini-1.5-pro-latest")
86 | gemini_1_5_pro_model.key = GEMINI_API_KEY
87 |
88 | return sonnet_3_5_model, gpt4_o_model, gemini_1_5_pro_model
89 |
90 |
91 | def build_latest_openai():
92 | OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
93 |
94 | # chatgpt_4o_latest_model: llm.Model = llm.get_model("chatgpt-4o-latest") - experimental
95 | chatgpt_4o_latest_model: llm.Model = llm.get_model("gpt-4o")
96 | chatgpt_4o_latest_model.key = OPENAI_API_KEY
97 | return chatgpt_4o_latest_model
98 |
99 |
100 | def build_big_3_plus_mini_models():
101 |
102 | ANTHROPIC_API_KEY = os.getenv("ANTHROPIC_API_KEY")
103 | OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
104 | GEMINI_API_KEY = os.getenv("GEMINI_API_KEY")
105 |
106 | sonnet_3_5_model: llm.Model = llm.get_model("claude-3.5-sonnet")
107 | sonnet_3_5_model.key = ANTHROPIC_API_KEY
108 |
109 | gpt4_o_model: llm.Model = llm.get_model("4o")
110 | gpt4_o_model.key = OPENAI_API_KEY
111 |
112 | gemini_1_5_pro_model: llm.Model = llm.get_model("gemini-1.5-pro-latest")
113 | gemini_1_5_pro_model.key = GEMINI_API_KEY
114 |
115 | gpt4_o_mini_model: llm.Model = llm.get_model("gpt-4o-mini")
116 | gpt4_o_mini_model.key = OPENAI_API_KEY
117 |
118 | chatgpt_4o_latest_model = build_latest_openai()
119 |
120 | return (
121 | sonnet_3_5_model,
122 | gpt4_o_model,
123 | gemini_1_5_pro_model,
124 | gpt4_o_mini_model,
125 | )
126 |
127 |
128 | def build_gemini_duo():
129 | gemini_1_5_pro: llm.Model = llm.get_model("gemini-1.5-pro-latest")
130 | gemini_1_5_flash: llm.Model = llm.get_model("gemini-1.5-flash-latest")
131 |
132 | GEMINI_API_KEY = os.getenv("GEMINI_API_KEY")
133 |
134 | gemini_1_5_pro.key = GEMINI_API_KEY
135 | gemini_1_5_flash.key = GEMINI_API_KEY
136 |
137 | return gemini_1_5_pro, gemini_1_5_flash
138 |
139 |
140 | def build_ollama_models():
141 |
142 | llama3_2_model: llm.Model = llm.get_model("llama3.2")
143 | llama_3_2_1b_model: llm.Model = llm.get_model("llama3.2:1b")
144 |
145 | return llama3_2_model, llama_3_2_1b_model
146 |
147 |
148 | def build_ollama_slm_models():
149 |
150 | llama3_2_model: llm.Model = llm.get_model("llama3.2")
151 | phi3_5_model: llm.Model = llm.get_model("phi3.5:latest")
152 | qwen2_5_model: llm.Model = llm.get_model("qwen2.5:latest")
153 |
154 | return llama3_2_model, phi3_5_model, qwen2_5_model
155 |
156 |
157 | def build_openai_model_stack():
158 | OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
159 |
160 | gpt4_o_mini_model: llm.Model = llm.get_model("gpt-4o-mini")
161 | gpt4_o_2024_08_06_model: llm.Model = llm.get_model("gpt-4o")
162 | o1_preview_model: llm.Model = llm.get_model("o1-preview")
163 | o1_mini_model: llm.Model = llm.get_model("o1-mini")
164 |
165 | models = [
166 | gpt4_o_mini_model,
167 | gpt4_o_2024_08_06_model,
168 | o1_preview_model,
169 | o1_mini_model,
170 | ]
171 |
172 | for model in models:
173 | model.key = OPENAI_API_KEY
174 |
175 | return models
176 |
177 |
178 | def build_openai_latest_and_fastest():
179 | OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
180 |
181 | gpt_4o_latest: llm.Model = llm.get_model("gpt-4o")
182 | gpt_4o_latest.key = OPENAI_API_KEY
183 |
184 | gpt_4o_mini_model: llm.Model = llm.get_model("gpt-4o-mini")
185 | gpt_4o_mini_model.key = OPENAI_API_KEY
186 |
187 | return gpt_4o_latest, gpt_4o_mini_model
188 |
189 |
190 | def build_o1_series():
191 | OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
192 |
193 | o1_mini_model: llm.Model = llm.get_model("o1-mini")
194 | o1_mini_model.key = OPENAI_API_KEY
195 |
196 | o1_preview_model: llm.Model = llm.get_model("o1-preview")
197 | o1_preview_model.key = OPENAI_API_KEY
198 |
199 | return o1_mini_model, o1_preview_model
200 |
201 |
202 | def build_small_cheap_and_fast():
203 | OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
204 | GEMINI_API_KEY = os.getenv("GEMINI_API_KEY")
205 | gpt4_o_mini_model: llm.Model = llm.get_model("gpt-4o-mini")
206 | gpt4_o_mini_model.key = OPENAI_API_KEY
207 |
208 | gemini_1_5_flash_002: llm.Model = llm.get_model("gemini-1.5-flash-002")
209 | gemini_1_5_flash_002.key = GEMINI_API_KEY
210 |
211 | return gpt4_o_mini_model, gemini_1_5_flash_002
212 |
213 |
214 | def build_small_cheap_and_fast():
215 | OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
216 | GEMINI_API_KEY = os.getenv("GEMINI_API_KEY")
217 | gpt4_o_mini_model: llm.Model = llm.get_model("gpt-4o-mini")
218 | gpt4_o_mini_model.key = OPENAI_API_KEY
219 |
220 | gemini_1_5_flash_002: llm.Model = llm.get_model("gemini-1.5-flash-002")
221 | gemini_1_5_flash_002.key = GEMINI_API_KEY
222 |
223 | return gpt4_o_mini_model, gemini_1_5_flash_002
224 |
225 |
226 | def build_gemini_1_2_002():
227 | GEMINI_API_KEY = os.getenv("GEMINI_API_KEY")
228 |
229 | gemini_1_5_pro_002: llm.Model = llm.get_model("gemini-1.5-pro-002")
230 | gemini_1_5_flash_002: llm.Model = llm.get_model("gemini-1.5-flash-002")
231 |
232 | gemini_1_5_pro_002.key = GEMINI_API_KEY
233 | gemini_1_5_flash_002.key = GEMINI_API_KEY
234 |
235 | return gemini_1_5_pro_002, gemini_1_5_flash_002
236 |
--------------------------------------------------------------------------------
/marimo_is_awesome_demo.py:
--------------------------------------------------------------------------------
1 | import marimo
2 |
3 | __generated_with = "0.8.18"
4 | app = marimo.App(width="full")
5 |
6 |
7 | @app.cell
8 | def __():
9 | import random
10 | import marimo as mo
11 | import pandas as pd
12 | import matplotlib.pyplot as plt
13 | from vega_datasets import data
14 | import io
15 | import altair as alt
16 | return alt, data, io, mo, pd, plt, random
17 |
18 |
19 | @app.cell
20 | def __(mo):
21 | mo.md(
22 | """
23 | # Marimo Awesome Examples
24 |
25 | This notebook demonstrates various features and capabilities of Marimo. Explore the different sections to see how Marimo can be used for interactive data analysis, visualization, and more!
26 |
27 | ---
28 | """
29 | )
30 | return
31 |
32 |
33 | @app.cell
34 | def __(mo):
35 | mo.md(
36 | """
37 | ## 1. Basic UI Elements
38 |
39 | ---
40 | """
41 | )
42 | return
43 |
44 |
45 | @app.cell
46 | def __(mo):
47 | slider = mo.ui.slider(1, 10, value=5, label="Slider Example")
48 | checkbox = mo.ui.checkbox(label="Checkbox Example")
49 | text_input = mo.ui.text(placeholder="Enter text here", label="Text Input Example")
50 |
51 | mo.vstack([slider, checkbox, text_input])
52 | return checkbox, slider, text_input
53 |
54 |
55 | @app.cell
56 | def __(checkbox, mo, slider, text_input):
57 | mo.md(
58 | f"""
59 | Slider value: {slider.value}
60 | Checkbox state: {checkbox.value}
61 | Text input: {text_input.value}
62 | Slider * Text input: {slider.value * "⭐️"}
63 | """
64 | )
65 | return
66 |
67 |
68 | @app.cell
69 | def __(mo):
70 | mo.md(
71 | """
72 | ## 2. Reactive Data Visualization
73 | ---
74 | """
75 | )
76 | return
77 |
78 |
79 | @app.cell
80 | def __(mo, pd):
81 | # Create a sample dataset
82 | sample_df = pd.DataFrame(
83 | {"x": range(1, 11), "y": [1, 4, 9, 16, 25, 36, 49, 64, 81, 100]}
84 | )
85 |
86 | plot_type = mo.ui.dropdown(
87 | options=["scatter", "line", "bar"], value="scatter", label="Select Plot Type"
88 | )
89 |
90 | mo.vstack(
91 | [
92 | plot_type,
93 | # mo.ui.table(sample_df, selection=None)
94 | ]
95 | )
96 | return plot_type, sample_df
97 |
98 |
99 | @app.cell
100 | def __(mo, plot_type, plt, sample_df):
101 | plt.figure(figsize=(10, 6))
102 |
103 | if plot_type.value == "scatter":
104 | plt.scatter(sample_df["x"], sample_df["y"])
105 | elif plot_type.value == "line":
106 | plt.plot(sample_df["x"], sample_df["y"])
107 | else:
108 | plt.bar(sample_df["x"], sample_df["y"])
109 |
110 | plt.xlabel("X")
111 | plt.ylabel("Y")
112 | plt.title(f"{plot_type.value.capitalize()} Plot")
113 | mo.mpl.interactive(plt.gcf())
114 | return
115 |
116 |
117 | @app.cell
118 | def __(mo):
119 | mo.md("""## 3. Conditional Output and Control Flow""")
120 | return
121 |
122 |
123 | @app.cell
124 | def __(mo):
125 | show_secret = mo.ui.checkbox(label="Show Secret Message")
126 | show_secret
127 | return (show_secret,)
128 |
129 |
130 | @app.cell
131 | def __(mo, show_secret):
132 | mo.stop(not show_secret.value, mo.md("Check the box to reveal the secret message!"))
133 | mo.md(
134 | "🎉 Congratulations! You've unlocked the secret message: Marimo is awesome! 🎉"
135 | )
136 | return
137 |
138 |
139 | @app.cell
140 | def __(mo):
141 | mo.md("""## 4. File Handling and Data Processing""")
142 | return
143 |
144 |
145 | @app.cell
146 | def __(mo):
147 | file_upload = mo.ui.file(label="Upload a CSV file")
148 | file_upload
149 | return (file_upload,)
150 |
151 |
152 | @app.cell
153 | def __(file_upload, io, mo, pd):
154 | mo.stop(
155 | not file_upload.value, mo.md("Please upload a CSV file to see the preview.")
156 | )
157 |
158 | uploaded_df = pd.read_csv(io.BytesIO(file_upload.value[0].contents))
159 | mo.md(f"### Uploaded File Preview")
160 | mo.ui.table(uploaded_df)
161 | return (uploaded_df,)
162 |
163 |
164 | @app.cell
165 | def __(mo):
166 | mo.md("""## 5. Advanced UI Components""")
167 | return
168 |
169 |
170 | @app.cell
171 | def __(mo, pd):
172 | accordion = mo.accordion(
173 | {
174 | "Section 1": mo.md("This is the content of section 1."),
175 | "Section 2": mo.ui.slider(0, 100, value=50, label="Nested Slider"),
176 | "Section 3": mo.ui.table(pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})),
177 | }
178 | )
179 | accordion
180 | return (accordion,)
181 |
182 |
183 | @app.cell
184 | def __(mo):
185 | tabs = mo.ui.tabs(
186 | {
187 | "Tab 1": mo.md("Content of Tab 1"),
188 | "Tab 2": mo.ui.button(label="Click me!"),
189 | "Tab 3": mo.mermaid(
190 | """
191 | graph TD
192 | A[Start] --> B{Decision}
193 | B -->|Yes| C[Do Something]
194 | B -->|No| D[Do Nothing]
195 | C --> E[End]
196 | D --> E
197 | """
198 | ),
199 | }
200 | )
201 | tabs
202 | return (tabs,)
203 |
204 |
205 | @app.cell
206 | def __(mo):
207 | mo.md("""## 6. Batch Operations and Forms""")
208 | return
209 |
210 |
211 | @app.cell
212 | def __(mo):
213 | user_form = (
214 | mo.md(
215 | """
216 | ### User Information Form
217 |
218 | First Name: {first_name}
219 | Last Name: {last_name}
220 | Age: {age}
221 | Email: {email}
222 | """
223 | )
224 | .batch(
225 | first_name=mo.ui.text(label="First Name"),
226 | last_name=mo.ui.text(label="Last Name"),
227 | age=mo.ui.number(start=0, stop=120, label="Age"),
228 | email=mo.ui.text(label="Email"),
229 | )
230 | .form()
231 | )
232 |
233 | user_form
234 | return (user_form,)
235 |
236 |
237 | @app.cell
238 | def __(mo, user_form):
239 | mo.stop(
240 | not user_form.value.get("first_name"),
241 | mo.md("Please submit the form to see the results."),
242 | )
243 |
244 | mo.md(
245 | f"""
246 | ### Submitted Information
247 |
248 | - **First Name:** {user_form.value['first_name']}
249 | - **Last Name:** {user_form.value['last_name']}
250 | - **Age:** {user_form.value['age']}
251 | - **Email:** {user_form.value['email']}
252 | """
253 | )
254 | return
255 |
256 |
257 | @app.cell
258 | def __(mo):
259 | mo.md("""## 7. Embedding External Content""")
260 | return
261 |
262 |
263 | @app.cell
264 | def __(mo):
265 | mo.image("https://marimo.io/logo.png", width=200, alt="Marimo Logo")
266 | return
267 |
268 |
269 | @app.cell
270 | def __(mo):
271 | mo.video(
272 | "https://v3.cdnpk.net/videvo_files/video/free/2013-08/large_watermarked/hd0992_preview.mp4",
273 | width=560,
274 | height=315,
275 | )
276 | return
277 |
278 |
279 | @app.cell
280 | def __(mo):
281 | mo.md("""## 8. Custom Styling and Layouts""")
282 | return
283 |
284 |
285 | @app.cell
286 | def __(mo):
287 | styled_text = mo.md(
288 | """
289 | # Custom Styled Header
290 |
291 | This text has custom styling applied.
292 | """
293 | ).style(
294 | {
295 | "font-style": "italic",
296 | "background-color": "#aaa",
297 | "padding": "10px",
298 | "border-radius": "5px",
299 | },
300 | )
301 |
302 | styled_text
303 | return (styled_text,)
304 |
305 |
306 | @app.cell
307 | def __(mo):
308 | layout = mo.vstack(
309 | [
310 | mo.hstack(
311 | [
312 | mo.md("Left Column").style(
313 | {
314 | "background-color": "#e0e0e0",
315 | "padding": "10px",
316 | }
317 | ),
318 | mo.md("Right Column").style(
319 | {
320 | "background-color": "#d0d0d0",
321 | "padding": "10px",
322 | }
323 | ),
324 | ]
325 | ),
326 | mo.md("Bottom Row").style(
327 | {"background-color": "#c0c0c0", "padding": "10px"}
328 | ),
329 | ]
330 | )
331 |
332 | layout
333 | return (layout,)
334 |
335 |
336 | @app.cell
337 | def __(mo):
338 | mo.md(
339 | """
340 | ## 9. Interactive Data Exploration
341 | ---
342 | """
343 | )
344 | return
345 |
346 |
347 | @app.cell
348 | def __(data, mo):
349 | cars = data.cars()
350 | mo.ui.data_explorer(cars)
351 | return (cars,)
352 |
353 |
354 | @app.cell
355 | def __(alt, data, mo):
356 | chart = (
357 | alt.Chart(data.cars())
358 | .mark_circle()
359 | .encode(
360 | x="Horsepower",
361 | y="Miles_per_Gallon",
362 | color="Origin",
363 | tooltip=["Name", "Origin", "Horsepower", "Miles_per_Gallon"],
364 | )
365 | .interactive()
366 | )
367 |
368 | mo.ui.altair_chart(chart)
369 | return (chart,)
370 |
371 |
372 | @app.cell
373 | def __(mo):
374 | mo.md(
375 | """
376 | ## Conclusion
377 |
378 | This notebook has demonstrated various features and capabilities of Marimo. From basic UI elements to advanced data visualization and interactive components, Marimo provides a powerful toolkit for creating dynamic and engaging notebooks.
379 |
380 | Explore the code in each cell to learn more about how these examples were created!
381 | """
382 | )
383 | return
384 |
385 |
386 | if __name__ == "__main__":
387 | app.run()
388 |
--------------------------------------------------------------------------------
/src/marimo_notebook/modules/chain.py:
--------------------------------------------------------------------------------
1 | import json
2 | import re
3 | from typing import List, Dict, Callable, Any, Tuple, Union
4 | from .typings import FusionChainResult
5 | import concurrent.futures
6 |
7 |
8 | class FusionChain:
9 |
10 | @staticmethod
11 | def run(
12 | context: Dict[str, Any],
13 | models: List[Any],
14 | callable: Callable,
15 | prompts: List[str],
16 | evaluator: Callable[[List[Any]], Tuple[Any, List[float]]],
17 | get_model_name: Callable[[Any], str],
18 | ) -> FusionChainResult:
19 | """
20 | Run a competition between models on a list of prompts.
21 |
22 | Runs the MinimalChainable.run method for each model for each prompt and evaluates the results.
23 |
24 | The evaluator runs on the last output of each model at the end of the chain of prompts.
25 |
26 | The eval method returns a performance score for each model from 0 to 1, giving priority to models earlier in the list.
27 |
28 | Args:
29 | context (Dict[str, Any]): The context for the prompts.
30 | models (List[Any]): List of models to compete.
31 | callable (Callable): The function to call for each prompt.
32 | prompts (List[str]): List of prompts to process.
33 | evaluator (Callable[[List[str]], Tuple[Any, List[float]]]): Function to evaluate model outputs, returning the top response and the scores.
34 | get_model_name (Callable[[Any], str]): Function to get the name of a model. Defaults to str(model).
35 |
36 | Returns:
37 | FusionChainResult: A FusionChainResult object containing the top response, all outputs, all context-filled prompts, performance scores, and model names.
38 | """
39 | all_outputs = []
40 | all_context_filled_prompts = []
41 |
42 | for model in models:
43 | outputs, context_filled_prompts = MinimalChainable.run(
44 | context, model, callable, prompts
45 | )
46 | all_outputs.append(outputs)
47 | all_context_filled_prompts.append(context_filled_prompts)
48 |
49 | # Evaluate the last output of each model
50 | last_outputs = [outputs[-1] for outputs in all_outputs]
51 | top_response, performance_scores = evaluator(last_outputs)
52 |
53 | model_names = [get_model_name(model) for model in models]
54 |
55 | return FusionChainResult(
56 | top_response=top_response,
57 | all_prompt_responses=all_outputs,
58 | all_context_filled_prompts=all_context_filled_prompts,
59 | performance_scores=performance_scores,
60 | llm_model_names=model_names,
61 | )
62 |
63 | @staticmethod
64 | def run_parallel(
65 | context: Dict[str, Any],
66 | models: List[Any],
67 | callable: Callable,
68 | prompts: List[str],
69 | evaluator: Callable[[List[Any]], Tuple[Any, List[float]]],
70 | get_model_name: Callable[[Any], str],
71 | num_workers: int = 4,
72 | ) -> FusionChainResult:
73 | """
74 | Run a competition between models on a list of prompts in parallel.
75 |
76 | This method is similar to the 'run' method but utilizes parallel processing
77 | to improve performance when dealing with multiple models.
78 |
79 | Args:
80 | context (Dict[str, Any]): The context for the prompts.
81 | models (List[Any]): List of models to compete.
82 | callable (Callable): The function to call for each prompt.
83 | prompts (List[str]): List of prompts to process.
84 | evaluator (Callable[[List[str]], Tuple[Any, List[float]]]): Function to evaluate model outputs, returning the top response and the scores.
85 | num_workers (int): Number of parallel workers to use. Defaults to 4.
86 | get_model_name (Callable[[Any], str]): Function to get the name of a model. Defaults to str(model).
87 |
88 | Returns:
89 | FusionChainResult: A FusionChainResult object containing the top response, all outputs, all context-filled prompts, performance scores, and model names.
90 | """
91 |
92 | def process_model(model):
93 | outputs, context_filled_prompts = MinimalChainable.run(
94 | context, model, callable, prompts
95 | )
96 | return outputs, context_filled_prompts
97 |
98 | all_outputs = []
99 | all_context_filled_prompts = []
100 |
101 | with concurrent.futures.ThreadPoolExecutor(max_workers=num_workers) as executor:
102 | future_to_model = {
103 | executor.submit(process_model, model): model for model in models
104 | }
105 | for future in concurrent.futures.as_completed(future_to_model):
106 | outputs, context_filled_prompts = future.result()
107 | all_outputs.append(outputs)
108 | all_context_filled_prompts.append(context_filled_prompts)
109 |
110 | # Evaluate the last output of each model
111 | last_outputs = [outputs[-1] for outputs in all_outputs]
112 | top_response, performance_scores = evaluator(last_outputs)
113 |
114 | model_names = [get_model_name(model) for model in models]
115 |
116 | return FusionChainResult(
117 | top_response=top_response,
118 | all_prompt_responses=all_outputs,
119 | all_context_filled_prompts=all_context_filled_prompts,
120 | performance_scores=performance_scores,
121 | llm_model_names=model_names,
122 | )
123 |
124 |
125 | class MinimalChainable:
126 | """
127 | Sequential prompt chaining with context and output back-references.
128 | """
129 |
130 | @staticmethod
131 | def run(
132 | context: Dict[str, Any], model: Any, callable: Callable, prompts: List[str]
133 | ) -> Tuple[List[Any], List[str]]:
134 | # Initialize an empty list to store the outputs
135 | output = []
136 | context_filled_prompts = []
137 |
138 | # Iterate over each prompt with its index
139 | for i, prompt in enumerate(prompts):
140 | # Iterate over each key-value pair in the context
141 | for key, value in context.items():
142 | # Check if the key is in the prompt
143 | if "{{" + key + "}}" in prompt:
144 | # Replace the key with its value
145 | prompt = prompt.replace("{{" + key + "}}", str(value))
146 |
147 | # Replace references to previous outputs
148 | # Iterate from the current index down to 1
149 | for j in range(i, 0, -1):
150 | # Get the previous output
151 | previous_output = output[i - j]
152 |
153 | # Handle JSON (dict) output references
154 | # Check if the previous output is a dictionary
155 | if isinstance(previous_output, dict):
156 | # Check if the reference is in the prompt
157 | if f"{{{{output[-{j}]}}}}" in prompt:
158 | # Replace the reference with the JSON string
159 | prompt = prompt.replace(
160 | f"{{{{output[-{j}]}}}}", json.dumps(previous_output)
161 | )
162 | # Iterate over each key-value pair in the previous output
163 | for key, value in previous_output.items():
164 | # Check if the key reference is in the prompt
165 | if f"{{{{output[-{j}].{key}}}}}" in prompt:
166 | # Replace the key reference with its value
167 | prompt = prompt.replace(
168 | f"{{{{output[-{j}].{key}}}}}", str(value)
169 | )
170 | # If not a dict, use the original string
171 | else:
172 | # Check if the reference is in the prompt
173 | if f"{{{{output[-{j}]}}}}" in prompt:
174 | # Replace the reference with the previous output
175 | prompt = prompt.replace(
176 | f"{{{{output[-{j}]}}}}", str(previous_output)
177 | )
178 |
179 | # Append the context filled prompt to the list
180 | context_filled_prompts.append(prompt)
181 |
182 | # Call the provided callable with the processed prompt
183 | # Get the result by calling the callable with the model and prompt
184 | result = callable(model, prompt)
185 |
186 | print("result", result)
187 |
188 | # Try to parse the result as JSON, handling markdown-wrapped JSON
189 | try:
190 | # First, attempt to extract JSON from markdown code blocks
191 | # Search for JSON in markdown code blocks
192 | json_match = re.search(r"```(?:json)?\s*([\s\S]*?)\s*```", result)
193 | # If a match is found
194 | if json_match:
195 | # Parse the JSON from the match
196 | result = json.loads(json_match.group(1))
197 | else:
198 | # If no markdown block found, try parsing the entire result
199 | # Parse the entire result as JSON
200 | result = json.loads(result)
201 | except json.JSONDecodeError:
202 | # Not JSON, keep as is
203 | pass
204 |
205 | # Append the result to the output list
206 | output.append(result)
207 |
208 | # Return the list of outputs
209 | return output, context_filled_prompts
210 |
211 | @staticmethod
212 | def to_delim_text_file(name: str, content: List[Union[str, dict, list]]) -> str:
213 | result_string = ""
214 | with open(f"{name}.txt", "w") as outfile:
215 | for i, item in enumerate(content, 1):
216 | if isinstance(item, (dict, list)):
217 | item = json.dumps(item)
218 | elif not isinstance(item, str):
219 | item = str(item)
220 | chain_text_delim = (
221 | f"{'🔗' * i} -------- Prompt Chain Result #{i} -------------\n\n"
222 | )
223 | outfile.write(chain_text_delim)
224 | outfile.write(item)
225 | outfile.write("\n\n")
226 |
227 | result_string += chain_text_delim + item + "\n\n"
228 |
229 | return result_string
230 |
--------------------------------------------------------------------------------
/multi_language_model_ranker.py:
--------------------------------------------------------------------------------
1 | import marimo
2 |
3 | __generated_with = "0.8.18"
4 | app = marimo.App(width="full")
5 |
6 |
7 | @app.cell
8 | def __():
9 | import marimo as mo
10 | import src.marimo_notebook.modules.llm_module as llm_module
11 | import src.marimo_notebook.modules.prompt_library_module as prompt_library_module
12 | import json
13 | import pyperclip
14 | return json, llm_module, mo, prompt_library_module, pyperclip
15 |
16 |
17 | @app.cell
18 | def __(prompt_library_module):
19 | map_testable_prompts: dict = prompt_library_module.pull_in_testable_prompts()
20 | return (map_testable_prompts,)
21 |
22 |
23 | @app.cell
24 | def __(llm_module):
25 | llm_o1_mini, llm_o1_preview = llm_module.build_o1_series()
26 | llm_gpt_4o_latest, llm_gpt_4o_mini = llm_module.build_openai_latest_and_fastest()
27 | # llm_sonnet = llm_module.build_sonnet_3_5()
28 | # gemini_1_5_pro, gemini_1_5_flash = llm_module.build_gemini_duo()
29 | # gemini_1_5_pro_2, gemini_1_5_flash_2 = llm_module.build_gemini_1_2_002()
30 | # llama3_2_model, llama3_2_1b_model = llm_module.build_ollama_models()
31 | # _, phi3_5_model, qwen2_5_model = llm_module.build_ollama_slm_models()
32 |
33 | models = {
34 | "o1-mini": llm_o1_mini,
35 | "o1-preview": llm_o1_preview,
36 | "gpt-4o-latest": llm_gpt_4o_latest,
37 | "gpt-4o-mini": llm_gpt_4o_mini,
38 | # "sonnet-3.5": llm_sonnet,
39 | # "gemini-1-5-pro": gemini_1_5_pro,
40 | # "gemini-1-5-flash": gemini_1_5_flash,
41 | # "gemini-1-5-pro-002": gemini_1_5_pro_2,
42 | # "gemini-1-5-flash-002": gemini_1_5_flash_2,
43 | # "llama3-2": llama3_2_model,
44 | # "llama3-2-1b": llama3_2_1b_model,
45 | # "phi3-5": phi3_5_model,
46 | # "qwen2-5": qwen2_5_model,
47 | }
48 | return (
49 | llm_gpt_4o_latest,
50 | llm_gpt_4o_mini,
51 | llm_o1_mini,
52 | llm_o1_preview,
53 | models,
54 | )
55 |
56 |
57 | @app.cell
58 | def __(map_testable_prompts, mo, models):
59 | prompt_multiselect = mo.ui.multiselect(
60 | options=list(map_testable_prompts.keys()),
61 | label="Select Prompts",
62 | )
63 | prompt_temp_slider = mo.ui.slider(
64 | start=0, stop=1, value=0.5, step=0.05, label="Temp"
65 | )
66 | model_multiselect = mo.ui.multiselect(
67 | options=models.copy(),
68 | label="Models",
69 | value=["gpt-4o-mini",],
70 | )
71 | return model_multiselect, prompt_multiselect, prompt_temp_slider
72 |
73 |
74 | @app.cell
75 | def __():
76 | prompt_style = {
77 | "background": "#eee",
78 | "padding": "10px",
79 | "border-radius": "10px",
80 | "margin-bottom": "20px",
81 | }
82 | return (prompt_style,)
83 |
84 |
85 | @app.cell
86 | def __(mo, model_multiselect, prompt_multiselect, prompt_temp_slider):
87 | form = (
88 | mo.md(
89 | r"""
90 | # Multi Language Model Ranker 📊
91 | {prompts}
92 | {temp}
93 | {models}
94 | """
95 | )
96 | .batch(
97 | prompts=prompt_multiselect,
98 | temp=prompt_temp_slider,
99 | models=model_multiselect,
100 | )
101 | .form()
102 | )
103 | form
104 | return (form,)
105 |
106 |
107 | @app.cell
108 | def __(form, map_testable_prompts, mo, prompt_style):
109 | mo.stop(not form.value)
110 |
111 | selected_models_string = mo.ui.array(
112 | [mo.ui.text(value=m.model_id, disabled=True) for m in form.value["models"]]
113 | )
114 |
115 | selected_prompts_accordion = mo.accordion(
116 | {
117 | prompt: mo.md(f"```xml\n{map_testable_prompts[prompt]}\n```")
118 | for prompt in form.value["prompts"]
119 | }
120 | )
121 |
122 | mo.vstack(
123 | [
124 | mo.md("## Selected Models"),
125 | mo.hstack(selected_models_string, align="start", justify="start"),
126 | mo.md("## Selected Prompts"),
127 | selected_prompts_accordion,
128 | ]
129 | ).style(prompt_style)
130 | return selected_models_string, selected_prompts_accordion
131 |
132 |
133 | @app.cell
134 | def __(form, llm_module, map_testable_prompts, mo, prompt_library_module):
135 | mo.stop(not form.value, "")
136 |
137 | all_prompt_responses = []
138 |
139 | total_executions = len(form.value["prompts"]) * len(form.value["models"])
140 |
141 | with mo.status.progress_bar(
142 | title="Running prompts on selected models...",
143 | total=total_executions,
144 | remove_on_exit=True,
145 | ) as prog_bar:
146 | for selected_prompt_name in form.value["prompts"]:
147 | selected_prompt = map_testable_prompts[selected_prompt_name]
148 | prompt_responses = []
149 |
150 | for model in form.value["models"]:
151 | model_name = model.model_id
152 | prog_bar.update(
153 | title=f"Prompting '{model_name}' with '{selected_prompt_name}'",
154 | increment=1,
155 | )
156 | raw_prompt_response = llm_module.prompt_with_temp(
157 | model, selected_prompt, form.value["temp"]
158 | )
159 | prompt_responses.append(
160 | {
161 | "model_id": model_name,
162 | "model": model,
163 | "output": raw_prompt_response,
164 | }
165 | )
166 |
167 | # Create a new list without the 'model' key for each response
168 | list_model_execution_dict = [
169 | {k: v for k, v in response.items() if k != "model"}
170 | for response in prompt_responses
171 | ]
172 |
173 | # Record the execution
174 | execution_filepath = prompt_library_module.record_llm_execution(
175 | prompt=selected_prompt,
176 | list_model_execution_dict=list_model_execution_dict,
177 | prompt_template=selected_prompt_name,
178 | )
179 | print(f"Execution record saved to: {execution_filepath}")
180 |
181 | all_prompt_responses.append(
182 | {
183 | "prompt_name": selected_prompt_name,
184 | "prompt": selected_prompt,
185 | "responses": prompt_responses,
186 | "execution_filepath": execution_filepath,
187 | }
188 | )
189 | return (
190 | all_prompt_responses,
191 | execution_filepath,
192 | list_model_execution_dict,
193 | model,
194 | model_name,
195 | prog_bar,
196 | prompt_responses,
197 | raw_prompt_response,
198 | selected_prompt,
199 | selected_prompt_name,
200 | total_executions,
201 | )
202 |
203 |
204 | @app.cell
205 | def __(all_prompt_responses, mo, pyperclip):
206 | mo.stop(not all_prompt_responses, mo.md(""))
207 |
208 | def copy_to_clipboard(text):
209 | print("copying: ", text)
210 | pyperclip.copy(text)
211 | return 1
212 |
213 | all_prompt_elements = []
214 |
215 | output_prompt_style = {
216 | "background": "#eee",
217 | "padding": "10px",
218 | "border-radius": "10px",
219 | "margin-bottom": "20px",
220 | "min-width": "200px",
221 | "box-shadow": "2px 2px 2px #ccc",
222 | }
223 |
224 | for loop_prompt_data in all_prompt_responses:
225 | prompt_output_elements = [
226 | mo.vstack(
227 | [
228 | mo.md(f"#### {response['model_id']}").style(
229 | {"font-weight": "bold"}
230 | ),
231 | mo.md(response["output"]),
232 | ]
233 | ).style(output_prompt_style)
234 | for response in loop_prompt_data["responses"]
235 | ]
236 |
237 | prompt_element = mo.vstack(
238 | [
239 | mo.md(f"### Prompt: {loop_prompt_data['prompt_name']}"),
240 | mo.hstack(prompt_output_elements, wrap=True, justify="start"),
241 | ]
242 | ).style(
243 | {
244 | "border-left": "4px solid #CCC",
245 | "padding": "2px 10px",
246 | "background": "#ffffee",
247 | }
248 | )
249 |
250 | all_prompt_elements.append(prompt_element)
251 |
252 | mo.vstack(all_prompt_elements)
253 | return (
254 | all_prompt_elements,
255 | copy_to_clipboard,
256 | loop_prompt_data,
257 | output_prompt_style,
258 | prompt_element,
259 | prompt_output_elements,
260 | )
261 |
262 |
263 | @app.cell
264 | def __(all_prompt_responses, copy_to_clipboard, form, mo):
265 | mo.stop(not all_prompt_responses, mo.md(""))
266 | mo.stop(not form.value, mo.md(""))
267 |
268 | # Prepare data for the table
269 | table_data = []
270 | for prompt_data in all_prompt_responses:
271 | for response in prompt_data["responses"]:
272 | table_data.append(
273 | {
274 | "Prompt": prompt_data["prompt_name"],
275 | "Model": response["model_id"],
276 | "Output": response["output"],
277 | }
278 | )
279 |
280 | # Create the table
281 | results_table = mo.ui.table(
282 | data=table_data,
283 | pagination=True,
284 | selection="multi",
285 | page_size=30,
286 | label="Model Responses",
287 | format_mapping={
288 | "Output": lambda val: "(trimmed) " + val[:15],
289 | # "Output": lambda val: val,
290 | },
291 | )
292 |
293 | # Function to copy selected outputs to clipboard
294 | def copy_selected_outputs():
295 | selected_rows = results_table.value
296 | if selected_rows:
297 | outputs = [row["Output"] for row in selected_rows]
298 | combined_output = "\n\n".join(outputs)
299 | copy_to_clipboard(combined_output)
300 | return f"Copied {len(outputs)} response(s) to clipboard"
301 | return "No rows selected"
302 |
303 | # Create the run buttons
304 | copy_button = mo.ui.run_button(label="🔗 Copy Selected Outputs")
305 | score_button = mo.ui.run_button(label="👍 Vote Selected Outputs")
306 |
307 | # Display the table and run buttons
308 | mo.vstack(
309 | [
310 | results_table,
311 | mo.hstack(
312 | [
313 | score_button,
314 | copy_button,
315 | ],
316 | justify="start",
317 | ),
318 | ]
319 | )
320 | return (
321 | copy_button,
322 | copy_selected_outputs,
323 | prompt_data,
324 | response,
325 | results_table,
326 | score_button,
327 | table_data,
328 | )
329 |
330 |
331 | @app.cell
332 | def __(
333 | copy_to_clipboard,
334 | get_rankings,
335 | mo,
336 | prompt_library_module,
337 | results_table,
338 | score_button,
339 | set_rankings,
340 | ):
341 | mo.stop(not results_table.value, "")
342 |
343 | selected_rows = results_table.value
344 | outputs = [row["Output"] for row in selected_rows]
345 | combined_output = "\n\n".join(outputs)
346 |
347 | if score_button.value:
348 | # Increment scores for selected models
349 | current_rankings = get_rankings()
350 | for row in selected_rows:
351 | model_id = row["Model"]
352 | for ranking in current_rankings:
353 | if ranking.llm_model_id == model_id:
354 | ranking.score += 1
355 | break
356 |
357 | # Save updated rankings
358 | set_rankings(current_rankings)
359 | prompt_library_module.save_rankings(current_rankings)
360 |
361 | mo.md(f"Scored {len(selected_rows)} model(s)")
362 | else:
363 | copy_to_clipboard(combined_output)
364 | mo.md(f"Copied {len(outputs)} response(s) to clipboard")
365 | return (
366 | combined_output,
367 | current_rankings,
368 | model_id,
369 | outputs,
370 | ranking,
371 | row,
372 | selected_rows,
373 | )
374 |
375 |
376 | @app.cell
377 | def __(all_prompt_responses, form, mo, prompt_library_module):
378 | mo.stop(not form.value, mo.md(""))
379 | mo.stop(not all_prompt_responses, mo.md(""))
380 |
381 | # Create buttons for resetting and loading rankings
382 | reset_ranking_button = mo.ui.run_button(label="❌ Reset Rankings")
383 | load_ranking_button = mo.ui.run_button(label="🔐 Load Rankings")
384 |
385 | # Load existing rankings
386 | get_rankings, set_rankings = mo.state(prompt_library_module.get_rankings())
387 |
388 | mo.hstack(
389 | [
390 | load_ranking_button,
391 | reset_ranking_button,
392 | ],
393 | justify="start",
394 | )
395 | return (
396 | get_rankings,
397 | load_ranking_button,
398 | reset_ranking_button,
399 | set_rankings,
400 | )
401 |
402 |
403 | @app.cell
404 | def __():
405 | # get_rankings()
406 | return
407 |
408 |
409 | @app.cell
410 | def __(
411 | form,
412 | mo,
413 | prompt_library_module,
414 | reset_ranking_button,
415 | set_rankings,
416 | ):
417 | mo.stop(not form.value, mo.md(""))
418 | mo.stop(not reset_ranking_button.value, mo.md(""))
419 |
420 | set_rankings(
421 | prompt_library_module.reset_rankings(
422 | [model.model_id for model in form.value["models"]]
423 | )
424 | )
425 |
426 | # mo.md("Rankings reset successfully")
427 | return
428 |
429 |
430 | @app.cell
431 | def __(form, load_ranking_button, mo, prompt_library_module, set_rankings):
432 | mo.stop(not form.value, mo.md(""))
433 | mo.stop(not load_ranking_button.value, mo.md(""))
434 |
435 | set_rankings(prompt_library_module.get_rankings())
436 | return
437 |
438 |
439 | @app.cell
440 | def __(get_rankings, mo):
441 | # Create UI elements for each model
442 | model_elements = []
443 |
444 | model_score_style = {
445 | "background": "#eeF",
446 | "padding": "10px",
447 | "border-radius": "10px",
448 | "margin-bottom": "20px",
449 | "min-width": "150px",
450 | "box-shadow": "2px 2px 2px #ccc",
451 | }
452 |
453 | for model_ranking in get_rankings():
454 | llm_model_id = model_ranking.llm_model_id
455 | score = model_ranking.score
456 | model_elements.append(
457 | mo.vstack(
458 | [
459 | mo.md(f"**{llm_model_id}** "),
460 | mo.hstack([mo.md(f""), mo.md(f"# {score}")]),
461 | ],
462 | justify="space-between",
463 | gap="2",
464 | ).style(model_score_style)
465 | )
466 |
467 | mo.hstack(model_elements, justify="start", wrap=True)
468 | return (
469 | llm_model_id,
470 | model_elements,
471 | model_ranking,
472 | model_score_style,
473 | score,
474 | )
475 |
476 |
477 | if __name__ == "__main__":
478 | app.run()
479 |
--------------------------------------------------------------------------------
/testable_prompts/context_window/context_window_1.md:
--------------------------------------------------------------------------------
1 | What was the end of year prediction made in the SCRIPT below?
2 |
3 | SCRIPT
4 | Gemma Phi 3, OpenELM, and Llama 3. Open source language models are becoming more viable with every single release. The terminology from Apple's new OpenELM model is spot on. These efficient language models are taking center stage in the LLM ecosystem. Why are ELMs so important? Because they reshape the business model of your agentic tools and products. When you can run a prompt directly on your device, the cost of building goes to zero. The pace of innovation has been incredible, especially with the release of Llama 3. But every time a new model drops, I'm always asking the same question. Are efficient language models truly ready for on-device use? And how do you know your ELM meets your standards? I'm going to give you a couple of examples here. The first one is that you need to know your ELM. Everyone has different standards for their prompts, prompt chains, AI agents, and agentic workflows. How do you know your personal standards are being met by Phi 3, by Llama 3, and whatever's coming next? This is something that we stress on the channel a lot. Always look at where the ball is going, not where it is. If this trend of incredible local models continue, how soon will it be until we can do what GPT-4 does right on our device? With Llama 3, it's looking like this. It's looking like this. It's looking like this. It's looking like this. It's looking like this. It's looking like this. It's looking like this. It's looking like this. It's looking like this. That time is coming very soon. In this video, we're going to answer the question, are efficient language models ready for on-device use? How do you know if they're ready for your specific use cases? Here are all the big ideas. We're going to set some standards for what ELM attributes we actually care about. There are things like RAM consumption, tokens per second, accuracy. We're going to look at some specific attributes of ELMs and talk about where they need to be for them to work on-device for us. We're going to break down the IT V-Benchmark. We'll explain exactly what that is. That's going to help us answer the question, is this model good enough for your specific use cases? And then we're going to actually run the IT V-Benchmark on Gemma 5.3 and Llama 3 for real on-device use. So we're going to look at a concrete example of the IT V-Benchmark running on my M2 MacBook Pro with 64 gigabytes of RAM and really try to answer the question in a concrete way. Is this ready for prime time? Are these ELMs, are these efficient language models ready for prime time? Let's first walk through some standards and then I'll share some of my personal standards for ELMs. So we'll look at it through the lens of how I'm approaching this as I'm building out agentic tools and products. How do we know we're ready for on-device use? First two most important metrics we need to look at, accuracy and speed. Given your test suite that validates that this model works for your use case, what accuracy do you need? Is it okay if it fails a couple of tests giving you 90% or are you okay with, you know, 60, 70 or 80%? I think accuracy is the most important benchmark we should all be paying attention to. Something like speed is also a complete blocker if it's too low. So we'll be measuring speed and TPS, tokens per second. We'll look at a range from one token per second, all the way up to grok levels, right? Of something like 500 plus, you know, 1000 tokens per second level. What else do we need to pay attention to? Memory and context window. So memory coupled with speed are the big two constraints for ELMs right now. Efficient language model, models that can run on your device. They chew up anywhere from four gigabytes of RAM, of GPU, of CPU, all the way up to 128 and beyond. To run Lama 3, 70 billion parameter on my MacBook, it will chew up something like half of all my available RAM. We also have context window. This is a classic one. Then we have JSON response and vision support. We're not gonna focus on these too much. These are more yes, no, do they have it or do they not? Is it multimodal or not? There are a couple other things that we need to pay attention to. First of all, we need to pay attention to these other attributes that we're missing here, but I don't think they matter as much as these six and specifically these four at the top here. So let's go ahead and walk through this through the lens of my personal standards for efficient language models. Let's break it down. So first things first, the accuracy for the ITV benchmark, which we're about to get to must hit 80%. So if a model is not passing about 80% here, I automatically disqualify it. Tokens per second. I require at least 20 tokens per second minimum. If it's below this, it's honestly just not worth it. It's too slow. There's not enough happening. Anything above this, of course we'll accept. So keep in mind when you're setting your personal standards, you're really looking for ranges, right? Anything above 80% for me is golden. Anything above 20 tokens per second at a very minimum is what we're looking for. So let's look at memory. For me, I am only willing to consume up to about 32 gigabytes of RAM, GPU, CPU. However, it ends up getting sliced. On my 64 gigabyte, I have several Docker instances and other applications that are basically running 24 seven that constrain my dev environment. Regardless, I'm looking for ELMs that consume less than 32 gigabytes of memory. Context window, for me, the sweet spot is 32K and above. Lama 3 released with 8K. I said, cool. Benchmarks look great, but it's a little too small. For some of the larger prompts and prompt chains that I'm building up, I'm looking for 32K minimum context. I highly recommend you go through and set your personal standard for each one of these metrics, as they're likely to be the most important for getting your ELM, for getting a model running on your device. So JSON response, vision support. I don't really care about vision support. This is not a high priority for me. Of course, it's a nice to have. There are image models that can run in isolation. That does the trick for me. I'm not super concerned about having local on device multimodal models, at least right now. JSON response support is a must have. For me, this is built into a lot of the model providers, and it's typically not a problem anymore. So these are my personal standards. The most important ones are up here. 80% accuracy on the ITP benchmark, which we'll talk about in just a second. We have the speed. I'm looking for 20 tokens per second at a minimum. I'm looking for a memory consumption maximum of 32. And then of course, the context window. I am simplifying a lot of the details here, especially around the memory usage. I just want to give you a high level of how to think about what your standards are for ELMs. So that when they come around, you're ready to start using it for your personal tools and products. Having this ready to go as soon as these models are ready will save you time and money, especially as you scale up your usage of language models. So let's talk about the ITP benchmark. What is this? It's simple. It's nothing fancy. ITP is just, is this viable? That's what the test is all about. I just want to know, is this ELM viable? Are these efficient language models, AKA on device language models good enough? This code repository we're about to dive into. It's a personalized use case specific benchmark to quickly swap in and out ELMs, AKA on device language models to know if it's ready for your tools and applications. So let's go ahead and take a quick look at this code base. Link for this is going to be in the description. Let's go ahead and crack open VS code and let's just start with the README. So let's preview this and it's simple. This uses Bunn, PromptFu, and Alama for a minimalist cross-platform local LLM prompt testing and benchmarking experience. So before we dive into this anymore, I'm just going to go ahead, open up the terminal. I'm going to type Bunn run ELM, and that's going to kick off the test. So you can see right away, I have four models running, starting with GPT 3.5 as a control model to test against. And then you can see here, we have Alama Chat, Alama 3, we have PHY, and we have Gemma running as well. So while this is running through our 12 test cases, let's go ahead and take a look at what this code base looks like. So all the details that get set up are going to be in the README. Once you're able to get set up with this in less than a minute, this code base was designed specifically for you to help you benchmark local models for your use cases so that when they're ready, you can start saving time and saving money immediately. If we look at the structure, it's very simple. We have some setup, some minor scripts, and then we have the most important thing, bench, underscore, underscore, and then whatever the test suite name is. This one's called Efficient Language Models. So let's go ahead and look at the prompt. So the prompt is just a simple template. This gets filled in with each individual test run. And if we open up our test files, you can see here, let's go ahead and collapse everything. You can see here we have a list of what do we have here, 12 tests. They're sectioned off. You can see we have string manipulation here, command generation, code explanation, text classification. This is a work in progress of my personal ELM accuracy benchmark. By the time you're watching this, there'll likely be a few additional tests here. They'll be generic enough though, so that you can come in, understand them, and tweak them to fit your own specific use case. So let's go ahead and take a look at this. So this is the test file, and we'll look into this in more detail in just a second here. But if you go to the most important file, prompt through configuration, you can see here, let's go ahead and collapse this. We have our control cloud LLM. So I like to have a kind of control and an experimental group. The control group is going to be our cloud LLM that we want to prove our local models are as good as or near the performance of. Right now I'm using dbt 3.5. And then we have our experimental local ELMs. So we're going to go ahead and take a look at this. So in here, you can see we have LLM 3, we have 5.3, and we have Gemma. Again, you can tweak these. This is all built on top of LLM. Let's go ahead and run through our tool set quickly. We're using Bun, which is an all in one JavaScript runtime. Over the past year, the engineers have really matured the ecosystem. This is my go-to tool for all things JavaScript and TypeScript related. They recently just launched Windows support, which means that this code base will work out of the box for Mac, Linux, and Windows users. You can go ahead and click on this, and you'll be able to see the code base. Huge shout out to the Bun developers on all the great work here. We're using Ollama to serve our local language models. I probably don't need to introduce them. And last but not least, we're using PromptFu. I've talked about PromptFu in a few videos in the past, but it's super, super important to bring back up. This is how you can test your individual prompts against expectations. So what does that look like? If we scroll down to the hero here, you can see exactly what a test case looks like. So you have your prompts that you're going to test. So this is what you would normally type in a chat input field. And then you can go ahead and click test. And then you can go ahead and you have your individual models. Let's say you want to test OpenAI, Plod, and Mistral Large. You would put those all here. So for each provider, it's going to run every single prompt. And then at the bottom, you have your test cases. Your test cases can pass in variables to your prompts, as you can see here. And then most importantly, your test cases can assert specific expectations on the output of your LLM. So you can see here where you're running this type contains. We need to make sure that it has this string in it. We're making sure that the cost is below this amount, latency below this, etc. There are many different assertion types. The ITV benchmark repo uses these three key pieces of technology for a really, really simplistic experience. So you have your prompt configuration where you specify what models you want to use. You have your tests, which specify the details. So let's go ahead and look at one of these tests. You can see here, this is a simple bullet summary test. So I'm saying create a summary of the following text in bullet points. And then here's the script to one of our previous videos. So, you know, here's a simple yet powerful idea that can help you take a large step toward useful and valuable agentic workflows. We're asserting case insensitively that all of these items are in the response of the prompt. So let's go ahead and look at our output. Let's see if our prompts completed. Okay, so we have 33 success and 15 failed tests. So LLM3 ran every single one of these test cases here and reported its results. So let's go ahead and take a look at what that looks like. So after you run that was Bon ELM, after you run that you can run Bon View and if we open up package.json, and you can see Bon view just runs prompt foo view Bon view. This is going to kick off a local prompt foo server that shows us exactly what happened in the test runs. So right away, you can see we have a great summary of the results. So we have our control test failing at only one test, right. So it passed 91% accuracy. This and then we have llama 3 so close to my 80 standard we'll dig into where it went wrong in just a second here we then have phi 3 failed half of the 12 test cases and then we have gemma looks like it did one better 7 out of 12 so you can see here this is why it's important to have a control group specifically for testing elms it's really good to compare against a kind of high performing model and you know gpg 3.5 turbo it's not really even high performing anymore but it's a good benchmark for testing against local models because really if we use opus or gpt4 here the local models won't even come close so that's why i like to compare to something like gpg 3.5 you can also use cloud 3 haiku here this right away gives you a great benchmark on how local models are performing let's go ahead and look at one of these tests what happened where did things go wrong let's look at our text classification this is a simple test the prompt is is the following block of text a sql natural language query nlq respond exclusively with yes or no so this test here is going to look at how well the model can both answer correctly and answer precisely right it needs to say yes or no and then the block of text is select 10 users over the age of 21 with a gmail address and then we have the assertion type equals yes so our test case validates this test if it returns exclusively yes and we can look at the prompt test to see exactly what that looks like so if you go to test.yaml we can see we're looking for just yes this is what that test looks like right so this is our one of our text classification tests and and we have this assertion type equals yes so equals is used when you know exactly what you want the response to be a lot of the times you'll want something like a i contains all so case insensitive contains everything or a case insensitive contains any and there are lots of different assertions you can make you can easily dive into that i've linked that in the readme you'll want to look at the assertions documentation in prompt foo they have a whole list here of different assertions you can make to improve and strengthen your prompt test so that's what that test looks like and and you can kind of go through the line over each model to see exactly what went right what went wrong etc so feel free to check out the other test cases the long story short here is that by running the itv benchmark by running your personal benchmarks against local models you can have higher confidence and you can have first movers advantage on getting your hands on these local models and truly utilizing them as you can see here llama 3 is nearly within my standard of what i need an elm to do based on these 12 test cases i'll increase this to add a lot more of the use cases that i use out of these 12 test cases llama 3 is performing really really well and this is the 8b model right so if we look at a llama you can see here the default version that comes in here is the 8 billion parameter model that's the 4b quantization so pretty good stuff here i don't need to talk about how great llama 3 is the rest of the internet is doing that but it is really awesome to see how it performs on your specific use cases the closer you get to the metal here the closer you understand how these models are performing next to each other the better and the faster you're going to be able to take these models and productionize them in your tools and products i also just want to shout out how incredible it is to actually run these tests over and over and over again with the same model without thinking about the cost for a single second. You can see here, we're getting about 12 tokens per second across the board. So not ideal, not super great, but still everything completed fine. You can walk through the examples. A lot of these test cases are passing. This is really great. I'm gonna be keeping a pretty close eye on this stuff. So definitely like and subscribe if you're interested in the best local performing models. I feel like we're gonna have a few different classes of models, right? If we break this down, fastest, cheapest, and then it was best, slowest. And now what I think we need to do is take this and add a nest to it. So we basically say something like this, right? We say cloud, right? And then we say the slowest, most expensive. And then we say local, fastest, lower accuracy, and best, slowest, right? So things kind of change when you're at the local level. Now we're just trading off speed and accuracy, which simplifies things a lot, right? Because basically we were doing this where we had the fastest, cheapest, and we had lower accuracy. And then we had best, slowest, most expensive, right? So this is your Opus, this is your GPT-4, and this is your Haiku, GPT-3. But now we're getting into this interesting place where now we have things like this, right? Now we have PHY-3, we have LAMA-3, LAMA-3 is seven or eight billion. We also have Gemma. And then in the slowest, we have our bigger models, right? So this is where like LAMA-3 was at 70 billion, that's where this goes. And then, you know, whatever other big models that come out that are, you know, going to really trip your RAM, they're going to run slower, but they will give you the best performance that you can possibly have locally. So I'm keeping an eye on this. Hit the like and hit the sub if you want to stay up to date with how cloud versus local models progress. We're going to be covering these on the channel and I'll likely use, you know, this class system to separate them to keep an eye on these, right? First thing that needs to happen is we need anything at all. To run locally, right? So this is kind of, you know, in the future, same with this. Right now we need just anything to run well enough. So, you know, we need decent accuracy, any speed, right? So this is what we're looking for right now. And this stuff is going to come in the future. So that's the way I'm looking at this. The ITV benchmark can help you gain confidence in your prompts. Link for the code is going to be in the description. I built this to be ultra simple. Just follow the README to get started. Thanks to Bunn. Pramphu and Ollama. This should be completely cross-platform and I'll be updating this with some additional test cases. By the time you watch this, I'll likely have added several additional tests. I'm missing some things in here like code generation, context window length testing, and a couple other sections. So look forward to that. I hope all of this makes sense. Up your feeling, the speed of the open source community building toward usable viable ELMs. I think this is something that we've all been really excited about. And it's finally starting to happen. I'm going to predict by the end of the year, we're going to have an on-device Haiku to GPT-4 level model running, consuming less than 8 gigabytes of RAM. As soon as OpenELM hits Ollama, we'll be able to test this as well. And that's one of the highlights of using the ITV benchmark inside of this code base. You'll be able to quickly and seamlessly get that up and running by just updating the model name, adding a new configuration here like this. And then it'll look something like this, OpenELM, and then whatever the size is going to be, say it's the 3B, and that's it. Then you just run the test again, right? So that's the beauty of having a test suite like this set up and ready to go. You can, of course, come in here and customize this. You can add Opus, you can add Haiku, you can add other models, tweak it to your liking. That's what this is all about. I highly recommend you get in here and test this. This was important enough for me to take a break from personal AI assistance, and HSE, and all of that stuff. And I'll see you guys in the next video. Bye-bye. MacBook Pro M4 chip is released. And as the LLM community rolls out permutations of Llama 3, I think very soon, possibly before mid-2024, ELM's efficient language models will be ready for on-device use. Again, this is use case specific, which is really the whole point of me creating this video is to share this code base with you so that you can know exactly what your use case specific standards are. Because after you have standards set and a great prompting framework like PromptFu, you can then answer the question for yourself, for your tools, and for your products, is this efficient language model ready for my device? For me personally, the answer to this question is very soon. If you enjoyed this video, you know what to do. Thanks so much for watching. Stay focused and keep building.
--------------------------------------------------------------------------------
/testable_prompts/context_window/context_window_3.md:
--------------------------------------------------------------------------------
1 | What was the end of year prediction made in the SCRIPT below?
2 |
3 | SCRIPT
4 | Gemma Phi 3, OpenELM, and Llama 3. Open source language models are becoming more viable with every single release. The terminology from Apple's new OpenELM model is spot on. These efficient language models are taking center stage in the LLM ecosystem. Why are ELMs so important? Because they reshape the business model of your agentic tools and products. When you can run a prompt directly on your device, the cost of building goes to zero. The pace of innovation has been incredible, especially with the release of Llama 3. But every time a new model drops, I'm always asking the same question. Are efficient language models truly ready for on-device use? And how do you know your ELM meets your standards? I'm going to give you a couple of examples here. The first one is that you need to know your ELM. Everyone has different standards for their prompts, prompt chains, AI agents, and agentic workflows. How do you know your personal standards are being met by Phi 3, by Llama 3, and whatever's coming next? This is something that we stress on the channel a lot. Always look at where the ball is going, not where it is. If this trend of incredible local models continue, how soon will it be until we can do what GPT-4 does right on our device? With Llama 3, it's looking like this. It's looking like this. It's looking like this. It's looking like this. It's looking like this. It's looking like this. It's looking like this. It's looking like this. It's looking like this. That time is coming very soon. In this video, we're going to answer the question, are efficient language models ready for on-device use? How do you know if they're ready for your specific use cases? Here are all the big ideas. We're going to set some standards for what ELM attributes we actually care about. There are things like RAM consumption, tokens per second, accuracy. We're going to look at some specific attributes of ELMs and talk about where they need to be for them to work on-device for us. We're going to break down the IT V-Benchmark. We'll explain exactly what that is. That's going to help us answer the question, is this model good enough for your specific use cases? And then we're going to actually run the IT V-Benchmark on Gemma 5.3 and Llama 3 for real on-device use. So we're going to look at a concrete example of the IT V-Benchmark running on my M2 MacBook Pro with 64 gigabytes of RAM and really try to answer the question in a concrete way. Is this ready for prime time? Are these ELMs, are these efficient language models ready for prime time? Let's first walk through some standards and then I'll share some of my personal standards for ELMs. So we'll look at it through the lens of how I'm approaching this as I'm building out agentic tools and products. How do we know we're ready for on-device use? First two most important metrics we need to look at, accuracy and speed. Given your test suite that validates that this model works for your use case, what accuracy do you need? Is it okay if it fails a couple of tests giving you 90% or are you okay with, you know, 60, 70 or 80%? I think accuracy is the most important benchmark we should all be paying attention to. Something like speed is also a complete blocker if it's too low. So we'll be measuring speed and TPS, tokens per second. We'll look at a range from one token per second, all the way up to grok levels, right? Of something like 500 plus, you know, 1000 tokens per second level. What else do we need to pay attention to? Memory and context window. So memory coupled with speed are the big two constraints for ELMs right now. Efficient language model, models that can run on your device. They chew up anywhere from four gigabytes of RAM, of GPU, of CPU, all the way up to 128 and beyond. To run Lama 3, 70 billion parameter on my MacBook, it will chew up something like half of all my available RAM. We also have context window. This is a classic one. Then we have JSON response and vision support. We're not gonna focus on these too much. These are more yes, no, do they have it or do they not? Is it multimodal or not? There are a couple other things that we need to pay attention to. First of all, we need to pay attention to these other attributes that we're missing here, but I don't think they matter as much as these six and specifically these four at the top here. So let's go ahead and walk through this through the lens of my personal standards for efficient language models. Let's break it down. So first things first, the accuracy for the ITV benchmark, which we're about to get to must hit 80%. So if a model is not passing about 80% here, I automatically disqualify it. Tokens per second. I require at least 20 tokens per second minimum. If it's below this, it's honestly just not worth it. It's too slow. There's not enough happening. Anything above this, of course we'll accept. So keep in mind when you're setting your personal standards, you're really looking for ranges, right? Anything above 80% for me is golden. Anything above 20 tokens per second at a very minimum is what we're looking for. So let's look at memory. For me, I am only willing to consume up to about 32 gigabytes of RAM, GPU, CPU. However, it ends up getting sliced. On my 64 gigabyte, I have several Docker instances and other applications that are basically running 24 seven that constrain my dev environment. Regardless, I'm looking for ELMs that consume less than 32 gigabytes of memory. Context window, for me, the sweet spot is 32K and above. Lama 3 released with 8K. I said, cool. Benchmarks look great, but it's a little too small. For some of the larger prompts and prompt chains that I'm building up, I'm looking for 32K minimum context. I highly recommend you go through and set your personal standard for each one of these metrics, as they're likely to be the most important for getting your ELM, for getting a model running on your device. So JSON response, vision support. I don't really care about vision support. This is not a high priority for me. Of course, it's a nice to have. There are image models that can run in isolation. That does the trick for me. I'm not super concerned about having local on device multimodal models, at least right now. JSON response support is a must have. For me, this is built into a lot of the model providers, and it's typically not a problem anymore. So these are my personal standards. The most important ones are up here. 80% accuracy on the ITP benchmark, which we'll talk about in just a second. We have the speed. I'm looking for 20 tokens per second at a minimum. I'm looking for a memory consumption maximum of 32. And then of course, the context window. I am simplifying a lot of the details here, especially around the memory usage. I just want to give you a high level of how to think about what your standards are for ELMs. So that when they come around, you're ready to start using it for your personal tools and products. Having this ready to go as soon as these models are ready will save you time and money, especially as you scale up your usage of language models. So let's talk about the ITP benchmark. What is this? It's simple. It's nothing fancy. ITP is just, is this viable? That's what the test is all about. I just want to know, is this ELM viable? Are these efficient language models, AKA on device language models good enough? This code repository we're about to dive into. It's a personalized use case specific benchmark to quickly swap in and out ELMs, AKA on device language models to know if it's ready for your tools and applications. So let's go ahead and take a quick look at this code base. Link for this is going to be in the description. Let's go ahead and crack open VS code and let's just start with the README. So let's preview this and it's simple. This uses Bunn, PromptFu, and Alama for a minimalist cross-platform local LLM prompt testing and benchmarking experience. So before we dive into this anymore, I'm just going to go ahead, open up the terminal. I'm going to type Bunn run ELM, and that's going to kick off the test. So you can see right away, I have four models running, starting with GPT 3.5 as a control model to test against. And then you can see here, we have Alama Chat, Alama 3, we have PHY, and we have Gemma running as well. So while this is running through our 12 test cases, let's go ahead and take a look at what this code base looks like. So all the details that get set up are going to be in the README. Once you're able to get set up with this in less than a minute, this code base was designed specifically for you to help you benchmark local models for your use cases so that when they're ready, you can start saving time and saving money immediately. If we look at the structure, it's very simple. We have some setup, some minor scripts, and then we have the most important thing, bench, underscore, underscore, and then whatever the test suite name is. This one's called Efficient Language Models. So let's go ahead and look at the prompt. So the prompt is just a simple template. This gets filled in with each individual test run. And if we open up our test files, you can see here, let's go ahead and collapse everything. You can see here we have a list of what do we have here, 12 tests. They're sectioned off. You can see we have string manipulation here, command generation, code explanation, text classification. This is a work in progress of my personal ELM accuracy benchmark. By the time you're watching this, there'll likely be a few additional tests here. They'll be generic enough though, so that you can come in, understand them, and tweak them to fit your own specific use case. So let's go ahead and take a look at this. So this is the test file, and we'll look into this in more detail in just a second here. But if you go to the most important file, prompt through configuration, you can see here, let's go ahead and collapse this. We have our control cloud LLM. So I like to have a kind of control and an experimental group. The control group is going to be our cloud LLM that we want to prove our local models are as good as or near the performance of. Right now I'm using dbt 3.5. And then we have our experimental local ELMs. So we're going to go ahead and take a look at this. So in here, you can see we have LLM 3, we have 5.3, and we have Gemma. Again, you can tweak these. This is all built on top of LLM. Let's go ahead and run through our tool set quickly. We're using Bun, which is an all in one JavaScript runtime. Over the past year, the engineers have really matured the ecosystem. This is my go-to tool for all things JavaScript and TypeScript related. They recently just launched Windows support, which means that this code base will work out of the box for Mac, Linux, and Windows users. You can go ahead and click on this, and you'll be able to see the code base. Huge shout out to the Bun developers on all the great work here. We're using Ollama to serve our local language models. I probably don't need to introduce them. And last but not least, we're using PromptFu. I've talked about PromptFu in a few videos in the past, but it's super, super important to bring back up. This is how you can test your individual prompts against expectations. So what does that look like? If we scroll down to the hero here, you can see exactly what a test case looks like. So you have your prompts that you're going to test. So this is what you would normally type in a chat input field. And then you can go ahead and click test. And then you can go ahead and you have your individual models. Let's say you want to test OpenAI, Plod, and Mistral Large. You would put those all here. So for each provider, it's going to run every single prompt. And then at the bottom, you have your test cases. Your test cases can pass in variables to your prompts, as you can see here. And then most importantly, your test cases can assert specific expectations on the output of your LLM. So you can see here where you're running this type contains. We need to make sure that it has this string in it. We're making sure that the cost is below this amount, latency below this, etc. There are many different assertion types. The ITV benchmark repo uses these three key pieces of technology for a really, really simplistic experience. So you have your prompt configuration where you specify what models you want to use. You have your tests, which specify the details. So let's go ahead and look at one of these tests. You can see here, this is a simple bullet summary test. So I'm saying create a summary of the following text in bullet points. And then here's the script to one of our previous videos. So, you know, here's a simple yet powerful idea that can help you take a large step toward useful and valuable agentic workflows. We're asserting case insensitively that all of these items are in the response of the prompt. So let's go ahead and look at our output. Let's see if our prompts completed. Okay, so we have 33 success and 15 failed tests. So LLM3 ran every single one of these test cases here and reported its results. So let's go ahead and take a look at what that looks like. So after you run that was Bon ELM, after you run that you can run Bon View and if we open up package.json, and you can see Bon view just runs prompt foo view Bon view. This is going to kick off a local prompt foo server that shows us exactly what happened in the test runs. So right away, you can see we have a great summary of the results. So we have our control test failing at only one test, right. So it passed 91% accuracy. This and then we have llama 3 so close to my 80 standard we'll dig into where it went wrong in just a second here we then have phi 3 failed half of the 12 test cases and then we have gemma looks like it did one better 7 out of 12 so you can see here this is why it's important to have a control group specifically for testing elms it's really good to compare against a kind of high performing model and you know gpg 3.5 turbo it's not really even high performing anymore but it's a good benchmark for testing against local models because really if we use opus or gpt4 here the local models won't even come close so that's why i like to compare to something like gpg 3.5 you can also use cloud 3 haiku here this right away gives you a great benchmark on how local models are performing let's go ahead and look at one of these tests what happened where did things go wrong let's look at our text classification this is a simple test the prompt is is the following block of text a sql natural language query nlq respond exclusively with yes or no so this test here is going to look at how well the model can both answer correctly and answer precisely right it needs to say yes or no and then the block of text is select 10 users over the age of 21 with a gmail address and then we have the assertion type equals yes so our test case validates this test if it returns exclusively yes and we can look at the prompt test to see exactly what that looks like so if you go to test.yaml we can see we're looking for just yes this is what that test looks like right so this is our one of our text classification tests and and we have this assertion type equals yes so equals is used when you know exactly what you want the response to be a lot of the times you'll want something like a i contains all so case insensitive contains everything or a case insensitive contains any and there are lots of different assertions you can make you can easily dive into that i've linked that in the readme you'll want to look at the assertions documentation in prompt foo they have a whole list here of different assertions you can make to improve and strengthen your prompt test so that's what that test looks like and and you can kind of go through the line over each model to see exactly what went right what went wrong etc so feel free to check out the other test cases the long story short here is that by running the itv benchmark by running your personal benchmarks against local models you can have higher confidence and you can have first movers advantage on getting your hands on these local models and truly utilizing them as you can see here llama 3 is nearly within my standard of what i need an elm to do based on these 12 test cases i'll increase this to add a lot more of the use cases that i use out of these 12 test cases llama 3 is performing really really well and this is the 8b model right so if we look at a llama you can see here the default version that comes in here is the 8 billion parameter model that's the 4b quantization so pretty good stuff here i don't need to talk about how great llama 3 is the rest of the internet is doing that but it is really awesome to see how it performs on your specific use cases the closer you get to the metal here the closer you understand how these models are performing next to each other the better and the faster you're going to be able to take these models and productionize them in your tools and products i also just want to shout out how incredible it is to actually run these tests over and over and over again with the same model without thinking about the cost for a single second. You can see here, we're getting about 12 tokens per second across the board. So not ideal, not super great, but still everything completed fine. You can walk through the examples. A lot of these test cases are passing. This is really great. I'm gonna be keeping a pretty close eye on this stuff. So definitely like and subscribe if you're interested in the best local performing models. I feel like we're gonna have a few different classes of models, right? If we break this down, fastest, cheapest, and then it was best, slowest. And now what I think we need to do is take this and add a nest to it. So we basically say something like this, right? We say cloud, right? And then we say the slowest, most expensive. And then we say local, fastest, lower accuracy, and best, slowest, right? So things kind of change when you're at the local level. Now we're just trading off speed and accuracy, which simplifies things a lot, right? Because basically we were doing this where we had the fastest, cheapest, and we had lower accuracy. And then we had best, slowest, most expensive, right? So this is your Opus, this is your GPT-4, and this is your Haiku, GPT-3. But now we're getting into this interesting place where now we have things like this, right? Now we have PHY-3, we have LAMA-3, LAMA-3 is seven or eight billion. We also have Gemma. And then in the slowest, we have our bigger models, right? So this is where like LAMA-3 was at 70 billion, that's where this goes. And then, you know, whatever other big models that come out that are, you know, going to really trip your RAM, they're going to run slower, but they will give you the best performance that you can possibly have locally. So I'm keeping an eye on this. Hit the like and hit the sub if you want to stay up to date with how cloud versus local models progress. We're going to be covering these on the channel and I'll likely use, you know, this class system to separate them to keep an eye on these, right? First thing that needs to happen is we need anything at all. To run locally, right? So this is kind of, you know, in the future, same with this. Right now we need just anything to run well enough. So, you know, we need decent accuracy, any speed, right? So this is what we're looking for right now. And this stuff is going to come in the future. So that's the way I'm looking at this. The ITV benchmark can help you gain confidence in your prompts. Link for the code is going to be in the description. I built this to be ultra simple. Just follow the README to get started. Thanks to Bunn. Pramphu and Ollama. This should be completely cross-platform and I'll be updating this with some additional test cases. By the time you watch this, I'll likely have added several additional tests. I'm missing some things in here like code generation, context window length testing, and a couple other sections. So look forward to that. I hope all of this makes sense. Up your feeling, the speed of the open source community building toward usable viable ELMs. I think this is something that we've all been really excited about. And it's finally starting to happen. I'm going to predict by the end of the year, we're going to have an on-device Haiku to GPT-4 level model running, consuming less than 8 gigabytes of RAM. As soon as OpenELM hits Ollama, we'll be able to test this as well. And that's one of the highlights of using the ITV benchmark inside of this code base. You'll be able to quickly and seamlessly get that up and running by just updating the model name, adding a new configuration here like this. And then it'll look something like this, OpenELM, and then whatever the size is going to be, say it's the 3B, and that's it. Then you just run the test again, right? So that's the beauty of having a test suite like this set up and ready to go. You can, of course, come in here and customize this. You can add Opus, you can add Haiku, you can add other models, tweak it to your liking. That's what this is all about. I highly recommend you get in here and test this. This was important enough for me to take a break from personal AI assistance, and HSE, and all of that stuff. And I'll see you guys in the next video. Bye-bye. MacBook Pro M4 chip is released. And as the LLM community rolls out permutations of Llama 3, I think very soon, possibly before mid-2024, ELM's efficient language models will be ready for on-device use. Again, this is use case specific, which is really the whole point of me creating this video is to share this code base with you so that you can know exactly what your use case specific standards are. Because after you have standards set and a great prompting framework like PromptFu, you can then answer the question for yourself, for your tools, and for your products, is this efficient language model ready for my device? For me personally, the answer to this question is very soon. If you enjoyed this video, you know what to do. Thanks so much for watching. Stay focused and keep building.
--------------------------------------------------------------------------------
/AI_DOCS/marimo_compressed.md:
--------------------------------------------------------------------------------
1 | import marimo as mo
2 | import random
3 | import pandas as pd
4 | import plotly.express as px
5 | import altair as alt
6 | from vega_datasets import data
7 | import matplotlib.pyplot as plt
8 |
9 | # Markdown
10 | mo.md("## This is a markdown heading")
11 |
12 | # Inputs
13 |
14 | # 1. Array
15 | sliders = mo.ui.array([mo.ui.slider(1, 100) for _ in range(3)])
16 | mo.md(f"Array of sliders: {sliders}")
17 |
18 | # 2. Batch
19 | user_info = mo.md(
20 | """
21 | - **Name:** {name}
22 | - **Birthday:** {birthday}
23 | """
24 | ).batch(name=mo.ui.text(), birthday=mo.ui.date())
25 | user_info
26 |
27 | # 3. Button
28 | def on_click(value):
29 | print("Button clicked!", value)
30 | return value + 1
31 |
32 | button = mo.ui.button(on_click=on_click, value=0, label="Click Me")
33 | button
34 |
35 | # 4. Checkbox
36 | checkbox = mo.ui.checkbox(label="Agree to terms")
37 | mo.md(f"Checkbox value: {checkbox.value}")
38 |
39 | # 5. Code Editor
40 | code = """
41 | def my_function():
42 | print("Hello from code editor!")
43 | """
44 | code_editor = mo.ui.code_editor(value=code, language="python")
45 | code_editor
46 |
47 | # 6. Dataframe
48 | df = pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]})
49 | dataframe_ui = mo.ui.dataframe(df)
50 | dataframe_ui
51 |
52 | # 7. Data Explorer
53 | data_explorer = mo.ui.data_explorer(data.cars())
54 | data_explorer
55 |
56 | # 8. Dates
57 |
58 | # Single date
59 | date_picker = mo.ui.date(label="Select a date")
60 | date_picker
61 |
62 | # Date and time
63 | datetime_picker = mo.ui.datetime(label="Select a date and time")
64 | datetime_picker
65 |
66 | # Date range
67 | date_range_picker = mo.ui.date_range(label="Select a date range")
68 | date_range_picker
69 |
70 | # 9. Dictionary
71 | elements = mo.ui.dictionary({
72 | 'slider': mo.ui.slider(1, 10),
73 | 'text': mo.ui.text(placeholder="Enter text")
74 | })
75 | mo.md(f"Dictionary of elements: {elements}")
76 |
77 | # 10. Dropdown
78 | dropdown = mo.ui.dropdown(options=['Option 1', 'Option 2', 'Option 3'], label="Select an option")
79 | dropdown
80 |
81 | # 11. File
82 | file_upload = mo.ui.file(label="Upload a file")
83 | file_upload
84 |
85 | # 12. File Browser
86 | file_browser = mo.ui.file_browser(label="Browse files")
87 | file_browser
88 |
89 | # 13. Form
90 | form = mo.ui.text(label="Enter your name").form()
91 | form
92 |
93 | # 14. Microphone
94 | microphone = mo.ui.microphone(label="Record audio")
95 | microphone
96 |
97 | # 15. Multiselect
98 | multiselect = mo.ui.multiselect(options=['A', 'B', 'C', 'D'], label="Select multiple options")
99 | multiselect
100 |
101 | # 16. Number
102 | number_picker = mo.ui.number(0, 10, step=0.5, label="Select a number")
103 | number_picker
104 |
105 | # 17. Radio
106 | radio_group = mo.ui.radio(options=['Red', 'Green', 'Blue'], label="Select a color")
107 | radio_group
108 |
109 | # 18. Range Slider
110 | range_slider = mo.ui.range_slider(0, 100, step=5, value=[20, 80], label="Select a range")
111 | range_slider
112 |
113 | # 19. Refresh
114 | refresh_button = mo.ui.refresh(default_interval="5s", label="Refresh")
115 | refresh_button
116 |
117 | # 20. Run Button
118 | run_button = mo.ui.run_button(label="Run")
119 | run_button
120 |
121 | # 21. Slider
122 | slider = mo.ui.slider(0, 100, step=1, label="Adjust value")
123 | slider
124 |
125 | # 22. Switch
126 | switch = mo.ui.switch(label="Enable feature")
127 | switch
128 |
129 | # 23. Table
130 | table_data = [{'Name': 'Alice', 'Age': 25}, {'Name': 'Bob', 'Age': 30}]
131 | table = mo.ui.table(data=table_data, label="User Table")
132 | table
133 |
134 | # 24. Tabs
135 | tab1_content = mo.md("Content for Tab 1")
136 | tab2_content = mo.ui.slider(0, 10)
137 | tabs = mo.ui.tabs({'Tab 1': tab1_content, 'Tab 2': tab2_content})
138 | tabs
139 |
140 | # 25. Text
141 | text_input = mo.ui.text(placeholder="Enter some text", label="Text Input")
142 | text_input
143 |
144 | # 26. Text Area
145 | text_area = mo.ui.text_area(placeholder="Enter a long text", label="Text Area")
146 | text_area
147 |
148 | # 27. Custom UI elements (Anywidget)
149 | # See the documentation on Anywidget for examples.
150 |
151 | # Layouts
152 |
153 | # 1. Accordion
154 | accordion = mo.ui.accordion({'Section 1': mo.md("This is section 1"), 'Section 2': mo.ui.slider(0, 10)})
155 | accordion
156 |
157 | # 2. Carousel
158 | carousel = mo.carousel([mo.md("Item 1"), mo.ui.slider(0, 10), mo.md("Item 3")])
159 | carousel
160 |
161 | # 3. Callout
162 | callout = mo.md("Important message!").callout(kind="warn")
163 | callout
164 |
165 | # 4. Justify
166 |
167 | # Center
168 | centered_text = mo.md("This text is centered").center()
169 | centered_text
170 |
171 | # Left
172 | left_aligned_text = mo.md("This text is left aligned").left()
173 | left_aligned_text
174 |
175 | # Right
176 | right_aligned_text = mo.md("This text is right aligned").right()
177 | right_aligned_text
178 |
179 | # 5. Lazy
180 | def lazy_content():
181 | mo.md("This content loaded lazily!")
182 |
183 | lazy_element = mo.lazy(lazy_content)
184 | lazy_element
185 |
186 | # 6. Plain
187 | plain_dataframe = mo.plain(df)
188 | plain_dataframe
189 |
190 | # 7. Routes
191 | def home_page():
192 | return mo.md("# Home Page")
193 |
194 | def about_page():
195 | return mo.md("# About Page")
196 |
197 | mo.routes({
198 | "#/": home_page,
199 | "#/about": about_page
200 | })
201 |
202 | # 8. Sidebar
203 | sidebar_content = mo.vstack([mo.md("## Menu"), mo.ui.button(label="Home"), mo.ui.button(label="About")])
204 | mo.sidebar(sidebar_content)
205 |
206 | # 9. Stacks
207 |
208 | # Horizontal Stack
209 | hstack_layout = mo.hstack([mo.md("Left"), mo.ui.slider(0, 10), mo.md("Right")])
210 | hstack_layout
211 |
212 | # Vertical Stack
213 | vstack_layout = mo.vstack([mo.md("Top"), mo.ui.slider(0, 10), mo.md("Bottom")])
214 | vstack_layout
215 |
216 | # 10. Tree
217 | tree_data = ['Item 1', ['Subitem 1.1', 'Subitem 1.2'], {'Key': 'Value'}]
218 | tree = mo.tree(tree_data)
219 | tree
220 |
221 | # Plotting
222 |
223 | # Reactive charts with Altair
224 | altair_chart = mo.ui.altair_chart(alt.Chart(data.cars()).mark_point().encode(x='Horsepower', y='Miles_per_Gallon', color='Origin'))
225 | altair_chart
226 |
227 | # Reactive plots with Plotly
228 | plotly_chart = mo.ui.plotly(px.scatter(data.cars(), x="Horsepower", y="Miles_per_Gallon", color="Origin"))
229 | plotly_chart
230 |
231 | # Interactive matplotlib
232 | plt.plot([1, 2, 3, 4])
233 | interactive_mpl_chart = mo.mpl.interactive(plt.gcf())
234 | interactive_mpl_chart
235 |
236 | # Media
237 |
238 | # 1. Image
239 | image = mo.image("https://marimo.io/logo.png", width=100, alt="Marimo Logo")
240 | image
241 |
242 | # 2. Audio
243 | audio = mo.audio("https://www.zedge.net/find/ringtones/ocean%20waves")
244 | audio
245 |
246 | # 3. Video
247 | video = mo.video("https://www.youtube.com/watch?v=dQw4w9WgXcQ", controls=True)
248 | video
249 |
250 | # 4. PDF
251 | pdf = mo.pdf("https://www.africau.edu/images/default/sample.pdf", width="50%")
252 | pdf
253 |
254 | # 5. Download Media
255 | download_button = mo.download(data="This is the content of the file", filename="download.txt")
256 | download_button
257 |
258 | # 6. Plain text
259 | plain_text = mo.plain_text("This is plain text")
260 | plain_text
261 |
262 | # Diagrams
263 |
264 | # 1. Mermaid diagrams
265 | mermaid_code = """
266 | graph LR
267 | A[Square Rect] -- Link text --> B((Circle))
268 | A --> C(Round Rect)
269 | B --> D{Rhombus}
270 | C --> D
271 | """
272 | mermaid_diagram = mo.mermaid(mermaid_code)
273 | mermaid_diagram
274 |
275 | # 2. Statistic cards
276 | stat_card = mo.stat(value=100, label="Users", caption="Total users this month", direction="increase")
277 | stat_card
278 |
279 | # Status
280 |
281 | # 1. Progress bar
282 | for i in mo.status.progress_bar(range(10), title="Processing"):
283 | # Simulate some work
284 | pass
285 |
286 | # 2. Spinner
287 | with mo.status.spinner(title="Loading...", subtitle="Please wait"):
288 | # Simulate a long-running task
289 | pass
290 |
291 | # Outputs
292 |
293 | # 1. Replace output
294 | mo.output.replace(mo.md("This is the new output"))
295 |
296 | # 2. Append output
297 | mo.output.append(mo.md("This is appended output"))
298 |
299 | # 3. Clear output
300 | mo.output.clear()
301 |
302 | # 4. Replace output at index
303 | mo.output.replace_at_index(mo.md("Replaced output"), 0)
304 |
305 | # Display cell code
306 | mo.show_code(mo.md("This output has code displayed"))
307 |
308 | # Control Flow
309 |
310 | # Stop execution
311 | user_age = mo.ui.number(0, 100, label="Enter your age")
312 | mo.stop(user_age.value < 18, mo.md("You must be 18 or older"))
313 | mo.md(f"Your age is: {user_age.value}")
314 |
315 | # HTML
316 |
317 | # 1. Convert to HTML
318 | html_object = mo.as_html(mo.md("This is markdown converted to HTML"))
319 | html_object
320 |
321 | # 2. Html object
322 | custom_html = mo.Html("This is custom HTML
") 323 | custom_html 324 | 325 | # Other API components 326 | 327 | # Query Parameters 328 | params = mo.query_params() 329 | params['name'] = 'John' 330 | 331 | # Command Line Arguments 332 | args = mo.cli_args() 333 | 334 | # State 335 | get_count, set_count = mo.state(0) 336 | mo.ui.button(on_click=lambda: set_count(get_count() + 1), label="Increment") 337 | mo.md(f"Count: {get_count()}") 338 | 339 | # App 340 | # See documentation for embedding notebooks 341 | 342 | # Cell 343 | # See documentation for running cells from other notebooks 344 | 345 | # Miscellaneous 346 | is_running_in_notebook = mo.running_in_notebook() 347 | 348 | --- Guides 349 | 350 | ## Marimo Guides: Concise Examples 351 | 352 | Here are concise examples for each guide in the Marimo documentation: 353 | 354 | ### 1. Overview 355 | 356 | ```python 357 | import marimo as mo 358 | 359 | # Define a variable 360 | x = 10 361 | 362 | # Display markdown with variable interpolation 363 | mo.md(f"The value of x is {x}") 364 | 365 | # Create a slider and display its value reactively 366 | slider = mo.ui.slider(0, 100, value=50) 367 | mo.md(f"Slider value: {slider.value}") 368 | ``` 369 | 370 | ### 2. Reactivity 371 | 372 | ```python 373 | import marimo as mo 374 | 375 | # Define a variable in one cell 376 | data = [1, 2, 3, 4, 5] 377 | 378 | # Use the variable in another cell - this cell will rerun when `data` changes 379 | mo.md(f"The sum of the data is {sum(data)}") 380 | ``` 381 | 382 | ### 3. Interactivity 383 | 384 | ```python 385 | import marimo as mo 386 | 387 | # Create a slider 388 | slider = mo.ui.slider(0, 10, label="Select a value") 389 | 390 | # Display the slider's value reactively 391 | mo.md(f"You selected: {slider.value}") 392 | ``` 393 | 394 | ### 4. SQL 395 | 396 | ```python 397 | import marimo as mo 398 | 399 | # Create a dataframe 400 | df = pd.DataFrame({'name': ['Alice', 'Bob', 'Charlie'], 'age': [25, 30, 35]}) 401 | 402 | # Query the dataframe using SQL 403 | mo.sql("SELECT * FROM df WHERE age > 30") 404 | ``` 405 | 406 | ### 5. Run as an app 407 | 408 | ```bash 409 | # Run a notebook as an interactive web app 410 | marimo run my_notebook.py 411 | ``` 412 | 413 | ### 6. Run as a script 414 | 415 | ```bash 416 | # Execute a notebook as a Python script 417 | python my_notebook.py 418 | ``` 419 | 420 | ### 7. Outputs 421 | 422 | ```python 423 | import marimo as mo 424 | 425 | # Display markdown 426 | mo.md("This is **markdown** output") 427 | 428 | # Display a matplotlib plot 429 | import matplotlib.pyplot as plt 430 | plt.plot([1, 2, 3, 4, 5]) 431 | plt.show() 432 | ``` 433 | 434 | ### 8. Dataframes 435 | 436 | ```python 437 | import marimo as mo 438 | import pandas as pd 439 | 440 | # Create a Pandas dataframe 441 | df = pd.DataFrame({'col1': [1, 2, 3], 'col2': [4, 5, 6]}) 442 | 443 | # Display the dataframe in an interactive table 444 | mo.ui.table(df) 445 | ``` 446 | 447 | ### 9. Plotting 448 | 449 | ```python 450 | import marimo as mo 451 | import altair as alt 452 | from vega_datasets import data 453 | 454 | # Create a reactive Altair chart 455 | chart = alt.Chart(data.cars()).mark_point().encode(x='Horsepower', y='Miles_per_Gallon') 456 | chart = mo.ui.altair_chart(chart) 457 | 458 | # Display the chart and selected data 459 | mo.hstack([chart, chart.value]) 460 | ``` 461 | 462 | ### 10. Editor Features 463 | 464 | - Explore variable values and their definitions in the **Variables Panel**. 465 | - Visualize cell dependencies in the **Dependency Graph**. 466 | - Use **Go-to-Definition** to jump to variable declarations. 467 | - Enable **GitHub Copilot** for AI-powered code suggestions. 468 | - Customize **Hotkeys** and **Theming** in the settings. 469 | 470 | ### 11. Theming 471 | 472 | ```python 473 | # In your notebook.py file: 474 | 475 | app = marimo.App(css_file="custom.css") 476 | ``` 477 | 478 | ### 12. Best Practices 479 | 480 | - Use global variables sparingly. 481 | - Encapsulate logic in functions and modules. 482 | - Minimize mutations. 483 | - Write idempotent cells. 484 | - Use caching for expensive computations. 485 | 486 | ### 13. Coming from other Tools 487 | 488 | - Refer to guides for specific tools like Jupyter, Jupytext, Papermill, and Streamlit to understand the transition to Marimo. 489 | 490 | ### 14. Integrating with Marimo 491 | 492 | ```python 493 | import marimo as mo 494 | 495 | # Check if running in a Marimo notebook 496 | if mo.running_in_notebook(): 497 | # Execute Marimo-specific code 498 | pass 499 | ``` 500 | 501 | ### 15. Reactive State 502 | 503 | ```python 504 | import marimo as mo 505 | 506 | # Create reactive state 507 | get_count, set_count = mo.state(0) 508 | 509 | # Increment the counter on button click 510 | mo.ui.button(on_click=lambda: set_count(get_count() + 1), label="Increment") 511 | 512 | # Display the counter value reactively 513 | mo.md(f"Count: {get_count()}") 514 | ``` 515 | 516 | ### 16. Online Playground 517 | 518 | - Create and share Marimo notebooks online at [https://marimo.new](https://marimo.new). 519 | 520 | ### 17. Exporting 521 | 522 | ```bash 523 | # Export to HTML 524 | marimo export html my_notebook.py -o my_notebook.html 525 | 526 | # Export to Python script 527 | marimo export script my_notebook.py -o my_script.py 528 | ``` 529 | 530 | ### 18. Configuration 531 | 532 | - Customize user-wide settings in `~/.marimo.toml`. 533 | - Configure notebook-specific settings in the `notebook.py` file. 534 | 535 | ### 19. Troubleshooting 536 | 537 | - Use the **Variables Panel** and **Dependency Graph** to debug cell execution issues. 538 | - Add `print` statements for debugging. 539 | - Try the "Lazy" runtime configuration for identifying stale cells. 540 | 541 | ### 20. Deploying 542 | 543 | ```bash 544 | # Deploy a Marimo notebook as an interactive web app 545 | marimo run my_notebook.py 546 | ``` 547 | 548 | --- recipes 549 | 550 | ## Marimo Recipes: Concise Examples 551 | 552 | Here are concise examples of common tasks and concepts from the Marimo Recipes section: 553 | 554 | ### Control Flow 555 | 556 | #### 1. Show an output conditionally 557 | 558 | ```python 559 | import marimo as mo 560 | 561 | show_output = mo.ui.checkbox(label="Show output") 562 | mo.md("This output is visible!") if show_output.value else None 563 | ``` 564 | 565 | #### 2. Run a cell on a timer 566 | 567 | ```python 568 | import marimo as mo 569 | import time 570 | 571 | refresh = mo.ui.refresh(default_interval="1s") 572 | refresh 573 | 574 | # This cell will run every second 575 | refresh 576 | mo.md(f"Current time: {time.time()}") 577 | ``` 578 | 579 | #### 3. Require form submission before sending UI value 580 | 581 | ```python 582 | import marimo as mo 583 | 584 | form = mo.ui.text(label="Your name").form() 585 | form 586 | 587 | # This cell will only run after form submission 588 | mo.stop(form.value is None, mo.md("Please submit the form.")) 589 | mo.md(f"Hello, {form.value}!") 590 | ``` 591 | 592 | #### 4. Stop execution of a cell and its descendants 593 | 594 | ```python 595 | import marimo as mo 596 | 597 | should_continue = mo.ui.checkbox(label="Continue?") 598 | 599 | # Stop execution if the checkbox is not checked 600 | mo.stop(not should_continue.value, mo.md("Execution stopped.")) 601 | 602 | # This code will only run if the checkbox is checked 603 | mo.md("Continuing execution...") 604 | ``` 605 | 606 | ### Grouping UI Elements Together 607 | 608 | #### 1. Create an array of UI elements 609 | 610 | ```python 611 | import marimo as mo 612 | 613 | n_sliders = mo.ui.number(1, 5, value=3, label="Number of sliders") 614 | sliders = mo.ui.array([mo.ui.slider(0, 100) for _ in range(n_sliders.value)]) 615 | mo.hstack(sliders) 616 | 617 | # Access slider values 618 | mo.md(f"Slider values: {sliders.value}") 619 | ``` 620 | 621 | #### 2. Create a dictionary of UI elements 622 | 623 | ```python 624 | import marimo as mo 625 | 626 | elements = mo.ui.dictionary({ 627 | 'name': mo.ui.text(label="Name"),**** 628 | 'age': mo.ui.number(0, 100, label="Age") 629 | }) 630 | 631 | # Access element values 632 | mo.md(f"Name: {elements['name'].value}, Age: {elements['age'].value}") 633 | ``` 634 | 635 | #### 3. Embed a dynamic number of UI elements in another output 636 | 637 | ```python 638 | import marimo as mo 639 | 640 | n_items = mo.ui.number(1, 5, value=3, label="Number of items") 641 | items = mo.ui.array([mo.ui.text(placeholder=f"Item {i+1}") for i in range(n_items.value)]) 642 | 643 | mo.md(f""" 644 | **My List:** 645 | 646 | * {items[0]} 647 | * {items[1]} 648 | * {items[2]} 649 | """) 650 | ``` 651 | 652 | #### 4. Create a `hstack` (or `vstack`) of UI elements with `on_change` handlers 653 | 654 | ```python 655 | import marimo as mo 656 | 657 | def handle_click(value, index): 658 | mo.md(f"Button {index} clicked!") 659 | 660 | buttons = mo.ui.array( 661 | [mo.ui.button(label=f"Button {i}", on_change=lambda v, i=i: handle_click(v, i)) 662 | for i in range(3)] 663 | ) 664 | mo.hstack(buttons) 665 | ``` 666 | 667 | #### 5. Create a table column of buttons with `on_change` handlers 668 | 669 | ```python 670 | import marimo as mo 671 | 672 | def handle_click(value, row_index): 673 | mo.md(f"Button clicked for row {row_index}") 674 | 675 | buttons = mo.ui.array( 676 | [mo.ui.button(label="Click me", on_change=lambda v, i=i: handle_click(v, i)) 677 | for i in range(3)] 678 | ) 679 | 680 | mo.ui.table({ 681 | 'Name': ['Alice', 'Bob', 'Charlie'], 682 | 'Action': buttons 683 | }) 684 | ``` 685 | 686 | #### 6. Create a form with multiple UI elements 687 | 688 | ```python 689 | import marimo as mo 690 | 691 | form = mo.md( 692 | """ 693 | **User Details** 694 | 695 | Name: {name} 696 | Age: {age} 697 | """ 698 | ).batch( 699 | name=mo.ui.text(label="Name"), 700 | age=mo.ui.number(0, 100, label="Age") 701 | ).form() 702 | form 703 | 704 | # Access form values after submission 705 | mo.md(f"Name: {form.value['name']}, Age: {form.value['age']}") 706 | ``` 707 | 708 | ### Working with Buttons 709 | 710 | #### 1. Create a button that triggers computation when clicked 711 | 712 | ```python 713 | import marimo as mo 714 | import random 715 | 716 | run_button = mo.ui.run_button(label="Generate Random Number") 717 | run_button 718 | 719 | # This cell only runs when the button is clicked 720 | mo.stop(not run_button.value, "Click 'Generate' to get a random number") 721 | mo.md(f"Random number: {random.randint(0, 100)}") 722 | ``` 723 | 724 | #### 2. Create a counter button 725 | 726 | ```python 727 | import marimo as mo 728 | 729 | counter_button = mo.ui.button(value=0, on_click=lambda count: count + 1, label="Count") 730 | counter_button 731 | 732 | # Display the count 733 | mo.md(f"Count: {counter_button.value}") 734 | ``` 735 | 736 | #### 3. Create a toggle button 737 | 738 | ```python 739 | import marimo as mo 740 | 741 | toggle_button = mo.ui.button(value=False, on_click=lambda state: not state, label="Toggle") 742 | toggle_button 743 | 744 | # Display the toggle state 745 | mo.md(f"State: {'On' if toggle_button.value else 'Off'}") 746 | ``` 747 | 748 | #### 4. Re-run a cell when a button is pressed 749 | 750 | ```python 751 | import marimo as mo 752 | import random 753 | 754 | refresh_button = mo.ui.button(label="Refresh") 755 | refresh_button 756 | 757 | # This cell reruns when the button is clicked 758 | refresh_button 759 | mo.md(f"Random number: {random.randint(0, 100)}") 760 | ``` 761 | 762 | #### 5. Run a cell when a button is pressed, but not before 763 | 764 | ```python 765 | import marimo as mo 766 | 767 | counter_button = mo.ui.button(value=0, on_click=lambda count: count + 1, label="Click to Continue") 768 | counter_button 769 | 770 | # Only run this cell after the button is clicked 771 | mo.stop(counter_button.value == 0, "Click the button to continue.") 772 | mo.md("You clicked the button!") 773 | ``` 774 | 775 | #### 6. Reveal an output when a button is pressed 776 | 777 | ```python 778 | import marimo as mo 779 | 780 | show_button = mo.ui.button(label="Show Output") 781 | show_button 782 | 783 | # Reveal output only after button click 784 | mo.md("This is the hidden output!") if show_button.value else None 785 | ``` 786 | 787 | ### Caching 788 | 789 | #### 1. Cache expensive computations 790 | 791 | ```python 792 | import marimo as mo 793 | import functools 794 | import time 795 | 796 | @functools.cache 797 | def expensive_function(x): 798 | time.sleep(2) # Simulate a long computation 799 | return x * 2 800 | 801 | # Call the function multiple times with the same argument 802 | result1 = expensive_function(5) 803 | result2 = expensive_function(5) # This will be retrieved from the cache 804 | 805 | mo.md(f"Result 1: {result1}, Result 2: {result2}") 806 | ``` 807 | 808 | These concise examples provide practical illustrations of various recipes, showcasing how Marimo can be used to create interactive, dynamic, and efficient notebooks. 809 | -------------------------------------------------------------------------------- /testable_prompts/context_window/context_window_2.md: -------------------------------------------------------------------------------- 1 | What was the speakers personal accuracy requirement for the benchmark made in the SCRIPT below? 2 | 3 | SCRIPT 4 | Gemma Phi 3, OpenELM, and Llama 3. Open source language models are becoming more viable with every single release. The terminology from Apple's new OpenELM model is spot on. These efficient language models are taking center stage in the LLM ecosystem. Why are ELMs so important? Because they reshape the business model of your agentic tools and products. When you can run a prompt directly on your device, the cost of building goes to zero. The pace of innovation has been incredible, especially with the release of Llama 3. But every time a new model drops, I'm always asking the same question. Are efficient language models truly ready for on-device use? And how do you know your ELM meets your standards? I'm going to give you a couple of examples here. The first one is that you need to know your ELM. Everyone has different standards for their prompts, prompt chains, AI agents, and agentic workflows. How do you know your personal standards are being met by Phi 3, by Llama 3, and whatever's coming next? This is something that we stress on the channel a lot. Always look at where the ball is going, not where it is. If this trend of incredible local models continue, how soon will it be until we can do what GPT-4 does right on our device? With Llama 3, it's looking like this. It's looking like this. It's looking like this. It's looking like this. It's looking like this. It's looking like this. It's looking like this. It's looking like this. It's looking like this. That time is coming very soon. In this video, we're going to answer the question, are efficient language models ready for on-device use? How do you know if they're ready for your specific use cases? Here are all the big ideas. We're going to set some standards for what ELM attributes we actually care about. There are things like RAM consumption, tokens per second, accuracy. We're going to look at some specific attributes of ELMs and talk about where they need to be for them to work on-device for us. We're going to break down the IT V-Benchmark. We'll explain exactly what that is. That's going to help us answer the question, is this model good enough for your specific use cases? And then we're going to actually run the IT V-Benchmark on Gemma 5.3 and Llama 3 for real on-device use. So we're going to look at a concrete example of the IT V-Benchmark running on my M2 MacBook Pro with 64 gigabytes of RAM and really try to answer the question in a concrete way. Is this ready for prime time? Are these ELMs, are these efficient language models ready for prime time? Let's first walk through some standards and then I'll share some of my personal standards for ELMs. So we'll look at it through the lens of how I'm approaching this as I'm building out agentic tools and products. How do we know we're ready for on-device use? First two most important metrics we need to look at, accuracy and speed. Given your test suite that validates that this model works for your use case, what accuracy do you need? Is it okay if it fails a couple of tests giving you 90% or are you okay with, you know, 60, 70 or 80%? I think accuracy is the most important benchmark we should all be paying attention to. Something like speed is also a complete blocker if it's too low. So we'll be measuring speed and TPS, tokens per second. We'll look at a range from one token per second, all the way up to grok levels, right? Of something like 500 plus, you know, 1000 tokens per second level. What else do we need to pay attention to? Memory and context window. So memory coupled with speed are the big two constraints for ELMs right now. Efficient language model, models that can run on your device. They chew up anywhere from four gigabytes of RAM, of GPU, of CPU, all the way up to 128 and beyond. To run Lama 3, 70 billion parameter on my MacBook, it will chew up something like half of all my available RAM. We also have context window. This is a classic one. Then we have JSON response and vision support. We're not gonna focus on these too much. These are more yes, no, do they have it or do they not? Is it multimodal or not? There are a couple other things that we need to pay attention to. First of all, we need to pay attention to these other attributes that we're missing here, but I don't think they matter as much as these six and specifically these four at the top here. So let's go ahead and walk through this through the lens of my personal standards for efficient language models. Let's break it down. So first things first, the accuracy for the ITV benchmark, which we're about to get to must hit 80%. So if a model is not passing about 80% here, I automatically disqualify it. Tokens per second. I require at least 20 tokens per second minimum. If it's below this, it's honestly just not worth it. It's too slow. There's not enough happening. Anything above this, of course we'll accept. So keep in mind when you're setting your personal standards, you're really looking for ranges, right? Anything above 80% for me is golden. Anything above 20 tokens per second at a very minimum is what we're looking for. So let's look at memory. For me, I am only willing to consume up to about 32 gigabytes of RAM, GPU, CPU. However, it ends up getting sliced. On my 64 gigabyte, I have several Docker instances and other applications that are basically running 24 seven that constrain my dev environment. Regardless, I'm looking for ELMs that consume less than 32 gigabytes of memory. Context window, for me, the sweet spot is 32K and above. Lama 3 released with 8K. I said, cool. Benchmarks look great, but it's a little too small. For some of the larger prompts and prompt chains that I'm building up, I'm looking for 32K minimum context. I highly recommend you go through and set your personal standard for each one of these metrics, as they're likely to be the most important for getting your ELM, for getting a model running on your device. So JSON response, vision support. I don't really care about vision support. This is not a high priority for me. Of course, it's a nice to have. There are image models that can run in isolation. That does the trick for me. I'm not super concerned about having local on device multimodal models, at least right now. JSON response support is a must have. For me, this is built into a lot of the model providers, and it's typically not a problem anymore. So these are my personal standards. The most important ones are up here. 80% accuracy on the ITP benchmark, which we'll talk about in just a second. We have the speed. I'm looking for 20 tokens per second at a minimum. I'm looking for a memory consumption maximum of 32. And then of course, the context window. I am simplifying a lot of the details here, especially around the memory usage. I just want to give you a high level of how to think about what your standards are for ELMs. So that when they come around, you're ready to start using it for your personal tools and products. Having this ready to go as soon as these models are ready will save you time and money, especially as you scale up your usage of language models. So let's talk about the ITP benchmark. What is this? It's simple. It's nothing fancy. ITP is just, is this viable? That's what the test is all about. I just want to know, is this ELM viable? Are these efficient language models, AKA on device language models good enough? This code repository we're about to dive into. It's a personalized use case specific benchmark to quickly swap in and out ELMs, AKA on device language models to know if it's ready for your tools and applications. So let's go ahead and take a quick look at this code base. Link for this is going to be in the description. Let's go ahead and crack open VS code and let's just start with the README. So let's preview this and it's simple. This uses Bunn, PromptFu, and Alama for a minimalist cross-platform local LLM prompt testing and benchmarking experience. So before we dive into this anymore, I'm just going to go ahead, open up the terminal. I'm going to type Bunn run ELM, and that's going to kick off the test. So you can see right away, I have four models running, starting with GPT 3.5 as a control model to test against. And then you can see here, we have Alama Chat, Alama 3, we have PHY, and we have Gemma running as well. So while this is running through our 12 test cases, let's go ahead and take a look at what this code base looks like. So all the details that get set up are going to be in the README. Once you're able to get set up with this in less than a minute, this code base was designed specifically for you to help you benchmark local models for your use cases so that when they're ready, you can start saving time and saving money immediately. If we look at the structure, it's very simple. We have some setup, some minor scripts, and then we have the most important thing, bench, underscore, underscore, and then whatever the test suite name is. This one's called Efficient Language Models. So let's go ahead and look at the prompt. So the prompt is just a simple template. This gets filled in with each individual test run. And if we open up our test files, you can see here, let's go ahead and collapse everything. You can see here we have a list of what do we have here, 12 tests. They're sectioned off. You can see we have string manipulation here, command generation, code explanation, text classification. This is a work in progress of my personal ELM accuracy benchmark. By the time you're watching this, there'll likely be a few additional tests here. They'll be generic enough though, so that you can come in, understand them, and tweak them to fit your own specific use case. So let's go ahead and take a look at this. So this is the test file, and we'll look into this in more detail in just a second here. But if you go to the most important file, prompt through configuration, you can see here, let's go ahead and collapse this. We have our control cloud LLM. So I like to have a kind of control and an experimental group. The control group is going to be our cloud LLM that we want to prove our local models are as good as or near the performance of. Right now I'm using dbt 3.5. And then we have our experimental local ELMs. So we're going to go ahead and take a look at this. So in here, you can see we have LLM 3, we have 5.3, and we have Gemma. Again, you can tweak these. This is all built on top of LLM. Let's go ahead and run through our tool set quickly. We're using Bun, which is an all in one JavaScript runtime. Over the past year, the engineers have really matured the ecosystem. This is my go-to tool for all things JavaScript and TypeScript related. They recently just launched Windows support, which means that this code base will work out of the box for Mac, Linux, and Windows users. You can go ahead and click on this, and you'll be able to see the code base. Huge shout out to the Bun developers on all the great work here. We're using Ollama to serve our local language models. I probably don't need to introduce them. And last but not least, we're using PromptFu. I've talked about PromptFu in a few videos in the past, but it's super, super important to bring back up. This is how you can test your individual prompts against expectations. So what does that look like? If we scroll down to the hero here, you can see exactly what a test case looks like. So you have your prompts that you're going to test. So this is what you would normally type in a chat input field. And then you can go ahead and click test. And then you can go ahead and you have your individual models. Let's say you want to test OpenAI, Plod, and Mistral Large. You would put those all here. So for each provider, it's going to run every single prompt. And then at the bottom, you have your test cases. Your test cases can pass in variables to your prompts, as you can see here. And then most importantly, your test cases can assert specific expectations on the output of your LLM. So you can see here where you're running this type contains. We need to make sure that it has this string in it. We're making sure that the cost is below this amount, latency below this, etc. There are many different assertion types. The ITV benchmark repo uses these three key pieces of technology for a really, really simplistic experience. So you have your prompt configuration where you specify what models you want to use. You have your tests, which specify the details. So let's go ahead and look at one of these tests. You can see here, this is a simple bullet summary test. So I'm saying create a summary of the following text in bullet points. And then here's the script to one of our previous videos. So, you know, here's a simple yet powerful idea that can help you take a large step toward useful and valuable agentic workflows. We're asserting case insensitively that all of these items are in the response of the prompt. So let's go ahead and look at our output. Let's see if our prompts completed. Okay, so we have 33 success and 15 failed tests. So LLM3 ran every single one of these test cases here and reported its results. So let's go ahead and take a look at what that looks like. So after you run that was Bon ELM, after you run that you can run Bon View and if we open up package.json, and you can see Bon view just runs prompt foo view Bon view. This is going to kick off a local prompt foo server that shows us exactly what happened in the test runs. So right away, you can see we have a great summary of the results. So we have our control test failing at only one test, right. So it passed 91% accuracy. This and then we have llama 3 so close to my 80 standard we'll dig into where it went wrong in just a second here we then have phi 3 failed half of the 12 test cases and then we have gemma looks like it did one better 7 out of 12 so you can see here this is why it's important to have a control group specifically for testing elms it's really good to compare against a kind of high performing model and you know gpg 3.5 turbo it's not really even high performing anymore but it's a good benchmark for testing against local models because really if we use opus or gpt4 here the local models won't even come close so that's why i like to compare to something like gpg 3.5 you can also use cloud 3 haiku here this right away gives you a great benchmark on how local models are performing let's go ahead and look at one of these tests what happened where did things go wrong let's look at our text classification this is a simple test the prompt is is the following block of text a sql natural language query nlq respond exclusively with yes or no so this test here is going to look at how well the model can both answer correctly and answer precisely right it needs to say yes or no and then the block of text is select 10 users over the age of 21 with a gmail address and then we have the assertion type equals yes so our test case validates this test if it returns exclusively yes and we can look at the prompt test to see exactly what that looks like so if you go to test.yaml we can see we're looking for just yes this is what that test looks like right so this is our one of our text classification tests and and we have this assertion type equals yes so equals is used when you know exactly what you want the response to be a lot of the times you'll want something like a i contains all so case insensitive contains everything or a case insensitive contains any and there are lots of different assertions you can make you can easily dive into that i've linked that in the readme you'll want to look at the assertions documentation in prompt foo they have a whole list here of different assertions you can make to improve and strengthen your prompt test so that's what that test looks like and and you can kind of go through the line over each model to see exactly what went right what went wrong etc so feel free to check out the other test cases the long story short here is that by running the itv benchmark by running your personal benchmarks against local models you can have higher confidence and you can have first movers advantage on getting your hands on these local models and truly utilizing them as you can see here llama 3 is nearly within my standard of what i need an elm to do based on these 12 test cases i'll increase this to add a lot more of the use cases that i use out of these 12 test cases llama 3 is performing really really well and this is the 8b model right so if we look at a llama you can see here the default version that comes in here is the 8 billion parameter model that's the 4b quantization so pretty good stuff here i don't need to talk about how great llama 3 is the rest of the internet is doing that but it is really awesome to see how it performs on your specific use cases the closer you get to the metal here the closer you understand how these models are performing next to each other the better and the faster you're going to be able to take these models and productionize them in your tools and products i also just want to shout out how incredible it is to actually run these tests over and over and over again with the same model without thinking about the cost for a single second. You can see here, we're getting about 12 tokens per second across the board. So not ideal, not super great, but still everything completed fine. You can walk through the examples. A lot of these test cases are passing. This is really great. I'm gonna be keeping a pretty close eye on this stuff. So definitely like and subscribe if you're interested in the best local performing models. I feel like we're gonna have a few different classes of models, right? If we break this down, fastest, cheapest, and then it was best, slowest. And now what I think we need to do is take this and add a nest to it. So we basically say something like this, right? We say cloud, right? And then we say the slowest, most expensive. And then we say local, fastest, lower accuracy, and best, slowest, right? So things kind of change when you're at the local level. Now we're just trading off speed and accuracy, which simplifies things a lot, right? Because basically we were doing this where we had the fastest, cheapest, and we had lower accuracy. And then we had best, slowest, most expensive, right? So this is your Opus, this is your GPT-4, and this is your Haiku, GPT-3. But now we're getting into this interesting place where now we have things like this, right? Now we have PHY-3, we have LAMA-3, LAMA-3 is seven or eight billion. We also have Gemma. And then in the slowest, we have our bigger models, right? So this is where like LAMA-3 was at 70 billion, that's where this goes. And then, you know, whatever other big models that come out that are, you know, going to really trip your RAM, they're going to run slower, but they will give you the best performance that you can possibly have locally. So I'm keeping an eye on this. Hit the like and hit the sub if you want to stay up to date with how cloud versus local models progress. We're going to be covering these on the channel and I'll likely use, you know, this class system to separate them to keep an eye on these, right? First thing that needs to happen is we need anything at all. To run locally, right? So this is kind of, you know, in the future, same with this. Right now we need just anything to run well enough. So, you know, we need decent accuracy, any speed, right? So this is what we're looking for right now. And this stuff is going to come in the future. So that's the way I'm looking at this. The ITV benchmark can help you gain confidence in your prompts. Link for the code is going to be in the description. I built this to be ultra simple. Just follow the README to get started. Thanks to Bunn. Pramphu and Ollama. This should be completely cross-platform and I'll be updating this with some additional test cases. By the time you watch this, I'll likely have added several additional tests. I'm missing some things in here like code generation, context window length testing, and a couple other sections. So look forward to that. I hope all of this makes sense. Up your feeling, the speed of the open source community building toward usable viable ELMs. I think this is something that we've all been really excited about. And it's finally starting to happen. I'm going to predict by the end of the year, we're going to have an on-device Haiku to GPT-4 level model running, consuming less than 8 gigabytes of RAM. As soon as OpenELM hits Ollama, we'll be able to test this as well. And that's one of the highlights of using the ITV benchmark inside of this code base. You'll be able to quickly and seamlessly get that up and running by just updating the model name, adding a new configuration here like this. And then it'll look something like this, OpenELM, and then whatever the size is going to be, say it's the 3B, and that's it. Then you just run the test again, right? So that's the beauty of having a test suite like this set up and ready to go. You can, of course, come in here and customize this. You can add Opus, you can add Haiku, you can add other models, tweak it to your liking. That's what this is all about. I highly recommend you get in here and test this. This was important enough for me to take a break from personal AI assistance, and HSE, and all of that stuff. And I'll see you guys in the next video. Bye-bye. MacBook Pro M4 chip is released. And as the LLM community rolls out permutations of Llama 3, I think very soon, possibly before mid-2024, ELM's efficient language models will be ready for on-device use. Again, this is use case specific, which is really the whole point of me creating this video is to share this code base with you so that you can know exactly what your use case specific standards are. Because after you have standards set and a great prompting framework like PromptFu, you can then answer the question for yourself, for your tools, and for your products, is this efficient language model ready for my device? For me personally, the answer to this question is very soon. If you enjoyed this video, you know what to do. Thanks so much for watching. Stay focused and keep building. --------------------------------------------------------------------------------