├── .env.sample
├── .gitignore
├── AI_DOCS
├── marimo_cheatsheet.md
├── marimo_compressed.md
└── marimo_documentation.json
├── README.md
├── adhoc_prompting.py
├── images
├── marimo_prompt_library.png
└── multi_slm_llm_prompt_and_model.png
├── language_model_rankings
└── rankings.json
├── layouts
├── adhoc_prompting.grid.json
├── adhoc_prompting.slides.json
├── multi_language_model_ranker.grid.json
├── multi_llm_prompting.grid.json
├── prompt_library.grid.json
└── prompt_library.slides.json
├── marimo_is_awesome_demo.py
├── multi_language_model_ranker.py
├── multi_llm_prompting.py
├── prompt_library.py
├── prompt_library
├── ai-coding-meta-review.xml
├── bullet-knowledge-compression.xml
├── chapter-gen.xml
└── hn-sentiment-analysis.xml
├── pyproject.toml
├── src
└── marimo_notebook
│ ├── __init__.py
│ ├── modules
│ ├── __init__.py
│ ├── chain.py
│ ├── llm_module.py
│ ├── prompt_library_module.py
│ ├── typings.py
│ └── utils.py
│ └── temp.py
├── testable_prompts
├── bash_commands
│ ├── command_generation_1.md
│ ├── command_generation_2.md
│ └── command_generation_3.md
├── basics
│ ├── hello.md
│ ├── mult_lang_counting.xml
│ ├── ping.xml
│ └── python_count_to_ten.xml
├── code_debugging
│ ├── code_debugging_1.md
│ ├── code_debugging_2.md
│ └── code_debugging_3.md
├── code_explanation
│ ├── code_explanation_1.md
│ ├── code_explanation_2.md
│ └── code_explanation_3.md
├── code_generation
│ ├── code_generation_1.md
│ ├── code_generation_2.md
│ ├── code_generation_3.md
│ └── code_generation_4.md
├── context_window
│ ├── context_window_1.md
│ ├── context_window_2.md
│ └── context_window_3.md
├── email_management
│ ├── email_management_1.md
│ ├── email_management_2.md
│ ├── email_management_3.md
│ ├── email_management_4.md
│ ├── email_management_5.md
│ └── email_management_6.md
├── personal_ai_assistant_responses
│ ├── personal_ai_assistant_responses_1.md
│ ├── personal_ai_assistant_responses_2.md
│ └── personal_ai_assistant_responses_3.md
├── sql
│ ├── nlq1.md
│ ├── nlq2.md
│ ├── nlq3.md
│ ├── nlq4.md
│ └── nlq5.md
├── string_manipulation
│ ├── string_manipulation_1.md
│ ├── string_manipulation_2.md
│ └── string_manipulation_3.md
└── text_classification
│ ├── text_classification_1.md
│ └── text_classification_2.md
└── uv.lock
/.env.sample:
--------------------------------------------------------------------------------
1 | ANTHROPIC_API_KEY=
2 | OPENAI_API_KEY=
3 | GROQ_API_KEY=
4 | GEMINI_API_KEY=
5 | PROMPT_LIBRARY_DIR=./prompt_library
6 | PROMPT_EXECUTIONS_DIR=./prompt_executions
7 | TESTABLE_PROMPTS_DIR=./testable_prompts
8 | LANGUAGE_MODEL_RANKINGS_FILE=./language_model_rankings/rankings.json
9 |
--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
1 | # Based on https://raw.githubusercontent.com/github/gitignore/main/Node.gitignore
2 |
3 | # Logs
4 |
5 | logs
6 | _.log
7 | npm-debug.log_
8 | yarn-debug.log*
9 | yarn-error.log*
10 | lerna-debug.log*
11 | .pnpm-debug.log*
12 |
13 | # Caches
14 |
15 | .cache
16 |
17 | # Diagnostic reports (https://nodejs.org/api/report.html)
18 |
19 | report.[0-9]_.[0-9]_.[0-9]_.[0-9]_.json
20 |
21 | # Runtime data
22 |
23 | pids
24 | _.pid
25 | _.seed
26 | *.pid.lock
27 |
28 | # Directory for instrumented libs generated by jscoverage/JSCover
29 |
30 | lib-cov
31 |
32 | # Coverage directory used by tools like istanbul
33 |
34 | coverage
35 | *.lcov
36 |
37 | # nyc test coverage
38 |
39 | .nyc_output
40 |
41 | # Grunt intermediate storage (https://gruntjs.com/creating-plugins#storing-task-files)
42 |
43 | .grunt
44 |
45 | # Bower dependency directory (https://bower.io/)
46 |
47 | bower_components
48 |
49 | # node-waf configuration
50 |
51 | .lock-wscript
52 |
53 | # Compiled binary addons (https://nodejs.org/api/addons.html)
54 |
55 | build/Release
56 |
57 | # Dependency directories
58 |
59 | node_modules/
60 | jspm_packages/
61 |
62 | # Snowpack dependency directory (https://snowpack.dev/)
63 |
64 | web_modules/
65 |
66 | # TypeScript cache
67 |
68 | *.tsbuildinfo
69 |
70 | # Optional npm cache directory
71 |
72 | .npm
73 |
74 | # Optional eslint cache
75 |
76 | .eslintcache
77 |
78 | # Optional stylelint cache
79 |
80 | .stylelintcache
81 |
82 | # Microbundle cache
83 |
84 | .rpt2_cache/
85 | .rts2_cache_cjs/
86 | .rts2_cache_es/
87 | .rts2_cache_umd/
88 |
89 | # Optional REPL history
90 |
91 | .node_repl_history
92 |
93 | # Output of 'npm pack'
94 |
95 | *.tgz
96 |
97 | # Yarn Integrity file
98 |
99 | .yarn-integrity
100 |
101 | # dotenv environment variable files
102 |
103 | .env
104 | .env.development.local
105 | .env.test.local
106 | .env.production.local
107 | .env.local
108 |
109 | # parcel-bundler cache (https://parceljs.org/)
110 |
111 | .parcel-cache
112 |
113 | # Next.js build output
114 |
115 | .next
116 | out
117 |
118 | # Nuxt.js build / generate output
119 |
120 | .nuxt
121 | dist
122 |
123 | # Gatsby files
124 |
125 | # Comment in the public line in if your project uses Gatsby and not Next.js
126 |
127 | # https://nextjs.org/blog/next-9-1#public-directory-support
128 |
129 | # public
130 |
131 | # vuepress build output
132 |
133 | .vuepress/dist
134 |
135 | # vuepress v2.x temp and cache directory
136 |
137 | .temp
138 |
139 | # Docusaurus cache and generated files
140 |
141 | .docusaurus
142 |
143 | # Serverless directories
144 |
145 | .serverless/
146 |
147 | # FuseBox cache
148 |
149 | .fusebox/
150 |
151 | # DynamoDB Local files
152 |
153 | .dynamodb/
154 |
155 | # TernJS port file
156 |
157 | .tern-port
158 |
159 | # Stores VSCode versions used for testing VSCode extensions
160 |
161 | .vscode-test
162 |
163 | # yarn v2
164 |
165 | .yarn/cache
166 | .yarn/unplugged
167 | .yarn/build-state.yml
168 | .yarn/install-state.gz
169 | .pnp.*
170 |
171 | # IntelliJ based IDEs
172 | .idea
173 |
174 | # Finder (MacOS) folder config
175 | .DS_Store
176 | .aider*
177 |
178 | __pycache__/
179 |
180 | .venv/
181 |
182 | prompt_executions/
--------------------------------------------------------------------------------
/AI_DOCS/marimo_cheatsheet.md:
--------------------------------------------------------------------------------
1 | # Marimo Cheat Sheet 0.2.5
2 |
3 | ## Install and Import
4 |
5 | ### Install
6 | ```bash
7 | pip install marimo
8 | ```
9 | ```bash
10 | marimo (open Marimo app)
11 | ```
12 | Open Safari
13 | ```bash
14 | http://localhost:8888
15 | ```
16 | ### Open Tutorials
17 | ```bash
18 | marimo tutorial DITTS
19 | ```
20 | ### View Server Info
21 | ```bash
22 | marimo tutorial --help
23 | ```
24 | ```bash
25 | Create new notebook
26 | ```
27 | ```bash
28 | > create notebook
29 | ```
30 | ```bash
31 | ls lists directories
32 | ```
33 | ```bash
34 | marimo export my_notebook.py
35 | ```
36 | ### Serve notebook as script
37 | ```bash
38 | marimo serve my_notebook.py
39 | ```
40 | ### Serve notebook as app
41 | ```bash
42 | marimo export my_notebook.json > your_notebook.py
43 | ```
44 | ```bash
45 | Run jupyter server
46 | ```
47 | ```bash
48 | jupyter notebook
49 | ```
50 | ```bash
51 | marimo export my_notebook.json > your_notebook.py
52 | ```
53 | ### CLI Commands (MARIMO CLI)
54 | ```bash
55 | marimo -p {PORT} NAME
56 | ```
57 | ```bash
58 | --p {PORT} SERVER to attach to.
59 | ```
60 | ```bash
61 | --h --show home screen in the app.
62 | ```
63 | ```bash
64 | --h displays Home screen in app.
65 | ```
66 | ```bash
67 | marimo export my_notebook.json > your_notebook.py
68 | ```
69 | **Server Port Tips**:
70 | ```bash
71 | If a port is busy use --port option. Server should start with /, /app subfolder. Use CLI or URL to access.
72 | ```
73 | ### Run server and exit.
74 | [GitHub](https://github.com/tithyhs/marimo-cheat-sheet)
75 | [Docs](http://docs.marimo.io)
76 |
77 | ## Inputs
78 | ```python
79 | # Array of ID elements
80 | ctlA.get_value(df.id), ctlC.set()
81 |
82 | # Add new elements with sample label labels[]
83 | new_labels.add('example_label'),['label'],['tag',C.init()])
84 |
85 | # Buttons with optional on-click
86 | m.ui_buttons(labels=[‘Ok’, 9], Labeled=‘Click Me’, m.ui_onclick('on_click'))
87 |
88 | # Basic checkbox layout
89 | ctl_inputs.add(['label':('Check me')])
90 |
91 | # Combo box code
92 | def_code_box[selector='dropdown',['shown']])
93 |
94 | # Dataframe code: column_names df.columns.labels df.cf.head()]
95 | df_render_data(df,'render_data','visualizations']
96 |
97 | # Dictionary
98 | m.ui_dictionary([‘text’:‘No.1 Column’, ‘data’: m.ui.set(df)])
99 | ```
100 | #### Set Dropdown
101 |
102 | ```python
103 | # Slider
104 | m.ui_checkbox(['id_range=[‘1’,‘Choice 2’]', ‘Choice 2'])
105 | ```
106 | #### Multi Dropdown
107 |
108 | ```python
109 | m.ui_multiselect(options=['Basic',3,'Row 5'])
110 | ```
111 | ### Table output
112 |
113 | ```python
114 | table().rows=['Header',['Example 1'5,‘Item 2’])
115 | ```
116 | #### Expand
117 | ```python
118 | rows(), show_folded_value()
119 | ```
120 | ## MarkDown
121 | ```markdown
122 | ## Music markdown
123 |
124 | - # 'Markdown Text'
125 | - ## Integrative Playlist - Start
126 | ```
127 | ```bash
128 | m.link('https://spotify' }])
129 | ```
130 | **Text Positioning**
131 | ```python
132 | m.md_text('Hello world') ])
133 | ```
134 | ```python
135 | m.md_link('Playlist URL ')
136 | ```
137 | Zoom in
138 | ```python
139 | m_md_zoom ]
140 | ```
141 | ```python
142 | m.md_add_input_code(`music-rocker`)`
143 | ```
144 | ```markdown
145 | - **Use Syntax:**
146 | '|data_content',['m.md_render('Tooltip-Hello')])
147 | ```
148 | ```markdown
149 | Code Syntax:
150 | ‘’`
151 | ```
152 | ## Outputs
153 | ```python
154 | # Replace cell's output
155 | m.output_replace(‘cell_output2’)
156 |
157 | # Append data cell
158 | m_output(append_cell('cell-output3’)
159 | ```
160 | ## Plotting
161 | ```python
162 | # Create Axis chart
163 | chart.plot(df_chart_data()).axis(['figure_color='],‘Width_px,G.title['Origin'])
164 |
165 | # Add chart (after show function)
166 | m.axis_chart([],function_completed)
167 |
168 | # Chart as Plotly interactive
169 | axis_chart_loader['function_return_chart'])
170 | ```
171 | ```python
172 | m.set.axis.plot(figure_data['canvas']).plot.chart.[sizes=s,parameter=plots])
173 |
174 | m.plot_data(plt.D3,[],[interactive() ])
175 | ```
176 | ### Plot
177 |
178 | ```python
179 | plt.axis(['Start Plot'])
180 | ```
181 | ## Media
182 | ### Render an image
183 |
184 | ```python
185 | m.md_image({‘File.jpg’})
186 | ‘Image description’=[m.md_title_image-].add-Image id:to_800)
187 | ```
188 | Add Markdown Media:
189 | ```python
190 | m.media_add(video‘video_display ]])
191 | m_media.stream({'Source_load-[‘MP4')})
192 | ```
193 | ```python
194 | img_embed=[m.youtube_link].render.{MP4’],embed[call]).allow('toggle_toolbar()’))
195 | ```
196 | Embed Audio
197 | ```python
198 | md.audio(‘audio_name', 'path.mp3')
199 |
200 | m.md_figure.img_reference.link'
201 | ```
202 | ## Diagrams
203 | ### Define Diagram
204 | ```python
205 | m_diag_define(m.diagram_structure].create()
206 |
207 | # Label
208 | diagram_diagram('Line path',['Node Arrow'])
209 | ```
210 | ### Showing simple path
211 | ```python
212 | simple_showchart='Example Path', ‘label','Diagonal_top'])
213 | ```
214 | ```python
215 | path_node]-(vertical_direction)]
216 | ```
217 | ## Status
218 | ```python
219 | # Progress bars
220 | progress_text(‘task-subtitle-updating.progress’)
221 |
222 | [Fetching Data]
223 | [please_processing_title_loadtext_spinner.png]
224 | ```
225 | #### Time Display Progress
226 | ```python
227 | time.sleep(['progress_loader'])
228 | ```
229 | ## Control Flow
230 | ```python
231 | # If/Else Looping State Variables
232 | # Call cell_state.log (['iteration_num=0)
233 | ```
234 | ```python
235 | # Ends current loop iteration/executes cell
236 | ```
237 | ```python
238 | def.goto(‘state=cell_suspended’)
239 | ctl_current({state}).wait()
240 | ```
241 | ## State
242 | ### State Management Code
243 |
244 | ```python
245 | ctl_set.state=[]['C.State']
246 | ctl_render_type(['current_frame_reset()])
247 | # toggle state
248 | ctl_state]
249 | ```
250 | ## HTML
251 | ### Convert Python to HTML
252 | ```python
253 | html_cell.add(html.tags()```
254 | html_justified
255 | ```
256 | ```python
257 | m.html_create_wrapper
258 | def(fixalignment_item])
259 |
260 | # Apply single batch to cell
261 | html.md_wrap.applyAlign(color)
262 | ```
263 | ### Set Justify
264 | ```python
265 | html_tag.set='center]
266 | ```
267 | ## Debug
268 | ### Debugging cell output:
269 | ```python
270 | ctl_debug().output
271 | m.debug.retrieve_debug().info
272 | ```
273 | ### Inspect Execution Code
274 | ```python
275 | ctl_last.debug()
276 | ```
277 | [GitHub](https://github.com/tithyhs/marimo-cheat-sheet)
278 | [Docs](http://docs.marimo.io)
--------------------------------------------------------------------------------
/AI_DOCS/marimo_compressed.md:
--------------------------------------------------------------------------------
1 | import marimo as mo
2 | import random
3 | import pandas as pd
4 | import plotly.express as px
5 | import altair as alt
6 | from vega_datasets import data
7 | import matplotlib.pyplot as plt
8 |
9 | # Markdown
10 | mo.md("## This is a markdown heading")
11 |
12 | # Inputs
13 |
14 | # 1. Array
15 | sliders = mo.ui.array([mo.ui.slider(1, 100) for _ in range(3)])
16 | mo.md(f"Array of sliders: {sliders}")
17 |
18 | # 2. Batch
19 | user_info = mo.md(
20 | """
21 | - **Name:** {name}
22 | - **Birthday:** {birthday}
23 | """
24 | ).batch(name=mo.ui.text(), birthday=mo.ui.date())
25 | user_info
26 |
27 | # 3. Button
28 | def on_click(value):
29 | print("Button clicked!", value)
30 | return value + 1
31 |
32 | button = mo.ui.button(on_click=on_click, value=0, label="Click Me")
33 | button
34 |
35 | # 4. Checkbox
36 | checkbox = mo.ui.checkbox(label="Agree to terms")
37 | mo.md(f"Checkbox value: {checkbox.value}")
38 |
39 | # 5. Code Editor
40 | code = """
41 | def my_function():
42 | print("Hello from code editor!")
43 | """
44 | code_editor = mo.ui.code_editor(value=code, language="python")
45 | code_editor
46 |
47 | # 6. Dataframe
48 | df = pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]})
49 | dataframe_ui = mo.ui.dataframe(df)
50 | dataframe_ui
51 |
52 | # 7. Data Explorer
53 | data_explorer = mo.ui.data_explorer(data.cars())
54 | data_explorer
55 |
56 | # 8. Dates
57 |
58 | # Single date
59 | date_picker = mo.ui.date(label="Select a date")
60 | date_picker
61 |
62 | # Date and time
63 | datetime_picker = mo.ui.datetime(label="Select a date and time")
64 | datetime_picker
65 |
66 | # Date range
67 | date_range_picker = mo.ui.date_range(label="Select a date range")
68 | date_range_picker
69 |
70 | # 9. Dictionary
71 | elements = mo.ui.dictionary({
72 | 'slider': mo.ui.slider(1, 10),
73 | 'text': mo.ui.text(placeholder="Enter text")
74 | })
75 | mo.md(f"Dictionary of elements: {elements}")
76 |
77 | # 10. Dropdown
78 | dropdown = mo.ui.dropdown(options=['Option 1', 'Option 2', 'Option 3'], label="Select an option")
79 | dropdown
80 |
81 | # 11. File
82 | file_upload = mo.ui.file(label="Upload a file")
83 | file_upload
84 |
85 | # 12. File Browser
86 | file_browser = mo.ui.file_browser(label="Browse files")
87 | file_browser
88 |
89 | # 13. Form
90 | form = mo.ui.text(label="Enter your name").form()
91 | form
92 |
93 | # 14. Microphone
94 | microphone = mo.ui.microphone(label="Record audio")
95 | microphone
96 |
97 | # 15. Multiselect
98 | multiselect = mo.ui.multiselect(options=['A', 'B', 'C', 'D'], label="Select multiple options")
99 | multiselect
100 |
101 | # 16. Number
102 | number_picker = mo.ui.number(0, 10, step=0.5, label="Select a number")
103 | number_picker
104 |
105 | # 17. Radio
106 | radio_group = mo.ui.radio(options=['Red', 'Green', 'Blue'], label="Select a color")
107 | radio_group
108 |
109 | # 18. Range Slider
110 | range_slider = mo.ui.range_slider(0, 100, step=5, value=[20, 80], label="Select a range")
111 | range_slider
112 |
113 | # 19. Refresh
114 | refresh_button = mo.ui.refresh(default_interval="5s", label="Refresh")
115 | refresh_button
116 |
117 | # 20. Run Button
118 | run_button = mo.ui.run_button(label="Run")
119 | run_button
120 |
121 | # 21. Slider
122 | slider = mo.ui.slider(0, 100, step=1, label="Adjust value")
123 | slider
124 |
125 | # 22. Switch
126 | switch = mo.ui.switch(label="Enable feature")
127 | switch
128 |
129 | # 23. Table
130 | table_data = [{'Name': 'Alice', 'Age': 25}, {'Name': 'Bob', 'Age': 30}]
131 | table = mo.ui.table(data=table_data, label="User Table")
132 | table
133 |
134 | # 24. Tabs
135 | tab1_content = mo.md("Content for Tab 1")
136 | tab2_content = mo.ui.slider(0, 10)
137 | tabs = mo.ui.tabs({'Tab 1': tab1_content, 'Tab 2': tab2_content})
138 | tabs
139 |
140 | # 25. Text
141 | text_input = mo.ui.text(placeholder="Enter some text", label="Text Input")
142 | text_input
143 |
144 | # 26. Text Area
145 | text_area = mo.ui.text_area(placeholder="Enter a long text", label="Text Area")
146 | text_area
147 |
148 | # 27. Custom UI elements (Anywidget)
149 | # See the documentation on Anywidget for examples.
150 |
151 | # Layouts
152 |
153 | # 1. Accordion
154 | accordion = mo.ui.accordion({'Section 1': mo.md("This is section 1"), 'Section 2': mo.ui.slider(0, 10)})
155 | accordion
156 |
157 | # 2. Carousel
158 | carousel = mo.carousel([mo.md("Item 1"), mo.ui.slider(0, 10), mo.md("Item 3")])
159 | carousel
160 |
161 | # 3. Callout
162 | callout = mo.md("Important message!").callout(kind="warn")
163 | callout
164 |
165 | # 4. Justify
166 |
167 | # Center
168 | centered_text = mo.md("This text is centered").center()
169 | centered_text
170 |
171 | # Left
172 | left_aligned_text = mo.md("This text is left aligned").left()
173 | left_aligned_text
174 |
175 | # Right
176 | right_aligned_text = mo.md("This text is right aligned").right()
177 | right_aligned_text
178 |
179 | # 5. Lazy
180 | def lazy_content():
181 | mo.md("This content loaded lazily!")
182 |
183 | lazy_element = mo.lazy(lazy_content)
184 | lazy_element
185 |
186 | # 6. Plain
187 | plain_dataframe = mo.plain(df)
188 | plain_dataframe
189 |
190 | # 7. Routes
191 | def home_page():
192 | return mo.md("# Home Page")
193 |
194 | def about_page():
195 | return mo.md("# About Page")
196 |
197 | mo.routes({
198 | "#/": home_page,
199 | "#/about": about_page
200 | })
201 |
202 | # 8. Sidebar
203 | sidebar_content = mo.vstack([mo.md("## Menu"), mo.ui.button(label="Home"), mo.ui.button(label="About")])
204 | mo.sidebar(sidebar_content)
205 |
206 | # 9. Stacks
207 |
208 | # Horizontal Stack
209 | hstack_layout = mo.hstack([mo.md("Left"), mo.ui.slider(0, 10), mo.md("Right")])
210 | hstack_layout
211 |
212 | # Vertical Stack
213 | vstack_layout = mo.vstack([mo.md("Top"), mo.ui.slider(0, 10), mo.md("Bottom")])
214 | vstack_layout
215 |
216 | # 10. Tree
217 | tree_data = ['Item 1', ['Subitem 1.1', 'Subitem 1.2'], {'Key': 'Value'}]
218 | tree = mo.tree(tree_data)
219 | tree
220 |
221 | # Plotting
222 |
223 | # Reactive charts with Altair
224 | altair_chart = mo.ui.altair_chart(alt.Chart(data.cars()).mark_point().encode(x='Horsepower', y='Miles_per_Gallon', color='Origin'))
225 | altair_chart
226 |
227 | # Reactive plots with Plotly
228 | plotly_chart = mo.ui.plotly(px.scatter(data.cars(), x="Horsepower", y="Miles_per_Gallon", color="Origin"))
229 | plotly_chart
230 |
231 | # Interactive matplotlib
232 | plt.plot([1, 2, 3, 4])
233 | interactive_mpl_chart = mo.mpl.interactive(plt.gcf())
234 | interactive_mpl_chart
235 |
236 | # Media
237 |
238 | # 1. Image
239 | image = mo.image("https://marimo.io/logo.png", width=100, alt="Marimo Logo")
240 | image
241 |
242 | # 2. Audio
243 | audio = mo.audio("https://www.zedge.net/find/ringtones/ocean%20waves")
244 | audio
245 |
246 | # 3. Video
247 | video = mo.video("https://www.youtube.com/watch?v=dQw4w9WgXcQ", controls=True)
248 | video
249 |
250 | # 4. PDF
251 | pdf = mo.pdf("https://www.africau.edu/images/default/sample.pdf", width="50%")
252 | pdf
253 |
254 | # 5. Download Media
255 | download_button = mo.download(data="This is the content of the file", filename="download.txt")
256 | download_button
257 |
258 | # 6. Plain text
259 | plain_text = mo.plain_text("This is plain text")
260 | plain_text
261 |
262 | # Diagrams
263 |
264 | # 1. Mermaid diagrams
265 | mermaid_code = """
266 | graph LR
267 | A[Square Rect] -- Link text --> B((Circle))
268 | A --> C(Round Rect)
269 | B --> D{Rhombus}
270 | C --> D
271 | """
272 | mermaid_diagram = mo.mermaid(mermaid_code)
273 | mermaid_diagram
274 |
275 | # 2. Statistic cards
276 | stat_card = mo.stat(value=100, label="Users", caption="Total users this month", direction="increase")
277 | stat_card
278 |
279 | # Status
280 |
281 | # 1. Progress bar
282 | for i in mo.status.progress_bar(range(10), title="Processing"):
283 | # Simulate some work
284 | pass
285 |
286 | # 2. Spinner
287 | with mo.status.spinner(title="Loading...", subtitle="Please wait"):
288 | # Simulate a long-running task
289 | pass
290 |
291 | # Outputs
292 |
293 | # 1. Replace output
294 | mo.output.replace(mo.md("This is the new output"))
295 |
296 | # 2. Append output
297 | mo.output.append(mo.md("This is appended output"))
298 |
299 | # 3. Clear output
300 | mo.output.clear()
301 |
302 | # 4. Replace output at index
303 | mo.output.replace_at_index(mo.md("Replaced output"), 0)
304 |
305 | # Display cell code
306 | mo.show_code(mo.md("This output has code displayed"))
307 |
308 | # Control Flow
309 |
310 | # Stop execution
311 | user_age = mo.ui.number(0, 100, label="Enter your age")
312 | mo.stop(user_age.value < 18, mo.md("You must be 18 or older"))
313 | mo.md(f"Your age is: {user_age.value}")
314 |
315 | # HTML
316 |
317 | # 1. Convert to HTML
318 | html_object = mo.as_html(mo.md("This is markdown converted to HTML"))
319 | html_object
320 |
321 | # 2. Html object
322 | custom_html = mo.Html("
This is custom HTML
")
323 | custom_html
324 |
325 | # Other API components
326 |
327 | # Query Parameters
328 | params = mo.query_params()
329 | params['name'] = 'John'
330 |
331 | # Command Line Arguments
332 | args = mo.cli_args()
333 |
334 | # State
335 | get_count, set_count = mo.state(0)
336 | mo.ui.button(on_click=lambda: set_count(get_count() + 1), label="Increment")
337 | mo.md(f"Count: {get_count()}")
338 |
339 | # App
340 | # See documentation for embedding notebooks
341 |
342 | # Cell
343 | # See documentation for running cells from other notebooks
344 |
345 | # Miscellaneous
346 | is_running_in_notebook = mo.running_in_notebook()
347 |
348 | --- Guides
349 |
350 | ## Marimo Guides: Concise Examples
351 |
352 | Here are concise examples for each guide in the Marimo documentation:
353 |
354 | ### 1. Overview
355 |
356 | ```python
357 | import marimo as mo
358 |
359 | # Define a variable
360 | x = 10
361 |
362 | # Display markdown with variable interpolation
363 | mo.md(f"The value of x is {x}")
364 |
365 | # Create a slider and display its value reactively
366 | slider = mo.ui.slider(0, 100, value=50)
367 | mo.md(f"Slider value: {slider.value}")
368 | ```
369 |
370 | ### 2. Reactivity
371 |
372 | ```python
373 | import marimo as mo
374 |
375 | # Define a variable in one cell
376 | data = [1, 2, 3, 4, 5]
377 |
378 | # Use the variable in another cell - this cell will rerun when `data` changes
379 | mo.md(f"The sum of the data is {sum(data)}")
380 | ```
381 |
382 | ### 3. Interactivity
383 |
384 | ```python
385 | import marimo as mo
386 |
387 | # Create a slider
388 | slider = mo.ui.slider(0, 10, label="Select a value")
389 |
390 | # Display the slider's value reactively
391 | mo.md(f"You selected: {slider.value}")
392 | ```
393 |
394 | ### 4. SQL
395 |
396 | ```python
397 | import marimo as mo
398 |
399 | # Create a dataframe
400 | df = pd.DataFrame({'name': ['Alice', 'Bob', 'Charlie'], 'age': [25, 30, 35]})
401 |
402 | # Query the dataframe using SQL
403 | mo.sql("SELECT * FROM df WHERE age > 30")
404 | ```
405 |
406 | ### 5. Run as an app
407 |
408 | ```bash
409 | # Run a notebook as an interactive web app
410 | marimo run my_notebook.py
411 | ```
412 |
413 | ### 6. Run as a script
414 |
415 | ```bash
416 | # Execute a notebook as a Python script
417 | python my_notebook.py
418 | ```
419 |
420 | ### 7. Outputs
421 |
422 | ```python
423 | import marimo as mo
424 |
425 | # Display markdown
426 | mo.md("This is **markdown** output")
427 |
428 | # Display a matplotlib plot
429 | import matplotlib.pyplot as plt
430 | plt.plot([1, 2, 3, 4, 5])
431 | plt.show()
432 | ```
433 |
434 | ### 8. Dataframes
435 |
436 | ```python
437 | import marimo as mo
438 | import pandas as pd
439 |
440 | # Create a Pandas dataframe
441 | df = pd.DataFrame({'col1': [1, 2, 3], 'col2': [4, 5, 6]})
442 |
443 | # Display the dataframe in an interactive table
444 | mo.ui.table(df)
445 | ```
446 |
447 | ### 9. Plotting
448 |
449 | ```python
450 | import marimo as mo
451 | import altair as alt
452 | from vega_datasets import data
453 |
454 | # Create a reactive Altair chart
455 | chart = alt.Chart(data.cars()).mark_point().encode(x='Horsepower', y='Miles_per_Gallon')
456 | chart = mo.ui.altair_chart(chart)
457 |
458 | # Display the chart and selected data
459 | mo.hstack([chart, chart.value])
460 | ```
461 |
462 | ### 10. Editor Features
463 |
464 | - Explore variable values and their definitions in the **Variables Panel**.
465 | - Visualize cell dependencies in the **Dependency Graph**.
466 | - Use **Go-to-Definition** to jump to variable declarations.
467 | - Enable **GitHub Copilot** for AI-powered code suggestions.
468 | - Customize **Hotkeys** and **Theming** in the settings.
469 |
470 | ### 11. Theming
471 |
472 | ```python
473 | # In your notebook.py file:
474 |
475 | app = marimo.App(css_file="custom.css")
476 | ```
477 |
478 | ### 12. Best Practices
479 |
480 | - Use global variables sparingly.
481 | - Encapsulate logic in functions and modules.
482 | - Minimize mutations.
483 | - Write idempotent cells.
484 | - Use caching for expensive computations.
485 |
486 | ### 13. Coming from other Tools
487 |
488 | - Refer to guides for specific tools like Jupyter, Jupytext, Papermill, and Streamlit to understand the transition to Marimo.
489 |
490 | ### 14. Integrating with Marimo
491 |
492 | ```python
493 | import marimo as mo
494 |
495 | # Check if running in a Marimo notebook
496 | if mo.running_in_notebook():
497 | # Execute Marimo-specific code
498 | pass
499 | ```
500 |
501 | ### 15. Reactive State
502 |
503 | ```python
504 | import marimo as mo
505 |
506 | # Create reactive state
507 | get_count, set_count = mo.state(0)
508 |
509 | # Increment the counter on button click
510 | mo.ui.button(on_click=lambda: set_count(get_count() + 1), label="Increment")
511 |
512 | # Display the counter value reactively
513 | mo.md(f"Count: {get_count()}")
514 | ```
515 |
516 | ### 16. Online Playground
517 |
518 | - Create and share Marimo notebooks online at [https://marimo.new](https://marimo.new).
519 |
520 | ### 17. Exporting
521 |
522 | ```bash
523 | # Export to HTML
524 | marimo export html my_notebook.py -o my_notebook.html
525 |
526 | # Export to Python script
527 | marimo export script my_notebook.py -o my_script.py
528 | ```
529 |
530 | ### 18. Configuration
531 |
532 | - Customize user-wide settings in `~/.marimo.toml`.
533 | - Configure notebook-specific settings in the `notebook.py` file.
534 |
535 | ### 19. Troubleshooting
536 |
537 | - Use the **Variables Panel** and **Dependency Graph** to debug cell execution issues.
538 | - Add `print` statements for debugging.
539 | - Try the "Lazy" runtime configuration for identifying stale cells.
540 |
541 | ### 20. Deploying
542 |
543 | ```bash
544 | # Deploy a Marimo notebook as an interactive web app
545 | marimo run my_notebook.py
546 | ```
547 |
548 | --- recipes
549 |
550 | ## Marimo Recipes: Concise Examples
551 |
552 | Here are concise examples of common tasks and concepts from the Marimo Recipes section:
553 |
554 | ### Control Flow
555 |
556 | #### 1. Show an output conditionally
557 |
558 | ```python
559 | import marimo as mo
560 |
561 | show_output = mo.ui.checkbox(label="Show output")
562 | mo.md("This output is visible!") if show_output.value else None
563 | ```
564 |
565 | #### 2. Run a cell on a timer
566 |
567 | ```python
568 | import marimo as mo
569 | import time
570 |
571 | refresh = mo.ui.refresh(default_interval="1s")
572 | refresh
573 |
574 | # This cell will run every second
575 | refresh
576 | mo.md(f"Current time: {time.time()}")
577 | ```
578 |
579 | #### 3. Require form submission before sending UI value
580 |
581 | ```python
582 | import marimo as mo
583 |
584 | form = mo.ui.text(label="Your name").form()
585 | form
586 |
587 | # This cell will only run after form submission
588 | mo.stop(form.value is None, mo.md("Please submit the form."))
589 | mo.md(f"Hello, {form.value}!")
590 | ```
591 |
592 | #### 4. Stop execution of a cell and its descendants
593 |
594 | ```python
595 | import marimo as mo
596 |
597 | should_continue = mo.ui.checkbox(label="Continue?")
598 |
599 | # Stop execution if the checkbox is not checked
600 | mo.stop(not should_continue.value, mo.md("Execution stopped."))
601 |
602 | # This code will only run if the checkbox is checked
603 | mo.md("Continuing execution...")
604 | ```
605 |
606 | ### Grouping UI Elements Together
607 |
608 | #### 1. Create an array of UI elements
609 |
610 | ```python
611 | import marimo as mo
612 |
613 | n_sliders = mo.ui.number(1, 5, value=3, label="Number of sliders")
614 | sliders = mo.ui.array([mo.ui.slider(0, 100) for _ in range(n_sliders.value)])
615 | mo.hstack(sliders)
616 |
617 | # Access slider values
618 | mo.md(f"Slider values: {sliders.value}")
619 | ```
620 |
621 | #### 2. Create a dictionary of UI elements
622 |
623 | ```python
624 | import marimo as mo
625 |
626 | elements = mo.ui.dictionary({
627 | 'name': mo.ui.text(label="Name"),****
628 | 'age': mo.ui.number(0, 100, label="Age")
629 | })
630 |
631 | # Access element values
632 | mo.md(f"Name: {elements['name'].value}, Age: {elements['age'].value}")
633 | ```
634 |
635 | #### 3. Embed a dynamic number of UI elements in another output
636 |
637 | ```python
638 | import marimo as mo
639 |
640 | n_items = mo.ui.number(1, 5, value=3, label="Number of items")
641 | items = mo.ui.array([mo.ui.text(placeholder=f"Item {i+1}") for i in range(n_items.value)])
642 |
643 | mo.md(f"""
644 | **My List:**
645 |
646 | * {items[0]}
647 | * {items[1]}
648 | * {items[2]}
649 | """)
650 | ```
651 |
652 | #### 4. Create a `hstack` (or `vstack`) of UI elements with `on_change` handlers
653 |
654 | ```python
655 | import marimo as mo
656 |
657 | def handle_click(value, index):
658 | mo.md(f"Button {index} clicked!")
659 |
660 | buttons = mo.ui.array(
661 | [mo.ui.button(label=f"Button {i}", on_change=lambda v, i=i: handle_click(v, i))
662 | for i in range(3)]
663 | )
664 | mo.hstack(buttons)
665 | ```
666 |
667 | #### 5. Create a table column of buttons with `on_change` handlers
668 |
669 | ```python
670 | import marimo as mo
671 |
672 | def handle_click(value, row_index):
673 | mo.md(f"Button clicked for row {row_index}")
674 |
675 | buttons = mo.ui.array(
676 | [mo.ui.button(label="Click me", on_change=lambda v, i=i: handle_click(v, i))
677 | for i in range(3)]
678 | )
679 |
680 | mo.ui.table({
681 | 'Name': ['Alice', 'Bob', 'Charlie'],
682 | 'Action': buttons
683 | })
684 | ```
685 |
686 | #### 6. Create a form with multiple UI elements
687 |
688 | ```python
689 | import marimo as mo
690 |
691 | form = mo.md(
692 | """
693 | **User Details**
694 |
695 | Name: {name}
696 | Age: {age}
697 | """
698 | ).batch(
699 | name=mo.ui.text(label="Name"),
700 | age=mo.ui.number(0, 100, label="Age")
701 | ).form()
702 | form
703 |
704 | # Access form values after submission
705 | mo.md(f"Name: {form.value['name']}, Age: {form.value['age']}")
706 | ```
707 |
708 | ### Working with Buttons
709 |
710 | #### 1. Create a button that triggers computation when clicked
711 |
712 | ```python
713 | import marimo as mo
714 | import random
715 |
716 | run_button = mo.ui.run_button(label="Generate Random Number")
717 | run_button
718 |
719 | # This cell only runs when the button is clicked
720 | mo.stop(not run_button.value, "Click 'Generate' to get a random number")
721 | mo.md(f"Random number: {random.randint(0, 100)}")
722 | ```
723 |
724 | #### 2. Create a counter button
725 |
726 | ```python
727 | import marimo as mo
728 |
729 | counter_button = mo.ui.button(value=0, on_click=lambda count: count + 1, label="Count")
730 | counter_button
731 |
732 | # Display the count
733 | mo.md(f"Count: {counter_button.value}")
734 | ```
735 |
736 | #### 3. Create a toggle button
737 |
738 | ```python
739 | import marimo as mo
740 |
741 | toggle_button = mo.ui.button(value=False, on_click=lambda state: not state, label="Toggle")
742 | toggle_button
743 |
744 | # Display the toggle state
745 | mo.md(f"State: {'On' if toggle_button.value else 'Off'}")
746 | ```
747 |
748 | #### 4. Re-run a cell when a button is pressed
749 |
750 | ```python
751 | import marimo as mo
752 | import random
753 |
754 | refresh_button = mo.ui.button(label="Refresh")
755 | refresh_button
756 |
757 | # This cell reruns when the button is clicked
758 | refresh_button
759 | mo.md(f"Random number: {random.randint(0, 100)}")
760 | ```
761 |
762 | #### 5. Run a cell when a button is pressed, but not before
763 |
764 | ```python
765 | import marimo as mo
766 |
767 | counter_button = mo.ui.button(value=0, on_click=lambda count: count + 1, label="Click to Continue")
768 | counter_button
769 |
770 | # Only run this cell after the button is clicked
771 | mo.stop(counter_button.value == 0, "Click the button to continue.")
772 | mo.md("You clicked the button!")
773 | ```
774 |
775 | #### 6. Reveal an output when a button is pressed
776 |
777 | ```python
778 | import marimo as mo
779 |
780 | show_button = mo.ui.button(label="Show Output")
781 | show_button
782 |
783 | # Reveal output only after button click
784 | mo.md("This is the hidden output!") if show_button.value else None
785 | ```
786 |
787 | ### Caching
788 |
789 | #### 1. Cache expensive computations
790 |
791 | ```python
792 | import marimo as mo
793 | import functools
794 | import time
795 |
796 | @functools.cache
797 | def expensive_function(x):
798 | time.sleep(2) # Simulate a long computation
799 | return x * 2
800 |
801 | # Call the function multiple times with the same argument
802 | result1 = expensive_function(5)
803 | result2 = expensive_function(5) # This will be retrieved from the cache
804 |
805 | mo.md(f"Result 1: {result1}, Result 2: {result2}")
806 | ```
807 |
808 | These concise examples provide practical illustrations of various recipes, showcasing how Marimo can be used to create interactive, dynamic, and efficient notebooks.
809 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Marimo Reactive Notebook Prompt Library
2 | > Starter codebase to use Marimo reactive notebooks to build a reusable, customizable, Prompt Library.
3 | >
4 | > Take this codebase and use it as a starter codebase to build your own personal prompt library.
5 | >
6 | > Marimo reactive notebooks & Prompt Library [walkthrough](https://youtu.be/PcLkBkQujMI)
7 | >
8 | > Run multiple prompts against multiple models (SLMs & LLMs) [walkthrough](https://youtu.be/VC6QCEXERpU)
9 |
10 |
11 |
12 |
13 |
14 | ## 1. Understand Marimo Notebook
15 | > This is a simple demo of the Marimo Reactive Notebook
16 | - Install hyper modern [UV Python Package and Project](https://docs.astral.sh/uv/getting-started/installation/)
17 | - Install dependencies `uv sync`
18 | - Install marimo `uv pip install marimo`
19 | - To Edit, Run `uv run marimo edit marimo_is_awesome_demo.py`
20 | - To View, Run `uv run marimo run marimo_is_awesome_demo.py`
21 | - Then use your favorite IDE & AI Coding Assistant to edit the `marimo_is_awesome_demo.py` directly or via the UI.
22 |
23 | ## 2. Ad-hoc Prompt Notebook
24 | > Quickly run and test prompts across models
25 | - 🟡 Copy `.env.sample` to `.env` and set your keys (minimally set `OPENAI_API_KEY`)
26 | - Add other keys and update the notebook to add support for additional SOTA LLMs
27 | - 🟡 Install Ollama (https://ollama.ai/) and pull the models you want to use
28 | - Update the notebook to use Ollama models you have installed
29 | - To Edit, Run `uv run marimo edit adhoc_prompting.py`
30 | - To View, Run `uv run marimo run adhoc_prompting.py`
31 |
32 | ## 3. ⭐️ Prompt Library Notebook
33 | > Build, Manage, Reuse, Version, and Iterate on your Prompt Library
34 | - 🟡 Copy `.env.sample` to `.env` and set your keys (minimally set `OPENAI_API_KEY`)
35 | - Add other keys and update the notebook to add support for additional SOTA LLMs
36 | - 🟡 Install Ollama (https://ollama.ai/) and pull the models you want to use
37 | - Update the notebook to use Ollama models you have installed
38 | - To Edit, Run `uv run marimo edit prompt_library.py`
39 | - To View, Run `uv run marimo run prompt_library.py`
40 |
41 | ## 4. Multi-LLM Prompt
42 | > Quickly test a single prompt across multiple language models
43 | - 🟡 Ensure your `.env` file is set up with the necessary API keys for the models you want to use
44 | - 🟡 Install Ollama (https://ollama.ai/) and pull the models you want to use
45 | - Update the notebook to use Ollama models you have installed
46 | - To Edit, Run `uv run marimo edit multi_llm_prompting.py`
47 | - To View, Run `uv run marimo run multi_llm_prompting.py`
48 |
49 | ## 5. Multi Language Model Ranker
50 | > Compare and rank multiple language models across various prompts
51 | - 🟡 Ensure your `.env` file is set up with the necessary API keys for the models you want to compare
52 | - 🟡 Install Ollama (https://ollama.ai/) and pull the models you want to use
53 | - Update the notebook to use Ollama models you have installed
54 | - To Edit, Run `uv run marimo edit multi_language_model_ranker.py`
55 | - To View, Run `uv run marimo run multi_language_model_ranker.py`
56 |
57 | ## General Usage
58 | > See the [Marimo Docs](https://docs.marimo.io/index.html) for general usage details
59 |
60 | ## Personal Prompt Library Use-Cases
61 | - Ad-hoc prompting
62 | - Prompt reuse
63 | - Prompt versioning
64 | - Interactive prompts
65 | - Prompt testing & Benchmarking
66 | - LLM comparison
67 | - Prompt templating
68 | - Run a single prompt against multiple LLMs & SLMs
69 | - Compare multi prompts against multiple LLMs & SLMs
70 | - Anything you can imagine!
71 |
72 | ## Advantages of Marimo
73 |
74 | ### Key Advantages
75 | > Rapid Prototyping: Seamlessly transition between user and builder mode with `cmd+.` to toggle. Consumer vs Producer. UI vs Code.
76 |
77 | > Interactivity: Built-in reactive UI elements enable intuitive data exploration and visualization.
78 |
79 | > Reactivity: Cells automatically update when dependencies change, ensuring a smooth and efficient workflow.
80 |
81 | > Out of the box: Use sliders, textareas, buttons, images, dataframe GUIs, plotting, and other interactive elements to quickly iterate on ideas.
82 |
83 | > It's 'just' Python: Pure Python scripts for easy version control and AI coding.
84 |
85 |
86 | - **Reactive Execution**: Run one cell, and marimo automatically updates all affected cells. This eliminates the need to manually manage notebook state.
87 | - **Interactive Elements**: Provides reactive UI elements like dataframe GUIs and plots, making data exploration fast and intuitive.
88 | - **Python-First Design**: Notebooks are pure Python scripts stored as `.py` files. They can be versioned with git, run as scripts, and imported into other Python code.
89 | - **Reproducible by Default**: Deterministic execution order with no hidden state ensures consistent and reproducible results.
90 | - **Built for Collaboration**: Git-friendly notebooks where small changes yield small diffs, facilitating collaboration.
91 | - **Developer-Friendly Features**: Includes GitHub Copilot, autocomplete, hover tooltips, vim keybindings, code formatting, debugging panels, and extensive hotkeys.
92 | - **Seamless Transition to Production**: Notebooks can be run as scripts or deployed as read-only web apps.
93 | - **Versatile Use Cases**: Ideal for experimenting with data and models, building internal tools, communicating research, education, and creating interactive dashboards.
94 |
95 | ### Advantages Over Jupyter Notebooks
96 |
97 | - **Reactive Notebook**: Automatically updates dependent cells when code or values change, unlike Jupyter where cells must be manually re-executed.
98 | - **Pure Python Notebooks**: Stored as `.py` files instead of JSON, making them easier to version control, lint, and integrate with Python tooling.
99 | - **No Hidden State**: Deleting a cell removes its variables and updates affected cells, reducing errors from stale variables.
100 | - **Better Git Integration**: Plain Python scripts result in smaller diffs and more manageable version control compared to Jupyter's JSON format.
101 | - **Import Symbols**: Allows importing symbols from notebooks into other notebooks or Python files.
102 | - **Enhanced Interactivity**: Built-in reactive UI elements provide a more interactive experience than standard Jupyter widgets.
103 | - **App Deployment**: Notebooks can be served as web apps or exported to static HTML for easier sharing and deployment.
104 | - **Advanced Developer Tools**: Features like code formatting, GitHub Copilot integration, and debugging panels enhance the development experience.
105 | - **Script Execution**: Can be executed as standard Python scripts, facilitating integration into pipelines and scripts without additional tools.
106 |
107 | ## Resources
108 | - https://docs.astral.sh/uv/
109 | - https://docs.marimo.io/index.html
110 | - https://youtu.be/PcLkBkQujMI
111 | - https://github.com/BuilderIO/gpt-crawler
112 | - https://github.com/simonw/llm
113 | - https://ollama.com/
114 | - https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/
115 | - https://qwenlm.github.io/
--------------------------------------------------------------------------------
/adhoc_prompting.py:
--------------------------------------------------------------------------------
1 | import marimo
2 |
3 | __generated_with = "0.8.18"
4 | app = marimo.App(width="medium")
5 |
6 |
7 | @app.cell
8 | def __():
9 | import marimo as mo
10 | from src.marimo_notebook.modules import llm_module
11 | import json
12 | return json, llm_module, mo
13 |
14 |
15 | @app.cell
16 | def __(llm_module):
17 | llm_o1_mini, llm_o1_preview = llm_module.build_o1_series()
18 | llm_gpt_4o_latest, llm_gpt_4o_mini = llm_module.build_openai_latest_and_fastest()
19 | # llm_sonnet = llm_module.build_sonnet_3_5()
20 | # gemini_1_5_pro, gemini_1_5_flash = llm_module.build_gemini_duo()
21 |
22 | models = {
23 | "o1-mini": llm_o1_mini,
24 | "o1-preview": llm_o1_preview,
25 | "gpt-4o-latest": llm_gpt_4o_latest,
26 | "gpt-4o-mini": llm_gpt_4o_mini,
27 | # "sonnet-3.5": llm_sonnet,
28 | # "gemini-1-5-pro": gemini_1_5_pro,
29 | # "gemini-1-5-flash": gemini_1_5_flash,
30 | }
31 | return (
32 | llm_gpt_4o_latest,
33 | llm_gpt_4o_mini,
34 | llm_o1_mini,
35 | llm_o1_preview,
36 | models,
37 | )
38 |
39 |
40 | @app.cell
41 | def __(mo, models):
42 | prompt_text_area = mo.ui.text_area(label="Prompt", full_width=True)
43 | prompt_temp_slider = mo.ui.slider(
44 | start=0, stop=1, value=0.5, step=0.05, label="Temp"
45 | )
46 | model_dropdown = mo.ui.dropdown(
47 | options=models.copy(),
48 | label="Model",
49 | value="gpt-4o-mini",
50 | )
51 | multi_model_checkbox = mo.ui.checkbox(label="Run on All Models", value=False)
52 |
53 | form = (
54 | mo.md(
55 | r"""
56 | # Ad-hoc Prompt
57 | {prompt}
58 | {temp}
59 | {model}
60 | {multi_model}
61 | """
62 | )
63 | .batch(
64 | prompt=prompt_text_area,
65 | temp=prompt_temp_slider,
66 | model=model_dropdown,
67 | multi_model=multi_model_checkbox,
68 | )
69 | .form()
70 | )
71 | form
72 | return (
73 | form,
74 | model_dropdown,
75 | multi_model_checkbox,
76 | prompt_temp_slider,
77 | prompt_text_area,
78 | )
79 |
80 |
81 | @app.cell
82 | def __(form, mo):
83 | mo.stop(not form.value or not len(form.value), "")
84 |
85 | # Format the form data for the table
86 | formatted_data = {}
87 | for key, value in form.value.items():
88 | if key == "model":
89 | formatted_data[key] = value.model_id
90 | elif key == "multi_model":
91 | formatted_data[key] = value
92 | else:
93 | formatted_data[key] = value
94 |
95 | # Create and display the table
96 | table = mo.ui.table(
97 | [formatted_data], # Wrap in a list to create a single-row table
98 | label="",
99 | selection=None,
100 | )
101 |
102 | mo.md(f"# Form Values\n\n{table}")
103 | return formatted_data, key, table, value
104 |
105 |
106 | @app.cell
107 | def __(form, llm_module, mo):
108 | mo.stop(not form.value or form.value["multi_model"], "")
109 |
110 | prompt_response = None
111 |
112 | with mo.status.spinner(title="Loading..."):
113 | prompt_response = llm_module.prompt_with_temp(
114 | form.value["model"], form.value["prompt"], form.value["temp"]
115 | )
116 |
117 | mo.md(f"# Prompt Output\n\n{prompt_response}").style(
118 | {"background": "#eee", "padding": "10px", "border-radius": "10px"}
119 | )
120 | return (prompt_response,)
121 |
122 |
123 | @app.cell
124 | def __(form, llm_module, mo, models):
125 | prompt_responses = []
126 |
127 | mo.stop(not form.value or not form.value["multi_model"], "")
128 |
129 | with mo.status.spinner(title="Running prompts on all models..."):
130 | for model_name, model in models.items():
131 | response = llm_module.prompt_with_temp(
132 | model, form.value["prompt"], form.value["temp"]
133 | )
134 | prompt_responses.append(
135 | {
136 | "model_id": model_name,
137 | "output": response,
138 | }
139 | )
140 | return model, model_name, prompt_responses, response
141 |
142 |
143 | @app.cell
144 | def __(mo, prompt_responses):
145 | mo.stop(not len(prompt_responses), "")
146 |
147 | # Create a table using mo.ui.table
148 | multi_model_table = mo.ui.table(
149 | prompt_responses, label="Multi-Model Prompt Outputs", selection=None
150 | )
151 |
152 | mo.vstack(
153 | [
154 | mo.md("# Multi-Model Prompt Outputs"),
155 | mo.ui.table(prompt_responses, selection=None),
156 | ]
157 | )
158 | return (multi_model_table,)
159 |
160 |
161 | if __name__ == "__main__":
162 | app.run()
163 |
--------------------------------------------------------------------------------
/images/marimo_prompt_library.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/disler/marimo-prompt-library/cec10b5e57cf1c5e3e11806c47a667cb38bd32b0/images/marimo_prompt_library.png
--------------------------------------------------------------------------------
/images/multi_slm_llm_prompt_and_model.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/disler/marimo-prompt-library/cec10b5e57cf1c5e3e11806c47a667cb38bd32b0/images/multi_slm_llm_prompt_and_model.png
--------------------------------------------------------------------------------
/language_model_rankings/rankings.json:
--------------------------------------------------------------------------------
1 | [
2 | {
3 | "llm_model_id": "gpt-4o-mini",
4 | "score": 2
5 | },
6 | {
7 | "llm_model_id": "llama3.2:latest",
8 | "score": 2
9 | },
10 | {
11 | "llm_model_id": "gemini-1.5-flash-002",
12 | "score": 2
13 | }
14 | ]
--------------------------------------------------------------------------------
/layouts/adhoc_prompting.grid.json:
--------------------------------------------------------------------------------
1 | {
2 | "type": "grid",
3 | "data": {
4 | "columns": 24,
5 | "rowHeight": 20,
6 | "maxWidth": 1400,
7 | "bordered": true,
8 | "cells": [
9 | {
10 | "position": null
11 | },
12 | {
13 | "position": null
14 | },
15 | {
16 | "position": [
17 | 1,
18 | 6,
19 | 11,
20 | 18
21 | ]
22 | },
23 | {
24 | "position": [
25 | 1,
26 | 25,
27 | 11,
28 | 10
29 | ]
30 | },
31 | {
32 | "position": [
33 | 13,
34 | 1,
35 | 10,
36 | 82
37 | ]
38 | },
39 | {
40 | "position": null
41 | },
42 | {
43 | "position": null
44 | }
45 | ]
46 | }
47 | }
--------------------------------------------------------------------------------
/layouts/adhoc_prompting.slides.json:
--------------------------------------------------------------------------------
1 | {
2 | "type": "slides",
3 | "data": {}
4 | }
--------------------------------------------------------------------------------
/layouts/multi_language_model_ranker.grid.json:
--------------------------------------------------------------------------------
1 | {
2 | "type": "grid",
3 | "data": {
4 | "columns": 22,
5 | "rowHeight": 20,
6 | "maxWidth": 2000,
7 | "bordered": true,
8 | "cells": [
9 | {
10 | "position": null
11 | },
12 | {
13 | "position": null
14 | },
15 | {
16 | "position": null
17 | },
18 | {
19 | "position": null
20 | },
21 | {
22 | "position": null
23 | },
24 | {
25 | "position": [
26 | 0,
27 | 0,
28 | 10,
29 | 13
30 | ]
31 | },
32 | {
33 | "position": null
34 | },
35 | {
36 | "position": null
37 | },
38 | {
39 | "position": [
40 | 10,
41 | 0,
42 | 11,
43 | 31
44 | ]
45 | },
46 | {
47 | "position": null
48 | },
49 | {
50 | "position": [
51 | 0,
52 | 13,
53 | 10,
54 | 18
55 | ]
56 | },
57 | {
58 | "position": null
59 | },
60 | {
61 | "position": null
62 | }
63 | ]
64 | }
65 | }
--------------------------------------------------------------------------------
/layouts/multi_llm_prompting.grid.json:
--------------------------------------------------------------------------------
1 | {
2 | "type": "grid",
3 | "data": {
4 | "columns": 24,
5 | "rowHeight": 20,
6 | "maxWidth": 1400,
7 | "bordered": true,
8 | "cells": [
9 | {
10 | "position": null
11 | },
12 | {
13 | "position": null
14 | },
15 | {
16 | "position": [
17 | 0,
18 | 0,
19 | 10,
20 | 16
21 | ]
22 | },
23 | {
24 | "position": null
25 | },
26 | {
27 | "position": [
28 | 10,
29 | 0,
30 | 14,
31 | 25
32 | ]
33 | },
34 | {
35 | "position": [
36 | 2,
37 | 17,
38 | 6,
39 | 6
40 | ]
41 | }
42 | ]
43 | }
44 | }
--------------------------------------------------------------------------------
/layouts/prompt_library.grid.json:
--------------------------------------------------------------------------------
1 | {
2 | "type": "grid",
3 | "data": {
4 | "columns": 24,
5 | "rowHeight": 15,
6 | "maxWidth": 1200,
7 | "bordered": true,
8 | "cells": [
9 | {
10 | "position": null
11 | },
12 | {
13 | "position": null
14 | },
15 | {
16 | "position": [
17 | 1,
18 | 2,
19 | 6,
20 | 30
21 | ]
22 | },
23 | {
24 | "position": [
25 | 1,
26 | 34,
27 | 6,
28 | 13
29 | ]
30 | },
31 | {
32 | "position": [
33 | 8,
34 | 2,
35 | 11,
36 | 51
37 | ],
38 | "scrollable": true,
39 | "side": "left"
40 | }
41 | ]
42 | }
43 | }
--------------------------------------------------------------------------------
/layouts/prompt_library.slides.json:
--------------------------------------------------------------------------------
1 | {
2 | "type": "slides",
3 | "data": {}
4 | }
--------------------------------------------------------------------------------
/marimo_is_awesome_demo.py:
--------------------------------------------------------------------------------
1 | import marimo
2 |
3 | __generated_with = "0.8.18"
4 | app = marimo.App(width="full")
5 |
6 |
7 | @app.cell
8 | def __():
9 | import random
10 | import marimo as mo
11 | import pandas as pd
12 | import matplotlib.pyplot as plt
13 | from vega_datasets import data
14 | import io
15 | import altair as alt
16 | return alt, data, io, mo, pd, plt, random
17 |
18 |
19 | @app.cell
20 | def __(mo):
21 | mo.md(
22 | """
23 | # Marimo Awesome Examples
24 |
25 | This notebook demonstrates various features and capabilities of Marimo. Explore the different sections to see how Marimo can be used for interactive data analysis, visualization, and more!
26 |
27 | ---
28 | """
29 | )
30 | return
31 |
32 |
33 | @app.cell
34 | def __(mo):
35 | mo.md(
36 | """
37 | ## 1. Basic UI Elements
38 |
39 | ---
40 | """
41 | )
42 | return
43 |
44 |
45 | @app.cell
46 | def __(mo):
47 | slider = mo.ui.slider(1, 10, value=5, label="Slider Example")
48 | checkbox = mo.ui.checkbox(label="Checkbox Example")
49 | text_input = mo.ui.text(placeholder="Enter text here", label="Text Input Example")
50 |
51 | mo.vstack([slider, checkbox, text_input])
52 | return checkbox, slider, text_input
53 |
54 |
55 | @app.cell
56 | def __(checkbox, mo, slider, text_input):
57 | mo.md(
58 | f"""
59 | Slider value: {slider.value}
60 | Checkbox state: {checkbox.value}
61 | Text input: {text_input.value}
62 | Slider * Text input: {slider.value * "⭐️"}
63 | """
64 | )
65 | return
66 |
67 |
68 | @app.cell
69 | def __(mo):
70 | mo.md(
71 | """
72 | ## 2. Reactive Data Visualization
73 | ---
74 | """
75 | )
76 | return
77 |
78 |
79 | @app.cell
80 | def __(mo, pd):
81 | # Create a sample dataset
82 | sample_df = pd.DataFrame(
83 | {"x": range(1, 11), "y": [1, 4, 9, 16, 25, 36, 49, 64, 81, 100]}
84 | )
85 |
86 | plot_type = mo.ui.dropdown(
87 | options=["scatter", "line", "bar"], value="scatter", label="Select Plot Type"
88 | )
89 |
90 | mo.vstack(
91 | [
92 | plot_type,
93 | # mo.ui.table(sample_df, selection=None)
94 | ]
95 | )
96 | return plot_type, sample_df
97 |
98 |
99 | @app.cell
100 | def __(mo, plot_type, plt, sample_df):
101 | plt.figure(figsize=(10, 6))
102 |
103 | if plot_type.value == "scatter":
104 | plt.scatter(sample_df["x"], sample_df["y"])
105 | elif plot_type.value == "line":
106 | plt.plot(sample_df["x"], sample_df["y"])
107 | else:
108 | plt.bar(sample_df["x"], sample_df["y"])
109 |
110 | plt.xlabel("X")
111 | plt.ylabel("Y")
112 | plt.title(f"{plot_type.value.capitalize()} Plot")
113 | mo.mpl.interactive(plt.gcf())
114 | return
115 |
116 |
117 | @app.cell
118 | def __(mo):
119 | mo.md("""## 3. Conditional Output and Control Flow""")
120 | return
121 |
122 |
123 | @app.cell
124 | def __(mo):
125 | show_secret = mo.ui.checkbox(label="Show Secret Message")
126 | show_secret
127 | return (show_secret,)
128 |
129 |
130 | @app.cell
131 | def __(mo, show_secret):
132 | mo.stop(not show_secret.value, mo.md("Check the box to reveal the secret message!"))
133 | mo.md(
134 | "🎉 Congratulations! You've unlocked the secret message: Marimo is awesome! 🎉"
135 | )
136 | return
137 |
138 |
139 | @app.cell
140 | def __(mo):
141 | mo.md("""## 4. File Handling and Data Processing""")
142 | return
143 |
144 |
145 | @app.cell
146 | def __(mo):
147 | file_upload = mo.ui.file(label="Upload a CSV file")
148 | file_upload
149 | return (file_upload,)
150 |
151 |
152 | @app.cell
153 | def __(file_upload, io, mo, pd):
154 | mo.stop(
155 | not file_upload.value, mo.md("Please upload a CSV file to see the preview.")
156 | )
157 |
158 | uploaded_df = pd.read_csv(io.BytesIO(file_upload.value[0].contents))
159 | mo.md(f"### Uploaded File Preview")
160 | mo.ui.table(uploaded_df)
161 | return (uploaded_df,)
162 |
163 |
164 | @app.cell
165 | def __(mo):
166 | mo.md("""## 5. Advanced UI Components""")
167 | return
168 |
169 |
170 | @app.cell
171 | def __(mo, pd):
172 | accordion = mo.accordion(
173 | {
174 | "Section 1": mo.md("This is the content of section 1."),
175 | "Section 2": mo.ui.slider(0, 100, value=50, label="Nested Slider"),
176 | "Section 3": mo.ui.table(pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})),
177 | }
178 | )
179 | accordion
180 | return (accordion,)
181 |
182 |
183 | @app.cell
184 | def __(mo):
185 | tabs = mo.ui.tabs(
186 | {
187 | "Tab 1": mo.md("Content of Tab 1"),
188 | "Tab 2": mo.ui.button(label="Click me!"),
189 | "Tab 3": mo.mermaid(
190 | """
191 | graph TD
192 | A[Start] --> B{Decision}
193 | B -->|Yes| C[Do Something]
194 | B -->|No| D[Do Nothing]
195 | C --> E[End]
196 | D --> E
197 | """
198 | ),
199 | }
200 | )
201 | tabs
202 | return (tabs,)
203 |
204 |
205 | @app.cell
206 | def __(mo):
207 | mo.md("""## 6. Batch Operations and Forms""")
208 | return
209 |
210 |
211 | @app.cell
212 | def __(mo):
213 | user_form = (
214 | mo.md(
215 | """
216 | ### User Information Form
217 |
218 | First Name: {first_name}
219 | Last Name: {last_name}
220 | Age: {age}
221 | Email: {email}
222 | """
223 | )
224 | .batch(
225 | first_name=mo.ui.text(label="First Name"),
226 | last_name=mo.ui.text(label="Last Name"),
227 | age=mo.ui.number(start=0, stop=120, label="Age"),
228 | email=mo.ui.text(label="Email"),
229 | )
230 | .form()
231 | )
232 |
233 | user_form
234 | return (user_form,)
235 |
236 |
237 | @app.cell
238 | def __(mo, user_form):
239 | mo.stop(
240 | not user_form.value.get("first_name"),
241 | mo.md("Please submit the form to see the results."),
242 | )
243 |
244 | mo.md(
245 | f"""
246 | ### Submitted Information
247 |
248 | - **First Name:** {user_form.value['first_name']}
249 | - **Last Name:** {user_form.value['last_name']}
250 | - **Age:** {user_form.value['age']}
251 | - **Email:** {user_form.value['email']}
252 | """
253 | )
254 | return
255 |
256 |
257 | @app.cell
258 | def __(mo):
259 | mo.md("""## 7. Embedding External Content""")
260 | return
261 |
262 |
263 | @app.cell
264 | def __(mo):
265 | mo.image("https://marimo.io/logo.png", width=200, alt="Marimo Logo")
266 | return
267 |
268 |
269 | @app.cell
270 | def __(mo):
271 | mo.video(
272 | "https://v3.cdnpk.net/videvo_files/video/free/2013-08/large_watermarked/hd0992_preview.mp4",
273 | width=560,
274 | height=315,
275 | )
276 | return
277 |
278 |
279 | @app.cell
280 | def __(mo):
281 | mo.md("""## 8. Custom Styling and Layouts""")
282 | return
283 |
284 |
285 | @app.cell
286 | def __(mo):
287 | styled_text = mo.md(
288 | """
289 | # Custom Styled Header
290 |
291 | This text has custom styling applied.
292 | """
293 | ).style(
294 | {
295 | "font-style": "italic",
296 | "background-color": "#aaa",
297 | "padding": "10px",
298 | "border-radius": "5px",
299 | },
300 | )
301 |
302 | styled_text
303 | return (styled_text,)
304 |
305 |
306 | @app.cell
307 | def __(mo):
308 | layout = mo.vstack(
309 | [
310 | mo.hstack(
311 | [
312 | mo.md("Left Column").style(
313 | {
314 | "background-color": "#e0e0e0",
315 | "padding": "10px",
316 | }
317 | ),
318 | mo.md("Right Column").style(
319 | {
320 | "background-color": "#d0d0d0",
321 | "padding": "10px",
322 | }
323 | ),
324 | ]
325 | ),
326 | mo.md("Bottom Row").style(
327 | {"background-color": "#c0c0c0", "padding": "10px"}
328 | ),
329 | ]
330 | )
331 |
332 | layout
333 | return (layout,)
334 |
335 |
336 | @app.cell
337 | def __(mo):
338 | mo.md(
339 | """
340 | ## 9. Interactive Data Exploration
341 | ---
342 | """
343 | )
344 | return
345 |
346 |
347 | @app.cell
348 | def __(data, mo):
349 | cars = data.cars()
350 | mo.ui.data_explorer(cars)
351 | return (cars,)
352 |
353 |
354 | @app.cell
355 | def __(alt, data, mo):
356 | chart = (
357 | alt.Chart(data.cars())
358 | .mark_circle()
359 | .encode(
360 | x="Horsepower",
361 | y="Miles_per_Gallon",
362 | color="Origin",
363 | tooltip=["Name", "Origin", "Horsepower", "Miles_per_Gallon"],
364 | )
365 | .interactive()
366 | )
367 |
368 | mo.ui.altair_chart(chart)
369 | return (chart,)
370 |
371 |
372 | @app.cell
373 | def __(mo):
374 | mo.md(
375 | """
376 | ## Conclusion
377 |
378 | This notebook has demonstrated various features and capabilities of Marimo. From basic UI elements to advanced data visualization and interactive components, Marimo provides a powerful toolkit for creating dynamic and engaging notebooks.
379 |
380 | Explore the code in each cell to learn more about how these examples were created!
381 | """
382 | )
383 | return
384 |
385 |
386 | if __name__ == "__main__":
387 | app.run()
388 |
--------------------------------------------------------------------------------
/multi_language_model_ranker.py:
--------------------------------------------------------------------------------
1 | import marimo
2 |
3 | __generated_with = "0.8.18"
4 | app = marimo.App(width="full")
5 |
6 |
7 | @app.cell
8 | def __():
9 | import marimo as mo
10 | import src.marimo_notebook.modules.llm_module as llm_module
11 | import src.marimo_notebook.modules.prompt_library_module as prompt_library_module
12 | import json
13 | import pyperclip
14 | return json, llm_module, mo, prompt_library_module, pyperclip
15 |
16 |
17 | @app.cell
18 | def __(prompt_library_module):
19 | map_testable_prompts: dict = prompt_library_module.pull_in_testable_prompts()
20 | return (map_testable_prompts,)
21 |
22 |
23 | @app.cell
24 | def __(llm_module):
25 | llm_o1_mini, llm_o1_preview = llm_module.build_o1_series()
26 | llm_gpt_4o_latest, llm_gpt_4o_mini = llm_module.build_openai_latest_and_fastest()
27 | # llm_sonnet = llm_module.build_sonnet_3_5()
28 | # gemini_1_5_pro, gemini_1_5_flash = llm_module.build_gemini_duo()
29 | # gemini_1_5_pro_2, gemini_1_5_flash_2 = llm_module.build_gemini_1_2_002()
30 | # llama3_2_model, llama3_2_1b_model = llm_module.build_ollama_models()
31 | # _, phi3_5_model, qwen2_5_model = llm_module.build_ollama_slm_models()
32 |
33 | models = {
34 | "o1-mini": llm_o1_mini,
35 | "o1-preview": llm_o1_preview,
36 | "gpt-4o-latest": llm_gpt_4o_latest,
37 | "gpt-4o-mini": llm_gpt_4o_mini,
38 | # "sonnet-3.5": llm_sonnet,
39 | # "gemini-1-5-pro": gemini_1_5_pro,
40 | # "gemini-1-5-flash": gemini_1_5_flash,
41 | # "gemini-1-5-pro-002": gemini_1_5_pro_2,
42 | # "gemini-1-5-flash-002": gemini_1_5_flash_2,
43 | # "llama3-2": llama3_2_model,
44 | # "llama3-2-1b": llama3_2_1b_model,
45 | # "phi3-5": phi3_5_model,
46 | # "qwen2-5": qwen2_5_model,
47 | }
48 | return (
49 | llm_gpt_4o_latest,
50 | llm_gpt_4o_mini,
51 | llm_o1_mini,
52 | llm_o1_preview,
53 | models,
54 | )
55 |
56 |
57 | @app.cell
58 | def __(map_testable_prompts, mo, models):
59 | prompt_multiselect = mo.ui.multiselect(
60 | options=list(map_testable_prompts.keys()),
61 | label="Select Prompts",
62 | )
63 | prompt_temp_slider = mo.ui.slider(
64 | start=0, stop=1, value=0.5, step=0.05, label="Temp"
65 | )
66 | model_multiselect = mo.ui.multiselect(
67 | options=models.copy(),
68 | label="Models",
69 | value=["gpt-4o-mini",],
70 | )
71 | return model_multiselect, prompt_multiselect, prompt_temp_slider
72 |
73 |
74 | @app.cell
75 | def __():
76 | prompt_style = {
77 | "background": "#eee",
78 | "padding": "10px",
79 | "border-radius": "10px",
80 | "margin-bottom": "20px",
81 | }
82 | return (prompt_style,)
83 |
84 |
85 | @app.cell
86 | def __(mo, model_multiselect, prompt_multiselect, prompt_temp_slider):
87 | form = (
88 | mo.md(
89 | r"""
90 | # Multi Language Model Ranker 📊
91 | {prompts}
92 | {temp}
93 | {models}
94 | """
95 | )
96 | .batch(
97 | prompts=prompt_multiselect,
98 | temp=prompt_temp_slider,
99 | models=model_multiselect,
100 | )
101 | .form()
102 | )
103 | form
104 | return (form,)
105 |
106 |
107 | @app.cell
108 | def __(form, map_testable_prompts, mo, prompt_style):
109 | mo.stop(not form.value)
110 |
111 | selected_models_string = mo.ui.array(
112 | [mo.ui.text(value=m.model_id, disabled=True) for m in form.value["models"]]
113 | )
114 |
115 | selected_prompts_accordion = mo.accordion(
116 | {
117 | prompt: mo.md(f"```xml\n{map_testable_prompts[prompt]}\n```")
118 | for prompt in form.value["prompts"]
119 | }
120 | )
121 |
122 | mo.vstack(
123 | [
124 | mo.md("## Selected Models"),
125 | mo.hstack(selected_models_string, align="start", justify="start"),
126 | mo.md("## Selected Prompts"),
127 | selected_prompts_accordion,
128 | ]
129 | ).style(prompt_style)
130 | return selected_models_string, selected_prompts_accordion
131 |
132 |
133 | @app.cell
134 | def __(form, llm_module, map_testable_prompts, mo, prompt_library_module):
135 | mo.stop(not form.value, "")
136 |
137 | all_prompt_responses = []
138 |
139 | total_executions = len(form.value["prompts"]) * len(form.value["models"])
140 |
141 | with mo.status.progress_bar(
142 | title="Running prompts on selected models...",
143 | total=total_executions,
144 | remove_on_exit=True,
145 | ) as prog_bar:
146 | for selected_prompt_name in form.value["prompts"]:
147 | selected_prompt = map_testable_prompts[selected_prompt_name]
148 | prompt_responses = []
149 |
150 | for model in form.value["models"]:
151 | model_name = model.model_id
152 | prog_bar.update(
153 | title=f"Prompting '{model_name}' with '{selected_prompt_name}'",
154 | increment=1,
155 | )
156 | raw_prompt_response = llm_module.prompt_with_temp(
157 | model, selected_prompt, form.value["temp"]
158 | )
159 | prompt_responses.append(
160 | {
161 | "model_id": model_name,
162 | "model": model,
163 | "output": raw_prompt_response,
164 | }
165 | )
166 |
167 | # Create a new list without the 'model' key for each response
168 | list_model_execution_dict = [
169 | {k: v for k, v in response.items() if k != "model"}
170 | for response in prompt_responses
171 | ]
172 |
173 | # Record the execution
174 | execution_filepath = prompt_library_module.record_llm_execution(
175 | prompt=selected_prompt,
176 | list_model_execution_dict=list_model_execution_dict,
177 | prompt_template=selected_prompt_name,
178 | )
179 | print(f"Execution record saved to: {execution_filepath}")
180 |
181 | all_prompt_responses.append(
182 | {
183 | "prompt_name": selected_prompt_name,
184 | "prompt": selected_prompt,
185 | "responses": prompt_responses,
186 | "execution_filepath": execution_filepath,
187 | }
188 | )
189 | return (
190 | all_prompt_responses,
191 | execution_filepath,
192 | list_model_execution_dict,
193 | model,
194 | model_name,
195 | prog_bar,
196 | prompt_responses,
197 | raw_prompt_response,
198 | selected_prompt,
199 | selected_prompt_name,
200 | total_executions,
201 | )
202 |
203 |
204 | @app.cell
205 | def __(all_prompt_responses, mo, pyperclip):
206 | mo.stop(not all_prompt_responses, mo.md(""))
207 |
208 | def copy_to_clipboard(text):
209 | print("copying: ", text)
210 | pyperclip.copy(text)
211 | return 1
212 |
213 | all_prompt_elements = []
214 |
215 | output_prompt_style = {
216 | "background": "#eee",
217 | "padding": "10px",
218 | "border-radius": "10px",
219 | "margin-bottom": "20px",
220 | "min-width": "200px",
221 | "box-shadow": "2px 2px 2px #ccc",
222 | }
223 |
224 | for loop_prompt_data in all_prompt_responses:
225 | prompt_output_elements = [
226 | mo.vstack(
227 | [
228 | mo.md(f"#### {response['model_id']}").style(
229 | {"font-weight": "bold"}
230 | ),
231 | mo.md(response["output"]),
232 | ]
233 | ).style(output_prompt_style)
234 | for response in loop_prompt_data["responses"]
235 | ]
236 |
237 | prompt_element = mo.vstack(
238 | [
239 | mo.md(f"### Prompt: {loop_prompt_data['prompt_name']}"),
240 | mo.hstack(prompt_output_elements, wrap=True, justify="start"),
241 | ]
242 | ).style(
243 | {
244 | "border-left": "4px solid #CCC",
245 | "padding": "2px 10px",
246 | "background": "#ffffee",
247 | }
248 | )
249 |
250 | all_prompt_elements.append(prompt_element)
251 |
252 | mo.vstack(all_prompt_elements)
253 | return (
254 | all_prompt_elements,
255 | copy_to_clipboard,
256 | loop_prompt_data,
257 | output_prompt_style,
258 | prompt_element,
259 | prompt_output_elements,
260 | )
261 |
262 |
263 | @app.cell
264 | def __(all_prompt_responses, copy_to_clipboard, form, mo):
265 | mo.stop(not all_prompt_responses, mo.md(""))
266 | mo.stop(not form.value, mo.md(""))
267 |
268 | # Prepare data for the table
269 | table_data = []
270 | for prompt_data in all_prompt_responses:
271 | for response in prompt_data["responses"]:
272 | table_data.append(
273 | {
274 | "Prompt": prompt_data["prompt_name"],
275 | "Model": response["model_id"],
276 | "Output": response["output"],
277 | }
278 | )
279 |
280 | # Create the table
281 | results_table = mo.ui.table(
282 | data=table_data,
283 | pagination=True,
284 | selection="multi",
285 | page_size=30,
286 | label="Model Responses",
287 | format_mapping={
288 | "Output": lambda val: "(trimmed) " + val[:15],
289 | # "Output": lambda val: val,
290 | },
291 | )
292 |
293 | # Function to copy selected outputs to clipboard
294 | def copy_selected_outputs():
295 | selected_rows = results_table.value
296 | if selected_rows:
297 | outputs = [row["Output"] for row in selected_rows]
298 | combined_output = "\n\n".join(outputs)
299 | copy_to_clipboard(combined_output)
300 | return f"Copied {len(outputs)} response(s) to clipboard"
301 | return "No rows selected"
302 |
303 | # Create the run buttons
304 | copy_button = mo.ui.run_button(label="🔗 Copy Selected Outputs")
305 | score_button = mo.ui.run_button(label="👍 Vote Selected Outputs")
306 |
307 | # Display the table and run buttons
308 | mo.vstack(
309 | [
310 | results_table,
311 | mo.hstack(
312 | [
313 | score_button,
314 | copy_button,
315 | ],
316 | justify="start",
317 | ),
318 | ]
319 | )
320 | return (
321 | copy_button,
322 | copy_selected_outputs,
323 | prompt_data,
324 | response,
325 | results_table,
326 | score_button,
327 | table_data,
328 | )
329 |
330 |
331 | @app.cell
332 | def __(
333 | copy_to_clipboard,
334 | get_rankings,
335 | mo,
336 | prompt_library_module,
337 | results_table,
338 | score_button,
339 | set_rankings,
340 | ):
341 | mo.stop(not results_table.value, "")
342 |
343 | selected_rows = results_table.value
344 | outputs = [row["Output"] for row in selected_rows]
345 | combined_output = "\n\n".join(outputs)
346 |
347 | if score_button.value:
348 | # Increment scores for selected models
349 | current_rankings = get_rankings()
350 | for row in selected_rows:
351 | model_id = row["Model"]
352 | for ranking in current_rankings:
353 | if ranking.llm_model_id == model_id:
354 | ranking.score += 1
355 | break
356 |
357 | # Save updated rankings
358 | set_rankings(current_rankings)
359 | prompt_library_module.save_rankings(current_rankings)
360 |
361 | mo.md(f"Scored {len(selected_rows)} model(s)")
362 | else:
363 | copy_to_clipboard(combined_output)
364 | mo.md(f"Copied {len(outputs)} response(s) to clipboard")
365 | return (
366 | combined_output,
367 | current_rankings,
368 | model_id,
369 | outputs,
370 | ranking,
371 | row,
372 | selected_rows,
373 | )
374 |
375 |
376 | @app.cell
377 | def __(all_prompt_responses, form, mo, prompt_library_module):
378 | mo.stop(not form.value, mo.md(""))
379 | mo.stop(not all_prompt_responses, mo.md(""))
380 |
381 | # Create buttons for resetting and loading rankings
382 | reset_ranking_button = mo.ui.run_button(label="❌ Reset Rankings")
383 | load_ranking_button = mo.ui.run_button(label="🔐 Load Rankings")
384 |
385 | # Load existing rankings
386 | get_rankings, set_rankings = mo.state(prompt_library_module.get_rankings())
387 |
388 | mo.hstack(
389 | [
390 | load_ranking_button,
391 | reset_ranking_button,
392 | ],
393 | justify="start",
394 | )
395 | return (
396 | get_rankings,
397 | load_ranking_button,
398 | reset_ranking_button,
399 | set_rankings,
400 | )
401 |
402 |
403 | @app.cell
404 | def __():
405 | # get_rankings()
406 | return
407 |
408 |
409 | @app.cell
410 | def __(
411 | form,
412 | mo,
413 | prompt_library_module,
414 | reset_ranking_button,
415 | set_rankings,
416 | ):
417 | mo.stop(not form.value, mo.md(""))
418 | mo.stop(not reset_ranking_button.value, mo.md(""))
419 |
420 | set_rankings(
421 | prompt_library_module.reset_rankings(
422 | [model.model_id for model in form.value["models"]]
423 | )
424 | )
425 |
426 | # mo.md("Rankings reset successfully")
427 | return
428 |
429 |
430 | @app.cell
431 | def __(form, load_ranking_button, mo, prompt_library_module, set_rankings):
432 | mo.stop(not form.value, mo.md(""))
433 | mo.stop(not load_ranking_button.value, mo.md(""))
434 |
435 | set_rankings(prompt_library_module.get_rankings())
436 | return
437 |
438 |
439 | @app.cell
440 | def __(get_rankings, mo):
441 | # Create UI elements for each model
442 | model_elements = []
443 |
444 | model_score_style = {
445 | "background": "#eeF",
446 | "padding": "10px",
447 | "border-radius": "10px",
448 | "margin-bottom": "20px",
449 | "min-width": "150px",
450 | "box-shadow": "2px 2px 2px #ccc",
451 | }
452 |
453 | for model_ranking in get_rankings():
454 | llm_model_id = model_ranking.llm_model_id
455 | score = model_ranking.score
456 | model_elements.append(
457 | mo.vstack(
458 | [
459 | mo.md(f"**{llm_model_id}** "),
460 | mo.hstack([mo.md(f""), mo.md(f"# {score}")]),
461 | ],
462 | justify="space-between",
463 | gap="2",
464 | ).style(model_score_style)
465 | )
466 |
467 | mo.hstack(model_elements, justify="start", wrap=True)
468 | return (
469 | llm_model_id,
470 | model_elements,
471 | model_ranking,
472 | model_score_style,
473 | score,
474 | )
475 |
476 |
477 | if __name__ == "__main__":
478 | app.run()
479 |
--------------------------------------------------------------------------------
/multi_llm_prompting.py:
--------------------------------------------------------------------------------
1 | import marimo
2 |
3 | __generated_with = "0.8.18"
4 | app = marimo.App(width="full")
5 |
6 |
7 | @app.cell
8 | def __():
9 | import marimo as mo
10 | import src.marimo_notebook.modules.llm_module as llm_module
11 | import src.marimo_notebook.modules.prompt_library_module as prompt_library_module
12 | import json
13 | import pyperclip
14 | return json, llm_module, mo, prompt_library_module, pyperclip
15 |
16 |
17 | @app.cell
18 | def __(llm_module):
19 | llm_o1_mini, llm_o1_preview = llm_module.build_o1_series()
20 | llm_gpt_4o_latest, llm_gpt_4o_mini = llm_module.build_openai_latest_and_fastest()
21 | # llm_sonnet = llm_module.build_sonnet_3_5()
22 | # gemini_1_5_pro, gemini_1_5_flash = llm_module.build_gemini_duo()
23 | # gemini_1_5_pro_2, gemini_1_5_flash_2 = llm_module.build_gemini_1_2_002()
24 | # llama3_2_model, llama3_2_1b_model = llm_module.build_ollama_models()
25 |
26 | models = {
27 | "o1-mini": llm_o1_mini,
28 | "o1-preview": llm_o1_preview,
29 | "gpt-4o-latest": llm_gpt_4o_latest,
30 | "gpt-4o-mini": llm_gpt_4o_mini,
31 | # "sonnet-3.5": llm_sonnet,
32 | # "gemini-1-5-pro": gemini_1_5_pro,
33 | # "gemini-1-5-flash": gemini_1_5_flash,
34 | # "gemini-1-5-pro-002": gemini_1_5_pro_2,
35 | # "gemini-1-5-flash-002": gemini_1_5_flash_2,
36 | # "llama3-2": llama3_2_model,
37 | # "llama3-2-1b": llama3_2_1b_model,
38 | }
39 | return (
40 | llm_gpt_4o_latest,
41 | llm_gpt_4o_mini,
42 | llm_o1_mini,
43 | llm_o1_preview,
44 | models,
45 | )
46 |
47 |
48 | @app.cell
49 | def __(mo, models):
50 | prompt_text_area = mo.ui.text_area(label="Prompt", full_width=True)
51 | prompt_temp_slider = mo.ui.slider(
52 | start=0, stop=1, value=0.5, step=0.05, label="Temp"
53 | )
54 | model_multiselect = mo.ui.multiselect(
55 | options=models.copy(),
56 | label="Models",
57 | value=["gpt-4o-mini"],
58 | )
59 |
60 | form = (
61 | mo.md(
62 | r"""
63 | # Multi-LLM Prompt
64 | {prompt}
65 | {temp}
66 | {models}
67 | """
68 | )
69 | .batch(
70 | prompt=prompt_text_area,
71 | temp=prompt_temp_slider,
72 | models=model_multiselect,
73 | )
74 | .form()
75 | )
76 | form
77 | return form, model_multiselect, prompt_temp_slider, prompt_text_area
78 |
79 |
80 | @app.cell
81 | def __(form, llm_module, mo, prompt_library_module):
82 | mo.stop(not form.value, "")
83 |
84 | prompt_responses = []
85 |
86 | with mo.status.progress_bar(
87 | title="Running prompts on selected models...",
88 | total=len(form.value["models"]),
89 | remove_on_exit=True,
90 | ) as prog_bar:
91 | # with mo.status.spinner(title="Running prompts on selected models...") as _spinner:
92 | for model in form.value["models"]:
93 | model_name = model.model_id
94 | prog_bar.update(title=f"Prompting '{model_name}'", increment=1)
95 | response = llm_module.prompt_with_temp(
96 | model, form.value["prompt"], form.value["temp"]
97 | )
98 | prompt_responses.append(
99 | {
100 | "model_id": model_name,
101 | "model": model,
102 | "output": response,
103 | }
104 | )
105 |
106 | # Create a new list without the 'model' key for each response
107 | list_model_execution_dict = [
108 | {k: v for k, v in response.items() if k != "model"}
109 | for response in prompt_responses
110 | ]
111 |
112 | # Record the execution
113 | execution_filepath = prompt_library_module.record_llm_execution(
114 | prompt=form.value["prompt"],
115 | list_model_execution_dict=list_model_execution_dict,
116 | prompt_template=None, # You can add a prompt template if you have one
117 | )
118 | print(f"Execution record saved to: {execution_filepath}")
119 | return (
120 | execution_filepath,
121 | list_model_execution_dict,
122 | model,
123 | model_name,
124 | prog_bar,
125 | prompt_responses,
126 | response,
127 | )
128 |
129 |
130 | @app.cell
131 | def __(mo, prompt_responses, pyperclip):
132 | def copy_to_clipboard(text):
133 | print("copying: ", text)
134 | pyperclip.copy(text)
135 | return mo.md("**Copied to clipboard!**").callout(kind="success")
136 |
137 | output_elements = [
138 | mo.vstack(
139 | [
140 | mo.md(f"# Prompt Output ({response['model_id']})"),
141 | mo.md(response["output"]),
142 | ]
143 | ).style(
144 | {
145 | "background": "#eee",
146 | "padding": "10px",
147 | "border-radius": "10px",
148 | "margin-bottom": "20px",
149 | }
150 | )
151 | for (idx, response) in enumerate(prompt_responses)
152 | ]
153 |
154 | mo.vstack(
155 | [
156 | mo.hstack(output_elements),
157 | # mo.hstack(output_elements, wrap=True),
158 | # mo.vstack(output_elements),
159 | # mo.carousel(output_elements),
160 | # mo.hstack(copy_buttons)
161 | # copy_buttons,
162 | ]
163 | )
164 | return copy_to_clipboard, output_elements
165 |
166 |
167 | @app.cell
168 | def __(copy_to_clipboard, mo, prompt_responses):
169 | copy_buttons = mo.ui.array(
170 | [
171 | mo.ui.button(
172 | label=f"Copy {response['model_id']} response",
173 | on_click=lambda v: copy_to_clipboard(prompt_responses[v]["output"]),
174 | value=idx,
175 | )
176 | for (idx, response) in enumerate(prompt_responses)
177 | ]
178 | )
179 |
180 | mo.vstack(copy_buttons, align="center")
181 | return (copy_buttons,)
182 |
183 |
184 | if __name__ == "__main__":
185 | app.run()
186 |
--------------------------------------------------------------------------------
/prompt_library.py:
--------------------------------------------------------------------------------
1 | import marimo
2 |
3 | __generated_with = "0.8.18"
4 | app = marimo.App(width="medium")
5 |
6 |
7 | @app.cell
8 | def __():
9 | import marimo as mo
10 | from src.marimo_notebook.modules import prompt_library_module, llm_module
11 | import re # For regex to extract placeholders
12 | return llm_module, mo, prompt_library_module, re
13 |
14 |
15 | @app.cell
16 | def __(prompt_library_module):
17 | map_prompt_library: dict = prompt_library_module.pull_in_prompt_library()
18 | return (map_prompt_library,)
19 |
20 |
21 | @app.cell
22 | def __(llm_module):
23 | llm_o1_mini, llm_o1_preview = llm_module.build_o1_series()
24 | llm_gpt_4o_latest, llm_gpt_4o_mini = llm_module.build_openai_latest_and_fastest()
25 | # llm_sonnet = llm_module.build_sonnet_3_5()
26 | # gemini_1_5_pro, gemini_1_5_flash = llm_module.build_gemini_duo()
27 |
28 | models = {
29 | "o1-mini": llm_o1_mini,
30 | "o1-preview": llm_o1_preview,
31 | "gpt-4o-latest": llm_gpt_4o_latest,
32 | "gpt-4o-mini": llm_gpt_4o_mini,
33 | # "sonnet-3.5": llm_sonnet,
34 | # "gemini-1-5-pro": gemini_1_5_pro,
35 | # "gemini-1-5-flash": gemini_1_5_flash,
36 | }
37 | return (
38 | llm_gpt_4o_latest,
39 | llm_gpt_4o_mini,
40 | llm_o1_mini,
41 | llm_o1_preview,
42 | models,
43 | )
44 |
45 |
46 | @app.cell
47 | def __():
48 | prompt_styles = {"background": "#eee", "padding": "10px", "border-radius": "10px"}
49 | return (prompt_styles,)
50 |
51 |
52 | @app.cell
53 | def __(map_prompt_library, mo, models):
54 | prompt_keys = list(map_prompt_library.keys())
55 | prompt_dropdown = mo.ui.dropdown(
56 | options=prompt_keys,
57 | label="Select a Prompt",
58 | )
59 | model_dropdown = mo.ui.dropdown(
60 | options=models,
61 | label="Select an LLM Model",
62 | value="gpt-4o-mini",
63 | )
64 | form = (
65 | mo.md(
66 | r"""
67 | # Prompt Library
68 | {prompt_dropdown}
69 | {model_dropdown}
70 | """
71 | )
72 | .batch(
73 | prompt_dropdown=prompt_dropdown,
74 | model_dropdown=model_dropdown,
75 | )
76 | .form()
77 | )
78 | form
79 | return form, model_dropdown, prompt_dropdown, prompt_keys
80 |
81 |
82 | @app.cell
83 | def __(form, map_prompt_library, mo, prompt_styles):
84 | selected_prompt_name = None
85 | selected_prompt = None
86 |
87 | mo.stop(not form.value or not len(form.value), "")
88 | selected_prompt_name = form.value["prompt_dropdown"]
89 | selected_prompt = map_prompt_library[selected_prompt_name]
90 | mo.vstack(
91 | [
92 | mo.md("# Selected Prompt"),
93 | mo.accordion(
94 | {
95 | "### Click to show": mo.md(f"```xml\n{selected_prompt}\n```").style(
96 | prompt_styles
97 | )
98 | }
99 | ),
100 | ]
101 | )
102 | return selected_prompt, selected_prompt_name
103 |
104 |
105 | @app.cell
106 | def __(mo, re, selected_prompt, selected_prompt_name):
107 | mo.stop(not selected_prompt_name or not selected_prompt, "")
108 |
109 | # Extract placeholders from the prompt
110 | placeholders = re.findall(r"\{\{(.*?)\}\}", selected_prompt)
111 | placeholders = list(set(placeholders)) # Remove duplicates
112 |
113 | # Create text areas for placeholders, using the placeholder text as the label
114 | placeholder_inputs = [
115 | mo.ui.text_area(label=ph, placeholder=f"Enter {ph}", full_width=True)
116 | for ph in placeholders
117 | ]
118 |
119 | # Create an array of placeholder inputs
120 | placeholder_array = mo.ui.array(
121 | placeholder_inputs,
122 | label="Fill in the Placeholders",
123 | )
124 |
125 | # Create a 'Proceed' button
126 | proceed_button = mo.ui.run_button(label="Prompt")
127 |
128 | # Display the placeholders and the 'Proceed' button in a vertical stack
129 | vstack = mo.vstack([mo.md("# Prompt Variables"), placeholder_array, proceed_button])
130 | vstack
131 | return (
132 | placeholder_array,
133 | placeholder_inputs,
134 | placeholders,
135 | proceed_button,
136 | vstack,
137 | )
138 |
139 |
140 | @app.cell
141 | def __(mo, placeholder_array, placeholders, proceed_button):
142 | mo.stop(not placeholder_array.value or not len(placeholder_array.value), "")
143 |
144 | # Check if any values are missing
145 | if any(not value.strip() for value in placeholder_array.value):
146 | mo.stop(True, mo.md("**Please fill in all placeholders.**"))
147 |
148 | # Ensure the 'Proceed' button has been pressed
149 | mo.stop(
150 | not proceed_button.value,
151 | mo.md("**Please press the 'Proceed' button to continue.**"),
152 | )
153 |
154 | # Map the placeholder names to the values
155 | filled_values = dict(zip(placeholders, placeholder_array.value))
156 | return (filled_values,)
157 |
158 |
159 | @app.cell
160 | def __(filled_values, selected_prompt):
161 | # Replace placeholders in the prompt
162 | final_prompt = selected_prompt
163 | for key, value in filled_values.items():
164 | final_prompt = final_prompt.replace(f"{{{{{key}}}}}", value)
165 |
166 | # Create context_filled_prompt
167 | context_filled_prompt = final_prompt
168 | return context_filled_prompt, final_prompt, key, value
169 |
170 |
171 | @app.cell
172 | def __(context_filled_prompt, mo, prompt_styles):
173 | mo.vstack(
174 | [
175 | mo.md("# Context Filled Prompt"),
176 | mo.accordion(
177 | {
178 | "### Click to Show Context Filled Prompt": mo.md(
179 | f"```xml\n{context_filled_prompt}\n```"
180 | ).style(prompt_styles)
181 | }
182 | ),
183 | ]
184 | )
185 | return
186 |
187 |
188 | @app.cell
189 | def __(context_filled_prompt, form, llm_module, mo):
190 | # Get the selected model
191 | model = form.value["model_dropdown"]
192 | # Run the prompt through the model using context_filled_prompt
193 | with mo.status.spinner(title="Running prompt..."):
194 | prompt_response = llm_module.prompt(model, context_filled_prompt)
195 |
196 | mo.md(f"# Prompt Output\n\n{prompt_response}").style(
197 | {"background": "#eee", "padding": "10px", "border-radius": "10px"}
198 | )
199 | return model, prompt_response
200 |
201 |
202 | if __name__ == "__main__":
203 | app.run()
204 |
--------------------------------------------------------------------------------
/prompt_library/ai-coding-meta-review.xml:
--------------------------------------------------------------------------------
1 |
2 | Given a diff of code, look for bugs and provide solutions to an AI coding assistant LLM to
3 | automatically fix the code.
4 |
5 |
6 |
7 | The diff-of-code will be provided in the diff-of-code block.
8 | The code-files will be provided in the code-files block.
9 | Respond in the output-format provided. Do not include any other text.
10 | Respond with valid JSON.parseable JSON.
11 | For each issue specify the file path, startingLine, aiCodingResolutionPrompt, bugDescription, and severity for each fix.
12 | First search for critical bugs. Then search for major bugs. Then minor bugs. Then readability.
13 | The aiCodingResolutionPrompt should be a prompt detailing the issue, the solution in natural language, and the code to fix the issue.
14 | Return a AIAssistantResolutions object, with contains an array of fixes: AIAssistantFix.
15 |
16 |
17 |
18 |
21 |
22 |
23 |
24 |
27 |
28 |
29 |
30 |
47 |
--------------------------------------------------------------------------------
/prompt_library/bullet-knowledge-compression.xml:
--------------------------------------------------------------------------------
1 |
2 | Create sections to generate a document that describes the key takeaways, principles, and ideas based in the provided content block.
3 |
4 |
5 |
6 |
7 | Use a simple 'title', 'theme', 'bullet points' structure.
8 |
9 |
10 | Cover as much content as you can, but don't overload bullet points, be concise and nest bullet points when you have more to add that would be valuable for understanding and utilizing the content.
11 |
12 |
13 |
14 |
15 | {{content}}
16 |
--------------------------------------------------------------------------------
/prompt_library/chapter-gen.xml:
--------------------------------------------------------------------------------
1 |
2 | We're generating YouTube video chapters. Generate chapters in the specified format detailed by
3 | the examples, ensuring that each chapter title is short, engaging, SEO-friendly, and aligned
4 | with the corresponding timestamp. Follow the instructions to generate the best, most interesting
5 | chapters.
6 |
7 |
8 |
9 | The time stamps are in the format [MM:SS] and you should use them to create the
10 | chapter titles.
11 | The timestamp should represent the beginning of the chapter.
12 | Collect what you think will be the most interesting and engaging parts of the video
13 | to represent each chapter based on the transcript.
14 | Use the transcript-with-timestamps to generate the chapter title.
15 | Use the examples to properly structure the chapter titles.
16 | Use the seo-keywords-to-hit to generate the chapter title.
17 | Generate 8 chapters for the video throughout the duration of the video.
18 | Respond exclusively with the chapters in the specified format detailed in the
19 | examples.
20 |
21 |
22 |
23 |
24 | 📖 Chapters
25 | 00:00 Increase your earnings potential
26 | 00:38 Omnicomplete - the autocomplete for everything
27 | 01:16 LLM Autocompletes can self improve
28 | 02:00 Reveal Actionable Information from your users
29 | 03:20 Client - Server - Prompt Architecture
30 | 05:30 LLM Autocomplete DEMO
31 | 06:45 Autocomplete PROMPT
32 | 08:45 Auto Improve LLM / Self Improve LLM
33 | 10:25 Break down codebase
34 | 12:28 Direct prompt testing integration
35 | 14:10 Domain Knowledge Example
36 | 16:00 Interesting Use Case For LLMs in 2024, 2025
37 |
38 |
39 |
40 | 📖 Chapters
41 | 00:00 The 100x LLM is coming
42 | 01:30 A 100x on opus and gpt4 is insane
43 | 01:57 Sam Altman's winning startup strategy
44 | 03:16 BAPs, Expand your problem set, 100 P/D
45 | 03:35 BAPs
46 | 06:35 Expand your problem set
47 | 08:45 The prompt is the new fundamental unit of programming
48 | 10:40 100 P/D
49 | 14:00 Recap 3 ways to prepare for 100x SOTA LLM
50 |
51 |
52 |
53 | 📖 Chapters
54 | 00:00 Best way to build AI Agents?
55 | 00:39 Agent OS
56 | 01:58 Big Ideas (Summary)
57 | 02:48 Breakdown Agent OS: LPU, RAM, I/O
58 | 04:03 Language Processing Unit (LPU)
59 | 05:42 Is this over engineering?
60 | 07:30 Memory, Context, State (RAM)
61 | 08:20 Tools, Function Calling, Spyware (I/O)
62 | 10:22 How do you know your Architecture is good?
63 | 13:27 Agent Composability
64 | 16:40 What's missing from Agent OS?
65 | 18:53 The Prompt is the...
66 |
67 |
68 |
69 |
70 | {{seo-keywords-to-hit}}
71 |
72 |
73 |
74 | {{transcript-with-timestamps}}
75 |
--------------------------------------------------------------------------------
/prompt_library/hn-sentiment-analysis.xml:
--------------------------------------------------------------------------------
1 |
2 | Analyze the aggregate sentiment of a list of comments from hn-json.
3 |
4 |
5 |
6 |
7 | Respond in the response-format TwoFormatSentiment structure.
8 |
9 | Analyze the sentiment of each comment and aggregate the results.
10 | Do not return anything else. Results must be JSON.parseable.
11 | Group your analysis into positive, nuanced, and negative categories.
12 | Include the entire comment in the 'standOutComments' field.
13 | Positive and negative are self-explanatory. Nuanced setiment is for comments that
14 | border on positive and negative or are even mixed.
15 | In the 'markdown' response, respond with your results from AggregateSentiment but
16 | in human readable markdown format.
17 |
18 | Create at least 3 themes for each sentiment category but expand to as many novel themes as needed.
19 |
20 |
21 | Include at least 3 standOutComments for each theme but expand to as many as you need.
22 |
23 |
24 |
25 |
26 |
29 |
30 |
31 |
32 |
60 |
--------------------------------------------------------------------------------
/pyproject.toml:
--------------------------------------------------------------------------------
1 | [project]
2 | name = "marimo-notebook"
3 | version = "0.1.0"
4 | description = "Starter codebase to use Marimo reactive notebooks to build a reusable, customizable, Prompt Library."
5 | readme = "README.md"
6 | requires-python = ">=3.12"
7 | dependencies = [
8 | "openai>=1.47.0",
9 | "python-dotenv>=1.0.1",
10 | "pydantic>=2.9.2",
11 | "llm>=0.16",
12 | "pytest>=8.3.3",
13 | "llm-claude>=0.4.0",
14 | "mako>=1.3.5",
15 | "pandas>=2.2.3",
16 | "matplotlib>=3.9.2",
17 | "altair>=5.4.1",
18 | "vega-datasets>=0.9.0",
19 | "tornado>=6.4.1",
20 | "llm-claude-3>=0.4.1",
21 | "marimo>=0.8.18",
22 | "llm-ollama>=0.5.0",
23 | "pyperclip>=1.9.0",
24 | "llm-gemini>=0.1a5",
25 | ]
26 |
27 | [build-system]
28 | requires = ["hatchling"]
29 | build-backend = "hatchling.build"
30 |
--------------------------------------------------------------------------------
/src/marimo_notebook/__init__.py:
--------------------------------------------------------------------------------
1 | def hello() -> str:
2 | return "Hello from marimo-notebook!"
3 |
--------------------------------------------------------------------------------
/src/marimo_notebook/modules/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/disler/marimo-prompt-library/cec10b5e57cf1c5e3e11806c47a667cb38bd32b0/src/marimo_notebook/modules/__init__.py
--------------------------------------------------------------------------------
/src/marimo_notebook/modules/chain.py:
--------------------------------------------------------------------------------
1 | import json
2 | import re
3 | from typing import List, Dict, Callable, Any, Tuple, Union
4 | from .typings import FusionChainResult
5 | import concurrent.futures
6 |
7 |
8 | class FusionChain:
9 |
10 | @staticmethod
11 | def run(
12 | context: Dict[str, Any],
13 | models: List[Any],
14 | callable: Callable,
15 | prompts: List[str],
16 | evaluator: Callable[[List[Any]], Tuple[Any, List[float]]],
17 | get_model_name: Callable[[Any], str],
18 | ) -> FusionChainResult:
19 | """
20 | Run a competition between models on a list of prompts.
21 |
22 | Runs the MinimalChainable.run method for each model for each prompt and evaluates the results.
23 |
24 | The evaluator runs on the last output of each model at the end of the chain of prompts.
25 |
26 | The eval method returns a performance score for each model from 0 to 1, giving priority to models earlier in the list.
27 |
28 | Args:
29 | context (Dict[str, Any]): The context for the prompts.
30 | models (List[Any]): List of models to compete.
31 | callable (Callable): The function to call for each prompt.
32 | prompts (List[str]): List of prompts to process.
33 | evaluator (Callable[[List[str]], Tuple[Any, List[float]]]): Function to evaluate model outputs, returning the top response and the scores.
34 | get_model_name (Callable[[Any], str]): Function to get the name of a model. Defaults to str(model).
35 |
36 | Returns:
37 | FusionChainResult: A FusionChainResult object containing the top response, all outputs, all context-filled prompts, performance scores, and model names.
38 | """
39 | all_outputs = []
40 | all_context_filled_prompts = []
41 |
42 | for model in models:
43 | outputs, context_filled_prompts = MinimalChainable.run(
44 | context, model, callable, prompts
45 | )
46 | all_outputs.append(outputs)
47 | all_context_filled_prompts.append(context_filled_prompts)
48 |
49 | # Evaluate the last output of each model
50 | last_outputs = [outputs[-1] for outputs in all_outputs]
51 | top_response, performance_scores = evaluator(last_outputs)
52 |
53 | model_names = [get_model_name(model) for model in models]
54 |
55 | return FusionChainResult(
56 | top_response=top_response,
57 | all_prompt_responses=all_outputs,
58 | all_context_filled_prompts=all_context_filled_prompts,
59 | performance_scores=performance_scores,
60 | llm_model_names=model_names,
61 | )
62 |
63 | @staticmethod
64 | def run_parallel(
65 | context: Dict[str, Any],
66 | models: List[Any],
67 | callable: Callable,
68 | prompts: List[str],
69 | evaluator: Callable[[List[Any]], Tuple[Any, List[float]]],
70 | get_model_name: Callable[[Any], str],
71 | num_workers: int = 4,
72 | ) -> FusionChainResult:
73 | """
74 | Run a competition between models on a list of prompts in parallel.
75 |
76 | This method is similar to the 'run' method but utilizes parallel processing
77 | to improve performance when dealing with multiple models.
78 |
79 | Args:
80 | context (Dict[str, Any]): The context for the prompts.
81 | models (List[Any]): List of models to compete.
82 | callable (Callable): The function to call for each prompt.
83 | prompts (List[str]): List of prompts to process.
84 | evaluator (Callable[[List[str]], Tuple[Any, List[float]]]): Function to evaluate model outputs, returning the top response and the scores.
85 | num_workers (int): Number of parallel workers to use. Defaults to 4.
86 | get_model_name (Callable[[Any], str]): Function to get the name of a model. Defaults to str(model).
87 |
88 | Returns:
89 | FusionChainResult: A FusionChainResult object containing the top response, all outputs, all context-filled prompts, performance scores, and model names.
90 | """
91 |
92 | def process_model(model):
93 | outputs, context_filled_prompts = MinimalChainable.run(
94 | context, model, callable, prompts
95 | )
96 | return outputs, context_filled_prompts
97 |
98 | all_outputs = []
99 | all_context_filled_prompts = []
100 |
101 | with concurrent.futures.ThreadPoolExecutor(max_workers=num_workers) as executor:
102 | future_to_model = {
103 | executor.submit(process_model, model): model for model in models
104 | }
105 | for future in concurrent.futures.as_completed(future_to_model):
106 | outputs, context_filled_prompts = future.result()
107 | all_outputs.append(outputs)
108 | all_context_filled_prompts.append(context_filled_prompts)
109 |
110 | # Evaluate the last output of each model
111 | last_outputs = [outputs[-1] for outputs in all_outputs]
112 | top_response, performance_scores = evaluator(last_outputs)
113 |
114 | model_names = [get_model_name(model) for model in models]
115 |
116 | return FusionChainResult(
117 | top_response=top_response,
118 | all_prompt_responses=all_outputs,
119 | all_context_filled_prompts=all_context_filled_prompts,
120 | performance_scores=performance_scores,
121 | llm_model_names=model_names,
122 | )
123 |
124 |
125 | class MinimalChainable:
126 | """
127 | Sequential prompt chaining with context and output back-references.
128 | """
129 |
130 | @staticmethod
131 | def run(
132 | context: Dict[str, Any], model: Any, callable: Callable, prompts: List[str]
133 | ) -> Tuple[List[Any], List[str]]:
134 | # Initialize an empty list to store the outputs
135 | output = []
136 | context_filled_prompts = []
137 |
138 | # Iterate over each prompt with its index
139 | for i, prompt in enumerate(prompts):
140 | # Iterate over each key-value pair in the context
141 | for key, value in context.items():
142 | # Check if the key is in the prompt
143 | if "{{" + key + "}}" in prompt:
144 | # Replace the key with its value
145 | prompt = prompt.replace("{{" + key + "}}", str(value))
146 |
147 | # Replace references to previous outputs
148 | # Iterate from the current index down to 1
149 | for j in range(i, 0, -1):
150 | # Get the previous output
151 | previous_output = output[i - j]
152 |
153 | # Handle JSON (dict) output references
154 | # Check if the previous output is a dictionary
155 | if isinstance(previous_output, dict):
156 | # Check if the reference is in the prompt
157 | if f"{{{{output[-{j}]}}}}" in prompt:
158 | # Replace the reference with the JSON string
159 | prompt = prompt.replace(
160 | f"{{{{output[-{j}]}}}}", json.dumps(previous_output)
161 | )
162 | # Iterate over each key-value pair in the previous output
163 | for key, value in previous_output.items():
164 | # Check if the key reference is in the prompt
165 | if f"{{{{output[-{j}].{key}}}}}" in prompt:
166 | # Replace the key reference with its value
167 | prompt = prompt.replace(
168 | f"{{{{output[-{j}].{key}}}}}", str(value)
169 | )
170 | # If not a dict, use the original string
171 | else:
172 | # Check if the reference is in the prompt
173 | if f"{{{{output[-{j}]}}}}" in prompt:
174 | # Replace the reference with the previous output
175 | prompt = prompt.replace(
176 | f"{{{{output[-{j}]}}}}", str(previous_output)
177 | )
178 |
179 | # Append the context filled prompt to the list
180 | context_filled_prompts.append(prompt)
181 |
182 | # Call the provided callable with the processed prompt
183 | # Get the result by calling the callable with the model and prompt
184 | result = callable(model, prompt)
185 |
186 | print("result", result)
187 |
188 | # Try to parse the result as JSON, handling markdown-wrapped JSON
189 | try:
190 | # First, attempt to extract JSON from markdown code blocks
191 | # Search for JSON in markdown code blocks
192 | json_match = re.search(r"```(?:json)?\s*([\s\S]*?)\s*```", result)
193 | # If a match is found
194 | if json_match:
195 | # Parse the JSON from the match
196 | result = json.loads(json_match.group(1))
197 | else:
198 | # If no markdown block found, try parsing the entire result
199 | # Parse the entire result as JSON
200 | result = json.loads(result)
201 | except json.JSONDecodeError:
202 | # Not JSON, keep as is
203 | pass
204 |
205 | # Append the result to the output list
206 | output.append(result)
207 |
208 | # Return the list of outputs
209 | return output, context_filled_prompts
210 |
211 | @staticmethod
212 | def to_delim_text_file(name: str, content: List[Union[str, dict, list]]) -> str:
213 | result_string = ""
214 | with open(f"{name}.txt", "w") as outfile:
215 | for i, item in enumerate(content, 1):
216 | if isinstance(item, (dict, list)):
217 | item = json.dumps(item)
218 | elif not isinstance(item, str):
219 | item = str(item)
220 | chain_text_delim = (
221 | f"{'🔗' * i} -------- Prompt Chain Result #{i} -------------\n\n"
222 | )
223 | outfile.write(chain_text_delim)
224 | outfile.write(item)
225 | outfile.write("\n\n")
226 |
227 | result_string += chain_text_delim + item + "\n\n"
228 |
229 | return result_string
230 |
--------------------------------------------------------------------------------
/src/marimo_notebook/modules/llm_module.py:
--------------------------------------------------------------------------------
1 | import llm
2 | from dotenv import load_dotenv
3 | import os
4 | from mako.template import Template
5 |
6 | # Load environment variables from .env file
7 | load_dotenv()
8 |
9 |
10 | def conditional_render(prompt, context, start_delim="% if", end_delim="% endif"):
11 | template = Template(prompt)
12 | return template.render(**context)
13 |
14 |
15 | def parse_markdown_backticks(str) -> str:
16 | if "```" not in str:
17 | return str.strip()
18 | # Remove opening backticks and language identifier
19 | str = str.split("```", 1)[-1].split("\n", 1)[-1]
20 | # Remove closing backticks
21 | str = str.rsplit("```", 1)[0]
22 | # Remove any leading or trailing whitespace
23 | return str.strip()
24 |
25 |
26 | def prompt(model: llm.Model, prompt: str):
27 | res = model.prompt(prompt, stream=False)
28 | return res.text()
29 |
30 |
31 | def prompt_with_temp(model: llm.Model, prompt: str, temperature: float = 0.7):
32 | """
33 | Send a prompt to the model with a specified temperature.
34 |
35 | Args:
36 | model (llm.Model): The LLM model to use.
37 | prompt (str): The prompt to send to the model.
38 | temperature (float): The temperature setting for the model's response. Default is 0.7.
39 |
40 | Returns:
41 | str: The model's response text.
42 | """
43 |
44 | model_id = model.model_id
45 | if "o1" in model_id or "gemini" in model_id:
46 | temperature = 1
47 | res = model.prompt(prompt, stream=False)
48 | return res.text()
49 |
50 | res = model.prompt(prompt, stream=False, temperature=temperature)
51 | return res.text()
52 |
53 |
54 | def get_model_name(model: llm.Model):
55 | return model.model_id
56 |
57 |
58 | def build_sonnet_3_5():
59 | ANTHROPIC_API_KEY = os.getenv("ANTHROPIC_API_KEY")
60 |
61 | sonnet_3_5_model: llm.Model = llm.get_model("claude-3.5-sonnet")
62 | sonnet_3_5_model.key = ANTHROPIC_API_KEY
63 |
64 | return sonnet_3_5_model
65 |
66 |
67 | def build_mini_model():
68 | OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
69 | gpt4_o_mini_model: llm.Model = llm.get_model("gpt-4o-mini")
70 | gpt4_o_mini_model.key = OPENAI_API_KEY
71 | return gpt4_o_mini_model
72 |
73 |
74 | def build_big_3_models():
75 | ANTHROPIC_API_KEY = os.getenv("ANTHROPIC_API_KEY")
76 | OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
77 | GEMINI_API_KEY = os.getenv("GEMINI_API_KEY")
78 |
79 | sonnet_3_5_model: llm.Model = llm.get_model("claude-3.5-sonnet")
80 | sonnet_3_5_model.key = ANTHROPIC_API_KEY
81 |
82 | gpt4_o_model: llm.Model = llm.get_model("4o")
83 | gpt4_o_model.key = OPENAI_API_KEY
84 |
85 | gemini_1_5_pro_model: llm.Model = llm.get_model("gemini-1.5-pro-latest")
86 | gemini_1_5_pro_model.key = GEMINI_API_KEY
87 |
88 | return sonnet_3_5_model, gpt4_o_model, gemini_1_5_pro_model
89 |
90 |
91 | def build_latest_openai():
92 | OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
93 |
94 | # chatgpt_4o_latest_model: llm.Model = llm.get_model("chatgpt-4o-latest") - experimental
95 | chatgpt_4o_latest_model: llm.Model = llm.get_model("gpt-4o")
96 | chatgpt_4o_latest_model.key = OPENAI_API_KEY
97 | return chatgpt_4o_latest_model
98 |
99 |
100 | def build_big_3_plus_mini_models():
101 |
102 | ANTHROPIC_API_KEY = os.getenv("ANTHROPIC_API_KEY")
103 | OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
104 | GEMINI_API_KEY = os.getenv("GEMINI_API_KEY")
105 |
106 | sonnet_3_5_model: llm.Model = llm.get_model("claude-3.5-sonnet")
107 | sonnet_3_5_model.key = ANTHROPIC_API_KEY
108 |
109 | gpt4_o_model: llm.Model = llm.get_model("4o")
110 | gpt4_o_model.key = OPENAI_API_KEY
111 |
112 | gemini_1_5_pro_model: llm.Model = llm.get_model("gemini-1.5-pro-latest")
113 | gemini_1_5_pro_model.key = GEMINI_API_KEY
114 |
115 | gpt4_o_mini_model: llm.Model = llm.get_model("gpt-4o-mini")
116 | gpt4_o_mini_model.key = OPENAI_API_KEY
117 |
118 | chatgpt_4o_latest_model = build_latest_openai()
119 |
120 | return (
121 | sonnet_3_5_model,
122 | gpt4_o_model,
123 | gemini_1_5_pro_model,
124 | gpt4_o_mini_model,
125 | )
126 |
127 |
128 | def build_gemini_duo():
129 | gemini_1_5_pro: llm.Model = llm.get_model("gemini-1.5-pro-latest")
130 | gemini_1_5_flash: llm.Model = llm.get_model("gemini-1.5-flash-latest")
131 |
132 | GEMINI_API_KEY = os.getenv("GEMINI_API_KEY")
133 |
134 | gemini_1_5_pro.key = GEMINI_API_KEY
135 | gemini_1_5_flash.key = GEMINI_API_KEY
136 |
137 | return gemini_1_5_pro, gemini_1_5_flash
138 |
139 |
140 | def build_ollama_models():
141 |
142 | llama3_2_model: llm.Model = llm.get_model("llama3.2")
143 | llama_3_2_1b_model: llm.Model = llm.get_model("llama3.2:1b")
144 |
145 | return llama3_2_model, llama_3_2_1b_model
146 |
147 |
148 | def build_ollama_slm_models():
149 |
150 | llama3_2_model: llm.Model = llm.get_model("llama3.2")
151 | phi3_5_model: llm.Model = llm.get_model("phi3.5:latest")
152 | qwen2_5_model: llm.Model = llm.get_model("qwen2.5:latest")
153 |
154 | return llama3_2_model, phi3_5_model, qwen2_5_model
155 |
156 |
157 | def build_openai_model_stack():
158 | OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
159 |
160 | gpt4_o_mini_model: llm.Model = llm.get_model("gpt-4o-mini")
161 | gpt4_o_2024_08_06_model: llm.Model = llm.get_model("gpt-4o")
162 | o1_preview_model: llm.Model = llm.get_model("o1-preview")
163 | o1_mini_model: llm.Model = llm.get_model("o1-mini")
164 |
165 | models = [
166 | gpt4_o_mini_model,
167 | gpt4_o_2024_08_06_model,
168 | o1_preview_model,
169 | o1_mini_model,
170 | ]
171 |
172 | for model in models:
173 | model.key = OPENAI_API_KEY
174 |
175 | return models
176 |
177 |
178 | def build_openai_latest_and_fastest():
179 | OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
180 |
181 | gpt_4o_latest: llm.Model = llm.get_model("gpt-4o")
182 | gpt_4o_latest.key = OPENAI_API_KEY
183 |
184 | gpt_4o_mini_model: llm.Model = llm.get_model("gpt-4o-mini")
185 | gpt_4o_mini_model.key = OPENAI_API_KEY
186 |
187 | return gpt_4o_latest, gpt_4o_mini_model
188 |
189 |
190 | def build_o1_series():
191 | OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
192 |
193 | o1_mini_model: llm.Model = llm.get_model("o1-mini")
194 | o1_mini_model.key = OPENAI_API_KEY
195 |
196 | o1_preview_model: llm.Model = llm.get_model("o1-preview")
197 | o1_preview_model.key = OPENAI_API_KEY
198 |
199 | return o1_mini_model, o1_preview_model
200 |
201 |
202 | def build_small_cheap_and_fast():
203 | OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
204 | GEMINI_API_KEY = os.getenv("GEMINI_API_KEY")
205 | gpt4_o_mini_model: llm.Model = llm.get_model("gpt-4o-mini")
206 | gpt4_o_mini_model.key = OPENAI_API_KEY
207 |
208 | gemini_1_5_flash_002: llm.Model = llm.get_model("gemini-1.5-flash-002")
209 | gemini_1_5_flash_002.key = GEMINI_API_KEY
210 |
211 | return gpt4_o_mini_model, gemini_1_5_flash_002
212 |
213 |
214 | def build_small_cheap_and_fast():
215 | OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
216 | GEMINI_API_KEY = os.getenv("GEMINI_API_KEY")
217 | gpt4_o_mini_model: llm.Model = llm.get_model("gpt-4o-mini")
218 | gpt4_o_mini_model.key = OPENAI_API_KEY
219 |
220 | gemini_1_5_flash_002: llm.Model = llm.get_model("gemini-1.5-flash-002")
221 | gemini_1_5_flash_002.key = GEMINI_API_KEY
222 |
223 | return gpt4_o_mini_model, gemini_1_5_flash_002
224 |
225 |
226 | def build_gemini_1_2_002():
227 | GEMINI_API_KEY = os.getenv("GEMINI_API_KEY")
228 |
229 | gemini_1_5_pro_002: llm.Model = llm.get_model("gemini-1.5-pro-002")
230 | gemini_1_5_flash_002: llm.Model = llm.get_model("gemini-1.5-flash-002")
231 |
232 | gemini_1_5_pro_002.key = GEMINI_API_KEY
233 | gemini_1_5_flash_002.key = GEMINI_API_KEY
234 |
235 | return gemini_1_5_pro_002, gemini_1_5_flash_002
236 |
--------------------------------------------------------------------------------
/src/marimo_notebook/modules/prompt_library_module.py:
--------------------------------------------------------------------------------
1 | import os
2 | import json
3 | from datetime import datetime
4 | from typing import List
5 | from dotenv import load_dotenv
6 | from src.marimo_notebook.modules.typings import ModelRanking, MultiLLMPromptExecution
7 |
8 | load_dotenv()
9 |
10 |
11 | def pull_in_dir_recursively(dir: str) -> dict:
12 | if not os.path.exists(dir):
13 | return {}
14 |
15 | result = {}
16 |
17 | def recursive_read(current_dir):
18 | for item in os.listdir(current_dir):
19 | item_path = os.path.join(current_dir, item)
20 | if os.path.isfile(item_path):
21 | relative_path = os.path.relpath(item_path, dir)
22 | with open(item_path, "r") as f:
23 | result[relative_path] = f.read()
24 | elif os.path.isdir(item_path):
25 | recursive_read(item_path)
26 |
27 | recursive_read(dir)
28 | return result
29 |
30 |
31 | def pull_in_prompt_library():
32 | prompt_library_dir = os.getenv("PROMPT_LIBRARY_DIR", "./prompt_library")
33 | return pull_in_dir_recursively(prompt_library_dir)
34 |
35 |
36 | def pull_in_testable_prompts():
37 | testable_prompts_dir = os.getenv("TESTABLE_PROMPTS_DIR", "./testable_prompts")
38 | return pull_in_dir_recursively(testable_prompts_dir)
39 |
40 |
41 | def record_llm_execution(
42 | prompt: str, list_model_execution_dict: list, prompt_template: str = None
43 | ):
44 | execution_dir = os.getenv("PROMPT_EXECUTIONS_DIR", "./prompt_executions")
45 | os.makedirs(execution_dir, exist_ok=True)
46 |
47 | if prompt_template:
48 | filename_base = prompt_template.replace(" ", "_").lower()
49 | else:
50 | filename_base = prompt[:50].replace(" ", "_").lower()
51 |
52 | # Clean up filename_base to ensure it's alphanumeric only
53 | filename_base = "".join(
54 | char for char in filename_base if char.isalnum() or char == "_"
55 | )
56 |
57 | timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
58 | filename = f"{filename_base}_{timestamp}.json"
59 | filepath = os.path.join(execution_dir, filename)
60 |
61 | execution_record = MultiLLMPromptExecution(
62 | prompt=prompt,
63 | prompt_template=prompt_template,
64 | prompt_responses=list_model_execution_dict,
65 | )
66 |
67 | with open(filepath, "w") as f:
68 | json.dump(execution_record.model_dump(), f, indent=2)
69 |
70 | return filepath
71 |
72 |
73 | def get_rankings():
74 | rankings_file = os.getenv(
75 | "LANGUAGE_MODEL_RANKINGS_FILE", "./language_model_rankings/rankings.json"
76 | )
77 | if not os.path.exists(rankings_file):
78 | return []
79 | with open(rankings_file, "r") as f:
80 | rankings_data = json.load(f)
81 | return [ModelRanking(**ranking) for ranking in rankings_data]
82 |
83 |
84 | def save_rankings(rankings: List[ModelRanking]):
85 | rankings_file = os.getenv(
86 | "LANGUAGE_MODEL_RANKINGS_FILE", "./language_model_rankings/rankings.json"
87 | )
88 | os.makedirs(os.path.dirname(rankings_file), exist_ok=True)
89 | rankings_dict = [ranking.model_dump() for ranking in rankings]
90 | with open(rankings_file, "w") as f:
91 | json.dump(rankings_dict, f, indent=2)
92 |
93 |
94 | def reset_rankings(model_ids: List[str]):
95 | new_rankings = [
96 | ModelRanking(llm_model_id=model_id, score=0) for model_id in model_ids
97 | ]
98 | save_rankings(new_rankings)
99 | return new_rankings
100 |
--------------------------------------------------------------------------------
/src/marimo_notebook/modules/typings.py:
--------------------------------------------------------------------------------
1 | from pydantic import BaseModel, ConfigDict
2 | from typing import List, Dict, Optional, Union, Any
3 |
4 |
5 | class FusionChainResult(BaseModel):
6 | top_response: Union[str, Dict[str, Any]]
7 | all_prompt_responses: List[List[Any]]
8 | all_context_filled_prompts: List[List[str]]
9 | performance_scores: List[float]
10 | llm_model_names: List[str]
11 |
12 |
13 | class MultiLLMPromptExecution(BaseModel):
14 | prompt_responses: List[Dict[str, Any]]
15 | prompt: str
16 | prompt_template: Optional[str] = None
17 |
18 |
19 | class ModelRanking(BaseModel):
20 | llm_model_id: str
21 | score: int
22 |
--------------------------------------------------------------------------------
/src/marimo_notebook/modules/utils.py:
--------------------------------------------------------------------------------
1 | import datetime
2 | import json
3 | import os
4 | from typing import Union, Dict, List
5 |
6 | OUTPUT_DIR = "output"
7 |
8 |
9 | def build_file_path(name: str):
10 | session_dir = f"{OUTPUT_DIR}"
11 | os.makedirs(session_dir, exist_ok=True)
12 | return os.path.join(session_dir, f"{name}")
13 |
14 |
15 | def build_file_name_session(name: str, session_id: str):
16 | session_dir = f"{OUTPUT_DIR}/{session_id}"
17 | os.makedirs(session_dir, exist_ok=True)
18 | return os.path.join(session_dir, f"{name}")
19 |
20 |
21 | def to_json_file_pretty(name: str, content: Union[Dict, List]):
22 | def default_serializer(obj):
23 | if hasattr(obj, "model_dump"):
24 | return obj.model_dump()
25 | raise TypeError(
26 | f"Object of type {obj.__class__.__name__} is not JSON serializable"
27 | )
28 |
29 | with open(f"{name}.json", "w") as outfile:
30 | json.dump(content, outfile, indent=2, default=default_serializer)
31 |
32 |
33 | def current_date_time_str() -> str:
34 | return datetime.datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
35 |
36 |
37 | def current_date_str() -> str:
38 | return datetime.datetime.now().strftime("%Y-%m-%d")
39 |
40 |
41 | def dict_item_diff_by_set(
42 | previous_list: List[Dict], current_list: List[Dict], set_key: str
43 | ) -> List[str]:
44 | previous_set = {item[set_key] for item in previous_list}
45 | current_set = {item[set_key] for item in current_list}
46 | return list(current_set - previous_set)
47 |
--------------------------------------------------------------------------------
/src/marimo_notebook/temp.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/disler/marimo-prompt-library/cec10b5e57cf1c5e3e11806c47a667cb38bd32b0/src/marimo_notebook/temp.py
--------------------------------------------------------------------------------
/testable_prompts/bash_commands/command_generation_1.md:
--------------------------------------------------------------------------------
1 | Mac: Bash: Concise: How do I list all hidden files in a directory?
--------------------------------------------------------------------------------
/testable_prompts/bash_commands/command_generation_2.md:
--------------------------------------------------------------------------------
1 | Mac: Bash: Concise: How do I recursively search a directory for a file by name?
--------------------------------------------------------------------------------
/testable_prompts/bash_commands/command_generation_3.md:
--------------------------------------------------------------------------------
1 | Mac: Bash: Concise: How do I resolve merge conflicts in Git when trying to merge two branches?
--------------------------------------------------------------------------------
/testable_prompts/basics/hello.md:
--------------------------------------------------------------------------------
1 | Hey my name is Dan, are you ready to build?
--------------------------------------------------------------------------------
/testable_prompts/basics/mult_lang_counting.xml:
--------------------------------------------------------------------------------
1 |
2 | Count to ten then zero in python, typescript, bash, sql, and rust.
3 |
4 |
5 |
6 | Respond with only runnable code
7 | Use a while loop
8 | Use the print function
9 | Group the code by language
10 | Use markdown to format the languages (h1) and code (backticks)
11 |
--------------------------------------------------------------------------------
/testable_prompts/basics/ping.xml:
--------------------------------------------------------------------------------
1 | ping
--------------------------------------------------------------------------------
/testable_prompts/basics/python_count_to_ten.xml:
--------------------------------------------------------------------------------
1 |
2 | Count to ten in python
3 |
4 |
5 |
6 | Respond with only runnable code
7 | Use a while loop
8 | Use the print function
9 |
--------------------------------------------------------------------------------
/testable_prompts/code_debugging/code_debugging_1.md:
--------------------------------------------------------------------------------
1 | Find the bug in this code:
2 |
3 | def mult_and_sum_array(arr, multiple):
4 | multi_arr = [x * multiple for x in arr]
5 | sum = 0
6 | sum = sum(multi_arr)
7 | return sum
--------------------------------------------------------------------------------
/testable_prompts/code_debugging/code_debugging_2.md:
--------------------------------------------------------------------------------
1 | Find the bug in this code:
2 |
3 | def find_max(nums):
4 | max_num = float('-inf')
5 | for num in nums:
6 | if num < max_num:
7 | max_num = num
8 | return max_num
--------------------------------------------------------------------------------
/testable_prompts/code_debugging/code_debugging_3.md:
--------------------------------------------------------------------------------
1 | Identify the programming language used in the following CODE_SNIPPET:
2 |
3 | CODE_SNIPPET
4 |
5 | def example_function():
6 | print("Hello, World!")
--------------------------------------------------------------------------------
/testable_prompts/code_explanation/code_explanation_1.md:
--------------------------------------------------------------------------------
1 | Concisely explain how I can use generator functions in Python in less than 100 words.
--------------------------------------------------------------------------------
/testable_prompts/code_explanation/code_explanation_2.md:
--------------------------------------------------------------------------------
1 | Explain what the PYTHON_CODE does in 100 words or less.
2 |
3 | PYTHON_CODE
4 | def get_first_keyword_in_prompt(prompt: str):
5 | map_keywords_to_agents = {
6 | "bash,browser": run_bash_command_workflow,
7 | "question": question_answer_workflow,
8 | "hello,hey,hi": soft_talk_workflow,
9 | "exit": end_conversation_workflow,
10 | }
11 | for keyword_group, agent in map_keywords_to_agents.items():
12 | keywords = keyword_group.split(",")
13 | for keyword in keywords:
14 | if keyword in prompt.lower():
15 | return agent, keyword
16 | return None, None
--------------------------------------------------------------------------------
/testable_prompts/code_explanation/code_explanation_3.md:
--------------------------------------------------------------------------------
1 | Concisely explain how I can use list comprehensions in Python in less than 100 words.
--------------------------------------------------------------------------------
/testable_prompts/code_generation/code_generation_1.md:
--------------------------------------------------------------------------------
1 | Implement the following python function prefix_string("abc", 2) -> "abcabc"
--------------------------------------------------------------------------------
/testable_prompts/code_generation/code_generation_2.md:
--------------------------------------------------------------------------------
1 | Implement a Python function is_palindrome(s) that takes a string s and returns True if it is a palindrome, False otherwise.
--------------------------------------------------------------------------------
/testable_prompts/code_generation/code_generation_3.md:
--------------------------------------------------------------------------------
1 | Implement a Python function longest_common_subsequence(s1, s2) that takes two strings s1 and s2 and returns the longest common subsequence between them.
--------------------------------------------------------------------------------
/testable_prompts/code_generation/code_generation_4.md:
--------------------------------------------------------------------------------
1 | Implement a Python class Stack that represents a stack using a singly linked list. It should support push, pop, and is_empty operations.
--------------------------------------------------------------------------------
/testable_prompts/context_window/context_window_1.md:
--------------------------------------------------------------------------------
1 | What was the end of year prediction made in the SCRIPT below?
2 |
3 | SCRIPT
4 | Gemma Phi 3, OpenELM, and Llama 3. Open source language models are becoming more viable with every single release. The terminology from Apple's new OpenELM model is spot on. These efficient language models are taking center stage in the LLM ecosystem. Why are ELMs so important? Because they reshape the business model of your agentic tools and products. When you can run a prompt directly on your device, the cost of building goes to zero. The pace of innovation has been incredible, especially with the release of Llama 3. But every time a new model drops, I'm always asking the same question. Are efficient language models truly ready for on-device use? And how do you know your ELM meets your standards? I'm going to give you a couple of examples here. The first one is that you need to know your ELM. Everyone has different standards for their prompts, prompt chains, AI agents, and agentic workflows. How do you know your personal standards are being met by Phi 3, by Llama 3, and whatever's coming next? This is something that we stress on the channel a lot. Always look at where the ball is going, not where it is. If this trend of incredible local models continue, how soon will it be until we can do what GPT-4 does right on our device? With Llama 3, it's looking like this. It's looking like this. It's looking like this. It's looking like this. It's looking like this. It's looking like this. It's looking like this. It's looking like this. It's looking like this. That time is coming very soon. In this video, we're going to answer the question, are efficient language models ready for on-device use? How do you know if they're ready for your specific use cases? Here are all the big ideas. We're going to set some standards for what ELM attributes we actually care about. There are things like RAM consumption, tokens per second, accuracy. We're going to look at some specific attributes of ELMs and talk about where they need to be for them to work on-device for us. We're going to break down the IT V-Benchmark. We'll explain exactly what that is. That's going to help us answer the question, is this model good enough for your specific use cases? And then we're going to actually run the IT V-Benchmark on Gemma 5.3 and Llama 3 for real on-device use. So we're going to look at a concrete example of the IT V-Benchmark running on my M2 MacBook Pro with 64 gigabytes of RAM and really try to answer the question in a concrete way. Is this ready for prime time? Are these ELMs, are these efficient language models ready for prime time? Let's first walk through some standards and then I'll share some of my personal standards for ELMs. So we'll look at it through the lens of how I'm approaching this as I'm building out agentic tools and products. How do we know we're ready for on-device use? First two most important metrics we need to look at, accuracy and speed. Given your test suite that validates that this model works for your use case, what accuracy do you need? Is it okay if it fails a couple of tests giving you 90% or are you okay with, you know, 60, 70 or 80%? I think accuracy is the most important benchmark we should all be paying attention to. Something like speed is also a complete blocker if it's too low. So we'll be measuring speed and TPS, tokens per second. We'll look at a range from one token per second, all the way up to grok levels, right? Of something like 500 plus, you know, 1000 tokens per second level. What else do we need to pay attention to? Memory and context window. So memory coupled with speed are the big two constraints for ELMs right now. Efficient language model, models that can run on your device. They chew up anywhere from four gigabytes of RAM, of GPU, of CPU, all the way up to 128 and beyond. To run Lama 3, 70 billion parameter on my MacBook, it will chew up something like half of all my available RAM. We also have context window. This is a classic one. Then we have JSON response and vision support. We're not gonna focus on these too much. These are more yes, no, do they have it or do they not? Is it multimodal or not? There are a couple other things that we need to pay attention to. First of all, we need to pay attention to these other attributes that we're missing here, but I don't think they matter as much as these six and specifically these four at the top here. So let's go ahead and walk through this through the lens of my personal standards for efficient language models. Let's break it down. So first things first, the accuracy for the ITV benchmark, which we're about to get to must hit 80%. So if a model is not passing about 80% here, I automatically disqualify it. Tokens per second. I require at least 20 tokens per second minimum. If it's below this, it's honestly just not worth it. It's too slow. There's not enough happening. Anything above this, of course we'll accept. So keep in mind when you're setting your personal standards, you're really looking for ranges, right? Anything above 80% for me is golden. Anything above 20 tokens per second at a very minimum is what we're looking for. So let's look at memory. For me, I am only willing to consume up to about 32 gigabytes of RAM, GPU, CPU. However, it ends up getting sliced. On my 64 gigabyte, I have several Docker instances and other applications that are basically running 24 seven that constrain my dev environment. Regardless, I'm looking for ELMs that consume less than 32 gigabytes of memory. Context window, for me, the sweet spot is 32K and above. Lama 3 released with 8K. I said, cool. Benchmarks look great, but it's a little too small. For some of the larger prompts and prompt chains that I'm building up, I'm looking for 32K minimum context. I highly recommend you go through and set your personal standard for each one of these metrics, as they're likely to be the most important for getting your ELM, for getting a model running on your device. So JSON response, vision support. I don't really care about vision support. This is not a high priority for me. Of course, it's a nice to have. There are image models that can run in isolation. That does the trick for me. I'm not super concerned about having local on device multimodal models, at least right now. JSON response support is a must have. For me, this is built into a lot of the model providers, and it's typically not a problem anymore. So these are my personal standards. The most important ones are up here. 80% accuracy on the ITP benchmark, which we'll talk about in just a second. We have the speed. I'm looking for 20 tokens per second at a minimum. I'm looking for a memory consumption maximum of 32. And then of course, the context window. I am simplifying a lot of the details here, especially around the memory usage. I just want to give you a high level of how to think about what your standards are for ELMs. So that when they come around, you're ready to start using it for your personal tools and products. Having this ready to go as soon as these models are ready will save you time and money, especially as you scale up your usage of language models. So let's talk about the ITP benchmark. What is this? It's simple. It's nothing fancy. ITP is just, is this viable? That's what the test is all about. I just want to know, is this ELM viable? Are these efficient language models, AKA on device language models good enough? This code repository we're about to dive into. It's a personalized use case specific benchmark to quickly swap in and out ELMs, AKA on device language models to know if it's ready for your tools and applications. So let's go ahead and take a quick look at this code base. Link for this is going to be in the description. Let's go ahead and crack open VS code and let's just start with the README. So let's preview this and it's simple. This uses Bunn, PromptFu, and Alama for a minimalist cross-platform local LLM prompt testing and benchmarking experience. So before we dive into this anymore, I'm just going to go ahead, open up the terminal. I'm going to type Bunn run ELM, and that's going to kick off the test. So you can see right away, I have four models running, starting with GPT 3.5 as a control model to test against. And then you can see here, we have Alama Chat, Alama 3, we have PHY, and we have Gemma running as well. So while this is running through our 12 test cases, let's go ahead and take a look at what this code base looks like. So all the details that get set up are going to be in the README. Once you're able to get set up with this in less than a minute, this code base was designed specifically for you to help you benchmark local models for your use cases so that when they're ready, you can start saving time and saving money immediately. If we look at the structure, it's very simple. We have some setup, some minor scripts, and then we have the most important thing, bench, underscore, underscore, and then whatever the test suite name is. This one's called Efficient Language Models. So let's go ahead and look at the prompt. So the prompt is just a simple template. This gets filled in with each individual test run. And if we open up our test files, you can see here, let's go ahead and collapse everything. You can see here we have a list of what do we have here, 12 tests. They're sectioned off. You can see we have string manipulation here, command generation, code explanation, text classification. This is a work in progress of my personal ELM accuracy benchmark. By the time you're watching this, there'll likely be a few additional tests here. They'll be generic enough though, so that you can come in, understand them, and tweak them to fit your own specific use case. So let's go ahead and take a look at this. So this is the test file, and we'll look into this in more detail in just a second here. But if you go to the most important file, prompt through configuration, you can see here, let's go ahead and collapse this. We have our control cloud LLM. So I like to have a kind of control and an experimental group. The control group is going to be our cloud LLM that we want to prove our local models are as good as or near the performance of. Right now I'm using dbt 3.5. And then we have our experimental local ELMs. So we're going to go ahead and take a look at this. So in here, you can see we have LLM 3, we have 5.3, and we have Gemma. Again, you can tweak these. This is all built on top of LLM. Let's go ahead and run through our tool set quickly. We're using Bun, which is an all in one JavaScript runtime. Over the past year, the engineers have really matured the ecosystem. This is my go-to tool for all things JavaScript and TypeScript related. They recently just launched Windows support, which means that this code base will work out of the box for Mac, Linux, and Windows users. You can go ahead and click on this, and you'll be able to see the code base. Huge shout out to the Bun developers on all the great work here. We're using Ollama to serve our local language models. I probably don't need to introduce them. And last but not least, we're using PromptFu. I've talked about PromptFu in a few videos in the past, but it's super, super important to bring back up. This is how you can test your individual prompts against expectations. So what does that look like? If we scroll down to the hero here, you can see exactly what a test case looks like. So you have your prompts that you're going to test. So this is what you would normally type in a chat input field. And then you can go ahead and click test. And then you can go ahead and you have your individual models. Let's say you want to test OpenAI, Plod, and Mistral Large. You would put those all here. So for each provider, it's going to run every single prompt. And then at the bottom, you have your test cases. Your test cases can pass in variables to your prompts, as you can see here. And then most importantly, your test cases can assert specific expectations on the output of your LLM. So you can see here where you're running this type contains. We need to make sure that it has this string in it. We're making sure that the cost is below this amount, latency below this, etc. There are many different assertion types. The ITV benchmark repo uses these three key pieces of technology for a really, really simplistic experience. So you have your prompt configuration where you specify what models you want to use. You have your tests, which specify the details. So let's go ahead and look at one of these tests. You can see here, this is a simple bullet summary test. So I'm saying create a summary of the following text in bullet points. And then here's the script to one of our previous videos. So, you know, here's a simple yet powerful idea that can help you take a large step toward useful and valuable agentic workflows. We're asserting case insensitively that all of these items are in the response of the prompt. So let's go ahead and look at our output. Let's see if our prompts completed. Okay, so we have 33 success and 15 failed tests. So LLM3 ran every single one of these test cases here and reported its results. So let's go ahead and take a look at what that looks like. So after you run that was Bon ELM, after you run that you can run Bon View and if we open up package.json, and you can see Bon view just runs prompt foo view Bon view. This is going to kick off a local prompt foo server that shows us exactly what happened in the test runs. So right away, you can see we have a great summary of the results. So we have our control test failing at only one test, right. So it passed 91% accuracy. This and then we have llama 3 so close to my 80 standard we'll dig into where it went wrong in just a second here we then have phi 3 failed half of the 12 test cases and then we have gemma looks like it did one better 7 out of 12 so you can see here this is why it's important to have a control group specifically for testing elms it's really good to compare against a kind of high performing model and you know gpg 3.5 turbo it's not really even high performing anymore but it's a good benchmark for testing against local models because really if we use opus or gpt4 here the local models won't even come close so that's why i like to compare to something like gpg 3.5 you can also use cloud 3 haiku here this right away gives you a great benchmark on how local models are performing let's go ahead and look at one of these tests what happened where did things go wrong let's look at our text classification this is a simple test the prompt is is the following block of text a sql natural language query nlq respond exclusively with yes or no so this test here is going to look at how well the model can both answer correctly and answer precisely right it needs to say yes or no and then the block of text is select 10 users over the age of 21 with a gmail address and then we have the assertion type equals yes so our test case validates this test if it returns exclusively yes and we can look at the prompt test to see exactly what that looks like so if you go to test.yaml we can see we're looking for just yes this is what that test looks like right so this is our one of our text classification tests and and we have this assertion type equals yes so equals is used when you know exactly what you want the response to be a lot of the times you'll want something like a i contains all so case insensitive contains everything or a case insensitive contains any and there are lots of different assertions you can make you can easily dive into that i've linked that in the readme you'll want to look at the assertions documentation in prompt foo they have a whole list here of different assertions you can make to improve and strengthen your prompt test so that's what that test looks like and and you can kind of go through the line over each model to see exactly what went right what went wrong etc so feel free to check out the other test cases the long story short here is that by running the itv benchmark by running your personal benchmarks against local models you can have higher confidence and you can have first movers advantage on getting your hands on these local models and truly utilizing them as you can see here llama 3 is nearly within my standard of what i need an elm to do based on these 12 test cases i'll increase this to add a lot more of the use cases that i use out of these 12 test cases llama 3 is performing really really well and this is the 8b model right so if we look at a llama you can see here the default version that comes in here is the 8 billion parameter model that's the 4b quantization so pretty good stuff here i don't need to talk about how great llama 3 is the rest of the internet is doing that but it is really awesome to see how it performs on your specific use cases the closer you get to the metal here the closer you understand how these models are performing next to each other the better and the faster you're going to be able to take these models and productionize them in your tools and products i also just want to shout out how incredible it is to actually run these tests over and over and over again with the same model without thinking about the cost for a single second. You can see here, we're getting about 12 tokens per second across the board. So not ideal, not super great, but still everything completed fine. You can walk through the examples. A lot of these test cases are passing. This is really great. I'm gonna be keeping a pretty close eye on this stuff. So definitely like and subscribe if you're interested in the best local performing models. I feel like we're gonna have a few different classes of models, right? If we break this down, fastest, cheapest, and then it was best, slowest. And now what I think we need to do is take this and add a nest to it. So we basically say something like this, right? We say cloud, right? And then we say the slowest, most expensive. And then we say local, fastest, lower accuracy, and best, slowest, right? So things kind of change when you're at the local level. Now we're just trading off speed and accuracy, which simplifies things a lot, right? Because basically we were doing this where we had the fastest, cheapest, and we had lower accuracy. And then we had best, slowest, most expensive, right? So this is your Opus, this is your GPT-4, and this is your Haiku, GPT-3. But now we're getting into this interesting place where now we have things like this, right? Now we have PHY-3, we have LAMA-3, LAMA-3 is seven or eight billion. We also have Gemma. And then in the slowest, we have our bigger models, right? So this is where like LAMA-3 was at 70 billion, that's where this goes. And then, you know, whatever other big models that come out that are, you know, going to really trip your RAM, they're going to run slower, but they will give you the best performance that you can possibly have locally. So I'm keeping an eye on this. Hit the like and hit the sub if you want to stay up to date with how cloud versus local models progress. We're going to be covering these on the channel and I'll likely use, you know, this class system to separate them to keep an eye on these, right? First thing that needs to happen is we need anything at all. To run locally, right? So this is kind of, you know, in the future, same with this. Right now we need just anything to run well enough. So, you know, we need decent accuracy, any speed, right? So this is what we're looking for right now. And this stuff is going to come in the future. So that's the way I'm looking at this. The ITV benchmark can help you gain confidence in your prompts. Link for the code is going to be in the description. I built this to be ultra simple. Just follow the README to get started. Thanks to Bunn. Pramphu and Ollama. This should be completely cross-platform and I'll be updating this with some additional test cases. By the time you watch this, I'll likely have added several additional tests. I'm missing some things in here like code generation, context window length testing, and a couple other sections. So look forward to that. I hope all of this makes sense. Up your feeling, the speed of the open source community building toward usable viable ELMs. I think this is something that we've all been really excited about. And it's finally starting to happen. I'm going to predict by the end of the year, we're going to have an on-device Haiku to GPT-4 level model running, consuming less than 8 gigabytes of RAM. As soon as OpenELM hits Ollama, we'll be able to test this as well. And that's one of the highlights of using the ITV benchmark inside of this code base. You'll be able to quickly and seamlessly get that up and running by just updating the model name, adding a new configuration here like this. And then it'll look something like this, OpenELM, and then whatever the size is going to be, say it's the 3B, and that's it. Then you just run the test again, right? So that's the beauty of having a test suite like this set up and ready to go. You can, of course, come in here and customize this. You can add Opus, you can add Haiku, you can add other models, tweak it to your liking. That's what this is all about. I highly recommend you get in here and test this. This was important enough for me to take a break from personal AI assistance, and HSE, and all of that stuff. And I'll see you guys in the next video. Bye-bye. MacBook Pro M4 chip is released. And as the LLM community rolls out permutations of Llama 3, I think very soon, possibly before mid-2024, ELM's efficient language models will be ready for on-device use. Again, this is use case specific, which is really the whole point of me creating this video is to share this code base with you so that you can know exactly what your use case specific standards are. Because after you have standards set and a great prompting framework like PromptFu, you can then answer the question for yourself, for your tools, and for your products, is this efficient language model ready for my device? For me personally, the answer to this question is very soon. If you enjoyed this video, you know what to do. Thanks so much for watching. Stay focused and keep building.
--------------------------------------------------------------------------------
/testable_prompts/context_window/context_window_2.md:
--------------------------------------------------------------------------------
1 | What was the speakers personal accuracy requirement for the benchmark made in the SCRIPT below?
2 |
3 | SCRIPT
4 | Gemma Phi 3, OpenELM, and Llama 3. Open source language models are becoming more viable with every single release. The terminology from Apple's new OpenELM model is spot on. These efficient language models are taking center stage in the LLM ecosystem. Why are ELMs so important? Because they reshape the business model of your agentic tools and products. When you can run a prompt directly on your device, the cost of building goes to zero. The pace of innovation has been incredible, especially with the release of Llama 3. But every time a new model drops, I'm always asking the same question. Are efficient language models truly ready for on-device use? And how do you know your ELM meets your standards? I'm going to give you a couple of examples here. The first one is that you need to know your ELM. Everyone has different standards for their prompts, prompt chains, AI agents, and agentic workflows. How do you know your personal standards are being met by Phi 3, by Llama 3, and whatever's coming next? This is something that we stress on the channel a lot. Always look at where the ball is going, not where it is. If this trend of incredible local models continue, how soon will it be until we can do what GPT-4 does right on our device? With Llama 3, it's looking like this. It's looking like this. It's looking like this. It's looking like this. It's looking like this. It's looking like this. It's looking like this. It's looking like this. It's looking like this. That time is coming very soon. In this video, we're going to answer the question, are efficient language models ready for on-device use? How do you know if they're ready for your specific use cases? Here are all the big ideas. We're going to set some standards for what ELM attributes we actually care about. There are things like RAM consumption, tokens per second, accuracy. We're going to look at some specific attributes of ELMs and talk about where they need to be for them to work on-device for us. We're going to break down the IT V-Benchmark. We'll explain exactly what that is. That's going to help us answer the question, is this model good enough for your specific use cases? And then we're going to actually run the IT V-Benchmark on Gemma 5.3 and Llama 3 for real on-device use. So we're going to look at a concrete example of the IT V-Benchmark running on my M2 MacBook Pro with 64 gigabytes of RAM and really try to answer the question in a concrete way. Is this ready for prime time? Are these ELMs, are these efficient language models ready for prime time? Let's first walk through some standards and then I'll share some of my personal standards for ELMs. So we'll look at it through the lens of how I'm approaching this as I'm building out agentic tools and products. How do we know we're ready for on-device use? First two most important metrics we need to look at, accuracy and speed. Given your test suite that validates that this model works for your use case, what accuracy do you need? Is it okay if it fails a couple of tests giving you 90% or are you okay with, you know, 60, 70 or 80%? I think accuracy is the most important benchmark we should all be paying attention to. Something like speed is also a complete blocker if it's too low. So we'll be measuring speed and TPS, tokens per second. We'll look at a range from one token per second, all the way up to grok levels, right? Of something like 500 plus, you know, 1000 tokens per second level. What else do we need to pay attention to? Memory and context window. So memory coupled with speed are the big two constraints for ELMs right now. Efficient language model, models that can run on your device. They chew up anywhere from four gigabytes of RAM, of GPU, of CPU, all the way up to 128 and beyond. To run Lama 3, 70 billion parameter on my MacBook, it will chew up something like half of all my available RAM. We also have context window. This is a classic one. Then we have JSON response and vision support. We're not gonna focus on these too much. These are more yes, no, do they have it or do they not? Is it multimodal or not? There are a couple other things that we need to pay attention to. First of all, we need to pay attention to these other attributes that we're missing here, but I don't think they matter as much as these six and specifically these four at the top here. So let's go ahead and walk through this through the lens of my personal standards for efficient language models. Let's break it down. So first things first, the accuracy for the ITV benchmark, which we're about to get to must hit 80%. So if a model is not passing about 80% here, I automatically disqualify it. Tokens per second. I require at least 20 tokens per second minimum. If it's below this, it's honestly just not worth it. It's too slow. There's not enough happening. Anything above this, of course we'll accept. So keep in mind when you're setting your personal standards, you're really looking for ranges, right? Anything above 80% for me is golden. Anything above 20 tokens per second at a very minimum is what we're looking for. So let's look at memory. For me, I am only willing to consume up to about 32 gigabytes of RAM, GPU, CPU. However, it ends up getting sliced. On my 64 gigabyte, I have several Docker instances and other applications that are basically running 24 seven that constrain my dev environment. Regardless, I'm looking for ELMs that consume less than 32 gigabytes of memory. Context window, for me, the sweet spot is 32K and above. Lama 3 released with 8K. I said, cool. Benchmarks look great, but it's a little too small. For some of the larger prompts and prompt chains that I'm building up, I'm looking for 32K minimum context. I highly recommend you go through and set your personal standard for each one of these metrics, as they're likely to be the most important for getting your ELM, for getting a model running on your device. So JSON response, vision support. I don't really care about vision support. This is not a high priority for me. Of course, it's a nice to have. There are image models that can run in isolation. That does the trick for me. I'm not super concerned about having local on device multimodal models, at least right now. JSON response support is a must have. For me, this is built into a lot of the model providers, and it's typically not a problem anymore. So these are my personal standards. The most important ones are up here. 80% accuracy on the ITP benchmark, which we'll talk about in just a second. We have the speed. I'm looking for 20 tokens per second at a minimum. I'm looking for a memory consumption maximum of 32. And then of course, the context window. I am simplifying a lot of the details here, especially around the memory usage. I just want to give you a high level of how to think about what your standards are for ELMs. So that when they come around, you're ready to start using it for your personal tools and products. Having this ready to go as soon as these models are ready will save you time and money, especially as you scale up your usage of language models. So let's talk about the ITP benchmark. What is this? It's simple. It's nothing fancy. ITP is just, is this viable? That's what the test is all about. I just want to know, is this ELM viable? Are these efficient language models, AKA on device language models good enough? This code repository we're about to dive into. It's a personalized use case specific benchmark to quickly swap in and out ELMs, AKA on device language models to know if it's ready for your tools and applications. So let's go ahead and take a quick look at this code base. Link for this is going to be in the description. Let's go ahead and crack open VS code and let's just start with the README. So let's preview this and it's simple. This uses Bunn, PromptFu, and Alama for a minimalist cross-platform local LLM prompt testing and benchmarking experience. So before we dive into this anymore, I'm just going to go ahead, open up the terminal. I'm going to type Bunn run ELM, and that's going to kick off the test. So you can see right away, I have four models running, starting with GPT 3.5 as a control model to test against. And then you can see here, we have Alama Chat, Alama 3, we have PHY, and we have Gemma running as well. So while this is running through our 12 test cases, let's go ahead and take a look at what this code base looks like. So all the details that get set up are going to be in the README. Once you're able to get set up with this in less than a minute, this code base was designed specifically for you to help you benchmark local models for your use cases so that when they're ready, you can start saving time and saving money immediately. If we look at the structure, it's very simple. We have some setup, some minor scripts, and then we have the most important thing, bench, underscore, underscore, and then whatever the test suite name is. This one's called Efficient Language Models. So let's go ahead and look at the prompt. So the prompt is just a simple template. This gets filled in with each individual test run. And if we open up our test files, you can see here, let's go ahead and collapse everything. You can see here we have a list of what do we have here, 12 tests. They're sectioned off. You can see we have string manipulation here, command generation, code explanation, text classification. This is a work in progress of my personal ELM accuracy benchmark. By the time you're watching this, there'll likely be a few additional tests here. They'll be generic enough though, so that you can come in, understand them, and tweak them to fit your own specific use case. So let's go ahead and take a look at this. So this is the test file, and we'll look into this in more detail in just a second here. But if you go to the most important file, prompt through configuration, you can see here, let's go ahead and collapse this. We have our control cloud LLM. So I like to have a kind of control and an experimental group. The control group is going to be our cloud LLM that we want to prove our local models are as good as or near the performance of. Right now I'm using dbt 3.5. And then we have our experimental local ELMs. So we're going to go ahead and take a look at this. So in here, you can see we have LLM 3, we have 5.3, and we have Gemma. Again, you can tweak these. This is all built on top of LLM. Let's go ahead and run through our tool set quickly. We're using Bun, which is an all in one JavaScript runtime. Over the past year, the engineers have really matured the ecosystem. This is my go-to tool for all things JavaScript and TypeScript related. They recently just launched Windows support, which means that this code base will work out of the box for Mac, Linux, and Windows users. You can go ahead and click on this, and you'll be able to see the code base. Huge shout out to the Bun developers on all the great work here. We're using Ollama to serve our local language models. I probably don't need to introduce them. And last but not least, we're using PromptFu. I've talked about PromptFu in a few videos in the past, but it's super, super important to bring back up. This is how you can test your individual prompts against expectations. So what does that look like? If we scroll down to the hero here, you can see exactly what a test case looks like. So you have your prompts that you're going to test. So this is what you would normally type in a chat input field. And then you can go ahead and click test. And then you can go ahead and you have your individual models. Let's say you want to test OpenAI, Plod, and Mistral Large. You would put those all here. So for each provider, it's going to run every single prompt. And then at the bottom, you have your test cases. Your test cases can pass in variables to your prompts, as you can see here. And then most importantly, your test cases can assert specific expectations on the output of your LLM. So you can see here where you're running this type contains. We need to make sure that it has this string in it. We're making sure that the cost is below this amount, latency below this, etc. There are many different assertion types. The ITV benchmark repo uses these three key pieces of technology for a really, really simplistic experience. So you have your prompt configuration where you specify what models you want to use. You have your tests, which specify the details. So let's go ahead and look at one of these tests. You can see here, this is a simple bullet summary test. So I'm saying create a summary of the following text in bullet points. And then here's the script to one of our previous videos. So, you know, here's a simple yet powerful idea that can help you take a large step toward useful and valuable agentic workflows. We're asserting case insensitively that all of these items are in the response of the prompt. So let's go ahead and look at our output. Let's see if our prompts completed. Okay, so we have 33 success and 15 failed tests. So LLM3 ran every single one of these test cases here and reported its results. So let's go ahead and take a look at what that looks like. So after you run that was Bon ELM, after you run that you can run Bon View and if we open up package.json, and you can see Bon view just runs prompt foo view Bon view. This is going to kick off a local prompt foo server that shows us exactly what happened in the test runs. So right away, you can see we have a great summary of the results. So we have our control test failing at only one test, right. So it passed 91% accuracy. This and then we have llama 3 so close to my 80 standard we'll dig into where it went wrong in just a second here we then have phi 3 failed half of the 12 test cases and then we have gemma looks like it did one better 7 out of 12 so you can see here this is why it's important to have a control group specifically for testing elms it's really good to compare against a kind of high performing model and you know gpg 3.5 turbo it's not really even high performing anymore but it's a good benchmark for testing against local models because really if we use opus or gpt4 here the local models won't even come close so that's why i like to compare to something like gpg 3.5 you can also use cloud 3 haiku here this right away gives you a great benchmark on how local models are performing let's go ahead and look at one of these tests what happened where did things go wrong let's look at our text classification this is a simple test the prompt is is the following block of text a sql natural language query nlq respond exclusively with yes or no so this test here is going to look at how well the model can both answer correctly and answer precisely right it needs to say yes or no and then the block of text is select 10 users over the age of 21 with a gmail address and then we have the assertion type equals yes so our test case validates this test if it returns exclusively yes and we can look at the prompt test to see exactly what that looks like so if you go to test.yaml we can see we're looking for just yes this is what that test looks like right so this is our one of our text classification tests and and we have this assertion type equals yes so equals is used when you know exactly what you want the response to be a lot of the times you'll want something like a i contains all so case insensitive contains everything or a case insensitive contains any and there are lots of different assertions you can make you can easily dive into that i've linked that in the readme you'll want to look at the assertions documentation in prompt foo they have a whole list here of different assertions you can make to improve and strengthen your prompt test so that's what that test looks like and and you can kind of go through the line over each model to see exactly what went right what went wrong etc so feel free to check out the other test cases the long story short here is that by running the itv benchmark by running your personal benchmarks against local models you can have higher confidence and you can have first movers advantage on getting your hands on these local models and truly utilizing them as you can see here llama 3 is nearly within my standard of what i need an elm to do based on these 12 test cases i'll increase this to add a lot more of the use cases that i use out of these 12 test cases llama 3 is performing really really well and this is the 8b model right so if we look at a llama you can see here the default version that comes in here is the 8 billion parameter model that's the 4b quantization so pretty good stuff here i don't need to talk about how great llama 3 is the rest of the internet is doing that but it is really awesome to see how it performs on your specific use cases the closer you get to the metal here the closer you understand how these models are performing next to each other the better and the faster you're going to be able to take these models and productionize them in your tools and products i also just want to shout out how incredible it is to actually run these tests over and over and over again with the same model without thinking about the cost for a single second. You can see here, we're getting about 12 tokens per second across the board. So not ideal, not super great, but still everything completed fine. You can walk through the examples. A lot of these test cases are passing. This is really great. I'm gonna be keeping a pretty close eye on this stuff. So definitely like and subscribe if you're interested in the best local performing models. I feel like we're gonna have a few different classes of models, right? If we break this down, fastest, cheapest, and then it was best, slowest. And now what I think we need to do is take this and add a nest to it. So we basically say something like this, right? We say cloud, right? And then we say the slowest, most expensive. And then we say local, fastest, lower accuracy, and best, slowest, right? So things kind of change when you're at the local level. Now we're just trading off speed and accuracy, which simplifies things a lot, right? Because basically we were doing this where we had the fastest, cheapest, and we had lower accuracy. And then we had best, slowest, most expensive, right? So this is your Opus, this is your GPT-4, and this is your Haiku, GPT-3. But now we're getting into this interesting place where now we have things like this, right? Now we have PHY-3, we have LAMA-3, LAMA-3 is seven or eight billion. We also have Gemma. And then in the slowest, we have our bigger models, right? So this is where like LAMA-3 was at 70 billion, that's where this goes. And then, you know, whatever other big models that come out that are, you know, going to really trip your RAM, they're going to run slower, but they will give you the best performance that you can possibly have locally. So I'm keeping an eye on this. Hit the like and hit the sub if you want to stay up to date with how cloud versus local models progress. We're going to be covering these on the channel and I'll likely use, you know, this class system to separate them to keep an eye on these, right? First thing that needs to happen is we need anything at all. To run locally, right? So this is kind of, you know, in the future, same with this. Right now we need just anything to run well enough. So, you know, we need decent accuracy, any speed, right? So this is what we're looking for right now. And this stuff is going to come in the future. So that's the way I'm looking at this. The ITV benchmark can help you gain confidence in your prompts. Link for the code is going to be in the description. I built this to be ultra simple. Just follow the README to get started. Thanks to Bunn. Pramphu and Ollama. This should be completely cross-platform and I'll be updating this with some additional test cases. By the time you watch this, I'll likely have added several additional tests. I'm missing some things in here like code generation, context window length testing, and a couple other sections. So look forward to that. I hope all of this makes sense. Up your feeling, the speed of the open source community building toward usable viable ELMs. I think this is something that we've all been really excited about. And it's finally starting to happen. I'm going to predict by the end of the year, we're going to have an on-device Haiku to GPT-4 level model running, consuming less than 8 gigabytes of RAM. As soon as OpenELM hits Ollama, we'll be able to test this as well. And that's one of the highlights of using the ITV benchmark inside of this code base. You'll be able to quickly and seamlessly get that up and running by just updating the model name, adding a new configuration here like this. And then it'll look something like this, OpenELM, and then whatever the size is going to be, say it's the 3B, and that's it. Then you just run the test again, right? So that's the beauty of having a test suite like this set up and ready to go. You can, of course, come in here and customize this. You can add Opus, you can add Haiku, you can add other models, tweak it to your liking. That's what this is all about. I highly recommend you get in here and test this. This was important enough for me to take a break from personal AI assistance, and HSE, and all of that stuff. And I'll see you guys in the next video. Bye-bye. MacBook Pro M4 chip is released. And as the LLM community rolls out permutations of Llama 3, I think very soon, possibly before mid-2024, ELM's efficient language models will be ready for on-device use. Again, this is use case specific, which is really the whole point of me creating this video is to share this code base with you so that you can know exactly what your use case specific standards are. Because after you have standards set and a great prompting framework like PromptFu, you can then answer the question for yourself, for your tools, and for your products, is this efficient language model ready for my device? For me personally, the answer to this question is very soon. If you enjoyed this video, you know what to do. Thanks so much for watching. Stay focused and keep building.
--------------------------------------------------------------------------------
/testable_prompts/context_window/context_window_3.md:
--------------------------------------------------------------------------------
1 | What was the end of year prediction made in the SCRIPT below?
2 |
3 | SCRIPT
4 | Gemma Phi 3, OpenELM, and Llama 3. Open source language models are becoming more viable with every single release. The terminology from Apple's new OpenELM model is spot on. These efficient language models are taking center stage in the LLM ecosystem. Why are ELMs so important? Because they reshape the business model of your agentic tools and products. When you can run a prompt directly on your device, the cost of building goes to zero. The pace of innovation has been incredible, especially with the release of Llama 3. But every time a new model drops, I'm always asking the same question. Are efficient language models truly ready for on-device use? And how do you know your ELM meets your standards? I'm going to give you a couple of examples here. The first one is that you need to know your ELM. Everyone has different standards for their prompts, prompt chains, AI agents, and agentic workflows. How do you know your personal standards are being met by Phi 3, by Llama 3, and whatever's coming next? This is something that we stress on the channel a lot. Always look at where the ball is going, not where it is. If this trend of incredible local models continue, how soon will it be until we can do what GPT-4 does right on our device? With Llama 3, it's looking like this. It's looking like this. It's looking like this. It's looking like this. It's looking like this. It's looking like this. It's looking like this. It's looking like this. It's looking like this. That time is coming very soon. In this video, we're going to answer the question, are efficient language models ready for on-device use? How do you know if they're ready for your specific use cases? Here are all the big ideas. We're going to set some standards for what ELM attributes we actually care about. There are things like RAM consumption, tokens per second, accuracy. We're going to look at some specific attributes of ELMs and talk about where they need to be for them to work on-device for us. We're going to break down the IT V-Benchmark. We'll explain exactly what that is. That's going to help us answer the question, is this model good enough for your specific use cases? And then we're going to actually run the IT V-Benchmark on Gemma 5.3 and Llama 3 for real on-device use. So we're going to look at a concrete example of the IT V-Benchmark running on my M2 MacBook Pro with 64 gigabytes of RAM and really try to answer the question in a concrete way. Is this ready for prime time? Are these ELMs, are these efficient language models ready for prime time? Let's first walk through some standards and then I'll share some of my personal standards for ELMs. So we'll look at it through the lens of how I'm approaching this as I'm building out agentic tools and products. How do we know we're ready for on-device use? First two most important metrics we need to look at, accuracy and speed. Given your test suite that validates that this model works for your use case, what accuracy do you need? Is it okay if it fails a couple of tests giving you 90% or are you okay with, you know, 60, 70 or 80%? I think accuracy is the most important benchmark we should all be paying attention to. Something like speed is also a complete blocker if it's too low. So we'll be measuring speed and TPS, tokens per second. We'll look at a range from one token per second, all the way up to grok levels, right? Of something like 500 plus, you know, 1000 tokens per second level. What else do we need to pay attention to? Memory and context window. So memory coupled with speed are the big two constraints for ELMs right now. Efficient language model, models that can run on your device. They chew up anywhere from four gigabytes of RAM, of GPU, of CPU, all the way up to 128 and beyond. To run Lama 3, 70 billion parameter on my MacBook, it will chew up something like half of all my available RAM. We also have context window. This is a classic one. Then we have JSON response and vision support. We're not gonna focus on these too much. These are more yes, no, do they have it or do they not? Is it multimodal or not? There are a couple other things that we need to pay attention to. First of all, we need to pay attention to these other attributes that we're missing here, but I don't think they matter as much as these six and specifically these four at the top here. So let's go ahead and walk through this through the lens of my personal standards for efficient language models. Let's break it down. So first things first, the accuracy for the ITV benchmark, which we're about to get to must hit 80%. So if a model is not passing about 80% here, I automatically disqualify it. Tokens per second. I require at least 20 tokens per second minimum. If it's below this, it's honestly just not worth it. It's too slow. There's not enough happening. Anything above this, of course we'll accept. So keep in mind when you're setting your personal standards, you're really looking for ranges, right? Anything above 80% for me is golden. Anything above 20 tokens per second at a very minimum is what we're looking for. So let's look at memory. For me, I am only willing to consume up to about 32 gigabytes of RAM, GPU, CPU. However, it ends up getting sliced. On my 64 gigabyte, I have several Docker instances and other applications that are basically running 24 seven that constrain my dev environment. Regardless, I'm looking for ELMs that consume less than 32 gigabytes of memory. Context window, for me, the sweet spot is 32K and above. Lama 3 released with 8K. I said, cool. Benchmarks look great, but it's a little too small. For some of the larger prompts and prompt chains that I'm building up, I'm looking for 32K minimum context. I highly recommend you go through and set your personal standard for each one of these metrics, as they're likely to be the most important for getting your ELM, for getting a model running on your device. So JSON response, vision support. I don't really care about vision support. This is not a high priority for me. Of course, it's a nice to have. There are image models that can run in isolation. That does the trick for me. I'm not super concerned about having local on device multimodal models, at least right now. JSON response support is a must have. For me, this is built into a lot of the model providers, and it's typically not a problem anymore. So these are my personal standards. The most important ones are up here. 80% accuracy on the ITP benchmark, which we'll talk about in just a second. We have the speed. I'm looking for 20 tokens per second at a minimum. I'm looking for a memory consumption maximum of 32. And then of course, the context window. I am simplifying a lot of the details here, especially around the memory usage. I just want to give you a high level of how to think about what your standards are for ELMs. So that when they come around, you're ready to start using it for your personal tools and products. Having this ready to go as soon as these models are ready will save you time and money, especially as you scale up your usage of language models. So let's talk about the ITP benchmark. What is this? It's simple. It's nothing fancy. ITP is just, is this viable? That's what the test is all about. I just want to know, is this ELM viable? Are these efficient language models, AKA on device language models good enough? This code repository we're about to dive into. It's a personalized use case specific benchmark to quickly swap in and out ELMs, AKA on device language models to know if it's ready for your tools and applications. So let's go ahead and take a quick look at this code base. Link for this is going to be in the description. Let's go ahead and crack open VS code and let's just start with the README. So let's preview this and it's simple. This uses Bunn, PromptFu, and Alama for a minimalist cross-platform local LLM prompt testing and benchmarking experience. So before we dive into this anymore, I'm just going to go ahead, open up the terminal. I'm going to type Bunn run ELM, and that's going to kick off the test. So you can see right away, I have four models running, starting with GPT 3.5 as a control model to test against. And then you can see here, we have Alama Chat, Alama 3, we have PHY, and we have Gemma running as well. So while this is running through our 12 test cases, let's go ahead and take a look at what this code base looks like. So all the details that get set up are going to be in the README. Once you're able to get set up with this in less than a minute, this code base was designed specifically for you to help you benchmark local models for your use cases so that when they're ready, you can start saving time and saving money immediately. If we look at the structure, it's very simple. We have some setup, some minor scripts, and then we have the most important thing, bench, underscore, underscore, and then whatever the test suite name is. This one's called Efficient Language Models. So let's go ahead and look at the prompt. So the prompt is just a simple template. This gets filled in with each individual test run. And if we open up our test files, you can see here, let's go ahead and collapse everything. You can see here we have a list of what do we have here, 12 tests. They're sectioned off. You can see we have string manipulation here, command generation, code explanation, text classification. This is a work in progress of my personal ELM accuracy benchmark. By the time you're watching this, there'll likely be a few additional tests here. They'll be generic enough though, so that you can come in, understand them, and tweak them to fit your own specific use case. So let's go ahead and take a look at this. So this is the test file, and we'll look into this in more detail in just a second here. But if you go to the most important file, prompt through configuration, you can see here, let's go ahead and collapse this. We have our control cloud LLM. So I like to have a kind of control and an experimental group. The control group is going to be our cloud LLM that we want to prove our local models are as good as or near the performance of. Right now I'm using dbt 3.5. And then we have our experimental local ELMs. So we're going to go ahead and take a look at this. So in here, you can see we have LLM 3, we have 5.3, and we have Gemma. Again, you can tweak these. This is all built on top of LLM. Let's go ahead and run through our tool set quickly. We're using Bun, which is an all in one JavaScript runtime. Over the past year, the engineers have really matured the ecosystem. This is my go-to tool for all things JavaScript and TypeScript related. They recently just launched Windows support, which means that this code base will work out of the box for Mac, Linux, and Windows users. You can go ahead and click on this, and you'll be able to see the code base. Huge shout out to the Bun developers on all the great work here. We're using Ollama to serve our local language models. I probably don't need to introduce them. And last but not least, we're using PromptFu. I've talked about PromptFu in a few videos in the past, but it's super, super important to bring back up. This is how you can test your individual prompts against expectations. So what does that look like? If we scroll down to the hero here, you can see exactly what a test case looks like. So you have your prompts that you're going to test. So this is what you would normally type in a chat input field. And then you can go ahead and click test. And then you can go ahead and you have your individual models. Let's say you want to test OpenAI, Plod, and Mistral Large. You would put those all here. So for each provider, it's going to run every single prompt. And then at the bottom, you have your test cases. Your test cases can pass in variables to your prompts, as you can see here. And then most importantly, your test cases can assert specific expectations on the output of your LLM. So you can see here where you're running this type contains. We need to make sure that it has this string in it. We're making sure that the cost is below this amount, latency below this, etc. There are many different assertion types. The ITV benchmark repo uses these three key pieces of technology for a really, really simplistic experience. So you have your prompt configuration where you specify what models you want to use. You have your tests, which specify the details. So let's go ahead and look at one of these tests. You can see here, this is a simple bullet summary test. So I'm saying create a summary of the following text in bullet points. And then here's the script to one of our previous videos. So, you know, here's a simple yet powerful idea that can help you take a large step toward useful and valuable agentic workflows. We're asserting case insensitively that all of these items are in the response of the prompt. So let's go ahead and look at our output. Let's see if our prompts completed. Okay, so we have 33 success and 15 failed tests. So LLM3 ran every single one of these test cases here and reported its results. So let's go ahead and take a look at what that looks like. So after you run that was Bon ELM, after you run that you can run Bon View and if we open up package.json, and you can see Bon view just runs prompt foo view Bon view. This is going to kick off a local prompt foo server that shows us exactly what happened in the test runs. So right away, you can see we have a great summary of the results. So we have our control test failing at only one test, right. So it passed 91% accuracy. This and then we have llama 3 so close to my 80 standard we'll dig into where it went wrong in just a second here we then have phi 3 failed half of the 12 test cases and then we have gemma looks like it did one better 7 out of 12 so you can see here this is why it's important to have a control group specifically for testing elms it's really good to compare against a kind of high performing model and you know gpg 3.5 turbo it's not really even high performing anymore but it's a good benchmark for testing against local models because really if we use opus or gpt4 here the local models won't even come close so that's why i like to compare to something like gpg 3.5 you can also use cloud 3 haiku here this right away gives you a great benchmark on how local models are performing let's go ahead and look at one of these tests what happened where did things go wrong let's look at our text classification this is a simple test the prompt is is the following block of text a sql natural language query nlq respond exclusively with yes or no so this test here is going to look at how well the model can both answer correctly and answer precisely right it needs to say yes or no and then the block of text is select 10 users over the age of 21 with a gmail address and then we have the assertion type equals yes so our test case validates this test if it returns exclusively yes and we can look at the prompt test to see exactly what that looks like so if you go to test.yaml we can see we're looking for just yes this is what that test looks like right so this is our one of our text classification tests and and we have this assertion type equals yes so equals is used when you know exactly what you want the response to be a lot of the times you'll want something like a i contains all so case insensitive contains everything or a case insensitive contains any and there are lots of different assertions you can make you can easily dive into that i've linked that in the readme you'll want to look at the assertions documentation in prompt foo they have a whole list here of different assertions you can make to improve and strengthen your prompt test so that's what that test looks like and and you can kind of go through the line over each model to see exactly what went right what went wrong etc so feel free to check out the other test cases the long story short here is that by running the itv benchmark by running your personal benchmarks against local models you can have higher confidence and you can have first movers advantage on getting your hands on these local models and truly utilizing them as you can see here llama 3 is nearly within my standard of what i need an elm to do based on these 12 test cases i'll increase this to add a lot more of the use cases that i use out of these 12 test cases llama 3 is performing really really well and this is the 8b model right so if we look at a llama you can see here the default version that comes in here is the 8 billion parameter model that's the 4b quantization so pretty good stuff here i don't need to talk about how great llama 3 is the rest of the internet is doing that but it is really awesome to see how it performs on your specific use cases the closer you get to the metal here the closer you understand how these models are performing next to each other the better and the faster you're going to be able to take these models and productionize them in your tools and products i also just want to shout out how incredible it is to actually run these tests over and over and over again with the same model without thinking about the cost for a single second. You can see here, we're getting about 12 tokens per second across the board. So not ideal, not super great, but still everything completed fine. You can walk through the examples. A lot of these test cases are passing. This is really great. I'm gonna be keeping a pretty close eye on this stuff. So definitely like and subscribe if you're interested in the best local performing models. I feel like we're gonna have a few different classes of models, right? If we break this down, fastest, cheapest, and then it was best, slowest. And now what I think we need to do is take this and add a nest to it. So we basically say something like this, right? We say cloud, right? And then we say the slowest, most expensive. And then we say local, fastest, lower accuracy, and best, slowest, right? So things kind of change when you're at the local level. Now we're just trading off speed and accuracy, which simplifies things a lot, right? Because basically we were doing this where we had the fastest, cheapest, and we had lower accuracy. And then we had best, slowest, most expensive, right? So this is your Opus, this is your GPT-4, and this is your Haiku, GPT-3. But now we're getting into this interesting place where now we have things like this, right? Now we have PHY-3, we have LAMA-3, LAMA-3 is seven or eight billion. We also have Gemma. And then in the slowest, we have our bigger models, right? So this is where like LAMA-3 was at 70 billion, that's where this goes. And then, you know, whatever other big models that come out that are, you know, going to really trip your RAM, they're going to run slower, but they will give you the best performance that you can possibly have locally. So I'm keeping an eye on this. Hit the like and hit the sub if you want to stay up to date with how cloud versus local models progress. We're going to be covering these on the channel and I'll likely use, you know, this class system to separate them to keep an eye on these, right? First thing that needs to happen is we need anything at all. To run locally, right? So this is kind of, you know, in the future, same with this. Right now we need just anything to run well enough. So, you know, we need decent accuracy, any speed, right? So this is what we're looking for right now. And this stuff is going to come in the future. So that's the way I'm looking at this. The ITV benchmark can help you gain confidence in your prompts. Link for the code is going to be in the description. I built this to be ultra simple. Just follow the README to get started. Thanks to Bunn. Pramphu and Ollama. This should be completely cross-platform and I'll be updating this with some additional test cases. By the time you watch this, I'll likely have added several additional tests. I'm missing some things in here like code generation, context window length testing, and a couple other sections. So look forward to that. I hope all of this makes sense. Up your feeling, the speed of the open source community building toward usable viable ELMs. I think this is something that we've all been really excited about. And it's finally starting to happen. I'm going to predict by the end of the year, we're going to have an on-device Haiku to GPT-4 level model running, consuming less than 8 gigabytes of RAM. As soon as OpenELM hits Ollama, we'll be able to test this as well. And that's one of the highlights of using the ITV benchmark inside of this code base. You'll be able to quickly and seamlessly get that up and running by just updating the model name, adding a new configuration here like this. And then it'll look something like this, OpenELM, and then whatever the size is going to be, say it's the 3B, and that's it. Then you just run the test again, right? So that's the beauty of having a test suite like this set up and ready to go. You can, of course, come in here and customize this. You can add Opus, you can add Haiku, you can add other models, tweak it to your liking. That's what this is all about. I highly recommend you get in here and test this. This was important enough for me to take a break from personal AI assistance, and HSE, and all of that stuff. And I'll see you guys in the next video. Bye-bye. MacBook Pro M4 chip is released. And as the LLM community rolls out permutations of Llama 3, I think very soon, possibly before mid-2024, ELM's efficient language models will be ready for on-device use. Again, this is use case specific, which is really the whole point of me creating this video is to share this code base with you so that you can know exactly what your use case specific standards are. Because after you have standards set and a great prompting framework like PromptFu, you can then answer the question for yourself, for your tools, and for your products, is this efficient language model ready for my device? For me personally, the answer to this question is very soon. If you enjoyed this video, you know what to do. Thanks so much for watching. Stay focused and keep building.
--------------------------------------------------------------------------------
/testable_prompts/email_management/email_management_1.md:
--------------------------------------------------------------------------------
1 | Categorize the following email into one of the following categories: work, personal, newsletter, other. Respond exclusively with the category name.
2 |
3 | EMAIL
4 |
5 | Subject:
6 | Action Items for next week
7 | From:
8 | john@workhard.com
9 | Body:
10 | Hey can you send over the action items for the week?
--------------------------------------------------------------------------------
/testable_prompts/email_management/email_management_2.md:
--------------------------------------------------------------------------------
1 | Categorize the following email into one of the following categories: work, personal, newsletter, other. Respond exclusively with the category name.
2 |
3 | EMAIL
4 |
5 | Subject:
6 | Dinner plans this weekend
7 | From:
8 | sarah@gmail.com
9 | Body:
10 | Hey! Just wanted to see if you're free for dinner this Saturday? Let me know!
--------------------------------------------------------------------------------
/testable_prompts/email_management/email_management_3.md:
--------------------------------------------------------------------------------
1 | Categorize the following email into one of the following categories: work, personal, newsletter, other. Respond exclusively with the category name.
2 |
3 | EMAIL
4 |
5 | Subject:
6 | Your weekly tech newsletter
7 | From:
8 | newsletter@techdigest.com
9 | Body:
10 | Here are the top tech stories for this week...
--------------------------------------------------------------------------------
/testable_prompts/email_management/email_management_4.md:
--------------------------------------------------------------------------------
1 | Categorize the following email into one of the following categories: work, personal, newsletter, other. Respond exclusively with the category name.
2 |
3 | EMAIL
4 |
5 | Subject:
6 | Your order has shipped!
7 | From:
8 | orders@onlinestore.com
9 | Body:
10 | Good news! Your recent order has shipped and is on its way to you.
--------------------------------------------------------------------------------
/testable_prompts/email_management/email_management_5.md:
--------------------------------------------------------------------------------
1 | Create a concise response to the USER_EMAIL. Follow the RESPONSE_STRUCTURE.
2 |
3 | RESPONSE_STRUCTURE:
4 | - Hey
5 | - Appreciate reach out
6 | - Reframe email to confirm details
7 | - Next steps, schedule meeting next week
8 | - Thanks for your time. Stay Focused, Keep building. - Dan.
9 |
10 | USER_EMAIL:
11 | Subject:
12 | Let's move forward on the project
13 | From:
14 | john@aiconsultingco.com
15 | Body:
16 | Hey Dan, I was thinking more about the product requirements for the idea we brainstormed last week.
17 | Let's discuss pricing, timeline and move forward with the proof of concept. Can we sync next week?
18 | Thanks for your time.
--------------------------------------------------------------------------------
/testable_prompts/email_management/email_management_6.md:
--------------------------------------------------------------------------------
1 | Create a concise summary of the DENSE_EMAIL. Extract the most important information into bullet points and then summarize the email in the SUMMARY_FORMAT.
2 |
3 | SUMMARY_FORMAT:
4 | Oneline Summary
5 | ...
6 | Bullet points
7 | - a
8 | - b
9 | - c
10 |
11 | DENSE_EMAIL:
12 | Subject:
13 | Project Update - New Requirements and Timeline Changes
14 | From:
15 | sarah@techconsultingfirm.com
16 | Body:
17 | Hi Dan,
18 |
19 | I wanted to provide an update on the ERP system implementation project. After our last meeting with the client, they have requested some additional features and changes to the original requirements. This includes integrating with their existing CRM system, adding advanced reporting capabilities, and supporting multi-currency transactions.
20 |
21 | Due to these new requirements, we will need to adjust our project timeline and milestones. I estimate that these changes will add approximately 3-4 weeks to our original schedule. We should also plan for additional testing and quality assurance to ensure the new features are working as expected.
22 |
23 | Please let me know if you have any concerns or questions about these changes. I think it's important that we communicate this to the client as soon as possible and set expectations around the revised timeline.
--------------------------------------------------------------------------------
/testable_prompts/personal_ai_assistant_responses/personal_ai_assistant_responses_1.md:
--------------------------------------------------------------------------------
1 | You are a friendly, ultra helpful, attentive, concise AI assistant named 'Ada'.
2 |
3 | You work with your human companion 'Dan' to build valuable experience through software.
4 |
5 | We both like short, concise, back-and-forth conversations.
6 |
7 | Concisely communicate the following message to your human companion: 'Select an image to generate a Vue component from'.
--------------------------------------------------------------------------------
/testable_prompts/personal_ai_assistant_responses/personal_ai_assistant_responses_2.md:
--------------------------------------------------------------------------------
1 | You are a friendly, ultra helpful, attentive, concise AI assistant named 'Ada'.
2 |
3 | You work with your human companion 'Dan' to build valuable experience through software.
4 |
5 | We both like short, concise, back-and-forth conversations.
6 |
7 | Communicate the following message to your human companion: 'I've found the URL in your clipboard. I'll scrape the URL and example generate code for you. But first, what about the example code would you like me to focus on?'.
--------------------------------------------------------------------------------
/testable_prompts/personal_ai_assistant_responses/personal_ai_assistant_responses_3.md:
--------------------------------------------------------------------------------
1 | You are a friendly, ultra helpful, attentive, concise AI assistant named 'Ada'.
2 |
3 | You work with your human companion 'Dan' to build valuable experience through software.
4 |
5 | We both like short, concise, back-and-forth conversations.
6 |
7 | Communicate the following message to your human companion: 'Code has been written to the working directory'.
--------------------------------------------------------------------------------
/testable_prompts/sql/nlq1.md:
--------------------------------------------------------------------------------
1 | Given this natural language query: "select all authed users with 'Premium' plans", generate the SQL using postgres dialect that satisfies the request. Use the TABLE_DEFINITIONS below to satisfy the database query. Follow the INSTRUCTIONS.
2 |
3 | INSTRUCTIONS:
4 |
5 | - ENSURE THE SQL IS VALID FOR THE DIALECT
6 | - USE THE TABLE DEFINITIONS TO GENERATE THE SQL
7 | - DO NOT CHANGE ANY CONTENT WITHIN STRINGS OF THE Natural Language Query
8 | - Exclusively respond with the SQL query needed to satisfy the request and nothing else
9 | - Prefer * for SELECT statements unless the user specifies a column
10 |
11 | TABLE_DEFINITIONS:
12 |
13 | CREATE TABLE users (
14 | id INT,
15 | created TIMESTAMP,
16 | updated TIMESTAMP,
17 | authed BOOLEAN,
18 | PLAN TEXT,
19 | name TEXT,
20 | email TEXT
21 | );
22 |
23 | SQL Statement:
24 |
--------------------------------------------------------------------------------
/testable_prompts/sql/nlq2.md:
--------------------------------------------------------------------------------
1 | Given this natural language query: 'select users created after January 1, 2022', generate the SQL using postgres dialect that satisfies the request. Use the TABLE_DEFINITIONS below to satisfy the database query. Follow the INSTRUCTIONS.
2 |
3 | INSTRUCTIONS:
4 |
5 | - ENSURE THE SQL IS VALID FOR THE DIALECT
6 | - USE THE TABLE DEFINITIONS TO GENERATE THE SQL
7 | - DO NOT CHANGE ANY CONTENT WITHIN STRINGS OF THE Natural Language Query
8 | - Exclusively respond with the SQL query needed to satisfy the request and nothing else
9 | - Prefer * for SELECT statements unless the user specifies a column
10 |
11 | TABLE_DEFINITIONS:
12 |
13 | CREATE TABLE users (
14 | id INT,
15 | created TIMESTAMP,
16 | updated TIMESTAMP,
17 | authed BOOLEAN,
18 | PLAN TEXT,
19 | name TEXT,
20 | email TEXT
21 | );
22 |
23 | SQL Statement:
24 |
--------------------------------------------------------------------------------
/testable_prompts/sql/nlq3.md:
--------------------------------------------------------------------------------
1 | Given this natural language query: 'select the top 3 users by most recent update', generate the SQL using postgres dialect that satisfies the request. Use the TABLE_DEFINITIONS below to satisfy the database query. Follow the INSTRUCTIONS.
2 |
3 | INSTRUCTIONS:
4 |
5 | - ENSURE THE SQL IS VALID FOR THE DIALECT
6 | - USE THE TABLE DEFINITIONS TO GENERATE THE SQL
7 | - DO NOT CHANGE ANY CONTENT WITHIN STRINGS OF THE Natural Language Query
8 | - Exclusively respond with the SQL query needed to satisfy the request and nothing else
9 | - Prefer * for SELECT statements unless the user specifies a column
10 |
11 | TABLE_DEFINITIONS:
12 |
13 | CREATE TABLE users (
14 | id INT,
15 | created TIMESTAMP,
16 | updated TIMESTAMP,
17 | authed BOOLEAN,
18 | PLAN TEXT,
19 | name TEXT,
20 | email TEXT
21 | );
22 |
23 | SQL Statement:
24 |
--------------------------------------------------------------------------------
/testable_prompts/sql/nlq4.md:
--------------------------------------------------------------------------------
1 | Given this natural language query: 'select user names and emails of non-authed users', generate the SQL using postgres dialect that satisfies the request. Use the TABLE_DEFINITIONS below to satisfy the database query. Follow the INSTRUCTIONS.
2 |
3 | INSTRUCTIONS:
4 |
5 | - ENSURE THE SQL IS VALID FOR THE DIALECT
6 | - USE THE TABLE DEFINITIONS TO GENERATE THE SQL
7 | - DO NOT CHANGE ANY CONTENT WITHIN STRINGS OF THE Natural Language Query
8 | - Exclusively respond with the SQL query needed to satisfy the request and nothing else
9 | - Prefer * for SELECT statements unless the user specifies a column
10 |
11 | TABLE_DEFINITIONS:
12 |
13 | CREATE TABLE users (
14 | id INT,
15 | created TIMESTAMP,
16 | updated TIMESTAMP,
17 | authed BOOLEAN,
18 | PLAN TEXT,
19 | name TEXT,
20 | email TEXT
21 | );
22 |
23 | SQL Statement:
24 |
--------------------------------------------------------------------------------
/testable_prompts/sql/nlq5.md:
--------------------------------------------------------------------------------
1 | Given this natural language query: "select users with 'Basic' plan who were created before 2021", generate the SQL using postgres dialect that satisfies the request. Use the TABLE_DEFINITIONS below to satisfy the database query. Follow the INSTRUCTIONS.
2 |
3 | INSTRUCTIONS:
4 |
5 | - ENSURE THE SQL IS VALID FOR THE DIALECT
6 | - USE THE TABLE DEFINITIONS TO GENERATE THE SQL
7 | - DO NOT CHANGE ANY CONTENT WITHIN STRINGS OF THE Natural Language Query
8 | - Exclusively respond with the SQL query needed to satisfy the request and nothing else
9 | - Prefer * for SELECT statements unless the user specifies a column
10 |
11 | TABLE_DEFINITIONS:
12 |
13 | CREATE TABLE users (
14 | id INT,
15 | created TIMESTAMP,
16 | updated TIMESTAMP,
17 | authed BOOLEAN,
18 | PLAN TEXT,
19 | name TEXT,
20 | email TEXT
21 | );
22 |
23 | SQL Statement:
24 |
--------------------------------------------------------------------------------
/testable_prompts/string_manipulation/string_manipulation_1.md:
--------------------------------------------------------------------------------
1 | Create a summary of the following text in bullet points.
2 |
3 | Here's a simple yet powerful idea that can help you take a large step toward useful and valuable agentic workflows. I have to warn you, this is one of those things that sounds really obvious after you hear it because it's hiding in plain sight. The idea is simple; it's called the two-way prompt. So, what is this? Why is it useful? And how can it help you build better AI agent workflows? Two-way prompting happens all the time in real collaborative workspaces. You are effectively two or more agents prompting each other to drive outcomes. Two-way prompting happens all the time when you're at work, with friends, with family, online, in comment sections, on PR reviews. You ask a question; your co-worker responds. They ask a question; you respond. Now, let's double click into what this looks like for your agentic tools. Right in agentic workflows, you are the critical communication process between you and your AI agents that are aiming to drive outcomes. In most agentic workflows, we fire off one prompt or configure some system prompt, and that's it. But we're missing a ton of opportunity here that we can unlock using two-way prompts. Let me show you a concrete example with Ada. So, Ada, of course, is our proof of concept personal AI assistant. And let me just go ahead and kick off this workflow so I can show you exactly how useful the two-way prompt can be. Ada, let's create some example code.
--------------------------------------------------------------------------------
/testable_prompts/string_manipulation/string_manipulation_2.md:
--------------------------------------------------------------------------------
1 | Take the following text and for each sentence, convert it into a bullet point. Do not change the sentence. Maintain the punctuation. The punctuations '.!?' should trigger bullet points. Use - for the bullet point.
2 |
3 | Here's a simple yet powerful idea that can help you take a large step toward useful and valuable agentic workflows. I have to warn you, this is one of those things that sounds really obvious after you hear it because it's hiding in plain sight. The idea is simple; it's called the two-way prompt. So, what is this? Why is it useful? And how can it help you build better AI agent workflows? Two-way prompting happens all the time in real collaborative workspaces. You are effectively two or more agents prompting each other to drive outcomes.
--------------------------------------------------------------------------------
/testable_prompts/string_manipulation/string_manipulation_3.md:
--------------------------------------------------------------------------------
1 | Convert the following SCRIPT to markdown, follow the SCRIPTING_RULES.
2 |
3 | SCRIPTING_RULES
4 | - Create 1 h1 header with a interesting title.
5 | - Create 2 h2 sub headers, one for the summary and one for the details.
6 | - Each section should contain bullet points.
7 | - Start each section with a hook.
8 | - Use short paragraphs.
9 | - Use emojis to indicate the hooks.
10 |
11 | SCRIPT
12 | Here's a simple yet powerful idea that can help you take a large step toward useful and valuable agentic workflows. I have to warn you, this is one of those things that sounds really obvious after you hear it because it's hiding in plain sight. The idea is simple; it's called the two-way prompt. So, what is this? Why is it useful? And how can it help you build better AI agent workflows? Two-way prompting happens all the time in real collaborative workspaces. You are effectively two or more agents prompting each other to drive outcomes.
--------------------------------------------------------------------------------
/testable_prompts/text_classification/text_classification_1.md:
--------------------------------------------------------------------------------
1 | Is the following BLOCK_OF_TEXT a SQL Natural Language Query (NLQ)? Respond Exclusively with 'yes' or 'no'.
2 |
3 | BLOCK_OF_TEXT
4 |
5 | select 10 users over the age of 21 with gmail address
--------------------------------------------------------------------------------
/testable_prompts/text_classification/text_classification_2.md:
--------------------------------------------------------------------------------
1 | Determine if the sentiment of the following TEXT is positive or negative. Respond exclusively with 'positive' or 'negative'.
2 |
3 | TEXT
4 |
5 | I love sunny days there's nothing like getting out in nature
--------------------------------------------------------------------------------