├── .env.sample ├── .gitignore ├── AI_DOCS ├── marimo_cheatsheet.md ├── marimo_compressed.md └── marimo_documentation.json ├── README.md ├── adhoc_prompting.py ├── images ├── marimo_prompt_library.png └── multi_slm_llm_prompt_and_model.png ├── language_model_rankings └── rankings.json ├── layouts ├── adhoc_prompting.grid.json ├── adhoc_prompting.slides.json ├── multi_language_model_ranker.grid.json ├── multi_llm_prompting.grid.json ├── prompt_library.grid.json └── prompt_library.slides.json ├── marimo_is_awesome_demo.py ├── multi_language_model_ranker.py ├── multi_llm_prompting.py ├── prompt_library.py ├── prompt_library ├── ai-coding-meta-review.xml ├── bullet-knowledge-compression.xml ├── chapter-gen.xml └── hn-sentiment-analysis.xml ├── pyproject.toml ├── src └── marimo_notebook │ ├── __init__.py │ ├── modules │ ├── __init__.py │ ├── chain.py │ ├── llm_module.py │ ├── prompt_library_module.py │ ├── typings.py │ └── utils.py │ └── temp.py ├── testable_prompts ├── bash_commands │ ├── command_generation_1.md │ ├── command_generation_2.md │ └── command_generation_3.md ├── basics │ ├── hello.md │ ├── mult_lang_counting.xml │ ├── ping.xml │ └── python_count_to_ten.xml ├── code_debugging │ ├── code_debugging_1.md │ ├── code_debugging_2.md │ └── code_debugging_3.md ├── code_explanation │ ├── code_explanation_1.md │ ├── code_explanation_2.md │ └── code_explanation_3.md ├── code_generation │ ├── code_generation_1.md │ ├── code_generation_2.md │ ├── code_generation_3.md │ └── code_generation_4.md ├── context_window │ ├── context_window_1.md │ ├── context_window_2.md │ └── context_window_3.md ├── email_management │ ├── email_management_1.md │ ├── email_management_2.md │ ├── email_management_3.md │ ├── email_management_4.md │ ├── email_management_5.md │ └── email_management_6.md ├── personal_ai_assistant_responses │ ├── personal_ai_assistant_responses_1.md │ ├── personal_ai_assistant_responses_2.md │ └── personal_ai_assistant_responses_3.md ├── sql │ ├── nlq1.md │ ├── nlq2.md │ ├── nlq3.md │ ├── nlq4.md │ └── nlq5.md ├── string_manipulation │ ├── string_manipulation_1.md │ ├── string_manipulation_2.md │ └── string_manipulation_3.md └── text_classification │ ├── text_classification_1.md │ └── text_classification_2.md └── uv.lock /.env.sample: -------------------------------------------------------------------------------- 1 | ANTHROPIC_API_KEY= 2 | OPENAI_API_KEY= 3 | GROQ_API_KEY= 4 | GEMINI_API_KEY= 5 | PROMPT_LIBRARY_DIR=./prompt_library 6 | PROMPT_EXECUTIONS_DIR=./prompt_executions 7 | TESTABLE_PROMPTS_DIR=./testable_prompts 8 | LANGUAGE_MODEL_RANKINGS_FILE=./language_model_rankings/rankings.json 9 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | # Based on https://raw.githubusercontent.com/github/gitignore/main/Node.gitignore 2 | 3 | # Logs 4 | 5 | logs 6 | _.log 7 | npm-debug.log_ 8 | yarn-debug.log* 9 | yarn-error.log* 10 | lerna-debug.log* 11 | .pnpm-debug.log* 12 | 13 | # Caches 14 | 15 | .cache 16 | 17 | # Diagnostic reports (https://nodejs.org/api/report.html) 18 | 19 | report.[0-9]_.[0-9]_.[0-9]_.[0-9]_.json 20 | 21 | # Runtime data 22 | 23 | pids 24 | _.pid 25 | _.seed 26 | *.pid.lock 27 | 28 | # Directory for instrumented libs generated by jscoverage/JSCover 29 | 30 | lib-cov 31 | 32 | # Coverage directory used by tools like istanbul 33 | 34 | coverage 35 | *.lcov 36 | 37 | # nyc test coverage 38 | 39 | .nyc_output 40 | 41 | # Grunt intermediate storage (https://gruntjs.com/creating-plugins#storing-task-files) 42 | 43 | .grunt 44 | 45 | # Bower dependency directory (https://bower.io/) 46 | 47 | bower_components 48 | 49 | # node-waf configuration 50 | 51 | .lock-wscript 52 | 53 | # Compiled binary addons (https://nodejs.org/api/addons.html) 54 | 55 | build/Release 56 | 57 | # Dependency directories 58 | 59 | node_modules/ 60 | jspm_packages/ 61 | 62 | # Snowpack dependency directory (https://snowpack.dev/) 63 | 64 | web_modules/ 65 | 66 | # TypeScript cache 67 | 68 | *.tsbuildinfo 69 | 70 | # Optional npm cache directory 71 | 72 | .npm 73 | 74 | # Optional eslint cache 75 | 76 | .eslintcache 77 | 78 | # Optional stylelint cache 79 | 80 | .stylelintcache 81 | 82 | # Microbundle cache 83 | 84 | .rpt2_cache/ 85 | .rts2_cache_cjs/ 86 | .rts2_cache_es/ 87 | .rts2_cache_umd/ 88 | 89 | # Optional REPL history 90 | 91 | .node_repl_history 92 | 93 | # Output of 'npm pack' 94 | 95 | *.tgz 96 | 97 | # Yarn Integrity file 98 | 99 | .yarn-integrity 100 | 101 | # dotenv environment variable files 102 | 103 | .env 104 | .env.development.local 105 | .env.test.local 106 | .env.production.local 107 | .env.local 108 | 109 | # parcel-bundler cache (https://parceljs.org/) 110 | 111 | .parcel-cache 112 | 113 | # Next.js build output 114 | 115 | .next 116 | out 117 | 118 | # Nuxt.js build / generate output 119 | 120 | .nuxt 121 | dist 122 | 123 | # Gatsby files 124 | 125 | # Comment in the public line in if your project uses Gatsby and not Next.js 126 | 127 | # https://nextjs.org/blog/next-9-1#public-directory-support 128 | 129 | # public 130 | 131 | # vuepress build output 132 | 133 | .vuepress/dist 134 | 135 | # vuepress v2.x temp and cache directory 136 | 137 | .temp 138 | 139 | # Docusaurus cache and generated files 140 | 141 | .docusaurus 142 | 143 | # Serverless directories 144 | 145 | .serverless/ 146 | 147 | # FuseBox cache 148 | 149 | .fusebox/ 150 | 151 | # DynamoDB Local files 152 | 153 | .dynamodb/ 154 | 155 | # TernJS port file 156 | 157 | .tern-port 158 | 159 | # Stores VSCode versions used for testing VSCode extensions 160 | 161 | .vscode-test 162 | 163 | # yarn v2 164 | 165 | .yarn/cache 166 | .yarn/unplugged 167 | .yarn/build-state.yml 168 | .yarn/install-state.gz 169 | .pnp.* 170 | 171 | # IntelliJ based IDEs 172 | .idea 173 | 174 | # Finder (MacOS) folder config 175 | .DS_Store 176 | .aider* 177 | 178 | __pycache__/ 179 | 180 | .venv/ 181 | 182 | prompt_executions/ -------------------------------------------------------------------------------- /AI_DOCS/marimo_cheatsheet.md: -------------------------------------------------------------------------------- 1 | # Marimo Cheat Sheet 0.2.5 2 | 3 | ## Install and Import 4 | 5 | ### Install 6 | ```bash 7 | pip install marimo 8 | ``` 9 | ```bash 10 | marimo (open Marimo app) 11 | ``` 12 | Open Safari 13 | ```bash 14 | http://localhost:8888 15 | ``` 16 | ### Open Tutorials 17 | ```bash 18 | marimo tutorial DITTS 19 | ``` 20 | ### View Server Info 21 | ```bash 22 | marimo tutorial --help 23 | ``` 24 | ```bash 25 | Create new notebook 26 | ``` 27 | ```bash 28 | > create notebook 29 | ``` 30 | ```bash 31 | ls lists directories 32 | ``` 33 | ```bash 34 | marimo export my_notebook.py 35 | ``` 36 | ### Serve notebook as script 37 | ```bash 38 | marimo serve my_notebook.py 39 | ``` 40 | ### Serve notebook as app 41 | ```bash 42 | marimo export my_notebook.json > your_notebook.py 43 | ``` 44 | ```bash 45 | Run jupyter server 46 | ``` 47 | ```bash 48 | jupyter notebook 49 | ``` 50 | ```bash 51 | marimo export my_notebook.json > your_notebook.py 52 | ``` 53 | ### CLI Commands (MARIMO CLI) 54 | ```bash 55 | marimo -p {PORT} NAME 56 | ``` 57 | ```bash 58 | --p {PORT} SERVER to attach to. 59 | ``` 60 | ```bash 61 | --h --show home screen in the app. 62 | ``` 63 | ```bash 64 | --h displays Home screen in app. 65 | ``` 66 | ```bash 67 | marimo export my_notebook.json > your_notebook.py 68 | ``` 69 | **Server Port Tips**: 70 | ```bash 71 | If a port is busy use --port option. Server should start with /, /app subfolder. Use CLI or URL to access. 72 | ``` 73 | ### Run server and exit. 74 | [GitHub](https://github.com/tithyhs/marimo-cheat-sheet) 75 | [Docs](http://docs.marimo.io) 76 | 77 | ## Inputs 78 | ```python 79 | # Array of ID elements 80 | ctlA.get_value(df.id), ctlC.set() 81 | 82 | # Add new elements with sample label labels[] 83 | new_labels.add('example_label'),['label'],['tag',C.init()]) 84 | 85 | # Buttons with optional on-click 86 | m.ui_buttons(labels=[‘Ok’, 9], Labeled=‘Click Me’, m.ui_onclick('on_click')) 87 | 88 | # Basic checkbox layout 89 | ctl_inputs.add(['label':('Check me')]) 90 | 91 | # Combo box code 92 | def_code_box[selector='dropdown',['shown']]) 93 | 94 | # Dataframe code: column_names df.columns.labels df.cf.head()] 95 | df_render_data(df,'render_data','visualizations'] 96 | 97 | # Dictionary 98 | m.ui_dictionary([‘text’:‘No.1 Column’, ‘data’: m.ui.set(df)]) 99 | ``` 100 | #### Set Dropdown 101 | 102 | ```python 103 | # Slider 104 | m.ui_checkbox(['id_range=[‘1’,‘Choice 2’]', ‘Choice 2']) 105 | ``` 106 | #### Multi Dropdown 107 | 108 | ```python 109 | m.ui_multiselect(options=['Basic',3,'Row 5']) 110 | ``` 111 | ### Table output 112 | 113 | ```python 114 | table().rows=['Header',['Example 1'5,‘Item 2’]) 115 | ``` 116 | #### Expand 117 | ```python 118 | rows(), show_folded_value() 119 | ``` 120 | ## MarkDown 121 | ```markdown 122 | ## Music markdown 123 | 124 | - # 'Markdown Text' 125 | - ## Integrative Playlist - Start 126 | ``` 127 | ```bash 128 | m.link('https://spotify' }]) 129 | ``` 130 | **Text Positioning** 131 | ```python 132 | m.md_text('Hello world') ]) 133 | ``` 134 | ```python 135 | m.md_link('Playlist URL ') 136 | ``` 137 | Zoom in 138 | ```python 139 | m_md_zoom ] 140 | ``` 141 | ```python 142 | m.md_add_input_code(`music-rocker`)` 143 | ``` 144 | ```markdown 145 | - **Use Syntax:** 146 | '|data_content',['m.md_render('Tooltip-Hello')]) 147 | ``` 148 | ```markdown 149 | Code Syntax: 150 | ‘’` 151 | ``` 152 | ## Outputs 153 | ```python 154 | # Replace cell's output 155 | m.output_replace(‘cell_output2’) 156 | 157 | # Append data cell 158 | m_output(append_cell('cell-output3’) 159 | ``` 160 | ## Plotting 161 | ```python 162 | # Create Axis chart 163 | chart.plot(df_chart_data()).axis(['figure_color='],‘Width_px,G.title['Origin']) 164 | 165 | # Add chart (after show function) 166 | m.axis_chart([],function_completed) 167 | 168 | # Chart as Plotly interactive 169 | axis_chart_loader['function_return_chart']) 170 | ``` 171 | ```python 172 | m.set.axis.plot(figure_data['canvas']).plot.chart.[sizes=s,parameter=plots]) 173 | 174 | m.plot_data(plt.D3,[],[interactive() ]) 175 | ``` 176 | ### Plot 177 | 178 | ```python 179 | plt.axis(['Start Plot']) 180 | ``` 181 | ## Media 182 | ### Render an image 183 | 184 | ```python 185 | m.md_image({‘File.jpg’}) 186 | ‘Image description’=[m.md_title_image-].add-Image id:to_800) 187 | ``` 188 | Add Markdown Media: 189 | ```python 190 | m.media_add(video‘video_display ]]) 191 | m_media.stream({'Source_load-[‘MP4')}) 192 | ``` 193 | ```python 194 | img_embed=[m.youtube_link].render.{MP4’],embed[call]).allow('toggle_toolbar()’)) 195 | ``` 196 | Embed Audio 197 | ```python 198 | md.audio(‘audio_name', 'path.mp3') 199 | 200 | m.md_figure.img_reference.link' 201 | ``` 202 | ## Diagrams 203 | ### Define Diagram 204 | ```python 205 | m_diag_define(m.diagram_structure].create() 206 | 207 | # Label 208 | diagram_diagram('Line path',['Node Arrow']) 209 | ``` 210 | ### Showing simple path 211 | ```python 212 | simple_showchart='Example Path', ‘label','Diagonal_top']) 213 | ``` 214 | ```python 215 | path_node]-(vertical_direction)] 216 | ``` 217 | ## Status 218 | ```python 219 | # Progress bars 220 | progress_text(‘task-subtitle-updating.progress’) 221 | 222 | [Fetching Data] 223 | [please_processing_title_loadtext_spinner.png] 224 | ``` 225 | #### Time Display Progress 226 | ```python 227 | time.sleep(['progress_loader']) 228 | ``` 229 | ## Control Flow 230 | ```python 231 | # If/Else Looping State Variables 232 | # Call cell_state.log (['iteration_num=0) 233 | ``` 234 | ```python 235 | # Ends current loop iteration/executes cell 236 | ``` 237 | ```python 238 | def.goto(‘state=cell_suspended’) 239 | ctl_current({state}).wait() 240 | ``` 241 | ## State 242 | ### State Management Code 243 | 244 | ```python 245 | ctl_set.state=[]['C.State'] 246 | ctl_render_type(['current_frame_reset()]) 247 | # toggle state 248 | ctl_state] 249 | ``` 250 | ## HTML 251 | ### Convert Python to HTML 252 | ```python 253 | html_cell.add(html.tags()``` 254 | html_justified 255 | ``` 256 | ```python 257 | m.html_create_wrapper 258 | def(fixalignment_item]) 259 | 260 | # Apply single batch to cell 261 | html.md_wrap.applyAlign(color) 262 | ``` 263 | ### Set Justify 264 | ```python 265 | html_tag.set='center] 266 | ``` 267 | ## Debug 268 | ### Debugging cell output: 269 | ```python 270 | ctl_debug().output 271 | m.debug.retrieve_debug().info 272 | ``` 273 | ### Inspect Execution Code 274 | ```python 275 | ctl_last.debug() 276 | ``` 277 | [GitHub](https://github.com/tithyhs/marimo-cheat-sheet) 278 | [Docs](http://docs.marimo.io) -------------------------------------------------------------------------------- /AI_DOCS/marimo_compressed.md: -------------------------------------------------------------------------------- 1 | import marimo as mo 2 | import random 3 | import pandas as pd 4 | import plotly.express as px 5 | import altair as alt 6 | from vega_datasets import data 7 | import matplotlib.pyplot as plt 8 | 9 | # Markdown 10 | mo.md("## This is a markdown heading") 11 | 12 | # Inputs 13 | 14 | # 1. Array 15 | sliders = mo.ui.array([mo.ui.slider(1, 100) for _ in range(3)]) 16 | mo.md(f"Array of sliders: {sliders}") 17 | 18 | # 2. Batch 19 | user_info = mo.md( 20 | """ 21 | - **Name:** {name} 22 | - **Birthday:** {birthday} 23 | """ 24 | ).batch(name=mo.ui.text(), birthday=mo.ui.date()) 25 | user_info 26 | 27 | # 3. Button 28 | def on_click(value): 29 | print("Button clicked!", value) 30 | return value + 1 31 | 32 | button = mo.ui.button(on_click=on_click, value=0, label="Click Me") 33 | button 34 | 35 | # 4. Checkbox 36 | checkbox = mo.ui.checkbox(label="Agree to terms") 37 | mo.md(f"Checkbox value: {checkbox.value}") 38 | 39 | # 5. Code Editor 40 | code = """ 41 | def my_function(): 42 | print("Hello from code editor!") 43 | """ 44 | code_editor = mo.ui.code_editor(value=code, language="python") 45 | code_editor 46 | 47 | # 6. Dataframe 48 | df = pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]}) 49 | dataframe_ui = mo.ui.dataframe(df) 50 | dataframe_ui 51 | 52 | # 7. Data Explorer 53 | data_explorer = mo.ui.data_explorer(data.cars()) 54 | data_explorer 55 | 56 | # 8. Dates 57 | 58 | # Single date 59 | date_picker = mo.ui.date(label="Select a date") 60 | date_picker 61 | 62 | # Date and time 63 | datetime_picker = mo.ui.datetime(label="Select a date and time") 64 | datetime_picker 65 | 66 | # Date range 67 | date_range_picker = mo.ui.date_range(label="Select a date range") 68 | date_range_picker 69 | 70 | # 9. Dictionary 71 | elements = mo.ui.dictionary({ 72 | 'slider': mo.ui.slider(1, 10), 73 | 'text': mo.ui.text(placeholder="Enter text") 74 | }) 75 | mo.md(f"Dictionary of elements: {elements}") 76 | 77 | # 10. Dropdown 78 | dropdown = mo.ui.dropdown(options=['Option 1', 'Option 2', 'Option 3'], label="Select an option") 79 | dropdown 80 | 81 | # 11. File 82 | file_upload = mo.ui.file(label="Upload a file") 83 | file_upload 84 | 85 | # 12. File Browser 86 | file_browser = mo.ui.file_browser(label="Browse files") 87 | file_browser 88 | 89 | # 13. Form 90 | form = mo.ui.text(label="Enter your name").form() 91 | form 92 | 93 | # 14. Microphone 94 | microphone = mo.ui.microphone(label="Record audio") 95 | microphone 96 | 97 | # 15. Multiselect 98 | multiselect = mo.ui.multiselect(options=['A', 'B', 'C', 'D'], label="Select multiple options") 99 | multiselect 100 | 101 | # 16. Number 102 | number_picker = mo.ui.number(0, 10, step=0.5, label="Select a number") 103 | number_picker 104 | 105 | # 17. Radio 106 | radio_group = mo.ui.radio(options=['Red', 'Green', 'Blue'], label="Select a color") 107 | radio_group 108 | 109 | # 18. Range Slider 110 | range_slider = mo.ui.range_slider(0, 100, step=5, value=[20, 80], label="Select a range") 111 | range_slider 112 | 113 | # 19. Refresh 114 | refresh_button = mo.ui.refresh(default_interval="5s", label="Refresh") 115 | refresh_button 116 | 117 | # 20. Run Button 118 | run_button = mo.ui.run_button(label="Run") 119 | run_button 120 | 121 | # 21. Slider 122 | slider = mo.ui.slider(0, 100, step=1, label="Adjust value") 123 | slider 124 | 125 | # 22. Switch 126 | switch = mo.ui.switch(label="Enable feature") 127 | switch 128 | 129 | # 23. Table 130 | table_data = [{'Name': 'Alice', 'Age': 25}, {'Name': 'Bob', 'Age': 30}] 131 | table = mo.ui.table(data=table_data, label="User Table") 132 | table 133 | 134 | # 24. Tabs 135 | tab1_content = mo.md("Content for Tab 1") 136 | tab2_content = mo.ui.slider(0, 10) 137 | tabs = mo.ui.tabs({'Tab 1': tab1_content, 'Tab 2': tab2_content}) 138 | tabs 139 | 140 | # 25. Text 141 | text_input = mo.ui.text(placeholder="Enter some text", label="Text Input") 142 | text_input 143 | 144 | # 26. Text Area 145 | text_area = mo.ui.text_area(placeholder="Enter a long text", label="Text Area") 146 | text_area 147 | 148 | # 27. Custom UI elements (Anywidget) 149 | # See the documentation on Anywidget for examples. 150 | 151 | # Layouts 152 | 153 | # 1. Accordion 154 | accordion = mo.ui.accordion({'Section 1': mo.md("This is section 1"), 'Section 2': mo.ui.slider(0, 10)}) 155 | accordion 156 | 157 | # 2. Carousel 158 | carousel = mo.carousel([mo.md("Item 1"), mo.ui.slider(0, 10), mo.md("Item 3")]) 159 | carousel 160 | 161 | # 3. Callout 162 | callout = mo.md("Important message!").callout(kind="warn") 163 | callout 164 | 165 | # 4. Justify 166 | 167 | # Center 168 | centered_text = mo.md("This text is centered").center() 169 | centered_text 170 | 171 | # Left 172 | left_aligned_text = mo.md("This text is left aligned").left() 173 | left_aligned_text 174 | 175 | # Right 176 | right_aligned_text = mo.md("This text is right aligned").right() 177 | right_aligned_text 178 | 179 | # 5. Lazy 180 | def lazy_content(): 181 | mo.md("This content loaded lazily!") 182 | 183 | lazy_element = mo.lazy(lazy_content) 184 | lazy_element 185 | 186 | # 6. Plain 187 | plain_dataframe = mo.plain(df) 188 | plain_dataframe 189 | 190 | # 7. Routes 191 | def home_page(): 192 | return mo.md("# Home Page") 193 | 194 | def about_page(): 195 | return mo.md("# About Page") 196 | 197 | mo.routes({ 198 | "#/": home_page, 199 | "#/about": about_page 200 | }) 201 | 202 | # 8. Sidebar 203 | sidebar_content = mo.vstack([mo.md("## Menu"), mo.ui.button(label="Home"), mo.ui.button(label="About")]) 204 | mo.sidebar(sidebar_content) 205 | 206 | # 9. Stacks 207 | 208 | # Horizontal Stack 209 | hstack_layout = mo.hstack([mo.md("Left"), mo.ui.slider(0, 10), mo.md("Right")]) 210 | hstack_layout 211 | 212 | # Vertical Stack 213 | vstack_layout = mo.vstack([mo.md("Top"), mo.ui.slider(0, 10), mo.md("Bottom")]) 214 | vstack_layout 215 | 216 | # 10. Tree 217 | tree_data = ['Item 1', ['Subitem 1.1', 'Subitem 1.2'], {'Key': 'Value'}] 218 | tree = mo.tree(tree_data) 219 | tree 220 | 221 | # Plotting 222 | 223 | # Reactive charts with Altair 224 | altair_chart = mo.ui.altair_chart(alt.Chart(data.cars()).mark_point().encode(x='Horsepower', y='Miles_per_Gallon', color='Origin')) 225 | altair_chart 226 | 227 | # Reactive plots with Plotly 228 | plotly_chart = mo.ui.plotly(px.scatter(data.cars(), x="Horsepower", y="Miles_per_Gallon", color="Origin")) 229 | plotly_chart 230 | 231 | # Interactive matplotlib 232 | plt.plot([1, 2, 3, 4]) 233 | interactive_mpl_chart = mo.mpl.interactive(plt.gcf()) 234 | interactive_mpl_chart 235 | 236 | # Media 237 | 238 | # 1. Image 239 | image = mo.image("https://marimo.io/logo.png", width=100, alt="Marimo Logo") 240 | image 241 | 242 | # 2. Audio 243 | audio = mo.audio("https://www.zedge.net/find/ringtones/ocean%20waves") 244 | audio 245 | 246 | # 3. Video 247 | video = mo.video("https://www.youtube.com/watch?v=dQw4w9WgXcQ", controls=True) 248 | video 249 | 250 | # 4. PDF 251 | pdf = mo.pdf("https://www.africau.edu/images/default/sample.pdf", width="50%") 252 | pdf 253 | 254 | # 5. Download Media 255 | download_button = mo.download(data="This is the content of the file", filename="download.txt") 256 | download_button 257 | 258 | # 6. Plain text 259 | plain_text = mo.plain_text("This is plain text") 260 | plain_text 261 | 262 | # Diagrams 263 | 264 | # 1. Mermaid diagrams 265 | mermaid_code = """ 266 | graph LR 267 | A[Square Rect] -- Link text --> B((Circle)) 268 | A --> C(Round Rect) 269 | B --> D{Rhombus} 270 | C --> D 271 | """ 272 | mermaid_diagram = mo.mermaid(mermaid_code) 273 | mermaid_diagram 274 | 275 | # 2. Statistic cards 276 | stat_card = mo.stat(value=100, label="Users", caption="Total users this month", direction="increase") 277 | stat_card 278 | 279 | # Status 280 | 281 | # 1. Progress bar 282 | for i in mo.status.progress_bar(range(10), title="Processing"): 283 | # Simulate some work 284 | pass 285 | 286 | # 2. Spinner 287 | with mo.status.spinner(title="Loading...", subtitle="Please wait"): 288 | # Simulate a long-running task 289 | pass 290 | 291 | # Outputs 292 | 293 | # 1. Replace output 294 | mo.output.replace(mo.md("This is the new output")) 295 | 296 | # 2. Append output 297 | mo.output.append(mo.md("This is appended output")) 298 | 299 | # 3. Clear output 300 | mo.output.clear() 301 | 302 | # 4. Replace output at index 303 | mo.output.replace_at_index(mo.md("Replaced output"), 0) 304 | 305 | # Display cell code 306 | mo.show_code(mo.md("This output has code displayed")) 307 | 308 | # Control Flow 309 | 310 | # Stop execution 311 | user_age = mo.ui.number(0, 100, label="Enter your age") 312 | mo.stop(user_age.value < 18, mo.md("You must be 18 or older")) 313 | mo.md(f"Your age is: {user_age.value}") 314 | 315 | # HTML 316 | 317 | # 1. Convert to HTML 318 | html_object = mo.as_html(mo.md("This is markdown converted to HTML")) 319 | html_object 320 | 321 | # 2. Html object 322 | custom_html = mo.Html("

This is custom HTML

") 323 | custom_html 324 | 325 | # Other API components 326 | 327 | # Query Parameters 328 | params = mo.query_params() 329 | params['name'] = 'John' 330 | 331 | # Command Line Arguments 332 | args = mo.cli_args() 333 | 334 | # State 335 | get_count, set_count = mo.state(0) 336 | mo.ui.button(on_click=lambda: set_count(get_count() + 1), label="Increment") 337 | mo.md(f"Count: {get_count()}") 338 | 339 | # App 340 | # See documentation for embedding notebooks 341 | 342 | # Cell 343 | # See documentation for running cells from other notebooks 344 | 345 | # Miscellaneous 346 | is_running_in_notebook = mo.running_in_notebook() 347 | 348 | --- Guides 349 | 350 | ## Marimo Guides: Concise Examples 351 | 352 | Here are concise examples for each guide in the Marimo documentation: 353 | 354 | ### 1. Overview 355 | 356 | ```python 357 | import marimo as mo 358 | 359 | # Define a variable 360 | x = 10 361 | 362 | # Display markdown with variable interpolation 363 | mo.md(f"The value of x is {x}") 364 | 365 | # Create a slider and display its value reactively 366 | slider = mo.ui.slider(0, 100, value=50) 367 | mo.md(f"Slider value: {slider.value}") 368 | ``` 369 | 370 | ### 2. Reactivity 371 | 372 | ```python 373 | import marimo as mo 374 | 375 | # Define a variable in one cell 376 | data = [1, 2, 3, 4, 5] 377 | 378 | # Use the variable in another cell - this cell will rerun when `data` changes 379 | mo.md(f"The sum of the data is {sum(data)}") 380 | ``` 381 | 382 | ### 3. Interactivity 383 | 384 | ```python 385 | import marimo as mo 386 | 387 | # Create a slider 388 | slider = mo.ui.slider(0, 10, label="Select a value") 389 | 390 | # Display the slider's value reactively 391 | mo.md(f"You selected: {slider.value}") 392 | ``` 393 | 394 | ### 4. SQL 395 | 396 | ```python 397 | import marimo as mo 398 | 399 | # Create a dataframe 400 | df = pd.DataFrame({'name': ['Alice', 'Bob', 'Charlie'], 'age': [25, 30, 35]}) 401 | 402 | # Query the dataframe using SQL 403 | mo.sql("SELECT * FROM df WHERE age > 30") 404 | ``` 405 | 406 | ### 5. Run as an app 407 | 408 | ```bash 409 | # Run a notebook as an interactive web app 410 | marimo run my_notebook.py 411 | ``` 412 | 413 | ### 6. Run as a script 414 | 415 | ```bash 416 | # Execute a notebook as a Python script 417 | python my_notebook.py 418 | ``` 419 | 420 | ### 7. Outputs 421 | 422 | ```python 423 | import marimo as mo 424 | 425 | # Display markdown 426 | mo.md("This is **markdown** output") 427 | 428 | # Display a matplotlib plot 429 | import matplotlib.pyplot as plt 430 | plt.plot([1, 2, 3, 4, 5]) 431 | plt.show() 432 | ``` 433 | 434 | ### 8. Dataframes 435 | 436 | ```python 437 | import marimo as mo 438 | import pandas as pd 439 | 440 | # Create a Pandas dataframe 441 | df = pd.DataFrame({'col1': [1, 2, 3], 'col2': [4, 5, 6]}) 442 | 443 | # Display the dataframe in an interactive table 444 | mo.ui.table(df) 445 | ``` 446 | 447 | ### 9. Plotting 448 | 449 | ```python 450 | import marimo as mo 451 | import altair as alt 452 | from vega_datasets import data 453 | 454 | # Create a reactive Altair chart 455 | chart = alt.Chart(data.cars()).mark_point().encode(x='Horsepower', y='Miles_per_Gallon') 456 | chart = mo.ui.altair_chart(chart) 457 | 458 | # Display the chart and selected data 459 | mo.hstack([chart, chart.value]) 460 | ``` 461 | 462 | ### 10. Editor Features 463 | 464 | - Explore variable values and their definitions in the **Variables Panel**. 465 | - Visualize cell dependencies in the **Dependency Graph**. 466 | - Use **Go-to-Definition** to jump to variable declarations. 467 | - Enable **GitHub Copilot** for AI-powered code suggestions. 468 | - Customize **Hotkeys** and **Theming** in the settings. 469 | 470 | ### 11. Theming 471 | 472 | ```python 473 | # In your notebook.py file: 474 | 475 | app = marimo.App(css_file="custom.css") 476 | ``` 477 | 478 | ### 12. Best Practices 479 | 480 | - Use global variables sparingly. 481 | - Encapsulate logic in functions and modules. 482 | - Minimize mutations. 483 | - Write idempotent cells. 484 | - Use caching for expensive computations. 485 | 486 | ### 13. Coming from other Tools 487 | 488 | - Refer to guides for specific tools like Jupyter, Jupytext, Papermill, and Streamlit to understand the transition to Marimo. 489 | 490 | ### 14. Integrating with Marimo 491 | 492 | ```python 493 | import marimo as mo 494 | 495 | # Check if running in a Marimo notebook 496 | if mo.running_in_notebook(): 497 | # Execute Marimo-specific code 498 | pass 499 | ``` 500 | 501 | ### 15. Reactive State 502 | 503 | ```python 504 | import marimo as mo 505 | 506 | # Create reactive state 507 | get_count, set_count = mo.state(0) 508 | 509 | # Increment the counter on button click 510 | mo.ui.button(on_click=lambda: set_count(get_count() + 1), label="Increment") 511 | 512 | # Display the counter value reactively 513 | mo.md(f"Count: {get_count()}") 514 | ``` 515 | 516 | ### 16. Online Playground 517 | 518 | - Create and share Marimo notebooks online at [https://marimo.new](https://marimo.new). 519 | 520 | ### 17. Exporting 521 | 522 | ```bash 523 | # Export to HTML 524 | marimo export html my_notebook.py -o my_notebook.html 525 | 526 | # Export to Python script 527 | marimo export script my_notebook.py -o my_script.py 528 | ``` 529 | 530 | ### 18. Configuration 531 | 532 | - Customize user-wide settings in `~/.marimo.toml`. 533 | - Configure notebook-specific settings in the `notebook.py` file. 534 | 535 | ### 19. Troubleshooting 536 | 537 | - Use the **Variables Panel** and **Dependency Graph** to debug cell execution issues. 538 | - Add `print` statements for debugging. 539 | - Try the "Lazy" runtime configuration for identifying stale cells. 540 | 541 | ### 20. Deploying 542 | 543 | ```bash 544 | # Deploy a Marimo notebook as an interactive web app 545 | marimo run my_notebook.py 546 | ``` 547 | 548 | --- recipes 549 | 550 | ## Marimo Recipes: Concise Examples 551 | 552 | Here are concise examples of common tasks and concepts from the Marimo Recipes section: 553 | 554 | ### Control Flow 555 | 556 | #### 1. Show an output conditionally 557 | 558 | ```python 559 | import marimo as mo 560 | 561 | show_output = mo.ui.checkbox(label="Show output") 562 | mo.md("This output is visible!") if show_output.value else None 563 | ``` 564 | 565 | #### 2. Run a cell on a timer 566 | 567 | ```python 568 | import marimo as mo 569 | import time 570 | 571 | refresh = mo.ui.refresh(default_interval="1s") 572 | refresh 573 | 574 | # This cell will run every second 575 | refresh 576 | mo.md(f"Current time: {time.time()}") 577 | ``` 578 | 579 | #### 3. Require form submission before sending UI value 580 | 581 | ```python 582 | import marimo as mo 583 | 584 | form = mo.ui.text(label="Your name").form() 585 | form 586 | 587 | # This cell will only run after form submission 588 | mo.stop(form.value is None, mo.md("Please submit the form.")) 589 | mo.md(f"Hello, {form.value}!") 590 | ``` 591 | 592 | #### 4. Stop execution of a cell and its descendants 593 | 594 | ```python 595 | import marimo as mo 596 | 597 | should_continue = mo.ui.checkbox(label="Continue?") 598 | 599 | # Stop execution if the checkbox is not checked 600 | mo.stop(not should_continue.value, mo.md("Execution stopped.")) 601 | 602 | # This code will only run if the checkbox is checked 603 | mo.md("Continuing execution...") 604 | ``` 605 | 606 | ### Grouping UI Elements Together 607 | 608 | #### 1. Create an array of UI elements 609 | 610 | ```python 611 | import marimo as mo 612 | 613 | n_sliders = mo.ui.number(1, 5, value=3, label="Number of sliders") 614 | sliders = mo.ui.array([mo.ui.slider(0, 100) for _ in range(n_sliders.value)]) 615 | mo.hstack(sliders) 616 | 617 | # Access slider values 618 | mo.md(f"Slider values: {sliders.value}") 619 | ``` 620 | 621 | #### 2. Create a dictionary of UI elements 622 | 623 | ```python 624 | import marimo as mo 625 | 626 | elements = mo.ui.dictionary({ 627 | 'name': mo.ui.text(label="Name"),**** 628 | 'age': mo.ui.number(0, 100, label="Age") 629 | }) 630 | 631 | # Access element values 632 | mo.md(f"Name: {elements['name'].value}, Age: {elements['age'].value}") 633 | ``` 634 | 635 | #### 3. Embed a dynamic number of UI elements in another output 636 | 637 | ```python 638 | import marimo as mo 639 | 640 | n_items = mo.ui.number(1, 5, value=3, label="Number of items") 641 | items = mo.ui.array([mo.ui.text(placeholder=f"Item {i+1}") for i in range(n_items.value)]) 642 | 643 | mo.md(f""" 644 | **My List:** 645 | 646 | * {items[0]} 647 | * {items[1]} 648 | * {items[2]} 649 | """) 650 | ``` 651 | 652 | #### 4. Create a `hstack` (or `vstack`) of UI elements with `on_change` handlers 653 | 654 | ```python 655 | import marimo as mo 656 | 657 | def handle_click(value, index): 658 | mo.md(f"Button {index} clicked!") 659 | 660 | buttons = mo.ui.array( 661 | [mo.ui.button(label=f"Button {i}", on_change=lambda v, i=i: handle_click(v, i)) 662 | for i in range(3)] 663 | ) 664 | mo.hstack(buttons) 665 | ``` 666 | 667 | #### 5. Create a table column of buttons with `on_change` handlers 668 | 669 | ```python 670 | import marimo as mo 671 | 672 | def handle_click(value, row_index): 673 | mo.md(f"Button clicked for row {row_index}") 674 | 675 | buttons = mo.ui.array( 676 | [mo.ui.button(label="Click me", on_change=lambda v, i=i: handle_click(v, i)) 677 | for i in range(3)] 678 | ) 679 | 680 | mo.ui.table({ 681 | 'Name': ['Alice', 'Bob', 'Charlie'], 682 | 'Action': buttons 683 | }) 684 | ``` 685 | 686 | #### 6. Create a form with multiple UI elements 687 | 688 | ```python 689 | import marimo as mo 690 | 691 | form = mo.md( 692 | """ 693 | **User Details** 694 | 695 | Name: {name} 696 | Age: {age} 697 | """ 698 | ).batch( 699 | name=mo.ui.text(label="Name"), 700 | age=mo.ui.number(0, 100, label="Age") 701 | ).form() 702 | form 703 | 704 | # Access form values after submission 705 | mo.md(f"Name: {form.value['name']}, Age: {form.value['age']}") 706 | ``` 707 | 708 | ### Working with Buttons 709 | 710 | #### 1. Create a button that triggers computation when clicked 711 | 712 | ```python 713 | import marimo as mo 714 | import random 715 | 716 | run_button = mo.ui.run_button(label="Generate Random Number") 717 | run_button 718 | 719 | # This cell only runs when the button is clicked 720 | mo.stop(not run_button.value, "Click 'Generate' to get a random number") 721 | mo.md(f"Random number: {random.randint(0, 100)}") 722 | ``` 723 | 724 | #### 2. Create a counter button 725 | 726 | ```python 727 | import marimo as mo 728 | 729 | counter_button = mo.ui.button(value=0, on_click=lambda count: count + 1, label="Count") 730 | counter_button 731 | 732 | # Display the count 733 | mo.md(f"Count: {counter_button.value}") 734 | ``` 735 | 736 | #### 3. Create a toggle button 737 | 738 | ```python 739 | import marimo as mo 740 | 741 | toggle_button = mo.ui.button(value=False, on_click=lambda state: not state, label="Toggle") 742 | toggle_button 743 | 744 | # Display the toggle state 745 | mo.md(f"State: {'On' if toggle_button.value else 'Off'}") 746 | ``` 747 | 748 | #### 4. Re-run a cell when a button is pressed 749 | 750 | ```python 751 | import marimo as mo 752 | import random 753 | 754 | refresh_button = mo.ui.button(label="Refresh") 755 | refresh_button 756 | 757 | # This cell reruns when the button is clicked 758 | refresh_button 759 | mo.md(f"Random number: {random.randint(0, 100)}") 760 | ``` 761 | 762 | #### 5. Run a cell when a button is pressed, but not before 763 | 764 | ```python 765 | import marimo as mo 766 | 767 | counter_button = mo.ui.button(value=0, on_click=lambda count: count + 1, label="Click to Continue") 768 | counter_button 769 | 770 | # Only run this cell after the button is clicked 771 | mo.stop(counter_button.value == 0, "Click the button to continue.") 772 | mo.md("You clicked the button!") 773 | ``` 774 | 775 | #### 6. Reveal an output when a button is pressed 776 | 777 | ```python 778 | import marimo as mo 779 | 780 | show_button = mo.ui.button(label="Show Output") 781 | show_button 782 | 783 | # Reveal output only after button click 784 | mo.md("This is the hidden output!") if show_button.value else None 785 | ``` 786 | 787 | ### Caching 788 | 789 | #### 1. Cache expensive computations 790 | 791 | ```python 792 | import marimo as mo 793 | import functools 794 | import time 795 | 796 | @functools.cache 797 | def expensive_function(x): 798 | time.sleep(2) # Simulate a long computation 799 | return x * 2 800 | 801 | # Call the function multiple times with the same argument 802 | result1 = expensive_function(5) 803 | result2 = expensive_function(5) # This will be retrieved from the cache 804 | 805 | mo.md(f"Result 1: {result1}, Result 2: {result2}") 806 | ``` 807 | 808 | These concise examples provide practical illustrations of various recipes, showcasing how Marimo can be used to create interactive, dynamic, and efficient notebooks. 809 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Marimo Reactive Notebook Prompt Library 2 | > Starter codebase to use Marimo reactive notebooks to build a reusable, customizable, Prompt Library. 3 | > 4 | > Take this codebase and use it as a starter codebase to build your own personal prompt library. 5 | > 6 | > Marimo reactive notebooks & Prompt Library [walkthrough](https://youtu.be/PcLkBkQujMI) 7 | > 8 | > Run multiple prompts against multiple models (SLMs & LLMs) [walkthrough](https://youtu.be/VC6QCEXERpU) 9 | 10 | multi llm prompting 11 | 12 | marimo promptlibrary 13 | 14 | ## 1. Understand Marimo Notebook 15 | > This is a simple demo of the Marimo Reactive Notebook 16 | - Install hyper modern [UV Python Package and Project](https://docs.astral.sh/uv/getting-started/installation/) 17 | - Install dependencies `uv sync` 18 | - Install marimo `uv pip install marimo` 19 | - To Edit, Run `uv run marimo edit marimo_is_awesome_demo.py` 20 | - To View, Run `uv run marimo run marimo_is_awesome_demo.py` 21 | - Then use your favorite IDE & AI Coding Assistant to edit the `marimo_is_awesome_demo.py` directly or via the UI. 22 | 23 | ## 2. Ad-hoc Prompt Notebook 24 | > Quickly run and test prompts across models 25 | - 🟡 Copy `.env.sample` to `.env` and set your keys (minimally set `OPENAI_API_KEY`) 26 | - Add other keys and update the notebook to add support for additional SOTA LLMs 27 | - 🟡 Install Ollama (https://ollama.ai/) and pull the models you want to use 28 | - Update the notebook to use Ollama models you have installed 29 | - To Edit, Run `uv run marimo edit adhoc_prompting.py` 30 | - To View, Run `uv run marimo run adhoc_prompting.py` 31 | 32 | ## 3. ⭐️ Prompt Library Notebook 33 | > Build, Manage, Reuse, Version, and Iterate on your Prompt Library 34 | - 🟡 Copy `.env.sample` to `.env` and set your keys (minimally set `OPENAI_API_KEY`) 35 | - Add other keys and update the notebook to add support for additional SOTA LLMs 36 | - 🟡 Install Ollama (https://ollama.ai/) and pull the models you want to use 37 | - Update the notebook to use Ollama models you have installed 38 | - To Edit, Run `uv run marimo edit prompt_library.py` 39 | - To View, Run `uv run marimo run prompt_library.py` 40 | 41 | ## 4. Multi-LLM Prompt 42 | > Quickly test a single prompt across multiple language models 43 | - 🟡 Ensure your `.env` file is set up with the necessary API keys for the models you want to use 44 | - 🟡 Install Ollama (https://ollama.ai/) and pull the models you want to use 45 | - Update the notebook to use Ollama models you have installed 46 | - To Edit, Run `uv run marimo edit multi_llm_prompting.py` 47 | - To View, Run `uv run marimo run multi_llm_prompting.py` 48 | 49 | ## 5. Multi Language Model Ranker 50 | > Compare and rank multiple language models across various prompts 51 | - 🟡 Ensure your `.env` file is set up with the necessary API keys for the models you want to compare 52 | - 🟡 Install Ollama (https://ollama.ai/) and pull the models you want to use 53 | - Update the notebook to use Ollama models you have installed 54 | - To Edit, Run `uv run marimo edit multi_language_model_ranker.py` 55 | - To View, Run `uv run marimo run multi_language_model_ranker.py` 56 | 57 | ## General Usage 58 | > See the [Marimo Docs](https://docs.marimo.io/index.html) for general usage details 59 | 60 | ## Personal Prompt Library Use-Cases 61 | - Ad-hoc prompting 62 | - Prompt reuse 63 | - Prompt versioning 64 | - Interactive prompts 65 | - Prompt testing & Benchmarking 66 | - LLM comparison 67 | - Prompt templating 68 | - Run a single prompt against multiple LLMs & SLMs 69 | - Compare multi prompts against multiple LLMs & SLMs 70 | - Anything you can imagine! 71 | 72 | ## Advantages of Marimo 73 | 74 | ### Key Advantages 75 | > Rapid Prototyping: Seamlessly transition between user and builder mode with `cmd+.` to toggle. Consumer vs Producer. UI vs Code. 76 | 77 | > Interactivity: Built-in reactive UI elements enable intuitive data exploration and visualization. 78 | 79 | > Reactivity: Cells automatically update when dependencies change, ensuring a smooth and efficient workflow. 80 | 81 | > Out of the box: Use sliders, textareas, buttons, images, dataframe GUIs, plotting, and other interactive elements to quickly iterate on ideas. 82 | 83 | > It's 'just' Python: Pure Python scripts for easy version control and AI coding. 84 | 85 | 86 | - **Reactive Execution**: Run one cell, and marimo automatically updates all affected cells. This eliminates the need to manually manage notebook state. 87 | - **Interactive Elements**: Provides reactive UI elements like dataframe GUIs and plots, making data exploration fast and intuitive. 88 | - **Python-First Design**: Notebooks are pure Python scripts stored as `.py` files. They can be versioned with git, run as scripts, and imported into other Python code. 89 | - **Reproducible by Default**: Deterministic execution order with no hidden state ensures consistent and reproducible results. 90 | - **Built for Collaboration**: Git-friendly notebooks where small changes yield small diffs, facilitating collaboration. 91 | - **Developer-Friendly Features**: Includes GitHub Copilot, autocomplete, hover tooltips, vim keybindings, code formatting, debugging panels, and extensive hotkeys. 92 | - **Seamless Transition to Production**: Notebooks can be run as scripts or deployed as read-only web apps. 93 | - **Versatile Use Cases**: Ideal for experimenting with data and models, building internal tools, communicating research, education, and creating interactive dashboards. 94 | 95 | ### Advantages Over Jupyter Notebooks 96 | 97 | - **Reactive Notebook**: Automatically updates dependent cells when code or values change, unlike Jupyter where cells must be manually re-executed. 98 | - **Pure Python Notebooks**: Stored as `.py` files instead of JSON, making them easier to version control, lint, and integrate with Python tooling. 99 | - **No Hidden State**: Deleting a cell removes its variables and updates affected cells, reducing errors from stale variables. 100 | - **Better Git Integration**: Plain Python scripts result in smaller diffs and more manageable version control compared to Jupyter's JSON format. 101 | - **Import Symbols**: Allows importing symbols from notebooks into other notebooks or Python files. 102 | - **Enhanced Interactivity**: Built-in reactive UI elements provide a more interactive experience than standard Jupyter widgets. 103 | - **App Deployment**: Notebooks can be served as web apps or exported to static HTML for easier sharing and deployment. 104 | - **Advanced Developer Tools**: Features like code formatting, GitHub Copilot integration, and debugging panels enhance the development experience. 105 | - **Script Execution**: Can be executed as standard Python scripts, facilitating integration into pipelines and scripts without additional tools. 106 | 107 | ## Resources 108 | - https://docs.astral.sh/uv/ 109 | - https://docs.marimo.io/index.html 110 | - https://youtu.be/PcLkBkQujMI 111 | - https://github.com/BuilderIO/gpt-crawler 112 | - https://github.com/simonw/llm 113 | - https://ollama.com/ 114 | - https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/ 115 | - https://qwenlm.github.io/ -------------------------------------------------------------------------------- /adhoc_prompting.py: -------------------------------------------------------------------------------- 1 | import marimo 2 | 3 | __generated_with = "0.8.18" 4 | app = marimo.App(width="medium") 5 | 6 | 7 | @app.cell 8 | def __(): 9 | import marimo as mo 10 | from src.marimo_notebook.modules import llm_module 11 | import json 12 | return json, llm_module, mo 13 | 14 | 15 | @app.cell 16 | def __(llm_module): 17 | llm_o1_mini, llm_o1_preview = llm_module.build_o1_series() 18 | llm_gpt_4o_latest, llm_gpt_4o_mini = llm_module.build_openai_latest_and_fastest() 19 | # llm_sonnet = llm_module.build_sonnet_3_5() 20 | # gemini_1_5_pro, gemini_1_5_flash = llm_module.build_gemini_duo() 21 | 22 | models = { 23 | "o1-mini": llm_o1_mini, 24 | "o1-preview": llm_o1_preview, 25 | "gpt-4o-latest": llm_gpt_4o_latest, 26 | "gpt-4o-mini": llm_gpt_4o_mini, 27 | # "sonnet-3.5": llm_sonnet, 28 | # "gemini-1-5-pro": gemini_1_5_pro, 29 | # "gemini-1-5-flash": gemini_1_5_flash, 30 | } 31 | return ( 32 | llm_gpt_4o_latest, 33 | llm_gpt_4o_mini, 34 | llm_o1_mini, 35 | llm_o1_preview, 36 | models, 37 | ) 38 | 39 | 40 | @app.cell 41 | def __(mo, models): 42 | prompt_text_area = mo.ui.text_area(label="Prompt", full_width=True) 43 | prompt_temp_slider = mo.ui.slider( 44 | start=0, stop=1, value=0.5, step=0.05, label="Temp" 45 | ) 46 | model_dropdown = mo.ui.dropdown( 47 | options=models.copy(), 48 | label="Model", 49 | value="gpt-4o-mini", 50 | ) 51 | multi_model_checkbox = mo.ui.checkbox(label="Run on All Models", value=False) 52 | 53 | form = ( 54 | mo.md( 55 | r""" 56 | # Ad-hoc Prompt 57 | {prompt} 58 | {temp} 59 | {model} 60 | {multi_model} 61 | """ 62 | ) 63 | .batch( 64 | prompt=prompt_text_area, 65 | temp=prompt_temp_slider, 66 | model=model_dropdown, 67 | multi_model=multi_model_checkbox, 68 | ) 69 | .form() 70 | ) 71 | form 72 | return ( 73 | form, 74 | model_dropdown, 75 | multi_model_checkbox, 76 | prompt_temp_slider, 77 | prompt_text_area, 78 | ) 79 | 80 | 81 | @app.cell 82 | def __(form, mo): 83 | mo.stop(not form.value or not len(form.value), "") 84 | 85 | # Format the form data for the table 86 | formatted_data = {} 87 | for key, value in form.value.items(): 88 | if key == "model": 89 | formatted_data[key] = value.model_id 90 | elif key == "multi_model": 91 | formatted_data[key] = value 92 | else: 93 | formatted_data[key] = value 94 | 95 | # Create and display the table 96 | table = mo.ui.table( 97 | [formatted_data], # Wrap in a list to create a single-row table 98 | label="", 99 | selection=None, 100 | ) 101 | 102 | mo.md(f"# Form Values\n\n{table}") 103 | return formatted_data, key, table, value 104 | 105 | 106 | @app.cell 107 | def __(form, llm_module, mo): 108 | mo.stop(not form.value or form.value["multi_model"], "") 109 | 110 | prompt_response = None 111 | 112 | with mo.status.spinner(title="Loading..."): 113 | prompt_response = llm_module.prompt_with_temp( 114 | form.value["model"], form.value["prompt"], form.value["temp"] 115 | ) 116 | 117 | mo.md(f"# Prompt Output\n\n{prompt_response}").style( 118 | {"background": "#eee", "padding": "10px", "border-radius": "10px"} 119 | ) 120 | return (prompt_response,) 121 | 122 | 123 | @app.cell 124 | def __(form, llm_module, mo, models): 125 | prompt_responses = [] 126 | 127 | mo.stop(not form.value or not form.value["multi_model"], "") 128 | 129 | with mo.status.spinner(title="Running prompts on all models..."): 130 | for model_name, model in models.items(): 131 | response = llm_module.prompt_with_temp( 132 | model, form.value["prompt"], form.value["temp"] 133 | ) 134 | prompt_responses.append( 135 | { 136 | "model_id": model_name, 137 | "output": response, 138 | } 139 | ) 140 | return model, model_name, prompt_responses, response 141 | 142 | 143 | @app.cell 144 | def __(mo, prompt_responses): 145 | mo.stop(not len(prompt_responses), "") 146 | 147 | # Create a table using mo.ui.table 148 | multi_model_table = mo.ui.table( 149 | prompt_responses, label="Multi-Model Prompt Outputs", selection=None 150 | ) 151 | 152 | mo.vstack( 153 | [ 154 | mo.md("# Multi-Model Prompt Outputs"), 155 | mo.ui.table(prompt_responses, selection=None), 156 | ] 157 | ) 158 | return (multi_model_table,) 159 | 160 | 161 | if __name__ == "__main__": 162 | app.run() 163 | -------------------------------------------------------------------------------- /images/marimo_prompt_library.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/disler/marimo-prompt-library/cec10b5e57cf1c5e3e11806c47a667cb38bd32b0/images/marimo_prompt_library.png -------------------------------------------------------------------------------- /images/multi_slm_llm_prompt_and_model.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/disler/marimo-prompt-library/cec10b5e57cf1c5e3e11806c47a667cb38bd32b0/images/multi_slm_llm_prompt_and_model.png -------------------------------------------------------------------------------- /language_model_rankings/rankings.json: -------------------------------------------------------------------------------- 1 | [ 2 | { 3 | "llm_model_id": "gpt-4o-mini", 4 | "score": 2 5 | }, 6 | { 7 | "llm_model_id": "llama3.2:latest", 8 | "score": 2 9 | }, 10 | { 11 | "llm_model_id": "gemini-1.5-flash-002", 12 | "score": 2 13 | } 14 | ] -------------------------------------------------------------------------------- /layouts/adhoc_prompting.grid.json: -------------------------------------------------------------------------------- 1 | { 2 | "type": "grid", 3 | "data": { 4 | "columns": 24, 5 | "rowHeight": 20, 6 | "maxWidth": 1400, 7 | "bordered": true, 8 | "cells": [ 9 | { 10 | "position": null 11 | }, 12 | { 13 | "position": null 14 | }, 15 | { 16 | "position": [ 17 | 1, 18 | 6, 19 | 11, 20 | 18 21 | ] 22 | }, 23 | { 24 | "position": [ 25 | 1, 26 | 25, 27 | 11, 28 | 10 29 | ] 30 | }, 31 | { 32 | "position": [ 33 | 13, 34 | 1, 35 | 10, 36 | 82 37 | ] 38 | }, 39 | { 40 | "position": null 41 | }, 42 | { 43 | "position": null 44 | } 45 | ] 46 | } 47 | } -------------------------------------------------------------------------------- /layouts/adhoc_prompting.slides.json: -------------------------------------------------------------------------------- 1 | { 2 | "type": "slides", 3 | "data": {} 4 | } -------------------------------------------------------------------------------- /layouts/multi_language_model_ranker.grid.json: -------------------------------------------------------------------------------- 1 | { 2 | "type": "grid", 3 | "data": { 4 | "columns": 22, 5 | "rowHeight": 20, 6 | "maxWidth": 2000, 7 | "bordered": true, 8 | "cells": [ 9 | { 10 | "position": null 11 | }, 12 | { 13 | "position": null 14 | }, 15 | { 16 | "position": null 17 | }, 18 | { 19 | "position": null 20 | }, 21 | { 22 | "position": null 23 | }, 24 | { 25 | "position": [ 26 | 0, 27 | 0, 28 | 10, 29 | 13 30 | ] 31 | }, 32 | { 33 | "position": null 34 | }, 35 | { 36 | "position": null 37 | }, 38 | { 39 | "position": [ 40 | 10, 41 | 0, 42 | 11, 43 | 31 44 | ] 45 | }, 46 | { 47 | "position": null 48 | }, 49 | { 50 | "position": [ 51 | 0, 52 | 13, 53 | 10, 54 | 18 55 | ] 56 | }, 57 | { 58 | "position": null 59 | }, 60 | { 61 | "position": null 62 | } 63 | ] 64 | } 65 | } -------------------------------------------------------------------------------- /layouts/multi_llm_prompting.grid.json: -------------------------------------------------------------------------------- 1 | { 2 | "type": "grid", 3 | "data": { 4 | "columns": 24, 5 | "rowHeight": 20, 6 | "maxWidth": 1400, 7 | "bordered": true, 8 | "cells": [ 9 | { 10 | "position": null 11 | }, 12 | { 13 | "position": null 14 | }, 15 | { 16 | "position": [ 17 | 0, 18 | 0, 19 | 10, 20 | 16 21 | ] 22 | }, 23 | { 24 | "position": null 25 | }, 26 | { 27 | "position": [ 28 | 10, 29 | 0, 30 | 14, 31 | 25 32 | ] 33 | }, 34 | { 35 | "position": [ 36 | 2, 37 | 17, 38 | 6, 39 | 6 40 | ] 41 | } 42 | ] 43 | } 44 | } -------------------------------------------------------------------------------- /layouts/prompt_library.grid.json: -------------------------------------------------------------------------------- 1 | { 2 | "type": "grid", 3 | "data": { 4 | "columns": 24, 5 | "rowHeight": 15, 6 | "maxWidth": 1200, 7 | "bordered": true, 8 | "cells": [ 9 | { 10 | "position": null 11 | }, 12 | { 13 | "position": null 14 | }, 15 | { 16 | "position": [ 17 | 1, 18 | 2, 19 | 6, 20 | 30 21 | ] 22 | }, 23 | { 24 | "position": [ 25 | 1, 26 | 34, 27 | 6, 28 | 13 29 | ] 30 | }, 31 | { 32 | "position": [ 33 | 8, 34 | 2, 35 | 11, 36 | 51 37 | ], 38 | "scrollable": true, 39 | "side": "left" 40 | } 41 | ] 42 | } 43 | } -------------------------------------------------------------------------------- /layouts/prompt_library.slides.json: -------------------------------------------------------------------------------- 1 | { 2 | "type": "slides", 3 | "data": {} 4 | } -------------------------------------------------------------------------------- /marimo_is_awesome_demo.py: -------------------------------------------------------------------------------- 1 | import marimo 2 | 3 | __generated_with = "0.8.18" 4 | app = marimo.App(width="full") 5 | 6 | 7 | @app.cell 8 | def __(): 9 | import random 10 | import marimo as mo 11 | import pandas as pd 12 | import matplotlib.pyplot as plt 13 | from vega_datasets import data 14 | import io 15 | import altair as alt 16 | return alt, data, io, mo, pd, plt, random 17 | 18 | 19 | @app.cell 20 | def __(mo): 21 | mo.md( 22 | """ 23 | # Marimo Awesome Examples 24 | 25 | This notebook demonstrates various features and capabilities of Marimo. Explore the different sections to see how Marimo can be used for interactive data analysis, visualization, and more! 26 | 27 | --- 28 | """ 29 | ) 30 | return 31 | 32 | 33 | @app.cell 34 | def __(mo): 35 | mo.md( 36 | """ 37 | ## 1. Basic UI Elements 38 | 39 | --- 40 | """ 41 | ) 42 | return 43 | 44 | 45 | @app.cell 46 | def __(mo): 47 | slider = mo.ui.slider(1, 10, value=5, label="Slider Example") 48 | checkbox = mo.ui.checkbox(label="Checkbox Example") 49 | text_input = mo.ui.text(placeholder="Enter text here", label="Text Input Example") 50 | 51 | mo.vstack([slider, checkbox, text_input]) 52 | return checkbox, slider, text_input 53 | 54 | 55 | @app.cell 56 | def __(checkbox, mo, slider, text_input): 57 | mo.md( 58 | f""" 59 | Slider value: {slider.value} 60 | Checkbox state: {checkbox.value} 61 | Text input: {text_input.value} 62 | Slider * Text input: {slider.value * "⭐️"} 63 | """ 64 | ) 65 | return 66 | 67 | 68 | @app.cell 69 | def __(mo): 70 | mo.md( 71 | """ 72 | ## 2. Reactive Data Visualization 73 | --- 74 | """ 75 | ) 76 | return 77 | 78 | 79 | @app.cell 80 | def __(mo, pd): 81 | # Create a sample dataset 82 | sample_df = pd.DataFrame( 83 | {"x": range(1, 11), "y": [1, 4, 9, 16, 25, 36, 49, 64, 81, 100]} 84 | ) 85 | 86 | plot_type = mo.ui.dropdown( 87 | options=["scatter", "line", "bar"], value="scatter", label="Select Plot Type" 88 | ) 89 | 90 | mo.vstack( 91 | [ 92 | plot_type, 93 | # mo.ui.table(sample_df, selection=None) 94 | ] 95 | ) 96 | return plot_type, sample_df 97 | 98 | 99 | @app.cell 100 | def __(mo, plot_type, plt, sample_df): 101 | plt.figure(figsize=(10, 6)) 102 | 103 | if plot_type.value == "scatter": 104 | plt.scatter(sample_df["x"], sample_df["y"]) 105 | elif plot_type.value == "line": 106 | plt.plot(sample_df["x"], sample_df["y"]) 107 | else: 108 | plt.bar(sample_df["x"], sample_df["y"]) 109 | 110 | plt.xlabel("X") 111 | plt.ylabel("Y") 112 | plt.title(f"{plot_type.value.capitalize()} Plot") 113 | mo.mpl.interactive(plt.gcf()) 114 | return 115 | 116 | 117 | @app.cell 118 | def __(mo): 119 | mo.md("""## 3. Conditional Output and Control Flow""") 120 | return 121 | 122 | 123 | @app.cell 124 | def __(mo): 125 | show_secret = mo.ui.checkbox(label="Show Secret Message") 126 | show_secret 127 | return (show_secret,) 128 | 129 | 130 | @app.cell 131 | def __(mo, show_secret): 132 | mo.stop(not show_secret.value, mo.md("Check the box to reveal the secret message!")) 133 | mo.md( 134 | "🎉 Congratulations! You've unlocked the secret message: Marimo is awesome! 🎉" 135 | ) 136 | return 137 | 138 | 139 | @app.cell 140 | def __(mo): 141 | mo.md("""## 4. File Handling and Data Processing""") 142 | return 143 | 144 | 145 | @app.cell 146 | def __(mo): 147 | file_upload = mo.ui.file(label="Upload a CSV file") 148 | file_upload 149 | return (file_upload,) 150 | 151 | 152 | @app.cell 153 | def __(file_upload, io, mo, pd): 154 | mo.stop( 155 | not file_upload.value, mo.md("Please upload a CSV file to see the preview.") 156 | ) 157 | 158 | uploaded_df = pd.read_csv(io.BytesIO(file_upload.value[0].contents)) 159 | mo.md(f"### Uploaded File Preview") 160 | mo.ui.table(uploaded_df) 161 | return (uploaded_df,) 162 | 163 | 164 | @app.cell 165 | def __(mo): 166 | mo.md("""## 5. Advanced UI Components""") 167 | return 168 | 169 | 170 | @app.cell 171 | def __(mo, pd): 172 | accordion = mo.accordion( 173 | { 174 | "Section 1": mo.md("This is the content of section 1."), 175 | "Section 2": mo.ui.slider(0, 100, value=50, label="Nested Slider"), 176 | "Section 3": mo.ui.table(pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})), 177 | } 178 | ) 179 | accordion 180 | return (accordion,) 181 | 182 | 183 | @app.cell 184 | def __(mo): 185 | tabs = mo.ui.tabs( 186 | { 187 | "Tab 1": mo.md("Content of Tab 1"), 188 | "Tab 2": mo.ui.button(label="Click me!"), 189 | "Tab 3": mo.mermaid( 190 | """ 191 | graph TD 192 | A[Start] --> B{Decision} 193 | B -->|Yes| C[Do Something] 194 | B -->|No| D[Do Nothing] 195 | C --> E[End] 196 | D --> E 197 | """ 198 | ), 199 | } 200 | ) 201 | tabs 202 | return (tabs,) 203 | 204 | 205 | @app.cell 206 | def __(mo): 207 | mo.md("""## 6. Batch Operations and Forms""") 208 | return 209 | 210 | 211 | @app.cell 212 | def __(mo): 213 | user_form = ( 214 | mo.md( 215 | """ 216 | ### User Information Form 217 | 218 | First Name: {first_name} 219 | Last Name: {last_name} 220 | Age: {age} 221 | Email: {email} 222 | """ 223 | ) 224 | .batch( 225 | first_name=mo.ui.text(label="First Name"), 226 | last_name=mo.ui.text(label="Last Name"), 227 | age=mo.ui.number(start=0, stop=120, label="Age"), 228 | email=mo.ui.text(label="Email"), 229 | ) 230 | .form() 231 | ) 232 | 233 | user_form 234 | return (user_form,) 235 | 236 | 237 | @app.cell 238 | def __(mo, user_form): 239 | mo.stop( 240 | not user_form.value.get("first_name"), 241 | mo.md("Please submit the form to see the results."), 242 | ) 243 | 244 | mo.md( 245 | f""" 246 | ### Submitted Information 247 | 248 | - **First Name:** {user_form.value['first_name']} 249 | - **Last Name:** {user_form.value['last_name']} 250 | - **Age:** {user_form.value['age']} 251 | - **Email:** {user_form.value['email']} 252 | """ 253 | ) 254 | return 255 | 256 | 257 | @app.cell 258 | def __(mo): 259 | mo.md("""## 7. Embedding External Content""") 260 | return 261 | 262 | 263 | @app.cell 264 | def __(mo): 265 | mo.image("https://marimo.io/logo.png", width=200, alt="Marimo Logo") 266 | return 267 | 268 | 269 | @app.cell 270 | def __(mo): 271 | mo.video( 272 | "https://v3.cdnpk.net/videvo_files/video/free/2013-08/large_watermarked/hd0992_preview.mp4", 273 | width=560, 274 | height=315, 275 | ) 276 | return 277 | 278 | 279 | @app.cell 280 | def __(mo): 281 | mo.md("""## 8. Custom Styling and Layouts""") 282 | return 283 | 284 | 285 | @app.cell 286 | def __(mo): 287 | styled_text = mo.md( 288 | """ 289 | # Custom Styled Header 290 | 291 | This text has custom styling applied. 292 | """ 293 | ).style( 294 | { 295 | "font-style": "italic", 296 | "background-color": "#aaa", 297 | "padding": "10px", 298 | "border-radius": "5px", 299 | }, 300 | ) 301 | 302 | styled_text 303 | return (styled_text,) 304 | 305 | 306 | @app.cell 307 | def __(mo): 308 | layout = mo.vstack( 309 | [ 310 | mo.hstack( 311 | [ 312 | mo.md("Left Column").style( 313 | { 314 | "background-color": "#e0e0e0", 315 | "padding": "10px", 316 | } 317 | ), 318 | mo.md("Right Column").style( 319 | { 320 | "background-color": "#d0d0d0", 321 | "padding": "10px", 322 | } 323 | ), 324 | ] 325 | ), 326 | mo.md("Bottom Row").style( 327 | {"background-color": "#c0c0c0", "padding": "10px"} 328 | ), 329 | ] 330 | ) 331 | 332 | layout 333 | return (layout,) 334 | 335 | 336 | @app.cell 337 | def __(mo): 338 | mo.md( 339 | """ 340 | ## 9. Interactive Data Exploration 341 | --- 342 | """ 343 | ) 344 | return 345 | 346 | 347 | @app.cell 348 | def __(data, mo): 349 | cars = data.cars() 350 | mo.ui.data_explorer(cars) 351 | return (cars,) 352 | 353 | 354 | @app.cell 355 | def __(alt, data, mo): 356 | chart = ( 357 | alt.Chart(data.cars()) 358 | .mark_circle() 359 | .encode( 360 | x="Horsepower", 361 | y="Miles_per_Gallon", 362 | color="Origin", 363 | tooltip=["Name", "Origin", "Horsepower", "Miles_per_Gallon"], 364 | ) 365 | .interactive() 366 | ) 367 | 368 | mo.ui.altair_chart(chart) 369 | return (chart,) 370 | 371 | 372 | @app.cell 373 | def __(mo): 374 | mo.md( 375 | """ 376 | ## Conclusion 377 | 378 | This notebook has demonstrated various features and capabilities of Marimo. From basic UI elements to advanced data visualization and interactive components, Marimo provides a powerful toolkit for creating dynamic and engaging notebooks. 379 | 380 | Explore the code in each cell to learn more about how these examples were created! 381 | """ 382 | ) 383 | return 384 | 385 | 386 | if __name__ == "__main__": 387 | app.run() 388 | -------------------------------------------------------------------------------- /multi_language_model_ranker.py: -------------------------------------------------------------------------------- 1 | import marimo 2 | 3 | __generated_with = "0.8.18" 4 | app = marimo.App(width="full") 5 | 6 | 7 | @app.cell 8 | def __(): 9 | import marimo as mo 10 | import src.marimo_notebook.modules.llm_module as llm_module 11 | import src.marimo_notebook.modules.prompt_library_module as prompt_library_module 12 | import json 13 | import pyperclip 14 | return json, llm_module, mo, prompt_library_module, pyperclip 15 | 16 | 17 | @app.cell 18 | def __(prompt_library_module): 19 | map_testable_prompts: dict = prompt_library_module.pull_in_testable_prompts() 20 | return (map_testable_prompts,) 21 | 22 | 23 | @app.cell 24 | def __(llm_module): 25 | llm_o1_mini, llm_o1_preview = llm_module.build_o1_series() 26 | llm_gpt_4o_latest, llm_gpt_4o_mini = llm_module.build_openai_latest_and_fastest() 27 | # llm_sonnet = llm_module.build_sonnet_3_5() 28 | # gemini_1_5_pro, gemini_1_5_flash = llm_module.build_gemini_duo() 29 | # gemini_1_5_pro_2, gemini_1_5_flash_2 = llm_module.build_gemini_1_2_002() 30 | # llama3_2_model, llama3_2_1b_model = llm_module.build_ollama_models() 31 | # _, phi3_5_model, qwen2_5_model = llm_module.build_ollama_slm_models() 32 | 33 | models = { 34 | "o1-mini": llm_o1_mini, 35 | "o1-preview": llm_o1_preview, 36 | "gpt-4o-latest": llm_gpt_4o_latest, 37 | "gpt-4o-mini": llm_gpt_4o_mini, 38 | # "sonnet-3.5": llm_sonnet, 39 | # "gemini-1-5-pro": gemini_1_5_pro, 40 | # "gemini-1-5-flash": gemini_1_5_flash, 41 | # "gemini-1-5-pro-002": gemini_1_5_pro_2, 42 | # "gemini-1-5-flash-002": gemini_1_5_flash_2, 43 | # "llama3-2": llama3_2_model, 44 | # "llama3-2-1b": llama3_2_1b_model, 45 | # "phi3-5": phi3_5_model, 46 | # "qwen2-5": qwen2_5_model, 47 | } 48 | return ( 49 | llm_gpt_4o_latest, 50 | llm_gpt_4o_mini, 51 | llm_o1_mini, 52 | llm_o1_preview, 53 | models, 54 | ) 55 | 56 | 57 | @app.cell 58 | def __(map_testable_prompts, mo, models): 59 | prompt_multiselect = mo.ui.multiselect( 60 | options=list(map_testable_prompts.keys()), 61 | label="Select Prompts", 62 | ) 63 | prompt_temp_slider = mo.ui.slider( 64 | start=0, stop=1, value=0.5, step=0.05, label="Temp" 65 | ) 66 | model_multiselect = mo.ui.multiselect( 67 | options=models.copy(), 68 | label="Models", 69 | value=["gpt-4o-mini",], 70 | ) 71 | return model_multiselect, prompt_multiselect, prompt_temp_slider 72 | 73 | 74 | @app.cell 75 | def __(): 76 | prompt_style = { 77 | "background": "#eee", 78 | "padding": "10px", 79 | "border-radius": "10px", 80 | "margin-bottom": "20px", 81 | } 82 | return (prompt_style,) 83 | 84 | 85 | @app.cell 86 | def __(mo, model_multiselect, prompt_multiselect, prompt_temp_slider): 87 | form = ( 88 | mo.md( 89 | r""" 90 | # Multi Language Model Ranker 📊 91 | {prompts} 92 | {temp} 93 | {models} 94 | """ 95 | ) 96 | .batch( 97 | prompts=prompt_multiselect, 98 | temp=prompt_temp_slider, 99 | models=model_multiselect, 100 | ) 101 | .form() 102 | ) 103 | form 104 | return (form,) 105 | 106 | 107 | @app.cell 108 | def __(form, map_testable_prompts, mo, prompt_style): 109 | mo.stop(not form.value) 110 | 111 | selected_models_string = mo.ui.array( 112 | [mo.ui.text(value=m.model_id, disabled=True) for m in form.value["models"]] 113 | ) 114 | 115 | selected_prompts_accordion = mo.accordion( 116 | { 117 | prompt: mo.md(f"```xml\n{map_testable_prompts[prompt]}\n```") 118 | for prompt in form.value["prompts"] 119 | } 120 | ) 121 | 122 | mo.vstack( 123 | [ 124 | mo.md("## Selected Models"), 125 | mo.hstack(selected_models_string, align="start", justify="start"), 126 | mo.md("## Selected Prompts"), 127 | selected_prompts_accordion, 128 | ] 129 | ).style(prompt_style) 130 | return selected_models_string, selected_prompts_accordion 131 | 132 | 133 | @app.cell 134 | def __(form, llm_module, map_testable_prompts, mo, prompt_library_module): 135 | mo.stop(not form.value, "") 136 | 137 | all_prompt_responses = [] 138 | 139 | total_executions = len(form.value["prompts"]) * len(form.value["models"]) 140 | 141 | with mo.status.progress_bar( 142 | title="Running prompts on selected models...", 143 | total=total_executions, 144 | remove_on_exit=True, 145 | ) as prog_bar: 146 | for selected_prompt_name in form.value["prompts"]: 147 | selected_prompt = map_testable_prompts[selected_prompt_name] 148 | prompt_responses = [] 149 | 150 | for model in form.value["models"]: 151 | model_name = model.model_id 152 | prog_bar.update( 153 | title=f"Prompting '{model_name}' with '{selected_prompt_name}'", 154 | increment=1, 155 | ) 156 | raw_prompt_response = llm_module.prompt_with_temp( 157 | model, selected_prompt, form.value["temp"] 158 | ) 159 | prompt_responses.append( 160 | { 161 | "model_id": model_name, 162 | "model": model, 163 | "output": raw_prompt_response, 164 | } 165 | ) 166 | 167 | # Create a new list without the 'model' key for each response 168 | list_model_execution_dict = [ 169 | {k: v for k, v in response.items() if k != "model"} 170 | for response in prompt_responses 171 | ] 172 | 173 | # Record the execution 174 | execution_filepath = prompt_library_module.record_llm_execution( 175 | prompt=selected_prompt, 176 | list_model_execution_dict=list_model_execution_dict, 177 | prompt_template=selected_prompt_name, 178 | ) 179 | print(f"Execution record saved to: {execution_filepath}") 180 | 181 | all_prompt_responses.append( 182 | { 183 | "prompt_name": selected_prompt_name, 184 | "prompt": selected_prompt, 185 | "responses": prompt_responses, 186 | "execution_filepath": execution_filepath, 187 | } 188 | ) 189 | return ( 190 | all_prompt_responses, 191 | execution_filepath, 192 | list_model_execution_dict, 193 | model, 194 | model_name, 195 | prog_bar, 196 | prompt_responses, 197 | raw_prompt_response, 198 | selected_prompt, 199 | selected_prompt_name, 200 | total_executions, 201 | ) 202 | 203 | 204 | @app.cell 205 | def __(all_prompt_responses, mo, pyperclip): 206 | mo.stop(not all_prompt_responses, mo.md("")) 207 | 208 | def copy_to_clipboard(text): 209 | print("copying: ", text) 210 | pyperclip.copy(text) 211 | return 1 212 | 213 | all_prompt_elements = [] 214 | 215 | output_prompt_style = { 216 | "background": "#eee", 217 | "padding": "10px", 218 | "border-radius": "10px", 219 | "margin-bottom": "20px", 220 | "min-width": "200px", 221 | "box-shadow": "2px 2px 2px #ccc", 222 | } 223 | 224 | for loop_prompt_data in all_prompt_responses: 225 | prompt_output_elements = [ 226 | mo.vstack( 227 | [ 228 | mo.md(f"#### {response['model_id']}").style( 229 | {"font-weight": "bold"} 230 | ), 231 | mo.md(response["output"]), 232 | ] 233 | ).style(output_prompt_style) 234 | for response in loop_prompt_data["responses"] 235 | ] 236 | 237 | prompt_element = mo.vstack( 238 | [ 239 | mo.md(f"### Prompt: {loop_prompt_data['prompt_name']}"), 240 | mo.hstack(prompt_output_elements, wrap=True, justify="start"), 241 | ] 242 | ).style( 243 | { 244 | "border-left": "4px solid #CCC", 245 | "padding": "2px 10px", 246 | "background": "#ffffee", 247 | } 248 | ) 249 | 250 | all_prompt_elements.append(prompt_element) 251 | 252 | mo.vstack(all_prompt_elements) 253 | return ( 254 | all_prompt_elements, 255 | copy_to_clipboard, 256 | loop_prompt_data, 257 | output_prompt_style, 258 | prompt_element, 259 | prompt_output_elements, 260 | ) 261 | 262 | 263 | @app.cell 264 | def __(all_prompt_responses, copy_to_clipboard, form, mo): 265 | mo.stop(not all_prompt_responses, mo.md("")) 266 | mo.stop(not form.value, mo.md("")) 267 | 268 | # Prepare data for the table 269 | table_data = [] 270 | for prompt_data in all_prompt_responses: 271 | for response in prompt_data["responses"]: 272 | table_data.append( 273 | { 274 | "Prompt": prompt_data["prompt_name"], 275 | "Model": response["model_id"], 276 | "Output": response["output"], 277 | } 278 | ) 279 | 280 | # Create the table 281 | results_table = mo.ui.table( 282 | data=table_data, 283 | pagination=True, 284 | selection="multi", 285 | page_size=30, 286 | label="Model Responses", 287 | format_mapping={ 288 | "Output": lambda val: "(trimmed) " + val[:15], 289 | # "Output": lambda val: val, 290 | }, 291 | ) 292 | 293 | # Function to copy selected outputs to clipboard 294 | def copy_selected_outputs(): 295 | selected_rows = results_table.value 296 | if selected_rows: 297 | outputs = [row["Output"] for row in selected_rows] 298 | combined_output = "\n\n".join(outputs) 299 | copy_to_clipboard(combined_output) 300 | return f"Copied {len(outputs)} response(s) to clipboard" 301 | return "No rows selected" 302 | 303 | # Create the run buttons 304 | copy_button = mo.ui.run_button(label="🔗 Copy Selected Outputs") 305 | score_button = mo.ui.run_button(label="👍 Vote Selected Outputs") 306 | 307 | # Display the table and run buttons 308 | mo.vstack( 309 | [ 310 | results_table, 311 | mo.hstack( 312 | [ 313 | score_button, 314 | copy_button, 315 | ], 316 | justify="start", 317 | ), 318 | ] 319 | ) 320 | return ( 321 | copy_button, 322 | copy_selected_outputs, 323 | prompt_data, 324 | response, 325 | results_table, 326 | score_button, 327 | table_data, 328 | ) 329 | 330 | 331 | @app.cell 332 | def __( 333 | copy_to_clipboard, 334 | get_rankings, 335 | mo, 336 | prompt_library_module, 337 | results_table, 338 | score_button, 339 | set_rankings, 340 | ): 341 | mo.stop(not results_table.value, "") 342 | 343 | selected_rows = results_table.value 344 | outputs = [row["Output"] for row in selected_rows] 345 | combined_output = "\n\n".join(outputs) 346 | 347 | if score_button.value: 348 | # Increment scores for selected models 349 | current_rankings = get_rankings() 350 | for row in selected_rows: 351 | model_id = row["Model"] 352 | for ranking in current_rankings: 353 | if ranking.llm_model_id == model_id: 354 | ranking.score += 1 355 | break 356 | 357 | # Save updated rankings 358 | set_rankings(current_rankings) 359 | prompt_library_module.save_rankings(current_rankings) 360 | 361 | mo.md(f"Scored {len(selected_rows)} model(s)") 362 | else: 363 | copy_to_clipboard(combined_output) 364 | mo.md(f"Copied {len(outputs)} response(s) to clipboard") 365 | return ( 366 | combined_output, 367 | current_rankings, 368 | model_id, 369 | outputs, 370 | ranking, 371 | row, 372 | selected_rows, 373 | ) 374 | 375 | 376 | @app.cell 377 | def __(all_prompt_responses, form, mo, prompt_library_module): 378 | mo.stop(not form.value, mo.md("")) 379 | mo.stop(not all_prompt_responses, mo.md("")) 380 | 381 | # Create buttons for resetting and loading rankings 382 | reset_ranking_button = mo.ui.run_button(label="❌ Reset Rankings") 383 | load_ranking_button = mo.ui.run_button(label="🔐 Load Rankings") 384 | 385 | # Load existing rankings 386 | get_rankings, set_rankings = mo.state(prompt_library_module.get_rankings()) 387 | 388 | mo.hstack( 389 | [ 390 | load_ranking_button, 391 | reset_ranking_button, 392 | ], 393 | justify="start", 394 | ) 395 | return ( 396 | get_rankings, 397 | load_ranking_button, 398 | reset_ranking_button, 399 | set_rankings, 400 | ) 401 | 402 | 403 | @app.cell 404 | def __(): 405 | # get_rankings() 406 | return 407 | 408 | 409 | @app.cell 410 | def __( 411 | form, 412 | mo, 413 | prompt_library_module, 414 | reset_ranking_button, 415 | set_rankings, 416 | ): 417 | mo.stop(not form.value, mo.md("")) 418 | mo.stop(not reset_ranking_button.value, mo.md("")) 419 | 420 | set_rankings( 421 | prompt_library_module.reset_rankings( 422 | [model.model_id for model in form.value["models"]] 423 | ) 424 | ) 425 | 426 | # mo.md("Rankings reset successfully") 427 | return 428 | 429 | 430 | @app.cell 431 | def __(form, load_ranking_button, mo, prompt_library_module, set_rankings): 432 | mo.stop(not form.value, mo.md("")) 433 | mo.stop(not load_ranking_button.value, mo.md("")) 434 | 435 | set_rankings(prompt_library_module.get_rankings()) 436 | return 437 | 438 | 439 | @app.cell 440 | def __(get_rankings, mo): 441 | # Create UI elements for each model 442 | model_elements = [] 443 | 444 | model_score_style = { 445 | "background": "#eeF", 446 | "padding": "10px", 447 | "border-radius": "10px", 448 | "margin-bottom": "20px", 449 | "min-width": "150px", 450 | "box-shadow": "2px 2px 2px #ccc", 451 | } 452 | 453 | for model_ranking in get_rankings(): 454 | llm_model_id = model_ranking.llm_model_id 455 | score = model_ranking.score 456 | model_elements.append( 457 | mo.vstack( 458 | [ 459 | mo.md(f"**{llm_model_id}** "), 460 | mo.hstack([mo.md(f""), mo.md(f"# {score}")]), 461 | ], 462 | justify="space-between", 463 | gap="2", 464 | ).style(model_score_style) 465 | ) 466 | 467 | mo.hstack(model_elements, justify="start", wrap=True) 468 | return ( 469 | llm_model_id, 470 | model_elements, 471 | model_ranking, 472 | model_score_style, 473 | score, 474 | ) 475 | 476 | 477 | if __name__ == "__main__": 478 | app.run() 479 | -------------------------------------------------------------------------------- /multi_llm_prompting.py: -------------------------------------------------------------------------------- 1 | import marimo 2 | 3 | __generated_with = "0.8.18" 4 | app = marimo.App(width="full") 5 | 6 | 7 | @app.cell 8 | def __(): 9 | import marimo as mo 10 | import src.marimo_notebook.modules.llm_module as llm_module 11 | import src.marimo_notebook.modules.prompt_library_module as prompt_library_module 12 | import json 13 | import pyperclip 14 | return json, llm_module, mo, prompt_library_module, pyperclip 15 | 16 | 17 | @app.cell 18 | def __(llm_module): 19 | llm_o1_mini, llm_o1_preview = llm_module.build_o1_series() 20 | llm_gpt_4o_latest, llm_gpt_4o_mini = llm_module.build_openai_latest_and_fastest() 21 | # llm_sonnet = llm_module.build_sonnet_3_5() 22 | # gemini_1_5_pro, gemini_1_5_flash = llm_module.build_gemini_duo() 23 | # gemini_1_5_pro_2, gemini_1_5_flash_2 = llm_module.build_gemini_1_2_002() 24 | # llama3_2_model, llama3_2_1b_model = llm_module.build_ollama_models() 25 | 26 | models = { 27 | "o1-mini": llm_o1_mini, 28 | "o1-preview": llm_o1_preview, 29 | "gpt-4o-latest": llm_gpt_4o_latest, 30 | "gpt-4o-mini": llm_gpt_4o_mini, 31 | # "sonnet-3.5": llm_sonnet, 32 | # "gemini-1-5-pro": gemini_1_5_pro, 33 | # "gemini-1-5-flash": gemini_1_5_flash, 34 | # "gemini-1-5-pro-002": gemini_1_5_pro_2, 35 | # "gemini-1-5-flash-002": gemini_1_5_flash_2, 36 | # "llama3-2": llama3_2_model, 37 | # "llama3-2-1b": llama3_2_1b_model, 38 | } 39 | return ( 40 | llm_gpt_4o_latest, 41 | llm_gpt_4o_mini, 42 | llm_o1_mini, 43 | llm_o1_preview, 44 | models, 45 | ) 46 | 47 | 48 | @app.cell 49 | def __(mo, models): 50 | prompt_text_area = mo.ui.text_area(label="Prompt", full_width=True) 51 | prompt_temp_slider = mo.ui.slider( 52 | start=0, stop=1, value=0.5, step=0.05, label="Temp" 53 | ) 54 | model_multiselect = mo.ui.multiselect( 55 | options=models.copy(), 56 | label="Models", 57 | value=["gpt-4o-mini"], 58 | ) 59 | 60 | form = ( 61 | mo.md( 62 | r""" 63 | # Multi-LLM Prompt 64 | {prompt} 65 | {temp} 66 | {models} 67 | """ 68 | ) 69 | .batch( 70 | prompt=prompt_text_area, 71 | temp=prompt_temp_slider, 72 | models=model_multiselect, 73 | ) 74 | .form() 75 | ) 76 | form 77 | return form, model_multiselect, prompt_temp_slider, prompt_text_area 78 | 79 | 80 | @app.cell 81 | def __(form, llm_module, mo, prompt_library_module): 82 | mo.stop(not form.value, "") 83 | 84 | prompt_responses = [] 85 | 86 | with mo.status.progress_bar( 87 | title="Running prompts on selected models...", 88 | total=len(form.value["models"]), 89 | remove_on_exit=True, 90 | ) as prog_bar: 91 | # with mo.status.spinner(title="Running prompts on selected models...") as _spinner: 92 | for model in form.value["models"]: 93 | model_name = model.model_id 94 | prog_bar.update(title=f"Prompting '{model_name}'", increment=1) 95 | response = llm_module.prompt_with_temp( 96 | model, form.value["prompt"], form.value["temp"] 97 | ) 98 | prompt_responses.append( 99 | { 100 | "model_id": model_name, 101 | "model": model, 102 | "output": response, 103 | } 104 | ) 105 | 106 | # Create a new list without the 'model' key for each response 107 | list_model_execution_dict = [ 108 | {k: v for k, v in response.items() if k != "model"} 109 | for response in prompt_responses 110 | ] 111 | 112 | # Record the execution 113 | execution_filepath = prompt_library_module.record_llm_execution( 114 | prompt=form.value["prompt"], 115 | list_model_execution_dict=list_model_execution_dict, 116 | prompt_template=None, # You can add a prompt template if you have one 117 | ) 118 | print(f"Execution record saved to: {execution_filepath}") 119 | return ( 120 | execution_filepath, 121 | list_model_execution_dict, 122 | model, 123 | model_name, 124 | prog_bar, 125 | prompt_responses, 126 | response, 127 | ) 128 | 129 | 130 | @app.cell 131 | def __(mo, prompt_responses, pyperclip): 132 | def copy_to_clipboard(text): 133 | print("copying: ", text) 134 | pyperclip.copy(text) 135 | return mo.md("**Copied to clipboard!**").callout(kind="success") 136 | 137 | output_elements = [ 138 | mo.vstack( 139 | [ 140 | mo.md(f"# Prompt Output ({response['model_id']})"), 141 | mo.md(response["output"]), 142 | ] 143 | ).style( 144 | { 145 | "background": "#eee", 146 | "padding": "10px", 147 | "border-radius": "10px", 148 | "margin-bottom": "20px", 149 | } 150 | ) 151 | for (idx, response) in enumerate(prompt_responses) 152 | ] 153 | 154 | mo.vstack( 155 | [ 156 | mo.hstack(output_elements), 157 | # mo.hstack(output_elements, wrap=True), 158 | # mo.vstack(output_elements), 159 | # mo.carousel(output_elements), 160 | # mo.hstack(copy_buttons) 161 | # copy_buttons, 162 | ] 163 | ) 164 | return copy_to_clipboard, output_elements 165 | 166 | 167 | @app.cell 168 | def __(copy_to_clipboard, mo, prompt_responses): 169 | copy_buttons = mo.ui.array( 170 | [ 171 | mo.ui.button( 172 | label=f"Copy {response['model_id']} response", 173 | on_click=lambda v: copy_to_clipboard(prompt_responses[v]["output"]), 174 | value=idx, 175 | ) 176 | for (idx, response) in enumerate(prompt_responses) 177 | ] 178 | ) 179 | 180 | mo.vstack(copy_buttons, align="center") 181 | return (copy_buttons,) 182 | 183 | 184 | if __name__ == "__main__": 185 | app.run() 186 | -------------------------------------------------------------------------------- /prompt_library.py: -------------------------------------------------------------------------------- 1 | import marimo 2 | 3 | __generated_with = "0.8.18" 4 | app = marimo.App(width="medium") 5 | 6 | 7 | @app.cell 8 | def __(): 9 | import marimo as mo 10 | from src.marimo_notebook.modules import prompt_library_module, llm_module 11 | import re # For regex to extract placeholders 12 | return llm_module, mo, prompt_library_module, re 13 | 14 | 15 | @app.cell 16 | def __(prompt_library_module): 17 | map_prompt_library: dict = prompt_library_module.pull_in_prompt_library() 18 | return (map_prompt_library,) 19 | 20 | 21 | @app.cell 22 | def __(llm_module): 23 | llm_o1_mini, llm_o1_preview = llm_module.build_o1_series() 24 | llm_gpt_4o_latest, llm_gpt_4o_mini = llm_module.build_openai_latest_and_fastest() 25 | # llm_sonnet = llm_module.build_sonnet_3_5() 26 | # gemini_1_5_pro, gemini_1_5_flash = llm_module.build_gemini_duo() 27 | 28 | models = { 29 | "o1-mini": llm_o1_mini, 30 | "o1-preview": llm_o1_preview, 31 | "gpt-4o-latest": llm_gpt_4o_latest, 32 | "gpt-4o-mini": llm_gpt_4o_mini, 33 | # "sonnet-3.5": llm_sonnet, 34 | # "gemini-1-5-pro": gemini_1_5_pro, 35 | # "gemini-1-5-flash": gemini_1_5_flash, 36 | } 37 | return ( 38 | llm_gpt_4o_latest, 39 | llm_gpt_4o_mini, 40 | llm_o1_mini, 41 | llm_o1_preview, 42 | models, 43 | ) 44 | 45 | 46 | @app.cell 47 | def __(): 48 | prompt_styles = {"background": "#eee", "padding": "10px", "border-radius": "10px"} 49 | return (prompt_styles,) 50 | 51 | 52 | @app.cell 53 | def __(map_prompt_library, mo, models): 54 | prompt_keys = list(map_prompt_library.keys()) 55 | prompt_dropdown = mo.ui.dropdown( 56 | options=prompt_keys, 57 | label="Select a Prompt", 58 | ) 59 | model_dropdown = mo.ui.dropdown( 60 | options=models, 61 | label="Select an LLM Model", 62 | value="gpt-4o-mini", 63 | ) 64 | form = ( 65 | mo.md( 66 | r""" 67 | # Prompt Library 68 | {prompt_dropdown} 69 | {model_dropdown} 70 | """ 71 | ) 72 | .batch( 73 | prompt_dropdown=prompt_dropdown, 74 | model_dropdown=model_dropdown, 75 | ) 76 | .form() 77 | ) 78 | form 79 | return form, model_dropdown, prompt_dropdown, prompt_keys 80 | 81 | 82 | @app.cell 83 | def __(form, map_prompt_library, mo, prompt_styles): 84 | selected_prompt_name = None 85 | selected_prompt = None 86 | 87 | mo.stop(not form.value or not len(form.value), "") 88 | selected_prompt_name = form.value["prompt_dropdown"] 89 | selected_prompt = map_prompt_library[selected_prompt_name] 90 | mo.vstack( 91 | [ 92 | mo.md("# Selected Prompt"), 93 | mo.accordion( 94 | { 95 | "### Click to show": mo.md(f"```xml\n{selected_prompt}\n```").style( 96 | prompt_styles 97 | ) 98 | } 99 | ), 100 | ] 101 | ) 102 | return selected_prompt, selected_prompt_name 103 | 104 | 105 | @app.cell 106 | def __(mo, re, selected_prompt, selected_prompt_name): 107 | mo.stop(not selected_prompt_name or not selected_prompt, "") 108 | 109 | # Extract placeholders from the prompt 110 | placeholders = re.findall(r"\{\{(.*?)\}\}", selected_prompt) 111 | placeholders = list(set(placeholders)) # Remove duplicates 112 | 113 | # Create text areas for placeholders, using the placeholder text as the label 114 | placeholder_inputs = [ 115 | mo.ui.text_area(label=ph, placeholder=f"Enter {ph}", full_width=True) 116 | for ph in placeholders 117 | ] 118 | 119 | # Create an array of placeholder inputs 120 | placeholder_array = mo.ui.array( 121 | placeholder_inputs, 122 | label="Fill in the Placeholders", 123 | ) 124 | 125 | # Create a 'Proceed' button 126 | proceed_button = mo.ui.run_button(label="Prompt") 127 | 128 | # Display the placeholders and the 'Proceed' button in a vertical stack 129 | vstack = mo.vstack([mo.md("# Prompt Variables"), placeholder_array, proceed_button]) 130 | vstack 131 | return ( 132 | placeholder_array, 133 | placeholder_inputs, 134 | placeholders, 135 | proceed_button, 136 | vstack, 137 | ) 138 | 139 | 140 | @app.cell 141 | def __(mo, placeholder_array, placeholders, proceed_button): 142 | mo.stop(not placeholder_array.value or not len(placeholder_array.value), "") 143 | 144 | # Check if any values are missing 145 | if any(not value.strip() for value in placeholder_array.value): 146 | mo.stop(True, mo.md("**Please fill in all placeholders.**")) 147 | 148 | # Ensure the 'Proceed' button has been pressed 149 | mo.stop( 150 | not proceed_button.value, 151 | mo.md("**Please press the 'Proceed' button to continue.**"), 152 | ) 153 | 154 | # Map the placeholder names to the values 155 | filled_values = dict(zip(placeholders, placeholder_array.value)) 156 | return (filled_values,) 157 | 158 | 159 | @app.cell 160 | def __(filled_values, selected_prompt): 161 | # Replace placeholders in the prompt 162 | final_prompt = selected_prompt 163 | for key, value in filled_values.items(): 164 | final_prompt = final_prompt.replace(f"{{{{{key}}}}}", value) 165 | 166 | # Create context_filled_prompt 167 | context_filled_prompt = final_prompt 168 | return context_filled_prompt, final_prompt, key, value 169 | 170 | 171 | @app.cell 172 | def __(context_filled_prompt, mo, prompt_styles): 173 | mo.vstack( 174 | [ 175 | mo.md("# Context Filled Prompt"), 176 | mo.accordion( 177 | { 178 | "### Click to Show Context Filled Prompt": mo.md( 179 | f"```xml\n{context_filled_prompt}\n```" 180 | ).style(prompt_styles) 181 | } 182 | ), 183 | ] 184 | ) 185 | return 186 | 187 | 188 | @app.cell 189 | def __(context_filled_prompt, form, llm_module, mo): 190 | # Get the selected model 191 | model = form.value["model_dropdown"] 192 | # Run the prompt through the model using context_filled_prompt 193 | with mo.status.spinner(title="Running prompt..."): 194 | prompt_response = llm_module.prompt(model, context_filled_prompt) 195 | 196 | mo.md(f"# Prompt Output\n\n{prompt_response}").style( 197 | {"background": "#eee", "padding": "10px", "border-radius": "10px"} 198 | ) 199 | return model, prompt_response 200 | 201 | 202 | if __name__ == "__main__": 203 | app.run() 204 | -------------------------------------------------------------------------------- /prompt_library/ai-coding-meta-review.xml: -------------------------------------------------------------------------------- 1 | 2 | Given a diff of code, look for bugs and provide solutions to an AI coding assistant LLM to 3 | automatically fix the code. 4 | 5 | 6 | 7 | The diff-of-code will be provided in the diff-of-code block. 8 | The code-files will be provided in the code-files block. 9 | Respond in the output-format provided. Do not include any other text. 10 | Respond with valid JSON.parseable JSON. 11 | For each issue specify the file path, startingLine, aiCodingResolutionPrompt, bugDescription, and severity for each fix. 12 | First search for critical bugs. Then search for major bugs. Then minor bugs. Then readability. 13 | The aiCodingResolutionPrompt should be a prompt detailing the issue, the solution in natural language, and the code to fix the issue. 14 | Return a AIAssistantResolutions object, with contains an array of fixes: AIAssistantFix. 15 | 16 | 17 | 18 | 21 | 22 | 23 | 24 | 27 | 28 | 29 | 30 | 47 | -------------------------------------------------------------------------------- /prompt_library/bullet-knowledge-compression.xml: -------------------------------------------------------------------------------- 1 | 2 | Create sections to generate a document that describes the key takeaways, principles, and ideas based in the provided content block. 3 | 4 | 5 | 6 | 7 | Use a simple 'title', 'theme', 'bullet points' structure. 8 | 9 | 10 | Cover as much content as you can, but don't overload bullet points, be concise and nest bullet points when you have more to add that would be valuable for understanding and utilizing the content. 11 | 12 | 13 | 14 | 15 | {{content}} 16 | -------------------------------------------------------------------------------- /prompt_library/chapter-gen.xml: -------------------------------------------------------------------------------- 1 | 2 | We're generating YouTube video chapters. Generate chapters in the specified format detailed by 3 | the examples, ensuring that each chapter title is short, engaging, SEO-friendly, and aligned 4 | with the corresponding timestamp. Follow the instructions to generate the best, most interesting 5 | chapters. 6 | 7 | 8 | 9 | The time stamps are in the format [MM:SS] and you should use them to create the 10 | chapter titles. 11 | The timestamp should represent the beginning of the chapter. 12 | Collect what you think will be the most interesting and engaging parts of the video 13 | to represent each chapter based on the transcript. 14 | Use the transcript-with-timestamps to generate the chapter title. 15 | Use the examples to properly structure the chapter titles. 16 | Use the seo-keywords-to-hit to generate the chapter title. 17 | Generate 8 chapters for the video throughout the duration of the video. 18 | Respond exclusively with the chapters in the specified format detailed in the 19 | examples. 20 | 21 | 22 | 23 | 24 | 📖 Chapters 25 | 00:00 Increase your earnings potential 26 | 00:38 Omnicomplete - the autocomplete for everything 27 | 01:16 LLM Autocompletes can self improve 28 | 02:00 Reveal Actionable Information from your users 29 | 03:20 Client - Server - Prompt Architecture 30 | 05:30 LLM Autocomplete DEMO 31 | 06:45 Autocomplete PROMPT 32 | 08:45 Auto Improve LLM / Self Improve LLM 33 | 10:25 Break down codebase 34 | 12:28 Direct prompt testing integration 35 | 14:10 Domain Knowledge Example 36 | 16:00 Interesting Use Case For LLMs in 2024, 2025 37 | 38 | 39 | 40 | 📖 Chapters 41 | 00:00 The 100x LLM is coming 42 | 01:30 A 100x on opus and gpt4 is insane 43 | 01:57 Sam Altman's winning startup strategy 44 | 03:16 BAPs, Expand your problem set, 100 P/D 45 | 03:35 BAPs 46 | 06:35 Expand your problem set 47 | 08:45 The prompt is the new fundamental unit of programming 48 | 10:40 100 P/D 49 | 14:00 Recap 3 ways to prepare for 100x SOTA LLM 50 | 51 | 52 | 53 | 📖 Chapters 54 | 00:00 Best way to build AI Agents? 55 | 00:39 Agent OS 56 | 01:58 Big Ideas (Summary) 57 | 02:48 Breakdown Agent OS: LPU, RAM, I/O 58 | 04:03 Language Processing Unit (LPU) 59 | 05:42 Is this over engineering? 60 | 07:30 Memory, Context, State (RAM) 61 | 08:20 Tools, Function Calling, Spyware (I/O) 62 | 10:22 How do you know your Architecture is good? 63 | 13:27 Agent Composability 64 | 16:40 What's missing from Agent OS? 65 | 18:53 The Prompt is the... 66 | 67 | 68 | 69 | 70 | {{seo-keywords-to-hit}} 71 | 72 | 73 | 74 | {{transcript-with-timestamps}} 75 | -------------------------------------------------------------------------------- /prompt_library/hn-sentiment-analysis.xml: -------------------------------------------------------------------------------- 1 | 2 | Analyze the aggregate sentiment of a list of comments from hn-json. 3 | 4 | 5 | 6 | 7 | Respond in the response-format TwoFormatSentiment structure. 8 | 9 | Analyze the sentiment of each comment and aggregate the results. 10 | Do not return anything else. Results must be JSON.parseable. 11 | Group your analysis into positive, nuanced, and negative categories. 12 | Include the entire comment in the 'standOutComments' field. 13 | Positive and negative are self-explanatory. Nuanced setiment is for comments that 14 | border on positive and negative or are even mixed. 15 | In the 'markdown' response, respond with your results from AggregateSentiment but 16 | in human readable markdown format. 17 | 18 | Create at least 3 themes for each sentiment category but expand to as many novel themes as needed. 19 | 20 | 21 | Include at least 3 standOutComments for each theme but expand to as many as you need. 22 | 23 | 24 | 25 | 26 | 29 | 30 | 31 | 32 | 60 | -------------------------------------------------------------------------------- /pyproject.toml: -------------------------------------------------------------------------------- 1 | [project] 2 | name = "marimo-notebook" 3 | version = "0.1.0" 4 | description = "Starter codebase to use Marimo reactive notebooks to build a reusable, customizable, Prompt Library." 5 | readme = "README.md" 6 | requires-python = ">=3.12" 7 | dependencies = [ 8 | "openai>=1.47.0", 9 | "python-dotenv>=1.0.1", 10 | "pydantic>=2.9.2", 11 | "llm>=0.16", 12 | "pytest>=8.3.3", 13 | "llm-claude>=0.4.0", 14 | "mako>=1.3.5", 15 | "pandas>=2.2.3", 16 | "matplotlib>=3.9.2", 17 | "altair>=5.4.1", 18 | "vega-datasets>=0.9.0", 19 | "tornado>=6.4.1", 20 | "llm-claude-3>=0.4.1", 21 | "marimo>=0.8.18", 22 | "llm-ollama>=0.5.0", 23 | "pyperclip>=1.9.0", 24 | "llm-gemini>=0.1a5", 25 | ] 26 | 27 | [build-system] 28 | requires = ["hatchling"] 29 | build-backend = "hatchling.build" 30 | -------------------------------------------------------------------------------- /src/marimo_notebook/__init__.py: -------------------------------------------------------------------------------- 1 | def hello() -> str: 2 | return "Hello from marimo-notebook!" 3 | -------------------------------------------------------------------------------- /src/marimo_notebook/modules/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/disler/marimo-prompt-library/cec10b5e57cf1c5e3e11806c47a667cb38bd32b0/src/marimo_notebook/modules/__init__.py -------------------------------------------------------------------------------- /src/marimo_notebook/modules/chain.py: -------------------------------------------------------------------------------- 1 | import json 2 | import re 3 | from typing import List, Dict, Callable, Any, Tuple, Union 4 | from .typings import FusionChainResult 5 | import concurrent.futures 6 | 7 | 8 | class FusionChain: 9 | 10 | @staticmethod 11 | def run( 12 | context: Dict[str, Any], 13 | models: List[Any], 14 | callable: Callable, 15 | prompts: List[str], 16 | evaluator: Callable[[List[Any]], Tuple[Any, List[float]]], 17 | get_model_name: Callable[[Any], str], 18 | ) -> FusionChainResult: 19 | """ 20 | Run a competition between models on a list of prompts. 21 | 22 | Runs the MinimalChainable.run method for each model for each prompt and evaluates the results. 23 | 24 | The evaluator runs on the last output of each model at the end of the chain of prompts. 25 | 26 | The eval method returns a performance score for each model from 0 to 1, giving priority to models earlier in the list. 27 | 28 | Args: 29 | context (Dict[str, Any]): The context for the prompts. 30 | models (List[Any]): List of models to compete. 31 | callable (Callable): The function to call for each prompt. 32 | prompts (List[str]): List of prompts to process. 33 | evaluator (Callable[[List[str]], Tuple[Any, List[float]]]): Function to evaluate model outputs, returning the top response and the scores. 34 | get_model_name (Callable[[Any], str]): Function to get the name of a model. Defaults to str(model). 35 | 36 | Returns: 37 | FusionChainResult: A FusionChainResult object containing the top response, all outputs, all context-filled prompts, performance scores, and model names. 38 | """ 39 | all_outputs = [] 40 | all_context_filled_prompts = [] 41 | 42 | for model in models: 43 | outputs, context_filled_prompts = MinimalChainable.run( 44 | context, model, callable, prompts 45 | ) 46 | all_outputs.append(outputs) 47 | all_context_filled_prompts.append(context_filled_prompts) 48 | 49 | # Evaluate the last output of each model 50 | last_outputs = [outputs[-1] for outputs in all_outputs] 51 | top_response, performance_scores = evaluator(last_outputs) 52 | 53 | model_names = [get_model_name(model) for model in models] 54 | 55 | return FusionChainResult( 56 | top_response=top_response, 57 | all_prompt_responses=all_outputs, 58 | all_context_filled_prompts=all_context_filled_prompts, 59 | performance_scores=performance_scores, 60 | llm_model_names=model_names, 61 | ) 62 | 63 | @staticmethod 64 | def run_parallel( 65 | context: Dict[str, Any], 66 | models: List[Any], 67 | callable: Callable, 68 | prompts: List[str], 69 | evaluator: Callable[[List[Any]], Tuple[Any, List[float]]], 70 | get_model_name: Callable[[Any], str], 71 | num_workers: int = 4, 72 | ) -> FusionChainResult: 73 | """ 74 | Run a competition between models on a list of prompts in parallel. 75 | 76 | This method is similar to the 'run' method but utilizes parallel processing 77 | to improve performance when dealing with multiple models. 78 | 79 | Args: 80 | context (Dict[str, Any]): The context for the prompts. 81 | models (List[Any]): List of models to compete. 82 | callable (Callable): The function to call for each prompt. 83 | prompts (List[str]): List of prompts to process. 84 | evaluator (Callable[[List[str]], Tuple[Any, List[float]]]): Function to evaluate model outputs, returning the top response and the scores. 85 | num_workers (int): Number of parallel workers to use. Defaults to 4. 86 | get_model_name (Callable[[Any], str]): Function to get the name of a model. Defaults to str(model). 87 | 88 | Returns: 89 | FusionChainResult: A FusionChainResult object containing the top response, all outputs, all context-filled prompts, performance scores, and model names. 90 | """ 91 | 92 | def process_model(model): 93 | outputs, context_filled_prompts = MinimalChainable.run( 94 | context, model, callable, prompts 95 | ) 96 | return outputs, context_filled_prompts 97 | 98 | all_outputs = [] 99 | all_context_filled_prompts = [] 100 | 101 | with concurrent.futures.ThreadPoolExecutor(max_workers=num_workers) as executor: 102 | future_to_model = { 103 | executor.submit(process_model, model): model for model in models 104 | } 105 | for future in concurrent.futures.as_completed(future_to_model): 106 | outputs, context_filled_prompts = future.result() 107 | all_outputs.append(outputs) 108 | all_context_filled_prompts.append(context_filled_prompts) 109 | 110 | # Evaluate the last output of each model 111 | last_outputs = [outputs[-1] for outputs in all_outputs] 112 | top_response, performance_scores = evaluator(last_outputs) 113 | 114 | model_names = [get_model_name(model) for model in models] 115 | 116 | return FusionChainResult( 117 | top_response=top_response, 118 | all_prompt_responses=all_outputs, 119 | all_context_filled_prompts=all_context_filled_prompts, 120 | performance_scores=performance_scores, 121 | llm_model_names=model_names, 122 | ) 123 | 124 | 125 | class MinimalChainable: 126 | """ 127 | Sequential prompt chaining with context and output back-references. 128 | """ 129 | 130 | @staticmethod 131 | def run( 132 | context: Dict[str, Any], model: Any, callable: Callable, prompts: List[str] 133 | ) -> Tuple[List[Any], List[str]]: 134 | # Initialize an empty list to store the outputs 135 | output = [] 136 | context_filled_prompts = [] 137 | 138 | # Iterate over each prompt with its index 139 | for i, prompt in enumerate(prompts): 140 | # Iterate over each key-value pair in the context 141 | for key, value in context.items(): 142 | # Check if the key is in the prompt 143 | if "{{" + key + "}}" in prompt: 144 | # Replace the key with its value 145 | prompt = prompt.replace("{{" + key + "}}", str(value)) 146 | 147 | # Replace references to previous outputs 148 | # Iterate from the current index down to 1 149 | for j in range(i, 0, -1): 150 | # Get the previous output 151 | previous_output = output[i - j] 152 | 153 | # Handle JSON (dict) output references 154 | # Check if the previous output is a dictionary 155 | if isinstance(previous_output, dict): 156 | # Check if the reference is in the prompt 157 | if f"{{{{output[-{j}]}}}}" in prompt: 158 | # Replace the reference with the JSON string 159 | prompt = prompt.replace( 160 | f"{{{{output[-{j}]}}}}", json.dumps(previous_output) 161 | ) 162 | # Iterate over each key-value pair in the previous output 163 | for key, value in previous_output.items(): 164 | # Check if the key reference is in the prompt 165 | if f"{{{{output[-{j}].{key}}}}}" in prompt: 166 | # Replace the key reference with its value 167 | prompt = prompt.replace( 168 | f"{{{{output[-{j}].{key}}}}}", str(value) 169 | ) 170 | # If not a dict, use the original string 171 | else: 172 | # Check if the reference is in the prompt 173 | if f"{{{{output[-{j}]}}}}" in prompt: 174 | # Replace the reference with the previous output 175 | prompt = prompt.replace( 176 | f"{{{{output[-{j}]}}}}", str(previous_output) 177 | ) 178 | 179 | # Append the context filled prompt to the list 180 | context_filled_prompts.append(prompt) 181 | 182 | # Call the provided callable with the processed prompt 183 | # Get the result by calling the callable with the model and prompt 184 | result = callable(model, prompt) 185 | 186 | print("result", result) 187 | 188 | # Try to parse the result as JSON, handling markdown-wrapped JSON 189 | try: 190 | # First, attempt to extract JSON from markdown code blocks 191 | # Search for JSON in markdown code blocks 192 | json_match = re.search(r"```(?:json)?\s*([\s\S]*?)\s*```", result) 193 | # If a match is found 194 | if json_match: 195 | # Parse the JSON from the match 196 | result = json.loads(json_match.group(1)) 197 | else: 198 | # If no markdown block found, try parsing the entire result 199 | # Parse the entire result as JSON 200 | result = json.loads(result) 201 | except json.JSONDecodeError: 202 | # Not JSON, keep as is 203 | pass 204 | 205 | # Append the result to the output list 206 | output.append(result) 207 | 208 | # Return the list of outputs 209 | return output, context_filled_prompts 210 | 211 | @staticmethod 212 | def to_delim_text_file(name: str, content: List[Union[str, dict, list]]) -> str: 213 | result_string = "" 214 | with open(f"{name}.txt", "w") as outfile: 215 | for i, item in enumerate(content, 1): 216 | if isinstance(item, (dict, list)): 217 | item = json.dumps(item) 218 | elif not isinstance(item, str): 219 | item = str(item) 220 | chain_text_delim = ( 221 | f"{'🔗' * i} -------- Prompt Chain Result #{i} -------------\n\n" 222 | ) 223 | outfile.write(chain_text_delim) 224 | outfile.write(item) 225 | outfile.write("\n\n") 226 | 227 | result_string += chain_text_delim + item + "\n\n" 228 | 229 | return result_string 230 | -------------------------------------------------------------------------------- /src/marimo_notebook/modules/llm_module.py: -------------------------------------------------------------------------------- 1 | import llm 2 | from dotenv import load_dotenv 3 | import os 4 | from mako.template import Template 5 | 6 | # Load environment variables from .env file 7 | load_dotenv() 8 | 9 | 10 | def conditional_render(prompt, context, start_delim="% if", end_delim="% endif"): 11 | template = Template(prompt) 12 | return template.render(**context) 13 | 14 | 15 | def parse_markdown_backticks(str) -> str: 16 | if "```" not in str: 17 | return str.strip() 18 | # Remove opening backticks and language identifier 19 | str = str.split("```", 1)[-1].split("\n", 1)[-1] 20 | # Remove closing backticks 21 | str = str.rsplit("```", 1)[0] 22 | # Remove any leading or trailing whitespace 23 | return str.strip() 24 | 25 | 26 | def prompt(model: llm.Model, prompt: str): 27 | res = model.prompt(prompt, stream=False) 28 | return res.text() 29 | 30 | 31 | def prompt_with_temp(model: llm.Model, prompt: str, temperature: float = 0.7): 32 | """ 33 | Send a prompt to the model with a specified temperature. 34 | 35 | Args: 36 | model (llm.Model): The LLM model to use. 37 | prompt (str): The prompt to send to the model. 38 | temperature (float): The temperature setting for the model's response. Default is 0.7. 39 | 40 | Returns: 41 | str: The model's response text. 42 | """ 43 | 44 | model_id = model.model_id 45 | if "o1" in model_id or "gemini" in model_id: 46 | temperature = 1 47 | res = model.prompt(prompt, stream=False) 48 | return res.text() 49 | 50 | res = model.prompt(prompt, stream=False, temperature=temperature) 51 | return res.text() 52 | 53 | 54 | def get_model_name(model: llm.Model): 55 | return model.model_id 56 | 57 | 58 | def build_sonnet_3_5(): 59 | ANTHROPIC_API_KEY = os.getenv("ANTHROPIC_API_KEY") 60 | 61 | sonnet_3_5_model: llm.Model = llm.get_model("claude-3.5-sonnet") 62 | sonnet_3_5_model.key = ANTHROPIC_API_KEY 63 | 64 | return sonnet_3_5_model 65 | 66 | 67 | def build_mini_model(): 68 | OPENAI_API_KEY = os.getenv("OPENAI_API_KEY") 69 | gpt4_o_mini_model: llm.Model = llm.get_model("gpt-4o-mini") 70 | gpt4_o_mini_model.key = OPENAI_API_KEY 71 | return gpt4_o_mini_model 72 | 73 | 74 | def build_big_3_models(): 75 | ANTHROPIC_API_KEY = os.getenv("ANTHROPIC_API_KEY") 76 | OPENAI_API_KEY = os.getenv("OPENAI_API_KEY") 77 | GEMINI_API_KEY = os.getenv("GEMINI_API_KEY") 78 | 79 | sonnet_3_5_model: llm.Model = llm.get_model("claude-3.5-sonnet") 80 | sonnet_3_5_model.key = ANTHROPIC_API_KEY 81 | 82 | gpt4_o_model: llm.Model = llm.get_model("4o") 83 | gpt4_o_model.key = OPENAI_API_KEY 84 | 85 | gemini_1_5_pro_model: llm.Model = llm.get_model("gemini-1.5-pro-latest") 86 | gemini_1_5_pro_model.key = GEMINI_API_KEY 87 | 88 | return sonnet_3_5_model, gpt4_o_model, gemini_1_5_pro_model 89 | 90 | 91 | def build_latest_openai(): 92 | OPENAI_API_KEY = os.getenv("OPENAI_API_KEY") 93 | 94 | # chatgpt_4o_latest_model: llm.Model = llm.get_model("chatgpt-4o-latest") - experimental 95 | chatgpt_4o_latest_model: llm.Model = llm.get_model("gpt-4o") 96 | chatgpt_4o_latest_model.key = OPENAI_API_KEY 97 | return chatgpt_4o_latest_model 98 | 99 | 100 | def build_big_3_plus_mini_models(): 101 | 102 | ANTHROPIC_API_KEY = os.getenv("ANTHROPIC_API_KEY") 103 | OPENAI_API_KEY = os.getenv("OPENAI_API_KEY") 104 | GEMINI_API_KEY = os.getenv("GEMINI_API_KEY") 105 | 106 | sonnet_3_5_model: llm.Model = llm.get_model("claude-3.5-sonnet") 107 | sonnet_3_5_model.key = ANTHROPIC_API_KEY 108 | 109 | gpt4_o_model: llm.Model = llm.get_model("4o") 110 | gpt4_o_model.key = OPENAI_API_KEY 111 | 112 | gemini_1_5_pro_model: llm.Model = llm.get_model("gemini-1.5-pro-latest") 113 | gemini_1_5_pro_model.key = GEMINI_API_KEY 114 | 115 | gpt4_o_mini_model: llm.Model = llm.get_model("gpt-4o-mini") 116 | gpt4_o_mini_model.key = OPENAI_API_KEY 117 | 118 | chatgpt_4o_latest_model = build_latest_openai() 119 | 120 | return ( 121 | sonnet_3_5_model, 122 | gpt4_o_model, 123 | gemini_1_5_pro_model, 124 | gpt4_o_mini_model, 125 | ) 126 | 127 | 128 | def build_gemini_duo(): 129 | gemini_1_5_pro: llm.Model = llm.get_model("gemini-1.5-pro-latest") 130 | gemini_1_5_flash: llm.Model = llm.get_model("gemini-1.5-flash-latest") 131 | 132 | GEMINI_API_KEY = os.getenv("GEMINI_API_KEY") 133 | 134 | gemini_1_5_pro.key = GEMINI_API_KEY 135 | gemini_1_5_flash.key = GEMINI_API_KEY 136 | 137 | return gemini_1_5_pro, gemini_1_5_flash 138 | 139 | 140 | def build_ollama_models(): 141 | 142 | llama3_2_model: llm.Model = llm.get_model("llama3.2") 143 | llama_3_2_1b_model: llm.Model = llm.get_model("llama3.2:1b") 144 | 145 | return llama3_2_model, llama_3_2_1b_model 146 | 147 | 148 | def build_ollama_slm_models(): 149 | 150 | llama3_2_model: llm.Model = llm.get_model("llama3.2") 151 | phi3_5_model: llm.Model = llm.get_model("phi3.5:latest") 152 | qwen2_5_model: llm.Model = llm.get_model("qwen2.5:latest") 153 | 154 | return llama3_2_model, phi3_5_model, qwen2_5_model 155 | 156 | 157 | def build_openai_model_stack(): 158 | OPENAI_API_KEY = os.getenv("OPENAI_API_KEY") 159 | 160 | gpt4_o_mini_model: llm.Model = llm.get_model("gpt-4o-mini") 161 | gpt4_o_2024_08_06_model: llm.Model = llm.get_model("gpt-4o") 162 | o1_preview_model: llm.Model = llm.get_model("o1-preview") 163 | o1_mini_model: llm.Model = llm.get_model("o1-mini") 164 | 165 | models = [ 166 | gpt4_o_mini_model, 167 | gpt4_o_2024_08_06_model, 168 | o1_preview_model, 169 | o1_mini_model, 170 | ] 171 | 172 | for model in models: 173 | model.key = OPENAI_API_KEY 174 | 175 | return models 176 | 177 | 178 | def build_openai_latest_and_fastest(): 179 | OPENAI_API_KEY = os.getenv("OPENAI_API_KEY") 180 | 181 | gpt_4o_latest: llm.Model = llm.get_model("gpt-4o") 182 | gpt_4o_latest.key = OPENAI_API_KEY 183 | 184 | gpt_4o_mini_model: llm.Model = llm.get_model("gpt-4o-mini") 185 | gpt_4o_mini_model.key = OPENAI_API_KEY 186 | 187 | return gpt_4o_latest, gpt_4o_mini_model 188 | 189 | 190 | def build_o1_series(): 191 | OPENAI_API_KEY = os.getenv("OPENAI_API_KEY") 192 | 193 | o1_mini_model: llm.Model = llm.get_model("o1-mini") 194 | o1_mini_model.key = OPENAI_API_KEY 195 | 196 | o1_preview_model: llm.Model = llm.get_model("o1-preview") 197 | o1_preview_model.key = OPENAI_API_KEY 198 | 199 | return o1_mini_model, o1_preview_model 200 | 201 | 202 | def build_small_cheap_and_fast(): 203 | OPENAI_API_KEY = os.getenv("OPENAI_API_KEY") 204 | GEMINI_API_KEY = os.getenv("GEMINI_API_KEY") 205 | gpt4_o_mini_model: llm.Model = llm.get_model("gpt-4o-mini") 206 | gpt4_o_mini_model.key = OPENAI_API_KEY 207 | 208 | gemini_1_5_flash_002: llm.Model = llm.get_model("gemini-1.5-flash-002") 209 | gemini_1_5_flash_002.key = GEMINI_API_KEY 210 | 211 | return gpt4_o_mini_model, gemini_1_5_flash_002 212 | 213 | 214 | def build_small_cheap_and_fast(): 215 | OPENAI_API_KEY = os.getenv("OPENAI_API_KEY") 216 | GEMINI_API_KEY = os.getenv("GEMINI_API_KEY") 217 | gpt4_o_mini_model: llm.Model = llm.get_model("gpt-4o-mini") 218 | gpt4_o_mini_model.key = OPENAI_API_KEY 219 | 220 | gemini_1_5_flash_002: llm.Model = llm.get_model("gemini-1.5-flash-002") 221 | gemini_1_5_flash_002.key = GEMINI_API_KEY 222 | 223 | return gpt4_o_mini_model, gemini_1_5_flash_002 224 | 225 | 226 | def build_gemini_1_2_002(): 227 | GEMINI_API_KEY = os.getenv("GEMINI_API_KEY") 228 | 229 | gemini_1_5_pro_002: llm.Model = llm.get_model("gemini-1.5-pro-002") 230 | gemini_1_5_flash_002: llm.Model = llm.get_model("gemini-1.5-flash-002") 231 | 232 | gemini_1_5_pro_002.key = GEMINI_API_KEY 233 | gemini_1_5_flash_002.key = GEMINI_API_KEY 234 | 235 | return gemini_1_5_pro_002, gemini_1_5_flash_002 236 | -------------------------------------------------------------------------------- /src/marimo_notebook/modules/prompt_library_module.py: -------------------------------------------------------------------------------- 1 | import os 2 | import json 3 | from datetime import datetime 4 | from typing import List 5 | from dotenv import load_dotenv 6 | from src.marimo_notebook.modules.typings import ModelRanking, MultiLLMPromptExecution 7 | 8 | load_dotenv() 9 | 10 | 11 | def pull_in_dir_recursively(dir: str) -> dict: 12 | if not os.path.exists(dir): 13 | return {} 14 | 15 | result = {} 16 | 17 | def recursive_read(current_dir): 18 | for item in os.listdir(current_dir): 19 | item_path = os.path.join(current_dir, item) 20 | if os.path.isfile(item_path): 21 | relative_path = os.path.relpath(item_path, dir) 22 | with open(item_path, "r") as f: 23 | result[relative_path] = f.read() 24 | elif os.path.isdir(item_path): 25 | recursive_read(item_path) 26 | 27 | recursive_read(dir) 28 | return result 29 | 30 | 31 | def pull_in_prompt_library(): 32 | prompt_library_dir = os.getenv("PROMPT_LIBRARY_DIR", "./prompt_library") 33 | return pull_in_dir_recursively(prompt_library_dir) 34 | 35 | 36 | def pull_in_testable_prompts(): 37 | testable_prompts_dir = os.getenv("TESTABLE_PROMPTS_DIR", "./testable_prompts") 38 | return pull_in_dir_recursively(testable_prompts_dir) 39 | 40 | 41 | def record_llm_execution( 42 | prompt: str, list_model_execution_dict: list, prompt_template: str = None 43 | ): 44 | execution_dir = os.getenv("PROMPT_EXECUTIONS_DIR", "./prompt_executions") 45 | os.makedirs(execution_dir, exist_ok=True) 46 | 47 | if prompt_template: 48 | filename_base = prompt_template.replace(" ", "_").lower() 49 | else: 50 | filename_base = prompt[:50].replace(" ", "_").lower() 51 | 52 | # Clean up filename_base to ensure it's alphanumeric only 53 | filename_base = "".join( 54 | char for char in filename_base if char.isalnum() or char == "_" 55 | ) 56 | 57 | timestamp = datetime.now().strftime("%Y%m%d_%H%M%S") 58 | filename = f"{filename_base}_{timestamp}.json" 59 | filepath = os.path.join(execution_dir, filename) 60 | 61 | execution_record = MultiLLMPromptExecution( 62 | prompt=prompt, 63 | prompt_template=prompt_template, 64 | prompt_responses=list_model_execution_dict, 65 | ) 66 | 67 | with open(filepath, "w") as f: 68 | json.dump(execution_record.model_dump(), f, indent=2) 69 | 70 | return filepath 71 | 72 | 73 | def get_rankings(): 74 | rankings_file = os.getenv( 75 | "LANGUAGE_MODEL_RANKINGS_FILE", "./language_model_rankings/rankings.json" 76 | ) 77 | if not os.path.exists(rankings_file): 78 | return [] 79 | with open(rankings_file, "r") as f: 80 | rankings_data = json.load(f) 81 | return [ModelRanking(**ranking) for ranking in rankings_data] 82 | 83 | 84 | def save_rankings(rankings: List[ModelRanking]): 85 | rankings_file = os.getenv( 86 | "LANGUAGE_MODEL_RANKINGS_FILE", "./language_model_rankings/rankings.json" 87 | ) 88 | os.makedirs(os.path.dirname(rankings_file), exist_ok=True) 89 | rankings_dict = [ranking.model_dump() for ranking in rankings] 90 | with open(rankings_file, "w") as f: 91 | json.dump(rankings_dict, f, indent=2) 92 | 93 | 94 | def reset_rankings(model_ids: List[str]): 95 | new_rankings = [ 96 | ModelRanking(llm_model_id=model_id, score=0) for model_id in model_ids 97 | ] 98 | save_rankings(new_rankings) 99 | return new_rankings 100 | -------------------------------------------------------------------------------- /src/marimo_notebook/modules/typings.py: -------------------------------------------------------------------------------- 1 | from pydantic import BaseModel, ConfigDict 2 | from typing import List, Dict, Optional, Union, Any 3 | 4 | 5 | class FusionChainResult(BaseModel): 6 | top_response: Union[str, Dict[str, Any]] 7 | all_prompt_responses: List[List[Any]] 8 | all_context_filled_prompts: List[List[str]] 9 | performance_scores: List[float] 10 | llm_model_names: List[str] 11 | 12 | 13 | class MultiLLMPromptExecution(BaseModel): 14 | prompt_responses: List[Dict[str, Any]] 15 | prompt: str 16 | prompt_template: Optional[str] = None 17 | 18 | 19 | class ModelRanking(BaseModel): 20 | llm_model_id: str 21 | score: int 22 | -------------------------------------------------------------------------------- /src/marimo_notebook/modules/utils.py: -------------------------------------------------------------------------------- 1 | import datetime 2 | import json 3 | import os 4 | from typing import Union, Dict, List 5 | 6 | OUTPUT_DIR = "output" 7 | 8 | 9 | def build_file_path(name: str): 10 | session_dir = f"{OUTPUT_DIR}" 11 | os.makedirs(session_dir, exist_ok=True) 12 | return os.path.join(session_dir, f"{name}") 13 | 14 | 15 | def build_file_name_session(name: str, session_id: str): 16 | session_dir = f"{OUTPUT_DIR}/{session_id}" 17 | os.makedirs(session_dir, exist_ok=True) 18 | return os.path.join(session_dir, f"{name}") 19 | 20 | 21 | def to_json_file_pretty(name: str, content: Union[Dict, List]): 22 | def default_serializer(obj): 23 | if hasattr(obj, "model_dump"): 24 | return obj.model_dump() 25 | raise TypeError( 26 | f"Object of type {obj.__class__.__name__} is not JSON serializable" 27 | ) 28 | 29 | with open(f"{name}.json", "w") as outfile: 30 | json.dump(content, outfile, indent=2, default=default_serializer) 31 | 32 | 33 | def current_date_time_str() -> str: 34 | return datetime.datetime.now().strftime("%Y-%m-%d_%H-%M-%S") 35 | 36 | 37 | def current_date_str() -> str: 38 | return datetime.datetime.now().strftime("%Y-%m-%d") 39 | 40 | 41 | def dict_item_diff_by_set( 42 | previous_list: List[Dict], current_list: List[Dict], set_key: str 43 | ) -> List[str]: 44 | previous_set = {item[set_key] for item in previous_list} 45 | current_set = {item[set_key] for item in current_list} 46 | return list(current_set - previous_set) 47 | -------------------------------------------------------------------------------- /src/marimo_notebook/temp.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/disler/marimo-prompt-library/cec10b5e57cf1c5e3e11806c47a667cb38bd32b0/src/marimo_notebook/temp.py -------------------------------------------------------------------------------- /testable_prompts/bash_commands/command_generation_1.md: -------------------------------------------------------------------------------- 1 | Mac: Bash: Concise: How do I list all hidden files in a directory? -------------------------------------------------------------------------------- /testable_prompts/bash_commands/command_generation_2.md: -------------------------------------------------------------------------------- 1 | Mac: Bash: Concise: How do I recursively search a directory for a file by name? -------------------------------------------------------------------------------- /testable_prompts/bash_commands/command_generation_3.md: -------------------------------------------------------------------------------- 1 | Mac: Bash: Concise: How do I resolve merge conflicts in Git when trying to merge two branches? -------------------------------------------------------------------------------- /testable_prompts/basics/hello.md: -------------------------------------------------------------------------------- 1 | Hey my name is Dan, are you ready to build? -------------------------------------------------------------------------------- /testable_prompts/basics/mult_lang_counting.xml: -------------------------------------------------------------------------------- 1 | 2 | Count to ten then zero in python, typescript, bash, sql, and rust. 3 | 4 | 5 | 6 | Respond with only runnable code 7 | Use a while loop 8 | Use the print function 9 | Group the code by language 10 | Use markdown to format the languages (h1) and code (backticks) 11 | -------------------------------------------------------------------------------- /testable_prompts/basics/ping.xml: -------------------------------------------------------------------------------- 1 | ping -------------------------------------------------------------------------------- /testable_prompts/basics/python_count_to_ten.xml: -------------------------------------------------------------------------------- 1 | 2 | Count to ten in python 3 | 4 | 5 | 6 | Respond with only runnable code 7 | Use a while loop 8 | Use the print function 9 | -------------------------------------------------------------------------------- /testable_prompts/code_debugging/code_debugging_1.md: -------------------------------------------------------------------------------- 1 | Find the bug in this code: 2 | 3 | def mult_and_sum_array(arr, multiple): 4 | multi_arr = [x * multiple for x in arr] 5 | sum = 0 6 | sum = sum(multi_arr) 7 | return sum -------------------------------------------------------------------------------- /testable_prompts/code_debugging/code_debugging_2.md: -------------------------------------------------------------------------------- 1 | Find the bug in this code: 2 | 3 | def find_max(nums): 4 | max_num = float('-inf') 5 | for num in nums: 6 | if num < max_num: 7 | max_num = num 8 | return max_num -------------------------------------------------------------------------------- /testable_prompts/code_debugging/code_debugging_3.md: -------------------------------------------------------------------------------- 1 | Identify the programming language used in the following CODE_SNIPPET: 2 | 3 | CODE_SNIPPET 4 | 5 | def example_function(): 6 | print("Hello, World!") -------------------------------------------------------------------------------- /testable_prompts/code_explanation/code_explanation_1.md: -------------------------------------------------------------------------------- 1 | Concisely explain how I can use generator functions in Python in less than 100 words. -------------------------------------------------------------------------------- /testable_prompts/code_explanation/code_explanation_2.md: -------------------------------------------------------------------------------- 1 | Explain what the PYTHON_CODE does in 100 words or less. 2 | 3 | PYTHON_CODE 4 | def get_first_keyword_in_prompt(prompt: str): 5 | map_keywords_to_agents = { 6 | "bash,browser": run_bash_command_workflow, 7 | "question": question_answer_workflow, 8 | "hello,hey,hi": soft_talk_workflow, 9 | "exit": end_conversation_workflow, 10 | } 11 | for keyword_group, agent in map_keywords_to_agents.items(): 12 | keywords = keyword_group.split(",") 13 | for keyword in keywords: 14 | if keyword in prompt.lower(): 15 | return agent, keyword 16 | return None, None -------------------------------------------------------------------------------- /testable_prompts/code_explanation/code_explanation_3.md: -------------------------------------------------------------------------------- 1 | Concisely explain how I can use list comprehensions in Python in less than 100 words. -------------------------------------------------------------------------------- /testable_prompts/code_generation/code_generation_1.md: -------------------------------------------------------------------------------- 1 | Implement the following python function prefix_string("abc", 2) -> "abcabc" -------------------------------------------------------------------------------- /testable_prompts/code_generation/code_generation_2.md: -------------------------------------------------------------------------------- 1 | Implement a Python function is_palindrome(s) that takes a string s and returns True if it is a palindrome, False otherwise. -------------------------------------------------------------------------------- /testable_prompts/code_generation/code_generation_3.md: -------------------------------------------------------------------------------- 1 | Implement a Python function longest_common_subsequence(s1, s2) that takes two strings s1 and s2 and returns the longest common subsequence between them. -------------------------------------------------------------------------------- /testable_prompts/code_generation/code_generation_4.md: -------------------------------------------------------------------------------- 1 | Implement a Python class Stack that represents a stack using a singly linked list. It should support push, pop, and is_empty operations. -------------------------------------------------------------------------------- /testable_prompts/context_window/context_window_1.md: -------------------------------------------------------------------------------- 1 | What was the end of year prediction made in the SCRIPT below? 2 | 3 | SCRIPT 4 | Gemma Phi 3, OpenELM, and Llama 3. Open source language models are becoming more viable with every single release. The terminology from Apple's new OpenELM model is spot on. These efficient language models are taking center stage in the LLM ecosystem. Why are ELMs so important? Because they reshape the business model of your agentic tools and products. When you can run a prompt directly on your device, the cost of building goes to zero. The pace of innovation has been incredible, especially with the release of Llama 3. But every time a new model drops, I'm always asking the same question. Are efficient language models truly ready for on-device use? And how do you know your ELM meets your standards? I'm going to give you a couple of examples here. The first one is that you need to know your ELM. Everyone has different standards for their prompts, prompt chains, AI agents, and agentic workflows. How do you know your personal standards are being met by Phi 3, by Llama 3, and whatever's coming next? This is something that we stress on the channel a lot. Always look at where the ball is going, not where it is. If this trend of incredible local models continue, how soon will it be until we can do what GPT-4 does right on our device? With Llama 3, it's looking like this. It's looking like this. It's looking like this. It's looking like this. It's looking like this. It's looking like this. It's looking like this. It's looking like this. It's looking like this. That time is coming very soon. In this video, we're going to answer the question, are efficient language models ready for on-device use? How do you know if they're ready for your specific use cases? Here are all the big ideas. We're going to set some standards for what ELM attributes we actually care about. There are things like RAM consumption, tokens per second, accuracy. We're going to look at some specific attributes of ELMs and talk about where they need to be for them to work on-device for us. We're going to break down the IT V-Benchmark. We'll explain exactly what that is. That's going to help us answer the question, is this model good enough for your specific use cases? And then we're going to actually run the IT V-Benchmark on Gemma 5.3 and Llama 3 for real on-device use. So we're going to look at a concrete example of the IT V-Benchmark running on my M2 MacBook Pro with 64 gigabytes of RAM and really try to answer the question in a concrete way. Is this ready for prime time? Are these ELMs, are these efficient language models ready for prime time? Let's first walk through some standards and then I'll share some of my personal standards for ELMs. So we'll look at it through the lens of how I'm approaching this as I'm building out agentic tools and products. How do we know we're ready for on-device use? First two most important metrics we need to look at, accuracy and speed. Given your test suite that validates that this model works for your use case, what accuracy do you need? Is it okay if it fails a couple of tests giving you 90% or are you okay with, you know, 60, 70 or 80%? I think accuracy is the most important benchmark we should all be paying attention to. Something like speed is also a complete blocker if it's too low. So we'll be measuring speed and TPS, tokens per second. We'll look at a range from one token per second, all the way up to grok levels, right? Of something like 500 plus, you know, 1000 tokens per second level. What else do we need to pay attention to? Memory and context window. So memory coupled with speed are the big two constraints for ELMs right now. Efficient language model, models that can run on your device. They chew up anywhere from four gigabytes of RAM, of GPU, of CPU, all the way up to 128 and beyond. To run Lama 3, 70 billion parameter on my MacBook, it will chew up something like half of all my available RAM. We also have context window. This is a classic one. Then we have JSON response and vision support. We're not gonna focus on these too much. These are more yes, no, do they have it or do they not? Is it multimodal or not? There are a couple other things that we need to pay attention to. First of all, we need to pay attention to these other attributes that we're missing here, but I don't think they matter as much as these six and specifically these four at the top here. So let's go ahead and walk through this through the lens of my personal standards for efficient language models. Let's break it down. So first things first, the accuracy for the ITV benchmark, which we're about to get to must hit 80%. So if a model is not passing about 80% here, I automatically disqualify it. Tokens per second. I require at least 20 tokens per second minimum. If it's below this, it's honestly just not worth it. It's too slow. There's not enough happening. Anything above this, of course we'll accept. So keep in mind when you're setting your personal standards, you're really looking for ranges, right? Anything above 80% for me is golden. Anything above 20 tokens per second at a very minimum is what we're looking for. So let's look at memory. For me, I am only willing to consume up to about 32 gigabytes of RAM, GPU, CPU. However, it ends up getting sliced. On my 64 gigabyte, I have several Docker instances and other applications that are basically running 24 seven that constrain my dev environment. Regardless, I'm looking for ELMs that consume less than 32 gigabytes of memory. Context window, for me, the sweet spot is 32K and above. Lama 3 released with 8K. I said, cool. Benchmarks look great, but it's a little too small. For some of the larger prompts and prompt chains that I'm building up, I'm looking for 32K minimum context. I highly recommend you go through and set your personal standard for each one of these metrics, as they're likely to be the most important for getting your ELM, for getting a model running on your device. So JSON response, vision support. I don't really care about vision support. This is not a high priority for me. Of course, it's a nice to have. There are image models that can run in isolation. That does the trick for me. I'm not super concerned about having local on device multimodal models, at least right now. JSON response support is a must have. For me, this is built into a lot of the model providers, and it's typically not a problem anymore. So these are my personal standards. The most important ones are up here. 80% accuracy on the ITP benchmark, which we'll talk about in just a second. We have the speed. I'm looking for 20 tokens per second at a minimum. I'm looking for a memory consumption maximum of 32. And then of course, the context window. I am simplifying a lot of the details here, especially around the memory usage. I just want to give you a high level of how to think about what your standards are for ELMs. So that when they come around, you're ready to start using it for your personal tools and products. Having this ready to go as soon as these models are ready will save you time and money, especially as you scale up your usage of language models. So let's talk about the ITP benchmark. What is this? It's simple. It's nothing fancy. ITP is just, is this viable? That's what the test is all about. I just want to know, is this ELM viable? Are these efficient language models, AKA on device language models good enough? This code repository we're about to dive into. It's a personalized use case specific benchmark to quickly swap in and out ELMs, AKA on device language models to know if it's ready for your tools and applications. So let's go ahead and take a quick look at this code base. Link for this is going to be in the description. Let's go ahead and crack open VS code and let's just start with the README. So let's preview this and it's simple. This uses Bunn, PromptFu, and Alama for a minimalist cross-platform local LLM prompt testing and benchmarking experience. So before we dive into this anymore, I'm just going to go ahead, open up the terminal. I'm going to type Bunn run ELM, and that's going to kick off the test. So you can see right away, I have four models running, starting with GPT 3.5 as a control model to test against. And then you can see here, we have Alama Chat, Alama 3, we have PHY, and we have Gemma running as well. So while this is running through our 12 test cases, let's go ahead and take a look at what this code base looks like. So all the details that get set up are going to be in the README. Once you're able to get set up with this in less than a minute, this code base was designed specifically for you to help you benchmark local models for your use cases so that when they're ready, you can start saving time and saving money immediately. If we look at the structure, it's very simple. We have some setup, some minor scripts, and then we have the most important thing, bench, underscore, underscore, and then whatever the test suite name is. This one's called Efficient Language Models. So let's go ahead and look at the prompt. So the prompt is just a simple template. This gets filled in with each individual test run. And if we open up our test files, you can see here, let's go ahead and collapse everything. You can see here we have a list of what do we have here, 12 tests. They're sectioned off. You can see we have string manipulation here, command generation, code explanation, text classification. This is a work in progress of my personal ELM accuracy benchmark. By the time you're watching this, there'll likely be a few additional tests here. They'll be generic enough though, so that you can come in, understand them, and tweak them to fit your own specific use case. So let's go ahead and take a look at this. So this is the test file, and we'll look into this in more detail in just a second here. But if you go to the most important file, prompt through configuration, you can see here, let's go ahead and collapse this. We have our control cloud LLM. So I like to have a kind of control and an experimental group. The control group is going to be our cloud LLM that we want to prove our local models are as good as or near the performance of. Right now I'm using dbt 3.5. And then we have our experimental local ELMs. So we're going to go ahead and take a look at this. So in here, you can see we have LLM 3, we have 5.3, and we have Gemma. Again, you can tweak these. This is all built on top of LLM. Let's go ahead and run through our tool set quickly. We're using Bun, which is an all in one JavaScript runtime. Over the past year, the engineers have really matured the ecosystem. This is my go-to tool for all things JavaScript and TypeScript related. They recently just launched Windows support, which means that this code base will work out of the box for Mac, Linux, and Windows users. You can go ahead and click on this, and you'll be able to see the code base. Huge shout out to the Bun developers on all the great work here. We're using Ollama to serve our local language models. I probably don't need to introduce them. And last but not least, we're using PromptFu. I've talked about PromptFu in a few videos in the past, but it's super, super important to bring back up. This is how you can test your individual prompts against expectations. So what does that look like? If we scroll down to the hero here, you can see exactly what a test case looks like. So you have your prompts that you're going to test. So this is what you would normally type in a chat input field. And then you can go ahead and click test. And then you can go ahead and you have your individual models. Let's say you want to test OpenAI, Plod, and Mistral Large. You would put those all here. So for each provider, it's going to run every single prompt. And then at the bottom, you have your test cases. Your test cases can pass in variables to your prompts, as you can see here. And then most importantly, your test cases can assert specific expectations on the output of your LLM. So you can see here where you're running this type contains. We need to make sure that it has this string in it. We're making sure that the cost is below this amount, latency below this, etc. There are many different assertion types. The ITV benchmark repo uses these three key pieces of technology for a really, really simplistic experience. So you have your prompt configuration where you specify what models you want to use. You have your tests, which specify the details. So let's go ahead and look at one of these tests. You can see here, this is a simple bullet summary test. So I'm saying create a summary of the following text in bullet points. And then here's the script to one of our previous videos. So, you know, here's a simple yet powerful idea that can help you take a large step toward useful and valuable agentic workflows. We're asserting case insensitively that all of these items are in the response of the prompt. So let's go ahead and look at our output. Let's see if our prompts completed. Okay, so we have 33 success and 15 failed tests. So LLM3 ran every single one of these test cases here and reported its results. So let's go ahead and take a look at what that looks like. So after you run that was Bon ELM, after you run that you can run Bon View and if we open up package.json, and you can see Bon view just runs prompt foo view Bon view. This is going to kick off a local prompt foo server that shows us exactly what happened in the test runs. So right away, you can see we have a great summary of the results. So we have our control test failing at only one test, right. So it passed 91% accuracy. This and then we have llama 3 so close to my 80 standard we'll dig into where it went wrong in just a second here we then have phi 3 failed half of the 12 test cases and then we have gemma looks like it did one better 7 out of 12 so you can see here this is why it's important to have a control group specifically for testing elms it's really good to compare against a kind of high performing model and you know gpg 3.5 turbo it's not really even high performing anymore but it's a good benchmark for testing against local models because really if we use opus or gpt4 here the local models won't even come close so that's why i like to compare to something like gpg 3.5 you can also use cloud 3 haiku here this right away gives you a great benchmark on how local models are performing let's go ahead and look at one of these tests what happened where did things go wrong let's look at our text classification this is a simple test the prompt is is the following block of text a sql natural language query nlq respond exclusively with yes or no so this test here is going to look at how well the model can both answer correctly and answer precisely right it needs to say yes or no and then the block of text is select 10 users over the age of 21 with a gmail address and then we have the assertion type equals yes so our test case validates this test if it returns exclusively yes and we can look at the prompt test to see exactly what that looks like so if you go to test.yaml we can see we're looking for just yes this is what that test looks like right so this is our one of our text classification tests and and we have this assertion type equals yes so equals is used when you know exactly what you want the response to be a lot of the times you'll want something like a i contains all so case insensitive contains everything or a case insensitive contains any and there are lots of different assertions you can make you can easily dive into that i've linked that in the readme you'll want to look at the assertions documentation in prompt foo they have a whole list here of different assertions you can make to improve and strengthen your prompt test so that's what that test looks like and and you can kind of go through the line over each model to see exactly what went right what went wrong etc so feel free to check out the other test cases the long story short here is that by running the itv benchmark by running your personal benchmarks against local models you can have higher confidence and you can have first movers advantage on getting your hands on these local models and truly utilizing them as you can see here llama 3 is nearly within my standard of what i need an elm to do based on these 12 test cases i'll increase this to add a lot more of the use cases that i use out of these 12 test cases llama 3 is performing really really well and this is the 8b model right so if we look at a llama you can see here the default version that comes in here is the 8 billion parameter model that's the 4b quantization so pretty good stuff here i don't need to talk about how great llama 3 is the rest of the internet is doing that but it is really awesome to see how it performs on your specific use cases the closer you get to the metal here the closer you understand how these models are performing next to each other the better and the faster you're going to be able to take these models and productionize them in your tools and products i also just want to shout out how incredible it is to actually run these tests over and over and over again with the same model without thinking about the cost for a single second. You can see here, we're getting about 12 tokens per second across the board. So not ideal, not super great, but still everything completed fine. You can walk through the examples. A lot of these test cases are passing. This is really great. I'm gonna be keeping a pretty close eye on this stuff. So definitely like and subscribe if you're interested in the best local performing models. I feel like we're gonna have a few different classes of models, right? If we break this down, fastest, cheapest, and then it was best, slowest. And now what I think we need to do is take this and add a nest to it. So we basically say something like this, right? We say cloud, right? And then we say the slowest, most expensive. And then we say local, fastest, lower accuracy, and best, slowest, right? So things kind of change when you're at the local level. Now we're just trading off speed and accuracy, which simplifies things a lot, right? Because basically we were doing this where we had the fastest, cheapest, and we had lower accuracy. And then we had best, slowest, most expensive, right? So this is your Opus, this is your GPT-4, and this is your Haiku, GPT-3. But now we're getting into this interesting place where now we have things like this, right? Now we have PHY-3, we have LAMA-3, LAMA-3 is seven or eight billion. We also have Gemma. And then in the slowest, we have our bigger models, right? So this is where like LAMA-3 was at 70 billion, that's where this goes. And then, you know, whatever other big models that come out that are, you know, going to really trip your RAM, they're going to run slower, but they will give you the best performance that you can possibly have locally. So I'm keeping an eye on this. Hit the like and hit the sub if you want to stay up to date with how cloud versus local models progress. We're going to be covering these on the channel and I'll likely use, you know, this class system to separate them to keep an eye on these, right? First thing that needs to happen is we need anything at all. To run locally, right? So this is kind of, you know, in the future, same with this. Right now we need just anything to run well enough. So, you know, we need decent accuracy, any speed, right? So this is what we're looking for right now. And this stuff is going to come in the future. So that's the way I'm looking at this. The ITV benchmark can help you gain confidence in your prompts. Link for the code is going to be in the description. I built this to be ultra simple. Just follow the README to get started. Thanks to Bunn. Pramphu and Ollama. This should be completely cross-platform and I'll be updating this with some additional test cases. By the time you watch this, I'll likely have added several additional tests. I'm missing some things in here like code generation, context window length testing, and a couple other sections. So look forward to that. I hope all of this makes sense. Up your feeling, the speed of the open source community building toward usable viable ELMs. I think this is something that we've all been really excited about. And it's finally starting to happen. I'm going to predict by the end of the year, we're going to have an on-device Haiku to GPT-4 level model running, consuming less than 8 gigabytes of RAM. As soon as OpenELM hits Ollama, we'll be able to test this as well. And that's one of the highlights of using the ITV benchmark inside of this code base. You'll be able to quickly and seamlessly get that up and running by just updating the model name, adding a new configuration here like this. And then it'll look something like this, OpenELM, and then whatever the size is going to be, say it's the 3B, and that's it. Then you just run the test again, right? So that's the beauty of having a test suite like this set up and ready to go. You can, of course, come in here and customize this. You can add Opus, you can add Haiku, you can add other models, tweak it to your liking. That's what this is all about. I highly recommend you get in here and test this. This was important enough for me to take a break from personal AI assistance, and HSE, and all of that stuff. And I'll see you guys in the next video. Bye-bye. MacBook Pro M4 chip is released. And as the LLM community rolls out permutations of Llama 3, I think very soon, possibly before mid-2024, ELM's efficient language models will be ready for on-device use. Again, this is use case specific, which is really the whole point of me creating this video is to share this code base with you so that you can know exactly what your use case specific standards are. Because after you have standards set and a great prompting framework like PromptFu, you can then answer the question for yourself, for your tools, and for your products, is this efficient language model ready for my device? For me personally, the answer to this question is very soon. If you enjoyed this video, you know what to do. Thanks so much for watching. Stay focused and keep building. -------------------------------------------------------------------------------- /testable_prompts/context_window/context_window_2.md: -------------------------------------------------------------------------------- 1 | What was the speakers personal accuracy requirement for the benchmark made in the SCRIPT below? 2 | 3 | SCRIPT 4 | Gemma Phi 3, OpenELM, and Llama 3. Open source language models are becoming more viable with every single release. The terminology from Apple's new OpenELM model is spot on. These efficient language models are taking center stage in the LLM ecosystem. Why are ELMs so important? Because they reshape the business model of your agentic tools and products. When you can run a prompt directly on your device, the cost of building goes to zero. The pace of innovation has been incredible, especially with the release of Llama 3. But every time a new model drops, I'm always asking the same question. Are efficient language models truly ready for on-device use? And how do you know your ELM meets your standards? I'm going to give you a couple of examples here. The first one is that you need to know your ELM. Everyone has different standards for their prompts, prompt chains, AI agents, and agentic workflows. How do you know your personal standards are being met by Phi 3, by Llama 3, and whatever's coming next? This is something that we stress on the channel a lot. Always look at where the ball is going, not where it is. If this trend of incredible local models continue, how soon will it be until we can do what GPT-4 does right on our device? With Llama 3, it's looking like this. It's looking like this. It's looking like this. It's looking like this. It's looking like this. It's looking like this. It's looking like this. It's looking like this. It's looking like this. That time is coming very soon. In this video, we're going to answer the question, are efficient language models ready for on-device use? How do you know if they're ready for your specific use cases? Here are all the big ideas. We're going to set some standards for what ELM attributes we actually care about. There are things like RAM consumption, tokens per second, accuracy. We're going to look at some specific attributes of ELMs and talk about where they need to be for them to work on-device for us. We're going to break down the IT V-Benchmark. We'll explain exactly what that is. That's going to help us answer the question, is this model good enough for your specific use cases? And then we're going to actually run the IT V-Benchmark on Gemma 5.3 and Llama 3 for real on-device use. So we're going to look at a concrete example of the IT V-Benchmark running on my M2 MacBook Pro with 64 gigabytes of RAM and really try to answer the question in a concrete way. Is this ready for prime time? Are these ELMs, are these efficient language models ready for prime time? Let's first walk through some standards and then I'll share some of my personal standards for ELMs. So we'll look at it through the lens of how I'm approaching this as I'm building out agentic tools and products. How do we know we're ready for on-device use? First two most important metrics we need to look at, accuracy and speed. Given your test suite that validates that this model works for your use case, what accuracy do you need? Is it okay if it fails a couple of tests giving you 90% or are you okay with, you know, 60, 70 or 80%? I think accuracy is the most important benchmark we should all be paying attention to. Something like speed is also a complete blocker if it's too low. So we'll be measuring speed and TPS, tokens per second. We'll look at a range from one token per second, all the way up to grok levels, right? Of something like 500 plus, you know, 1000 tokens per second level. What else do we need to pay attention to? Memory and context window. So memory coupled with speed are the big two constraints for ELMs right now. Efficient language model, models that can run on your device. They chew up anywhere from four gigabytes of RAM, of GPU, of CPU, all the way up to 128 and beyond. To run Lama 3, 70 billion parameter on my MacBook, it will chew up something like half of all my available RAM. We also have context window. This is a classic one. Then we have JSON response and vision support. We're not gonna focus on these too much. These are more yes, no, do they have it or do they not? Is it multimodal or not? There are a couple other things that we need to pay attention to. First of all, we need to pay attention to these other attributes that we're missing here, but I don't think they matter as much as these six and specifically these four at the top here. So let's go ahead and walk through this through the lens of my personal standards for efficient language models. Let's break it down. So first things first, the accuracy for the ITV benchmark, which we're about to get to must hit 80%. So if a model is not passing about 80% here, I automatically disqualify it. Tokens per second. I require at least 20 tokens per second minimum. If it's below this, it's honestly just not worth it. It's too slow. There's not enough happening. Anything above this, of course we'll accept. So keep in mind when you're setting your personal standards, you're really looking for ranges, right? Anything above 80% for me is golden. Anything above 20 tokens per second at a very minimum is what we're looking for. So let's look at memory. For me, I am only willing to consume up to about 32 gigabytes of RAM, GPU, CPU. However, it ends up getting sliced. On my 64 gigabyte, I have several Docker instances and other applications that are basically running 24 seven that constrain my dev environment. Regardless, I'm looking for ELMs that consume less than 32 gigabytes of memory. Context window, for me, the sweet spot is 32K and above. Lama 3 released with 8K. I said, cool. Benchmarks look great, but it's a little too small. For some of the larger prompts and prompt chains that I'm building up, I'm looking for 32K minimum context. I highly recommend you go through and set your personal standard for each one of these metrics, as they're likely to be the most important for getting your ELM, for getting a model running on your device. So JSON response, vision support. I don't really care about vision support. This is not a high priority for me. Of course, it's a nice to have. There are image models that can run in isolation. That does the trick for me. I'm not super concerned about having local on device multimodal models, at least right now. JSON response support is a must have. For me, this is built into a lot of the model providers, and it's typically not a problem anymore. So these are my personal standards. The most important ones are up here. 80% accuracy on the ITP benchmark, which we'll talk about in just a second. We have the speed. I'm looking for 20 tokens per second at a minimum. I'm looking for a memory consumption maximum of 32. And then of course, the context window. I am simplifying a lot of the details here, especially around the memory usage. I just want to give you a high level of how to think about what your standards are for ELMs. So that when they come around, you're ready to start using it for your personal tools and products. Having this ready to go as soon as these models are ready will save you time and money, especially as you scale up your usage of language models. So let's talk about the ITP benchmark. What is this? It's simple. It's nothing fancy. ITP is just, is this viable? That's what the test is all about. I just want to know, is this ELM viable? Are these efficient language models, AKA on device language models good enough? This code repository we're about to dive into. It's a personalized use case specific benchmark to quickly swap in and out ELMs, AKA on device language models to know if it's ready for your tools and applications. So let's go ahead and take a quick look at this code base. Link for this is going to be in the description. Let's go ahead and crack open VS code and let's just start with the README. So let's preview this and it's simple. This uses Bunn, PromptFu, and Alama for a minimalist cross-platform local LLM prompt testing and benchmarking experience. So before we dive into this anymore, I'm just going to go ahead, open up the terminal. I'm going to type Bunn run ELM, and that's going to kick off the test. So you can see right away, I have four models running, starting with GPT 3.5 as a control model to test against. And then you can see here, we have Alama Chat, Alama 3, we have PHY, and we have Gemma running as well. So while this is running through our 12 test cases, let's go ahead and take a look at what this code base looks like. So all the details that get set up are going to be in the README. Once you're able to get set up with this in less than a minute, this code base was designed specifically for you to help you benchmark local models for your use cases so that when they're ready, you can start saving time and saving money immediately. If we look at the structure, it's very simple. We have some setup, some minor scripts, and then we have the most important thing, bench, underscore, underscore, and then whatever the test suite name is. This one's called Efficient Language Models. So let's go ahead and look at the prompt. So the prompt is just a simple template. This gets filled in with each individual test run. And if we open up our test files, you can see here, let's go ahead and collapse everything. You can see here we have a list of what do we have here, 12 tests. They're sectioned off. You can see we have string manipulation here, command generation, code explanation, text classification. This is a work in progress of my personal ELM accuracy benchmark. By the time you're watching this, there'll likely be a few additional tests here. They'll be generic enough though, so that you can come in, understand them, and tweak them to fit your own specific use case. So let's go ahead and take a look at this. So this is the test file, and we'll look into this in more detail in just a second here. But if you go to the most important file, prompt through configuration, you can see here, let's go ahead and collapse this. We have our control cloud LLM. So I like to have a kind of control and an experimental group. The control group is going to be our cloud LLM that we want to prove our local models are as good as or near the performance of. Right now I'm using dbt 3.5. And then we have our experimental local ELMs. So we're going to go ahead and take a look at this. So in here, you can see we have LLM 3, we have 5.3, and we have Gemma. Again, you can tweak these. This is all built on top of LLM. Let's go ahead and run through our tool set quickly. We're using Bun, which is an all in one JavaScript runtime. Over the past year, the engineers have really matured the ecosystem. This is my go-to tool for all things JavaScript and TypeScript related. They recently just launched Windows support, which means that this code base will work out of the box for Mac, Linux, and Windows users. You can go ahead and click on this, and you'll be able to see the code base. Huge shout out to the Bun developers on all the great work here. We're using Ollama to serve our local language models. I probably don't need to introduce them. And last but not least, we're using PromptFu. I've talked about PromptFu in a few videos in the past, but it's super, super important to bring back up. This is how you can test your individual prompts against expectations. So what does that look like? If we scroll down to the hero here, you can see exactly what a test case looks like. So you have your prompts that you're going to test. So this is what you would normally type in a chat input field. And then you can go ahead and click test. And then you can go ahead and you have your individual models. Let's say you want to test OpenAI, Plod, and Mistral Large. You would put those all here. So for each provider, it's going to run every single prompt. And then at the bottom, you have your test cases. Your test cases can pass in variables to your prompts, as you can see here. And then most importantly, your test cases can assert specific expectations on the output of your LLM. So you can see here where you're running this type contains. We need to make sure that it has this string in it. We're making sure that the cost is below this amount, latency below this, etc. There are many different assertion types. The ITV benchmark repo uses these three key pieces of technology for a really, really simplistic experience. So you have your prompt configuration where you specify what models you want to use. You have your tests, which specify the details. So let's go ahead and look at one of these tests. You can see here, this is a simple bullet summary test. So I'm saying create a summary of the following text in bullet points. And then here's the script to one of our previous videos. So, you know, here's a simple yet powerful idea that can help you take a large step toward useful and valuable agentic workflows. We're asserting case insensitively that all of these items are in the response of the prompt. So let's go ahead and look at our output. Let's see if our prompts completed. Okay, so we have 33 success and 15 failed tests. So LLM3 ran every single one of these test cases here and reported its results. So let's go ahead and take a look at what that looks like. So after you run that was Bon ELM, after you run that you can run Bon View and if we open up package.json, and you can see Bon view just runs prompt foo view Bon view. This is going to kick off a local prompt foo server that shows us exactly what happened in the test runs. So right away, you can see we have a great summary of the results. So we have our control test failing at only one test, right. So it passed 91% accuracy. This and then we have llama 3 so close to my 80 standard we'll dig into where it went wrong in just a second here we then have phi 3 failed half of the 12 test cases and then we have gemma looks like it did one better 7 out of 12 so you can see here this is why it's important to have a control group specifically for testing elms it's really good to compare against a kind of high performing model and you know gpg 3.5 turbo it's not really even high performing anymore but it's a good benchmark for testing against local models because really if we use opus or gpt4 here the local models won't even come close so that's why i like to compare to something like gpg 3.5 you can also use cloud 3 haiku here this right away gives you a great benchmark on how local models are performing let's go ahead and look at one of these tests what happened where did things go wrong let's look at our text classification this is a simple test the prompt is is the following block of text a sql natural language query nlq respond exclusively with yes or no so this test here is going to look at how well the model can both answer correctly and answer precisely right it needs to say yes or no and then the block of text is select 10 users over the age of 21 with a gmail address and then we have the assertion type equals yes so our test case validates this test if it returns exclusively yes and we can look at the prompt test to see exactly what that looks like so if you go to test.yaml we can see we're looking for just yes this is what that test looks like right so this is our one of our text classification tests and and we have this assertion type equals yes so equals is used when you know exactly what you want the response to be a lot of the times you'll want something like a i contains all so case insensitive contains everything or a case insensitive contains any and there are lots of different assertions you can make you can easily dive into that i've linked that in the readme you'll want to look at the assertions documentation in prompt foo they have a whole list here of different assertions you can make to improve and strengthen your prompt test so that's what that test looks like and and you can kind of go through the line over each model to see exactly what went right what went wrong etc so feel free to check out the other test cases the long story short here is that by running the itv benchmark by running your personal benchmarks against local models you can have higher confidence and you can have first movers advantage on getting your hands on these local models and truly utilizing them as you can see here llama 3 is nearly within my standard of what i need an elm to do based on these 12 test cases i'll increase this to add a lot more of the use cases that i use out of these 12 test cases llama 3 is performing really really well and this is the 8b model right so if we look at a llama you can see here the default version that comes in here is the 8 billion parameter model that's the 4b quantization so pretty good stuff here i don't need to talk about how great llama 3 is the rest of the internet is doing that but it is really awesome to see how it performs on your specific use cases the closer you get to the metal here the closer you understand how these models are performing next to each other the better and the faster you're going to be able to take these models and productionize them in your tools and products i also just want to shout out how incredible it is to actually run these tests over and over and over again with the same model without thinking about the cost for a single second. You can see here, we're getting about 12 tokens per second across the board. So not ideal, not super great, but still everything completed fine. You can walk through the examples. A lot of these test cases are passing. This is really great. I'm gonna be keeping a pretty close eye on this stuff. So definitely like and subscribe if you're interested in the best local performing models. I feel like we're gonna have a few different classes of models, right? If we break this down, fastest, cheapest, and then it was best, slowest. And now what I think we need to do is take this and add a nest to it. So we basically say something like this, right? We say cloud, right? And then we say the slowest, most expensive. And then we say local, fastest, lower accuracy, and best, slowest, right? So things kind of change when you're at the local level. Now we're just trading off speed and accuracy, which simplifies things a lot, right? Because basically we were doing this where we had the fastest, cheapest, and we had lower accuracy. And then we had best, slowest, most expensive, right? So this is your Opus, this is your GPT-4, and this is your Haiku, GPT-3. But now we're getting into this interesting place where now we have things like this, right? Now we have PHY-3, we have LAMA-3, LAMA-3 is seven or eight billion. We also have Gemma. And then in the slowest, we have our bigger models, right? So this is where like LAMA-3 was at 70 billion, that's where this goes. And then, you know, whatever other big models that come out that are, you know, going to really trip your RAM, they're going to run slower, but they will give you the best performance that you can possibly have locally. So I'm keeping an eye on this. Hit the like and hit the sub if you want to stay up to date with how cloud versus local models progress. We're going to be covering these on the channel and I'll likely use, you know, this class system to separate them to keep an eye on these, right? First thing that needs to happen is we need anything at all. To run locally, right? So this is kind of, you know, in the future, same with this. Right now we need just anything to run well enough. So, you know, we need decent accuracy, any speed, right? So this is what we're looking for right now. And this stuff is going to come in the future. So that's the way I'm looking at this. The ITV benchmark can help you gain confidence in your prompts. Link for the code is going to be in the description. I built this to be ultra simple. Just follow the README to get started. Thanks to Bunn. Pramphu and Ollama. This should be completely cross-platform and I'll be updating this with some additional test cases. By the time you watch this, I'll likely have added several additional tests. I'm missing some things in here like code generation, context window length testing, and a couple other sections. So look forward to that. I hope all of this makes sense. Up your feeling, the speed of the open source community building toward usable viable ELMs. I think this is something that we've all been really excited about. And it's finally starting to happen. I'm going to predict by the end of the year, we're going to have an on-device Haiku to GPT-4 level model running, consuming less than 8 gigabytes of RAM. As soon as OpenELM hits Ollama, we'll be able to test this as well. And that's one of the highlights of using the ITV benchmark inside of this code base. You'll be able to quickly and seamlessly get that up and running by just updating the model name, adding a new configuration here like this. And then it'll look something like this, OpenELM, and then whatever the size is going to be, say it's the 3B, and that's it. Then you just run the test again, right? So that's the beauty of having a test suite like this set up and ready to go. You can, of course, come in here and customize this. You can add Opus, you can add Haiku, you can add other models, tweak it to your liking. That's what this is all about. I highly recommend you get in here and test this. This was important enough for me to take a break from personal AI assistance, and HSE, and all of that stuff. And I'll see you guys in the next video. Bye-bye. MacBook Pro M4 chip is released. And as the LLM community rolls out permutations of Llama 3, I think very soon, possibly before mid-2024, ELM's efficient language models will be ready for on-device use. Again, this is use case specific, which is really the whole point of me creating this video is to share this code base with you so that you can know exactly what your use case specific standards are. Because after you have standards set and a great prompting framework like PromptFu, you can then answer the question for yourself, for your tools, and for your products, is this efficient language model ready for my device? For me personally, the answer to this question is very soon. If you enjoyed this video, you know what to do. Thanks so much for watching. Stay focused and keep building. -------------------------------------------------------------------------------- /testable_prompts/context_window/context_window_3.md: -------------------------------------------------------------------------------- 1 | What was the end of year prediction made in the SCRIPT below? 2 | 3 | SCRIPT 4 | Gemma Phi 3, OpenELM, and Llama 3. Open source language models are becoming more viable with every single release. The terminology from Apple's new OpenELM model is spot on. These efficient language models are taking center stage in the LLM ecosystem. Why are ELMs so important? Because they reshape the business model of your agentic tools and products. When you can run a prompt directly on your device, the cost of building goes to zero. The pace of innovation has been incredible, especially with the release of Llama 3. But every time a new model drops, I'm always asking the same question. Are efficient language models truly ready for on-device use? And how do you know your ELM meets your standards? I'm going to give you a couple of examples here. The first one is that you need to know your ELM. Everyone has different standards for their prompts, prompt chains, AI agents, and agentic workflows. How do you know your personal standards are being met by Phi 3, by Llama 3, and whatever's coming next? This is something that we stress on the channel a lot. Always look at where the ball is going, not where it is. If this trend of incredible local models continue, how soon will it be until we can do what GPT-4 does right on our device? With Llama 3, it's looking like this. It's looking like this. It's looking like this. It's looking like this. It's looking like this. It's looking like this. It's looking like this. It's looking like this. It's looking like this. That time is coming very soon. In this video, we're going to answer the question, are efficient language models ready for on-device use? How do you know if they're ready for your specific use cases? Here are all the big ideas. We're going to set some standards for what ELM attributes we actually care about. There are things like RAM consumption, tokens per second, accuracy. We're going to look at some specific attributes of ELMs and talk about where they need to be for them to work on-device for us. We're going to break down the IT V-Benchmark. We'll explain exactly what that is. That's going to help us answer the question, is this model good enough for your specific use cases? And then we're going to actually run the IT V-Benchmark on Gemma 5.3 and Llama 3 for real on-device use. So we're going to look at a concrete example of the IT V-Benchmark running on my M2 MacBook Pro with 64 gigabytes of RAM and really try to answer the question in a concrete way. Is this ready for prime time? Are these ELMs, are these efficient language models ready for prime time? Let's first walk through some standards and then I'll share some of my personal standards for ELMs. So we'll look at it through the lens of how I'm approaching this as I'm building out agentic tools and products. How do we know we're ready for on-device use? First two most important metrics we need to look at, accuracy and speed. Given your test suite that validates that this model works for your use case, what accuracy do you need? Is it okay if it fails a couple of tests giving you 90% or are you okay with, you know, 60, 70 or 80%? I think accuracy is the most important benchmark we should all be paying attention to. Something like speed is also a complete blocker if it's too low. So we'll be measuring speed and TPS, tokens per second. We'll look at a range from one token per second, all the way up to grok levels, right? Of something like 500 plus, you know, 1000 tokens per second level. What else do we need to pay attention to? Memory and context window. So memory coupled with speed are the big two constraints for ELMs right now. Efficient language model, models that can run on your device. They chew up anywhere from four gigabytes of RAM, of GPU, of CPU, all the way up to 128 and beyond. To run Lama 3, 70 billion parameter on my MacBook, it will chew up something like half of all my available RAM. We also have context window. This is a classic one. Then we have JSON response and vision support. We're not gonna focus on these too much. These are more yes, no, do they have it or do they not? Is it multimodal or not? There are a couple other things that we need to pay attention to. First of all, we need to pay attention to these other attributes that we're missing here, but I don't think they matter as much as these six and specifically these four at the top here. So let's go ahead and walk through this through the lens of my personal standards for efficient language models. Let's break it down. So first things first, the accuracy for the ITV benchmark, which we're about to get to must hit 80%. So if a model is not passing about 80% here, I automatically disqualify it. Tokens per second. I require at least 20 tokens per second minimum. If it's below this, it's honestly just not worth it. It's too slow. There's not enough happening. Anything above this, of course we'll accept. So keep in mind when you're setting your personal standards, you're really looking for ranges, right? Anything above 80% for me is golden. Anything above 20 tokens per second at a very minimum is what we're looking for. So let's look at memory. For me, I am only willing to consume up to about 32 gigabytes of RAM, GPU, CPU. However, it ends up getting sliced. On my 64 gigabyte, I have several Docker instances and other applications that are basically running 24 seven that constrain my dev environment. Regardless, I'm looking for ELMs that consume less than 32 gigabytes of memory. Context window, for me, the sweet spot is 32K and above. Lama 3 released with 8K. I said, cool. Benchmarks look great, but it's a little too small. For some of the larger prompts and prompt chains that I'm building up, I'm looking for 32K minimum context. I highly recommend you go through and set your personal standard for each one of these metrics, as they're likely to be the most important for getting your ELM, for getting a model running on your device. So JSON response, vision support. I don't really care about vision support. This is not a high priority for me. Of course, it's a nice to have. There are image models that can run in isolation. That does the trick for me. I'm not super concerned about having local on device multimodal models, at least right now. JSON response support is a must have. For me, this is built into a lot of the model providers, and it's typically not a problem anymore. So these are my personal standards. The most important ones are up here. 80% accuracy on the ITP benchmark, which we'll talk about in just a second. We have the speed. I'm looking for 20 tokens per second at a minimum. I'm looking for a memory consumption maximum of 32. And then of course, the context window. I am simplifying a lot of the details here, especially around the memory usage. I just want to give you a high level of how to think about what your standards are for ELMs. So that when they come around, you're ready to start using it for your personal tools and products. Having this ready to go as soon as these models are ready will save you time and money, especially as you scale up your usage of language models. So let's talk about the ITP benchmark. What is this? It's simple. It's nothing fancy. ITP is just, is this viable? That's what the test is all about. I just want to know, is this ELM viable? Are these efficient language models, AKA on device language models good enough? This code repository we're about to dive into. It's a personalized use case specific benchmark to quickly swap in and out ELMs, AKA on device language models to know if it's ready for your tools and applications. So let's go ahead and take a quick look at this code base. Link for this is going to be in the description. Let's go ahead and crack open VS code and let's just start with the README. So let's preview this and it's simple. This uses Bunn, PromptFu, and Alama for a minimalist cross-platform local LLM prompt testing and benchmarking experience. So before we dive into this anymore, I'm just going to go ahead, open up the terminal. I'm going to type Bunn run ELM, and that's going to kick off the test. So you can see right away, I have four models running, starting with GPT 3.5 as a control model to test against. And then you can see here, we have Alama Chat, Alama 3, we have PHY, and we have Gemma running as well. So while this is running through our 12 test cases, let's go ahead and take a look at what this code base looks like. So all the details that get set up are going to be in the README. Once you're able to get set up with this in less than a minute, this code base was designed specifically for you to help you benchmark local models for your use cases so that when they're ready, you can start saving time and saving money immediately. If we look at the structure, it's very simple. We have some setup, some minor scripts, and then we have the most important thing, bench, underscore, underscore, and then whatever the test suite name is. This one's called Efficient Language Models. So let's go ahead and look at the prompt. So the prompt is just a simple template. This gets filled in with each individual test run. And if we open up our test files, you can see here, let's go ahead and collapse everything. You can see here we have a list of what do we have here, 12 tests. They're sectioned off. You can see we have string manipulation here, command generation, code explanation, text classification. This is a work in progress of my personal ELM accuracy benchmark. By the time you're watching this, there'll likely be a few additional tests here. They'll be generic enough though, so that you can come in, understand them, and tweak them to fit your own specific use case. So let's go ahead and take a look at this. So this is the test file, and we'll look into this in more detail in just a second here. But if you go to the most important file, prompt through configuration, you can see here, let's go ahead and collapse this. We have our control cloud LLM. So I like to have a kind of control and an experimental group. The control group is going to be our cloud LLM that we want to prove our local models are as good as or near the performance of. Right now I'm using dbt 3.5. And then we have our experimental local ELMs. So we're going to go ahead and take a look at this. So in here, you can see we have LLM 3, we have 5.3, and we have Gemma. Again, you can tweak these. This is all built on top of LLM. Let's go ahead and run through our tool set quickly. We're using Bun, which is an all in one JavaScript runtime. Over the past year, the engineers have really matured the ecosystem. This is my go-to tool for all things JavaScript and TypeScript related. They recently just launched Windows support, which means that this code base will work out of the box for Mac, Linux, and Windows users. You can go ahead and click on this, and you'll be able to see the code base. Huge shout out to the Bun developers on all the great work here. We're using Ollama to serve our local language models. I probably don't need to introduce them. And last but not least, we're using PromptFu. I've talked about PromptFu in a few videos in the past, but it's super, super important to bring back up. This is how you can test your individual prompts against expectations. So what does that look like? If we scroll down to the hero here, you can see exactly what a test case looks like. So you have your prompts that you're going to test. So this is what you would normally type in a chat input field. And then you can go ahead and click test. And then you can go ahead and you have your individual models. Let's say you want to test OpenAI, Plod, and Mistral Large. You would put those all here. So for each provider, it's going to run every single prompt. And then at the bottom, you have your test cases. Your test cases can pass in variables to your prompts, as you can see here. And then most importantly, your test cases can assert specific expectations on the output of your LLM. So you can see here where you're running this type contains. We need to make sure that it has this string in it. We're making sure that the cost is below this amount, latency below this, etc. There are many different assertion types. The ITV benchmark repo uses these three key pieces of technology for a really, really simplistic experience. So you have your prompt configuration where you specify what models you want to use. You have your tests, which specify the details. So let's go ahead and look at one of these tests. You can see here, this is a simple bullet summary test. So I'm saying create a summary of the following text in bullet points. And then here's the script to one of our previous videos. So, you know, here's a simple yet powerful idea that can help you take a large step toward useful and valuable agentic workflows. We're asserting case insensitively that all of these items are in the response of the prompt. So let's go ahead and look at our output. Let's see if our prompts completed. Okay, so we have 33 success and 15 failed tests. So LLM3 ran every single one of these test cases here and reported its results. So let's go ahead and take a look at what that looks like. So after you run that was Bon ELM, after you run that you can run Bon View and if we open up package.json, and you can see Bon view just runs prompt foo view Bon view. This is going to kick off a local prompt foo server that shows us exactly what happened in the test runs. So right away, you can see we have a great summary of the results. So we have our control test failing at only one test, right. So it passed 91% accuracy. This and then we have llama 3 so close to my 80 standard we'll dig into where it went wrong in just a second here we then have phi 3 failed half of the 12 test cases and then we have gemma looks like it did one better 7 out of 12 so you can see here this is why it's important to have a control group specifically for testing elms it's really good to compare against a kind of high performing model and you know gpg 3.5 turbo it's not really even high performing anymore but it's a good benchmark for testing against local models because really if we use opus or gpt4 here the local models won't even come close so that's why i like to compare to something like gpg 3.5 you can also use cloud 3 haiku here this right away gives you a great benchmark on how local models are performing let's go ahead and look at one of these tests what happened where did things go wrong let's look at our text classification this is a simple test the prompt is is the following block of text a sql natural language query nlq respond exclusively with yes or no so this test here is going to look at how well the model can both answer correctly and answer precisely right it needs to say yes or no and then the block of text is select 10 users over the age of 21 with a gmail address and then we have the assertion type equals yes so our test case validates this test if it returns exclusively yes and we can look at the prompt test to see exactly what that looks like so if you go to test.yaml we can see we're looking for just yes this is what that test looks like right so this is our one of our text classification tests and and we have this assertion type equals yes so equals is used when you know exactly what you want the response to be a lot of the times you'll want something like a i contains all so case insensitive contains everything or a case insensitive contains any and there are lots of different assertions you can make you can easily dive into that i've linked that in the readme you'll want to look at the assertions documentation in prompt foo they have a whole list here of different assertions you can make to improve and strengthen your prompt test so that's what that test looks like and and you can kind of go through the line over each model to see exactly what went right what went wrong etc so feel free to check out the other test cases the long story short here is that by running the itv benchmark by running your personal benchmarks against local models you can have higher confidence and you can have first movers advantage on getting your hands on these local models and truly utilizing them as you can see here llama 3 is nearly within my standard of what i need an elm to do based on these 12 test cases i'll increase this to add a lot more of the use cases that i use out of these 12 test cases llama 3 is performing really really well and this is the 8b model right so if we look at a llama you can see here the default version that comes in here is the 8 billion parameter model that's the 4b quantization so pretty good stuff here i don't need to talk about how great llama 3 is the rest of the internet is doing that but it is really awesome to see how it performs on your specific use cases the closer you get to the metal here the closer you understand how these models are performing next to each other the better and the faster you're going to be able to take these models and productionize them in your tools and products i also just want to shout out how incredible it is to actually run these tests over and over and over again with the same model without thinking about the cost for a single second. You can see here, we're getting about 12 tokens per second across the board. So not ideal, not super great, but still everything completed fine. You can walk through the examples. A lot of these test cases are passing. This is really great. I'm gonna be keeping a pretty close eye on this stuff. So definitely like and subscribe if you're interested in the best local performing models. I feel like we're gonna have a few different classes of models, right? If we break this down, fastest, cheapest, and then it was best, slowest. And now what I think we need to do is take this and add a nest to it. So we basically say something like this, right? We say cloud, right? And then we say the slowest, most expensive. And then we say local, fastest, lower accuracy, and best, slowest, right? So things kind of change when you're at the local level. Now we're just trading off speed and accuracy, which simplifies things a lot, right? Because basically we were doing this where we had the fastest, cheapest, and we had lower accuracy. And then we had best, slowest, most expensive, right? So this is your Opus, this is your GPT-4, and this is your Haiku, GPT-3. But now we're getting into this interesting place where now we have things like this, right? Now we have PHY-3, we have LAMA-3, LAMA-3 is seven or eight billion. We also have Gemma. And then in the slowest, we have our bigger models, right? So this is where like LAMA-3 was at 70 billion, that's where this goes. And then, you know, whatever other big models that come out that are, you know, going to really trip your RAM, they're going to run slower, but they will give you the best performance that you can possibly have locally. So I'm keeping an eye on this. Hit the like and hit the sub if you want to stay up to date with how cloud versus local models progress. We're going to be covering these on the channel and I'll likely use, you know, this class system to separate them to keep an eye on these, right? First thing that needs to happen is we need anything at all. To run locally, right? So this is kind of, you know, in the future, same with this. Right now we need just anything to run well enough. So, you know, we need decent accuracy, any speed, right? So this is what we're looking for right now. And this stuff is going to come in the future. So that's the way I'm looking at this. The ITV benchmark can help you gain confidence in your prompts. Link for the code is going to be in the description. I built this to be ultra simple. Just follow the README to get started. Thanks to Bunn. Pramphu and Ollama. This should be completely cross-platform and I'll be updating this with some additional test cases. By the time you watch this, I'll likely have added several additional tests. I'm missing some things in here like code generation, context window length testing, and a couple other sections. So look forward to that. I hope all of this makes sense. Up your feeling, the speed of the open source community building toward usable viable ELMs. I think this is something that we've all been really excited about. And it's finally starting to happen. I'm going to predict by the end of the year, we're going to have an on-device Haiku to GPT-4 level model running, consuming less than 8 gigabytes of RAM. As soon as OpenELM hits Ollama, we'll be able to test this as well. And that's one of the highlights of using the ITV benchmark inside of this code base. You'll be able to quickly and seamlessly get that up and running by just updating the model name, adding a new configuration here like this. And then it'll look something like this, OpenELM, and then whatever the size is going to be, say it's the 3B, and that's it. Then you just run the test again, right? So that's the beauty of having a test suite like this set up and ready to go. You can, of course, come in here and customize this. You can add Opus, you can add Haiku, you can add other models, tweak it to your liking. That's what this is all about. I highly recommend you get in here and test this. This was important enough for me to take a break from personal AI assistance, and HSE, and all of that stuff. And I'll see you guys in the next video. Bye-bye. MacBook Pro M4 chip is released. And as the LLM community rolls out permutations of Llama 3, I think very soon, possibly before mid-2024, ELM's efficient language models will be ready for on-device use. Again, this is use case specific, which is really the whole point of me creating this video is to share this code base with you so that you can know exactly what your use case specific standards are. Because after you have standards set and a great prompting framework like PromptFu, you can then answer the question for yourself, for your tools, and for your products, is this efficient language model ready for my device? For me personally, the answer to this question is very soon. If you enjoyed this video, you know what to do. Thanks so much for watching. Stay focused and keep building. -------------------------------------------------------------------------------- /testable_prompts/email_management/email_management_1.md: -------------------------------------------------------------------------------- 1 | Categorize the following email into one of the following categories: work, personal, newsletter, other. Respond exclusively with the category name. 2 | 3 | EMAIL 4 | 5 | Subject: 6 | Action Items for next week 7 | From: 8 | john@workhard.com 9 | Body: 10 | Hey can you send over the action items for the week? -------------------------------------------------------------------------------- /testable_prompts/email_management/email_management_2.md: -------------------------------------------------------------------------------- 1 | Categorize the following email into one of the following categories: work, personal, newsletter, other. Respond exclusively with the category name. 2 | 3 | EMAIL 4 | 5 | Subject: 6 | Dinner plans this weekend 7 | From: 8 | sarah@gmail.com 9 | Body: 10 | Hey! Just wanted to see if you're free for dinner this Saturday? Let me know! -------------------------------------------------------------------------------- /testable_prompts/email_management/email_management_3.md: -------------------------------------------------------------------------------- 1 | Categorize the following email into one of the following categories: work, personal, newsletter, other. Respond exclusively with the category name. 2 | 3 | EMAIL 4 | 5 | Subject: 6 | Your weekly tech newsletter 7 | From: 8 | newsletter@techdigest.com 9 | Body: 10 | Here are the top tech stories for this week... -------------------------------------------------------------------------------- /testable_prompts/email_management/email_management_4.md: -------------------------------------------------------------------------------- 1 | Categorize the following email into one of the following categories: work, personal, newsletter, other. Respond exclusively with the category name. 2 | 3 | EMAIL 4 | 5 | Subject: 6 | Your order has shipped! 7 | From: 8 | orders@onlinestore.com 9 | Body: 10 | Good news! Your recent order has shipped and is on its way to you. -------------------------------------------------------------------------------- /testable_prompts/email_management/email_management_5.md: -------------------------------------------------------------------------------- 1 | Create a concise response to the USER_EMAIL. Follow the RESPONSE_STRUCTURE. 2 | 3 | RESPONSE_STRUCTURE: 4 | - Hey 5 | - Appreciate reach out 6 | - Reframe email to confirm details 7 | - Next steps, schedule meeting next week 8 | - Thanks for your time. Stay Focused, Keep building. - Dan. 9 | 10 | USER_EMAIL: 11 | Subject: 12 | Let's move forward on the project 13 | From: 14 | john@aiconsultingco.com 15 | Body: 16 | Hey Dan, I was thinking more about the product requirements for the idea we brainstormed last week. 17 | Let's discuss pricing, timeline and move forward with the proof of concept. Can we sync next week? 18 | Thanks for your time. -------------------------------------------------------------------------------- /testable_prompts/email_management/email_management_6.md: -------------------------------------------------------------------------------- 1 | Create a concise summary of the DENSE_EMAIL. Extract the most important information into bullet points and then summarize the email in the SUMMARY_FORMAT. 2 | 3 | SUMMARY_FORMAT: 4 | Oneline Summary 5 | ... 6 | Bullet points 7 | - a 8 | - b 9 | - c 10 | 11 | DENSE_EMAIL: 12 | Subject: 13 | Project Update - New Requirements and Timeline Changes 14 | From: 15 | sarah@techconsultingfirm.com 16 | Body: 17 | Hi Dan, 18 | 19 | I wanted to provide an update on the ERP system implementation project. After our last meeting with the client, they have requested some additional features and changes to the original requirements. This includes integrating with their existing CRM system, adding advanced reporting capabilities, and supporting multi-currency transactions. 20 | 21 | Due to these new requirements, we will need to adjust our project timeline and milestones. I estimate that these changes will add approximately 3-4 weeks to our original schedule. We should also plan for additional testing and quality assurance to ensure the new features are working as expected. 22 | 23 | Please let me know if you have any concerns or questions about these changes. I think it's important that we communicate this to the client as soon as possible and set expectations around the revised timeline. -------------------------------------------------------------------------------- /testable_prompts/personal_ai_assistant_responses/personal_ai_assistant_responses_1.md: -------------------------------------------------------------------------------- 1 | You are a friendly, ultra helpful, attentive, concise AI assistant named 'Ada'. 2 | 3 | You work with your human companion 'Dan' to build valuable experience through software. 4 | 5 | We both like short, concise, back-and-forth conversations. 6 | 7 | Concisely communicate the following message to your human companion: 'Select an image to generate a Vue component from'. -------------------------------------------------------------------------------- /testable_prompts/personal_ai_assistant_responses/personal_ai_assistant_responses_2.md: -------------------------------------------------------------------------------- 1 | You are a friendly, ultra helpful, attentive, concise AI assistant named 'Ada'. 2 | 3 | You work with your human companion 'Dan' to build valuable experience through software. 4 | 5 | We both like short, concise, back-and-forth conversations. 6 | 7 | Communicate the following message to your human companion: 'I've found the URL in your clipboard. I'll scrape the URL and example generate code for you. But first, what about the example code would you like me to focus on?'. -------------------------------------------------------------------------------- /testable_prompts/personal_ai_assistant_responses/personal_ai_assistant_responses_3.md: -------------------------------------------------------------------------------- 1 | You are a friendly, ultra helpful, attentive, concise AI assistant named 'Ada'. 2 | 3 | You work with your human companion 'Dan' to build valuable experience through software. 4 | 5 | We both like short, concise, back-and-forth conversations. 6 | 7 | Communicate the following message to your human companion: 'Code has been written to the working directory'. -------------------------------------------------------------------------------- /testable_prompts/sql/nlq1.md: -------------------------------------------------------------------------------- 1 | Given this natural language query: "select all authed users with 'Premium' plans", generate the SQL using postgres dialect that satisfies the request. Use the TABLE_DEFINITIONS below to satisfy the database query. Follow the INSTRUCTIONS. 2 | 3 | INSTRUCTIONS: 4 | 5 | - ENSURE THE SQL IS VALID FOR THE DIALECT 6 | - USE THE TABLE DEFINITIONS TO GENERATE THE SQL 7 | - DO NOT CHANGE ANY CONTENT WITHIN STRINGS OF THE Natural Language Query 8 | - Exclusively respond with the SQL query needed to satisfy the request and nothing else 9 | - Prefer * for SELECT statements unless the user specifies a column 10 | 11 | TABLE_DEFINITIONS: 12 | 13 | CREATE TABLE users ( 14 | id INT, 15 | created TIMESTAMP, 16 | updated TIMESTAMP, 17 | authed BOOLEAN, 18 | PLAN TEXT, 19 | name TEXT, 20 | email TEXT 21 | ); 22 | 23 | SQL Statement: 24 | -------------------------------------------------------------------------------- /testable_prompts/sql/nlq2.md: -------------------------------------------------------------------------------- 1 | Given this natural language query: 'select users created after January 1, 2022', generate the SQL using postgres dialect that satisfies the request. Use the TABLE_DEFINITIONS below to satisfy the database query. Follow the INSTRUCTIONS. 2 | 3 | INSTRUCTIONS: 4 | 5 | - ENSURE THE SQL IS VALID FOR THE DIALECT 6 | - USE THE TABLE DEFINITIONS TO GENERATE THE SQL 7 | - DO NOT CHANGE ANY CONTENT WITHIN STRINGS OF THE Natural Language Query 8 | - Exclusively respond with the SQL query needed to satisfy the request and nothing else 9 | - Prefer * for SELECT statements unless the user specifies a column 10 | 11 | TABLE_DEFINITIONS: 12 | 13 | CREATE TABLE users ( 14 | id INT, 15 | created TIMESTAMP, 16 | updated TIMESTAMP, 17 | authed BOOLEAN, 18 | PLAN TEXT, 19 | name TEXT, 20 | email TEXT 21 | ); 22 | 23 | SQL Statement: 24 | -------------------------------------------------------------------------------- /testable_prompts/sql/nlq3.md: -------------------------------------------------------------------------------- 1 | Given this natural language query: 'select the top 3 users by most recent update', generate the SQL using postgres dialect that satisfies the request. Use the TABLE_DEFINITIONS below to satisfy the database query. Follow the INSTRUCTIONS. 2 | 3 | INSTRUCTIONS: 4 | 5 | - ENSURE THE SQL IS VALID FOR THE DIALECT 6 | - USE THE TABLE DEFINITIONS TO GENERATE THE SQL 7 | - DO NOT CHANGE ANY CONTENT WITHIN STRINGS OF THE Natural Language Query 8 | - Exclusively respond with the SQL query needed to satisfy the request and nothing else 9 | - Prefer * for SELECT statements unless the user specifies a column 10 | 11 | TABLE_DEFINITIONS: 12 | 13 | CREATE TABLE users ( 14 | id INT, 15 | created TIMESTAMP, 16 | updated TIMESTAMP, 17 | authed BOOLEAN, 18 | PLAN TEXT, 19 | name TEXT, 20 | email TEXT 21 | ); 22 | 23 | SQL Statement: 24 | -------------------------------------------------------------------------------- /testable_prompts/sql/nlq4.md: -------------------------------------------------------------------------------- 1 | Given this natural language query: 'select user names and emails of non-authed users', generate the SQL using postgres dialect that satisfies the request. Use the TABLE_DEFINITIONS below to satisfy the database query. Follow the INSTRUCTIONS. 2 | 3 | INSTRUCTIONS: 4 | 5 | - ENSURE THE SQL IS VALID FOR THE DIALECT 6 | - USE THE TABLE DEFINITIONS TO GENERATE THE SQL 7 | - DO NOT CHANGE ANY CONTENT WITHIN STRINGS OF THE Natural Language Query 8 | - Exclusively respond with the SQL query needed to satisfy the request and nothing else 9 | - Prefer * for SELECT statements unless the user specifies a column 10 | 11 | TABLE_DEFINITIONS: 12 | 13 | CREATE TABLE users ( 14 | id INT, 15 | created TIMESTAMP, 16 | updated TIMESTAMP, 17 | authed BOOLEAN, 18 | PLAN TEXT, 19 | name TEXT, 20 | email TEXT 21 | ); 22 | 23 | SQL Statement: 24 | -------------------------------------------------------------------------------- /testable_prompts/sql/nlq5.md: -------------------------------------------------------------------------------- 1 | Given this natural language query: "select users with 'Basic' plan who were created before 2021", generate the SQL using postgres dialect that satisfies the request. Use the TABLE_DEFINITIONS below to satisfy the database query. Follow the INSTRUCTIONS. 2 | 3 | INSTRUCTIONS: 4 | 5 | - ENSURE THE SQL IS VALID FOR THE DIALECT 6 | - USE THE TABLE DEFINITIONS TO GENERATE THE SQL 7 | - DO NOT CHANGE ANY CONTENT WITHIN STRINGS OF THE Natural Language Query 8 | - Exclusively respond with the SQL query needed to satisfy the request and nothing else 9 | - Prefer * for SELECT statements unless the user specifies a column 10 | 11 | TABLE_DEFINITIONS: 12 | 13 | CREATE TABLE users ( 14 | id INT, 15 | created TIMESTAMP, 16 | updated TIMESTAMP, 17 | authed BOOLEAN, 18 | PLAN TEXT, 19 | name TEXT, 20 | email TEXT 21 | ); 22 | 23 | SQL Statement: 24 | -------------------------------------------------------------------------------- /testable_prompts/string_manipulation/string_manipulation_1.md: -------------------------------------------------------------------------------- 1 | Create a summary of the following text in bullet points. 2 | 3 | Here's a simple yet powerful idea that can help you take a large step toward useful and valuable agentic workflows. I have to warn you, this is one of those things that sounds really obvious after you hear it because it's hiding in plain sight. The idea is simple; it's called the two-way prompt. So, what is this? Why is it useful? And how can it help you build better AI agent workflows? Two-way prompting happens all the time in real collaborative workspaces. You are effectively two or more agents prompting each other to drive outcomes. Two-way prompting happens all the time when you're at work, with friends, with family, online, in comment sections, on PR reviews. You ask a question; your co-worker responds. They ask a question; you respond. Now, let's double click into what this looks like for your agentic tools. Right in agentic workflows, you are the critical communication process between you and your AI agents that are aiming to drive outcomes. In most agentic workflows, we fire off one prompt or configure some system prompt, and that's it. But we're missing a ton of opportunity here that we can unlock using two-way prompts. Let me show you a concrete example with Ada. So, Ada, of course, is our proof of concept personal AI assistant. And let me just go ahead and kick off this workflow so I can show you exactly how useful the two-way prompt can be. Ada, let's create some example code. -------------------------------------------------------------------------------- /testable_prompts/string_manipulation/string_manipulation_2.md: -------------------------------------------------------------------------------- 1 | Take the following text and for each sentence, convert it into a bullet point. Do not change the sentence. Maintain the punctuation. The punctuations '.!?' should trigger bullet points. Use - for the bullet point. 2 | 3 | Here's a simple yet powerful idea that can help you take a large step toward useful and valuable agentic workflows. I have to warn you, this is one of those things that sounds really obvious after you hear it because it's hiding in plain sight. The idea is simple; it's called the two-way prompt. So, what is this? Why is it useful? And how can it help you build better AI agent workflows? Two-way prompting happens all the time in real collaborative workspaces. You are effectively two or more agents prompting each other to drive outcomes. -------------------------------------------------------------------------------- /testable_prompts/string_manipulation/string_manipulation_3.md: -------------------------------------------------------------------------------- 1 | Convert the following SCRIPT to markdown, follow the SCRIPTING_RULES. 2 | 3 | SCRIPTING_RULES 4 | - Create 1 h1 header with a interesting title. 5 | - Create 2 h2 sub headers, one for the summary and one for the details. 6 | - Each section should contain bullet points. 7 | - Start each section with a hook. 8 | - Use short paragraphs. 9 | - Use emojis to indicate the hooks. 10 | 11 | SCRIPT 12 | Here's a simple yet powerful idea that can help you take a large step toward useful and valuable agentic workflows. I have to warn you, this is one of those things that sounds really obvious after you hear it because it's hiding in plain sight. The idea is simple; it's called the two-way prompt. So, what is this? Why is it useful? And how can it help you build better AI agent workflows? Two-way prompting happens all the time in real collaborative workspaces. You are effectively two or more agents prompting each other to drive outcomes. -------------------------------------------------------------------------------- /testable_prompts/text_classification/text_classification_1.md: -------------------------------------------------------------------------------- 1 | Is the following BLOCK_OF_TEXT a SQL Natural Language Query (NLQ)? Respond Exclusively with 'yes' or 'no'. 2 | 3 | BLOCK_OF_TEXT 4 | 5 | select 10 users over the age of 21 with gmail address -------------------------------------------------------------------------------- /testable_prompts/text_classification/text_classification_2.md: -------------------------------------------------------------------------------- 1 | Determine if the sentiment of the following TEXT is positive or negative. Respond exclusively with 'positive' or 'negative'. 2 | 3 | TEXT 4 | 5 | I love sunny days there's nothing like getting out in nature --------------------------------------------------------------------------------