├── 2024
├── MLX-day-one.ipynb
├── MLX-day-one.md
├── README.md
├── conversion-etc.ipynb
├── conversion-etc.md
├── rag-basics1.ipynb
├── rag-basics1.md
├── rag-basics2.ipynb
└── rag-basics2.md
├── .gitignore
├── LICENSE
├── README.md
└── assets
├── images
└── 2024
│ ├── Apple-MLX-GeminiGen-cropped.jpg
│ ├── Apple-MLX-GeminiGen.jpeg
│ ├── RAG-basics-1-cover.jpg
│ ├── RAG-basics-2-cover.jpg
│ ├── apple-mlx-ail-llm-day-one.gif
│ ├── black-panther-hulk-cover-open-source-llm-800x500.jpg
│ ├── construction-joke-meme.jpg
│ ├── meme-arrested_dev_why_you_use_open_source.jpg
│ ├── nasa-eclipse-diamong-ring.png
│ ├── ogbuji-kids-eclipse-2017.jpg
│ ├── oranges-to-apples.png
│ ├── rag-process-gao-et-al.png
│ ├── rmaiig-engineering-202404.png
│ ├── tokenizer-examples.png
│ ├── vlite-install-errors.png
│ └── vlite-perf-claims.png
└── resources
└── 2024
├── ragbasics
├── files
│ ├── MLX-day-one.md
│ ├── conversion-etc.md
│ ├── rag-basics1.md
│ └── rag-basics2.md
└── listings
│ ├── qdrant_build_db.py
│ ├── qdrant_rag_101.py
│ ├── vlite_build_db.py
│ ├── vlite_custom_split_build_db.py
│ └── vlite_retrieve.py
└── rmiug-pres-april
├── README.md
└── test.png
/.gitignore:
--------------------------------------------------------------------------------
1 | # Embargo section ------------
2 |
3 | TEMPLATE.md
4 | UPCOMING.md
5 | 2024/integrator-view.md
6 | 2024/playing-with-network-integrations.md
7 | 2024/fun-with-graf.md
8 | 2024/rag-basics.md
9 | 2024/working-with-agents.md
10 |
11 | # ----------------------------
12 |
13 | scratch
14 |
15 | env.sh
16 | **/.DS_Store
17 |
--------------------------------------------------------------------------------
/2024/MLX-day-one.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "id": "2c6807f7",
6 | "metadata": {},
7 | "source": [
8 | "\n",
9 | "\n",
10 | "# Apple MLX for AI/Large Language Models—Day One\n",
11 | "\n",
12 | "_Author: Uche Ogbuji_\n",
13 | "\n",
14 | "I've been using llama.cpp on Mac Silicon for months now, and [my brother, Chimezie](https://huggingface.co/cogbuji) has been nudging me to give [MLX](https://github.com/ml-explore/mlx) a go.\n",
15 | "I finally set aside time today to get started, with an eventual goal of adding support for MLX model loading & usage in [OgbujiPT](https://github.com/OoriData/OgbujiPT). I'd been warned it's rough around the edges, but it's been stimulating to play with. I thought I'd capture some of my notes, including some pitfalls I ran into, which might help anyone else trying to get into MLX in its current state.\n",
16 | "\n",
17 | "As a quick bit of background I'll mention that MLX is very interesting because honestly, Apple has the most coherently engineered consumer and small-business-level hardware for AI workloads, with Apple Silicon and its unified memory. The news lately is all about Apple's AI fumbles, but I suspect their clever plan is to empower a community of developers to take the arrows in their back and build things out for them. The MLX community is already an absolute machine, a fact Chimezie spotted early on. If like me you're trying to develop products on this new frontier without abdicating the engineering to separate, black-box providers, MLX is a compelling avenue.\n",
18 | "\n",
19 | "\n",
20 | "\n",
21 | "My initial forays will just be into inferencing, which should complement the large amount of solid community work in MLX fine-tuning and other more advanced topics. There's plenty of nuance to dig into just on the inference side, though.\n",
22 | "As I was warned, it's clear that MLX is developing with great velocity, even by contemporary AI standards, so just as some resources I found from six weeks ago were already out of date, this could also well be by the time you come across it. I'll try to update and continue taking notes on developments as I go along, though.\n",
23 | "\n",
24 | "First of all, I installed the mlx_lm package for Python, following the [instructions from HuggingFace](https://huggingface.co/docs/hub/en/mlx). After switching to a suitable Python virtual environment:"
25 | ]
26 | },
27 | {
28 | "cell_type": "code",
29 | "execution_count": 1,
30 | "id": "4883f0fd",
31 | "metadata": {},
32 | "outputs": [
33 | {
34 | "name": "stdout",
35 | "output_type": "stream",
36 | "text": [
37 | "Requirement already satisfied: mlx-lm in /Users/uche/.local/venv/ogpt/lib/python3.11/site-packages (0.3.0)\n",
38 | "Requirement already satisfied: mlx>=0.6 in /Users/uche/.local/venv/ogpt/lib/python3.11/site-packages (from mlx-lm) (0.6.0)\n",
39 | "Requirement already satisfied: numpy in /Users/uche/.local/venv/ogpt/lib/python3.11/site-packages (from mlx-lm) (1.26.2)\n",
40 | "Requirement already satisfied: transformers>=4.38.0 in /Users/uche/.local/venv/ogpt/lib/python3.11/site-packages (from mlx-lm) (4.38.2)\n",
41 | "Requirement already satisfied: protobuf in /Users/uche/.local/venv/ogpt/lib/python3.11/site-packages (from mlx-lm) (4.25.1)\n",
42 | "Requirement already satisfied: pyyaml in /Users/uche/.local/venv/ogpt/lib/python3.11/site-packages (from mlx-lm) (6.0.1)\n",
43 | "Requirement already satisfied: filelock in /Users/uche/.local/venv/ogpt/lib/python3.11/site-packages (from transformers>=4.38.0->mlx-lm) (3.13.1)\n",
44 | "Requirement already satisfied: huggingface-hub<1.0,>=0.19.3 in /Users/uche/.local/venv/ogpt/lib/python3.11/site-packages (from transformers>=4.38.0->mlx-lm) (0.19.4)\n",
45 | "Requirement already satisfied: packaging>=20.0 in /Users/uche/.local/venv/ogpt/lib/python3.11/site-packages (from transformers>=4.38.0->mlx-lm) (23.2)\n",
46 | "Requirement already satisfied: regex!=2019.12.17 in /Users/uche/.local/venv/ogpt/lib/python3.11/site-packages (from transformers>=4.38.0->mlx-lm) (2023.10.3)\n",
47 | "Requirement already satisfied: requests in /Users/uche/.local/venv/ogpt/lib/python3.11/site-packages (from transformers>=4.38.0->mlx-lm) (2.31.0)\n",
48 | "Requirement already satisfied: tokenizers<0.19,>=0.14 in /Users/uche/.local/venv/ogpt/lib/python3.11/site-packages (from transformers>=4.38.0->mlx-lm) (0.15.0)\n",
49 | "Requirement already satisfied: safetensors>=0.4.1 in /Users/uche/.local/venv/ogpt/lib/python3.11/site-packages (from transformers>=4.38.0->mlx-lm) (0.4.1)\n",
50 | "Requirement already satisfied: tqdm>=4.27 in /Users/uche/.local/venv/ogpt/lib/python3.11/site-packages (from transformers>=4.38.0->mlx-lm) (4.66.1)\n",
51 | "Requirement already satisfied: fsspec>=2023.5.0 in /Users/uche/.local/venv/ogpt/lib/python3.11/site-packages (from huggingface-hub<1.0,>=0.19.3->transformers>=4.38.0->mlx-lm) (2023.12.1)\n",
52 | "Requirement already satisfied: typing-extensions>=3.7.4.3 in /Users/uche/.local/venv/ogpt/lib/python3.11/site-packages (from huggingface-hub<1.0,>=0.19.3->transformers>=4.38.0->mlx-lm) (4.8.0)\n",
53 | "Requirement already satisfied: charset-normalizer<4,>=2 in /Users/uche/.local/venv/ogpt/lib/python3.11/site-packages (from requests->transformers>=4.38.0->mlx-lm) (3.3.2)\n",
54 | "Requirement already satisfied: idna<4,>=2.5 in /Users/uche/.local/venv/ogpt/lib/python3.11/site-packages (from requests->transformers>=4.38.0->mlx-lm) (3.6)\n",
55 | "Requirement already satisfied: urllib3<3,>=1.21.1 in /Users/uche/.local/venv/ogpt/lib/python3.11/site-packages (from requests->transformers>=4.38.0->mlx-lm) (1.26.18)\n",
56 | "Requirement already satisfied: certifi>=2017.4.17 in /Users/uche/.local/venv/ogpt/lib/python3.11/site-packages (from requests->transformers>=4.38.0->mlx-lm) (2023.11.17)\n",
57 | "Note: you may need to restart the kernel to use updated packages.\n"
58 | ]
59 | }
60 | ],
61 | "source": [
62 | "%pip install mlx-lm"
63 | ]
64 | },
65 | {
66 | "cell_type": "markdown",
67 | "id": "e573fb5e",
68 | "metadata": {},
69 | "source": [
70 | "Later on, it became clear that I probably wanted to keep closer to the cutting edge, so I pulled from github instead:\n",
71 | "\n",
72 | "```sh\n",
73 | "git clone https://github.com/ml-explore/mlx-examples.git\n",
74 | "cd mlx-examples/llms\n",
75 | "pip install -U .\n",
76 | "```"
77 | ]
78 | },
79 | {
80 | "cell_type": "markdown",
81 | "id": "bf40c1df",
82 | "metadata": {},
83 | "source": [
84 | "All I needed was a model to try out. On llama.cpp my go-to has been [OpenHermes-2.5-Mistral-7B](https://huggingface.co/teknium/OpenHermes-2.5-Mistral-7B), so my first move was to try to run that on MLX. I had read MLX gained limited GGUF model file format support recently, with limited support for quantization outputs. If this sentence has been gobbledygook to you,\n",
85 | "I recommend you pause, [read this useful, llama.cpp-centered tutorial](https://christophergs.com/blog/running-open-source-llms-in-python), and come back. These concepts will be useful to you no matter what AI/LLM framework you end up using.\n",
86 | "\n",
87 | "I naively just tried to load my already downloaded GGUF using `mlx_lm.load()`, but clearly wanted a `safetensors` distribution. I looked around some more and found the [GGUF](https://github.com/ml-explore/mlx-examples/tree/main/llms/gguf_llm) examples, but it was clear this was off the beaten path, and Chimezie soon told me the usual approach is to use MLX-specific models, which I can easily convert myself from regular model weights, or I can find pre-converted weights in the [mlx-community space](https://huggingface.co/mlx-community).\n",
88 | "The first/obvious such repository I found matching OpenHermes-2.5-Mistral-7B was `mlx-community/OpenHermes-2.5-Mistral-7B`, but MLX refused to load it, and indeed it's an outdated model without `safetensors`. It used the `.NPZ` format, which seems to be out of date and [yet is still referenced in the docs](https://ml-explore.github.io/mlx/build/html/examples/llama-inference.html#converting-the-weights).\n",
89 | "\n",
90 | "\n",
92 | "\n",
93 | "A better choice turned out to be [`mlx-community/OpenHermes-2.5-Mistral-7B-4bit-mlx`](https://huggingface.co/mlx-community/OpenHermes-2.5-Mistral-7B-4bit-mlx)."
94 | ]
95 | },
96 | {
97 | "cell_type": "code",
98 | "execution_count": 2,
99 | "id": "1a102628",
100 | "metadata": {},
101 | "outputs": [
102 | {
103 | "data": {
104 | "application/vnd.jupyter.widget-view+json": {
105 | "model_id": "a3215b3afd724e4e9124efa6abc3b31d",
106 | "version_major": 2,
107 | "version_minor": 0
108 | },
109 | "text/plain": [
110 | "Fetching 8 files: 0%| | 0/8 [00:00, ?it/s]"
111 | ]
112 | },
113 | "metadata": {},
114 | "output_type": "display_data"
115 | },
116 | {
117 | "name": "stderr",
118 | "output_type": "stream",
119 | "text": [
120 | "Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.\n"
121 | ]
122 | }
123 | ],
124 | "source": [
125 | "from mlx_lm import load, generate\n",
126 | "\n",
127 | "model, tokenizer = load('mlx-community/OpenHermes-2.5-Mistral-7B-4bit-mlx')"
128 | ]
129 | },
130 | {
131 | "cell_type": "markdown",
132 | "id": "b645a65a",
133 | "metadata": {},
134 | "source": [
135 | "\n",
136 | "The first time you run this load it will download from HuggingFace. The repository will be cached, by default in `~/.cache/huggingface/hub`, so subsequent loads will be much faster. Quick completion/generation example:"
137 | ]
138 | },
139 | {
140 | "cell_type": "code",
141 | "execution_count": 3,
142 | "id": "e0bdbad7",
143 | "metadata": {},
144 | "outputs": [
145 | {
146 | "name": "stdout",
147 | "output_type": "stream",
148 | "text": [
149 | "==========\n",
150 | "Prompt: A fun limerick about four-leaf clovers is:\n",
151 | "\n",
152 | "\n",
153 | "There once was a leprechaun named Joe\n",
154 | "Who found a four-leaf clover, you know\n",
155 | "He rubbed it and wished\n",
156 | "For a pot of gold, oh my!\n",
157 | "And now he's the luckiest leprechaun in town.\n",
158 | "\n",
159 | "This limerick is a playful and lighthearted way to celebrate the luck of the Irish and the mythical four-leaf clover. The rhyme scheme and rhythm make it easy to remember and\n",
160 | "==========\n",
161 | "Prompt: 13.374 tokens-per-sec\n",
162 | "Generation: 37.459 tokens-per-sec\n",
163 | "\n",
164 | "\n",
165 | "There once was a leprechaun named Joe\n",
166 | "Who found a four-leaf clover, you know\n",
167 | "He rubbed it and wished\n",
168 | "For a pot of gold, oh my!\n",
169 | "And now he's the luckiest leprechaun in town.\n",
170 | "\n",
171 | "This limerick is a playful and lighthearted way to celebrate the luck of the Irish and the mythical four-leaf clover. The rhyme scheme and rhythm make it easy to remember and\n"
172 | ]
173 | }
174 | ],
175 | "source": [
176 | "response = generate(model, tokenizer, prompt=\"A fun limerick about four-leaf clovers is:\", verbose=True)\n",
177 | "print(response)"
178 | ]
179 | },
180 | {
181 | "cell_type": "markdown",
182 | "id": "6d1d7869",
183 | "metadata": {},
184 | "source": [
185 | "\n",
186 | "You should see the completion response being streamed. I got a truly terrible limerick. Your mileage may very.\n",
187 | "\n",
188 | "You can also use [ChatML-style interaction](https://huggingface.co/docs/transformers/main/en/chat_templating):"
189 | ]
190 | },
191 | {
192 | "cell_type": "code",
193 | "execution_count": 4,
194 | "id": "5e44937a",
195 | "metadata": {},
196 | "outputs": [
197 | {
198 | "name": "stdout",
199 | "output_type": "stream",
200 | "text": [
201 | "==========\n",
202 | "Prompt: <|im_start|>system\n",
203 | "You are a friendly chatbot who always responds in the style of a talk show host<|im_end|>\n",
204 | "<|im_start|>user\n",
205 | "Do you have any advice for a fresh graduate?<|im_end|>\n",
206 | "\n",
207 | "\n",
208 | "Chatbot: Welcome to the real world, my friend! It's a big, beautiful, and sometimes scary place. But don't worry, I've got some advice that'll help you navigate these waters.\n",
209 | "\n",
210 | "First things first, don't be afraid to ask for help. Whether it's from a mentor, a colleague, or even me, your friendly chatbot, don't be shy to reach out. We've all been there\n",
211 | "==========\n",
212 | "Prompt: 162.485 tokens-per-sec\n",
213 | "Generation: 38.943 tokens-per-sec\n"
214 | ]
215 | }
216 | ],
217 | "source": [
218 | "messages = [\n",
219 | " {'role': 'system', 'content': 'You are a friendly chatbot who always responds in the style of a talk show host'},\n",
220 | " {'role': 'user', 'content': 'Do you have any advice for a fresh graduate?'}]\n",
221 | "chat_prompt = tokenizer.apply_chat_template(messages, tokenize=False)\n",
222 | "response = generate(model, tokenizer, prompt=chat_prompt, verbose=True)"
223 | ]
224 | },
225 | {
226 | "cell_type": "markdown",
227 | "id": "79703aad",
228 | "metadata": {},
229 | "source": [
230 | "`response` is the plain old string with the LLM completion/response. It will already have been streamed to the console thanks to `verbose=True`, right after the converted prompt, displayed so you can see how the ChatML format has been converted using special, low-level LLM tokens such as `<|im_start|>` & `<|im_end|>`. Having the system message in the chat prompting and all that definitely, by my quick impressions, made the interactions far more coherent.\n",
231 | "\n",
232 | "\n",
233 | "\n",
234 | "That's as far as I got in a few hours of probing yesterday, but as I said, I'll keep the notes coming as I learn more. Next I plan to start thinking about how to incorporate what I've learned into OgbujiPT.\n",
235 | "\n",
236 | "Plug: As I've suggested, Chimezie has blazed this trail before me, and was quite helpful. You can check out the work he's already shared with the MLX community, such as his [Mr. Grammatology medical/clinical LLM fine-tune](https://huggingface.co/cogbuji/Mr-Grammatology-clinical-problems-Mistral-7B-0.5), and [mlx-tuning-fork](https://github.com/chimezie/mlx-tuning-fork), his framework for (Q)LoRa fine-tuning with MLX. [His work is featured in the brand new Oori Data HuggingFace organization page.](https://huggingface.co/OoriData)."
237 | ]
238 | }
239 | ],
240 | "metadata": {
241 | "jupytext": {
242 | "cell_metadata_filter": "-all",
243 | "main_language": "sh",
244 | "notebook_metadata_filter": "-all"
245 | },
246 | "kernelspec": {
247 | "display_name": "ogpt",
248 | "language": "python",
249 | "name": "python3"
250 | },
251 | "language_info": {
252 | "codemirror_mode": {
253 | "name": "ipython",
254 | "version": 3
255 | },
256 | "file_extension": ".py",
257 | "mimetype": "text/x-python",
258 | "name": "python",
259 | "nbconvert_exporter": "python",
260 | "pygments_lexer": "ipython3",
261 | "version": "3.11.6"
262 | }
263 | },
264 | "nbformat": 4,
265 | "nbformat_minor": 5
266 | }
267 |
--------------------------------------------------------------------------------
/2024/MLX-day-one.md:
--------------------------------------------------------------------------------
1 | 
2 |
3 | # Apple MLX for AI/Large Language Models—Day One
4 |
5 | _Author: [Uche Ogbuji](https://ucheog.carrd.co/)_
6 |
7 | I've been using llama.cpp on Mac Silicon for months now, and [my brother, Chimezie](https://huggingface.co/cogbuji) has been nudging me to give [MLX](https://github.com/ml-explore/mlx) a go.
8 | I finally set aside time today to get started, with an eventual goal of adding support for MLX model loading & usage in [OgbujiPT](https://github.com/OoriData/OgbujiPT). I'd been warned it's rough around the edges, but it's been stimulating to play with. I thought I'd capture some of my notes, including some pitfalls I ran into, which might help anyone else trying to get into MLX in its current state.
9 |
10 | As a quick bit of background I'll mention that MLX is very interesting because honestly, Apple has the most coherently engineered consumer and small-business-level hardware for AI workloads, with Apple Silicon and its unified memory. The news lately is all about Apple's AI fumbles, but I suspect their clever plan is to empower a community of developers to take the arrows in their back and build things out for them. The MLX community is already an absolute machine, a fact Chimezie spotted early on. If like me you're trying to develop products on this new frontier without abdicating the engineering to separate, black-box providers, MLX is a compelling avenue.
11 |
12 | 
13 |
14 | My initial forays will just be into inferencing, which should complement the large amount of solid community work in MLX fine-tuning and other more advanced topics. There's plenty of nuance to dig into just on the inference side, though.
15 | As I was warned, it's clear that MLX is developing with great velocity, even by contemporary AI standards, so just as some resources I found from six weeks ago were already out of date, this could also well be by the time you come across it. I'll try to update and continue taking notes on developments as I go along, though.
16 |
17 | First of all, I installed the mlx_lm package for Python, following the [instructions from HuggingFace](https://huggingface.co/docs/hub/en/mlx). After switching to a suitable Python virtual environment:
18 |
19 | ```sh
20 | pip install mlx-lm
21 | ```
22 |
23 | Later on, it became clear that I probably wanted to keep closer to the cutting edge, so I pulled from github instead:
24 |
25 | ```sh
26 | git clone https://github.com/ml-explore/mlx-examples.git
27 | cd mlx-examples/llms
28 | pip install -U .
29 | ```
30 |
31 | All I needed was a model to try out. On llama.cpp my go-to has been [OpenHermes-2.5-Mistral-7B](https://huggingface.co/teknium/OpenHermes-2.5-Mistral-7B), so my first move was to try to run that on MLX. I had read MLX gained limited GGUF model file format support recently, with limited support for quantization outputs. If this sentence has been gobbledygook to you,
32 | I recommend you pause, [read this useful, llama.cpp-centered tutorial](https://christophergs.com/blog/running-open-source-llms-in-python), and come back. These concepts will be useful to you no matter what AI/LLM framework you end up using.
33 |
34 | I naively just tried to load my already downloaded GGUF using `mlx_lm.load()`, but clearly wanted a `safetensors` distribution. I looked around some more and found the [GGUF](https://github.com/ml-explore/mlx-examples/tree/main/llms/gguf_llm) examples, but it was clear this was off the beaten path, and Chimezie soon told me the usual approach is to use MLX-specific models, which I can easily convert myself from regular model weights, or I can find pre-converted weights in the [mlx-community space](https://huggingface.co/mlx-community).
35 | The first/obvious such repository I found matching OpenHermes-2.5-Mistral-7B was `mlx-community/OpenHermes-2.5-Mistral-7B`, but MLX refused to load it, and indeed it's an outdated model without `safetensors`. It used the `.NPZ` format, which seems to be out of date and [yet is still referenced in the docs](https://ml-explore.github.io/mlx/build/html/examples/llama-inference.html#converting-the-weights).
36 |
37 | 
39 |
40 | A better choice turned out to be [`mlx-community/OpenHermes-2.5-Mistral-7B-4bit-mlx`](https://huggingface.co/mlx-community/OpenHermes-2.5-Mistral-7B-4bit-mlx).
41 |
42 | ```py
43 | from mlx_lm import load, generate
44 |
45 | model, tokenizer = load('mlx-community/OpenHermes-2.5-Mistral-7B-4bit-mlx')
46 | ```
47 |
48 | The first time you run this load it will download from HuggingFace. The repository will be cached, by default in `~/.cache/huggingface/hub`, so subsequent loads will be much faster. Quick completion/generation example:
49 |
50 | ```py
51 | response = generate(model, tokenizer, prompt="A fun limerick about four-leaf clovers is:", verbose=True)
52 | ```
53 |
54 | You should see the completion response being streamed. I got a truly terrible limerick. Your mileage may vary.
55 |
56 | You can also use [ChatML-style interaction](https://huggingface.co/docs/transformers/main/en/chat_templating):
57 |
58 | ```py
59 | messages = [
60 | {'role': 'system', 'content': 'You are a friendly chatbot who always responds in the style of a talk show host'},
61 | {'role': 'user', 'content': 'Do you have any advice for a fresh graduate?'}]
62 | chat_prompt = tokenizer.apply_chat_template(messages, tokenize=False)
63 | response = generate(model, tokenizer, prompt=chat_prompt, verbose=True)
64 | ```
65 |
66 | `response` is the plain old string with the LLM completion/response. It will already have been streamed to the console thanks to `verbose=True`, right after the converted prompt, displayed so you can see how the ChatML format has been converted using special, low-level LLM tokens such as `<|im_start|>` & `<|im_end|>` (more on LLM tokens in a future article). Having the system message in the chat prompting and all that definitely, by my quick impressions, made the interactions far more coherent.
67 |
68 | 
69 |
70 | That's as far as I got in a few hours of probing yesterday, but as I said, I'll keep the notes coming as I learn more. Next I plan to start thinking about how to incorporate what I've learned into OgbujiPT.
71 |
72 | Plug: As I've suggested, Chimezie has blazed this trail before me, and was quite helpful. You can check out the work he's already shared with the MLX community, such as his [Mr. Grammatology medical/clinical LLM fine-tune](https://huggingface.co/cogbuji/Mr-Grammatology-clinical-problems-Mistral-7B-0.5), and [mlx-tuning-fork](https://github.com/chimezie/mlx-tuning-fork), his framework for (Q)LoRa fine-tuning with MLX. [His work is featured in the brand new Oori Data HuggingFace organization page.](https://huggingface.co/OoriData).
--------------------------------------------------------------------------------
/2024/README.md:
--------------------------------------------------------------------------------
1 | # Notes on the Apple MLX machine learning framework
2 |
3 | ## Apple MLX for AI/Large Language Models—Day One
4 |
5 |
13 |
14 | ## Converting models from Hugging Face to MLX format, and sharing
15 |
16 |
24 |
25 | ## Retrieval augmentation with MLX: A bag full of RAG, part 1
26 |
27 |
35 |
36 | ## Retrieval augmentation with MLX: A bag full of RAG, part 2
37 |
38 |
49 |
--------------------------------------------------------------------------------
/2024/conversion-etc.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "id": "b610f3c8",
6 | "metadata": {},
7 | "source": [
8 | "\n",
10 | "\n",
11 | "# Converting models from Hugging Face to MLX format, and sharing\n",
12 | "\n",
13 | "_Author: [Uche Ogbuji](https://ucheog.carrd.co/)_\n",
14 | "\n",
15 | "Since my [first article on dipping my toes into MLX](https://github.com/uogbuji/mlx-notes/blob/main/2024/MLX-day-one.md) I've had several attention swaps, but a trip to Cleveland for prime eclipse viewing in totality gave me a chance get back to the framework. Of course the MLX team and community keep marching on, and there have been several exciting releases, and performance boosts, since my last look-in.\n",
16 | "\n",
17 | "At the same time, coincidentally, a new small model was released which I wanted to try out. [H2O-Danube2-1.8b, and in particular the chat version](https://huggingface.co/h2oai/h2o-danube2-1.8b-chat) debuts at #2 in the \"~1.5B parameter\" category on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard), which is promising. It was then only available in Hugging Face weight format, so I needed to convert it for MLX use.\n",
18 | "\n",
19 | "I'll use this new model to explore how easy it is to convert Hugging Face weights to MLX format, and to share the results, if one chooses to do so.\n",
20 | "\n",
21 | " [source](https://science.nasa.gov/eclipses/future-eclipses/eclipse-2024/what-to-expect/)\n",
22 | "\n",
23 | "## Preparation and conversion\n",
24 | "\n",
25 | "First of all I upgraded the MLX versions in the virtual environment I was using"
26 | ]
27 | },
28 | {
29 | "cell_type": "code",
30 | "execution_count": null,
31 | "id": "23aa283e",
32 | "metadata": {
33 | "vscode": {
34 | "languageId": "shellscript"
35 | }
36 | },
37 | "outputs": [],
38 | "source": [
39 | "pip install -U mlx mlx-lm"
40 | ]
41 | },
42 | {
43 | "cell_type": "markdown",
44 | "id": "9a91f43c",
45 | "metadata": {},
46 | "source": [
47 | "If you do so, and in the unlikely event that you don't end up with the latest versions of packages after this, you might want to add the ` --force-reinstall` flag.\n",
48 | "\n",
49 | "I created a directory to hold the converted model"
50 | ]
51 | },
52 | {
53 | "cell_type": "code",
54 | "execution_count": null,
55 | "id": "e4bdfb5e",
56 | "metadata": {
57 | "vscode": {
58 | "languageId": "shellscript"
59 | }
60 | },
61 | "outputs": [],
62 | "source": [
63 | "mkdir -p ~/.local/share/models/mlx\n",
64 | "mkdir ~/.local/share/models/mlx/h2o-danube2-1.8b-chat"
65 | ]
66 | },
67 | {
68 | "cell_type": "markdown",
69 | "id": "8edc99cc",
70 | "metadata": {},
71 | "source": [
72 | "Used the command line for the actual conversion"
73 | ]
74 | },
75 | {
76 | "cell_type": "code",
77 | "execution_count": null,
78 | "id": "1ae20d98",
79 | "metadata": {
80 | "vscode": {
81 | "languageId": "shellscript"
82 | }
83 | },
84 | "outputs": [],
85 | "source": [
86 | "python -m mlx_lm.convert --hf-path h2oai/h2o-danube2-1.8b-chat --mlx-path ~/.local/share/models/mlx/h2o-danube2-1.8b-chat -q"
87 | ]
88 | },
89 | {
90 | "cell_type": "markdown",
91 | "id": "1067039e",
92 | "metadata": {},
93 | "source": [
94 | "This took around ten and a half minutes of wall clock time—downloading and converting. The `-q` option quantizes the weights while converting them. The default quantization is to 4 bits (from the standard Hugging Face weight format of 16-bit floating point), but you can choose a different result bits per weight, and other quantization parameters with other command line options.\n",
95 | "\n",
96 | "It's good to be aware of the model type and architecture you're dealing with, which doesn't change when you convert weights to MLX. Eyeballing the [h2o-danube2-1.8b-chat config.json](https://huggingface.co/h2oai/h2o-danube2-1.8b-chat/blob/main/config.json), I found the following useful bits:\n",
97 | "\n",
98 | "```json\n",
99 | "\"architectures\": [\n",
100 | " \"MistralForCausalLM\"\n",
101 | "],\n",
102 | "…\n",
103 | "\"model_type\": \"mistral\",\n",
104 | "```\n",
105 | "\n",
106 | "Luckily Mistral-style models are well supported, thanks to their popularity.\n",
107 | "\n",
108 | "# Loading and using the converted model from Python\n",
109 | "\n",
110 | "I loaded the model from local directory\n"
111 | ]
112 | },
113 | {
114 | "cell_type": "code",
115 | "execution_count": null,
116 | "id": "b15de53f",
117 | "metadata": {},
118 | "outputs": [],
119 | "source": [
120 | "from mlx_lm import load, generate\n",
121 | "from pathlib import Path\n",
122 | "\n",
123 | "model_path = Path.home() / Path('.local/share/models/mlx') / Path('h2o-danube2-1.8b-chat')\n",
124 | "model, tokenizer = load(model_path)"
125 | ]
126 | },
127 | {
128 | "cell_type": "markdown",
129 | "id": "d84130d1",
130 | "metadata": {},
131 | "source": [
132 | "\n",
133 | "This led to a warning\n",
134 | "\n",
135 | "> You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers\n",
136 | "\n",
137 | "Some digging into the Hugging Face transformers library, from whence it originates, yielded no easy answers as to how seriously to take this warning.\n",
138 | "\n",
139 | "First ran the model using a similar pattern to the example in the last article, but it tripped up on the chat format."
140 | ]
141 | },
142 | {
143 | "cell_type": "code",
144 | "execution_count": null,
145 | "id": "15a9da09",
146 | "metadata": {},
147 | "outputs": [],
148 | "source": [
149 | "messages = [\n",
150 | " {'role': 'system', 'content': 'You are a friendly and informative chatbot'},\n",
151 | " {'role': 'user', 'content': 'There\\'s a total solar eclipse tomorrow. Tell me a fun fact about such events.'}]\n",
152 | "chat_prompt = tokenizer.apply_chat_template(messages, tokenize=False)\n",
153 | "response = generate(model, tokenizer, prompt=chat_prompt, verbose=True)"
154 | ]
155 | },
156 | {
157 | "cell_type": "markdown",
158 | "id": "d1af3f0c",
159 | "metadata": {},
160 | "source": [
161 | "I got `TemplateError: System role not supported`. Not all chat models are trained/fine-tuned with the system role. If you try to set system messages but the Hugging Face tokenizer doesn't recognize system role support, it sends a strong signal through this exception. Much better than silently confusing the model. There is no universal workaround for this—it all comes down to details of how the model was trained. I didn't do a lot of investigation of the H2O Danube 2 chat template. Instead I just basically slammed the system prompt into the user role."
162 | ]
163 | },
164 | {
165 | "cell_type": "code",
166 | "execution_count": null,
167 | "id": "eac2b048",
168 | "metadata": {},
169 | "outputs": [],
170 | "source": [
171 | "from mlx_lm import load, generate\n",
172 | "from pathlib import Path\n",
173 | "\n",
174 | "model_path = Path.home() / Path('.local/share/models/mlx') / Path('h2o-danube2-1.8b-chat')\n",
175 | "model, tokenizer = load(model_path) # Issues a slow tokenizer warning\n",
176 | "\n",
177 | "SYSTEM_ROLE = 'user'\n",
178 | "messages = [\n",
179 | " {'role': SYSTEM_ROLE, 'content': 'You are a friendly and informative chatbot'},\n",
180 | " {'role': 'user', 'content': 'There\\'s a total solar eclipse tomorrow. Tell me a fun fact about such events.'}]\n",
181 | "chat_prompt = tokenizer.apply_chat_template(messages, tokenize=False)\n",
182 | "response = generate(model, tokenizer, prompt=chat_prompt, verbose=True)"
183 | ]
184 | },
185 | {
186 | "cell_type": "markdown",
187 | "id": "1ab58b00",
188 | "metadata": {},
189 | "source": [
190 | "The generation felt faster than last article's run on OpenHermes/Mistral7b, and indeed the reported numbers were impressive—running on a 2021 Apple M1 Max MacBook Pro (64GB RAM):\n",
191 | "\n",
192 | "```\n",
193 | "Prompt: 84.037 tokens-per-sec\n",
194 | "Generation: 104.326 tokens-per-sec\n",
195 | "```\n",
196 | "\n",
197 | "Back of the envelope says that's 3-4X faster prompt processing and 2-3X faster generation. Some of that is the fact that H2O-Danube2 is around 4X smaller, but some of it is down to improvements in the MLX code.\n",
198 | "\n",
199 | "\n",
200 | "\n",
201 | "# Uploading converted models to Hugging Face\n",
202 | "\n",
203 | "Unlike the example above, you'll often find that models have been converted to MLX weights for you already. This is of course the beauty of an open-source community. If you do convert a model yourself, you can be part of the sharing spree.\n",
204 | "\n",
205 | "## Preparing for upload\n",
206 | "\n",
207 | "You'll need an account on Hugging Face, then an access token with write permissions. Copy one from [your tokens settings](https://huggingface.co/settings/tokens) into your clipboard (and password manager). Install the Hugging Face tools"
208 | ]
209 | },
210 | {
211 | "cell_type": "code",
212 | "execution_count": null,
213 | "id": "bef3124e",
214 | "metadata": {
215 | "vscode": {
216 | "languageId": "shellscript"
217 | }
218 | },
219 | "outputs": [],
220 | "source": [
221 | "pip install -U huggingface_hub"
222 | ]
223 | },
224 | {
225 | "cell_type": "markdown",
226 | "id": "103e3529",
227 | "metadata": {},
228 | "source": [
229 | "Run `huggingface-cli login` and paste the token you copied earlier. You're now ready for the upload. It will push everything in your local folder with the converted weights, so you should probably check that it's ready for the public. At a minimum add a `README.md` (more on this below) and look over the `config.json`, making sure there is at least a `\"model_type\"` key. In this case, it's unchanged from the original: `\"model_type\": \"mistral\"`. Browsing other, recent model repositories for the [`mlx-community`](https://huggingface.co/mlx-community) is a good way to get a sense of what your upload should contain.\n",
230 | "\n",
231 | "### README.md\n",
232 | "\n",
233 | "You'll want to have a README.md file, from which the Hugging Face model card and some metadata are extracted. I started with the metadata from the original model. It's MDX format, which is markdown with metadata headers and optional inline instructions. The [original model's metadata headers are as follows](https://huggingface.co/h2oai/h2o-danube2-1.8b-chat/raw/main/README.md):\n",
234 | "\n",
235 | "```\n",
236 | "---\n",
237 | "language:\n",
238 | "- en\n",
239 | "library_name: transformers\n",
240 | "license: apache-2.0\n",
241 | "tags:\n",
242 | "- gpt\n",
243 | "- llm\n",
244 | "- large language model\n",
245 | "- h2o-llmstudio\n",
246 | "thumbnail: >-\n",
247 | " https://h2o.ai/etc.clientlibs/h2o/clientlibs/clientlib-site/resources/images/favicon.ico\n",
248 | "pipeline_tag: text-generation\n",
249 | "---\n",
250 | "```\n",
251 | "\n",
252 | "I added some descriptive information about the model and how to use it in MLX. This information becomes the Hugging Face model card for the upload.\n",
253 | "\n",
254 | "## Upload\n",
255 | "\n",
256 | "[Hugging Face repositories are basically git and git-LFS](https://huggingface.co/docs/huggingface_hub/guides/upload), so you have many ways of interacting with them. In my case I ran a Python script:"
257 | ]
258 | },
259 | {
260 | "cell_type": "code",
261 | "execution_count": null,
262 | "id": "d93edb57",
263 | "metadata": {},
264 | "outputs": [],
265 | "source": [
266 | "from huggingface_hub import HfApi, create_repo\n",
267 | "from pathlib import Path\n",
268 | "\n",
269 | "model_path = Path.home() / Path('.local/share/models/mlx') / Path('h2o-danube2-1.8b-chat')\n",
270 | "\n",
271 | "repo_id = create_repo('h2o-danube2-1.8b-chat-MLX-4bit').repo_id\n",
272 | "api = HfApi()\n",
273 | "api.upload_folder(folder_path=model_path,\n",
274 | " repo_id=repo_id,\n",
275 | " repo_type='model',\n",
276 | " multi_commits=True,\n",
277 | " multi_commits_verbose=True)"
278 | ]
279 | },
280 | {
281 | "cell_type": "markdown",
282 | "id": "eb3cec4f",
283 | "metadata": {},
284 | "source": [
285 | "Notice how I set the destination `repo_id` within my own account `ucheog`. Eventually I may want to share models I convert within the MLX community space, where others can more readily find it. In such a case I'd set `repo_id` to something like `mlx-community/h2o-danube2-1.8b-chat`. Since this is my first go-round, however, I'd rather start under my own auspices. To be frank, models available on `mlx-community` are a bit of a wild west grab-bag. This is the yin and yang of open source, of course, and we each navigate the bazaar in our own way.\n",
286 | "\n",
287 | "The `multi_commits` flags use a pull request & stage the upload piece-meal, which e.g. allows better recovery from interruption.\n",
288 | "\n",
289 | "\n",
290 | "\n",
291 | "# Wrap up\n",
292 | "\n",
293 | "You've had a quick overview on how to convert Hugging Face weights to MLX format, and how to share such converted models with the public. As it happens, [another MLX community member converted and shared h2o-danube2-1.8b-chat](https://huggingface.co/mlx-community/h2o-danube2-1.8b-chat-4bit) a few days after I posted my own version, and you should probably use that one, if you're looking to use the model seriously. Nevertheless, there are innumerable models out there, a very small proportion of which has been converted for MLX, so it's very useful to learn how to do so for yourself.\n",
294 | "\n",
295 | "# Additional resources\n",
296 | "\n",
297 | "* [Hugging Face hub docs on uploading models](https://huggingface.co/docs/hub/en/models-uploading)\n",
298 | "* [Hugging Face/Transformers docs on sharing models](https://huggingface.co/docs/transformers/model_sharing) - more relevant to notebook & in-Python use\n",
299 | "* Chat templating is a very fiddly topic, but [this Hugging Face post](https://huggingface.co/blog/chat-templates) is a useful intro. They do push Jinja2 hard, and there's nothng wrong with Jinja2, but as with any tool I'd say use it if it's the right one, and not out of reflex."
300 | ]
301 | }
302 | ],
303 | "metadata": {
304 | "jupytext": {
305 | "cell_metadata_filter": "-all",
306 | "main_language": "sh",
307 | "notebook_metadata_filter": "-all"
308 | },
309 | "language_info": {
310 | "name": "python"
311 | }
312 | },
313 | "nbformat": 4,
314 | "nbformat_minor": 5
315 | }
316 |
--------------------------------------------------------------------------------
/2024/conversion-etc.md:
--------------------------------------------------------------------------------
1 | 
3 |
4 | # Converting models from Hugging Face to MLX format, and sharing
5 |
6 | _Author: [Uche Ogbuji](https://ucheog.carrd.co/)_
7 |
8 | Since my [first article on dipping my toes into MLX](https://github.com/uogbuji/mlx-notes/blob/main/2024/MLX-day-one.md) I've had several attention swaps, but a trip to Cleveland for prime eclipse viewing in totality gave me a chance get back to the framework. Of course the MLX team and community keep marching on, and there have been several exciting releases, and performance boosts, since my last look-in.
9 |
10 | At the same time, coincidentally, a new small model was released which I wanted to try out. [H2O-Danube2-1.8b, and in particular the chat version](https://huggingface.co/h2oai/h2o-danube2-1.8b-chat) debuts at #2 in the "~1.5B parameter" category on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard), which is promising. It was then only available in Hugging Face weight format, so I needed to convert it for MLX use.
11 |
12 | I'll use this new model to explore how easy it is to convert Hugging Face weights to MLX format, and to share the results, if one chooses to do so.
13 |
14 |  [source](https://science.nasa.gov/eclipses/future-eclipses/eclipse-2024/what-to-expect/)
15 |
16 | ## Preparation and conversion
17 |
18 | First of all I upgraded the MLX versions in the virtual environment I was using
19 |
20 | ```sh
21 | pip install -U mlx mlx-lm
22 | ```
23 |
24 | If you do so, and in the unlikely event that you don't end up with the latest versions of packages after this, you might want to add the ` --force-reinstall` flag.
25 |
26 | ~~I created a directory to hold the converted model~~ _New versions of MLX make this step obsolete_
27 |
28 |
34 |
35 | I used the command line for the actual conversion
36 |
37 | ```sh
38 | python -m mlx_lm.convert --hf-path h2oai/h2o-danube2-1.8b-chat --mlx-path ~/.local/share/models/mlx/h2o-danube2-1.8b-chat -q
39 | ```
40 |
41 | This took around ten and a half minutes of wall clock time—downloading and converting. The `-q` option quantizes the weights while converting them. The default quantization is to 4 bits (from the standard Hugging Face weight format of 16-bit floating point), but you can choose a different result bits per weight, and other quantization parameters with other command line options.
42 |
43 | It's good to be aware of the model type and architecture you're dealing with, which doesn't change when you convert weights to MLX. Eyeballing the [h2o-danube2-1.8b-chat config.json](https://huggingface.co/h2oai/h2o-danube2-1.8b-chat/blob/main/config.json), I found the following useful bits:
44 |
45 | ```json
46 | "architectures": [
47 | "MistralForCausalLM"
48 | ],
49 | …
50 | "model_type": "mistral",
51 | ```
52 |
53 | Luckily Mistral-style models are well supported, thanks to their popularity.
54 |
55 | # Loading and using the converted model from Python
56 |
57 | I loaded the model from local directory
58 |
59 | ```py
60 | from mlx_lm import load, generate
61 | from pathlib import Path
62 |
63 | model_path = Path.home() / Path('.local/share/models/mlx') / Path('h2o-danube2-1.8b-chat')
64 | model, tokenizer = load(model_path)
65 | ```
66 |
67 | This led to a warning
68 |
69 | > You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
70 |
71 | Some digging into the Hugging Face transformers library, from whence it originates, yielded no easy answers as to how seriously to take this warning.
72 |
73 | First ran the model using a similar pattern to the example in the last article, but it tripped up on the chat format.
74 |
75 | ```py
76 | messages = [
77 | {'role': 'system', 'content': 'You are a friendly and informative chatbot'},
78 | {'role': 'user', 'content': 'There\'s a total solar eclipse tomorrow. Tell me a fun fact about such events.'}]
79 | chat_prompt = tokenizer.apply_chat_template(messages, tokenize=False)
80 | response = generate(model, tokenizer, prompt=chat_prompt, verbose=True)
81 | ```
82 |
83 | I got `TemplateError: System role not supported`. Not all chat models are trained/fine-tuned with the system role. If you try to set system messages but the Hugging Face tokenizer doesn't recognize system role support, it sends a strong signal through this exception. Much better than silently confusing the model. There is no universal workaround for this—it all comes down to details of how the model was trained. I didn't do a lot of investigation of the H2O Danube 2 chat template. Instead I just basically slammed the system prompt into the user role.
84 |
85 | ```py
86 | from mlx_lm import load, generate
87 | from pathlib import Path
88 |
89 | model_path = Path.home() / Path('.local/share/models/mlx') / Path('h2o-danube2-1.8b-chat')
90 | model, tokenizer = load(model_path) # Issues a slow tokenizer warning
91 |
92 | SYSTEM_ROLE = 'user'
93 | messages = [
94 | {'role': SYSTEM_ROLE, 'content': 'You are a friendly and informative chatbot'},
95 | {'role': 'user', 'content': 'There\'s a total solar eclipse tomorrow. Tell me a fun fact about such events.'}]
96 | chat_prompt = tokenizer.apply_chat_template(messages, tokenize=False)
97 | response = generate(model, tokenizer, prompt=chat_prompt, verbose=True)
98 | ```
99 |
100 | The generation felt faster than last article's run on OpenHermes/Mistral7b, and indeed the reported numbers were impressive—running on a 2021 Apple M1 Max MacBook Pro (64GB RAM):
101 |
102 | ```
103 | Prompt: 84.037 tokens-per-sec
104 | Generation: 104.326 tokens-per-sec
105 | ```
106 |
107 | Back of the envelope says that's 3-4X faster prompt processing and 2-3X faster generation. Some of that is the fact that H2O-Danube2 is around 4X smaller, but some of it is down to improvements in the MLX code.
108 |
109 | 
110 |
111 | # Uploading converted models to Hugging Face
112 |
113 | Unlike the example above, you'll often find that models have been converted to MLX weights for you already. This is of course the beauty of an open-source community. If you do convert a model yourself, you can be part of the sharing spree.
114 |
115 | ## Preparing for upload
116 |
117 | You'll need an account on Hugging Face, then an access token with write permissions. Copy one from [your tokens settings](https://huggingface.co/settings/tokens) into your clipboard (and password manager). Install the Hugging Face tools
118 |
119 | ```sh
120 | pip install -U huggingface_hub
121 | ```
122 |
123 | Run `huggingface-cli login` and paste the token you copied earlier. You're now ready for the upload. It will push everything in your local folder with the converted weights, so you should probably check that it's ready for the public. At a minimum add a `README.md` (more on this below) and look over the `config.json`, making sure there is at least a `"model_type"` key. In this case, it's unchanged from the original: `"model_type": "mistral"`. Browsing other, recent model repositories for the [`mlx-community`](https://huggingface.co/mlx-community) is a good way to get a sense of what your upload should contain.
124 |
125 | ### README.md
126 |
127 | You'll want to have a README.md file, from which the Hugging Face model card and some metadata are extracted. I started with the metadata from the original model. It's MDX format, which is markdown with metadata headers and optional inline instructions. The [original model's metadata headers are as follows](https://huggingface.co/h2oai/h2o-danube2-1.8b-chat/raw/main/README.md):
128 |
129 | ```markdown
130 | ---
131 | language:
132 | - en
133 | library_name: transformers
134 | license: apache-2.0
135 | tags:
136 | - gpt
137 | - llm
138 | - large language model
139 | - h2o-llmstudio
140 | thumbnail: >-
141 | https://h2o.ai/etc.clientlibs/h2o/clientlibs/clientlib-site/resources/images/favicon.ico
142 | pipeline_tag: text-generation
143 | ---
144 | ```
145 |
146 | I added some descriptive information about the model and how to use it in MLX. This information becomes the Hugging Face model card for the upload.
147 |
148 | ## Upload
149 |
150 | [Hugging Face repositories are basically git and git-LFS](https://huggingface.co/docs/huggingface_hub/guides/upload), so you have many ways of interacting with them.
151 |
152 | **UPDATE Auguat 2024** you can use the same `mlx_lm.convert` module with the `--upload-repo` option to upload.
153 |
154 | ```sh
155 | python -m mlx_lm.convert --hf-path h2oai/h2o-danube2-1.8b-chat --mlx-path ~/.local/share/models/mlx/h2o-danube2-1.8b-chat -q --upload-repo ucheog/h2o-danube2-1.8b-chat-MLX-4bit
156 | ```
157 |
158 | Alternatively, use the `huggingface_hub` library directly.
159 |
160 | ```py
161 | from huggingface_hub import HfApi, create_repo
162 | from pathlib import Path
163 |
164 | model_path = Path.home() / Path('.local/share/models/mlx') / Path('h2o-danube2-1.8b-chat')
165 |
166 | repo_id = create_repo('h2o-danube2-1.8b-chat-MLX-4bit').repo_id
167 | api = HfApi()
168 | api.upload_folder(folder_path=model_path,
169 | repo_id=repo_id,
170 | repo_type='model',
171 | multi_commits=True,
172 | multi_commits_verbose=True)
173 | ```
174 |
175 | Notice how I set the destination `repo_id` within my own account `ucheog`. Eventually I may want to share models I convert within the MLX community space, where others can more readily find it. In such a case I'd set `repo_id` to something like `mlx-community/h2o-danube2-1.8b-chat`. Since this is my first go-round, however, I'd rather start under my own auspices. To be frank, models available on `mlx-community` are a bit of a wild west grab-bag. This is the yin and yang of open source, of course, and we each navigate the bazaar in our own way.
176 |
177 | The `multi_commits` flags use a pull request & stage the upload piece-meal, which e.g. allows better recovery from interruption.
178 |
179 | 
180 |
181 | # Wrap up
182 |
183 | You've had a quick overview on how to convert Hugging Face weights to MLX format, and how to share such converted models with the public. As it happens, [another MLX community member converted and shared h2o-danube2-1.8b-chat](https://huggingface.co/mlx-community/h2o-danube2-1.8b-chat-4bit) a few days after I posted my own version, and you should probably use that one, if you're looking to use the model seriously. Nevertheless, there are innumerable models out there, a very small proportion of which has been converted for MLX, so it's very useful to learn how to do so for yourself.
184 |
185 | # Additional resources
186 |
187 | * [Hugging Face hub docs on uploading models](https://huggingface.co/docs/hub/en/models-uploading)
188 | * [Hugging Face/Transformers docs on sharing models](https://huggingface.co/docs/transformers/model_sharing) - more relevant to notebook & in-Python use
189 | * Chat templating is a very fiddly topic, but [this Hugging Face post](https://huggingface.co/blog/chat-templates) is a useful intro. They do push Jinja2 hard, and there's nothing wrong with Jinja2, but as with any tool I'd say use it if it's the right one, and not out of reflex.
190 |
--------------------------------------------------------------------------------
/2024/rag-basics1.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "id": "268c1bac",
6 | "metadata": {},
7 | "source": [
8 | "\n",
9 | "\n",
10 | "# Retrieval augmentation with MLX: A bag full of RAG, part 1\n",
11 | "\n",
12 | "_Author: [Uche Ogbuji](https://ucheog.carrd.co/)_\n",
13 | "\n",
14 | "After the intial fun with LLMs, asking the capital of random countries and getting them to tell lousy jokes or write wack poems, the next area I've found people trying to figure out is how to \"chat their documents\". In other words, can you give an LLM access to some documents, databases, web pages, etc. and get them to use that context for more specialized discussion and applications? This is more formally called Retrieval Augmented Generation (RAG).\n",
15 | "\n",
16 | "As usual I shan't spend too much time explaining fundamental AI principles in these articles which are focused on the MLX framework. For a very high level view of RAG, see [this synopsis from the Prompt Engineering Guide](https://www.promptingguide.ai/techniques/rag), or better yet, see [this full article from the same source](https://www.promptingguide.ai/research/rag). The latter is a long read, but really important if you're trying to advance from baby steps to almost any sophisticated use of LLMs. In any case, you'll need to understand at least the basics of RAG to get the most of this article.\n",
17 | "\n",
18 | "### RAG application workflow (Source: [Gao et al.](https://arxiv.org/abs/2312.10997))\n",
19 | "\n",
20 | "\n",
21 | "\n",
22 | "In this article I'll show through code examples how you can start to build RAG apps to work with LLM generation on MLX. It's a big subtopic, even just to get through the basics, so I'll break it into two parts, the first of which focuses on the retrieval portion.\n",
23 | "\n",
24 | "# Trying out a vector DBMS\n",
25 | "\n",
26 | "So far in these articles the main benefit of MLX has been GenAI accelerated on Apple Silicon's Metal architecture. That's all about the \"G\" in RAG. It would be great to have the \"R\" part also taking some advantage of Metal, but that proves a bit tougher than I'd expected. Many of the best-known vector DBs (faiss, qdrant, etc.) use various techniques to accelerate embedding and perhaps lookup via GPU, but they focus on Nvidia (CUDA) and in some cases AMD (ROCm), with nothing for Metal.\n",
27 | "\n",
28 | "Following a hint from [Prince Canuma](https://huggingface.co/prince-canuma) I found [vlite—\"a simple and blazing fast vector database\"](https://github.com/sdan/vlite) whose docs include the promising snippet:\n",
29 | "\n",
30 | "> - `device` (optional): The device to use for embedding ('cpu', 'mps', or 'cuda'). Default is 'cpu'. 'mps' uses PyTorch's Metal Performance Shaders on M1 macs, 'cuda' uses a NVIDIA GPU for embedding generation.\n",
31 | "\n",
32 | "_Spoiler: I ended up abandoning vlite for reasons I'll cover, so feel free to not botgher trying any of the code examples until you get to the section \"Using the Qdrant vector DBMS via OgbujiPT\"._\n",
33 | "\n",
34 | "Installing vlite"
35 | ]
36 | },
37 | {
38 | "cell_type": "code",
39 | "execution_count": null,
40 | "id": "bda1ca9d",
41 | "metadata": {},
42 | "outputs": [],
43 | "source": [
44 | "%pip install -Ur https://raw.githubusercontent.com/sdan/vlite/master/requirements.txt\n",
45 | "%pip install \"vlite[ocr]\""
46 | ]
47 | },
48 | {
49 | "cell_type": "markdown",
50 | "id": "ac185cf6",
51 | "metadata": {},
52 | "source": [
53 | "With the `[ocr]` modifier the vlite package is installed as well as helper packages for pulling text from PDF documents in the database. In my case I ran into pip errors such as the following.\n",
54 | "\n",
55 | "\n",
56 | "\n",
57 | "These sorts of conflicting dependencies are a common annoyance in the AI space, especially with widely-used and fast-evolving packages such as transformers, tokenizers, pytorch, pydantics and such.\n",
58 | "\n",
59 | "If you don't need all the added PDF tools you can install just vlite by taking out the `[ocr]` modifier from the `pip` command.\n",
60 | "\n",
61 | "## Ever more resources\n",
62 | "\n",
63 | "We need content to add to the database. I've made it easy by providing the markdown of articles in thei MLX notes series (including this article). You can [download them from Github](https://github.com/uogbuji/mlx-notes/tree/main/assets/resources/2024/ragbasics/files), the whole directory, or just the contained files, and put them in a location you can refer to in the code later.\n",
64 | "\n",
65 | "Build your vlite vector database from those files by running the code in listing 1, which I've also [provided as a download from Github](https://github.com/uogbuji/mlx-notes/tree/main/assets/resources/2024/ragbasics/listings). Make sure you first give it a look and perhaps update the `CONTENT_FOLDER` and `COLLECTION_FPATH` values.\n",
66 | "\n",
67 | "### Listing 1 (vlite_build_db.py): Building a vlite vector database from markdown files on disk\n",
68 | "\n",
69 | "_Note: [You can find all code listings on GitHub.](https://github.com/uogbuji/mlx-notes/tree/main/assets/resources/2024/ragbasics/listings)_"
70 | ]
71 | },
72 | {
73 | "cell_type": "code",
74 | "execution_count": null,
75 | "id": "4e270da1",
76 | "metadata": {},
77 | "outputs": [],
78 | "source": [
79 | "# vlite_build_db.py\n",
80 | "import os\n",
81 | "from pathlib import Path\n",
82 | "\n",
83 | "from vlite import VLite\n",
84 | "from vlite.utils import process_txt\n",
85 | "\n",
86 | "# Needed to silence a Hugging Face tokenizers library warning\n",
87 | "os.environ['TOKENIZERS_PARALLELISM'] = 'false'\n",
88 | "\n",
89 | "TEXT_SUFFIXES = ['.md', '.txt']\n",
90 | "# \n",
91 | "CONTENT_FOLDER = Path('assets/resources/2024/ragbasics/files')\n",
92 | "# Path to a \"CTX\" file which is basically the vector DB in binary form\n",
93 | "COLLECTION_FPATH = Path('/tmp/ragbasics')\n",
94 | "\n",
95 | "def setup_db(files, collection):\n",
96 | " # Create database\n",
97 | " # If you don't specify a \"collection\" (basically a filename), vlite will create\n",
98 | " # a file, using the current timestamp. under \"contexts\" in the current dir\n",
99 | " # device='mps' uses Apple Metal acceleration for the embedding,\n",
100 | " # which is typically the most expensive stage\n",
101 | " vdb = VLite(collection=collection, device='mps')\n",
102 | "\n",
103 | " for fname in files.iterdir():\n",
104 | " if fname.suffix in TEXT_SUFFIXES:\n",
105 | " print('Processing:', fname)\n",
106 | " vdb.add(process_txt(fname))\n",
107 | " else:\n",
108 | " print('Skipping:', fname)\n",
109 | " return vdb\n",
110 | "\n",
111 | "vdb = setup_db(CONTENT_FOLDER, COLLECTION_FPATH)\n",
112 | "vdb.save() # Make sure the DB is up to date"
113 | ]
114 | },
115 | {
116 | "cell_type": "markdown",
117 | "id": "d378b330",
118 | "metadata": {},
119 | "source": [
120 | "\n",
121 | "\n",
122 | "First time you run this, you might get a long delay while it downloads and caches models needed for embeddings and the like (especially `onnx/model.onnx`). Again this is a common rigamarole with ML projects.\n",
123 | "\n",
124 | "# Retrieving from the vector DB\n",
125 | "\n",
126 | "Vlite takes a very raw and lightweight approach to vector database management. You can save embedded and otherwise indexed data to what it calls context files, which can later on be loaded for update or querying.\n",
127 | "\n",
128 | "Listing 2 loads the context file saved in the previous section (`/tmp/ragbasics.ctx`) and then tries to retrieve a snippet of text from one of these MLX articles.\n",
129 | "\n",
130 | "### Listing 2 (vlite_retrieve.py): Retrieving content from a vlite vector database on disk"
131 | ]
132 | },
133 | {
134 | "cell_type": "code",
135 | "execution_count": null,
136 | "id": "f18be550",
137 | "metadata": {},
138 | "outputs": [],
139 | "source": [
140 | "# vlite_retrieve.py\n",
141 | "from pathlib import Path\n",
142 | "\n",
143 | "from vlite import VLite\n",
144 | "\n",
145 | "COLLECTION_FPATH = Path('/tmp/ragbasics')\n",
146 | "\n",
147 | "vdb = VLite(collection=COLLECTION_FPATH, device='mps')\n",
148 | "\n",
149 | "# top_k=N means take the N closest matches\n",
150 | "# return_scores=True adds the closeness scores to the return\n",
151 | "results = vdb.retrieve('ChatML format has been converted using special, low-level LLM tokens', top_k=1, return_scores=True)\n",
152 | "print(results[0])"
153 | ]
154 | },
155 | {
156 | "cell_type": "markdown",
157 | "id": "7590eb3a",
158 | "metadata": {},
159 | "source": [
160 | "When I first tried this, the results were terrible. RAG is always trickier than you may think, and there are many considerations to designing an effective RAG pipeline. One key one is how the content is partitioned within the vector DB, because each query will try to match and respond with particular chunks of text based on the query. By default vlite takes a naive approach of creating sequential chunks with up to 512 token length.\n",
161 | "\n",
162 | "## Wait, what are tokens again?\n",
163 | "\n",
164 | "Tokens have come up before in this series, and you might be wondering. \"What are those, exactly?\" Tokens are a really important concept with LLMs. When an LLM is dealing with language, it doesn't do so character by character, but it breaks down a given language into statistically useful groupings of characters, which are then identified with integer numbers. For example the characters \"ing\" occur pretty frequently, so a tokenizer might group those as a single token in many circumstances. It's sensitive to the surrounding character sequence, though, so the word \"sing\" might well be encoded as a single token of its own, regardless of containing \"ing\".\n",
165 | "\n",
166 | "The best way to get a feel of LLM tokenization is to play around with sample text and see how it gets converted. Luckily there are many tools out there to help, including [the simple llama-tokenizer-js playground](https://belladoreai.github.io/llama-tokenizer-js/example-demo/build/) web app which allows you to enter text and see how the popular Llama LLMs would tokenize them.\n",
167 | "\n",
168 | "\n",
169 | "\n",
170 | "The colors don't mean anything special in themselves. They're just visual tiling to separate the tokens. Notice how start of text is a special token ``. You might remember we also encountered some other special tokens such as `<|im_start|>` (begin conversation turn) in previous articles. LLM pre-training and fine-tuning changes the way things are tokenized, as part of setting the entire model of language. Llama won't tokenize exactly as, say ChatGPT does, but the basic concepts stay the same.\n",
171 | "\n",
172 | "The picture shows an example of how markup such as HTML can affect tokenization. There are models such as the commercial [Docugami](https://www.docugami.com/) which are trained towards efficient tokenization of markup. Code-specialized LLMs such as those used in programmer copilot tools would have efficient tokenizations of the sorts of constructs which are more common in programming code than in natural language.\n",
173 | "\n",
174 | "## Creating more sensible chunks\n",
175 | "\n",
176 | "In effect, the tokenization establishes the shape of language in a model. As such, it makes some sense, if you absolutely know no better, to at least use token boundaries in chunking text for vector lookups. We can do even better, though. Just as a basic approach it would be better to chunk each paragraph in these articles separately. That way you have a coherent thread of meaning in each chunk which is more likely to align, say with input from the user.\n",
177 | "\n",
178 | "In the following code I take over the chunking from vlite, using the `text_split` function available in [my company's open source OgbujiPT package](https://github.com/OoriData/OgbujiPT). Instead of fixed chunk sizes, I split by Markdown paragraphs (`\\n\\n`), with a guideline that chunks should be kept under 100 characters where possible.\n",
179 | "\n",
180 | "### Listing 3 (vlite_custom_split_build_db.py): Improved text splitting while building a vlite vector database from markdown files on disk"
181 | ]
182 | },
183 | {
184 | "cell_type": "code",
185 | "execution_count": null,
186 | "id": "9f0b546f",
187 | "metadata": {},
188 | "outputs": [],
189 | "source": [
190 | "import os\n",
191 | "from pathlib import Path\n",
192 | "\n",
193 | "from vlite import VLite\n",
194 | "\n",
195 | "from ogbujipt.text_helper import text_split\n",
196 | "\n",
197 | "# Needed to silence a Hugging Face tokenizers library warning\n",
198 | "os.environ['TOKENIZERS_PARALLELISM'] = 'false'\n",
199 | "\n",
200 | "TEXT_SUFFIXES = ['.md', '.txt']\n",
201 | "# \n",
202 | "CONTENT_FOLDER = Path('assets/resources/2024/ragbasics/files')\n",
203 | "# Path to a \"CTX\" file which is basically the vector DB in binary form\n",
204 | "COLLECTION_FPATH = Path('/tmp/ragbasics')\n",
205 | "\n",
206 | "def setup_db(files, collection):\n",
207 | " # Create database\n",
208 | " # If you don't specify a \"collection\" (basically a filename), vlite will create\n",
209 | " # a file, using the current timestamp. under \"contexts\" in the current dir\n",
210 | " # device='mps' uses Apple Metal acceleration for the embedding,\n",
211 | " # which is typically the most expensive stage\n",
212 | " vdb = VLite(collection=collection, device='mps')\n",
213 | "\n",
214 | " for fname in files.iterdir():\n",
215 | " if fname.suffix in TEXT_SUFFIXES:\n",
216 | " fname = str(fname)\n",
217 | " print('Processing:', fname)\n",
218 | " with open(fname) as fp:\n",
219 | " # Governed by paragraph boundaries (\\n\\n), with a target chunk size of 100\n",
220 | " for chunk in text_split(fp.read(), chunk_size=100, separator='\\n\\n'):\n",
221 | " print(chunk, '\\n¶')\n",
222 | " vdb.add(chunk, metadata={'src-file': fname})\n",
223 | " else:\n",
224 | " print('Skipping:', fname)\n",
225 | " return vdb\n",
226 | "\n",
227 | "vdb = setup_db(CONTENT_FOLDER, COLLECTION_FPATH)\n",
228 | "vdb.save() # Make sure the DB is up to date"
229 | ]
230 | },
231 | {
232 | "cell_type": "markdown",
233 | "id": "fda8895a",
234 | "metadata": {},
235 | "source": [
236 | "\n",
237 | "Unfortunately this change didn't seem to improve vlite's ability to retrieve more relevant chunks. There might be something else going on in how I'm using it, and I'll certainly revisit vlite, but my next step was to give up on a Metal-accelerated vector database and just use a package I'm more familiar. PGVector is my usual go-to, but it adds a few dependencies I wanted to avoid for this write-up. We'll just use [Qdrant](https://qdrant.tech/).\n",
238 | "\n",
239 | "# Using the Qdrant vector DBMS via OgbujiPT\n",
240 | "\n",
241 | "My OgbujiPT library includes tools to make it easy to use Qdrant or PostgreSQL/PGVector for vector database applicaitons such as RAG. Install the needed prerequisites."
242 | ]
243 | },
244 | {
245 | "cell_type": "code",
246 | "execution_count": null,
247 | "id": "efd77c54",
248 | "metadata": {},
249 | "outputs": [],
250 | "source": [
251 | "%pip install ogbujipt qdrant_client sentence_transformers"
252 | ]
253 | },
254 | {
255 | "cell_type": "markdown",
256 | "id": "22faca58",
257 | "metadata": {},
258 | "source": [
259 | "Listing 4 vectorizes the same markdown documents as before, and then does a sample retrieval, using Qdrant.\n",
260 | "\n",
261 | "### Listing 4 (qdrant_build_db.py): Switch to Qdrant for content database from markdown files on disk"
262 | ]
263 | },
264 | {
265 | "cell_type": "code",
266 | "execution_count": null,
267 | "id": "184ebe51",
268 | "metadata": {},
269 | "outputs": [],
270 | "source": [
271 | "# qdrant_build_db.py\n",
272 | "import os\n",
273 | "from pathlib import Path\n",
274 | "\n",
275 | "from sentence_transformers import SentenceTransformer\n",
276 | "from qdrant_client import QdrantClient\n",
277 | "\n",
278 | "from ogbujipt.text_helper import text_split\n",
279 | "from ogbujipt.embedding.qdrant import collection\n",
280 | "\n",
281 | "embedding_model = SentenceTransformer('all-MiniLM-L6-v2')\n",
282 | "# Needed to silence a Hugging Face tokenizers library warning\n",
283 | "os.environ['TOKENIZERS_PARALLELISM'] = 'false'\n",
284 | "\n",
285 | "TEXT_SUFFIXES = ['.md', '.txt']\n",
286 | "CONTENT_FOLDER = Path('assets/resources/2024/ragbasics/files')\n",
287 | "DBPATH = '/tmp/qdrant_test' # Set up disk storage location\n",
288 | "QCLIENT = QdrantClient(path=DBPATH)\n",
289 | "\n",
290 | "\n",
291 | "def setup_db(files):\n",
292 | " # Create content database named \"ragbasics\", using the disk storage location set up above\n",
293 | " qcoll = collection('ragbasics', embedding_model, db=QCLIENT)\n",
294 | "\n",
295 | " for fname in files.iterdir():\n",
296 | " if fname.suffix in TEXT_SUFFIXES:\n",
297 | " fname = str(fname)\n",
298 | " print('Processing:', fname)\n",
299 | " with open(fname) as fp:\n",
300 | " # Governed by paragraph boundaries (\\n\\n), with a target chunk size of 100\n",
301 | " for chunk in text_split(fp.read(), chunk_size=100, separator='\\n\\n'):\n",
302 | " # print(chunk, '\\n¶')\n",
303 | " # Probably more efficient to add in batches of chunks, but not bothering right now\n",
304 | " # Metadata can be useful in many ways, including having the LLM cite sources in its response\n",
305 | " qcoll.update(texts=[chunk], metas=[{'src-file': fname}])\n",
306 | " else:\n",
307 | " print('Skipping:', fname)\n",
308 | " return qcoll\n",
309 | "\n",
310 | "vdb = setup_db(CONTENT_FOLDER)\n",
311 | "results = vdb.search('How ChatML gets converted for use with the LLM', limit=1)\n",
312 | "\n",
313 | "top_match_text = results[0].payload['_text'] # Grabs the actual content\n",
314 | "top_match_source = results[0].payload['src-file'] # Grabs the metadata stored alongside\n",
315 | "print(f'Matched chunk: {top_match_text}\\n\\nFrom file {top_match_source}')"
316 | ]
317 | },
318 | {
319 | "cell_type": "markdown",
320 | "id": "7b16c3d6",
321 | "metadata": {},
322 | "source": [
323 | "\n",
324 | "Output:\n",
325 | "\n",
326 | "```\n",
327 | "Matched chunk: `response` is the plain old string with the LLM completion/response. It will already have been streamed to the console thanks to `verbose=True`, right after the converted prompt, displayed so you can see how the ChatML format has been converted using special, low-level LLM tokens such as `<|im_start|>` & `<|im_end|>`. Having the system message in the chat prompting and all that definitely, by my quick impressions, made the interactions far more coherent.\n",
328 | "\n",
329 | "From file assets/resources/2024/ragbasics/files/MLX-day-one.md\n",
330 | "```\n",
331 | "\n",
332 | "It's quite noticeable how much slower the embedding and indexing is with Qdrant, compared to vlite, which underscores why it would be nice to revisit the latter for such uses.\n",
333 | "\n",
334 | "# Generation next to come\n",
335 | "\n",
336 | "Now that we can search a database of content in order to select items which seem relevant to an input, we're ready to turn our attention to the generation component of RAG. Stay tuned for part 2, coming soon.\n",
337 | "\n",
338 | "# Cultural accompaniment\n",
339 | "\n",
340 | "While doing the final edits of this article I was enjoying the amazing groove of the Funmilayo Afrobeat Orquestra, plus Seun Kuti's Egypt 80, a song called Upside Down. I thought to myself (this groove is part of the article whether anyone knows it, so why not share?) Since I'm also a poet, DJ, etc. I think I'll start sharing with these articles some artistic snippet that accompanied me in the process, or maybe something that's been inspiring me lately.\n",
341 | "\n",
342 | "[![Funmilayo Afrobeat Orquestra & Seun Kuti's Egypt 80 - Upside Down [Live Session]](https://img.youtube.com/vi/Gf8G3OhHW8I/0.jpg)](https://www.youtube.com/watch?v=Gf8G3OhHW8I)\n",
343 | "\n",
344 | "\n",
347 | "\n",
348 | "I grew up on Afrobeat (not \"Afrobeats, abeg, oh!\"), back home in Nigeria, and I'm beyond delight to see how this magical, elemental music has found its way around the world and continues to flourish. Lovely to see Seun, my favorite contemporary exponent of his father, Fela's genre, OGs such as Kunle Justice on electric bass (a man who should be much better known!) and of course the dynamic, Brazil-based women of Funmilayo, named after Fela's mother. This one betta now! Make you enjoy!\n",
349 | "\n",
350 | "# Additional resources\n",
351 | "\n",
352 | "* [vlite documentation](https://github.com/sdan/vlite/blob/master/docs.md)"
353 | ]
354 | }
355 | ],
356 | "metadata": {
357 | "jupytext": {
358 | "cell_metadata_filter": "-all",
359 | "main_language": "sh",
360 | "notebook_metadata_filter": "-all"
361 | },
362 | "language_info": {
363 | "name": "python"
364 | }
365 | },
366 | "nbformat": 4,
367 | "nbformat_minor": 5
368 | }
369 |
--------------------------------------------------------------------------------
/2024/rag-basics1.md:
--------------------------------------------------------------------------------
1 | 
2 |
3 | # Retrieval augmentation with MLX: A bag full of RAG, part 1
4 |
5 | _Author: [Uche Ogbuji](https://ucheog.carrd.co/)_
6 |
7 | After the intial fun with LLMs, asking the capital of random countries and getting them to tell lousy jokes or write wack poems, the next area I've found people trying to figure out is how to "chat their documents". In other words, can you give an LLM access to some documents, databases, web pages, etc. and get them to use that context for more specialized discussion and applications? This is more formally called Retrieval Augmented Generation (RAG).
8 |
9 | As usual I shan't spend too much time explaining fundamental AI principles in these articles which are focused on the MLX framework. For a very high level view of RAG, see [this synopsis from the Prompt Engineering Guide](https://www.promptingguide.ai/techniques/rag), or better yet, see [this full article from the same source](https://www.promptingguide.ai/research/rag). The latter is a long read, but really important if you're trying to advance from baby steps to almost any sophisticated use of LLMs. In any case, you'll need to understand at least the basics of RAG to get the most of this article.
10 |
11 | ### RAG application workflow (Source: [Gao et al.](https://arxiv.org/abs/2312.10997))
12 |
13 | 
14 |
15 | In this article I'll show through code examples how you can start to build RAG apps to work with LLM generation on MLX. It's a big subtopic, even just to get through the basics, so I'll break it into two parts, the first of which focuses on the retrieval portion.
16 |
17 | # Trying out a vector DBMS
18 |
19 | So far in these articles the main benefit of MLX has been GenAI accelerated on Apple Silicon's Metal architecture. That's all about the "G" in RAG. It would be great to have the "R" part also taking some advantage of Metal, but that proves a bit tougher than I'd expected. Many of the best-known vector DBs (faiss, qdrant, etc.) use various techniques to accelerate embedding and perhaps lookup via GPU, but they focus on Nvidia (CUDA) and in some cases AMD (ROCm), with nothing for Metal. We need content to add to the database. I've made it easy by providing the markdown of articles in thei MLX notes series (including this article). You can [download them from Github](https://github.com/uogbuji/mlx-notes/tree/main/assets/resources/2024/ragbasics/files), the whole directory, or just the contained files, and put them in a location you can refer to in the code later.
20 |
21 | PGVector is my usual go-to for vector DBMS, but it adds a few nuances and dependencies I wanted to avoid for this write-up. We'll just use [Qdrant](https://qdrant.tech/).
22 |
23 | # Using the Qdrant vector DBMS via OgbujiPT
24 |
25 | My OgbujiPT library includes tools to make it easy to use Qdrant or PostgreSQL/PGVector for vector database applicaitons such as RAG. Install the needed prerequisites.
26 |
27 | ```sh
28 | pip install ogbujipt qdrant_client sentence_transformers
29 | ```
30 |
31 | Listing 4 vectorizes the same markdown documents as before, and then does a sample retrieval, using Qdrant. With the `text_split` function, available in [OgbujiPT](https://github.com/OoriData/OgbujiPT), I split by Markdown paragraphs (`\n\n`), with a guideline that chunks should be kept under 100 characters where possible.
32 |
33 | ### Listing 4 (qdrant_build_db.py): Switch to Qdrant for content database from markdown files on disk
34 |
35 | ```py
36 | # qdrant_build_db.py
37 | import os
38 | from pathlib import Path
39 |
40 | from sentence_transformers import SentenceTransformer # ST docs: https://www.sbert.net/docs/
41 | from qdrant_client import QdrantClient
42 |
43 | from ogbujipt.text_helper import text_split
44 | from ogbujipt.embedding.qdrant import collection
45 |
46 | embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
47 | # Needed to silence a Hugging Face tokenizers library warning
48 | os.environ['TOKENIZERS_PARALLELISM'] = 'false'
49 |
50 | TEXT_SUFFIXES = ['.md', '.txt']
51 | CONTENT_FOLDER = Path('assets/resources/2024/ragbasics/files')
52 | DBPATH = '/tmp/qdrant_test' # Set up disk storage location
53 | QCLIENT = QdrantClient(path=DBPATH)
54 |
55 |
56 | def setup_db(files):
57 | # Create content database named "ragbasics", using the disk storage location set up above
58 | qcoll = collection('ragbasics', embedding_model, db=QCLIENT)
59 |
60 | for fname in files.iterdir():
61 | if fname.suffix in TEXT_SUFFIXES:
62 | fname = str(fname)
63 | print('Processing:', fname)
64 | with open(fname) as fp:
65 | # Governed by paragraph boundaries (\n\n), with a target chunk size of 100
66 | for chunk in text_split(fp.read(), chunk_size=100, separator='\n\n'):
67 | # print(chunk, '\n¶')
68 | # Probably more efficient to add in batches of chunks, but not bothering right now
69 | # Metadata can be useful in many ways, including having the LLM cite sources in its response
70 | qcoll.update(texts=[chunk], metas=[{'src-file': fname}])
71 | else:
72 | print('Skipping:', fname)
73 | return qcoll
74 |
75 | vdb = setup_db(CONTENT_FOLDER)
76 | results = vdb.search('How ChatML gets converted for use with the LLM', limit=1)
77 |
78 | top_match_text = results[0].payload['_text'] # Grabs the actual content
79 | top_match_source = results[0].payload['src-file'] # Grabs the metadata stored alongside
80 | print(f'Matched chunk: {top_match_text}\n\nFrom file {top_match_source}')
81 | ```
82 |
83 | Output:
84 |
85 | ```
86 | Matched chunk: `response` is the plain old string with the LLM completion/response. It will already have been streamed to the console thanks to `verbose=True`, right after the converted prompt, displayed so you can see how the ChatML format has been converted using special, low-level LLM tokens such as `<|im_start|>` & `<|im_end|>`. Having the system message in the chat prompting and all that definitely, by my quick impressions, made the interactions far more coherent.
87 |
88 | From file assets/resources/2024/ragbasics/files/MLX-day-one.md
89 | ```
90 |
91 | You might not be able to tell off the bat, but the embedding and indexing of text in the examples above is much slower that need be. let's look into an option for speeding it up.
92 |
93 | # Using MLX-Embeddings for local, accelerated embedding generation
94 |
95 | mlx-embeddings is a package designed to generate text and image embeddings locally on Apple Silicon using the MLX framework. It uses Apple's Metal acceleration for much faster computation than cross-platform libraries like sentence-transformers.
96 |
97 | Install the package with:
98 |
99 | ```sh
100 | pip install mlx-embeddings
101 | ```
102 |
103 | Here's how you can use mlx-embeddings to generate embeddings for a list of texts:
104 |
105 | ```py
106 | from mlx_embeddings import load, generate
107 |
108 | # Load a model (e.g., MiniLM in MLX format)
109 | model, processor = load("mlx-community/all-MiniLM-L6-v2-4bit")
110 |
111 | # Generate normalized embeddings for a list of texts
112 | output = generate(model, processor, texts=["I like grapes", "I like fruits"])
113 | embeddings = output.text_embeds # Normalized embeddings
114 |
115 | # Example: Compute similarity matrix using MLX
116 | import mlx.core as mx
117 | similarity_matrix = mx.matmul(embeddings, embeddings.T)
118 | print("Similarity matrix between texts:")
119 | print(similarity_matrix)
120 | ```
121 |
122 | This workflow is similar to sentence-transformers, but all computation runs natively on your Mac, taking full advantage of Apple Silicon hardware acceleration.
123 |
124 | mlx-embeddings supports a growing set of popular models, including BERT and XLM-RoBERTa, with more being added. Vision models are also supported, making it suitable for multimodal RAG applications
125 |
126 | You can use mlx-embeddings in place of sentence-transformers wherever you need to generate vector representations for indexing or retrieval. The embeddings can be stored in any vector database, including those mentioned in previous sections. For many usage scenarios MLX-based embedding and inference are significantly faster than running PyTorch-based models via CPU, and often faster than using non-native GPU backends on Mac.
127 |
128 | # A word on tokens
129 |
130 | Tokens have come up before in this series, and you might be wondering. "What are those, exactly?" Tokens are a really important concept with LLMs. When an LLM is dealing with language, it doesn't do so character by character, but it breaks down a given language into statistically useful groupings of characters, which are then identified with integer numbers. For example the characters "ing" occur pretty frequently, so a tokenizer might group those as a single token in many circumstances. It's sensitive to the surrounding character sequence, though, so the word "sing" might well be encoded as a single token of its own, regardless of containing "ing".
131 |
132 | The best way to get a feel of LLM tokenization is to play around with sample text and see how it gets converted. Luckily there are many tools out there to help, including [the simple llama-tokenizer-js playground](https://belladoreai.github.io/llama-tokenizer-js/example-demo/build/) web app which allows you to enter text and see how the popular Llama LLMs would tokenize them.
133 |
134 | 
135 |
136 | The colors don't mean anything special in themselves. They're just visual tiling to separate the tokens. Notice how start of text is a special token ``. You might remember we also encountered some other special tokens such as `<|im_start|>` (begin conversation turn) in previous articles. LLM pre-training and fine-tuning changes the way things are tokenized, as part of setting the entire model of language. Llama won't tokenize exactly as, say ChatGPT does, but the basic concepts stay the same.
137 |
138 | The picture shows an example of how markup such as HTML can affect tokenization. There are models such as the commercial [Docugami](https://www.docugami.com/) which are trained towards efficient tokenization of markup. Code-specialized LLMs such as those used in programmer copilot tools would have efficient tokenizations of the sorts of constructs which are more common in programming code than in natural language.
139 |
140 | ## Creating more sensible chunks
141 |
142 | As I mentioned, I used a simple text splitter from OgbujiPT above, but lately I've taken to use [Chonkie](https://github.com/chonkie-inc/chonkie), a library that offers a wide variety of flexible chunking options, chunking by tokens, and by LLM-guided heuristics.
143 |
144 | In effect, the tokenization establishes the shape of language in a model, which is why using token boundaries in chunking text can help avoid weid boundary issues in vector lookups. There ae many other chunking tchniques you can try, as well. Just to cite one example, you can chunk each paragraph separately, say in a collection of articles. That way you have a coherent thread of meaning in each chunk which is more likely to align with expected search patterns.
145 |
146 | # Generation next to come
147 |
148 | Now that we can search a database of content in order to select items which seem relevant to an input, we're ready to turn our attention to the generation component of RAG, in part 2. Keep in mind always that RAG is trickier than one may think, and there are many considerations to designing an effective RAG pipeline.
149 |
150 | # Recent Developments (mid 2024 - mid 2025)
151 |
152 | As always, the MLX ecosystem is evolving rapidly, so readers should check for new models and tools regularly, here are some interesting tidbits dating from after I first wrote this article.
153 |
154 | * **Model Format Conversions**: The MLX ecosystem now includes many converted models (e.g., MiniLM, BGE) in MLX format, available on Hugging Face, which can be loaded directly using mlx-embeddings
155 | * **Alternative Embedding Packages**: Other projects like mlx_embedding_models and swift-embeddings enable running BERT- or RoBERTa-based embeddings natively on Mac, broadening the choices for local RAG workflows
156 | * **Multimodal RAG**: With MLX now supporting vision models in addition to language, it is possible to build multimodal RAG systems (text + image retrieval) entirely on-device
157 | * **Community Tools**: There is a growing ecosystem of RAG implementations optimized for MLX and Apple Silicon, including command-line tools and open-source projects for vector database integration
158 |
159 | # Cultural accompaniment
160 |
161 | While doing the final edits of this article I was enjoying the amazing groove of the Funmilayo Afrobeat Orquestra, plus Seun Kuti's Egypt 80, a song called Upside Down. I thought to myself (this groove is part of the article whether anyone knows it, so why not share?) Since I'm also a poet, DJ, etc. I think I'll start sharing with these articles some artistic snippet that accompanied me in the process, or maybe something that's been inspiring me lately.
162 |
163 | [![Funmilayo Afrobeat Orquestra & Seun Kuti's Egypt 80 - Upside Down [Live Session]](https://img.youtube.com/vi/Gf8G3OhHW8I/0.jpg)](https://www.youtube.com/watch?v=Gf8G3OhHW8I)
164 |
165 |
168 |
169 | I grew up on Afrobeat (not "Afrobeats, abeg, oh!"), back home in Nigeria, and I'm beyond delight to see how this magical, elemental music has found its way around the world and continues to flourish. Lovely to see Seun, my favorite contemporary exponent of his father, Fela's genre, OGs such as Kunle Justice on electric bass (a man who should be much better known!) and of course the dynamic, Brazil-based women of Funmilayo, named after Fela's mother. This one betta now! Make you enjoy!
170 |
--------------------------------------------------------------------------------
/2024/rag-basics2.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "id": "9ba08e0a",
6 | "metadata": {},
7 | "source": [
8 | "\n",
9 | "\n",
10 | "# Retrieval augmentation with MLX: A bag full of RAG, part 2\n",
11 | "14 June 2024. Versions: mlx: 0.15.0 | mlx-lm: 0.14.3\n",
12 | "_Author: [Uche Ogbuji](https://ucheog.carrd.co/)_\n",
13 | "\n",
14 | "[In the first part of this article](https://github.com/uogbuji/mlx-notes/blob/main/2024/rag-basics1.md) I made a basic introduction to Retrieval Augmented Generation (RAG), a technique for integrating content retrieved from databases or other sources into prompts for LLM. In the first part I showed how you might construct such a context database (retrieval), and in this part we'll see how the content can be stuffed into the prompt for the LLM in the generation phase. You'll want to read part 1 before proceeding.\n",
15 | "\n",
16 | "# Back to the land of LLMs\n",
17 | "\n",
18 | "While fiddling with the vector database we haven't got around yet to using the G (Generation) part of RAG. The results from vector DB lookup are exact raw chunks of content. What you usually want in such scenarios, is for the LLM to take this raw content and work it into a coherent response to the user. A next step is to stuff the retrieved text into the prompt, as context, along with some instructions (generally placed in a system prompt). If all goes well, the LLM's response proves useful, and is anchored by the facts retrieved from the vector DB, lowering the LLM's tendency to hallucinate.\n",
19 | "\n",
20 | "_Aside: Hallucination is one of the most misunderstood topics in GenAI. It's always important to remember what LLMs are trained to do: they are trained to complete the text provided in the prompt. They are just predicting tokens and generating language. This means that they will sometimes generate language whose meaning is confusing, false or misleading, which we call hallucinations, but in doing so, they are merely following their training._\n",
21 | "\n",
22 | "_A part of the solution is to include in the prompt facts and instructions which are carefully constructed (i.e. prompt engineered) according to an understanding of the LLM's statistical tendencies. This reduces the likelihood of hallucinations, but it may not be possible to completely eliminate that tendency. Some LLMs are trained or fine-tuned to be especially \"obedient\" to the context, and these are good choices for RAG. Picking the right LLM is another part of the solution; using multi-stage pipelines with verification by other LLMs or even people (perhaps from a random or heuristically selected sample of transcripts) is another part of the solution. RAG is a simple concept, but getting consistently great results with it involves complex considerations_\n",
23 | "\n",
24 | "## Prompt stuffing 101\n",
25 | "\n",
26 | "In the previous article, [Listing 4 (qdrant_build_db.py)](https://github.com/uogbuji/mlx-notes/tree/main/assets/resources/2024/ragbasics/listings) created a Qdrant vector database from the markdown of articles in this series. We can now use that database to retrieve likely chunks of content and stuff these in the prompt for the generation phase of RAG. Listing 1, below, is a simple example of this process, using the MLX generation interface explored in previous articles.\n",
27 | "\n",
28 | "The code first queries the vector database for chunks of content semantically similar to the user question or prompt, which is hard-coded for simplicity. It then pulls the chunks into a template to construct an overall prompt, which is sent to the LLM for completion.\n",
29 | "\n",
30 | "### Listing 1 (qdrant_rag_101.py)\n",
31 | "\n",
32 | "_Note: [You can find all code listings on GitHub.](https://github.com/uogbuji/mlx-notes/tree/main/assets/resources/2024/ragbasics/listings)_"
33 | ]
34 | },
35 | {
36 | "cell_type": "code",
37 | "execution_count": null,
38 | "id": "b6f1d5a8",
39 | "metadata": {},
40 | "outputs": [],
41 | "source": [
42 | "# qdrant_rag_101.py\n",
43 | "import os\n",
44 | "from pathlib import Path\n",
45 | "import pprint\n",
46 | "\n",
47 | "from sentence_transformers import SentenceTransformer\n",
48 | "from qdrant_client import QdrantClient\n",
49 | "\n",
50 | "from ogbujipt.embedding.qdrant import collection\n",
51 | "\n",
52 | "from mlx_lm import load, generate\n",
53 | "\n",
54 | "chat_model, tokenizer = load('mlx-community/Hermes-2-Theta-Llama-3-8B-4bit')\n",
55 | "\n",
56 | "embedding_model = SentenceTransformer('all-MiniLM-L6-v2')\n",
57 | "# Needed to silence a Hugging Face tokenizers library warning\n",
58 | "os.environ['TOKENIZERS_PARALLELISM'] = 'false'\n",
59 | "\n",
60 | "TEXT_SUFFIXES = ['.md', '.txt']\n",
61 | "CONTENT_FOLDER = Path('assets/resources/2024/ragbasics/files')\n",
62 | "DBPATH = '/tmp/qdrant_test' # Set up disk storage location\n",
63 | "\n",
64 | "assert Path(DBPATH).exists(), 'DB not found. You may need to run qdrant_build_db.py again'\n",
65 | "\n",
66 | "QCLIENT = QdrantClient(path=DBPATH)\n",
67 | "\n",
68 | "USER_PROMPT = 'How can I get a better understand what tokens are, and how they work in LLMs?'\n",
69 | "SCORE_THRESHOLD = 0.2\n",
70 | "MAX_CHUNKS = 4\n",
71 | "\n",
72 | "# Set up to retrieve from previosly created content database named \"ragbasics\"\n",
73 | "# Note: Here you have to match the embedding model with the one originally used in storage\n",
74 | "qcoll = collection('ragbasics', embedding_model, db=QCLIENT)\n",
75 | "\n",
76 | "results = qcoll.search(USER_PROMPT, limit=MAX_CHUNKS, score_threshold=SCORE_THRESHOLD)\n",
77 | "\n",
78 | "top_match_text = results[0].payload['_text'] # Grabs the actual content\n",
79 | "top_match_source = results[0].payload['src-file'] # Grabs the metadata stored alongside\n",
80 | "print(f'Top matched chunk: {top_match_text}\\n\\nFrom file {top_match_source}')\n",
81 | "\n",
82 | "gathered_chunks = '\\n\\n'.join(\n",
83 | " doc.payload['_text'] for doc in results if doc.payload)\n",
84 | "\n",
85 | "sys_prompt = '''\\\n",
86 | "You are a helpful assistant who answers questions directly and as briefly as possible.\n",
87 | "Consider the following context and answer the user\\'s question.\n",
88 | "If you cannot answer with the given context, just say you don't know.\\n\n",
89 | "'''\n",
90 | "\n",
91 | "# Construct the input message struct from the system prompt, the gathered chunks, and the user prompt itself\n",
92 | "messages = [\n",
93 | " {'role': 'system', 'content': sys_prompt},\n",
94 | " {'role': 'user', 'content': f'=== BEGIN CONTEXT\\n\\n{gathered_chunks}\\n\\n=== END CONTEXT'},\n",
95 | " {'role': 'user', 'content': f'Please use the context above to respond to the following:\\n{USER_PROMPT}'}\n",
96 | " ]\n",
97 | "\n",
98 | "pprint.pprint(messages, width=120)\n",
99 | "\n",
100 | "chat_prompt = tokenizer.apply_chat_template(messages, tokenize=False)\n",
101 | "response = generate(chat_model, tokenizer, prompt=chat_prompt, verbose=True)\n",
102 | "\n",
103 | "print('RAG-aided LLM response to the user prompt:\\n', response)"
104 | ]
105 | },
106 | {
107 | "cell_type": "markdown",
108 | "id": "651c17a0",
109 | "metadata": {},
110 | "source": [
111 | "The pretty-printed `messages` structure comes out as follows:\n",
112 | "\n",
113 | "```py\n",
114 | "[{'content': 'You are a helpful assistant who answers questions directly and as briefly as possible.\\n'\n",
115 | " \"Consider the following context and answer the user's question.\\n\"\n",
116 | " \"If you cannot answer with the given context, just say you don't know.\\n\"\n",
117 | " '\\n',\n",
118 | " 'role': 'system'},\n",
119 | " {'content': '=== BEGIN CONTEXT\\n'\n",
120 | " '\\n'\n",
121 | " 'Tokens have come up before in this series, and you might be wondering. \"What are those, exactly?\" Tokens '\n",
122 | " \"are a really important concept with LLMs. When an LLM is dealing with language, it doesn't do so \"\n",
123 | " 'character by character, but it breaks down a given language into statistically useful groupings of '\n",
124 | " 'characters, which are then identified with integer numbers. For example the characters \"ing\" occur '\n",
125 | " \"pretty frequently, so a tokenizer might group those as a single token in many circumstances. It's \"\n",
126 | " 'sensitive to the surrounding character sequence, though, so the word \"sing\" might well be encoded as a '\n",
127 | " 'single token of its own, regardless of containing \"ing\".\\n'\n",
128 | " '\\n'\n",
129 | " 'The best way to get a feel of LLM tokenization is to play around with sample text and see how it gets '\n",
130 | " 'converted. Luckily there are many tools out there to help, including [the simple llama-tokenizer-js '\n",
131 | " 'playground](https://belladoreai.github.io/llama-tokenizer-js/example-demo/build/) web app which allows '\n",
132 | " 'you to enter text and see how the popular Llama LLMs would tokenize them.\\n'\n",
133 | " '\\n'\n",
134 | " '## Wait, what are tokens again?\\n'\n",
135 | " '\\n'\n",
136 | " \"The colors don't mean anything special in themselves. They're just visual tiling to separate the tokens. \"\n",
137 | " 'Notice how start of text is a special token ``. You might remember we also encountered some other '\n",
138 | " 'special tokens such as `<|im_start|>` (begin conversation turn) in previous articles. LLM pre-training '\n",
139 | " 'and fine-tuning changes the way things are tokenized, as part of setting the entire model of language. '\n",
140 | " \"Llama won't tokenize exactly as, say ChatGPT does, but the basic concepts stay the same.\\n\"\n",
141 | " '\\n'\n",
142 | " '=== END CONTEXT',\n",
143 | " 'role': 'user'},\n",
144 | " {'content': 'Please use the context above to respond to the following:\\n'\n",
145 | " 'How can I get a better understand what tokens are, and how they work in LLMs?',\n",
146 | " 'role': 'user'}]\n",
147 | " ```\n",
148 | "\n",
149 | "Output (the LLM's response):\n",
150 | "\n",
151 | "> According to the context, the best way to get a better understanding of tokens in LLMs is to play around with sample text and see how it gets converted. You can use the simple llama-tokenizer-js playground web app, which allows you to enter text and see how popular LLMs would tokenize it. Additionally, you can also remember that tokens are a way for LLMs to break down a given language into statistically useful groupings of characters, identified with integer numbers.\n",
152 | "\n",
153 | "### Faster prompt processing\n",
154 | "\n",
155 | "One detail that popped out to my eye, from an MLX perspective, was the generation speed:\n",
156 | "\n",
157 | "```\n",
158 | "Prompt: 443.319 tokens-per-sec\n",
159 | "Generation: 44.225 tokens-per-sec\n",
160 | "```\n",
161 | "\n",
162 | "Back in April I was seeing the following report (same 2021 Apple M1 Max MacBook Pro with 64GB RAM):\n",
163 | "\n",
164 | "```\n",
165 | "Prompt: 84.037 tokens-per-sec\n",
166 | "Generation: 104.326 tokens-per-sec\n",
167 | "```\n",
168 | "\n",
169 | "The generation speed looks slower now, but the prompt processing speed is some 5X faster, and in RAG applications, whre the prompt gets stuffed with retrieved data, this is an important figure. That said, this is a completely different model from the `h2o-danube2-1.8b-chat-MLX-4bit` from the earlier article, and many aspects of the model itself can affect prompt processing and generation speeds.\n",
170 | "\n",
171 | "The model I've used in the code above is my new favorite, general-purpose, open-source model, `Hermes-2-Theta-Llama-3-8B`, and in particular [a 4 bit quant I converted to MLX and contributed to the community myself](https://huggingface.co/mlx-community/Hermes-2-Theta-Llama-3-8B-4bit), using techniques from my previous article in this series, [\"Converting models from Hugging Face to MLX format, and sharing\"](https://github.com/uogbuji/mlx-notes/blob/main/2024/conversion-etc.md).\n",
172 | "\n",
173 | "# Going beyond\n",
174 | "\n",
175 | "These are the basic bones of RAG. Using just the code so far, you already have a lot of basis for experimentation. You can change the chunk size of the data stored in the vector DB—an adjustment which might surprise you in the degree of its effects. You can play with `SCORE_THRESHOLD` and `MAX_CHUNKS` to dial up or down what gets stuffed into the prompt for generation.\n",
176 | "\n",
177 | "That's just scratching the surface. There are a dizzying array of techniques and variations to RAG. Just to name a selection, you can:\n",
178 | "\n",
179 | "* use overlap with the chunking, so that you're less likely to chop apart or orphan the context of each chunk\n",
180 | "* have multiple levels of chunking, e.g. chunking document section headers as well as their contents, sometimes called hierarchical RAG\n",
181 | "* base the retrieval on more basic SQL or other traditional database query rather than vector search, perhaps even using a coding LLM to generate the SQL (yes, there are special security implications to this)\n",
182 | "* use text matching rather than semantic vector search\n",
183 | "* take retrieved chunks and re-summarize them using an LLM before sending them for generation (contextual compression), or re-assess their relevance (reranking)\n",
184 | "* retrieve and stuff with structured knowledge graphs rather than loose text\n",
185 | "* use an LLM to rewrite the user's prompt to better suit the context (while maintaining fidelity to the original)\n",
186 | "* structure the stuffing of the prompts into a format to match the training of a context obedient generation LLM\n",
187 | "\n",
188 | "Of course you can mix and match all the above, and so much more. RAG is really just an onramp to engineering, rather than its destination. As I continue this article series, I'll probably end up touching on many other advanced RAG techniques.\n",
189 | "\n",
190 | "For now, you have a basic idea of how to use RAG in MLX, and you're mostly limited by your imagination. Load up your retrieval DB with your your company's knowledgebase to create a customer self-help bot. Load it up with your financials to create a prep tool for investor reporting. Load up with all your instant messages so you can remember whom to thank about that killer restaurant recommendation once you get around to trying it. Since you're using a locally-hosted LLM, courtesy MLX, you can run such apps entirely airgapped and have few of the privacy concerns from using e.g. OpenAI, Anthropic or Google.\n",
191 | "\n",
192 | "# Its data all the way down\n",
193 | "\n",
194 | "At the heart of AI has always been high quality data at high volume. RAG, if anything makes this connection far more obvious. If you want to gain its benefits, you have to be ready to commit to sound data architecture and management. We all know that garbage in leads to garbage out, but it's especially pernicious to deal with garbage out that's been given a spit shine by an eager LLM during generation.\n",
195 | "\n",
196 | "There is a lot of energy around RAG projects, but they hide a dirty little secret: they tend to look extremely promising in prototype phases, and then run into massive engineering difficulties on the path towards full product status. A lot of this is because, to be frank, organizations have often spent so much time cutting corners in their data engineering that they just don't have the right fuel for RAG, and they might not even realize where their pipelines are falling short.\n",
197 | "\n",
198 | "RAG is essentially the main grown-up LLM technique we have right now. It's at the heart of many product initiatives, including many of my own ones. Don't ever think, however, that it's a cure-all for the various issues in GenAI, such as hallucination and unpredictable behavior. In addition to making sure you have your overall data engineering house in order, be ready to implement sound AI Ops, with a lot of testing and ongoing metrics. There's no magic escape from this if you want to take the benefits of AI at scale.\n",
199 | "\n",
200 | "# Cultural accompaniment\n",
201 | "\n",
202 | "This time I'm going with some Indian/African musical syncretism right in Chocolate City, USA. It's a DJ set by Priyanka, who put on one of my favorite online DJ sets ever mashing up Amapiano grooves with hits from India. The live vibe is…damn! I mean the crowd starts by rhythmically chanting \"DeeeeeeJaaaaay, we wanna paaaaartaaaaay\", and not three minutes in a dude from the crowd jumps in with a saxophone. It's a great way to set a creative mood while puzzling through your RAG content chunking strategy.\n",
203 | "\n",
204 | "[](https://www.youtube.com/watch?v=8f3e8aMNDf0)"
205 | ]
206 | }
207 | ],
208 | "metadata": {
209 | "jupytext": {
210 | "cell_metadata_filter": "-all",
211 | "main_language": "python",
212 | "notebook_metadata_filter": "-all"
213 | },
214 | "language_info": {
215 | "name": "python"
216 | }
217 | },
218 | "nbformat": 4,
219 | "nbformat_minor": 5
220 | }
221 |
--------------------------------------------------------------------------------
/2024/rag-basics2.md:
--------------------------------------------------------------------------------
1 | 
2 |
3 | # Retrieval augmentation with MLX: A bag full of RAG, part 2
4 | 14 June 2024. Versions: mlx: 0.15.0 | mlx-lm: 0.14.3
5 | Updated: 11 May 2025. Versions: mlx: 0.22.0 | mlx-lm: 0.20.6
6 | _Author: [Uche Ogbuji](https://ucheog.carrd.co/)_
7 |
8 | [In the first part of this article](https://github.com/uogbuji/mlx-notes/blob/main/2024/rag-basics1.md) I made a basic introduction to Retrieval Augmented Generation (RAG), a technique for integrating content retrieved from databases or other sources into prompts for LLM. In the first part I showed how you might construct such a context database (retrieval), and in this part we'll see how the content can be stuffed into the prompt for the LLM in the generation phase. You'll want to read part 1 before proceeding.
9 |
10 | # Back to the land of LLMs
11 |
12 | While fiddling with the vector database we haven't got around yet to using the G (Generation) part of RAG. The results from vector DB lookup are exact raw chunks of content. What you usually want in such scenarios, is for the LLM to take this raw content and work it into a coherent response to the user. A next step is to stuff the retrieved text into the prompt, as context, along with some instructions (generally placed in a system prompt). If all goes well, the LLM's response proves useful, and is anchored by the facts retrieved from the vector DB, lowering the LLM's tendency to hallucinate.
13 |
14 | _Aside: Hallucination is one of the most misunderstood topics in GenAI. It's always important to remember what LLMs are trained to do: they are trained to complete the text provided in the prompt. They are just predicting tokens and generating language. This means that they will sometimes generate language whose meaning is confusing, false or misleading, which we call hallucinations, but in doing so, they are merely following their training._
15 |
16 | _A part of the solution is to include in the prompt facts and instructions which are carefully constructed (i.e. prompt engineered) according to an understanding of the LLM's statistical tendencies. This reduces the likelihood of hallucinations, but it may not be possible to completely eliminate that tendency. Some LLMs are trained or fine-tuned to be especially "obedient" to the context, and these are good choices for RAG. Picking the right LLM is another part of the solution; using multi-stage pipelines with verification by other LLMs or even people (perhaps from a random or heuristically selected sample of transcripts) is another part of the solution. RAG is a simple concept, but getting consistently great results with it involves complex considerations_
17 |
18 | ## Prompt stuffing 101
19 |
20 | In the previous article, [Listing 4 (qdrant_build_db.py)](https://github.com/uogbuji/mlx-notes/tree/main/assets/resources/2024/ragbasics/listings) created a Qdrant vector database from the markdown of articles in this series. We can now use that database to retrieve likely chunks of content and stuff these in the prompt for the generation phase of RAG. Listing 1, below, is a simple example of this process, using the MLX generation interface explored in previous articles.
21 |
22 | The code first queries the vector database for chunks of content semantically similar to the user question or prompt, which is hard-coded for simplicity. It then pulls the chunks into a template to construct an overall prompt, which is sent to the LLM for completion.
23 |
24 | ### Listing 1 (qdrant_rag_101.py)
25 |
26 | _Note: [You can find all code listings on GitHub.](https://github.com/uogbuji/mlx-notes/tree/main/assets/resources/2024/ragbasics/listings)_
27 |
28 | ```py
29 | # qdrant_rag_101.py
30 | import os
31 | from pathlib import Path
32 | import pprint
33 |
34 | from sentence_transformers import SentenceTransformer
35 | from qdrant_client import QdrantClient
36 |
37 | from ogbujipt.embedding.qdrant import collection
38 |
39 | from mlx_lm import load, generate
40 |
41 | chat_model, tokenizer = load('mlx-community/Hermes-2-Theta-Llama-3-8B-4bit')
42 |
43 | embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
44 | # Needed to silence a Hugging Face tokenizers library warning
45 | os.environ['TOKENIZERS_PARALLELISM'] = 'false'
46 |
47 | TEXT_SUFFIXES = ['.md', '.txt']
48 | CONTENT_FOLDER = Path('assets/resources/2024/ragbasics/files')
49 | DBPATH = '/tmp/qdrant_test' # Set up disk storage location
50 |
51 | assert Path(DBPATH).exists(), 'DB not found. You may need to run qdrant_build_db.py again'
52 |
53 | QCLIENT = QdrantClient(path=DBPATH)
54 |
55 | USER_PROMPT = 'How can I get a better understand what tokens are, and how they work in LLMs?'
56 | SCORE_THRESHOLD = 0.2
57 | MAX_CHUNKS = 4
58 |
59 | # Set up to retrieve from previosly created content database named "ragbasics"
60 | # Note: Here you have to match the embedding model with the one originally used in storage
61 | qcoll = collection('ragbasics', embedding_model, db=QCLIENT)
62 |
63 | results = qcoll.search(USER_PROMPT, limit=MAX_CHUNKS, score_threshold=SCORE_THRESHOLD)
64 |
65 | top_match_text = results[0].payload['_text'] # Grabs the actual content
66 | top_match_source = results[0].payload['src-file'] # Grabs the metadata stored alongside
67 | print(f'Top matched chunk: {top_match_text}\n\nFrom file {top_match_source}')
68 |
69 | gathered_chunks = '\n\n'.join(
70 | doc.payload['_text'] for doc in results if doc.payload)
71 |
72 | sys_prompt = '''\
73 | You are a helpful assistant who answers questions directly and as briefly as possible.
74 | Consider the following context and answer the user\'s question.
75 | If you cannot answer with the given context, just say you don't know.\n
76 | '''
77 |
78 | # Construct the input message struct from the system prompt, the gathered chunks, and the user prompt itself
79 | messages = [
80 | {'role': 'system', 'content': sys_prompt},
81 | {'role': 'user', 'content': f'=== BEGIN CONTEXT\n\n{gathered_chunks}\n\n=== END CONTEXT'},
82 | {'role': 'user', 'content': f'Please use the context above to respond to the following:\n{USER_PROMPT}'}
83 | ]
84 |
85 | pprint.pprint(messages, width=120)
86 |
87 | chat_prompt = tokenizer.apply_chat_template(messages, tokenize=False)
88 | response = generate(chat_model, tokenizer, prompt=chat_prompt, verbose=True)
89 |
90 | print('RAG-aided LLM response to the user prompt:\n', response)
91 | ```
92 |
93 | The pretty-printed `messages` structure comes out as follows:
94 |
95 | ```python
96 | [{'content': 'You are a helpful assistant who answers questions directly and as briefly as possible.\n'
97 | "Consider the following context and answer the user's question.\n"
98 | "If you cannot answer with the given context, just say you don't know.\n"
99 | '\n',
100 | 'role': 'system'},
101 | {'content': '=== BEGIN CONTEXT\n'
102 | '\n'
103 | 'Tokens have come up before in this series, and you might be wondering. "What are those, exactly?" Tokens '
104 | "are a really important concept with LLMs. When an LLM is dealing with language, it doesn't do so "
105 | 'character by character, but it breaks down a given language into statistically useful groupings of '
106 | 'characters, which are then identified with integer numbers. For example the characters "ing" occur '
107 | "pretty frequently, so a tokenizer might group those as a single token in many circumstances. It's "
108 | 'sensitive to the surrounding character sequence, though, so the word "sing" might well be encoded as a '
109 | 'single token of its own, regardless of containing "ing".\n'
110 | '\n'
111 | 'The best way to get a feel of LLM tokenization is to play around with sample text and see how it gets '
112 | 'converted. Luckily there are many tools out there to help, including [the simple llama-tokenizer-js '
113 | 'playground](https://belladoreai.github.io/llama-tokenizer-js/example-demo/build/) web app which allows '
114 | 'you to enter text and see how the popular Llama LLMs would tokenize them.\n'
115 | '\n'
116 | '## Wait, what are tokens again?\n'
117 | '\n'
118 | "The colors don't mean anything special in themselves. They're just visual tiling to separate the tokens. "
119 | 'Notice how start of text is a special token ``. You might remember we also encountered some other '
120 | 'special tokens such as `<|im_start|>` (begin conversation turn) in previous articles. LLM pre-training '
121 | 'and fine-tuning changes the way things are tokenized, as part of setting the entire model of language. '
122 | "Llama won't tokenize exactly as, say ChatGPT does, but the basic concepts stay the same.\n"
123 | '\n'
124 | '=== END CONTEXT',
125 | 'role': 'user'},
126 | {'content': 'Please use the context above to respond to the following:\n'
127 | 'How can I get a better understand what tokens are, and how they work in LLMs?',
128 | 'role': 'user'}]
129 | ```
130 |
131 | Output (the LLM's response):
132 |
133 | > According to the context, the best way to get a better understanding of tokens in LLMs is to play around with sample text and see how it gets converted. You can use the simple llama-tokenizer-js playground web app, which allows you to enter text and see how popular LLMs would tokenize it. Additionally, you can also remember that tokens are a way for LLMs to break down a given language into statistically useful groupings of characters, identified with integer numbers.
134 |
135 | ### Faster prompt processing
136 |
137 | One detail that popped out to my eye, from an MLX perspective, was the generation speed:
138 |
139 | ```
140 | Prompt: 443.319 tokens-per-sec
141 | Generation: 44.225 tokens-per-sec
142 | ```
143 |
144 | Back in April I was seeing the following report (same 2021 Apple M1 Max MacBook Pro with 64GB RAM):
145 |
146 | ```
147 | Prompt: 84.037 tokens-per-sec
148 | Generation: 104.326 tokens-per-sec
149 | ```
150 |
151 | The generation speed looks slower now, but the prompt processing speed is some 5X faster, and in RAG applications, whre the prompt gets stuffed with retrieved data, this is an important figure. That said, this is a completely different model from the `h2o-danube2-1.8b-chat-MLX-4bit` from the earlier article, and many aspects of the model itself can affect prompt processing and generation speeds.
152 |
153 | The model I've used in the code above is my new favorite, general-purpose, open-source model, `Hermes-2-Theta-Llama-3-8B`, and in particular [a 4 bit quant I converted to MLX and contributed to the community myself](https://huggingface.co/mlx-community/Hermes-2-Theta-Llama-3-8B-4bit), using techniques from my previous article in this series, ["Converting models from Hugging Face to MLX format, and sharing"](https://github.com/uogbuji/mlx-notes/blob/main/2024/conversion-etc.md).
154 |
155 | # Best Practices: Chunk Size and Embedding Model Selection
156 |
157 | ## Optimizing Chunk Size for RAG
158 |
159 | Chunk size plays a critical role in the effectiveness and efficiency of Retrieval-Augmented Generation (RAG) systems. The right chunk size balances the need for detailed, relevant retrieval with the speed and faithfulness of generated responses.
160 |
161 | - **Precision vs. Context:** Smaller chunks (e.g., 250–256 tokens) enable more precise retrieval, as each chunk is focused on a narrow context. However, if chunks are too small, important context may be lost, leading to fragmented or incomplete answers.
162 | - **Larger Chunks:** Larger chunks (e.g., 512 tokens or a paragraph) provide more context, reducing the risk of missing relevant details, but can dilute the representation if multiple topics are included, potentially lowering retrieval precision and slowing response generation.
163 | - **Experimentation is Key:** There is no universal optimal chunk size. Start with sizes between 250 and 512 tokens and adjust based on your data and use case. Monitor both retrieval accuracy and system latency to find the best balance.
164 | - **Semantic Chunking:** Advanced strategies, such as semantically informed chunking (e.g., the SPLICE method), can further improve retrieval by aligning chunk boundaries with natural topic or section breaks, preserving meaning and context.
165 |
166 | ## Choosing the Right Embedding Model
167 |
168 | The choice of embedding model directly impacts retrieval quality, system performance, and scalability:
169 |
170 | - **Model Benchmarks:** Use benchmarks like the Massive Text Embedding Benchmark (MTEB) to compare models on tasks relevant to your application, such as retrieval, semantic similarity, and reranking
171 | - **Dense vs. Sparse vs. Hybrid:** Dense models (e.g., E5, MiniLM) excel at semantic search, while sparse models (e.g., BM25) are better for keyword matching. Hybrid approaches often yield the best results, especially for heterogeneous or domain-specific data
172 | - **Model Context Window:** Ensure the model’s maximum token limit aligns with your chosen chunk size. For most RAG applications, models supporting 512 tokens per embedding are sufficient, but longer context windows may be needed for larger documents
173 | - **Efficiency and Domain Fit:** Consider inference speed, memory requirements, and how well the model handles your domain’s language and structure. Test multiple models and measure performance on your actual data to guide selection
174 |
175 | ## Summary: Chunk Size and Embedding Model tips
176 |
177 | | Aspect | Recommendation |
178 | |--------------------------|----------------------------------------------------|
179 | | Chunk Size | Start with 250–512 tokens; adjust as needed |
180 | | Chunking Strategy | Prefer semantic or paragraph-based chunking |
181 | | Embedding Model | Use MTEB or real-world benchmarks for selection |
182 | | Model Type | Dense for semantics; hybrid for complex datasets |
183 | | Context Window | Ensure model supports your chunk size |
184 | | Evaluation | Test for faithfulness, relevance, and efficiency |
185 |
186 | By carefully tuning chunk size and embedding model choice, you can significantly improve both the precision and responsiveness of your RAG system.
187 |
188 | # Going beyond
189 |
190 | These are the basic bones of RAG. Using just the code so far, you already have a lot of basis for experimentation. You can change the chunk size of the data stored in the vector DB—an adjustment which might surprise you in the degree of its effects. You can play with `SCORE_THRESHOLD` and `MAX_CHUNKS` to dial up or down what gets stuffed into the prompt for generation.
191 |
192 | That's just scratching the surface. There are a dizzying array of techniques and variations to RAG. Just to name a selection, you can:
193 |
194 | * use overlap with the chunking, so that you're less likely to chop apart or orphan the context of each chunk
195 | * have multiple levels of chunking, e.g. chunking document section headers as well as their contents, sometimes called hierarchical RAG
196 | * base the retrieval on more basic SQL or other traditional database query rather than vector search, perhaps even using a coding LLM to generate the SQL (yes, there are special security implications to this)
197 | * use text matching rather than semantic vector search
198 | * take retrieved chunks and re-summarize them using an LLM before sending them for generation (contextual compression), or re-assess their relevance (reranking)
199 | * retrieve and stuff with structured knowledge graphs rather than loose text
200 | * use an LLM to rewrite the user's prompt to better suit the context (while maintaining fidelity to the original)
201 | * structure the stuffing of the prompts into a format to match the training of a context obedient generation LLM
202 |
203 | Of course you can mix and match all the above, and so much more. RAG is really just an onramp to engineering, rather than its destination. As I continue this article series, I'll probably end up touching on many other advanced RAG techniques.
204 |
205 | For now, you have a basic idea of how to use RAG in MLX, and you're mostly limited by your imagination. Load up your retrieval DB with your your company's knowledgebase to create a customer self-help bot. Load it up with your financials to create a prep tool for investor reporting. Load up with all your instant messages so you can remember whom to thank about that killer restaurant recommendation once you get around to trying it. Since you're using a locally-hosted LLM, courtesy MLX, you can run such apps entirely airgapped and have few of the privacy concerns from using e.g. OpenAI, Anthropic or Google.
206 |
207 | # Its data all the way down
208 |
209 | At the heart of AI has always been high quality data at high volume. RAG, if anything makes this connection far more obvious. If you want to gain its benefits, you have to be ready to commit to sound data architecture and management. We all know that garbage in leads to garbage out, but it's especially pernicious to deal with garbage out that's been given a spit shine by an eager LLM during generation.
210 |
211 | There is a lot of energy around RAG projects, but they hide a dirty little secret: they tend to look extremely promising in prototype phases, and then run into massive engineering difficulties on the path towards full product status. A lot of this is because, to be frank, organizations have often spent so much time cutting corners in their data engineering that they just don't have the right fuel for RAG, and they might not even realize where their pipelines are falling short.
212 |
213 | RAG is essentially the main grown-up LLM technique we have right now. It's at the heart of many product initiatives, including many of my own ones. Don't ever think, however, that it's a cure-all for the various issues in GenAI, such as hallucination and unpredictable behavior. In addition to making sure you have your overall data engineering house in order, be ready to implement sound AI Ops, with a lot of testing and ongoing metrics. There's no magic escape from this if you want to take the benefits of AI at scale.
214 |
215 |
222 |
223 | # Cultural accompaniment
224 |
225 | This time I'm going with some Indian/African musical syncretism right in Chocolate City, USA. It's a DJ set by Priyanka, who put on one of my favorite online DJ sets ever mashing up Amapiano grooves with hits from India. The live vibe is…damn! I mean the crowd starts by rhythmically chanting "DeeeeeeJaaaaay, we wanna paaaaartaaaaay", and not three minutes in a dude from the crowd jumps in with a saxophone. It's a great way to set a creative mood while puzzling through your RAG content chunking strategy.
226 |
227 | [](https://www.youtube.com/watch?v=8f3e8aMNDf0)
228 |
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | Attribution 4.0 International
2 |
3 | =======================================================================
4 |
5 | Creative Commons Corporation ("Creative Commons") is not a law firm and
6 | does not provide legal services or legal advice. Distribution of
7 | Creative Commons public licenses does not create a lawyer-client or
8 | other relationship. Creative Commons makes its licenses and related
9 | information available on an "as-is" basis. Creative Commons gives no
10 | warranties regarding its licenses, any material licensed under their
11 | terms and conditions, or any related information. Creative Commons
12 | disclaims all liability for damages resulting from their use to the
13 | fullest extent possible.
14 |
15 | Using Creative Commons Public Licenses
16 |
17 | Creative Commons public licenses provide a standard set of terms and
18 | conditions that creators and other rights holders may use to share
19 | original works of authorship and other material subject to copyright
20 | and certain other rights specified in the public license below. The
21 | following considerations are for informational purposes only, are not
22 | exhaustive, and do not form part of our licenses.
23 |
24 | Considerations for licensors: Our public licenses are
25 | intended for use by those authorized to give the public
26 | permission to use material in ways otherwise restricted by
27 | copyright and certain other rights. Our licenses are
28 | irrevocable. Licensors should read and understand the terms
29 | and conditions of the license they choose before applying it.
30 | Licensors should also secure all rights necessary before
31 | applying our licenses so that the public can reuse the
32 | material as expected. Licensors should clearly mark any
33 | material not subject to the license. This includes other CC-
34 | licensed material, or material used under an exception or
35 | limitation to copyright. More considerations for licensors:
36 | wiki.creativecommons.org/Considerations_for_licensors
37 |
38 | Considerations for the public: By using one of our public
39 | licenses, a licensor grants the public permission to use the
40 | licensed material under specified terms and conditions. If
41 | the licensor's permission is not necessary for any reason--for
42 | example, because of any applicable exception or limitation to
43 | copyright--then that use is not regulated by the license. Our
44 | licenses grant only permissions under copyright and certain
45 | other rights that a licensor has authority to grant. Use of
46 | the licensed material may still be restricted for other
47 | reasons, including because others have copyright or other
48 | rights in the material. A licensor may make special requests,
49 | such as asking that all changes be marked or described.
50 | Although not required by our licenses, you are encouraged to
51 | respect those requests where reasonable. More_considerations
52 | for the public:
53 | wiki.creativecommons.org/Considerations_for_licensees
54 |
55 | =======================================================================
56 |
57 | Creative Commons Attribution 4.0 International Public License
58 |
59 | By exercising the Licensed Rights (defined below), You accept and agree
60 | to be bound by the terms and conditions of this Creative Commons
61 | Attribution 4.0 International Public License ("Public License"). To the
62 | extent this Public License may be interpreted as a contract, You are
63 | granted the Licensed Rights in consideration of Your acceptance of
64 | these terms and conditions, and the Licensor grants You such rights in
65 | consideration of benefits the Licensor receives from making the
66 | Licensed Material available under these terms and conditions.
67 |
68 |
69 | Section 1 -- Definitions.
70 |
71 | a. Adapted Material means material subject to Copyright and Similar
72 | Rights that is derived from or based upon the Licensed Material
73 | and in which the Licensed Material is translated, altered,
74 | arranged, transformed, or otherwise modified in a manner requiring
75 | permission under the Copyright and Similar Rights held by the
76 | Licensor. For purposes of this Public License, where the Licensed
77 | Material is a musical work, performance, or sound recording,
78 | Adapted Material is always produced where the Licensed Material is
79 | synched in timed relation with a moving image.
80 |
81 | b. Adapter's License means the license You apply to Your Copyright
82 | and Similar Rights in Your contributions to Adapted Material in
83 | accordance with the terms and conditions of this Public License.
84 |
85 | c. Copyright and Similar Rights means copyright and/or similar rights
86 | closely related to copyright including, without limitation,
87 | performance, broadcast, sound recording, and Sui Generis Database
88 | Rights, without regard to how the rights are labeled or
89 | categorized. For purposes of this Public License, the rights
90 | specified in Section 2(b)(1)-(2) are not Copyright and Similar
91 | Rights.
92 |
93 | d. Effective Technological Measures means those measures that, in the
94 | absence of proper authority, may not be circumvented under laws
95 | fulfilling obligations under Article 11 of the WIPO Copyright
96 | Treaty adopted on December 20, 1996, and/or similar international
97 | agreements.
98 |
99 | e. Exceptions and Limitations means fair use, fair dealing, and/or
100 | any other exception or limitation to Copyright and Similar Rights
101 | that applies to Your use of the Licensed Material.
102 |
103 | f. Licensed Material means the artistic or literary work, database,
104 | or other material to which the Licensor applied this Public
105 | License.
106 |
107 | g. Licensed Rights means the rights granted to You subject to the
108 | terms and conditions of this Public License, which are limited to
109 | all Copyright and Similar Rights that apply to Your use of the
110 | Licensed Material and that the Licensor has authority to license.
111 |
112 | h. Licensor means the individual(s) or entity(ies) granting rights
113 | under this Public License.
114 |
115 | i. Share means to provide material to the public by any means or
116 | process that requires permission under the Licensed Rights, such
117 | as reproduction, public display, public performance, distribution,
118 | dissemination, communication, or importation, and to make material
119 | available to the public including in ways that members of the
120 | public may access the material from a place and at a time
121 | individually chosen by them.
122 |
123 | j. Sui Generis Database Rights means rights other than copyright
124 | resulting from Directive 96/9/EC of the European Parliament and of
125 | the Council of 11 March 1996 on the legal protection of databases,
126 | as amended and/or succeeded, as well as other essentially
127 | equivalent rights anywhere in the world.
128 |
129 | k. You means the individual or entity exercising the Licensed Rights
130 | under this Public License. Your has a corresponding meaning.
131 |
132 |
133 | Section 2 -- Scope.
134 |
135 | a. License grant.
136 |
137 | 1. Subject to the terms and conditions of this Public License,
138 | the Licensor hereby grants You a worldwide, royalty-free,
139 | non-sublicensable, non-exclusive, irrevocable license to
140 | exercise the Licensed Rights in the Licensed Material to:
141 |
142 | a. reproduce and Share the Licensed Material, in whole or
143 | in part; and
144 |
145 | b. produce, reproduce, and Share Adapted Material.
146 |
147 | 2. Exceptions and Limitations. For the avoidance of doubt, where
148 | Exceptions and Limitations apply to Your use, this Public
149 | License does not apply, and You do not need to comply with
150 | its terms and conditions.
151 |
152 | 3. Term. The term of this Public License is specified in Section
153 | 6(a).
154 |
155 | 4. Media and formats; technical modifications allowed. The
156 | Licensor authorizes You to exercise the Licensed Rights in
157 | all media and formats whether now known or hereafter created,
158 | and to make technical modifications necessary to do so. The
159 | Licensor waives and/or agrees not to assert any right or
160 | authority to forbid You from making technical modifications
161 | necessary to exercise the Licensed Rights, including
162 | technical modifications necessary to circumvent Effective
163 | Technological Measures. For purposes of this Public License,
164 | simply making modifications authorized by this Section 2(a)
165 | (4) never produces Adapted Material.
166 |
167 | 5. Downstream recipients.
168 |
169 | a. Offer from the Licensor -- Licensed Material. Every
170 | recipient of the Licensed Material automatically
171 | receives an offer from the Licensor to exercise the
172 | Licensed Rights under the terms and conditions of this
173 | Public License.
174 |
175 | b. No downstream restrictions. You may not offer or impose
176 | any additional or different terms or conditions on, or
177 | apply any Effective Technological Measures to, the
178 | Licensed Material if doing so restricts exercise of the
179 | Licensed Rights by any recipient of the Licensed
180 | Material.
181 |
182 | 6. No endorsement. Nothing in this Public License constitutes or
183 | may be construed as permission to assert or imply that You
184 | are, or that Your use of the Licensed Material is, connected
185 | with, or sponsored, endorsed, or granted official status by,
186 | the Licensor or others designated to receive attribution as
187 | provided in Section 3(a)(1)(A)(i).
188 |
189 | b. Other rights.
190 |
191 | 1. Moral rights, such as the right of integrity, are not
192 | licensed under this Public License, nor are publicity,
193 | privacy, and/or other similar personality rights; however, to
194 | the extent possible, the Licensor waives and/or agrees not to
195 | assert any such rights held by the Licensor to the limited
196 | extent necessary to allow You to exercise the Licensed
197 | Rights, but not otherwise.
198 |
199 | 2. Patent and trademark rights are not licensed under this
200 | Public License.
201 |
202 | 3. To the extent possible, the Licensor waives any right to
203 | collect royalties from You for the exercise of the Licensed
204 | Rights, whether directly or through a collecting society
205 | under any voluntary or waivable statutory or compulsory
206 | licensing scheme. In all other cases the Licensor expressly
207 | reserves any right to collect such royalties.
208 |
209 |
210 | Section 3 -- License Conditions.
211 |
212 | Your exercise of the Licensed Rights is expressly made subject to the
213 | following conditions.
214 |
215 | a. Attribution.
216 |
217 | 1. If You Share the Licensed Material (including in modified
218 | form), You must:
219 |
220 | a. retain the following if it is supplied by the Licensor
221 | with the Licensed Material:
222 |
223 | i. identification of the creator(s) of the Licensed
224 | Material and any others designated to receive
225 | attribution, in any reasonable manner requested by
226 | the Licensor (including by pseudonym if
227 | designated);
228 |
229 | ii. a copyright notice;
230 |
231 | iii. a notice that refers to this Public License;
232 |
233 | iv. a notice that refers to the disclaimer of
234 | warranties;
235 |
236 | v. a URI or hyperlink to the Licensed Material to the
237 | extent reasonably practicable;
238 |
239 | b. indicate if You modified the Licensed Material and
240 | retain an indication of any previous modifications; and
241 |
242 | c. indicate the Licensed Material is licensed under this
243 | Public License, and include the text of, or the URI or
244 | hyperlink to, this Public License.
245 |
246 | 2. You may satisfy the conditions in Section 3(a)(1) in any
247 | reasonable manner based on the medium, means, and context in
248 | which You Share the Licensed Material. For example, it may be
249 | reasonable to satisfy the conditions by providing a URI or
250 | hyperlink to a resource that includes the required
251 | information.
252 |
253 | 3. If requested by the Licensor, You must remove any of the
254 | information required by Section 3(a)(1)(A) to the extent
255 | reasonably practicable.
256 |
257 | 4. If You Share Adapted Material You produce, the Adapter's
258 | License You apply must not prevent recipients of the Adapted
259 | Material from complying with this Public License.
260 |
261 |
262 | Section 4 -- Sui Generis Database Rights.
263 |
264 | Where the Licensed Rights include Sui Generis Database Rights that
265 | apply to Your use of the Licensed Material:
266 |
267 | a. for the avoidance of doubt, Section 2(a)(1) grants You the right
268 | to extract, reuse, reproduce, and Share all or a substantial
269 | portion of the contents of the database;
270 |
271 | b. if You include all or a substantial portion of the database
272 | contents in a database in which You have Sui Generis Database
273 | Rights, then the database in which You have Sui Generis Database
274 | Rights (but not its individual contents) is Adapted Material; and
275 |
276 | c. You must comply with the conditions in Section 3(a) if You Share
277 | all or a substantial portion of the contents of the database.
278 |
279 | For the avoidance of doubt, this Section 4 supplements and does not
280 | replace Your obligations under this Public License where the Licensed
281 | Rights include other Copyright and Similar Rights.
282 |
283 |
284 | Section 5 -- Disclaimer of Warranties and Limitation of Liability.
285 |
286 | a. UNLESS OTHERWISE SEPARATELY UNDERTAKEN BY THE LICENSOR, TO THE
287 | EXTENT POSSIBLE, THE LICENSOR OFFERS THE LICENSED MATERIAL AS-IS
288 | AND AS-AVAILABLE, AND MAKES NO REPRESENTATIONS OR WARRANTIES OF
289 | ANY KIND CONCERNING THE LICENSED MATERIAL, WHETHER EXPRESS,
290 | IMPLIED, STATUTORY, OR OTHER. THIS INCLUDES, WITHOUT LIMITATION,
291 | WARRANTIES OF TITLE, MERCHANTABILITY, FITNESS FOR A PARTICULAR
292 | PURPOSE, NON-INFRINGEMENT, ABSENCE OF LATENT OR OTHER DEFECTS,
293 | ACCURACY, OR THE PRESENCE OR ABSENCE OF ERRORS, WHETHER OR NOT
294 | KNOWN OR DISCOVERABLE. WHERE DISCLAIMERS OF WARRANTIES ARE NOT
295 | ALLOWED IN FULL OR IN PART, THIS DISCLAIMER MAY NOT APPLY TO YOU.
296 |
297 | b. TO THE EXTENT POSSIBLE, IN NO EVENT WILL THE LICENSOR BE LIABLE
298 | TO YOU ON ANY LEGAL THEORY (INCLUDING, WITHOUT LIMITATION,
299 | NEGLIGENCE) OR OTHERWISE FOR ANY DIRECT, SPECIAL, INDIRECT,
300 | INCIDENTAL, CONSEQUENTIAL, PUNITIVE, EXEMPLARY, OR OTHER LOSSES,
301 | COSTS, EXPENSES, OR DAMAGES ARISING OUT OF THIS PUBLIC LICENSE OR
302 | USE OF THE LICENSED MATERIAL, EVEN IF THE LICENSOR HAS BEEN
303 | ADVISED OF THE POSSIBILITY OF SUCH LOSSES, COSTS, EXPENSES, OR
304 | DAMAGES. WHERE A LIMITATION OF LIABILITY IS NOT ALLOWED IN FULL OR
305 | IN PART, THIS LIMITATION MAY NOT APPLY TO YOU.
306 |
307 | c. The disclaimer of warranties and limitation of liability provided
308 | above shall be interpreted in a manner that, to the extent
309 | possible, most closely approximates an absolute disclaimer and
310 | waiver of all liability.
311 |
312 |
313 | Section 6 -- Term and Termination.
314 |
315 | a. This Public License applies for the term of the Copyright and
316 | Similar Rights licensed here. However, if You fail to comply with
317 | this Public License, then Your rights under this Public License
318 | terminate automatically.
319 |
320 | b. Where Your right to use the Licensed Material has terminated under
321 | Section 6(a), it reinstates:
322 |
323 | 1. automatically as of the date the violation is cured, provided
324 | it is cured within 30 days of Your discovery of the
325 | violation; or
326 |
327 | 2. upon express reinstatement by the Licensor.
328 |
329 | For the avoidance of doubt, this Section 6(b) does not affect any
330 | right the Licensor may have to seek remedies for Your violations
331 | of this Public License.
332 |
333 | c. For the avoidance of doubt, the Licensor may also offer the
334 | Licensed Material under separate terms or conditions or stop
335 | distributing the Licensed Material at any time; however, doing so
336 | will not terminate this Public License.
337 |
338 | d. Sections 1, 5, 6, 7, and 8 survive termination of this Public
339 | License.
340 |
341 |
342 | Section 7 -- Other Terms and Conditions.
343 |
344 | a. The Licensor shall not be bound by any additional or different
345 | terms or conditions communicated by You unless expressly agreed.
346 |
347 | b. Any arrangements, understandings, or agreements regarding the
348 | Licensed Material not stated herein are separate from and
349 | independent of the terms and conditions of this Public License.
350 |
351 |
352 | Section 8 -- Interpretation.
353 |
354 | a. For the avoidance of doubt, this Public License does not, and
355 | shall not be interpreted to, reduce, limit, restrict, or impose
356 | conditions on any use of the Licensed Material that could lawfully
357 | be made without permission under this Public License.
358 |
359 | b. To the extent possible, if any provision of this Public License is
360 | deemed unenforceable, it shall be automatically reformed to the
361 | minimum extent necessary to make it enforceable. If the provision
362 | cannot be reformed, it shall be severed from this Public License
363 | without affecting the enforceability of the remaining terms and
364 | conditions.
365 |
366 | c. No term or condition of this Public License will be waived and no
367 | failure to comply consented to unless expressly agreed to by the
368 | Licensor.
369 |
370 | d. Nothing in this Public License constitutes or may be interpreted
371 | as a limitation upon, or waiver of, any privileges and immunities
372 | that apply to the Licensor or You, including from the legal
373 | processes of any jurisdiction or authority.
374 |
375 |
376 | =======================================================================
377 |
378 | Creative Commons is not a party to its public
379 | licenses. Notwithstanding, Creative Commons may elect to apply one of
380 | its public licenses to material it publishes and in those instances
381 | will be considered the “Licensor.” The text of the Creative Commons
382 | public licenses is dedicated to the public domain under the CC0 Public
383 | Domain Dedication. Except for the limited purpose of indicating that
384 | material is shared under a Creative Commons public license or as
385 | otherwise permitted by the Creative Commons policies published at
386 | creativecommons.org/policies, Creative Commons does not authorize the
387 | use of the trademark "Creative Commons" or any other trademark or logo
388 | of Creative Commons without its prior written consent including,
389 | without limitation, in connection with any unauthorized modifications
390 | to any of its public licenses or any other arrangements,
391 | understandings, or agreements concerning use of licensed material. For
392 | the avoidance of doubt, this paragraph does not form part of the
393 | public licenses.
394 |
395 | Creative Commons may be contacted at creativecommons.org.
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Notes on the Apple MLX machine learning framework
2 |
3 | ## Apple MLX for AI/Large Language Models—Day One
4 |
5 |
13 |
14 | ## Converting models from Hugging Face to MLX format, and sharing
15 |
16 |
24 |
25 | ## Retrieval augmentation with MLX: A bag full of RAG, part 1
26 |
27 |
35 |
36 | ## Retrieval augmentation with MLX: A bag full of RAG, part 2
37 |
38 |
49 |
50 | # More (up to date) MLX resources
51 |
52 |
53 |
54 | * [MLX home page](https://github.com/ml-explore/mlx)
55 | * [Hugging Face MLX community](https://huggingface.co/mlx-community)
56 | * [Using MLX at Hugging Face](https://huggingface.co/docs/hub/en/mlx)
57 | * [MLX Text-completion Finetuning Notebook](https://github.com/mark-lord/MLX-text-completion-notebook)
58 | * [MLX Tuning Fork—Framework for parameterized large language model (Q)LoRa fine-tuning using mlx, mlx_lm, and OgbujiPT. Architecture for systematic running of easily parameterized fine-tunes](https://github.com/chimezie/mlx-tuning-fork)
59 |
60 | # A few general notes
61 |
62 | * For the many chat formats already charted out in llama.cpp, see the `@register_chat_format` decorated functions in [llama_chat_format.py](https://github.com/abetlen/llama-cpp-python/blob/main/llama_cpp/llama_chat_format.py)
63 |
64 | ## To do, or figure out
65 |
66 | * [any grammar/ebnf support a la llama.cpp](https://christophergs.com/blog/running-open-source-llms-in-python#grammar)?
67 | * Alternate LLM sampling methods
68 | * Steering vectors
69 |
70 | # Syncing articles to notebooks
71 |
72 | Use [Jupytext](https://jupytext.readthedocs.io/en/latest/) to convert the `.md` articles to `.ipynb` notebooks:
73 |
74 | ```sh
75 | jupytext --to ipynb 2024/MLX-day-one.md
76 | ```
77 |
78 | May have to convert cells using plain `pip` to use `%pip` instead. It also doesn't seem to check the format metadata, so you might need to convert non-Python cells back to Markdown by hand.
79 |
80 | # License
81 |
82 | Shield: [![CC BY 4.0][cc-by-shield]][cc-by]
83 |
84 | This work is licensed under a
85 | [Creative Commons Attribution 4.0 International License][cc-by].
86 |
87 | [![CC BY 4.0][cc-by-image]][cc-by]
88 |
89 | [cc-by]: http://creativecommons.org/licenses/by/4.0/
90 | [cc-by-image]: https://i.creativecommons.org/l/by/4.0/88x31.png
91 | [cc-by-shield]: https://img.shields.io/badge/License-CC%20BY%204.0-lightgrey.svg
92 |
93 | See also: https://github.com/santisoler/cc-licenses?tab=readme-ov-file#cc-attribution-40-international
94 |
--------------------------------------------------------------------------------
/assets/images/2024/Apple-MLX-GeminiGen-cropped.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/uogbuji/mlx-notes/20e0ee6e6a99268ba39e04804add9ae017db6b05/assets/images/2024/Apple-MLX-GeminiGen-cropped.jpg
--------------------------------------------------------------------------------
/assets/images/2024/Apple-MLX-GeminiGen.jpeg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/uogbuji/mlx-notes/20e0ee6e6a99268ba39e04804add9ae017db6b05/assets/images/2024/Apple-MLX-GeminiGen.jpeg
--------------------------------------------------------------------------------
/assets/images/2024/RAG-basics-1-cover.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/uogbuji/mlx-notes/20e0ee6e6a99268ba39e04804add9ae017db6b05/assets/images/2024/RAG-basics-1-cover.jpg
--------------------------------------------------------------------------------
/assets/images/2024/RAG-basics-2-cover.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/uogbuji/mlx-notes/20e0ee6e6a99268ba39e04804add9ae017db6b05/assets/images/2024/RAG-basics-2-cover.jpg
--------------------------------------------------------------------------------
/assets/images/2024/apple-mlx-ail-llm-day-one.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/uogbuji/mlx-notes/20e0ee6e6a99268ba39e04804add9ae017db6b05/assets/images/2024/apple-mlx-ail-llm-day-one.gif
--------------------------------------------------------------------------------
/assets/images/2024/black-panther-hulk-cover-open-source-llm-800x500.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/uogbuji/mlx-notes/20e0ee6e6a99268ba39e04804add9ae017db6b05/assets/images/2024/black-panther-hulk-cover-open-source-llm-800x500.jpg
--------------------------------------------------------------------------------
/assets/images/2024/construction-joke-meme.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/uogbuji/mlx-notes/20e0ee6e6a99268ba39e04804add9ae017db6b05/assets/images/2024/construction-joke-meme.jpg
--------------------------------------------------------------------------------
/assets/images/2024/meme-arrested_dev_why_you_use_open_source.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/uogbuji/mlx-notes/20e0ee6e6a99268ba39e04804add9ae017db6b05/assets/images/2024/meme-arrested_dev_why_you_use_open_source.jpg
--------------------------------------------------------------------------------
/assets/images/2024/nasa-eclipse-diamong-ring.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/uogbuji/mlx-notes/20e0ee6e6a99268ba39e04804add9ae017db6b05/assets/images/2024/nasa-eclipse-diamong-ring.png
--------------------------------------------------------------------------------
/assets/images/2024/ogbuji-kids-eclipse-2017.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/uogbuji/mlx-notes/20e0ee6e6a99268ba39e04804add9ae017db6b05/assets/images/2024/ogbuji-kids-eclipse-2017.jpg
--------------------------------------------------------------------------------
/assets/images/2024/oranges-to-apples.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/uogbuji/mlx-notes/20e0ee6e6a99268ba39e04804add9ae017db6b05/assets/images/2024/oranges-to-apples.png
--------------------------------------------------------------------------------
/assets/images/2024/rag-process-gao-et-al.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/uogbuji/mlx-notes/20e0ee6e6a99268ba39e04804add9ae017db6b05/assets/images/2024/rag-process-gao-et-al.png
--------------------------------------------------------------------------------
/assets/images/2024/rmaiig-engineering-202404.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/uogbuji/mlx-notes/20e0ee6e6a99268ba39e04804add9ae017db6b05/assets/images/2024/rmaiig-engineering-202404.png
--------------------------------------------------------------------------------
/assets/images/2024/tokenizer-examples.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/uogbuji/mlx-notes/20e0ee6e6a99268ba39e04804add9ae017db6b05/assets/images/2024/tokenizer-examples.png
--------------------------------------------------------------------------------
/assets/images/2024/vlite-install-errors.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/uogbuji/mlx-notes/20e0ee6e6a99268ba39e04804add9ae017db6b05/assets/images/2024/vlite-install-errors.png
--------------------------------------------------------------------------------
/assets/images/2024/vlite-perf-claims.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/uogbuji/mlx-notes/20e0ee6e6a99268ba39e04804add9ae017db6b05/assets/images/2024/vlite-perf-claims.png
--------------------------------------------------------------------------------
/assets/resources/2024/ragbasics/files/MLX-day-one.md:
--------------------------------------------------------------------------------
1 | 
2 |
3 | # Apple MLX for AI/Large Language Models—Day One
4 |
5 | _Author: [Uche Ogbuji](https://ucheog.carrd.co/)_
6 |
7 | I've been using llama.cpp on Mac Silicon for months now, and [my brother, Chimezie](https://huggingface.co/cogbuji) has been nudging me to give [MLX](https://github.com/ml-explore/mlx) a go.
8 | I finally set aside time today to get started, with an eventual goal of adding support for MLX model loading & usage in [OgbujiPT](https://github.com/OoriData/OgbujiPT). I'd been warned it's rough around the edges, but it's been stimulating to play with. I thought I'd capture some of my notes, including some pitfalls I ran into, which might help anyone else trying to get into MLX in its current state.
9 |
10 | As a quick bit of background I'll mention that MLX is very interesting because honestly, Apple has the most coherently engineered consumer and small-business-level hardware for AI workloads, with Apple Silicon and its unified memory. The news lately is all about Apple's AI fumbles, but I suspect their clever plan is to empower a community of developers to take the arrows in their back and build things out for them. The MLX community is already an absolute machine, a fact Chimezie spotted early on. If like me you're trying to develop products on this new frontier without abdicating the engineering to separate, black-box providers, MLX is a compelling avenue.
11 |
12 | 
13 |
14 | My initial forays will just be into inferencing, which should complement the large amount of solid community work in MLX fine-tuning and other more advanced topics. There's plenty of nuance to dig into just on the inference side, though.
15 | As I was warned, it's clear that MLX is developing with great velocity, even by contemporary AI standards, so just as some resources I found from six weeks ago were already out of date, this could also well be by the time you come across it. I'll try to update and continue taking notes on developments as I go along, though.
16 |
17 | First of all, I installed the mlx_lm package for Python, following the [instructions from HuggingFace](https://huggingface.co/docs/hub/en/mlx). After switching to a suitable Python virtual environment:
18 |
19 | ```sh
20 | pip install mlx-lm
21 | ```
22 |
23 | Later on, it became clear that I probably wanted to keep closer to the cutting edge, so I pulled from github instead:
24 |
25 | ```sh
26 | git clone https://github.com/ml-explore/mlx-examples.git
27 | cd mlx-examples/llms
28 | pip install -U .
29 | ```
30 |
31 | All I needed was a model to try out. On llama.cpp my go-to has been [OpenHermes-2.5-Mistral-7B](https://huggingface.co/teknium/OpenHermes-2.5-Mistral-7B), so my first move was to try to run that on MLX. I had read MLX gained limited GGUF model file format support recently, with limited support for quantization outputs. If this sentence has been gobbledygook to you,
32 | I recommend you pause, [read this useful, llama.cpp-centered tutorial](https://christophergs.com/blog/running-open-source-llms-in-python), and come back. These concepts will be useful to you no matter what AI/LLM framework you end up using.
33 |
34 | I naively just tried to load my already downloaded GGUF using `mlx_lm.load()`, but clearly wanted a `safetensors` distribution. I looked around some more and found the [GGUF](https://github.com/ml-explore/mlx-examples/tree/main/llms/gguf_llm) examples, but it was clear this was off the beaten path, and Chimezie soon told me the usual approach is to use MLX-specific models, which I can easily convert myself from regular model weights, or I can find pre-converted weights in the [mlx-community space](https://huggingface.co/mlx-community).
35 | The first/obvious such repository I found matching OpenHermes-2.5-Mistral-7B was `mlx-community/OpenHermes-2.5-Mistral-7B`, but MLX refused to load it, and indeed it's an outdated model without `safetensors`. It used the `.NPZ` format, which seems to be out of date and [yet is still referenced in the docs](https://ml-explore.github.io/mlx/build/html/examples/llama-inference.html#converting-the-weights).
36 |
37 | 
39 |
40 | A better choice turned out to be [`mlx-community/OpenHermes-2.5-Mistral-7B-4bit-mlx`](https://huggingface.co/mlx-community/OpenHermes-2.5-Mistral-7B-4bit-mlx).
41 |
42 | ```py
43 | from mlx_lm import load, generate
44 |
45 | model, tokenizer = load('mlx-community/OpenHermes-2.5-Mistral-7B-4bit-mlx')
46 | ```
47 |
48 | The first time you run this load it will download from HuggingFace. The repository will be cached, by default in `~/.cache/huggingface/hub`, so subsequent loads will be much faster. Quick completion/generation example:
49 |
50 | ```py
51 | response = generate(model, tokenizer, prompt="A fun limerick about four-leaf clovers is:", verbose=True)
52 | ```
53 |
54 | You should see the completion response being streamed. I got a truly terrible limerick. Your mileage may very.
55 |
56 | You can also use [ChatML-style interaction](https://huggingface.co/docs/transformers/main/en/chat_templating):
57 |
58 | ```py
59 | messages = [
60 | {'role': 'system', 'content': 'You are a friendly chatbot who always responds in the style of a talk show host'},
61 | {'role': 'user', 'content': 'Do you have any advice for a fresh graduate?'}]
62 | chat_prompt = tokenizer.apply_chat_template(messages, tokenize=False)
63 | response = generate(model, tokenizer, prompt=chat_prompt, verbose=True)
64 | ```
65 |
66 | `response` is the plain old string with the LLM completion/response. It will already have been streamed to the console thanks to `verbose=True`, right after the converted prompt, displayed so you can see how the ChatML format has been converted using special, low-level LLM tokens such as `<|im_start|>` & `<|im_end|>`. Having the system message in the chat prompting and all that definitely, by my quick impressions, made the interactions far more coherent.
67 |
68 | 
69 |
70 | That's as far as I got in a few hours of probing yesterday, but as I said, I'll keep the notes coming as I learn more. Next I plan to start thinking about how to incorporate what I've learned into OgbujiPT.
71 |
72 | Plug: As I've suggested, Chimezie has blazed this trail before me, and was quite helpful. You can check out the work he's already shared with the MLX community, such as his [Mr. Grammatology medical/clinical LLM fine-tune](https://huggingface.co/cogbuji/Mr-Grammatology-clinical-problems-Mistral-7B-0.5), and [mlx-tuning-fork](https://github.com/chimezie/mlx-tuning-fork), his framework for (Q)LoRa fine-tuning with MLX. [His work is featured in the brand new Oori Data HuggingFace organization page.](https://huggingface.co/OoriData).
--------------------------------------------------------------------------------
/assets/resources/2024/ragbasics/files/conversion-etc.md:
--------------------------------------------------------------------------------
1 | 
3 |
4 | # Converting models from Hugging Face to MLX format, and sharing
5 |
6 | _Author: [Uche Ogbuji](https://ucheog.carrd.co/)_
7 |
8 | Since my [first article on dipping my toes into MLX](https://github.com/uogbuji/mlx-notes/blob/main/2024/MLX-day-one.md) I've had several attention swaps, but a trip to Cleveland for prime eclipse viewing in totality gave me a chance get back to the framework. Of course the MLX team and community keep marching on, and there have been several exciting releases, and performance boosts, since my last look-in.
9 |
10 | At the same time, coincidentally, a new small model was released which I wanted to try out. [H2O-Danube2-1.8b, and in particular the chat version](https://huggingface.co/h2oai/h2o-danube2-1.8b-chat) debuts at #2 in the "~1.5B parameter" category on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard), which is promising. It was then only available in Hugging Face weight format, so I needed to convert it for MLX use.
11 |
12 | I'll use this new model to explore how easy it is to convert Hugging Face weights to MLX format, and to share the results, if one chooses to do so.
13 |
14 |  [source](https://science.nasa.gov/eclipses/future-eclipses/eclipse-2024/what-to-expect/)
15 |
16 | ## Preparation and conversion
17 |
18 | First of all I upgraded the MLX versions in the virtual environment I was using
19 |
20 | ```sh
21 | pip install -U mlx mlx-lm
22 | ```
23 |
24 | If you do so, and in the unlikely event that you don't end up with the latest versions of packages after this, you might want to add the ` --force-reinstall` flag.
25 |
26 | I created a directory to hold the converted model
27 |
28 | ```sh
29 | mkdir -p ~/.local/share/models/mlx
30 | mkdir ~/.local/share/models/mlx/h2o-danube2-1.8b-chat
31 | ```
32 |
33 | Used the command line for the actual conversion
34 |
35 | ```sh
36 | python -m mlx_lm.convert --hf-path h2oai/h2o-danube2-1.8b-chat --mlx-path ~/.local/share/models/mlx/h2o-danube2-1.8b-chat -q
37 | ```
38 |
39 | This took around ten and a half minutes of wall clock time—downloading and converting. The `-q` option quantizes the weights while converting them. The default quantization is to 4 bits (from the standard Hugging Face weight format of 16-bit floating point), but you can choose a different result bits per weight, and other quantization parameters with other command line options.
40 |
41 | It's good to be aware of the model type and architecture you're dealing with, which doesn't change when you convert weights to MLX. Eyeballing the [h2o-danube2-1.8b-chat config.json](https://huggingface.co/h2oai/h2o-danube2-1.8b-chat/blob/main/config.json), I found the following useful bits:
42 |
43 | ```json
44 | "architectures": [
45 | "MistralForCausalLM"
46 | ],
47 | …
48 | "model_type": "mistral",
49 | ```
50 |
51 | Luckily Mistral-style models are well supported, thanks to their popularity.
52 |
53 | # Loading and using the converted model from Python
54 |
55 | I loaded the model from local directory
56 |
57 | ```py
58 | from mlx_lm import load, generate
59 | from pathlib import Path
60 |
61 | model_path = Path.home() / Path('.local/share/models/mlx') / Path('h2o-danube2-1.8b-chat')
62 | model, tokenizer = load(model_path)
63 | ```
64 |
65 | This led to a warning
66 |
67 | > You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
68 |
69 | Some digging into the Hugging Face transformers library, from whence it originates, yielded no easy answers as to how seriously to take this warning.
70 |
71 | First ran the model using a similar pattern to the example in the last article, but it tripped up on the chat format.
72 |
73 | ```py
74 | messages = [
75 | {'role': 'system', 'content': 'You are a friendly and informative chatbot'},
76 | {'role': 'user', 'content': 'There\'s a total solar eclipse tomorrow. Tell me a fun fact about such events.'}]
77 | chat_prompt = tokenizer.apply_chat_template(messages, tokenize=False)
78 | response = generate(model, tokenizer, prompt=chat_prompt, verbose=True)
79 | ```
80 |
81 | I got `TemplateError: System role not supported`. Not all chat models are trained/fine-tuned with the system role. If you try to set system messages but the Hugging Face tokenizer doesn't recognize system role support, it sends a strong signal through this exception. Much better than silently confusing the model. There is no universal workaround for this—it all comes down to details of how the model was trained. I didn't do a lot of investigation of the H2O Danube 2 chat template. Instead I just basically slammed the system prompt into the user role.
82 |
83 | ```py
84 | from mlx_lm import load, generate
85 | from pathlib import Path
86 |
87 | model_path = Path.home() / Path('.local/share/models/mlx') / Path('h2o-danube2-1.8b-chat')
88 | model, tokenizer = load(model_path) # Issues a slow tokenizer warning
89 |
90 | SYSTEM_ROLE = 'user'
91 | messages = [
92 | {'role': SYSTEM_ROLE, 'content': 'You are a friendly and informative chatbot'},
93 | {'role': 'user', 'content': 'There\'s a total solar eclipse tomorrow. Tell me a fun fact about such events.'}]
94 | chat_prompt = tokenizer.apply_chat_template(messages, tokenize=False)
95 | response = generate(model, tokenizer, prompt=chat_prompt, verbose=True)
96 | ```
97 |
98 | The generation felt faster than last article's run on OpenHermes/Mistral7b, and indeed the reported numbers were impressive—running on a 2021 Apple M1 Max MacBook Pro (64GB RAM):
99 |
100 | ```
101 | Prompt: 84.037 tokens-per-sec
102 | Generation: 104.326 tokens-per-sec
103 | ```
104 |
105 | Back of the envelope says that's 3-4X faster prompt processing and 2-3X faster generation. Some of that is the fact that H2O-Danube2 is around 4X smaller, but some of it is down to improvements in the MLX code.
106 |
107 | 
108 |
109 | # Uploading converted models to Hugging Face
110 |
111 | Unlike the example above, you'll often find that models have been converted to MLX weights for you already. This is of course the beauty of an open-source community. If you do convert a model yourself, you can be part of the sharing spree.
112 |
113 | ## Preparing for upload
114 |
115 | You'll need an account on Hugging Face, then an access token with write permissions. Copy one from [your tokens settings](https://huggingface.co/settings/tokens) into your clipboard (and password manager). Install the Hugging Face tools
116 |
117 | ```sh
118 | pip install -U huggingface_hub
119 | ```
120 |
121 | Run `huggingface-cli login` and paste the token you copied earlier. You're now ready for the upload. It will push everything in your local folder with the converted weights, so you should probably check that it's ready for the public. At a minimum add a `README.md` (more on this below) and look over the `config.json`, making sure there is at least a `"model_type"` key. In this case, it's unchanged from the original: `"model_type": "mistral"`. Browsing other, recent model repositories for the [`mlx-community`](https://huggingface.co/mlx-community) is a good way to get a sense of what your upload should contain.
122 |
123 | ### README.md
124 |
125 | You'll want to have a README.md file, from which the Hugging Face model card and some metadata are extracted. I started with the metadata from the original model. It's MDX format, which is markdown with metadata headers and optional inline instructions. The [original model's metadata headers are as follows](https://huggingface.co/h2oai/h2o-danube2-1.8b-chat/raw/main/README.md):
126 |
127 | ```
128 | ---
129 | language:
130 | - en
131 | library_name: transformers
132 | license: apache-2.0
133 | tags:
134 | - gpt
135 | - llm
136 | - large language model
137 | - h2o-llmstudio
138 | thumbnail: >-
139 | https://h2o.ai/etc.clientlibs/h2o/clientlibs/clientlib-site/resources/images/favicon.ico
140 | pipeline_tag: text-generation
141 | ---
142 | ```
143 |
144 | I added some descriptive information about the model and how to use it in MLX. This information becomes the Hugging Face model card for the upload.
145 |
146 | ## Upload
147 |
148 | [Hugging Face repositories are basically git and git-LFS](https://huggingface.co/docs/huggingface_hub/guides/upload), so you have many ways of interacting with them. In my case I ran a Python script:
149 |
150 | ```py
151 | from huggingface_hub import HfApi, create_repo
152 | from pathlib import Path
153 |
154 | model_path = Path.home() / Path('.local/share/models/mlx') / Path('h2o-danube2-1.8b-chat')
155 |
156 | repo_id = create_repo('h2o-danube2-1.8b-chat-MLX-4bit').repo_id
157 | api = HfApi()
158 | api.upload_folder(folder_path=model_path,
159 | repo_id=repo_id,
160 | repo_type='model',
161 | multi_commits=True,
162 | multi_commits_verbose=True)
163 | ```
164 |
165 | Notice how I set the destination `repo_id` within my own account `ucheog`. Eventually I may want to share models I convert within the MLX community space, where others can more readily find it. In such a case I'd set `repo_id` to something like `mlx-community/h2o-danube2-1.8b-chat`. Since this is my first go-round, however, I'd rather start under my own auspices. To be frank, models available on `mlx-community` are a bit of a wild west grab-bag. This is the yin and yang of open source, of course, and we each navigate the bazaar in our own way.
166 |
167 | The `multi_commits` flags use a pull request & stage the upload piece-meal, which e.g. allows better recovery from interruption.
168 |
169 | 
170 |
171 | # Wrap up
172 |
173 | You've had a quick overview on how to convert Hugging Face weights to MLX format, and how to share such converted models with the public. As it happens, [another MLX community member converted and shared h2o-danube2-1.8b-chat](https://huggingface.co/mlx-community/h2o-danube2-1.8b-chat-4bit) a few days after I posted my own version, and you should probably use that one, if you're looking to use the model seriously. Nevertheless, there are innumerable models out there, a very small proportion of which has been converted for MLX, so it's very useful to learn how to do so for yourself.
174 |
175 | # Additional resources
176 |
177 | * [Hugging Face hub docs on uploading models](https://huggingface.co/docs/hub/en/models-uploading)
178 | * [Hugging Face/Transformers docs on sharing models](https://huggingface.co/docs/transformers/model_sharing) - more relevant to notebook & in-Python use
179 | * Chat templating is a very fiddly topic, but [this Hugging Face post](https://huggingface.co/blog/chat-templates) is a useful intro. They do push Jinja2 hard, and there's nothing wrong with Jinja2, but as with any tool I'd say use it if it's the right one, and not out of reflex.
180 |
--------------------------------------------------------------------------------
/assets/resources/2024/ragbasics/files/rag-basics1.md:
--------------------------------------------------------------------------------
1 | 
2 |
3 | # Retrieval augmentation with MLX: A bag full of RAG, part 1
4 |
5 | _Author: [Uche Ogbuji](https://ucheog.carrd.co/)_
6 |
7 | After the intial fun with LLMs, asking the capital of random countries and getting them to tell lousy jokes or write wack poems, the next area I've found people trying to figure out is how to "chat their documents". In other words, can you give an LLM access to some documents, databases, web pages, etc. and get them to use that context for more specialized discussion and applications? This is more formally called Retrieval Augmented Generation (RAG).
8 |
9 | As usual I shan't spend too much time explaining fundamental AI principles in these articles which are focused on the MLX framework. For a very high level view of RAG, see [this synopsis from the Prompt Engineering Guide](https://www.promptingguide.ai/techniques/rag), or better yet, see [this full article from the same source](https://www.promptingguide.ai/research/rag). The latter is a long read, but really important if you're trying to advance from baby steps to almost any sophisticated use of LLMs. In any case, you'll need to understand at least the basics of RAG to get the most of this article.
10 |
11 | ### RAG application workflow (Source: [Gao et al.](https://arxiv.org/abs/2312.10997))
12 |
13 | 
14 |
15 | In this article I'll show through code examples how you can start to build RAG apps to work with LLM generation on MLX. It's a big subtopic, even just to get through the basics, so I'll break it into two parts, the first of which focuses on the retrieval portion.
16 |
17 | # Trying out a vector DBMS
18 |
19 | So far in these articles the main benefit of MLX has been GenAI accelerated on Apple Silicon's Metal architecture. That's all about the "G" in RAG. It would be great to have the "R" part also taking some advantage of Metal, but that proves a bit tougher than I'd expected. Many of the best-known vector DBs (faiss, qdrant, etc.) use various techniques to accelerate embedding and perhaps lookup via GPU, but they focus on Nvidia (CUDA) and in some cases AMD (ROCm), with nothing for Metal. We need content to add to the database. I've made it easy by providing the markdown of articles in thei MLX notes series (including this article). You can [download them from Github](https://github.com/uogbuji/mlx-notes/tree/main/assets/resources/2024/ragbasics/files), the whole directory, or just the contained files, and put them in a location you can refer to in the code later.
20 |
21 | Unfortunately this change didn't seem to improve vlite's ability to retrieve more relevant chunks. There might be something else going on in how I'm using it, and I'll certainly revisit vlite, but my next step was to give up on a Metal-accelerated vector database and just use a package I'm more familiar. PGVector is my usual go-to, but it adds a few dependencies I wanted to avoid for this write-up. We'll just use [Qdrant](https://qdrant.tech/).
22 |
23 | # Using the Qdrant vector DBMS via OgbujiPT
24 |
25 | My OgbujiPT library includes tools to make it easy to use Qdrant or PostgreSQL/PGVector for vector database applicaitons such as RAG. Install the needed prerequisites.
26 |
27 | ```sh
28 | pip install ogbujipt qdrant_client sentence_transformers
29 | ```
30 |
31 | Listing 4 vectorizes the same markdown documents as before, and then does a sample retrieval, using Qdrant. With the `text_split` function, available in [OgbujiPT](https://github.com/OoriData/OgbujiPT), I split by Markdown paragraphs (`\n\n`), with a guideline that chunks should be kept under 100 characters where possible.
32 |
33 | ### Listing 4 (qdrant_build_db.py): Switch to Qdrant for content database from markdown files on disk
34 |
35 | ```py
36 | # qdrant_build_db.py
37 | import os
38 | from pathlib import Path
39 |
40 | from sentence_transformers import SentenceTransformer # ST docs: https://www.sbert.net/docs/
41 | from qdrant_client import QdrantClient
42 |
43 | from ogbujipt.text_helper import text_split
44 | from ogbujipt.embedding.qdrant import collection
45 |
46 | embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
47 | # Needed to silence a Hugging Face tokenizers library warning
48 | os.environ['TOKENIZERS_PARALLELISM'] = 'false'
49 |
50 | TEXT_SUFFIXES = ['.md', '.txt']
51 | CONTENT_FOLDER = Path('assets/resources/2024/ragbasics/files')
52 | DBPATH = '/tmp/qdrant_test' # Set up disk storage location
53 | QCLIENT = QdrantClient(path=DBPATH)
54 |
55 |
56 | def setup_db(files):
57 | # Create content database named "ragbasics", using the disk storage location set up above
58 | qcoll = collection('ragbasics', embedding_model, db=QCLIENT)
59 |
60 | for fname in files.iterdir():
61 | if fname.suffix in TEXT_SUFFIXES:
62 | fname = str(fname)
63 | print('Processing:', fname)
64 | with open(fname) as fp:
65 | # Governed by paragraph boundaries (\n\n), with a target chunk size of 100
66 | for chunk in text_split(fp.read(), chunk_size=100, separator='\n\n'):
67 | # print(chunk, '\n¶')
68 | # Probably more efficient to add in batches of chunks, but not bothering right now
69 | # Metadata can be useful in many ways, including having the LLM cite sources in its response
70 | qcoll.update(texts=[chunk], metas=[{'src-file': fname}])
71 | else:
72 | print('Skipping:', fname)
73 | return qcoll
74 |
75 | vdb = setup_db(CONTENT_FOLDER)
76 | results = vdb.search('How ChatML gets converted for use with the LLM', limit=1)
77 |
78 | top_match_text = results[0].payload['_text'] # Grabs the actual content
79 | top_match_source = results[0].payload['src-file'] # Grabs the metadata stored alongside
80 | print(f'Matched chunk: {top_match_text}\n\nFrom file {top_match_source}')
81 | ```
82 |
83 | Output:
84 |
85 | ```
86 | Matched chunk: `response` is the plain old string with the LLM completion/response. It will already have been streamed to the console thanks to `verbose=True`, right after the converted prompt, displayed so you can see how the ChatML format has been converted using special, low-level LLM tokens such as `<|im_start|>` & `<|im_end|>`. Having the system message in the chat prompting and all that definitely, by my quick impressions, made the interactions far more coherent.
87 |
88 | From file assets/resources/2024/ragbasics/files/MLX-day-one.md
89 | ```
90 |
91 | You might not be able to tell off the bat, but the embedding and indexing of text in the examples above is much slower that need be. let's look into an option for speeding it up.
92 |
93 | # Using MLX-Embeddings for local, accelerated embedding generation
94 |
95 | mlx-embeddings is a package designed to generate text and image embeddings locally on Apple Silicon using the MLX framework. It uses Apple's Metal acceleration for much faster computation than cross-platform libraries like sentence-transformers.
96 |
97 | Install the package with:
98 |
99 | ```sh
100 | pip install mlx-embeddings
101 | ```
102 |
103 | Here's how you can use mlx-embeddings to generate embeddings for a list of texts:
104 |
105 | ```py
106 | from mlx_embeddings import load, generate
107 |
108 | # Load a model (e.g., MiniLM in MLX format)
109 | model, processor = load("mlx-community/all-MiniLM-L6-v2-4bit")
110 |
111 | # Generate normalized embeddings for a list of texts
112 | output = generate(model, processor, texts=["I like grapes", "I like fruits"])
113 | embeddings = output.text_embeds # Normalized embeddings
114 |
115 | # Example: Compute similarity matrix using MLX
116 | import mlx.core as mx
117 | similarity_matrix = mx.matmul(embeddings, embeddings.T)
118 | print("Similarity matrix between texts:")
119 | print(similarity_matrix)
120 | ```
121 |
122 | This workflow is similar to sentence-transformers, but all computation runs natively on your Mac, taking full advantage of Apple Silicon hardware acceleration.
123 |
124 | mlx-embeddings supports a growing set of popular models, including BERT and XLM-RoBERTa, with more being added. Vision models are also supported, making it suitable for multimodal RAG applications
125 |
126 | You can use mlx-embeddings in place of sentence-transformers wherever you need to generate vector representations for indexing or retrieval. The embeddings can be stored in any vector database, including those mentioned in previous sections. For many usage scenarios MLX-based embedding and inference are significantly faster than running PyTorch-based models via CPU, and often faster than using non-native GPU backends on Mac.
127 |
128 | # A word on tokens
129 |
130 | Tokens have come up before in this series, and you might be wondering. "What are those, exactly?" Tokens are a really important concept with LLMs. When an LLM is dealing with language, it doesn't do so character by character, but it breaks down a given language into statistically useful groupings of characters, which are then identified with integer numbers. For example the characters "ing" occur pretty frequently, so a tokenizer might group those as a single token in many circumstances. It's sensitive to the surrounding character sequence, though, so the word "sing" might well be encoded as a single token of its own, regardless of containing "ing".
131 |
132 | The best way to get a feel of LLM tokenization is to play around with sample text and see how it gets converted. Luckily there are many tools out there to help, including [the simple llama-tokenizer-js playground](https://belladoreai.github.io/llama-tokenizer-js/example-demo/build/) web app which allows you to enter text and see how the popular Llama LLMs would tokenize them.
133 |
134 | 
135 |
136 | The colors don't mean anything special in themselves. They're just visual tiling to separate the tokens. Notice how start of text is a special token ``. You might remember we also encountered some other special tokens such as `<|im_start|>` (begin conversation turn) in previous articles. LLM pre-training and fine-tuning changes the way things are tokenized, as part of setting the entire model of language. Llama won't tokenize exactly as, say ChatGPT does, but the basic concepts stay the same.
137 |
138 | The picture shows an example of how markup such as HTML can affect tokenization. There are models such as the commercial [Docugami](https://www.docugami.com/) which are trained towards efficient tokenization of markup. Code-specialized LLMs such as those used in programmer copilot tools would have efficient tokenizations of the sorts of constructs which are more common in programming code than in natural language.
139 |
140 | ## Creating more sensible chunks
141 |
142 | As I mentioned, I used a simple text splitter from OgbujiPT above, but lately I've taken to use [Chonkie](https://github.com/chonkie-inc/chonkie), a library that offers a wide variety of flexible chunking options, chunking by tokens, and by LLM-guided heuristics.
143 |
144 | In effect, the tokenization establishes the shape of language in a model, which is why using token boundaries in chunking text can help avoid weid boundary issues in vector lookups. There ae many other chunking tchniques you can try, as well. Just to cite one example, you can chunk each paragraph separately, say in a collection of articles. That way you have a coherent thread of meaning in each chunk which is more likely to align with expected search patterns.
145 |
146 | # Generation next to come
147 |
148 | Now that we can search a database of content in order to select items which seem relevant to an input, we're ready to turn our attention to the generation component of RAG, in part 2. Keep in mind always that RAG is trickier than one may think, and there are many considerations to designing an effective RAG pipeline.
149 |
150 | # Recent Developments (mid 2024 - mid 2025)
151 |
152 | As always, the MLX ecosystem is evolving rapidly, so readers should check for new models and tools regularly, here are some interesting tidbits dating from after I first wrote this article.
153 |
154 | * **Model Format Conversions**: The MLX ecosystem now includes many converted models (e.g., MiniLM, BGE) in MLX format, available on Hugging Face, which can be loaded directly using mlx-embeddings
155 | * **Alternative Embedding Packages**: Other projects like mlx_embedding_models and swift-embeddings enable running BERT- or RoBERTa-based embeddings natively on Mac, broadening the choices for local RAG workflows
156 | * **Multimodal RAG**: With MLX now supporting vision models in addition to language, it is possible to build multimodal RAG systems (text + image retrieval) entirely on-device
157 | * **Community Tools**: There is a growing ecosystem of RAG implementations optimized for MLX and Apple Silicon, including command-line tools and open-source projects for vector database integration
158 |
159 | # Cultural accompaniment
160 |
161 | While doing the final edits of this article I was enjoying the amazing groove of the Funmilayo Afrobeat Orquestra, plus Seun Kuti's Egypt 80, a song called Upside Down. I thought to myself (this groove is part of the article whether anyone knows it, so why not share?) Since I'm also a poet, DJ, etc. I think I'll start sharing with these articles some artistic snippet that accompanied me in the process, or maybe something that's been inspiring me lately.
162 |
163 | [![Funmilayo Afrobeat Orquestra & Seun Kuti's Egypt 80 - Upside Down [Live Session]](https://img.youtube.com/vi/Gf8G3OhHW8I/0.jpg)](https://www.youtube.com/watch?v=Gf8G3OhHW8I)
164 |
165 |
168 |
169 | I grew up on Afrobeat (not "Afrobeats, abeg, oh!"), back home in Nigeria, and I'm beyond delight to see how this magical, elemental music has found its way around the world and continues to flourish. Lovely to see Seun, my favorite contemporary exponent of his father, Fela's genre, OGs such as Kunle Justice on electric bass (a man who should be much better known!) and of course the dynamic, Brazil-based women of Funmilayo, named after Fela's mother. This one betta now! Make you enjoy!
170 |
--------------------------------------------------------------------------------
/assets/resources/2024/ragbasics/files/rag-basics2.md:
--------------------------------------------------------------------------------
1 | 
2 |
3 | # Retrieval augmentation with MLX: A bag full of RAG, part 2
4 | 14 June 2024. Versions: mlx: 0.15.0 | mlx-lm: 0.14.3
5 | Updated: 11 May 2025. Versions: mlx: 0.22.0 | mlx-lm: 0.20.6
6 | _Author: [Uche Ogbuji](https://ucheog.carrd.co/)_
7 |
8 | [In the first part of this article](https://github.com/uogbuji/mlx-notes/blob/main/2024/rag-basics1.md) I made a basic introduction to Retrieval Augmented Generation (RAG), a technique for integrating content retrieved from databases or other sources into prompts for LLM. In the first part I showed how you might construct such a context database (retrieval), and in this part we'll see how the content can be stuffed into the prompt for the LLM in the generation phase. You'll want to read part 1 before proceeding.
9 |
10 | # Back to the land of LLMs
11 |
12 | While fiddling with the vector database we haven't got around yet to using the G (Generation) part of RAG. The results from vector DB lookup are exact raw chunks of content. What you usually want in such scenarios, is for the LLM to take this raw content and work it into a coherent response to the user. A next step is to stuff the retrieved text into the prompt, as context, along with some instructions (generally placed in a system prompt). If all goes well, the LLM's response proves useful, and is anchored by the facts retrieved from the vector DB, lowering the LLM's tendency to hallucinate.
13 |
14 | _Aside: Hallucination is one of the most misunderstood topics in GenAI. It's always important to remember what LLMs are trained to do: they are trained to complete the text provided in the prompt. They are just predicting tokens and generating language. This means that they will sometimes generate language whose meaning is confusing, false or misleading, which we call hallucinations, but in doing so, they are merely following their training._
15 |
16 | _A part of the solution is to include in the prompt facts and instructions which are carefully constructed (i.e. prompt engineered) according to an understanding of the LLM's statistical tendencies. This reduces the likelihood of hallucinations, but it may not be possible to completely eliminate that tendency. Some LLMs are trained or fine-tuned to be especially "obedient" to the context, and these are good choices for RAG. Picking the right LLM is another part of the solution; using multi-stage pipelines with verification by other LLMs or even people (perhaps from a random or heuristically selected sample of transcripts) is another part of the solution. RAG is a simple concept, but getting consistently great results with it involves complex considerations_
17 |
18 | ## Prompt stuffing 101
19 |
20 | In the previous article, [Listing 4 (qdrant_build_db.py)](https://github.com/uogbuji/mlx-notes/tree/main/assets/resources/2024/ragbasics/listings) created a Qdrant vector database from the markdown of articles in this series. We can now use that database to retrieve likely chunks of content and stuff these in the prompt for the generation phase of RAG. Listing 1, below, is a simple example of this process, using the MLX generation interface explored in previous articles.
21 |
22 | The code first queries the vector database for chunks of content semantically similar to the user question or prompt, which is hard-coded for simplicity. It then pulls the chunks into a template to construct an overall prompt, which is sent to the LLM for completion.
23 |
24 | ### Listing 1 (qdrant_rag_101.py)
25 |
26 | _Note: [You can find all code listings on GitHub.](https://github.com/uogbuji/mlx-notes/tree/main/assets/resources/2024/ragbasics/listings)_
27 |
28 | ```py
29 | # qdrant_rag_101.py
30 | import os
31 | from pathlib import Path
32 | import pprint
33 |
34 | from sentence_transformers import SentenceTransformer
35 | from qdrant_client import QdrantClient
36 |
37 | from ogbujipt.embedding.qdrant import collection
38 |
39 | from mlx_lm import load, generate
40 |
41 | chat_model, tokenizer = load('mlx-community/Hermes-2-Theta-Llama-3-8B-4bit')
42 |
43 | embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
44 | # Needed to silence a Hugging Face tokenizers library warning
45 | os.environ['TOKENIZERS_PARALLELISM'] = 'false'
46 |
47 | TEXT_SUFFIXES = ['.md', '.txt']
48 | CONTENT_FOLDER = Path('assets/resources/2024/ragbasics/files')
49 | DBPATH = '/tmp/qdrant_test' # Set up disk storage location
50 |
51 | assert Path(DBPATH).exists(), 'DB not found. You may need to run qdrant_build_db.py again'
52 |
53 | QCLIENT = QdrantClient(path=DBPATH)
54 |
55 | USER_PROMPT = 'How can I get a better understand what tokens are, and how they work in LLMs?'
56 | SCORE_THRESHOLD = 0.2
57 | MAX_CHUNKS = 4
58 |
59 | # Set up to retrieve from previosly created content database named "ragbasics"
60 | # Note: Here you have to match the embedding model with the one originally used in storage
61 | qcoll = collection('ragbasics', embedding_model, db=QCLIENT)
62 |
63 | results = qcoll.search(USER_PROMPT, limit=MAX_CHUNKS, score_threshold=SCORE_THRESHOLD)
64 |
65 | top_match_text = results[0].payload['_text'] # Grabs the actual content
66 | top_match_source = results[0].payload['src-file'] # Grabs the metadata stored alongside
67 | print(f'Top matched chunk: {top_match_text}\n\nFrom file {top_match_source}')
68 |
69 | gathered_chunks = '\n\n'.join(
70 | doc.payload['_text'] for doc in results if doc.payload)
71 |
72 | sys_prompt = '''\
73 | You are a helpful assistant who answers questions directly and as briefly as possible.
74 | Consider the following context and answer the user\'s question.
75 | If you cannot answer with the given context, just say you don't know.\n
76 | '''
77 |
78 | # Construct the input message struct from the system prompt, the gathered chunks, and the user prompt itself
79 | messages = [
80 | {'role': 'system', 'content': sys_prompt},
81 | {'role': 'user', 'content': f'=== BEGIN CONTEXT\n\n{gathered_chunks}\n\n=== END CONTEXT'},
82 | {'role': 'user', 'content': f'Please use the context above to respond to the following:\n{USER_PROMPT}'}
83 | ]
84 |
85 | pprint.pprint(messages, width=120)
86 |
87 | chat_prompt = tokenizer.apply_chat_template(messages, tokenize=False)
88 | response = generate(chat_model, tokenizer, prompt=chat_prompt, verbose=True)
89 |
90 | print('RAG-aided LLM response to the user prompt:\n', response)
91 | ```
92 |
93 | The pretty-printed `messages` structure comes out as follows:
94 |
95 | ```python
96 | [{'content': 'You are a helpful assistant who answers questions directly and as briefly as possible.\n'
97 | "Consider the following context and answer the user's question.\n"
98 | "If you cannot answer with the given context, just say you don't know.\n"
99 | '\n',
100 | 'role': 'system'},
101 | {'content': '=== BEGIN CONTEXT\n'
102 | '\n'
103 | 'Tokens have come up before in this series, and you might be wondering. "What are those, exactly?" Tokens '
104 | "are a really important concept with LLMs. When an LLM is dealing with language, it doesn't do so "
105 | 'character by character, but it breaks down a given language into statistically useful groupings of '
106 | 'characters, which are then identified with integer numbers. For example the characters "ing" occur '
107 | "pretty frequently, so a tokenizer might group those as a single token in many circumstances. It's "
108 | 'sensitive to the surrounding character sequence, though, so the word "sing" might well be encoded as a '
109 | 'single token of its own, regardless of containing "ing".\n'
110 | '\n'
111 | 'The best way to get a feel of LLM tokenization is to play around with sample text and see how it gets '
112 | 'converted. Luckily there are many tools out there to help, including [the simple llama-tokenizer-js '
113 | 'playground](https://belladoreai.github.io/llama-tokenizer-js/example-demo/build/) web app which allows '
114 | 'you to enter text and see how the popular Llama LLMs would tokenize them.\n'
115 | '\n'
116 | '## Wait, what are tokens again?\n'
117 | '\n'
118 | "The colors don't mean anything special in themselves. They're just visual tiling to separate the tokens. "
119 | 'Notice how start of text is a special token ``. You might remember we also encountered some other '
120 | 'special tokens such as `<|im_start|>` (begin conversation turn) in previous articles. LLM pre-training '
121 | 'and fine-tuning changes the way things are tokenized, as part of setting the entire model of language. '
122 | "Llama won't tokenize exactly as, say ChatGPT does, but the basic concepts stay the same.\n"
123 | '\n'
124 | '=== END CONTEXT',
125 | 'role': 'user'},
126 | {'content': 'Please use the context above to respond to the following:\n'
127 | 'How can I get a better understand what tokens are, and how they work in LLMs?',
128 | 'role': 'user'}]
129 | ```
130 |
131 | Output (the LLM's response):
132 |
133 | > According to the context, the best way to get a better understanding of tokens in LLMs is to play around with sample text and see how it gets converted. You can use the simple llama-tokenizer-js playground web app, which allows you to enter text and see how popular LLMs would tokenize it. Additionally, you can also remember that tokens are a way for LLMs to break down a given language into statistically useful groupings of characters, identified with integer numbers.
134 |
135 | ### Faster prompt processing
136 |
137 | One detail that popped out to my eye, from an MLX perspective, was the generation speed:
138 |
139 | ```
140 | Prompt: 443.319 tokens-per-sec
141 | Generation: 44.225 tokens-per-sec
142 | ```
143 |
144 | Back in April I was seeing the following report (same 2021 Apple M1 Max MacBook Pro with 64GB RAM):
145 |
146 | ```
147 | Prompt: 84.037 tokens-per-sec
148 | Generation: 104.326 tokens-per-sec
149 | ```
150 |
151 | The generation speed looks slower now, but the prompt processing speed is some 5X faster, and in RAG applications, whre the prompt gets stuffed with retrieved data, this is an important figure. That said, this is a completely different model from the `h2o-danube2-1.8b-chat-MLX-4bit` from the earlier article, and many aspects of the model itself can affect prompt processing and generation speeds.
152 |
153 | The model I've used in the code above is my new favorite, general-purpose, open-source model, `Hermes-2-Theta-Llama-3-8B`, and in particular [a 4 bit quant I converted to MLX and contributed to the community myself](https://huggingface.co/mlx-community/Hermes-2-Theta-Llama-3-8B-4bit), using techniques from my previous article in this series, ["Converting models from Hugging Face to MLX format, and sharing"](https://github.com/uogbuji/mlx-notes/blob/main/2024/conversion-etc.md).
154 |
155 | # Best Practices: Chunk Size and Embedding Model Selection
156 |
157 | ## Optimizing Chunk Size for RAG
158 |
159 | Chunk size plays a critical role in the effectiveness and efficiency of Retrieval-Augmented Generation (RAG) systems. The right chunk size balances the need for detailed, relevant retrieval with the speed and faithfulness of generated responses.
160 |
161 | - **Precision vs. Context:** Smaller chunks (e.g., 250–256 tokens) enable more precise retrieval, as each chunk is focused on a narrow context. However, if chunks are too small, important context may be lost, leading to fragmented or incomplete answers[1][2].
162 | - **Larger Chunks:** Larger chunks (e.g., 512 tokens or a paragraph) provide more context, reducing the risk of missing relevant details, but can dilute the representation if multiple topics are included, potentially lowering retrieval precision and slowing response generation[1][2][5].
163 | - **Experimentation is Key:** There is no universal optimal chunk size. Start with sizes between 250 and 512 tokens and adjust based on your data and use case. Monitor both retrieval accuracy and system latency to find the best balance[1][2][4][5].
164 | - **Semantic Chunking:** Advanced strategies, such as semantically informed chunking (e.g., the SPLICE method), can further improve retrieval by aligning chunk boundaries with natural topic or section breaks, preserving meaning and context[6].
165 |
166 | ## Choosing the Right Embedding Model
167 |
168 | The choice of embedding model directly impacts retrieval quality, system performance, and scalability:
169 |
170 | - **Model Benchmarks:** Use benchmarks like the Massive Text Embedding Benchmark (MTEB) to compare models on tasks relevant to your application, such as retrieval, semantic similarity, and reranking
171 | - **Dense vs. Sparse vs. Hybrid:** Dense models (e.g., E5, MiniLM) excel at semantic search, while sparse models (e.g., BM25) are better for keyword matching. Hybrid approaches often yield the best results, especially for heterogeneous or domain-specific data
172 | - **Model Context Window:** Ensure the model’s maximum token limit aligns with your chosen chunk size. For most RAG applications, models supporting 512 tokens per embedding are sufficient, but longer context windows may be needed for larger documents
173 | - **Efficiency and Domain Fit:** Consider inference speed, memory requirements, and how well the model handles your domain’s language and structure. Test multiple models and measure performance on your actual data to guide selection
174 |
175 | ## Summary: Chunk Size and Embedding Model tips
176 |
177 | | Aspect | Recommendation |
178 | |--------------------------|----------------------------------------------------|
179 | | Chunk Size | Start with 250–512 tokens; adjust as needed |
180 | | Chunking Strategy | Prefer semantic or paragraph-based chunking |
181 | | Embedding Model | Use MTEB or real-world benchmarks for selection |
182 | | Model Type | Dense for semantics; hybrid for complex datasets |
183 | | Context Window | Ensure model supports your chunk size |
184 | | Evaluation | Test for faithfulness, relevance, and efficiency |
185 |
186 | By carefully tuning chunk size and embedding model choice, you can significantly improve both the precision and responsiveness of your RAG system.
187 |
188 | # Going beyond
189 |
190 | These are the basic bones of RAG. Using just the code so far, you already have a lot of basis for experimentation. You can change the chunk size of the data stored in the vector DB—an adjustment which might surprise you in the degree of its effects. You can play with `SCORE_THRESHOLD` and `MAX_CHUNKS` to dial up or down what gets stuffed into the prompt for generation.
191 |
192 | That's just scratching the surface. There are a dizzying array of techniques and variations to RAG. Just to name a selection, you can:
193 |
194 | * use overlap with the chunking, so that you're less likely to chop apart or orphan the context of each chunk
195 | * have multiple levels of chunking, e.g. chunking document section headers as well as their contents, sometimes called hierarchical RAG
196 | * base the retrieval on more basic SQL or other traditional database query rather than vector search, perhaps even using a coding LLM to generate the SQL (yes, there are special security implications to this)
197 | * use text matching rather than semantic vector search
198 | * take retrieved chunks and re-summarize them using an LLM before sending them for generation (contextual compression), or re-assess their relevance (reranking)
199 | * retrieve and stuff with structured knowledge graphs rather than loose text
200 | * use an LLM to rewrite the user's prompt to better suit the context (while maintaining fidelity to the original)
201 | * structure the stuffing of the prompts into a format to match the training of a context obedient generation LLM
202 |
203 | Of course you can mix and match all the above, and so much more. RAG is really just an onramp to engineering, rather than its destination. As I continue this article series, I'll probably end up touching on many other advanced RAG techniques.
204 |
205 | For now, you have a basic idea of how to use RAG in MLX, and you're mostly limited by your imagination. Load up your retrieval DB with your your company's knowledgebase to create a customer self-help bot. Load it up with your financials to create a prep tool for investor reporting. Load up with all your instant messages so you can remember whom to thank about that killer restaurant recommendation once you get around to trying it. Since you're using a locally-hosted LLM, courtesy MLX, you can run such apps entirely airgapped and have few of the privacy concerns from using e.g. OpenAI, Anthropic or Google.
206 |
207 | # Its data all the way down
208 |
209 | At the heart of AI has always been high quality data at high volume. RAG, if anything makes this connection far more obvious. If you want to gain its benefits, you have to be ready to commit to sound data architecture and management. We all know that garbage in leads to garbage out, but it's especially pernicious to deal with garbage out that's been given a spit shine by an eager LLM during generation.
210 |
211 | There is a lot of energy around RAG projects, but they hide a dirty little secret: they tend to look extremely promising in prototype phases, and then run into massive engineering difficulties on the path towards full product status. A lot of this is because, to be frank, organizations have often spent so much time cutting corners in their data engineering that they just don't have the right fuel for RAG, and they might not even realize where their pipelines are falling short.
212 |
213 | RAG is essentially the main grown-up LLM technique we have right now. It's at the heart of many product initiatives, including many of my own ones. Don't ever think, however, that it's a cure-all for the various issues in GenAI, such as hallucination and unpredictable behavior. In addition to making sure you have your overall data engineering house in order, be ready to implement sound AI Ops, with a lot of testing and ongoing metrics. There's no magic escape from this if you want to take the benefits of AI at scale.
214 |
215 |
222 |
223 | # Cultural accompaniment
224 |
225 | This time I'm going with some Indian/African musical syncretism right in Chocolate City, USA. It's a DJ set by Priyanka, who put on one of my favorite online DJ sets ever mashing up Amapiano grooves with hits from India. The live vibe is…damn! I mean the crowd starts by rhythmically chanting "DeeeeeeJaaaaay, we wanna paaaaartaaaaay", and not three minutes in a dude from the crowd jumps in with a saxophone. It's a great way to set a creative mood while puzzling through your RAG content chunking strategy.
226 |
227 | [](https://www.youtube.com/watch?v=8f3e8aMNDf0)
228 |
--------------------------------------------------------------------------------
/assets/resources/2024/ragbasics/listings/qdrant_build_db.py:
--------------------------------------------------------------------------------
1 | # qdrant_build_db.py
2 | import os
3 | from pathlib import Path
4 |
5 | from sentence_transformers import SentenceTransformer
6 | from qdrant_client import QdrantClient
7 |
8 | from ogbujipt.text_helper import text_split
9 | from ogbujipt.embedding.qdrant import collection
10 |
11 | embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
12 | # Needed to silence a Hugging Face tokenizers library warning
13 | os.environ['TOKENIZERS_PARALLELISM'] = 'false'
14 |
15 | TEXT_SUFFIXES = ['.md', '.txt']
16 | CONTENT_FOLDER = Path('assets/resources/2024/ragbasics/files')
17 | DBPATH = '/tmp/qdrant_test' # Set up disk storage location
18 | QCLIENT = QdrantClient(path=DBPATH)
19 |
20 |
21 | def setup_db(files):
22 | # Create content database named "ragbasics", using the disk storage location set up above
23 | qcoll = collection('ragbasics', embedding_model, db=QCLIENT)
24 |
25 | for fname in files.iterdir():
26 | if fname.suffix in TEXT_SUFFIXES:
27 | fname = str(fname)
28 | print('Processing:', fname)
29 | with open(fname) as fp:
30 | # Governed by paragraph boundaries (\n\n), with a target chunk size of 100
31 | for chunk in text_split(fp.read(), chunk_size=100, separator='\n\n'):
32 | # print(chunk, '\n¶')
33 | # Probably more efficient to add in batches of chunks, but not bothering right now
34 | # Metadata can be useful in many ways, including having the LLM cite sources in its response
35 | qcoll.update(texts=[chunk], metas=[{'src-file': fname}])
36 | else:
37 | print('Skipping:', fname)
38 | return qcoll
39 |
40 | vdb = setup_db(CONTENT_FOLDER)
41 | results = vdb.search('How ChatML gets converted for use with the LLM', limit=1)
42 |
43 | top_match_text = results[0].payload['_text'] # Grabs the actual content
44 | top_match_source = results[0].payload['src-file'] # Grabs the metadata stored alongside
45 | print(f'Matched chunk: {top_match_text}\n\nFrom file {top_match_source}')
46 |
--------------------------------------------------------------------------------
/assets/resources/2024/ragbasics/listings/qdrant_rag_101.py:
--------------------------------------------------------------------------------
1 | # qdrant_rag_101.py
2 | import os
3 | from pathlib import Path
4 | import pprint
5 |
6 | from sentence_transformers import SentenceTransformer
7 | from qdrant_client import QdrantClient
8 |
9 | from ogbujipt.embedding.qdrant import collection
10 |
11 | from mlx_lm import load, generate
12 |
13 | chat_model, tokenizer = load('mlx-community/Hermes-2-Theta-Llama-3-8B-4bit')
14 |
15 | embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
16 | # Needed to silence a Hugging Face tokenizers library warning
17 | os.environ['TOKENIZERS_PARALLELISM'] = 'false'
18 |
19 | TEXT_SUFFIXES = ['.md', '.txt']
20 | CONTENT_FOLDER = Path('assets/resources/2024/ragbasics/files')
21 | DBPATH = '/tmp/qdrant_test' # Set up disk storage location
22 |
23 | assert Path(DBPATH).exists(), 'DB not found. You may need to run qdrant_build_db.py again'
24 |
25 | QCLIENT = QdrantClient(path=DBPATH)
26 |
27 | USER_PROMPT = 'How can I get a better understand what tokens are, and how they work in LLMs?'
28 | SCORE_THRESHOLD = 0.2
29 | MAX_CHUNKS = 4
30 |
31 | # Set up to retrieve from previosly created content database named "ragbasics"
32 | # Note: Here you have to match the embedding model with the one originally used in storage
33 | qcoll = collection('ragbasics', embedding_model, db=QCLIENT)
34 |
35 | results = qcoll.search(USER_PROMPT, limit=MAX_CHUNKS, score_threshold=SCORE_THRESHOLD)
36 |
37 | top_match_text = results[0].payload['_text'] # Grabs the actual content
38 | top_match_source = results[0].payload['src-file'] # Grabs the metadata stored alongside
39 | print(f'Top matched chunk: {top_match_text}\n\nFrom file {top_match_source}')
40 |
41 | gathered_chunks = '\n\n'.join(
42 | doc.payload['_text'] for doc in results if doc.payload)
43 |
44 | sys_prompt = '''\
45 | You are a helpful assistant who answers questions directly and as briefly as possible.
46 | Consider the following context and answer the user\'s question.
47 | If you cannot answer with the given context, just say you don't know.\n
48 | '''
49 |
50 | # Construct the input message struct from the system prompt, the gathered chunks, and the user prompt itself
51 | messages = [
52 | {'role': 'system', 'content': sys_prompt},
53 | {'role': 'user', 'content': f'=== BEGIN CONTEXT\n\n{gathered_chunks}\n\n=== END CONTEXT'},
54 | {'role': 'user', 'content': f'Please use the context above to respond to the following:\n{USER_PROMPT}'}
55 | ]
56 |
57 | pprint.pprint(messages, width=120)
58 |
59 | chat_prompt = tokenizer.apply_chat_template(messages, tokenize=False)
60 | response = generate(chat_model, tokenizer, prompt=chat_prompt, verbose=True)
61 |
62 | print('RAG-aided LLM response to the user prompt:\n', response)
63 |
--------------------------------------------------------------------------------
/assets/resources/2024/ragbasics/listings/vlite_build_db.py:
--------------------------------------------------------------------------------
1 | import os
2 | from pathlib import Path
3 |
4 | from vlite import VLite
5 | from vlite.utils import process_txt
6 |
7 | # Needed to silence a Huggign Face tokenizers library warning
8 | os.environ['TOKENIZERS_PARALLELISM'] = 'false'
9 |
10 | TEXT_SUFFIXES = ['.md', '.txt']
11 | #
12 | CONTENT_FOLDER = Path('assets/resources/2024/ragbasics/files')
13 | # Path to a "CTX" file which is basically the vector DB in binary form
14 | COLLECTION_FPATH = Path('/tmp/ragbasics')
15 |
16 | def setup_db(files, collection):
17 | # Create database
18 | # If you don't specify a "collection" (basically a filename), vlite will create
19 | # a file, using the current timestamp. under "contexts" in the current dir
20 | # device='mps' uses Apple Metal acceleration for the embedding,
21 | # which is typically the most expensive stage
22 | vdb = VLite(collection=collection, device='mps')
23 |
24 | for fname in files.iterdir():
25 | if fname.suffix in TEXT_SUFFIXES:
26 | print('Processing:', fname)
27 | vdb.add(process_txt(fname))
28 | else:
29 | print('Skipping:', fname)
30 | return vdb
31 |
32 | vdb = setup_db(CONTENT_FOLDER, COLLECTION_FPATH)
33 | vdb.save() # Make sure the DB is up to date
--------------------------------------------------------------------------------
/assets/resources/2024/ragbasics/listings/vlite_custom_split_build_db.py:
--------------------------------------------------------------------------------
1 | # vlite_custom_split_build_db.py
2 | import os
3 | from pathlib import Path
4 |
5 | from vlite import VLite
6 |
7 | from ogbujipt.text_helper import text_split
8 |
9 | # Needed to silence a Hugging Face tokenizers library warning
10 | os.environ['TOKENIZERS_PARALLELISM'] = 'false'
11 |
12 | TEXT_SUFFIXES = ['.md', '.txt']
13 | #
14 | CONTENT_FOLDER = Path('assets/resources/2024/ragbasics/files')
15 | # Path to a "CTX" file which is basically the vector DB in binary form
16 | COLLECTION_FPATH = Path('/tmp/ragbasics')
17 |
18 | def setup_db(files, collection):
19 | # Create database
20 | # If you don't specify a "collection" (basically a filename), vlite will create
21 | # a file, using the current timestamp. under "contexts" in the current dir
22 | # device='mps' uses Apple Metal acceleration for the embedding,
23 | # which is typically the most expensive stage
24 | vdb = VLite(collection=collection, device='mps')
25 |
26 | for fname in files.iterdir():
27 | if fname.suffix in TEXT_SUFFIXES:
28 | fname = str(fname)
29 | print('Processing:', fname)
30 | with open(fname) as fp:
31 | # Governed by paragraph boundaries (\n\n), with a target chunk size of 100
32 | for chunk in text_split(fp.read(), chunk_size=100, separator='\n\n'):
33 | print(chunk, '\n¶')
34 | vdb.add(chunk, metadata={'src-file': fname})
35 | else:
36 | print('Skipping:', fname)
37 | return vdb
38 |
39 | vdb = setup_db(CONTENT_FOLDER, COLLECTION_FPATH)
40 | vdb.save() # Make sure the DB is up to date
41 |
--------------------------------------------------------------------------------
/assets/resources/2024/ragbasics/listings/vlite_retrieve.py:
--------------------------------------------------------------------------------
1 | # vlite_retrieve.py
2 | from pathlib import Path
3 |
4 | from vlite import VLite
5 |
6 | COLLECTION_FPATH = Path('/tmp/ragbasics')
7 |
8 | vdb = VLite(collection=COLLECTION_FPATH, device='mps')
9 |
10 | # top_k=N means take the N closest matches
11 | # return_scores=True adds the closeness scores to the return
12 | results = vdb.retrieve('ChatML format has been converted using special, low-level LLM tokens', top_k=1, return_scores=True)
13 | print(results[0])
14 |
--------------------------------------------------------------------------------
/assets/resources/2024/rmiug-pres-april/README.md:
--------------------------------------------------------------------------------
1 | # Hands-on Intro to MLX April 2024
2 |
3 | [via RMAIIG, Boulder, CO](https://www.meetup.com/ai-ml-engineering-aie-an-rmaiig-subgroup/events/300581632)
4 |
5 | [Presentation slides](https://docs.google.com/presentation/d/1IUNuFcS3YrkPFn7Oxw7uoB94TMnIK_wD1Pxu3-tzTjo/edit?usp=sharing)
6 |
7 | ## A few key pointers
8 |
9 | * GitHub: [ml-explore/mlx](https://github.com/ml-explore/mlx)—core framework
10 | * [Also via PyPI: mlx](https://pypi.org/project/mlx/)
11 | * GH: [ml-explore/mlx-examples](https://github.com/ml-explore/mlx-examples)—quick code; various use cases
12 | * [Also via PyPI (language AI bits): mlx-lm](https://pypi.org/project/mlx-lm/)
13 | * HuggingFace: [mlx-community](https://huggingface.co/mlx-community)
14 | * GH: [uogbuji/mlx-notes](https://github.com/uogbuji/mlx-notes/) articles from which this presentation originated
15 |
16 | ## Python setup
17 |
18 | ```sh
19 | pip install mlx mlx-lm
20 | ```
21 |
22 | ```sh
23 | git clone https://github.com/ml-explore/mlx-examples.git
24 | cd mlx-examples/llms
25 | pip install -U .
26 | ```
27 |
28 | ## Downloading & running an LLM
29 |
30 | ```py
31 | from mlx_lm import load, generate
32 | model, tokenizer = load('mlx-community/Phi-3-mini-4k-instruct-4bit')
33 | ```
34 |
35 | ## Doing the generation
36 |
37 | ```py
38 | p = 'Can a Rocky Mountain High take me to the sky?'
39 | resp = generate(model, tokenizer, prompt=p, verbose=True)
40 | ```
41 |
42 | ## Tokens & tokenizer
43 |
44 | Relevant link: [llama-tokenizer-js playground](https://belladoreai.github.io/llama-tokenizer-js/example-demo/build/), a simple web app which allows you to enter text and see how the popular Llama LLMs would tokenize them
45 |
46 | ## Let’s chat about chat
47 |
48 | ```py
49 | sysmsg = 'You\'re a friendly, helpful chatbot'
50 | messages = [
51 | {'role': 'system', 'content': sysmsg},
52 | {'role': 'user', 'content': 'How are you today?'}]
53 | chat_prompt = tokenizer.apply_chat_template(messages, tokenize=False)
54 | response = generate(model, tokenizer, prompt=chat_prompt, verbose=True)
55 | ```
56 |
57 | ## Converting your own models
58 |
59 | ```sh
60 | python -m mlx_lm.convert
61 | --hf-path h2oai/h2o-danube2-1.8b-chat
62 | --mlx-path ./mlx/h2o-danube2-1.8b-chat -q
63 | ```
64 |
65 | ~~## OK enough words; what about images?~~
66 | ## A picture is worth how many words?
67 |
68 | ```sh
69 | cd mlx-examples/stable_diffusion
70 | python txt2image.py -v --model sd --n_images 4 --n_rows 2 --cfg 8.0 --steps 50 --output test.png "Boulder flatirons"
71 | ```
72 |
--------------------------------------------------------------------------------
/assets/resources/2024/rmiug-pres-april/test.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/uogbuji/mlx-notes/20e0ee6e6a99268ba39e04804add9ae017db6b05/assets/resources/2024/rmiug-pres-april/test.png
--------------------------------------------------------------------------------