Building LLM enhanced search with the help of LLMs....

├── README.md
└── rag-elixir.livemd


/README.md:
--------------------------------------------------------------------------------
  1 | # rag-elixir-doc
  2 | 
  3 | <h1 align="center">Building LLM enhanced search with the help of LLMs....</h1>
  4 | 
  5 | We want to improve the search for the Elixir/Phoenix/Plug/LiveView documentation when using an LLM and experiment a `RAG` pipeline.
  6 | 
  7 | All the tools used here are "free", meaning everything is running locally.
  8 | 
  9 | [![Run in Livebook](https://livebook.dev/badge/v1/blue.svg)](https://livebook.dev/run?url=https%3A%2F%2Fgithub.com%2Fdwyl%2Frag-elixir-doc%2Fblob%2Fmain%2Frag-elixir.livemd)
 10 | 
 11 | ## What is `RAG`?
 12 | 
 13 | It is a "chat with your documents" process, meaning you ask an LLM model to respond based on additional ressources.
 14 | 
 15 | Theses sources may or may not be already incorporated inside the training used for the LLM.
 16 | 
 17 | Using RAG is _not about fine tuning_ the model, which is changing the coefficients or structure of the model based on additional sources.
 18 | 
 19 | RAG is about giving an additional context - the "context window" - to enhance or constraint the response from the LLM.
 20 | 
 21 | > Note that the LLM accepts a limited amounts of tokens, thus the window context is limited.
 22 | 
 23 | 
 24 | ## Scope of this POC:
 25 | 
 26 | We want to improve the LLm's response when we ask questions related to the Elixir/Phoenix/Plug/LiveView documentation. We will build a "context" to add information to help the LLM to build a response.
 27 | 
 28 | Running such a helper locally means that we need to have the extra ressources available locally. Our database will be local and our LLM will be run locally, using only local ressources.
 29 | 
 30 | We will extract some markdown files from the Phoenix_LiveView GitHub repo.
 31 | 
 32 | We will use a database to store chunks extracted from these files.
 33 | 
 34 | 
 35 | - One way is **SQL Full-Text-Search**. If we use `Postgres`, we have a [built-in functionality](https://www.postgresql.org/docs/current/textsearch-intro.html#TEXTSEARCH-DOCUMENT). This works by using ** key words**, so given that we may have various ways to express the same questions, we may want a more semantic search. [A good introduction to SQL FTS](https://leandronsp.com/a-powerful-full-text-search-in-postgresql-in-less-than-20-lines) or [here](https://peterullrich.com/complete-guide-to-full-text-search-with-postgres-and-ecto).
 36 | 
 37 | - We can also use the open source version of "elastisearch" by Amazon named `opensearch` via a Docker image. We may also use the [local image of ElastiSearch](https://www.elastic.co/guide/en/elasticsearch/reference/current/run-elasticsearch-locally.html). There exists [Text Chunking tools](https://opensearch.org/docs/latest/search-plugins/text-chunking/).  The matching is based on the **BM25** algorithm. This might be an interesting extension to use this instead of the **bi-encoding** step, as a second version. Indeed, Elixir has an Opensearch client: [snap](https://github.com/breakroom/snap)
 38 | 
 39 | - we can build a RAG pipeline with an LLM. We follow this route here. We save chunks with their embedding into a database, run a similarity search against our query, and then rerank our top-k reponses, and finally use the last reponse to build our prompt and submit it as a context to the LLM. This will help to constrain responses on questions about Elixir/Plug/Phoenix_Liveview. 
 40 | 
 41 | 
 42 | ## Using LLM to find which models to use
 43 | 
 44 | We are using pre-trained NR models. We will also use an LLM to elaborate a response. 
 45 | 
 46 | The first question is: what does a RAG pipeline brings?
 47 | 
 48 | We believe the next two paragraphs give some confidence that this technic works, or at least improves some responses.
 49 | 
 50 | ### Testing Codellama
 51 | 
 52 | ```text
 53 | Question: "how to handle forms server-side?"
 54 | ```
 55 | 
 56 | <details><summary>The "raw" response</summary>
 57 | 
 58 | Handling forms on the server-side involves using the `Phoenix.HTML` module's `form_for` function to create a form struct with the necessary information for submitting the form data to the server.
 59 | Here's an example:
 60 | 
 61 | ```elixir
 62 | import Phoenix.HTML
 63 | 
 64 | # Create a form struct with the necessary information
 65 | form = form_for(conn, :action, :controller)
 66 | 
 67 | # Add fields to the form using the `add_field` function
 68 | form = form |> add_field(:name, :text) |> add_field(:age, :number)
 69 | 
 70 | # Use the `handle_event` function to handle form submissions
 71 | handle_event(form, :submit, conn) do
 72 |   # Do something with the submitted data here, e.g. update a database
 73 | end
 74 | ```
 75 | 
 76 | The `form_for` function takes three arguments:
 77 | 
 78 | 	* `conn`: The current connection struct
 79 | 	* `action`: The action to perform when submitting the form(e.g. `:create`, `:update`, etc.)
 80 | 	* `controller`: The controller that handles the form submission
 81 | 
 82 | Once you have created the form struct, you can add fields to it using the `add_field` function.
 83 | This function takes two arguments:
 84 | 
 85 | 	* `name`: The name of the field (e.g. `:username`,`:password`, etc.)
 86 | 	* `type`: The type of field (e.g. `:text`, `:number`, etc.)
 87 | 
 88 | You can then use the `handle_event` function to handle form submissions. This function takes three arguments:
 89 | 
 90 | 	* `form`: The form struct that was created using the `form_for` function
 91 | 	* `event`: The event that triggered the submission (e.g. `:submit`, etc.)
 92 | 	* `conn`: The current connection struct
 93 | 
 94 | Inside the `handle_event` function, you can perform any necessary actions when the form is submitted, such as updating a database or sending an email.
 95 | 
 96 | To handle form validation and error handling, you can use the `Phoenix.LiveView` module's `validate` and `error` functions.
 97 | 
 98 | For example:
 99 | 
100 | ```elixir
101 | import Phoenix.LiveView
102 | 
103 | # Add validation rules to the form using the `validate` function
104 | form = form |> validate(:name, :required) |> validate(:age, :number)
105 | 
106 | # Handle errors in the form submission using the `error` function
107 | handle_event(form, :submit, conn) do
108 |   if error?(conn) do
109 |     # Do something with the validation errors here, e.g. display an error message
110 |   else
111 |     # Do something with the submitted data here, e.g. update a database
112 |   end
113 | end
114 | ```
115 | 
116 | In this example, we use the `validate` function to add validation rules to the form. The `:required` and `:number` arguments specify that the fields should be required and contain only numbers, respectively.
117 | If there are any errors in the submitted data, the `error?` function will return true and we can handle them using the `error` function.
118 | </details>
119 | <br/>
120 | 
121 | <details><summary>The "RAGed" response</summary>
122 | 
123 | To handle forms with LiveView, you can use the `Phoenix.Component` module and its functions to define a form component that will submit and validate the form data using LiveView's built-in validation mechanism. Here's an example of how to define a form component in your LiveView:
124 | 
125 | ```elixir
126 | defmodule MyAppWeb.Components.Form do
127 |   use Phoenix.Component
128 |   def form(assigns) do
129 |     ~H"""
130 |       <.form for={@form} phx-change="validate" phx-submit="save">\
131 |         ...
132 |       </.form>
133 |     """
134 |     end
135 |   end
136 | ...
137 | ```
138 | 
139 | In the above example, we define a `Form` component that renders a form tag with a `phx-change` and `phx-submit` event to handle validation and submission of the form data. The `for` attribute is used to bind the form to a changeset or user parameters, which can be created from a form struct using `Phoenix.Component.to_form/1`.
140 | 
141 | 
142 | Additionally, we strongly recommend including a unique HTML "id" attribute on the form. When DOM siblings change, elements without an ID will be replaced rather than moved, which can cause issues such as form fields losing focus.
143 | 
144 | You may wish for an individual input to use its own change event or to target a different component. This can be accomplished by annotating the input itself with `phx-change`, for example:
145 | 
146 | ```elixir
147 | <.form for={@form} phx-change="validate" phx-submit="save">
148 |   ...
149 |   <.input field={@form[:email]}  phx-change="email_changed" phx-target={@myself} />
150 | </.form>
151 | ```
152 | 
153 | The `phx-change` event will trigger the `email_changed` function on the `@myself` component, which can be used to update the form data or perform other actions based on changes to the input field.
154 | 
155 | You may also wish for an individual input to use its own change event or to target a different component. This can be accomplished by annotating the input itself with `phx-change`, for example:
156 | 
157 | ```elixir
158 | <.form for={@form} phx-change="validate" phx-submit="save">
159 |   ...
160 |   <.input field={@form[:email]}  phx-change="email_changed" phx-target={@myself} />
161 | </.form>
162 | ```
163 | 
164 | The `phx-change` event will trigger the `email_changed` function on the `@myself` component, which can be used to update the form data or perform other actions based on changes to the input field.\n\nIt's also important to note that you may want to use a unique HTML \"id\" attribute on the form. When DOM siblings change, elements without an ID will be replaced rather than moved, which can cause issues such as form fields losing focus.\n\nYou may also wish for an individual input to use its own change event or to target a different component. This can be accomplished by annotating the input itself with `phx-change`, for example:
165 | 
166 | ```elixir
167 | <.form for={@form} phx-change="validate" phx-submit="save">
168 |   ...
169 |   <.input field={@form[:email]}  phx-change="email_changed" phx-target={@myself} />
170 | </.form>
171 | ```
172 | 
173 | The `phx-change` event will trigger the `email_changed` function on the `@myself` component, which can be used to update the form data or perform other actions based on changes to the input field.
174 | </details>
175 | <br/>
176 | 
177 | The first conclusion is that the RAG pipeline improves a lot the response.
178 | This happens also because most probablythe current version of the LLM (July 2024) is not up-to-date with the current state-of-art at the time of writting.
179 | 
180 | We continue by asking the LLM to help us with the technical task below. We encounter the main limitation: halucination.
181 | 
182 | <details><summary>Which embedding models can we use? Some examples of how LLMs respond to this question</summary>
183 | 
184 | We know that most questions that are submitted to an LLM will produce so-called "halucinated" responses, or in other words, invented.
185 | 
186 | Let's test this with LLama3.1, choosen because you can run it locally via `ollama run llama3.1`.
187 | 
188 | ```txt
189 | question: "I want to use an Elixir portage of llama. Can you suggest a library?
190 | 
191 | response: "**Elixir-T5**: This is an Elixir port of a popular LLM architecture (T5). It's designed specifically for use with Elixir and might be worth considering. You may also consider **LLaMA-erl**: This is an Elixir wrapper for the LLaMA model, which is similar to Ollama.
192 | ```
193 | 
194 | Nor `Elixir-T5` nor `LLama-erl` exist!
195 | 
196 | I tried CodeLLama: `ollama run codellama`:
197 | 
198 | ```txt
199 | question: which Elixir library would you suggest for text chunking to compute embeddings efficiently?
200 | ```
201 | 
202 | <details><summary>codellama response</summary>
203 | 
204 | ```console
205 | response:
206 | There are several Elixir libraries available for text chunking and computing embeddings, each with its own strengths and weaknesses. Here are a few popular options:
207 | 
208 | 1. `elixir_bert`: This is an Elixir port of the BERT (Bidirectional Encoder Representations from Transformers) model, which is a powerful language model that can be
209 | used for text chunking. It provides a simple and efficient interface for computing embeddings, and it supports both CPU and GPU acceleration.
210 | 2. `elixir_bert_embedding`: This is an Elixir wrapper for the BERT embedding layer, which allows you to compute embeddings for input sequences using a pre-trained
211 | BERT model. It provides a simple interface for text chunking and supports both CPU and GPU acceleration.
212 | 3. `elixir_word2vec`: This is an Elixir port of the Word2Vec model, which is another popular language model that can be used for text chunking. It provides a simple
213 | and efficient interface for computing embeddings, and it supports both CPU and GPU acceleration.
214 | 4. `elixir_glove`: This is an Elixir port of the GLOVE (Global Vectors for Word Representation) model, which is a language model that can be used for text chunking.
215 | It provides a simple and efficient interface for computing embeddings, and it supports both CPU and GPU acceleration.
216 | 
217 | All of these libraries should work well for your purposes, but you may want to consider which one best fits your specific needs. If you have any specific requirements
218 | or preferences (e.g., support for a particular file format or feature), you may want to choose the library that is most suitable for your use case.
219 | ```
220 | </details>
221 | <br/>
222 | 
223 | We check that `elixir_bert` does not exist, nor `elixir_bert_embedding`, nor `elixir_glove` nor `elixir_workd2vec`. 
224 | 
225 | However, after researching a bit, the library `ExFinalFusion` exists, a NIF of `FinalFusion`.
226 | [![Screenshot 2024-08-24 at 19 33 08](https://github.com/user-attachments/assets/11d7daeb-8f3f-49ea-a2af-38c2c654c42d)](https://hexdocs.pm/ex_final_fusion/ExFinalFusion.html#summary)
227 | 
228 | ### Let the LLMs choose the models
229 | 
230 | The question:
231 | 
232 | ```console
233 | I want to code a RAG pipeline in the Elixir langugage using its ML libraries (Nx, Axon, Bumble
234 | bee). I want to use a bi-encoder to compute embeddings from chunks of text, and then run a
235 | cross-encoding process. Can you recommend  models firstly for the bi-encoding and then for the
236 | cross-encoding step, having in mind that I need a model for tokenizing the input.
237 | ```
238 | 
239 | Some reponses with differents LLMs.
240 | 
241 | <details><summary>codellama response</summary>
242 | 
243 | ```console
244 | Bi-encoding (Tokenization):
245 | 
246 | * BERT (Bidirectional Encoder Representations from Transformers)
247 | 	+ This is a popular pre-trained language model that can be fine-tuned for various NLP tasks,
248 | including tokenization.
249 | 	+ BERT uses a multi-layer bidirectional transformer encoder to generate contextualized
250 | representations of words in a sentence.
251 | * RoBERTa (Robustly Optimized BERT Pretraining Approach)
252 | 	+ This is a variant of BERT that was specifically designed for text classification tasks and
253 | has been shown to achieve state-of-the-art results on several benchmark datasets.
254 | 	+ RoBERTa uses a combination of tokenization and language modeling techniques to generate
255 | 
256 | contextualized representations of words in a sentence.
257 | * DistilBERT (Distilled BERT)
258 | 	+ This is a smaller and more efficient variant of BERT that has been trained to match the
259 | 
260 | performance of the full BERT model while requiring fewer computational resources.
261 | 	+ DistilBERT uses a combination of tokenization and attention techniques to generate
262 | contextualized representations of words in a sentence.
263 | 
264 | Cross-encoding:
265 | 
266 | * MLP (Multi-layer perceptron)
267 | 	+ This is a simple feedforward neural network that can be used for classification tasks such
268 | as text classification, sentiment analysis, and topic modeling.
269 | 	+ An MLP consists of multiple layers of artificial neurons that learn to represent the input
270 | data in a higher dimensional space.
271 | * CNN (Convolutional neural network)
272 | 	+ This is a deep learning architecture that is particularly well-suited for text
273 | classification tasks where the input data has a spatial or temporal structure.
274 | 	+ A CNN consists of multiple convolutional layers that learn to extract features from the
275 | 
276 | input data, followed by pooling and fully connected layers to make predictions.
277 | 
278 | It's important to note that the specific models you choose will depend on your specific use
279 | case and requirements. You may also want to consider using a combination of different models
280 | to achieve better results.
281 | ```
282 | </details>
283 | <br/>
284 | 
285 | <details><summary>Llama3 response</summary>
286 | 
287 | ```console
288 | **Bi-Encoding Step: Tokenization and Embedding**
289 | 
290 | For tokenization, you can use a simple word tokenizer like `Nx.Tokenizer` from the Elixir
291 | `axon` library. However, if you need more advanced tokenization capabilities (e.g., handling
292 | out-of-vocabulary words), consider using Bumblebee's `Bee.Tokenizer`. For embedding the
293 | tokens into vectors, I recommend using:
294 | 
295 | 1. **Universal Sentence Encoder (USE)**: This is a pre-trained language model that can be
296 | used for token embedding. You can use the Elixir wrapper provided by Axon to load and use
297 | the USE model.
298 | 
299 | **Cross-Encoding Step**
300 | 
301 | For cross-encoding, you'll need to take the output of the bi-encoding step and generate a
302 | new representation based on it. I recommend using:
303 | 
304 | 1. **Reformer**: This is a state-of-the-art transformer architecture that's well-suited for
305 | the RAG pipeline. You can use Bumblebee to load and use the Reformer model.
306 | 
307 | Here are some additional recommendations:
308 | 
309 | * For tokenizing input text, you might want to consider using more advanced techniques like
310 | subwording (e.g., WordPiece) or character-level encoding.
311 | * If you're dealing with long documents, you may need to split them into smaller chunks
312 | before applying bi-encoding and cross-encoding. Consider using a technique like document
313 | chunking or summarization.
314 | 
315 | To give you a better understanding of the pipeline flow, here's a high-level outline:
316 | 
317 | 1. **Tokenize input text**: Use `Nx.Tokenizer` or Bumblebee's `Bee.Tokenizer` to break down
318 | the text into individual tokens.
319 | 2. **Bi-encode tokens**: Apply USE to embed each token into a vector space (e.g.,
320 | 128-dimensional).
321 | 3. **Cross-encode output**: Use Reformer to generate a new representation based on the
322 | bi-encoded output.
323 | 
324 | Remember to preprocess your data (e.g., normalize text, remove stop words) before feeding it
325 | into these models.
326 | 
327 | For more information and implementation details, I recommend checking out the following
328 | resources:
329 | 
330 | * [Bumblebee documentation](https://hexdocs.pm/bumblebee/)
331 | * [Axon documentation](https://hexdocs.pm/axon/)
332 | * [Elixir ML
333 | libraries](https://elixir-lang.org/docs/master/libraries.html#machine-learning-libraries)
334 | * [RAG paper and implementation](https://arxiv.org/abs/2006.16262)
335 | ```
336 | </details>
337 | <br/>
338 | 
339 | <details><summary>Claude3.5 Sonnet response</summary>
340 | 
341 | ```console
342 | Bi-encoding step:
343 | 
344 | For the bi-encoding step, where you'll encode both the chunks from the documentation and the query, a good choice would be a sentence transformer model. A popular option that works well for semantic search is:
345 | 
346 | Model: "sentence-transformers/all-MiniLM-L6-v2"
347 | 
348 | This model provides a good balance between performance and speed. It's relatively small (80MB) but still gives good results for semantic similarity tasks.
349 | 
350 | Cross-encoding step:
351 | 
352 | For the cross-encoding step, where you'll want to get a more precise relevance score between the query and the retrieved documents, you could use:
353 | 
354 | Model: "cross-encoder/ms-marco-MiniLM-L-6-v2"
355 | 
356 | This model is specifically trained for relevance ranking and should work well for your cross-encoding needs.
357 | 
358 | Tokenization:
359 | 
360 | For tokenization, you'll want to use a tokenizer that matches the models you're using. Since both recommended models are based on MiniLM, you can use the MiniLM tokenizer:
361 | 
362 | Tokenizer: "microsoft/MiniLM-L6-H384-uncased"
363 | ```
364 | </details>
365 | <br/>
366 | 
367 | <details><summary>ChatGPT response</summary>
368 | 
369 | ```console
370 | To create a Retrieval-Augmented Generation (RAG) pipeline in Elixir using the Nx, Axon, and Bumblebee libraries, you'll need to carefully choose models that are suitable for both the bi-encoding and cross-encoding processes. Let's break down the components:
371 | 
372 | 1. Bi-Encoder Step:
373 | Purpose: The bi-encoder step involves encoding chunks of text independently to generate embeddings that can be used for retrieval.
374 | Recommended Models:
375 | MiniLM (Microsoft): A small, efficient model designed for sentence embeddings. It's often used for bi-encoder tasks due to its balance between speed and accuracy.
376 | DistilBERT: A smaller and faster version of BERT, ideal for generating embeddings from text while maintaining a good balance of performance.
377 | Sentence-BERT (SBERT): This is a modification of BERT that is specifically designed for creating dense vector representations of sentences, making it a solid choice for bi-encoding.
378 | 2. Cross-Encoder Step:
379 | Purpose: The cross-encoder step involves comparing the query and retrieved passages together, typically used to re-rank the retrieved results.
380 | Recommended Models:
381 | BERT: The standard BERT model is highly effective for cross-encoding tasks because it considers the interaction between the query and the passage.
382 | RoBERTa: A variant of BERT with more robust training, which tends to perform better in NLP tasks.
383 | DeBERTa: Known for its improved handling of word order and dependency information, making it a strong candidate for cross-encoding.
384 | 3. Tokenization:
385 | Purpose: Tokenizing the input is crucial for both the bi-encoder and cross-encoder steps. The tokenizer needs to be compatible with the chosen models.
386 | Recommended Tokenizers:
387 | BERT Tokenizer: Compatible with BERT-based models (BERT, RoBERTa, DistilBERT, etc.).
388 | SentencePiece Tokenizer: Often used with models like MiniLM and other transformer-based models.
389 | Elixir Implementation:
390 | Libraries:
391 | Nx: For numerical computation, handling tensors, and integrating with the underlying ML operations.
392 | Axon: For building neural networks, including the bi-encoder and cross-encoder models.
393 | Bumblebee: For pre-trained models, including tokenizers and transformers, which will help with the embedding and cross-encoding tasks.
394 | Example Models in Bumblebee:
395 | For bi-encoding: Use models like "MiniLM-L6-v2" or "distilbert-base-uncased" from Bumblebee.
396 | For cross-encoding: Use models like "bert-base-uncased" or "roberta-base".
397 | Steps to Implement:
398 | Load the Model: Use Bumblebee to load the bi-encoder and cross-encoder models.
399 | Tokenize the Input: Tokenize the text using the appropriate tokenizer.
400 | Compute Embeddings: Generate embeddings using the bi-encoder model.
401 | Retrieve and Rank: Use the cross-encoder model to compare and rank the retrieved documents.
402 | ```
403 | </details>
404 | 
405 | 
406 | </details>
407 | 
408 | ### First conclusion for the LLM
409 | 
410 | It seems that Claude3.5 Sonnet and ChatGTP give much "better" responses than Llama. However, we need an API to reach their REST API whilst we can run la local version of the LLama LLM. 
411 | 
412 | We will therefore choose LLama, running it at no cost, knowing that it can only get better by choosing a "paid" version with OpenAI or better Anthropic.
413 | 
414 | ### Conclusion for the choice of our models
415 | 
416 | We will use SBERT based models:
417 | - the bi-encoder"sentence-transformers/all-MiniLM-l6-v2" which also provides a tokenizer,
418 | - the cross-encoder "cross-encoder/ms-marco-MiniLM-L-6-v2" along with the tokenizer "bert-base-uncased"
419 | 
420 | 
421 | We check that the models are correctly implemented in Elixir by running the code in Python. This is done in [#8](https://github.com/dwyl/rag-elixir-doc/discussions/8) and [#9](https://github.com/dwyl/rag-elixir-doc/discussions/9).
422 | 
423 | ## Source of knowledge
424 | 
425 | We firstly seed the vector database with some Github markdown pages from the Elixir documentation.
426 | 
427 | The sources will be extracted from the files that the GitHub API returns when querying some directories:
428 |   - <https://api.github.com/repos/phoenixframework/phoenix_live_view/contents/guides/server/>
429 |   - <https://api.github.com/repos/phoenixframework/phoenix_live_view/contents/guides/client/>
430 |   - <https://github.com/phoenixframework/phoenix_live_view/blob/main/guides/introduction/welcome.md>
431 | 
432 |   - we can also add some ".ex" modules when they provide documentation in a moduledoc.
433 | 
434 | 
435 | ## Overview of the RAG process:
436 |   * installed tools: the database `Postgres` with the `pgvector` extension, the plateform `ollama` to run LLM locally.
437 |     
438 |   * Build the external sources.
439 |     - Download "external sources" as a string
440 |     - chunk the sources
441 |     - produce an embedding based on a "sentence-transformer" model for each chunk
442 |     - insert chunk + embedding into a Vector database using a HSNW index
443 |       
444 |   * Build a RAG pipeline
445 |     - produce an embedding (a vector representation) from the question
446 |     - perform a first vector similarity search (HNSW) against the database
447 |     - rerank the top-k with "cross-encoding"
448 |     - build a prompt by injecting the later result with the query as a context
449 |     - submit the prompt to the LLM for completion
450 | 
451 | ### Pseudo-code pipeline
452 | 
453 | The pipeline will use three SBert based models: "sentence-transformers/all-MiniLM-L6-v2" for the embedding, "cross-encoder/ms-marco-MiniLM-L-6-v2" for the reranking, and "bert-base-uncased" for tokenizing.
454 | 
455 | In pseudo-code, we have:
456 |   
457 | ```elixir
458 | # Data collection and chunking
459 | defmodule DataCollector do
460 |   def fetch_and_chunk_docs do
461 |     ...
462 |   end
463 | end
464 | 
465 | # Embedding generation: "sentence-transformers/all-MiniLM-L6-v2"
466 | defmodule Embedder do
467 |   def generate_embeddings(text) do
468 |     ...
469 |   end
470 | end
471 | 
472 | # Semantic search
473 | defmodule SemanticSearch do
474 |   def search(query, top_k) do
475 |     ...
476 |   end
477 | end
478 | 
479 | # Cross-encoder reranking: "cross-encoder/ms-marco-MiniLM-L-6-v2"
480 | defmodule CrossEncoder do
481 |   def rerank(query, documents) do
482 |     ...
483 |   end
484 | end
485 | 
486 | # Prompt construction
487 | defmodule PromptBuilder do
488 |   def build_prompt(query, context) do
489 |     ...
490 |   end
491 | end
492 | 
493 | # LLM integration
494 | defmodule LLM do
495 |   def generate_response(prompt) do
496 |     ...
497 |   end
498 | end
499 | 
500 | # Main RAG pipeline
501 | defmodule RAG do
502 |   def process_query(query) do
503 |     query
504 |     |> SemanticSearch.search(10)
505 |     |> CrossEncoder.rerank(query)
506 |     |> PromptBuilder.build_prompt(query)
507 |     |> LLM.generate_response()
508 |   end
509 | end
510 | ```
511 | 
512 | ## What is **bi-encoding** and **cross-encoding**?
513 | 
514 | - [Bi-encoders]: Encode the query and document separately, then compare their vector representations. This is the "standard" similarity search.
515 | 
516 |   Bi-encoding does consider the relationship between the query and each document, but it does so independently for each document. The main problem is that bi-encoding might not capture nuanced differences between documents or complex query-document relationships. `HNSW` indexes or `BM25` can be used for this.
517 |   
518 | - Cross-encoders: Take both the query and document as input simultaneously, allowing for more complex interactions between them. It processes the query and document together through a neural network (typically a transformer model like BERT) to produce a single relevance score. This allows the model to capture complex interactions between the query and document at all levels of representation.
519 | 
520 |   Cross-encoders typically perform better than bi-encoders in terms of accuracy, but are computationally more expensive and slower at inference time.
521 |   They are not suitable for large-scale retrieval because they require comparing the query with every document from scratch, which doesn't scale well.
522 |   Therefor, Cross-encoding is often used in a two-stage retrieval process.
523 | 
524 | - How cross-encoders works in reranking?:
525 |   - After initial retrieval (e.g., using vector similarity), you pass each query-document pair through the cross-encoder.
526 |   - The cross-encoder outputs a relevance score for each pair.
527 |   - Results are then sorted based on these scores, potentially significantly changing the order from the initial retrieval.
528 |     
529 |   
530 | ## How to **chunk**? 
531 | 
532 | We need to define how to ingest these documents to produce _embeddings_ saved into a _vector database_. 
533 | 
534 | Do we run a naive chunk? or [use this package](https://github.com/revelrylabs/text_chunker_ex), or [structured chunks](https://docs.llamaindex.ai/en/stable/examples/retrievers/auto_vs_recursive_retriever/), [Chunk + Document Hybrid Retrieval](https://docs.llamaindex.ai/en/stable/examples/retrievers/multi_doc_together_hybrid/), or use [BM25](https://docs.llamaindex.ai/en/stable/examples/retrievers/bm25_retriever/), with an Elixir implementation [BM25](https://github.com/elliotekj/bm25)? 
535 | 
536 | ## Which embedding? 
537 | 
538 | - [SBert](https://www.sbert.net/): check: <https://huggingface.co/spaces/mteb/leaderboard>.
539 | 
540 | 
541 | ## Vector database of Index?
542 | 
543 | - An index [HNSW](https://github.com/elixir-nx/hnswlib), the Elixir portage of `hnswlib`, a KNN search,
544 | - or a vector database?
545 | 	- Postgres with [pgvector](https://github.com/pgvector/pgvector) with the Elixir portage: [pgvector-elixir](https://github.com/pgvector/pgvector-elixir),
546 |  	- SQLite with [sqlite-vec](https://github.com/asg017/sqlite-vec). The extension has to be installed manually from the repo and loaded (with `exqlite`),
547 | 	- or [Supabase](https://github.com/supabase/supabase), with an [Elixir client](https://github.com/zoedsoupe/supabase-ex)
548 | 	- or [ChromaDB](https://github.com/3zcurdia/chroma), with an [Elixir client](https://github.com/3zcurdia/chroma)
549 | 
550 | We will use Postgres with the extension `pgvector` and the `HNSW` algorithm. See discussion on the Postgres + pg_vector setup.
551 |   
552 | ## How to **prompt**? 
553 | 
554 | This is where we define the scope of the response we want from the LLM, given the retrieved context given by the database nearest neighbour search. 
555 | 
556 | The LLM should be able to generate an "accurate" response constrainted by this context.
557 | 
558 | ## A word on **LLMs**
559 | 
560 | A Dockyard post on this: <https://dockyard.com/blog/2023/05/16/open-source-elixir-alternatives-to-chatgpt>.
561 | 
562 | A comparison of different LLMs (source: Anthropic)
563 | <img width="592" alt="Screenshot 2024-08-14 at 17 56 40" src="https://github.com/user-attachments/assets/af4ef9ea-88f8-42bf-b963-013ea35d429f">
564 | 
565 | ### Pricing
566 | 
567 | [![Screenshot 2024-08-28 at 21 19 41](https://github.com/user-attachments/assets/a66e8689-a7c2-46e6-a597-fb141426b9cf)](https://openai.com/api/pricing/)
568 | 
569 | [![Screenshot 2024-08-28 at 21 21 51](https://github.com/user-attachments/assets/081a4b1c-579d-4801-b441-d7f14be9c76a)](https://www.anthropic.com/pricing#anthropic-api)
570 | 
571 | ## Going further?
572 | 
573 | - accept new documents "on the fly" (download a given link), and maybe running the database ingesting in a background job.
574 | 
575 | 
576 | - use `Opensearch` instead of the bi-encoding:
577 | 	- install local:
578 |  		- <https://opensearch.org/docs/latest/install-and-configure/install-opensearch/docker/>
579 |  		- <https://waytohksharma.medium.com/opensearch-local-running-instance-with-docker-with-m1-1e60d90a263c>
580 |    	- ingest data: <https://opensearch.org/docs/latest/getting-started/ingest-data/> and <https://opensearch.org/docs/latest/api-reference/document-apis/index-document/>
581 | 
582 | - clusterise data?
583 | 
584 | ## Source of inspiration. 
585 | 
586 | Which repos, blog post?
587 |   - <https://dockyard.com/blog/2024/05/16/retrieval-augmented-generation-what-it-is-how-to-use-it>
588 |   - using the cross-encoder: <https://github.com/elixir-nx/bumblebee/issues/251>
589 |   - Bumblebee, RAG: <https://hexdocs.pm/bumblebee/llms_rag.html#introduction>
590 |   - Supabase: <https://github.com/supabase-community/chatgpt-your-files>
591 |   - Langchain: <https://github.com/brainlid/langchain_demo>
592 |   - <https://dockyard.com/blog/2024/05/16/retrieval-augmented-generation-what-it-is-how-to-use-it>
593 |   - <https://github.com/nileshtrivedi/autogen>
594 |   - <https://dockyard.com/blog/2023/05/16/open-source-elixir-alternatives-to-chatgpt>
595 |   - <https://fly.io/phoenix-files/using-llama-cpp-with-elixir-and-rustler/>
596 |   -  A Fly.io post on using ``llama.cpp` with `Rustler`: <https://fly.io/phoenix-files/using-llama-cpp-with-elixir-and-rustler/> 
597 |   -  ExLLama: LlammaCpp.rs NIF wrapper for Elixir/Erlang: <https://hexdocs.pm/ex_llama/readme.html> and <https://fly.io/phoenix-files/using-llama-cpp-with-elixir-and-rustler/>
598 |   -  ollama-ex to run LLM locally: <https://hexdocs.pm/ollama/Ollama.html>
599 |     
600 | 
601 | 


--------------------------------------------------------------------------------
/rag-elixir.livemd:
--------------------------------------------------------------------------------
  1 | # RAG Elixir Phoenix Liveview documentation
  2 | 
  3 | ```elixir
  4 | Mix.install(
  5 |   [
  6 |     {:req, "~> 0.5.6"},
  7 |     {:bumblebee, "~> 0.5.3"},
  8 |     {:ollama, "~> 0.7.1"},
  9 |     {:text_chunker, "~> 0.3.1"},
 10 |     {:postgrex, "~> 0.19.1"},
 11 |     {:pgvector, "~> 0.3.0"},
 12 |     {:ecto_sql, "~> 3.12"},
 13 |     {:exla, "~> 0.7.3"},
 14 |     {:kino_bumblebee, "~> 0.5.0"},
 15 |     {:scholar, "~> 0.3.1"},
 16 |     {:explorer, "~> 0.9.2"},
 17 |     {:tucan, "~> 0.3.1"}
 18 |   ],
 19 |   config: [nx: [default_backend: EXLA.Backend]]
 20 | )
 21 | 
 22 | Nx.Defn.global_default_options(compiler: EXLA, client: :host)
 23 | ```
 24 | 
 25 | ## Vector extension to Postgres with Docker
 26 | 
 27 | ### 1) pgvector
 28 | 
 29 | To add the [pgvector](https://github.com/pgvector/pgvector) extension to your `PostgreSQL` container, you'll need to use a `PostgreSQL` image that includes this extension.
 30 | 
 31 | The official PostgreSQL image doesn't include `pg_vector` by default, so we'll extend the Postgres image and build use a custom image that has the extension pgvector pre-installed.
 32 | 
 33 | Create a Dockerfile with the following content:
 34 | 
 35 | <!-- livebook:{"break_markdown":true} -->
 36 | 
 37 | <details>
 38 | <summary>Dockerfile</summary>
 39 |   
 40 | ```dockerfile
 41 | FROM postgres:16
 42 | 
 43 | RUN apt-get update && apt-get install -y \
 44 |     git \
 45 |     build-essential \
 46 |     postgresql-server-dev-16
 47 | 
 48 | RUN git clone https://github.com/pgvector/pgvector.git && \
 49 |     cd pgvector && \
 50 |     make && \
 51 |     make install
 52 | 
 53 | CMD ["postgres"]
 54 | ```
 55 | </details>
 56 | 
 57 | <!-- livebook:{"break_markdown":true} -->
 58 | 
 59 | Build the custom image named "postgres-with-vector": we have a **1.5Gb** image.
 60 | 
 61 | <!-- livebook:{"break_markdown":true} -->
 62 | 
 63 | ```bash
 64 | > docker build -t postgres-with-vector .
 65 | ```
 66 | 
 67 | <!-- livebook:{"break_markdown":true} -->
 68 | 
 69 | Run a container in detached mode named "postgres-rag" from this custom "postgres-with-vector" image, create the database "rag_example", and open the port 5432 for the Elixir backend to be  able to connect to:
 70 | 
 71 | <!-- livebook:{"break_markdown":true} -->
 72 | 
 73 | ```console
 74 | > docker run \
 75 |    -d --rm \
 76 |    --name postgres-rag \
 77 |    -e POSTGRES_PASSWORD=secret \
 78 |    -e POSTGRES_DB=rag_example \
 79 |    -p 5432:5432 \
 80 |    postgres-with-vector
 81 | ```
 82 | 
 83 | <!-- livebook:{"break_markdown":true} -->
 84 | 
 85 | Check the logs:
 86 | 
 87 | <!-- livebook:{"break_markdown":true} -->
 88 | 
 89 | ```console
 90 | > docker logs postgres-rag
 91 | 
 92 | LOG:  database system is ready to accept connections
 93 | ```
 94 | 
 95 | <!-- livebook:{"break_markdown":true} -->
 96 | 
 97 | In another terminal, connect to the running "postgres-rag" container and execute `psql` on the "rag_example" database:
 98 | 
 99 | ```console
100 | > docker exec -it postgres-rag psql -U postgres -d rag_example
101 | ```
102 | 
103 | <!-- livebook:{"break_markdown":true} -->
104 | 
105 | We execute the `psql` CLI in the container (with the default username "postgres" and password as above) to connect to the database "rag_example":
106 | 
107 | <!-- livebook:{"break_markdown":true} -->
108 | 
109 | ```bash
110 | rag_example=#
111 | ```
112 | 
113 | <!-- livebook:{"break_markdown":true} -->
114 | 
115 | ### 2) Use an Ecto.Repo
116 | 
117 | <!-- livebook:{"break_markdown":true} -->
118 | 
119 | The extension will define a custom type:
120 | 
121 | ```elixir
122 | Postgrex.Types.define(
123 |   RAG.PostgrexTypes, 
124 |   Pgvector.extensions() ++ Ecto.Adapters.Postgres.extensions(), 
125 |   []
126 | )
127 | ```
128 | 
129 | Note that you can also use the Postgres adaptor [Postgrex](https://github.com/elixir-ecto/postgrex) directly with raw SQL commands.
130 | 
131 | <!-- livebook:{"break_markdown":true} -->
132 | 
133 | <details>
134 | <summary>Postgrex code without Ecto</summary>
135 |   
136 | ```elixir
137 | {:ok, pg} = Postgrex.start_link(
138 |   username: "postgres",
139 |   password: "secret", 
140 |   database: "rag_example", 
141 |   type: "RAG.PostgrexTypes"
142 | )
143 | 
144 | Postgrex.query!(pg, "create extension if not exists vector;", [])
145 | Postgrex.query!(pg, "drop table if exists documents;", [])
146 | Postgrex.query!(pg, "create table documents ....", [])
147 | ```
148 | </details>
149 | 
150 | <!-- livebook:{"break_markdown":true} -->
151 | 
152 | We use [Ecto.Repo](https://hexdocs.pm/ecto/Ecto.Repo.html) behaviour. We can use a more friendly DSL than raw SQL commands.
153 | 
154 | ```elixir
155 | defmodule RAG.Repo do
156 |   use Ecto.Repo,
157 |     otp_app: :rag,
158 |     adapter: Ecto.Adapters.Postgres
159 | end
160 | 
161 | defmodule RAG.Document do
162 |   use Ecto.Schema
163 | 
164 |   schema "documents" do
165 |     field :content, :string
166 |     field :embedding, Pgvector.Ecto.Vector
167 |   end
168 | end
169 | 
170 | {:ok, pg} = 
171 |   RAG.Repo.start_link(
172 |     hostname: "localhost",
173 |     username: "postgres",
174 |     password: "secret",
175 |     database: "rag_example",
176 |     types: RAG.PostgrexTypes
177 |   )
178 | ```
179 | 
180 | We create the extension:
181 | 
182 | ```elixir
183 | RAG.Repo.query!("create extension if not exists vector;")
184 | ```
185 | 
186 | We check in the terminal that the index `HNSW` method is available:
187 | 
188 | ```
189 | rag_example=# select * from pg_am where amname='hnsw';
190 | 
191 | 16450 | hnsw   | hnswhandler | i
192 | ```
193 | 
194 | <!-- livebook:{"break_markdown":true} -->
195 | 
196 | We create a table with two columns, "content" and "embedding" where the datatypes are respectively "text" and "vector(384)". The later is because we will be using an embedding model with 384 dimensions (see further).
197 | 
198 | We create an `hnsw` index on the "embedding" column using the "cosine" distance.
199 | 
200 | cf [documentation](https://github.com/pgvector/pgvector#hnsw): an HNSW index creates a multilayer graph. It has better query performance than IVFFlat (in terms of speed-recall tradeoff), but has slower build times and uses more memory. Also, an index can be created without any data in the table
201 | 
202 | ```elixir
203 | # reset the table
204 | RAG.Repo.query!("drop table if exists documents;")
205 | 
206 | RAG.Repo.query!("""
207 |   CREATE TABLE IF NOT EXISTS documents (
208 |     id SERIAL PRIMARY KEY,
209 |     content TEXT,
210 |     embedding vector(384)
211 |   )
212 | """)
213 | 
214 | RAG.Repo.query!(
215 |   "create index if not exists embedding_idx on documents using hnsw (embedding vector_l2_ops);"
216 | )
217 | ```
218 | 
219 | <details>
220 | <summary>Check in the terminal (that runs `psql` in the container) the details of the created table "documents" and the indexes we created</summary>
221 | 
222 | ```bash
223 | rag_example=# \d documents
224 | 
225 |  id        | integer     |   | not null | nextval('documents_id_seq'::regclass)
226 |  content   | text        |   |          |
227 |  embedding | vector(384) |   |          |
228 | ```
229 | 
230 | ```bash
231 | rag_example=# select * from pg_indexes where tablename='documents';
232 | 
233 |  public | documents | documents_pkey           |  | CREATE UNIQUE INDEX documents_pkey ON public.documents USING btree (id)
234 |  public | documents | documents_embedding_idx  |  | CREATE INDEX documents_embedding_idx ON public.documents USING hnsw (embedding vector_cosine_ops)
235 | ```
236 | </details>
237 | 
238 | ## Fetching and chunking documents
239 | 
240 | We implement the logic to fetch documents from the `Phoenix LiveView` GitHub repo and chunk them with `TextChunker`.
241 | 
242 | ```elixir
243 | defmodule RAG.DataCollector do
244 |   def process_directory(url, extractor) do
245 |     Req.get!(url).body
246 |     #|> Enum.flat_map(&extract_chunks/1)
247 |     |> Enum.flat_map(fn file -> extractor.(file) end)
248 |   end
249 | end
250 | ```
251 | 
252 | ## Generate & insert embeddings from the sources
253 | 
254 | We use `Bumblebee` to load a sentence transformer model, and then compute the embeddings and insert into the database
255 | 
256 | ```elixir
257 | defmodule RAG.Embedder do
258 |   def load_model do
259 |     repo = {:hf, "sentence-transformers/all-MiniLM-L6-v2"}
260 |     {:ok, model_info} = Bumblebee.load_model(repo) 
261 |     {:ok, tokenizer} = Bumblebee.load_tokenizer(repo)
262 | 
263 |     embedding_serving = 
264 |       Bumblebee.Text.text_embedding(
265 |         model_info, 
266 |         tokenizer,
267 |         output_pool: :mean_pooling,
268 |         output_attribute: :hidden_state,
269 |         embedding_processor: :l2_norm,
270 |         compile: [batch_size: 1, sequence_length: [2000]],
271 |         defn_options: [compiler: EXLA]
272 |       )
273 | 
274 |     Kino.start_child({Nx.Serving, serving: embedding_serving, name: ChunkEmbedder})
275 |   end
276 | 
277 |   def generate_embedding(text) do
278 |     %{embedding: vector} = Nx.Serving.batched_run(ChunkEmbedder, String.trim(text))
279 |     Nx.to_flat_list(vector)
280 |   end
281 | end
282 | ```
283 | 
284 | ### Test the embedding against Python
285 | 
286 | <details>
287 | <summary>Python check</summary>
288 |   
289 | Lets firstly test that our embedding works correctly.
290 | 
291 | We use the `Python` results running this model as our source of truth.
292 | 
293 | We use the `Python` library [llm](https://github.com/simonw/llm?ref=samrat.me) to compute an embedding of a given chunk.
294 | 
295 | We install a plugin to bring in an embedding model "sentence-transformers":
296 | 
297 | ```console
298 | > llm install llm-sentence-transformers
299 | ```
300 | 
301 | We check the installation:
302 | 
303 | ```console
304 | > llm plugins
305 | 
306 | [
307 |   {
308 |     "name": "llm-sentence-transformers",
309 |     "hooks": [
310 |       "register_commands",
311 |       "register_embedding_models"
312 |     ],
313 |     "version": "0.2"
314 |   }
315 | ]
316 | ```
317 | 
318 | We load the model and use the `llm` CLI to test the output of the chunk "phoenix liveview":
319 | 
320 | ```console
321 | > llm embed -c 'phoenix liveview' -m sentence-transformers/all-MiniLM-L6-v2
322 | ```
323 | 
324 | We obtain a vector of length 384 (as expected when we craeted the row "embedding" in our "documents" table)
325 | 
326 | ```console
327 | [-0.009706685319542885, -0.052094198763370514, -0.09055887907743454, -0.020933324471116066, -0.009688383899629116, 0.013350575231015682, 0.025953974574804306, -0.16938750445842743, -0.010423310101032257, -0.011145276017487049, 0.027349309995770454, -0.001918078283779323, -0.021567553281784058, -0.003199926810339093, -0.0008285145158879459, -0.015139210037887096, 0.06255557388067245, -0.06932919472455978, 0.013888751156628132, -0.004555793013423681, -0.07562420517206192, -0.009811706840991974, -0.012136539444327354, 0.04693487659096718,...]
328 | ```
329 | </details>
330 | 
331 | <!-- livebook:{"break_markdown":true} -->
332 | 
333 | We now test our `Bumblebee` settings.
334 | 
335 | We load the model:
336 | 
337 | ```elixir
338 | RAG.Embedder.load_model()
339 | ```
340 | 
341 | and we check that we obtain the same (first!) values as above when we run our `Bumblebee` based embedder against the same chunk:
342 | 
343 | ```elixir
344 | RAG.Embedder.generate_embedding("phoenix liveview")
345 | ```
346 | 
347 | ### Build the RAG source
348 | 
349 | We setup the foundations of our RAG by chunking and inserting our documents as strngs and embeddings, their numerical representation, into our vector database.
350 | 
351 | We read each Github folder and download the markdown file, chunk it into a list of strings, and then compute an embedding for each chunk and save it into the vector database.
352 | 
353 | ```elixir
354 | defmodule RAG.ExternalSources do
355 | 
356 |   def extract_chunks(file) do
357 |     case file do
358 |       %{"type" => "file", "name" => name, "download_url" => download_url} ->
359 |         if String.ends_with?(name, ".md") do
360 |           Req.get!(download_url).body
361 |           |> TextChunker.split(format: :markdown, chunk_size: 800, chunk_overlap: 200)
362 |           |> Enum.map(&Map.get(&1, :text))
363 |         else
364 |           []
365 |         end
366 |       _ -> []
367 |     end
368 |   end
369 | 
370 |   def build(guides) do
371 |     guides
372 |     |> Task.async_stream(fn guide -> 
373 |       chunks = RAG.DataCollector.process_directory(guide, &extract_chunks/1) 
374 |       IO.puts("chunks length: #{length(chunks)}")
375 |       Enum.each(chunks, fn chunk -> 
376 |          Task.start(fn -> 
377 |             embedding = RAG.Embedder.generate_embedding(chunk)
378 |             RAG.Repo.insert!(%RAG.Document{content: chunk, embedding: embedding})  
379 |          end)
380 |       end)
381 |     end, 
382 |     ordered: false,
383 |     timeout: :infinity
384 |     )
385 |     |> Stream.run() 
386 |   end
387 | end
388 | ```
389 | 
390 | ```elixir
391 | guides  = [
392 |   "https://api.github.com/repos/phoenixframework/phoenix_live_view/contents/guides/server",
393 |   "https://api.github.com/repos/phoenixframework/phoenix_live_view/contents/guides/client",
394 |   "https://api.github.com/repos/phoenixframework/phoenix_live_view/contents/guides/introduction"
395 | ]
396 | 
397 | RAG.ExternalSources.build(guides)
398 | ```
399 | 
400 | We check the number of insertions. We should have 422.
401 | 
402 | ```elixir
403 | RAG.Repo.aggregate(RAG.Document, :count, :id) 
404 | ```
405 | 
406 | ## Semantic search
407 | 
408 | We implement the l2 similarity search on the embeddings
409 | 
410 | ```elixir
411 | top_k = 20
412 | ```
413 | 
414 | ```elixir
415 | defmodule RAG.SemanticSearch do
416 |   import Ecto.Query
417 | 
418 |   def search(query, top_k) do
419 |     query_embedding = RAG.Embedder.generate_embedding(query) 
420 |     
421 |     from(d in RAG.Document,
422 |       order_by: fragment("embedding <-> ?", ^query_embedding),
423 |       limit: ^top_k
424 |     )
425 |     |> RAG.Repo.all()
426 |   end
427 | end
428 | ```
429 | 
430 | ```elixir
431 | # Usage
432 | query = "how to handle forms server-side?"
433 | 
434 | # a list of %RAG.Document{content: content, embedding: embedding}
435 | top_results = RAG.SemanticSearch.search(query, top_k)
436 | ```
437 | 
438 | We inspect the first reranking:
439 | 
440 | ```elixir
441 | List.first(top_results).content
442 | ```
443 | 
444 | ## Re-ranking with cross-encoder
445 | 
446 | For this step, we'll load another model from Huggingface compatible with Bumblebee to rerank the results.
447 | 
448 | We use a pretrained model "cross-encoder/ms-marco-MiniLM-L-6-v2" as shown in the SBert [documentation on cross-encoders](https://www.sbert.net/docs/cross_encoder/pretrained_models.html).
449 | 
450 | ```elixir
451 | defmodule RAG.CrossEncoder do
452 |   @first 5
453 |   
454 |   def load_model do
455 |     repo= {:hf, "cross-encoder/ms-marco-MiniLM-L-6-v2"}
456 |     tokenizer = {:hf, "bert-base-uncased"}
457 |     {:ok, model_info} = Bumblebee.load_model(repo)
458 |     {:ok, tokenizer} = Bumblebee.load_tokenizer(tokenizer)
459 | 
460 |     {model_info, tokenizer}
461 |   end
462 | 
463 |   def rerank(documents, query) do
464 |     # Prepare input pairs for cross-encoder
465 |     {model_info, tokenizer} = load_model()
466 |     input_pairs = 
467 |      Bumblebee.apply_tokenizer(tokenizer, 
468 |        Enum.map(documents, fn doc -> 
469 |          {query, doc.content}
470 |        end)
471 |      )
472 | 
473 |     # Run cross-encoder in batches
474 |     outputs = Axon.predict(model_info.model, model_info.params, input_pairs)
475 |      
476 | 
477 |     # Combine scores with original documents and sort
478 |     Enum.zip(documents, outputs.logits |> Nx.to_flat_list())
479 |     |> Enum.sort_by(fn {_, score} -> score end, :desc)
480 |     |> Enum.map(fn {doc, _} -> doc.content end)
481 |     |> Enum.take(@first)
482 |   end
483 | end
484 | ```
485 | 
486 | #### Check reranking against Python
487 | 
488 | [TODO]
489 | 
490 | <!-- livebook:{"break_markdown":true} -->
491 | 
492 | This model uses the architecture `:for_sequence_classification`; there is no such function yet coded in Bumblebee at the time of writting.
493 | 
494 | <!-- livebook:{"break_markdown":true} -->
495 | 
496 | ### Build the context by re-ranking
497 | 
498 | ```elixir
499 | # Load the model
500 | RAG.CrossEncoder.load_model()
501 | 
502 | # Rerank the results
503 | #query = "how to handle forms server-side?"
504 | context = RAG.CrossEncoder.rerank(top_results, query)
505 | ```
506 | 
507 | ## Build the prompt
508 | 
509 | We define the prompt with a context and a question
510 | 
511 | ```elixir
512 | defmodule RAG.PromptBuilder do
513 |   def build_prompt(context, query) do
514 |     context_text = Enum.join(context, "\n\n")
515 |     """
516 |     You are a proficient Elixir developer, with full knowledge of the framework Phoenix LiveView.
517 |     You are given a context information below relevant to the query that is submitted to you.
518 |     -----------------------
519 |     #{context_text}
520 |     -----------------------
521 |     You answer to the query using in priority the context informations given above and you should cite it.
522 |     The response should be in markdown format.
523 | 
524 |     Query: #{query}
525 |     Answer:
526 |     """
527 |   end
528 | end
529 | ```
530 | 
531 | ## LLM integration
532 | 
533 | Most of the LLM are paid solutions accesible via an endpoint. Very few models can be run locally. LLMs tends to be large.
534 | 
535 | We run the "codellama" model via the `ollama` plateform
536 | 
537 | <!-- livebook:{"break_markdown":true} -->
538 | 
539 | ### LLama CLI
540 | 
541 | <details>
542 | <summary>Install and start ollama server</summary>
543 |   
544 | We install `ollama` (see [the repo](https://github.com/ollama/ollama/tree/main)) to install de "codellama" LLM.
545 | 
546 | We pull a model from the registry:
547 | 
548 | ```console
549 | > ollama pull codellama
550 | ```
551 | 
552 | We start an LLM server:
553 | 
554 | ```console
555 | > ollama serve
556 | ```
557 | 
558 | This gives us an interactive CLI and a [REST API](https://github.com/ollama/ollama/tree/main#rest-api).
559 | 
560 | 
561 | We can test this and send a **POST** request to generate a completion where we pass a json `{"model": "codellama", "prompt": "...."}`.
562 | 
563 | ```console
564 | > curl http://localhost:11434/api/generate -d \
565 |    '{"model": "codellama", "prompt": "how to handle forms with Phoenix Liveview?", "stream": false}'
566 | ```
567 | 
568 | We get a response back:
569 | 
570 | ```json
571 | {
572 |   "model":"codellama",
573 |   "created_at":"2024-08-29T07:25:31.941263Z",
574 |   "response":"\nTo handle forms in Phoenix LiveView, you can use the `Phoenix.LiveView.Form` module. This module provides a set of functions for creating and manipulating HTML form elements, as well as handling form data on the server.\n\nHere's an example of how to create a simple form using Phoenix LiveView:\n```\nimport Ecto.Changeset\n\n# Create a changeset for the form\nchangeset = Ecto.Changeset.change(%YourModel{}, %{})\n\n# Render the form in your template\n\u003cform phx-submit=\"save\"\u003e\n  \u003cdiv\u003e\n    \u003clabel for=\"name\"\u003eName:\u003c/label\u003e\n    \u003cinput type=\"text\" id=\"name\" name=\"name\" value={changeset.data[\"name\"]} /\u003e\n  \u003c/div\u003e\n\n  \u003cdiv\u003e\n    \u003clabel for=\"age\"\u003eAge:\u003c/label\u003e\n    \u003cinput type=\"number\" id=\"age\" name=\"age\" value={changeset.data[\"age\"]} /\u003e\n  \u003c/div\u003e\n\n  \u003cbutton type=\"submit\"\u003eSave\u003c/button\u003e\n\u003c/form\u003e\n```\nIn this example, we're creating a changeset for the form, which is used to validate and update the form data on the server. We then render the form in our template using the `phx-submit` attribute, which tells Phoenix to send the form data to the server when the form is submitted.\n\nWhen the form is submitted, Phoenix will automatically handle the form data and update the changeset with any validation errors or updates. You can then use the updated changeset to persist the data in your database.\n\nTo handle the form submission on the server, you can define a `save` function in your LiveView module that will be called when the form is submitted. This function will receive the updated changeset as an argument, and you can use it to update the data in your database or perform any other necessary actions.\n```\ndef save(changeset) do\n  # Validate the changeset and return an error if there are any validation errors\n  case Ecto.Changeset.apply_action(changeset, :update) do\n    {:ok, _model} -\u003e\n      # Update the data in your database or perform any other necessary actions\n      :ok\n\n    {:error, _changeset} -\u003e\n      # Render an error page if there were validation errors\n      render(:index, changeset: changeset)\n  end\nend\n```\nIn this example, we're using the `Ecto.Changeset` module to validate the form data and update the changeset with any validation errors or updates. If there are no validation errors, we can use the updated changeset to persist the data in our database or perform any other necessary actions. If there are validation errors, we render an error page with the updated changeset.\n\nOverall, using Phoenix LiveView forms provides a convenient and efficient way to handle form data on the server, while also providing a seamless user experience for your users.",
575 |   "done":true,
576 |   ...
577 | }
578 | 
579 | ```
580 | </details>
581 | 
582 | <!-- livebook:{"break_markdown":true} -->
583 | 
584 | We check that `ollama` is running:
585 | 
586 | ```console
587 | lsof -i -P | grep LISTEN | grep 11434
588 | ```
589 | 
590 | ## Generate a response via the LLama REST API and Elixir
591 | 
592 | The Livebook runs a **POST** request with `Req` and pass a json to the `:json` key.
593 | 
594 | As per the [documentation](https://hexdocs.pm/req/Req.Steps.html#encode_body/1-request-options), it does `Jason.encode_to_iodata(%{model: "codellama", "prompt": "..."})` and sets the adequate headers.
595 | 
596 | Note that we need to increase the socket timeout above the default 5000.
597 | 
598 | ```elixir
599 | defmodule LLM do
600 |   def generate_response(prompt) do
601 |     json = %{stream: false, model: "codellama", prompt: prompt}
602 | 
603 |     res = 
604 |       Req.post!(
605 |         "http://localhost:11434/api/generate",
606 |         json: json,
607 |         receive_timeout: 120_000
608 |       )
609 |     
610 |       case res do
611 |         %{status: 200, body: body} -> 
612 |           body["response"]
613 |         _ ->
614 |           IO.puts "error"
615 |       end
616 |   end
617 | end
618 | ```
619 | 
620 | ```elixir
621 | query
622 | ```
623 | 
624 | ```elixir
625 | RAG.PromptBuilder.build_prompt(context, query)
626 | |> LLM.generate_response()
627 | ```
628 | 
629 | ## Wrap up
630 | 
631 | Seed the database with external sources
632 | 
633 | <!-- livebook:{"force_markdown":true} -->
634 | 
635 | ```elixir
636 | guides  = [
637 |   "https://api.github.com/repos/phoenixframework/phoenix_live_view/contents/guides/server",
638 |   "https://api.github.com/repos/phoenixframework/phoenix_live_view/contents/guides/client",
639 |   "https://api.github.com/repos/phoenixframework/phoenix_live_view/contents/guides/introduction"
640 | ]
641 | 
642 | RAG.ExternalSources.build(guides)
643 | ```
644 | 
645 | ```elixir
646 | defmodule RAG do
647 |   def process_query(query) do
648 |     top_k = 10
649 |     
650 |     query
651 |     |> RAG.SemanticSearch.search(top_k)
652 |     # top_results
653 |     |> RAG.CrossEncoder.rerank(query)
654 |     # context
655 |     |> tap(&IO.puts/1)
656 |     |> RAG.PromptBuilder.build_prompt(query)
657 |     # prompt
658 |     |> LLM.generate_response()
659 |   end
660 | end
661 | 
662 | query = "explain Javascript interoperability on the server-side"
663 | 
664 | RAG.process_query(query)
665 | ```
666 | 
667 | ## Dimension reduction & visualization
668 | 
669 | We will use the `scholar` librabry,
670 | 
671 | ```elixir
672 | require Explorer.DataFrame, as: DF
673 | ```
674 | 


--------------------------------------------------------------------------------