├── .gitignore ├── README.md ├── requirements.txt ├── data1.txt ├── data2.txt └── code.ipynb /.gitignore: -------------------------------------------------------------------------------- 1 | .env 2 | app 3 | notes.txt -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | "# Applied-Advanced-RAG" 2 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | langchain-openai 2 | langchain 3 | chromadb 4 | python-dotenv 5 | tiktoken 6 | umap-learn 7 | sentence-transformers 8 | matplotlib -------------------------------------------------------------------------------- /data1.txt: -------------------------------------------------------------------------------- 1 | Elmwood Banking & Trust has firmly established itself as a beacon of financial reliability and personalized service in its community. With a deep understanding of its customers' needs, Elmwood offers a plethora of financial products and services, ensuring every individual's and business's financial goals are met with precision and care. 2 | The bank's personal loan offerings are diverse, providing financial solutions for various life events. From home renovations to unexpected expenses, Elmwood's personal loans are structured with flexibility and competitive interest rates, making them accessible to a broad range of customers. For businesses, the bank offers commercial loans that are essential for growth and expansion. These loans are tailored to each business's unique needs, whether it's for upgrading equipment, expanding operations, or increasing working capital. 3 | Elmwood's wealth management services are a cornerstone of its offerings. The bank's financial advisors are skilled in crafting investment strategies that align with clients' long-term financial objectives. They offer guidance in portfolio management, retirement planning, and estate planning, ensuring clients' financial futures are secure. 4 | For those looking to buy or refinance a home, Elmwood's mortgage options are comprehensive. The bank's mortgage experts work closely with clients to find the best mortgage solutions, whether it's a fixed-rate, adjustable-rate, or a more specialized mortgage product. Their goal is to make the process of home financing as seamless and straightforward as possible, providing support and advice at every step. 5 | Elmwood Banking & Trust also places a high value on financial education. The bank regularly hosts workshops and seminars on various financial topics, from basic budgeting and savings strategies to more complex investment and tax planning concepts. These educational initiatives are part of Elmwood's commitment to empowering its customers with the knowledge to make informed financial decisions. 6 | In addition to its financial services, Elmwood is deeply ingrained in the community. The bank supports local charities, sponsors community events, and encourages its employees to volunteer, reflecting its commitment to making a positive impact in the community it serves. This community-centric approach is what truly distinguishes Elmwood Banking & Trust as more than just a financial institution, but a pillar of support and growth in its community. -------------------------------------------------------------------------------- /data2.txt: -------------------------------------------------------------------------------- 1 | La Bella Vita offers a gastronomic journey through the rich landscapes of Italian cuisine, encapsulating the essence of Italy's culinary tradition. The restaurant's ambiance, reminiscent of a rustic Italian villa, sets the stage for an authentic dining experience. 2 | The chefs at La Bella Vita are masters of their craft, bringing generations of Italian cooking expertise to the table. The menu features a variety of regional specialties, showcasing the diversity of Italian cuisine. From the northern delicacies like risotto and polenta to the southern favorites like Neapolitan pizza and Sicilian seafood, each dish tells a story of its regional origin. 3 | The pasta selection is a highlight of the menu, featuring classics like lasagna, ravioli, and gnocchi, each prepared with homemade pasta and cooked to perfection. The sauces, ranging from rich and creamy to light and zesty, are made with the freshest ingredients, bringing out the authentic flavors of Italy. 4 | The meat and fish dishes are equally impressive, with offerings like osso buco, saltimbocca, and grilled branzino. These dishes are cooked with traditional methods, ensuring every bite is infused with the depth and richness of Italian culinary tradition. 5 | La Bella Vita also prides itself on its vegetarian and vegan options, catering to all dietary preferences. Dishes like eggplant parmigiana and mushroom risotto are crafted with the same attention to detail and flavor as their meat-based counterparts. 6 | The wine list at La Bella Vita is a carefully curated selection of Italian wines, ranging from well-known labels to hidden gems from small vineyards across Italy. The staff are knowledgeable and can recommend the perfect wine to complement each dish, enhancing the overall dining experience. 7 | Desserts at La Bella Vita are a must-try, with a range of traditional Italian sweets. The cannoli, filled with sweet, creamy ricotta, and the affogato, a delightful blend of espresso and gelato, are perfect endings to an exquisite meal. 8 | La Bella Vita also offers a unique dining experience with its private dining rooms and catering services. Whether it's for a special occasion, a business meeting, or a family gathering, the restaurant provides a personalized service that makes every event memorable. 9 | The staff at La Bella Vita are the heart of the restaurant. Their passion for Italian cuisine and culture is evident in their attentive service and willingness to go above and beyond to ensure every guest has an enjoyable dining experience. 10 | In essence, La Bella Vita is not just a restaurant; it's a destination for those who appreciate the art of Italian cuisine. It's a place where food, culture, and passion come together, creating an unforgettable dining experience that keeps guests coming back for more. -------------------------------------------------------------------------------- /code.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": null, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "from langchain_openai import OpenAIEmbeddings\n", 10 | "from langchain.text_splitter import RecursiveCharacterTextSplitter\n", 11 | "from langchain_community.document_loaders import TextLoader, DirectoryLoader\n", 12 | "from langchain_community.vectorstores import Chroma\n", 13 | "from dotenv import load_dotenv, find_dotenv\n", 14 | "\n", 15 | "load_dotenv(find_dotenv())" 16 | ] 17 | }, 18 | { 19 | "cell_type": "code", 20 | "execution_count": null, 21 | "metadata": {}, 22 | "outputs": [], 23 | "source": [ 24 | "loader = DirectoryLoader('./', glob=\"./*.txt\", loader_cls=TextLoader)\n", 25 | "docs = loader.load()" 26 | ] 27 | }, 28 | { 29 | "cell_type": "code", 30 | "execution_count": null, 31 | "metadata": {}, 32 | "outputs": [], 33 | "source": [ 34 | "text_splitter = RecursiveCharacterTextSplitter(\n", 35 | " chunk_size=100,\n", 36 | " chunk_overlap=10,\n", 37 | " length_function=len\n", 38 | ")\n", 39 | "new_docs = text_splitter.split_documents(documents=docs)\n", 40 | "doc_strings = [doc.page_content for doc in new_docs]" 41 | ] 42 | }, 43 | { 44 | "cell_type": "code", 45 | "execution_count": null, 46 | "metadata": {}, 47 | "outputs": [], 48 | "source": [ 49 | "embeddings = OpenAIEmbeddings(model=\"text-embedding-3-small\")\n", 50 | "vectors = embeddings.embed_documents(doc_strings)" 51 | ] 52 | }, 53 | { 54 | "cell_type": "code", 55 | "execution_count": null, 56 | "metadata": {}, 57 | "outputs": [], 58 | "source": [ 59 | "db = Chroma.from_documents(new_docs, embeddings)\n", 60 | "retriever = db.as_retriever(search_kwargs={\"k\": 6})" 61 | ] 62 | }, 63 | { 64 | "cell_type": "code", 65 | "execution_count": null, 66 | "metadata": {}, 67 | "outputs": [], 68 | "source": [ 69 | "import umap\n", 70 | "import numpy as np\n", 71 | "from tqdm import tqdm\n", 72 | "\n", 73 | "umap_transformer = umap.UMAP(random_state=0, transform_seed=0).fit(vectors)" 74 | ] 75 | }, 76 | { 77 | "cell_type": "code", 78 | "execution_count": null, 79 | "metadata": {}, 80 | "outputs": [], 81 | "source": [ 82 | "def umap_embed(vectors, umap_transformer):\n", 83 | " umap_embeddings = np.array([umap_transformer.transform([vector])[0] for vector in tqdm(vectors)])\n", 84 | " return umap_embeddings" 85 | ] 86 | }, 87 | { 88 | "cell_type": "code", 89 | "execution_count": null, 90 | "metadata": {}, 91 | "outputs": [], 92 | "source": [ 93 | "global_embeddings = umap_embed(vectors, umap_transformer)" 94 | ] 95 | }, 96 | { 97 | "cell_type": "code", 98 | "execution_count": null, 99 | "metadata": {}, 100 | "outputs": [], 101 | "source": [ 102 | "global_embeddings" 103 | ] 104 | }, 105 | { 106 | "cell_type": "code", 107 | "execution_count": null, 108 | "metadata": {}, 109 | "outputs": [], 110 | "source": [ 111 | "import matplotlib.pyplot as plt\n", 112 | "\n", 113 | "plt.figure()\n", 114 | "plt.scatter(global_embeddings[:, 0], global_embeddings[:, 1], s=10)\n", 115 | "plt.gca().set_aspect('equal', 'datalim')\n", 116 | "plt.title('Embeddings')\n", 117 | "plt.axis('off')" 118 | ] 119 | }, 120 | { 121 | "cell_type": "code", 122 | "execution_count": null, 123 | "metadata": {}, 124 | "outputs": [], 125 | "source": [ 126 | "def calc_global_embeddings(query, embeddings, retriever, umap_transformer, embed_function, global_embeddings):\n", 127 | " q_embedding = embeddings.embed_query(query)\n", 128 | "\n", 129 | " docs = retriever.get_relevant_documents(query)\n", 130 | " page_contents = [doc.page_content for doc in docs]\n", 131 | " vectors_content_vectors = embeddings.embed_documents(page_contents)\n", 132 | "\n", 133 | " query_embeddings = embed_function([q_embedding], umap_transformer)\n", 134 | " retrieved_embeddings = embed_function(vectors_content_vectors, umap_transformer)\n", 135 | "\n", 136 | " plt.figure()\n", 137 | " plt.scatter(global_embeddings[:, 0], global_embeddings[:, 1], s=10, color='gray')\n", 138 | " plt.scatter(query_embeddings[:, 0], query_embeddings[:, 1], s=150, marker='X', color='r')\n", 139 | " plt.scatter(retrieved_embeddings[:, 0], retrieved_embeddings[:, 1], s=50, facecolors='none', edgecolors='g')\n", 140 | " plt.gca().set_aspect('equal', 'datalim')\n", 141 | " plt.title(f'{query}')\n", 142 | " plt.axis('off')\n", 143 | " plt.show()\n" 144 | ] 145 | }, 146 | { 147 | "cell_type": "code", 148 | "execution_count": null, 149 | "metadata": {}, 150 | "outputs": [], 151 | "source": [ 152 | "calc_global_embeddings(\"What loan do you offer?\", embeddings, retriever, umap_transformer, umap_embed, global_embeddings)" 153 | ] 154 | }, 155 | { 156 | "cell_type": "code", 157 | "execution_count": null, 158 | "metadata": {}, 159 | "outputs": [], 160 | "source": [ 161 | "calc_global_embeddings(\"Do you offer vegetarian food?\", embeddings, retriever, umap_transformer, umap_embed, global_embeddings)\n" 162 | ] 163 | }, 164 | { 165 | "cell_type": "markdown", 166 | "metadata": {}, 167 | "source": [ 168 | "### Query Expansion " 169 | ] 170 | }, 171 | { 172 | "cell_type": "code", 173 | "execution_count": null, 174 | "metadata": {}, 175 | "outputs": [], 176 | "source": [ 177 | "from langchain.chat_models import ChatOpenAI\n", 178 | "\n", 179 | "llm = ChatOpenAI(\n", 180 | " temperature=0,\n", 181 | " max_tokens=800,\n", 182 | " model_kwargs={\"top_p\": 0, \"frequency_penalty\": 0, \"presence_penalty\": 0},\n", 183 | " )" 184 | ] 185 | }, 186 | { 187 | "cell_type": "code", 188 | "execution_count": null, 189 | "metadata": {}, 190 | "outputs": [], 191 | "source": [ 192 | "from langchain.chains import LLMChain\n", 193 | "from langchain.output_parsers import PydanticOutputParser\n", 194 | "from langchain.prompts import PromptTemplate\n", 195 | "from pydantic import BaseModel, Field\n", 196 | "\n", 197 | "query = \"Do you offer vegetarian food?\"\n", 198 | "\n", 199 | "class LineList(BaseModel):\n", 200 | " lines: list[str] = Field(description=\"Lines of text\")\n", 201 | "\n", 202 | "\n", 203 | "class LineListOutputParser(PydanticOutputParser):\n", 204 | " def __init__(self) -> None:\n", 205 | " super().__init__(pydantic_object=LineList)\n", 206 | "\n", 207 | " def parse(self, text: str) -> list[str]:\n", 208 | " lines = text.strip().split(\"\\n\")\n", 209 | " return lines\n", 210 | "\n", 211 | "\n", 212 | "output_parser = LineListOutputParser()\n", 213 | "\n", 214 | "QUERY_PROMPT = PromptTemplate(\n", 215 | " input_variables=[\"question\"],\n", 216 | " template=\"\"\"You are an AI language model assistant. Your task is to generate five\n", 217 | " different versions of the given user question to retrieve relevant documents from a vector\n", 218 | " database. By generating multiple perspectives on the user question, your goal is to help\n", 219 | " the user overcome some of the limitations of the distance-based similarity search.\n", 220 | " Provide these alternative questions separated by newlines. Only provide the query, no numbering.\n", 221 | " Original question: {question}\"\"\",\n", 222 | ")\n", 223 | "\n", 224 | "llm_chain = LLMChain(llm=llm, prompt=QUERY_PROMPT, output_parser=output_parser)\n", 225 | "queries = llm_chain.invoke(query)" 226 | ] 227 | }, 228 | { 229 | "cell_type": "code", 230 | "execution_count": null, 231 | "metadata": {}, 232 | "outputs": [], 233 | "source": [ 234 | "queries = queries.get(\"text\")\n", 235 | "queries" 236 | ] 237 | }, 238 | { 239 | "cell_type": "code", 240 | "execution_count": null, 241 | "metadata": {}, 242 | "outputs": [], 243 | "source": [ 244 | "docs = [retriever.get_relevant_documents(query) for query in queries]" 245 | ] 246 | }, 247 | { 248 | "cell_type": "code", 249 | "execution_count": null, 250 | "metadata": {}, 251 | "outputs": [], 252 | "source": [ 253 | "unique_contents = set()\n", 254 | "unique_docs = []\n", 255 | "for sublist in docs:\n", 256 | " for doc in sublist:\n", 257 | " if doc.page_content not in unique_contents:\n", 258 | " unique_docs.append(doc)\n", 259 | " unique_contents.add(doc.page_content)\n", 260 | "unique_contents = list(unique_contents)" 261 | ] 262 | }, 263 | { 264 | "cell_type": "code", 265 | "execution_count": null, 266 | "metadata": {}, 267 | "outputs": [], 268 | "source": [ 269 | "unique_contents" 270 | ] 271 | }, 272 | { 273 | "cell_type": "markdown", 274 | "metadata": {}, 275 | "source": [ 276 | "# Cross-encoder re-ranking" 277 | ] 278 | }, 279 | { 280 | "cell_type": "code", 281 | "execution_count": null, 282 | "metadata": {}, 283 | "outputs": [], 284 | "source": [ 285 | "from sentence_transformers import CrossEncoder\n", 286 | "cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')" 287 | ] 288 | }, 289 | { 290 | "cell_type": "code", 291 | "execution_count": null, 292 | "metadata": {}, 293 | "outputs": [], 294 | "source": [ 295 | "pairs = []\n", 296 | "for doc in unique_contents:\n", 297 | " pairs.append([query, doc])" 298 | ] 299 | }, 300 | { 301 | "cell_type": "code", 302 | "execution_count": null, 303 | "metadata": {}, 304 | "outputs": [], 305 | "source": [ 306 | "scores = cross_encoder.predict(pairs)\n", 307 | "scores" 308 | ] 309 | }, 310 | { 311 | "cell_type": "code", 312 | "execution_count": null, 313 | "metadata": {}, 314 | "outputs": [], 315 | "source": [ 316 | "scored_docs = zip(scores, unique_contents)\n", 317 | "sorted_docs = sorted(scored_docs, reverse=True)\n", 318 | "sorted_docs" 319 | ] 320 | }, 321 | { 322 | "cell_type": "code", 323 | "execution_count": null, 324 | "metadata": {}, 325 | "outputs": [], 326 | "source": [ 327 | "reranked_docs = [doc for _, doc in sorted_docs][0:8]\n", 328 | "reranked_docs" 329 | ] 330 | }, 331 | { 332 | "cell_type": "markdown", 333 | "metadata": {}, 334 | "source": [ 335 | "### Lost in the middle Problem" 336 | ] 337 | }, 338 | { 339 | "cell_type": "code", 340 | "execution_count": null, 341 | "metadata": {}, 342 | "outputs": [], 343 | "source": [ 344 | "from langchain_community.document_transformers import (\n", 345 | " LongContextReorder\n", 346 | ")" 347 | ] 348 | }, 349 | { 350 | "cell_type": "code", 351 | "execution_count": null, 352 | "metadata": {}, 353 | "outputs": [], 354 | "source": [ 355 | "reordering = LongContextReorder()\n", 356 | "reordered_docs = reordering.transform_documents(reranked_docs)\n", 357 | "reordered_docs" 358 | ] 359 | } 360 | ], 361 | "metadata": { 362 | "kernelspec": { 363 | "display_name": "app", 364 | "language": "python", 365 | "name": "python3" 366 | }, 367 | "language_info": { 368 | "codemirror_mode": { 369 | "name": "ipython", 370 | "version": 3 371 | }, 372 | "file_extension": ".py", 373 | "mimetype": "text/x-python", 374 | "name": "python", 375 | "nbconvert_exporter": "python", 376 | "pygments_lexer": "ipython3", 377 | "version": "3.11.0" 378 | } 379 | }, 380 | "nbformat": 4, 381 | "nbformat_minor": 2 382 | } 383 | --------------------------------------------------------------------------------