├── .DS_Store ├── .gitignore ├── LICENSE ├── README.md ├── getting_started.ipynb ├── requirements.txt └── streamlit-demo ├── .DS_Store ├── README.md ├── app.log ├── app.py ├── prayers.txt └── requirements.txt /.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/somewheresystems/dataclysm/419976fe2cd9d31f1189ec9c334f7e8021f43522/.DS_Store -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | *.gguf -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Apache License 2 | Version 2.0, January 2004 3 | http://www.apache.org/licenses/ 4 | 5 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 6 | 7 | 1. Definitions. 8 | 9 | "License" shall mean the terms and conditions for use, reproduction, 10 | and distribution as defined by Sections 1 through 9 of this document. 11 | 12 | "Licensor" shall mean the copyright owner or entity authorized by 13 | the copyright owner that is granting the License. 14 | 15 | "Legal Entity" shall mean the union of the acting entity and all 16 | other entities that control, are controlled by, or are under common 17 | control with that entity. For the purposes of this definition, 18 | "control" means (i) the power, direct or indirect, to cause the 19 | direction or management of such entity, whether by contract or 20 | otherwise, or (ii) ownership of fifty percent (50%) or more of the 21 | outstanding shares, or (iii) beneficial ownership of such entity. 22 | 23 | "You" (or "Your") shall mean an individual or Legal Entity 24 | exercising permissions granted by this License. 25 | 26 | "Source" form shall mean the preferred form for making modifications, 27 | including but not limited to software source code, documentation 28 | source, and configuration files. 29 | 30 | "Object" form shall mean any form resulting from mechanical 31 | transformation or translation of a Source form, including but 32 | not limited to compiled object code, generated documentation, 33 | and conversions to other media types. 34 | 35 | "Work" shall mean the work of authorship, whether in Source or 36 | Object form, made available under the License, as indicated by a 37 | copyright notice that is included in or attached to the work 38 | (an example is provided in the Appendix below). 39 | 40 | "Derivative Works" shall mean any work, whether in Source or Object 41 | form, that is based on (or derived from) the Work and for which the 42 | editorial revisions, annotations, elaborations, or other modifications 43 | represent, as a whole, an original work of authorship. For the purposes 44 | of this License, Derivative Works shall not include works that remain 45 | separable from, or merely link (or bind by name) to the interfaces of, 46 | the Work and Derivative Works thereof. 47 | 48 | "Contribution" shall mean any work of authorship, including 49 | the original version of the Work and any modifications or additions 50 | to that Work or Derivative Works thereof, that is intentionally 51 | submitted to Licensor for inclusion in the Work by the copyright owner 52 | or by an individual or Legal Entity authorized to submit on behalf of 53 | the copyright owner. For the purposes of this definition, "submitted" 54 | means any form of electronic, verbal, or written communication sent 55 | to the Licensor or its representatives, including but not limited to 56 | communication on electronic mailing lists, source code control systems, 57 | and issue tracking systems that are managed by, or on behalf of, the 58 | Licensor for the purpose of discussing and improving the Work, but 59 | excluding communication that is conspicuously marked or otherwise 60 | designated in writing by the copyright owner as "Not a Contribution." 61 | 62 | "Contributor" shall mean Licensor and any individual or Legal Entity 63 | on behalf of whom a Contribution has been received by Licensor and 64 | subsequently incorporated within the Work. 65 | 66 | 2. Grant of Copyright License. Subject to the terms and conditions of 67 | this License, each Contributor hereby grants to You a perpetual, 68 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 69 | copyright license to reproduce, prepare Derivative Works of, 70 | publicly display, publicly perform, sublicense, and distribute the 71 | Work and such Derivative Works in Source or Object form. 72 | 73 | 3. Grant of Patent License. Subject to the terms and conditions of 74 | this License, each Contributor hereby grants to You a perpetual, 75 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 76 | (except as stated in this section) patent license to make, have made, 77 | use, offer to sell, sell, import, and otherwise transfer the Work, 78 | where such license applies only to those patent claims licensable 79 | by such Contributor that are necessarily infringed by their 80 | Contribution(s) alone or by combination of their Contribution(s) 81 | with the Work to which such Contribution(s) was submitted. If You 82 | institute patent litigation against any entity (including a 83 | cross-claim or counterclaim in a lawsuit) alleging that the Work 84 | or a Contribution incorporated within the Work constitutes direct 85 | or contributory patent infringement, then any patent licenses 86 | granted to You under this License for that Work shall terminate 87 | as of the date such litigation is filed. 88 | 89 | 4. Redistribution. You may reproduce and distribute copies of the 90 | Work or Derivative Works thereof in any medium, with or without 91 | modifications, and in Source or Object form, provided that You 92 | meet the following conditions: 93 | 94 | (a) You must give any other recipients of the Work or 95 | Derivative Works a copy of this License; and 96 | 97 | (b) You must cause any modified files to carry prominent notices 98 | stating that You changed the files; and 99 | 100 | (c) You must retain, in the Source form of any Derivative Works 101 | that You distribute, all copyright, patent, trademark, and 102 | attribution notices from the Source form of the Work, 103 | excluding those notices that do not pertain to any part of 104 | the Derivative Works; and 105 | 106 | (d) If the Work includes a "NOTICE" text file as part of its 107 | distribution, then any Derivative Works that You distribute must 108 | include a readable copy of the attribution notices contained 109 | within such NOTICE file, excluding those notices that do not 110 | pertain to any part of the Derivative Works, in at least one 111 | of the following places: within a NOTICE text file distributed 112 | as part of the Derivative Works; within the Source form or 113 | documentation, if provided along with the Derivative Works; or, 114 | within a display generated by the Derivative Works, if and 115 | wherever such third-party notices normally appear. The contents 116 | of the NOTICE file are for informational purposes only and 117 | do not modify the License. You may add Your own attribution 118 | notices within Derivative Works that You distribute, alongside 119 | or as an addendum to the NOTICE text from the Work, provided 120 | that such additional attribution notices cannot be construed 121 | as modifying the License. 122 | 123 | You may add Your own copyright statement to Your modifications and 124 | may provide additional or different license terms and conditions 125 | for use, reproduction, or distribution of Your modifications, or 126 | for any such Derivative Works as a whole, provided Your use, 127 | reproduction, and distribution of the Work otherwise complies with 128 | the conditions stated in this License. 129 | 130 | 5. Submission of Contributions. Unless You explicitly state otherwise, 131 | any Contribution intentionally submitted for inclusion in the Work 132 | by You to the Licensor shall be under the terms and conditions of 133 | this License, without any additional terms or conditions. 134 | Notwithstanding the above, nothing herein shall supersede or modify 135 | the terms of any separate license agreement you may have executed 136 | with Licensor regarding such Contributions. 137 | 138 | 6. Trademarks. This License does not grant permission to use the trade 139 | names, trademarks, service marks, or product names of the Licensor, 140 | except as required for reasonable and customary use in describing the 141 | origin of the Work and reproducing the content of the NOTICE file. 142 | 143 | 7. Disclaimer of Warranty. Unless required by applicable law or 144 | agreed to in writing, Licensor provides the Work (and each 145 | Contributor provides its Contributions) on an "AS IS" BASIS, 146 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 147 | implied, including, without limitation, any warranties or conditions 148 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A 149 | PARTICULAR PURPOSE. You are solely responsible for determining the 150 | appropriateness of using or redistributing the Work and assume any 151 | risks associated with Your exercise of permissions under this License. 152 | 153 | 8. Limitation of Liability. In no event and under no legal theory, 154 | whether in tort (including negligence), contract, or otherwise, 155 | unless required by applicable law (such as deliberate and grossly 156 | negligent acts) or agreed to in writing, shall any Contributor be 157 | liable to You for damages, including any direct, indirect, special, 158 | incidental, or consequential damages of any character arising as a 159 | result of this License or out of the use or inability to use the 160 | Work (including but not limited to damages for loss of goodwill, 161 | work stoppage, computer failure or malfunction, or any and all 162 | other commercial damages or losses), even if such Contributor 163 | has been advised of the possibility of such damages. 164 | 165 | 9. Accepting Warranty or Additional Liability. While redistributing 166 | the Work or Derivative Works thereof, You may choose to offer, 167 | and charge a fee for, acceptance of support, warranty, indemnity, 168 | or other liability obligations and/or rights consistent with this 169 | License. However, in accepting such obligations, You may act only 170 | on Your own behalf and on Your sole responsibility, not on behalf 171 | of any other Contributor, and only if You agree to indemnify, 172 | defend, and hold each Contributor harmless for any liability 173 | incurred by, or claims asserted against, such Contributor by reason 174 | of your accepting any such warranty or additional liability. 175 | 176 | END OF TERMS AND CONDITIONS 177 | 178 | APPENDIX: How to apply the Apache License to your work. 179 | 180 | To apply the Apache License to your work, attach the following 181 | boilerplate notice, with the fields enclosed by brackets "[]" 182 | replaced with your own identifying information. (Don't include 183 | the brackets!) The text should be enclosed in the appropriate 184 | comment syntax for the file format. We also recommend that a 185 | file or class name and description of purpose be included on the 186 | same "printed page" as the copyright notice for easier 187 | identification within third-party archives. 188 | 189 | Copyright [yyyy] [name of copyright owner] 190 | 191 | Licensed under the Apache License, Version 2.0 (the "License"); 192 | you may not use this file except in compliance with the License. 193 | You may obtain a copy of the License at 194 | 195 | http://www.apache.org/licenses/LICENSE-2.0 196 | 197 | Unless required by applicable law or agreed to in writing, software 198 | distributed under the License is distributed on an "AS IS" BASIS, 199 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 200 | See the License for the specific language governing permissions and 201 | limitations under the License. 202 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ``` 2 | @@@@@%%%%%@#-..:-*%%@@@@@@#********%@@@@%%%%#*++#@#*#@%%@@@%%%%%%%%%%%%%%%%+..:*%#%%%%@@@@ 3 | @@@@@%%%%##@@=-*=..+%#+::+@-:-:::.*#-:-%%%*:.---.@*.#%@#:.%*.:#-.---=%--%@*..-%@##%%%%@@@@ 4 | @@@@@%%%%#*%@+-#@#::*@*---@*@@-+@%*%:=.#%+:=%@@@#@#:#%@@%=:.-@#:-%@@*%=:+%-.-=@#*#%%%%@@@@ 5 | @@@@@%%%%##%@+-#@%=-*@-+*.#@@@-+@@@+-%-:@=-*@@@@@@#:#@@@@%-:@@@%+-=#@%=-:=.+=+@#*#%%%%@@@@ 6 | @@@@@%%%%##%@*+%@#=+%+-==-+@@@=+@@%-===-*#==%@@@%@#=#@@@%%=+@@%@@@*-*@+**-=%+*@###%%%%@@@@ 7 | @@@@@%%%%##@@#+#=-*%%=*@@#=%@@=+@@+=%@@+=@%+:-===@#-===-#@++@@#-++-:*@+#@+%@*#@@*#%%%%@@@@ 8 | @@@@@%%%%##%@++*%@@@#%#@@#@#@%@%%%%%%@%%%%@@@%%@@@#@@@@@@%%%%@#@%#%@@%%#@@@%*+@%*#%%%%@@@@ 9 | ``` 10 | This repository provides a comprehensive guide to getting started with using DATACLYSM: a series of high-quality embeddings libraries, with coverage for the entirety of PubMed, English Wikipedia and arXiv. The guide is based on the `getting_started.ipynb` notebook. 11 | 12 | It also includes a demo of the Spatial Search Engine, a Streamlit app for exploring the Dataclysm datasets visually and performing ranked searches on proximally related articles (by title, currently). 13 | 14 | ## Table of Contents 15 | 1. [Installation](#installation) 16 | 2. [Initialization](#initialization) 17 | 3. [Retrieval Augmented Generation](#retrieval-augmented-generation) 18 | 4. [Reranking Results](#reranking-results) 19 | 5. [License](#license) 20 | 21 | ## Installation 22 | To install the necessary dependencies, run the following command in a fresh conda environment. I suggest Python 3.10: 23 | ```python 24 | %pip install -r requirements.txt 25 | ``` 26 | 27 | ## Retrieval Augmented Generation 28 | The Retrieval Augmented Generation (RAG) demonstration uses the `BAAI/bge_small_en_v2` model to encode a query and retrieve examples based on title similarity using FAISS. The examples are then summarized using Hermes-2.5-Mistral-7B. 29 | 30 | ## Reranking Results 31 | Demos are included for classical (score augmentation) and LLM-based (experimental) reranking of results. The experimental LLM reranking process uses the aforementioned model to return a list instructing the LLM to rerank and drop irrelevant results. The results are then displayed as a table with hyperlinks. 32 | 33 | ## Streamlit SSE (Spatial Search Engine) Demo 34 | To run the Streamlit demo, simply navigate to the demo directory and run the Streamlit app: 35 | ```bash 36 | cd streamlit-demo 37 | streamlit run app.py 38 | ``` 39 | 40 | ## License 41 | This project is licensed under the Apache License 2.0. For more details, see the `LICENSE` file in the repository. 42 | 43 | For more detailed instructions and examples, refer to the `getting_started.ipynb` notebook. 44 | -------------------------------------------------------------------------------- /getting_started.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "I suggest using Python 3.10 in a conda environment with this." 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "## Install Dependencies" 15 | ] 16 | }, 17 | { 18 | "cell_type": "code", 19 | "execution_count": null, 20 | "metadata": {}, 21 | "outputs": [], 22 | "source": [ 23 | "%pip install -r requirements.txt\n", 24 | "%pip install faiss-cpu faiss-gpu" 25 | ] 26 | }, 27 | { 28 | "cell_type": "markdown", 29 | "metadata": {}, 30 | "source": [ 31 | "# Initialize Wikipedia Database + Index\n", 32 | "This process takes 2x as much time as arXiv to download, about ~12 minutes to index (M3 Max)" 33 | ] 34 | }, 35 | { 36 | "cell_type": "code", 37 | "execution_count": null, 38 | "metadata": {}, 39 | "outputs": [], 40 | "source": [ 41 | "import numpy as np\n", 42 | "from tqdm import tqdm\n", 43 | "from FlagEmbedding import FlagModel\n", 44 | "from datasets import load_dataset\n", 45 | "import pandas as pd\n", 46 | "import psutil\n", 47 | "\n", 48 | "def print_memory_usage():\n", 49 | " print(f\"Current memory usage: {psutil.Process().memory_info().rss / 1024 ** 2} MB\")\n", 50 | "\n", 51 | "print(\"Loading dataset...\")\n", 52 | "print_memory_usage()\n", 53 | "dataclysm_wikipedia = load_dataset('somewheresystems/dataclysm-wikipedia', split=\"train\")\n", 54 | "print_memory_usage()\n", 55 | "\n", 56 | "# Check the structure of the dataset, particularly the 'title_embedding' and 'abstract_embedding' columns\n", 57 | "print(dataclysm_wikipedia)\n", 58 | "print(dataclysm_wikipedia.column_names)\n", 59 | "print(dataclysm_wikipedia.features)\n", 60 | "print_memory_usage()\n", 61 | "\n", 62 | "# Define a function to flatten the embeddings and add FAISS index\n", 63 | "def flatten_and_add_faiss_index(dataset, column_name):\n", 64 | " embedding_shape = np.array(dataset[0][column_name]).shape\n", 65 | " if len(embedding_shape) == 2:\n", 66 | " print(f\"Flattening {column_name} and adding FAISS index...\")\n", 67 | " # Flatten the column before adding the FAISS index\n", 68 | " dataset = dataset.map(lambda x: {column_name: np.concatenate(x[column_name])})\n", 69 | " dataset = dataset.add_faiss_index(column=column_name)\n", 70 | " print(f\"FAISS index for {column_name} added.\")\n", 71 | " else:\n", 72 | " print(f\"Cannot add FAISS index for {column_name}.\")\n", 73 | " print_memory_usage()\n", 74 | " return dataset\n", 75 | "\n", 76 | "# Add FAISS indices for 'title_embedding' and 'abstract_embedding' and save them to different datasets\n", 77 | "dataclysm_wikipedia_indexed = flatten_and_add_faiss_index(dataclysm_wikipedia, 'title_embedding')\n", 78 | "print_memory_usage()\n", 79 | "\n", 80 | "print(\"Datasets loaded.\")\n", 81 | "\n", 82 | "# Define the model\n", 83 | "print(\"Initializing model...\")\n", 84 | "model = FlagModel('BAAI/bge-small-en-v1.5', \n", 85 | " query_instruction_for_retrieval=\"Represent this sentence for searching relevant passages:\",\n", 86 | " use_fp16=True)\n", 87 | "print(\"Model initialized.\")\n", 88 | "print_memory_usage()" 89 | ] 90 | }, 91 | { 92 | "cell_type": "markdown", 93 | "metadata": {}, 94 | "source": [ 95 | "# Initialize arXiv Abstract + Title Indices\n", 96 | "This process takes ~15 minutes to index (M3 Max)" 97 | ] 98 | }, 99 | { 100 | "cell_type": "code", 101 | "execution_count": null, 102 | "metadata": {}, 103 | "outputs": [], 104 | "source": [ 105 | "import numpy as np\n", 106 | "from tqdm import tqdm\n", 107 | "from FlagEmbedding import FlagModel\n", 108 | "from datasets import load_dataset\n", 109 | "import pandas as pd\n", 110 | "import psutil\n", 111 | "\n", 112 | "def print_memory_usage():\n", 113 | " print(f\"Current memory usage: {psutil.Process().memory_info().rss / 1024 ** 2} MB\")\n", 114 | "\n", 115 | "print(\"Loading dataset...\")\n", 116 | "print_memory_usage()\n", 117 | "dataclysm_arxiv = load_dataset('somewheresystems/dataclysm-arxiv', split=\"train\")\n", 118 | "print_memory_usage()\n", 119 | "\n", 120 | "# Check the structure of the dataset, particularly the 'title_embedding' and 'abstract_embedding' columns\n", 121 | "print(dataclysm_arxiv)\n", 122 | "print(dataclysm_arxiv.column_names)\n", 123 | "print(dataclysm_arxiv.features)\n", 124 | "print_memory_usage()\n", 125 | "\n", 126 | "# Define a function to flatten the embeddings and add FAISS index\n", 127 | "def flatten_and_add_faiss_index(dataset, column_name):\n", 128 | " embedding_shape = np.array(dataset[0][column_name]).shape\n", 129 | " if len(embedding_shape) == 2:\n", 130 | " print(f\"Flattening {column_name} and adding FAISS index...\")\n", 131 | " # Flatten the column before adding the FAISS index\n", 132 | " dataset = dataset.map(lambda x: {column_name: np.concatenate(x[column_name])})\n", 133 | " dataset = dataset.add_faiss_index(column=column_name)\n", 134 | " print(f\"FAISS index for {column_name} added.\")\n", 135 | " else:\n", 136 | " print(f\"Cannot add FAISS index for {column_name}.\")\n", 137 | " print_memory_usage()\n", 138 | " return dataset\n", 139 | "\n", 140 | "# Add FAISS indices for 'title_embedding' and 'abstract_embedding' and save them to different datasets\n", 141 | "dataclysm_title_indexed = flatten_and_add_faiss_index(dataclysm_arxiv, 'title_embedding')\n", 142 | "dataclysm_abstract_indexed = flatten_and_add_faiss_index(dataclysm_arxiv, 'abstract_embedding')\n", 143 | "print_memory_usage()\n", 144 | "\n", 145 | "print(\"Datasets loaded.\")\n", 146 | "\n", 147 | "# Define the model\n", 148 | "print(\"Initializing model...\")\n", 149 | "model = FlagModel('BAAI/bge-small-en-v1.5', \n", 150 | " query_instruction_for_retrieval=\"Represent this sentence for searching relevant passages:\",\n", 151 | " use_fp16=True)\n", 152 | "print(\"Model initialized.\")\n", 153 | "print_memory_usage()\n", 154 | "\n" 155 | ] 156 | }, 157 | { 158 | "cell_type": "markdown", 159 | "metadata": {}, 160 | "source": [ 161 | "# Initialize PubMed Title Indices" 162 | ] 163 | }, 164 | { 165 | "cell_type": "code", 166 | "execution_count": null, 167 | "metadata": {}, 168 | "outputs": [], 169 | "source": [ 170 | "import numpy as np\n", 171 | "from tqdm import tqdm\n", 172 | "from FlagEmbedding import FlagModel\n", 173 | "from datasets import load_dataset\n", 174 | "import pandas as pd\n", 175 | "import psutil\n", 176 | "\n", 177 | "def print_memory_usage():\n", 178 | " print(f\"Current memory usage: {psutil.Process().memory_info().rss / 1024 ** 2} MB\")\n", 179 | "\n", 180 | "print(\"Loading dataset...\")\n", 181 | "print_memory_usage()\n", 182 | "dataclysm_pubmed = load_dataset('somewheresystems/dataclysm-pubmed', split=\"train\")\n", 183 | "print_memory_usage()\n", 184 | "\n", 185 | "# Check the structure of the dataset, particularly the 'title_embedding' and 'abstract_embedding' columns\n", 186 | "print(dataclysm_pubmed)\n", 187 | "print(dataclysm_pubmed.column_names)\n", 188 | "print(dataclysm_pubmed.features)\n", 189 | "print_memory_usage()\n", 190 | "\n", 191 | "def flatten_and_add_faiss_index(dataset, column_name):\n", 192 | " embedding_shape = np.array(dataset[0][column_name]).shape\n", 193 | " try:\n", 194 | " if len(embedding_shape) == 2:\n", 195 | " print(f\"Flattening {column_name} and adding FAISS index...\")\n", 196 | " # Flatten the column before adding the FAISS index\n", 197 | " dataset = dataset.map(lambda x: {column_name: np.concatenate(x[column_name])} if x[column_name] is not None else {})\n", 198 | " # Remove entries with no embeddings\n", 199 | " dataset = dataset.filter(lambda x: column_name in x)\n", 200 | " # If the column is 'abstract_embedding', remove entries with empty abstracts\n", 201 | " if column_name == 'abstract_embedding':\n", 202 | " dataset = dataset.filter(lambda x: len(x['abstract_embedding']) != 0)\n", 203 | " dataset = dataset.add_faiss_index(column=column_name)\n", 204 | " print(f\"FAISS index for {column_name} added.\")\n", 205 | " else:\n", 206 | " print(f\"Cannot add FAISS index for {column_name}.\")\n", 207 | " except Exception as e:\n", 208 | " print(f\"Failed to add FAISS index for {column_name}. Error: {e}\")\n", 209 | " print_memory_usage()\n", 210 | " return dataset\n", 211 | "\n", 212 | "# Add FAISS indices for 'title_embedding' and 'abstract_embedding' and save them to different datasets\n", 213 | "dataclysm_pubmed_title_indexed = flatten_and_add_faiss_index(dataclysm_pubmed, 'title_embedding')\n", 214 | "dataclysm_pubmed_abstract_indexed = flatten_and_add_faiss_index(dataclysm_pubmed, 'abstract_embedding')\n", 215 | "print_memory_usage()\n", 216 | "\n", 217 | "print(\"Datasets loaded.\")\n", 218 | "\n", 219 | "# Define the model\n", 220 | "print(\"Initializing model...\")\n", 221 | "model = FlagModel('BAAI/bge-small-en-v1.5', \n", 222 | " query_instruction_for_retrieval=\"Represent this sentence for searching relevant passages:\",\n", 223 | " use_fp16=True)\n", 224 | "print(\"Model initialized.\")\n", 225 | "print_memory_usage()\n", 226 | "query = \"Attention Is All You Need\"\n", 227 | "print(\"Encoding query...\")\n", 228 | "query_embedding = model.encode([query])\n", 229 | "print(\"Query encoded.\")\n" 230 | ] 231 | }, 232 | { 233 | "cell_type": "markdown", 234 | "metadata": {}, 235 | "source": [ 236 | "# arXiv Composite Search with regex Rerank\n", 237 | "Search by both Abstract and Title similarity, rank both descending by score. \n", 238 | "1. If a duplicate (title and abstract hit) is found, it increases the score by a factor of 2. \n", 239 | "2. If regex finds the query in the abstract, it increases the score by 0.1 (additive)." 240 | ] 241 | }, 242 | { 243 | "cell_type": "code", 244 | "execution_count": null, 245 | "metadata": {}, 246 | "outputs": [], 247 | "source": [ 248 | "query = \"Attention Is All You Need\"\n", 249 | "print(\"Encoding query...\")\n", 250 | "query_embedding = model.encode([query])\n", 251 | "print(\"Query encoded.\")\n", 252 | "\n", 253 | "print(\"Retrieving examples by abstract similarity...\")\n", 254 | "scores_abstract, retrieved_examples_abstract = dataclysm_abstract_indexed.get_nearest_examples('abstract_embedding', query_embedding, k=10)\n", 255 | "print(\"Examples retrieved.\")\n", 256 | "\n", 257 | "print(\"Retrieving examples by title similarity...\")\n", 258 | "scores_title, retrieved_examples_title = dataclysm_title_indexed.get_nearest_examples('title_embedding', query_embedding, k=10)\n", 259 | "print(\"Examples retrieved.\")\n", 260 | "\n", 261 | "from IPython.display import display, HTML\n", 262 | "import pandas as pd\n", 263 | "import re\n", 264 | "\n", 265 | "# Convert retrieved examples to DataFrame\n", 266 | "df_abstract = pd.DataFrame(retrieved_examples_abstract)\n", 267 | "df_title = pd.DataFrame(retrieved_examples_title)\n", 268 | "\n", 269 | "# Calculate similarity score in percentage\n", 270 | "df_abstract['similarity_score'] = scores_abstract\n", 271 | "df_title['similarity_score'] = scores_title\n", 272 | "\n", 273 | "# Add a column to denote the source of retrieval\n", 274 | "df_abstract['source'] = 'A'\n", 275 | "df_title['source'] = 'T'\n", 276 | "\n", 277 | "# Drop 'title_embedding' and 'abstract_embedding' columns\n", 278 | "df_abstract = df_abstract.drop(columns=['title_embedding', 'abstract_embedding'])\n", 279 | "df_title = df_title.drop(columns=['title_embedding', 'abstract_embedding'])\n", 280 | "\n", 281 | "# Drop empty columns\n", 282 | "df_abstract = df_abstract.dropna(axis=1, how='all')\n", 283 | "df_title = df_title.dropna(axis=1, how='all')\n", 284 | "\n", 285 | "# Create a \"click to expand\" for the abstract so it doesn't take up much space\n", 286 | "df_abstract['abstract'] = df_abstract['abstract'].apply(lambda x: f'
Abstract{x}
')\n", 287 | "df_title['abstract'] = df_title['abstract'].apply(lambda x: f'
Abstract{x}
')\n", 288 | "\n", 289 | "# Create a URL field with a hyperlink which is constructed by appending the id onto the end of arxiv.org/abs/\n", 290 | "df_abstract['URL'] = df_abstract['id'].apply(lambda x: f'Link')\n", 291 | "df_title['URL'] = df_title['id'].apply(lambda x: f'Link')\n", 292 | "\n", 293 | "# Concatenate the two dataframes\n", 294 | "df = pd.concat([df_abstract, df_title])\n", 295 | "\n", 296 | "# Normalize the similarity score to be between 0 and 1\n", 297 | "df['similarity_score'] = df['similarity_score'] / df['similarity_score'].max()\n", 298 | "\n", 299 | "# Increase the score if the query is found in the abstract\n", 300 | "df['similarity_score'] = df.apply(lambda row: row['similarity_score'] + 0.1 if re.search(query, row['abstract'], re.IGNORECASE) else row['similarity_score'], axis=1)\n", 301 | "\n", 302 | "# Remove duplicates\n", 303 | "df = df.drop_duplicates(subset=['id'])\n", 304 | "\n", 305 | "# Sort by ascending similarity score\n", 306 | "df = df.sort_values(by='similarity_score', ascending=False)\n", 307 | "\n", 308 | "# Display the DataFrame\n", 309 | "from IPython.display import Markdown, display\n", 310 | "display(Markdown(f'QUERY: **{query}**'))\n", 311 | "display(HTML(df.to_html(escape=False)))\n" 312 | ] 313 | }, 314 | { 315 | "cell_type": "markdown", 316 | "metadata": {}, 317 | "source": [ 318 | "# PubMed Simple Search (by title)" 319 | ] 320 | }, 321 | { 322 | "cell_type": "code", 323 | "execution_count": 22, 324 | "metadata": {}, 325 | "outputs": [ 326 | { 327 | "name": "stdout", 328 | "output_type": "stream", 329 | "text": [ 330 | "Encoding query...\n", 331 | "Query encoded.\n", 332 | "Retrieving examples by title similarity...\n", 333 | "Examples retrieved.\n" 334 | ] 335 | }, 336 | { 337 | "data": { 338 | "text/markdown": [ 339 | "QUERY: **bioinformatics**" 340 | ], 341 | "text/plain": [ 342 | "" 343 | ] 344 | }, 345 | "metadata": {}, 346 | "output_type": "display_data" 347 | }, 348 | { 349 | "data": { 350 | "text/html": [ 351 | "\n", 352 | " \n", 353 | " \n", 354 | " \n", 355 | " \n", 356 | " \n", 357 | " \n", 358 | " \n", 359 | " \n", 360 | " \n", 361 | " \n", 362 | " \n", 363 | " \n", 364 | " \n", 365 | " \n", 366 | " \n", 367 | " \n", 368 | " \n", 369 | " \n", 370 | " \n", 371 | " \n", 372 | " \n", 373 | " \n", 374 | " \n", 375 | " \n", 376 | " \n", 377 | " \n", 378 | " \n", 379 | " \n", 380 | " \n", 381 | " \n", 382 | " \n", 383 | " \n", 384 | " \n", 385 | " \n", 386 | " \n", 387 | " \n", 388 | " \n", 389 | " \n", 390 | " \n", 391 | " \n", 392 | " \n", 393 | " \n", 394 | " \n", 395 | " \n", 396 | " \n", 397 | " \n", 398 | " \n", 399 | " \n", 400 | " \n", 401 | " \n", 402 | " \n", 403 | " \n", 404 | " \n", 405 | " \n", 406 | " \n", 407 | " \n", 408 | " \n", 409 | " \n", 410 | " \n", 411 | " \n", 412 | " \n", 413 | " \n", 414 | " \n", 415 | " \n", 416 | " \n", 417 | " \n", 418 | " \n", 419 | " \n", 420 | " \n", 421 | " \n", 422 | " \n", 423 | " \n", 424 | " \n", 425 | " \n", 426 | " \n", 427 | " \n", 428 | " \n", 429 | " \n", 430 | " \n", 431 | " \n", 432 | " \n", 433 | " \n", 434 | " \n", 435 | " \n", 436 | " \n", 437 | " \n", 438 | " \n", 439 | " \n", 440 | " \n", 441 | " \n", 442 | " \n", 443 | " \n", 444 | "
PMIDArticleTitleAbstractTextabstract_embeddingsimilarity_score
91078478[Interactive data processing in the area of medical biology research].None
Abstract Embedding[]
0.432746
810197053Genomics in the real world.The term genomics has evolved into a catch-all term for a variety of information intensive biological methodologies. While the promise of genomics in the bio/pharmaceutical industry is great, its impact on the drug discovery pipeline has not yet been realized, excluding a few notable exceptions. As companies acquire several years of experience in working with genomic data, it is likely that the impact on the discovery process will slowly emerge as we learn to integrate these new technologies into individual discovery programs. It is clear that extracting novel biologically valid targets targets from exponentially growing amounts of sequence data requires time and considerable investment in biological research infrastructure. In order to accelerate the process of target validation, a variety of functional genomics technologies are also being developed to try to predict the effect of inhibitory compounds in advance of development. Resources spent on early stage exploratory efforts such as these can pay off by improving the success rate for screening and medicinal chemistry.
Abstract Embedding[[-0.02972412109375, -0.0218505859375, -0.0156402587890625, -0.049652099609375, 0.0282440185546875, -0.01373291015625, 0.01904296875, 0.03106689453125, -0.01119232177734375, -0.005401611328125, 0.00621795654296875, 0.006938934326171875, 0.036407470703125, -0.0278472900390625, -0.028778076171875, 0.01494598388671875, -0.0170745849609375, -0.00878143310546875, -0.016815185546875, -0.0162506103515625, -0.00872802734375, 0.00682830810546875, 0.03985595703125, -0.0199737548828125, -0.00035119056701660156, 0.056640625, -0.005901336669921875, -0.04107666015625, -0.044708251953125, -0.2412109375, 0.002429962158203125, 0.027252197265625, 0.06005859375, -0.006435394287109375, -0.004337310791015625, 0.0404052734375, -0.0169525146484375, 0.0364990234375, -0.023193359375, 0.06732177734375, 0.0226898193359375, 0.048919677734375, -0.02874755859375, 0.0343017578125, 0.016754150390625, -0.038360595703125, -0.07293701171875, 0.0024261474609375, -0.0110321044921875, -0.02001953125, -0.060150146484375, -0.0460205078125, 0.000682830810546875, 0.04132080078125, -0.0194854736328125, 0.027740478515625, 0.01071929931640625, 0.0152130126953125, -0.0166778564453125, 0.0228118896484375, 0.04827880859375, 0.01195526123046875, -0.10736083984375, 0.039459228515625, 0.023223876953125, 0.0333251953125, -0.0233001708984375, -0.01274871826171875, 0.056854248046875, 0.04693603515625, -0.0200347900390625, 0.017608642578125, 0.0298919677734375, 0.0203857421875, 0.037689208984375, 0.046966552734375, 0.005489349365234375, -0.01146697998046875, 0.06787109375, -0.0205535888671875, 0.062286376953125, -0.0195770263671875, 0.0053863525390625, -0.068603515625, -0.030609130859375, 0.0090484619140625, -0.016998291015625, 0.02679443359375, 0.036895751953125, 0.018829345703125, 0.05462646484375, -0.0275726318359375, 0.01190948486328125, -4.029273986816406e-05, -0.08758544921875, -0.01280975341796875, 0.031097412109375, -0.05413818359375, 0.02117919921875, 0.39794921875, -0.051177978515625, 0.04034423828125, -0.08868408203125, -0.0183563232421875, 0.00841522216796875, -0.047088623046875, -0.027740478515625, -0.070068359375, 0.0309295654296875, -0.0352783203125, 0.052276611328125, 0.01080322265625, 0.02154541015625, 0.01629638671875, 0.009429931640625, -0.005878448486328125, -0.007598876953125, 0.0119781494140625, -0.033966064453125, 0.0204010009765625, -0.025634765625, 0.02191162109375, 0.0306396484375, 0.040313720703125, 0.00598907470703125, 0.048065185546875, -0.01715087890625, 0.08099365234375, 0.01151275634765625, -0.00931549072265625, 0.0550537109375, 0.04315185546875, -0.05670166015625, 0.035308837890625, 0.001903533935546875, -0.016845703125, -0.035186767578125, 0.005634307861328125, -0.0294342041015625, 0.0120849609375, -0.0181121826171875, 0.01141357421875, -0.0211181640625, -0.055938720703125, -0.081298828125, 0.1116943359375, 0.07818603515625, 0.045166015625, -0.040191650390625, -0.054412841796875, -0.0181121826171875, 0.0712890625, 0.0254669189453125, 0.037750244140625, -0.01291656494140625, 0.06121826171875, 0.0199737548828125, -0.0511474609375, -0.04779052734375, 0.0263519287109375, -0.07696533203125, 0.030670166015625, -0.061126708984375, 0.11712646484375, 0.0095367431640625, -0.036285400390625, 0.00738525390625, -0.0036468505859375, 0.041107177734375, 0.0210723876953125, -0.0001533031463623047, -0.028289794921875, 0.02850341796875, 0.0261688232421875, -0.1029052734375, -0.048919677734375, -0.03143310546875, -0.0196075439453125, -0.01483917236328125, -0.0090484619140625, 0.0389404296875, -0.013092041015625, 0.01470947265625, 0.0175628662109375, -0.001068115234375, -0.054412841796875, -0.036346435546875, 0.0215606689453125, 0.01250457763671875, 0.03814697265625, -0.04071044921875, 0.0670166015625, -0.00115203857421875, -0.0093994140625, -0.044647216796875, -0.08746337890625, -0.00955963134765625, 0.045745849609375, -0.010589599609375, -0.072265625, 0.023101806640625, 0.049835205078125, 0.048309326171875, 0.05615234375, -0.00431060791015625, -0.01180267333984375, -0.00550079345703125, -0.0227508544921875, 0.04510498046875, 0.00939178466796875, -0.053680419921875, 0.0212249755859375, -0.01363372802734375, 0.0084991455078125, -0.0283203125, 0.003833770751953125, 0.0140228271484375, 0.0216064453125, 0.00811767578125, 0.039947509765625, 0.0007691383361816406, 0.031402587890625, -0.01641845703125, -0.299560546875, 0.004253387451171875, -0.01285552978515625, 0.032073974609375, -0.024871826171875, -0.00743865966796875, -0.007511138916015625, 0.0162506103515625, -0.038177490234375, 0.0120849609375, -0.00452423095703125, 0.11517333984375, -0.011138916015625, -0.045013427734375, -0.037841796875, -0.055511474609375, -0.00026297569274902344, -0.020050048828125, -0.033233642578125, -0.027923583984375, -0.007724761962890625, -0.0161285400390625, 0.0019855499267578125, -0.0299224853515625, 0.03021240234375, -0.0216522216796875, 0.1373291015625, 0.026702880859375, -0.059783935546875, 0.032501220703125, 0.01287078857421875, 0.021453857421875, -0.026031494140625, -0.1156005859375, 0.0122528076171875, -0.01050567626953125, 0.05987548828125, -0.0193939208984375, 0.004425048828125, 0.00470733642578125, -0.0097503662109375, -0.003009796142578125, -0.025115966796875, -0.07421875, -0.034271240234375, 0.0185546875, -0.010406494140625, -0.01366424560546875, -0.02008056640625, 0.02789306640625, 0.0699462890625, 0.06353759765625, 0.07904052734375, -0.0226287841796875, 0.009765625, -0.036956787109375, -0.03619384765625, 0.056915283203125, -0.0193023681640625, -0.033355712890625, 0.0343017578125, 0.041015625, -0.0235748291015625, 0.03326416015625, 0.008758544921875, -0.0927734375, -0.0396728515625, -0.01715087890625, 0.043792724609375, -0.060150146484375, 0.029937744140625, 0.108642578125, -0.0273590087890625, 0.037078857421875, 0.0731201171875, -0.00366973876953125, -0.0133819580078125, -0.11102294921875, -0.055999755859375, 0.06805419921875, 0.0293121337890625, -0.01224517822265625, 0.031219482421875, -0.01476287841796875, -0.031036376953125, 0.002773284912109375, 0.0428466796875, -0.0341796875, 0.01171875, -0.034210205078125, 0.002994537353515625, -0.030609130859375, -0.00394439697265625, -0.08441162109375, 0.042022705078125, 0.007053375244140625, -0.2008056640625, 0.053802490234375, 0.00765228271484375, 0.0236358642578125, 0.036712646484375, -0.01262664794921875, 0.06732177734375, -0.1002197265625, 0.0164642333984375, 0.033416748046875, -0.006378173828125, -0.0036602020263671875, 0.07421875, -0.0305633544921875, 0.00408935546875, 0.0274200439453125, 0.1270751953125, -0.05841064453125, 0.0152130126953125, -0.01371002197265625, 0.06964111328125, 0.0294647216796875, 0.181884765625, -0.031494140625, 0.01155853271484375, 0.0008349418640136719, -0.013397216796875, -0.040802001953125, -0.026336669921875, 0.0195770263671875, -0.003330230712890625, 0.024169921875, 0.050445556640625, -0.034423828125, 0.0011043548583984375, 0.074462890625, 0.033416748046875, -0.039764404296875, -0.039031982421875, -0.01328277587890625, 0.047271728515625, -0.04949951171875, -0.052215576171875, -0.005962371826171875, 0.06158447265625, -0.077392578125, -0.049468994140625, -0.05657958984375, -0.0290069580078125, 0.055908203125, -0.042510986328125, -0.0111541748046875, 0.017852783203125, -0.0016431808471679688, -0.00701904296875, -0.0114593505859375, 0.0169830322265625, 0.0031032562255859375, 0.033447265625, 0.0052947998046875, -0.026275634765625, 0.0355224609375, -0.08453369140625, 0.07550048828125, -0.01177978515625]]
0.432628
710047737[Utility of genome databases: future perspectives].None
Abstract Embedding[]
0.431294
610203836Microbial genomics.None
Abstract Embedding[]
0.431236
510193187Genomics and the biology of parasites.Despite the advances of modern medicine, the threat of chronic illness, disfigurement, or death that can result from parasitic infection still affects the majority of the world population, retarding economic development. For most parasitic diseases, current therapeutics often leave much to be desired in terms of administration regime, toxicity, or effectiveness and potential vaccines are a long way from market. Our best prospects for identifying new targets for drug, vaccine, and diagnostics development and for dissecting the biological basis of drug resistance, antigenic diversity, infectivity and pathology lie in parasite genome analysis, and international mapping and gene discovery initiatives are under way for a variety of protozoan and helminth parasites. These are far from ideal experimental organisms, and the influence of biological and genomic characteristics on experimental approaches is discussed, progress is reviewed and future prospects are examined.
Abstract Embedding[[-0.01416015625, -0.01340484619140625, 0.038238525390625, 0.020843505859375, 0.00539398193359375, -0.026885986328125, 0.052703857421875, 0.110107421875, -0.05792236328125, -0.058074951171875, 0.02374267578125, -0.061279296875, 0.00754547119140625, -0.00139617919921875, -0.04156494140625, 0.0092926025390625, -0.061370849609375, 0.01433563232421875, 0.0229644775390625, 0.03375244140625, -0.007472991943359375, -0.0049591064453125, 0.031219482421875, -0.061279296875, -0.04742431640625, 0.04010009765625, 0.0219879150390625, 0.01035308837890625, -0.0301361083984375, -0.192626953125, -0.01238250732421875, -0.0296783447265625, -0.016510009765625, -0.03558349609375, -0.004352569580078125, -0.039276123046875, -0.04278564453125, 0.0173492431640625, -0.009063720703125, 0.0433349609375, 0.0419921875, 0.057708740234375, -0.012237548828125, 0.027313232421875, 0.01523590087890625, -0.07501220703125, -0.0780029296875, 0.019256591796875, 0.045928955078125, -0.038848876953125, -0.034454345703125, -0.07171630859375, 0.00010001659393310547, 0.076416015625, -0.002529144287109375, -0.022003173828125, 0.0004782676696777344, 0.006107330322265625, 0.0015153884887695312, 0.051025390625, 0.00678253173828125, 0.037353515625, -0.0858154296875, 0.06304931640625, 0.040924072265625, 0.047088623046875, -0.04498291015625, -0.035003662109375, 0.0184326171875, -0.017333984375, -0.048065185546875, 0.006359100341796875, 0.044158935546875, 0.03509521484375, 0.00691986083984375, 0.053863525390625, -0.00876617431640625, -0.052734375, 0.06500244140625, 0.0162506103515625, 0.036163330078125, -0.007965087890625, 0.04278564453125, -0.0007901191711425781, -0.03973388671875, -0.004245758056640625, 0.004497528076171875, 0.0102081298828125, 0.0209808349609375, 0.01043701171875, 0.04766845703125, -0.0175628662109375, 0.0287017822265625, 0.0171966552734375, -0.0789794921875, -0.004314422607421875, -0.00971221923828125, -0.0181884765625, -0.0007534027099609375, 0.44287109375, 0.019805908203125, -0.0220947265625, -0.04803466796875, 0.002834320068359375, -0.0185394287109375, 0.00868988037109375, 0.0168304443359375, -0.06134033203125, 0.0288543701171875, 0.056640625, 0.01274871826171875, 0.0110626220703125, -0.0249786376953125, 0.0293121337890625, 0.0091400146484375, 0.0013113021850585938, 0.01090240478515625, 0.027679443359375, -0.0330810546875, 0.039703369140625, -0.0104522705078125, -0.039093017578125, 0.040924072265625, -0.03533935546875, 0.049072265625, 0.034088134765625, -0.016876220703125, 0.12060546875, 0.01279449462890625, 0.005558013916015625, 0.050201416015625, 0.045440673828125, -0.0236968994140625, 0.06597900390625, 0.01453399658203125, 0.0008058547973632812, -0.0305328369140625, -0.01470947265625, -0.03265380859375, 0.03826904296875, -0.059783935546875, -0.044036865234375, -0.044464111328125, -0.08966064453125, -0.09033203125, 0.08746337890625, 0.02880859375, 0.0227203369140625, -0.06292724609375, -0.020660400390625, 0.0029239654541015625, 0.06597900390625, 0.0251312255859375, 0.05084228515625, -0.0587158203125, 0.056854248046875, 0.0239105224609375, 0.003040313720703125, -0.004695892333984375, -0.02850341796875, -0.0853271484375, 0.00013518333435058594, -0.047088623046875, 0.1500244140625, -0.0194091796875, -0.05511474609375, -0.00972747802734375, 0.0037822723388671875, 0.018310546875, 0.042205810546875, -0.0546875, -0.00433349609375, 0.02679443359375, 0.00041294097900390625, -0.05615234375, -0.01873779296875, -0.091552734375, 0.007213592529296875, 0.02728271484375, 0.013427734375, -0.0094146728515625, -0.01538848876953125, -0.039947509765625, 0.0034503936767578125, -0.02130126953125, -0.0030384063720703125, -0.025238037109375, 0.07086181640625, -0.023162841796875, 0.01018524169921875, -0.002964019775390625, 0.002025604248046875, -0.0540771484375, 0.04083251953125, 0.0263214111328125, -0.0482177734375, -0.005977630615234375, -0.0188140869140625, -0.01137542724609375, -0.043914794921875, 0.03240966796875, 0.061065673828125, 0.0269622802734375, 0.0545654296875, -0.006961822509765625, -0.020904541015625, 0.05792236328125, -0.050262451171875, 0.0576171875, 0.003692626953125, -0.0147857666015625, 0.023101806640625, -0.0498046875, 0.0176544189453125, -0.055419921875, 0.01525115966796875, -0.0087738037109375, 0.03271484375, 0.10400390625, 0.0260162353515625, 0.0184783935546875, 0.055694580078125, -0.049072265625, -0.28759765625, 0.0125274658203125, -0.046722412109375, -0.032135009765625, -0.055572509765625, -0.0199432373046875, -0.023834228515625, 0.006107330322265625, 0.036590576171875, 0.01922607421875, 0.026947021484375, 0.08319091796875, 0.0301361083984375, 0.01549530029296875, -0.07757568359375, 0.0138397216796875, -0.01922607421875, -0.026824951171875, 0.00255584716796875, 0.01983642578125, 0.040496826171875, -0.04827880859375, 0.0955810546875, -0.078125, -0.022674560546875, -0.055206298828125, 0.1297607421875, 0.04052734375, -0.004878997802734375, 0.00836181640625, 0.01122283935546875, -0.002361297607421875, -0.0245819091796875, -0.0736083984375, -0.005191802978515625, -0.0251922607421875, 0.01506805419921875, -0.049407958984375, -0.034454345703125, -0.009735107421875, -0.01221466064453125, -0.00992584228515625, 0.01385498046875, -0.035675048828125, -0.03302001953125, 0.03875732421875, -0.017974853515625, 0.01538848876953125, 0.04315185546875, 0.0325927734375, 0.0010728836059570312, 0.040740966796875, 0.03704833984375, -0.0284881591796875, -0.053436279296875, 0.03631591796875, 0.0299224853515625, 0.014556884765625, -0.04339599609375, 0.028961181640625, 0.007625579833984375, 0.0224456787109375, -0.01494598388671875, 0.0290985107421875, -0.02313232421875, -0.0262908935546875, -0.03680419921875, -0.0137786865234375, 0.032958984375, -0.0391845703125, 0.011383056640625, 0.12469482421875, 0.01322174072265625, 0.0087738037109375, 0.031768798828125, 0.0014495849609375, 0.049774169921875, -0.00983428955078125, -0.01055145263671875, 0.07354736328125, 0.032623291015625, -0.004566192626953125, -0.02984619140625, 0.01953125, 0.0031032562255859375, 0.03924560546875, 0.0400390625, -0.00301361083984375, -0.0077056884765625, -0.02935791015625, -0.0018863677978515625, -0.020965576171875, 0.005107879638671875, -0.085205078125, -0.043426513671875, 0.0440673828125, -0.209716796875, 0.057769775390625, 0.006107330322265625, 0.02569580078125, -0.03350830078125, -0.0009388923645019531, 0.0478515625, -0.0496826171875, -0.0029582977294921875, -0.017974853515625, 0.0888671875, 0.035247802734375, 0.06268310546875, 0.0093994140625, 0.0139617919921875, -0.0059356689453125, 0.06396484375, -0.061492919921875, 0.017059326171875, -0.028717041015625, -0.01207733154296875, -0.005550384521484375, 0.1802978515625, -0.04498291015625, -0.004344940185546875, 0.038116455078125, -0.001384735107421875, 0.045745849609375, -0.0251007080078125, 0.01922607421875, -0.0074615478515625, -0.020660400390625, 0.0062103271484375, -0.04168701171875, -0.00030422210693359375, 0.039581298828125, 0.0243682861328125, -0.020965576171875, -0.046783447265625, -0.031402587890625, -0.0290679931640625, 0.004180908203125, 0.01226806640625, -0.0010900497436523438, 0.03753662109375, -0.0159454345703125, -0.032806396484375, -0.0325927734375, -0.050506591796875, 0.03326416015625, -0.11529541015625, -0.018280029296875, 0.0096893310546875, -0.028076171875, -0.00432586669921875, -0.03973388671875, 0.0235443115234375, -0.017852783203125, 0.0011816024780273438, -0.0194854736328125, 0.0160980224609375, -0.0032444000244140625, -0.1124267578125, 0.08453369140625, -0.0005488395690917969]]
0.423589
410089485The Nucleic Acid Database: A resource for nucleic acid science.The Nucleic Acid Database (NDB) distributes information about nucleic acid-containing structures. Here the information content of the database as well as the query capabilities are described. A summary of how the technology developed by this project has been used to develop other macromolecular databases is given.
Abstract Embedding[[-0.017120361328125, -0.010833740234375, -0.056640625, -0.0198822021484375, 0.04217529296875, 0.004749298095703125, 0.01934814453125, -0.01849365234375, 0.0102386474609375, 0.05279541015625, 0.007022857666015625, -0.0014896392822265625, 0.05224609375, -0.0247802734375, -0.0070343017578125, 0.0068817138671875, -0.040252685546875, -0.03900146484375, 0.00853729248046875, 0.010894775390625, 0.0290374755859375, 0.0075531005859375, -0.0160369873046875, -0.01352691650390625, -0.0034809112548828125, 0.1119384765625, 0.0282440185546875, -0.01410675048828125, -0.05120849609375, -0.1971435546875, 0.0276336669921875, 0.0175018310546875, 0.0187530517578125, -0.05499267578125, 0.020050048828125, -0.0203704833984375, 0.026123046875, -0.0226593017578125, -0.0423583984375, 0.015899658203125, 0.0518798828125, -0.020751953125, -0.0109405517578125, -0.006687164306640625, 0.08270263671875, -0.04656982421875, -0.010345458984375, -0.01009368896484375, 0.0259246826171875, -0.0290679931640625, -0.0367431640625, -0.06597900390625, -0.040252685546875, 0.0087890625, 0.0015773773193359375, 0.0213775634765625, 0.021484375, 0.0067901611328125, -0.02105712890625, -0.0203857421875, 0.075927734375, 0.0024776458740234375, -0.089111328125, 0.091064453125, 0.0501708984375, 0.045867919921875, 0.01067352294921875, -0.018951416015625, 0.04180908203125, 0.032379150390625, -0.0155487060546875, -0.0247039794921875, 0.006359100341796875, 0.047698974609375, 0.02276611328125, -0.031005859375, 0.050201416015625, -0.0274658203125, 0.0288848876953125, -0.0191650390625, 0.001708984375, -0.0506591796875, 0.040740966796875, -0.05401611328125, -0.0238800048828125, 0.01410675048828125, -0.07666015625, -0.0350341796875, -0.01580810546875, 0.0147705078125, 0.017486572265625, 0.0098419189453125, 0.038299560546875, 0.0233917236328125, -0.0645751953125, -0.02874755859375, -0.016632080078125, -0.039703369140625, 0.082275390625, 0.34814453125, -0.04266357421875, 0.03173828125, -0.03057861328125, -0.036956787109375, -0.049835205078125, -0.032745361328125, 0.0011377334594726562, -0.0433349609375, -0.01446533203125, 0.01090240478515625, 0.03643798828125, -0.029815673828125, 0.0204315185546875, -0.0841064453125, 0.02734375, -0.0081329345703125, -0.015899658203125, -0.0216827392578125, -0.049224853515625, -0.01206207275390625, -0.06439208984375, 0.01337432861328125, -0.0038700103759765625, 0.06158447265625, 0.06756591796875, -0.011749267578125, -0.05267333984375, 0.075927734375, 0.056182861328125, 0.0113372802734375, 0.056488037109375, 0.037353515625, -0.030792236328125, -0.031982421875, -0.002399444580078125, 0.006069183349609375, 0.02020263671875, -0.01241302490234375, 0.0031585693359375, 0.037200927734375, -0.01190948486328125, -0.02227783203125, -0.06976318359375, 0.0031147003173828125, -0.1383056640625, 0.061920166015625, -0.061279296875, 0.0430908203125, -0.060333251953125, -0.034393310546875, 0.028106689453125, 0.0775146484375, 0.015777587890625, -0.0254974365234375, 0.01371002197265625, 0.033050537109375, -0.0261688232421875, -0.04901123046875, -0.050048828125, -0.023223876953125, -0.053436279296875, 0.006076812744140625, 0.022369384765625, 0.10101318359375, -0.0084991455078125, -0.09149169921875, -0.020538330078125, 0.02056884765625, 0.05584716796875, -0.0301666259765625, 0.0294189453125, 0.01161956787109375, -0.051177978515625, 0.0465087890625, -0.0290374755859375, -0.0072021484375, -0.0004143714904785156, 0.035003662109375, 0.057769775390625, -0.00974273681640625, 0.025543212890625, -0.0498046875, -0.038909912109375, -0.0032672882080078125, 0.0157928466796875, -0.033203125, -0.084716796875, 0.035552978515625, 0.045013427734375, 0.034515380859375, -0.029632568359375, 0.043731689453125, -0.0225372314453125, 0.0190277099609375, -0.0034046173095703125, -0.058929443359375, 0.073974609375, 0.02978515625, -0.01248931884765625, -0.044036865234375, 0.07073974609375, 0.03570556640625, -0.00339508056640625, 0.11236572265625, 0.00537109375, -0.0290985107421875, -0.0225067138671875, -0.02789306640625, -0.00246429443359375, -0.0214691162109375, 0.004192352294921875, -0.0136566162109375, -0.10772705078125, -0.002368927001953125, -0.0894775390625, -0.058349609375, 0.033843994140625, 0.041168212890625, 0.051025390625, -0.01190948486328125, -0.01654052734375, -0.03973388671875, -0.007205963134765625, -0.279296875, 0.031402587890625, 0.0237579345703125, 0.01788330078125, -0.018096923828125, 0.002048492431640625, -0.00806427001953125, -0.040985107421875, -0.022064208984375, 0.02313232421875, 0.0318603515625, 0.0145111083984375, 0.0132904052734375, -0.08221435546875, -0.0679931640625, 0.06781005859375, 0.07049560546875, -0.0253143310546875, -0.07855224609375, -0.002117156982421875, 0.051513671875, -0.017303466796875, -0.0133056640625, -0.01238250732421875, 0.032623291015625, -0.0240631103515625, 0.1099853515625, -0.0609130859375, 0.01325225830078125, 0.038116455078125, 0.0501708984375, 0.034454345703125, -0.07061767578125, -0.08154296875, 0.0030422210693359375, -0.036224365234375, -0.07025146484375, -0.0003921985626220703, -0.031982421875, -0.033538818359375, 0.005115509033203125, 0.028594970703125, 0.036651611328125, -0.0909423828125, -0.0017671585083007812, -0.005756378173828125, 0.02752685546875, -0.0030193328857421875, -0.004596710205078125, -0.01180267333984375, 0.00464630126953125, 0.000606536865234375, 0.017974853515625, 0.0771484375, 0.033416748046875, -0.004352569580078125, -0.0182647705078125, -0.03631591796875, -0.0249481201171875, 0.0255126953125, 0.03765869140625, 0.00861358642578125, -0.04791259765625, 0.02667236328125, -0.01210784912109375, -0.066650390625, 0.04302978515625, 0.0209503173828125, 0.06298828125, -0.07769775390625, -0.0014715194702148438, 0.1072998046875, -0.05078125, 0.071533203125, 0.035003662109375, 0.034881591796875, -0.019439697265625, -0.0843505859375, -0.049163818359375, -0.0133056640625, 0.02734375, -0.056854248046875, 0.087646484375, -0.0007505416870117188, 0.056610107421875, 0.05096435546875, 0.1007080078125, 0.001972198486328125, -0.010009765625, 0.0174407958984375, 0.0321044921875, -0.048095703125, 0.0186767578125, -0.032623291015625, 0.01544189453125, -0.0060882568359375, -0.2496337890625, 0.09503173828125, 0.0002777576446533203, 0.033203125, 0.04730224609375, 0.0501708984375, 0.073974609375, -0.002567291259765625, 0.01241302490234375, 0.0263519287109375, 0.0860595703125, 0.0277862548828125, -0.00739288330078125, -0.065673828125, -0.014892578125, 0.04376220703125, 0.1312255859375, -0.056396484375, 0.031463623046875, 0.03814697265625, 0.0198974609375, 0.034881591796875, 0.1578369140625, 0.01430511474609375, 0.01416015625, 0.002773284912109375, -0.016326904296875, 0.0034999847412109375, -0.0194549560546875, 0.002716064453125, -0.008392333984375, 0.03765869140625, 0.061370849609375, -0.0621337890625, -0.0433349609375, 0.06243896484375, 0.0015840530395507812, 0.0020084381103515625, -0.001331329345703125, -0.0053863525390625, -0.0428466796875, 0.0093231201171875, -0.053802490234375, 0.00577545166015625, 0.0733642578125, -0.006778717041015625, -0.061920166015625, -0.07012939453125, 0.026519775390625, 0.0079345703125, -0.01125335693359375, -0.01114654541015625, 0.0635986328125, -0.002521514892578125, -0.0283203125, -0.036773681640625, 0.04315185546875, 0.016448974609375, 0.02288818359375, -0.0022106170654296875, 0.022735595703125, -0.01192474365234375, -0.0675048828125, 0.06683349609375, 0.0009241104125976562]]
0.419733
310175124The monster code: biology and the computer sciences.None
Abstract Embedding[]
0.416959
210203763Bioinformatics, pharma and farmers.None
Abstract Embedding[]
0.397592
110191386How will bioinformatics influence metabolic engineering?Ten microbial genomes have been fully sequenced to date, and the sequencing of many more genomes is expected to be completed before the end of the century. The assignment of function to open reading frames (ORFs) is progressing, and for some genomes over 70% of functional assignments have been made. The majority of the assigned ORFs relate to metabolic functions. Thus, the complete genetic and biochemical functions of a number of microbial cells may be soon available. From a metabolic engineering standpoint, these developments open a new realm of possibilities. Metabolic analysis and engineering strategies can now be built on a sound genomic basis. An important question that now arises; how should these tasks be approached? Flux-balance analysis (FBA) has the potential to play an important role. It is based on the fundamental principle of mass conservation. It requires only the stoichiometric matrix, the metabolic demands, and some strain specific parameters. Importantly, no enzymatic kinetic data is required. In this article, we show how the genomically defined microbial metabolic genotypes can be analyzed by FBA. Fundamental concepts of metabolic genotype, metabolic phenotype, metabolic redundancy and robustness are defined and examples of their use given. We discuss the advantage of this approach, and how FBA is expected to find uses in the near future. FBA is likely to become an important analysis tool for genomically based approaches to metabolic engineering, strain design, and development.
Abstract Embedding[[-0.0275421142578125, 0.00980377197265625, -0.045135498046875, 0.00040531158447265625, 0.070556640625, -0.0007672309875488281, -0.03692626953125, 0.01983642578125, -0.048492431640625, -0.0031642913818359375, -0.00417327880859375, -0.11627197265625, 0.0122833251953125, -0.0229034423828125, -0.026458740234375, -0.01158905029296875, -0.035308837890625, 0.038482666015625, 0.034210205078125, -0.053985595703125, 0.064208984375, -0.003948211669921875, -0.039215087890625, 0.0034027099609375, 0.041900634765625, -0.0081024169921875, 0.007724761962890625, -0.023101806640625, -0.090087890625, -0.269287109375, 0.0227813720703125, 0.0263671875, 0.04595947265625, -0.030609130859375, -0.0133514404296875, 0.0307769775390625, -0.00766754150390625, -0.021484375, -0.041290283203125, 0.0777587890625, 0.00789642333984375, 0.0240020751953125, 0.038848876953125, 0.0293426513671875, -0.0117645263671875, -0.0198822021484375, -0.0278472900390625, 0.0406494140625, 0.03350830078125, -0.033721923828125, -0.04229736328125, -0.05712890625, 0.004974365234375, 0.0377197265625, -0.0335693359375, 0.0118560791015625, 0.04736328125, -0.01038360595703125, -0.0215606689453125, 0.006076812744140625, 0.045074462890625, -0.00821685791015625, -0.11993408203125, 0.0111846923828125, 0.05517578125, -0.00719451904296875, -0.030059814453125, -0.0272216796875, 0.043365478515625, 0.0196990966796875, -0.076904296875, 0.037109375, -0.0035190582275390625, -0.00994110107421875, 0.04302978515625, 0.041351318359375, 0.010833740234375, -0.0155487060546875, 0.005767822265625, 0.023651123046875, 0.00640869140625, 0.02825927734375, 0.01393890380859375, -0.0085296630859375, -0.060211181640625, 0.009429931640625, 0.01103973388671875, -0.02301025390625, 0.037078857421875, 0.052215576171875, -0.0107421875, -0.04791259765625, 0.024810791015625, 0.0296783447265625, -0.06573486328125, -0.02532958984375, 0.045135498046875, -0.035369873046875, 0.04638671875, 0.4306640625, -0.0404052734375, 0.05560302734375, 0.0239410400390625, -0.0005412101745605469, 0.04559326171875, -0.043853759765625, 0.01271820068359375, 0.00499725341796875, 0.01181793212890625, -0.042144775390625, 0.0302734375, -0.0010786056518554688, -0.029052734375, 0.01210784912109375, 0.0120086669921875, -0.006988525390625, -0.0008611679077148438, 0.005725860595703125, -0.05523681640625, 0.036102294921875, -0.002079010009765625, -0.0205230712890625, -0.0014495849609375, 0.01184844970703125, 0.05853271484375, -0.0248870849609375, -0.005626678466796875, 0.0758056640625, 0.032135009765625, 0.0178070068359375, 0.055450439453125, 0.0748291015625, -0.0279541015625, -0.0005159378051757812, -0.054901123046875, 0.01416778564453125, -0.0081939697265625, -0.024444580078125, 0.01303863525390625, -0.0022602081298828125, -0.00620269775390625, 0.020172119140625, -0.037750244140625, -0.0943603515625, -0.10052490234375, 0.07745361328125, 0.0284576416015625, 0.047882080078125, -0.0175933837890625, -0.0281219482421875, 0.043243408203125, 0.054656982421875, 0.01403045654296875, -0.03173828125, 0.07354736328125, 0.07891845703125, -0.01336669921875, 0.0004892349243164062, -0.0126800537109375, 0.0063323974609375, -0.0970458984375, 0.014007568359375, 0.00905609130859375, 0.07373046875, 0.037445068359375, -0.0165863037109375, -0.0369873046875, 0.0197601318359375, -0.0069580078125, -0.01654052734375, 0.03436279296875, -0.0097808837890625, 0.040679931640625, -0.0218505859375, -0.018310546875, -0.0191192626953125, -0.00319671630859375, -0.0206146240234375, 0.034332275390625, 0.044036865234375, 0.02862548828125, 0.00530242919921875, -0.01340484619140625, 0.01290130615234375, 0.035308837890625, 0.020782470703125, -0.031768798828125, -0.0016336441040039062, -0.038848876953125, 0.037445068359375, -0.01739501953125, 0.0004849433898925781, -0.03216552734375, 0.09564208984375, -0.025787353515625, -0.02764892578125, 0.022369384765625, 0.037109375, -0.022216796875, -0.06268310546875, 0.022430419921875, 0.05517578125, 0.01454925537109375, 0.0191192626953125, -0.0101165771484375, -0.03704833984375, -0.04229736328125, 0.0243682861328125, -0.00689697265625, 0.052947998046875, -0.06646728515625, -0.0041046142578125, -0.08184814453125, 0.019989013671875, -0.061004638671875, 0.0313720703125, 0.04034423828125, 0.00603485107421875, 0.0311126708984375, -0.0026798248291015625, 0.059112548828125, -0.02349853515625, -0.035919189453125, -0.31591796875, -0.032257080078125, 0.0111846923828125, -0.0291595458984375, -0.00402069091796875, -0.00827789306640625, -0.007389068603515625, -0.01526641845703125, -0.0293731689453125, -0.0103912353515625, 0.05206298828125, 0.05029296875, -0.05523681640625, -0.031585693359375, -0.039031982421875, 0.00677490234375, 0.0126495361328125, -0.0770263671875, -0.04736328125, -0.0017328262329101562, 0.0631103515625, -0.0113525390625, 0.0919189453125, 0.0032138824462890625, 0.058349609375, -0.0178680419921875, 0.092529296875, -0.03045654296875, 0.08966064453125, 0.0097198486328125, 0.007228851318359375, 0.0023632049560546875, 0.035308837890625, -0.028778076171875, -0.0124664306640625, -0.00994110107421875, 0.0189971923828125, -0.05633544921875, 0.024017333984375, 0.004299163818359375, -0.004222869873046875, 0.0243072509765625, -0.011688232421875, -0.130859375, -0.026031494140625, -0.0289154052734375, -0.03662109375, -0.0751953125, -0.009613037109375, -0.0270538330078125, -0.022369384765625, 0.01043701171875, 0.041229248046875, -0.0272216796875, 0.0157928466796875, -0.0222015380859375, 0.005443572998046875, -0.02886962890625, -0.057830810546875, 0.00928497314453125, -0.00563812255859375, 0.0193328857421875, -0.00672149658203125, 0.058837890625, 0.0079345703125, 0.0009250640869140625, -0.06488037109375, 0.0256195068359375, 0.0452880859375, -0.06427001953125, 0.0251617431640625, 0.07666015625, -0.0008788108825683594, 0.01558685302734375, 0.0301055908203125, 0.0215911865234375, 0.01800537109375, -0.06512451171875, -0.038787841796875, -0.01123809814453125, 0.047637939453125, -0.04412841796875, 0.03973388671875, -0.01392364501953125, 0.007640838623046875, 0.060760498046875, 0.0019588470458984375, -0.0296630859375, 0.005817413330078125, 0.042999267578125, 0.001377105712890625, -0.0196380615234375, -0.0193328857421875, -0.0369873046875, 0.0479736328125, 0.03643798828125, -0.2181396484375, 0.027801513671875, 0.032562255859375, 0.04840087890625, -0.0014085769653320312, 0.01007080078125, 0.07159423828125, -0.072998046875, 0.015716552734375, 0.0239715576171875, 0.06878662109375, 0.0092010498046875, 0.0307769775390625, 0.0406494140625, 0.04034423828125, 0.0404052734375, 0.0771484375, -0.0523681640625, 0.059356689453125, -0.040679931640625, 0.03826904296875, 0.007274627685546875, 0.1546630859375, -0.0134429931640625, 0.025146484375, 0.0055694580078125, -0.0253143310546875, 0.0281982421875, -0.024169921875, 0.0071563720703125, 0.021453857421875, -0.00467681884765625, 0.07464599609375, -0.024139404296875, -0.00130462646484375, 0.046173095703125, 0.031341552734375, -0.04608154296875, -0.043487548828125, -0.0252838134765625, 0.04217529296875, -0.02978515625, -0.0266571044921875, -0.041748046875, 0.059356689453125, -0.056549072265625, -0.092041015625, -0.07452392578125, -0.0213775634765625, 0.0037670135498046875, -0.0295257568359375, 0.01092529296875, 0.019683837890625, -0.001888275146484375, 0.00618743896484375, -0.028839111328125, -0.0280609130859375, -0.00981903076171875, -0.045166015625, -0.028045654296875, 0.0193328857421875, 0.0166473388671875, -0.06396484375, 0.08599853515625, -0.022064208984375]]
0.395189
010066490Genomics and computational molecular biology.There has been a dramatic increase in the number of completely sequenced bacterial genomes during the past two years as a result of the efforts both of public genome agencies and the pharmaceutical industry. The availability of completely sequenced genomes permits more systematic analyses of genes, evolution and genome function than was otherwise possible. Using computational methods - which are used to identify genes and their functions including statistics, sequence similarity, motifs, profiles, protein folds and probabilistic models - it is possible to develop characteristic genome signatures, assign functions to genes, identify pathogenic genes, identify metabolic pathways, develop diagnostic probes and discover potential drug-binding sites. All of these directions are critical to understanding bacterial growth, pathogenicity and host-pathogen interactions.
Abstract Embedding[[-0.05792236328125, -0.0160369873046875, -0.038909912109375, -0.031280517578125, 0.05096435546875, -0.01432037353515625, -0.0167388916015625, 0.0762939453125, -0.002201080322265625, -0.004703521728515625, 0.007671356201171875, -0.023468017578125, 0.06329345703125, -0.032379150390625, -0.02178955078125, -0.006542205810546875, -0.052886962890625, 0.0234222412109375, 0.004009246826171875, -0.022735595703125, 0.034912109375, 0.03472900390625, -0.0192108154296875, -0.0517578125, 0.0248565673828125, 0.037322998046875, 0.0242462158203125, 0.019195556640625, -0.05242919921875, -0.2178955078125, -0.00045871734619140625, 0.01629638671875, 0.06622314453125, -0.044891357421875, 0.0118255615234375, -0.02520751953125, 0.032196044921875, -0.005840301513671875, -0.00463104248046875, -0.01122283935546875, 0.036163330078125, -0.0143585205078125, 0.059417724609375, -0.0093994140625, 0.0254974365234375, -0.00605010986328125, -0.01629638671875, 0.034942626953125, 0.042388916015625, -0.07562255859375, -0.07086181640625, -0.04571533203125, -0.02728271484375, 0.0654296875, -0.033477783203125, -0.01493072509765625, 0.012542724609375, -0.0103759765625, 0.0083160400390625, 0.003086090087890625, 0.022674560546875, -0.0006613731384277344, -0.094970703125, 0.056488037109375, 0.00299072265625, 0.043426513671875, 0.00077056884765625, -0.050628662109375, 0.08251953125, 0.07476806640625, -0.0587158203125, -0.0127716064453125, -0.0206298828125, 0.041534423828125, 0.016693115234375, 0.0301513671875, -0.0198974609375, -0.037139892578125, 0.040740966796875, -0.0183563232421875, 0.0264434814453125, -0.0270233154296875, 0.04998779296875, -0.022705078125, -0.058380126953125, 0.032318115234375, -0.00933074951171875, -0.0272979736328125, -0.0018110275268554688, 0.0217132568359375, 0.0401611328125, -0.02301025390625, 0.04486083984375, 0.00559234619140625, -0.085205078125, -0.035125732421875, 0.018035888671875, -0.03863525390625, 0.05279541015625, 0.3916015625, -0.1064453125, -0.026214599609375, -0.02215576171875, 0.03240966796875, 0.0196990966796875, -0.00556182861328125, 0.028564453125, -0.043731689453125, -0.0282135009765625, -0.019683837890625, -0.0270538330078125, 0.0164642333984375, -0.0092620849609375, -0.006031036376953125, 0.0115203857421875, 0.01235198974609375, -0.03546142578125, 0.0162353515625, -0.055938720703125, -0.006038665771484375, 0.0086822509765625, -0.0011529922485351562, 0.0284271240234375, 0.039703369140625, 0.024322509765625, 0.060150146484375, -0.042083740234375, 0.08526611328125, -0.0015583038330078125, 0.01157379150390625, 0.057708740234375, 0.04815673828125, -0.0526123046875, 0.0711669921875, 0.00997161865234375, 0.004283905029296875, -0.0289306640625, -0.031341552734375, 0.00655364990234375, 0.002246856689453125, -0.034149169921875, 0.0116729736328125, -0.011322021484375, -0.08154296875, -0.104736328125, 0.11004638671875, 0.0086822509765625, 0.041259765625, -0.027252197265625, -0.04766845703125, 0.040435791015625, 0.042877197265625, 0.0291748046875, -0.0394287109375, 0.00946807861328125, 0.07232666015625, 0.040130615234375, -0.05487060546875, 0.0081024169921875, 0.032806396484375, -0.1319580078125, 0.0215301513671875, -0.004711151123046875, 0.1365966796875, 0.0360107421875, -0.07086181640625, -0.006175994873046875, 0.009765625, 0.052520751953125, 0.0005292892456054688, -0.01386260986328125, 0.041748046875, -0.0005130767822265625, 0.0195465087890625, -0.05377197265625, -0.0005478858947753906, -0.036102294921875, 0.01114654541015625, 0.0029754638671875, 0.0350341796875, -0.00750732421875, -0.035797119140625, -0.03814697265625, 0.058990478515625, -0.0028247833251953125, -0.032470703125, -0.0247802734375, -0.0067138671875, -0.0212554931640625, 0.0158233642578125, -0.006023406982421875, 0.00931549072265625, -0.001964569091796875, -0.0242919921875, 0.01195526123046875, -0.057281494140625, 0.0003497600555419922, 0.051849365234375, -0.003772735595703125, -0.01276397705078125, 0.03265380859375, 0.04144287109375, -0.015655517578125, -0.03155517578125, 0.056732177734375, -0.0494384765625, 0.032806396484375, 0.00826263427734375, -0.01311492919921875, 0.042236328125, -0.04644775390625, -0.006458282470703125, -0.07244873046875, 0.010467529296875, -0.06768798828125, -0.016510009765625, 0.041229248046875, 0.056060791015625, 0.032440185546875, 0.01082611083984375, 0.027435302734375, 0.017791748046875, -0.041229248046875, -0.31494140625, -0.01139068603515625, -0.0340576171875, -0.0307159423828125, -0.006744384765625, 0.0181121826171875, -0.0318603515625, 0.0015735626220703125, -0.0194854736328125, -0.0007467269897460938, 0.02081298828125, 0.056854248046875, -0.0187530517578125, -0.0452880859375, -0.052947998046875, 0.03472900390625, 0.020050048828125, -0.0594482421875, -0.09368896484375, -0.01493072509765625, 0.03826904296875, 0.01491546630859375, 0.018585205078125, -0.03326416015625, 0.01122283935546875, -0.005733489990234375, 0.10858154296875, -0.021331787109375, 0.0860595703125, 0.01351165771484375, -0.00972747802734375, -0.0292510986328125, -0.022369384765625, -0.07440185546875, -0.007625579833984375, -0.00994873046875, 0.0187225341796875, -0.0364990234375, 0.046783447265625, -0.018829345703125, 0.00894927978515625, 0.0043487548828125, -0.0250091552734375, -0.00372314453125, -0.024200439453125, -0.0091400146484375, -0.0177459716796875, -0.05621337890625, 0.01166534423828125, 0.05328369140625, -0.008575439453125, 0.027191162109375, 0.0264892578125, 0.005458831787109375, -0.016387939453125, 0.0017385482788085938, -0.0066070556640625, -0.023590087890625, -0.0174713134765625, 0.0291748046875, 0.0157318115234375, 0.01287841796875, -0.0262603759765625, 0.0367431640625, 0.01052093505859375, -0.042083740234375, -0.0177001953125, 0.0360107421875, 0.07415771484375, -0.0677490234375, -0.04144287109375, 0.085205078125, -0.036041259765625, 0.004909515380859375, 0.0457763671875, 0.0114593505859375, -0.0121917724609375, -0.12457275390625, -0.046142578125, 0.0012054443359375, 0.0645751953125, -0.0289306640625, 0.0243988037109375, 0.0085906982421875, 0.0305633544921875, 0.01493072509765625, 0.01041412353515625, -0.007373809814453125, 0.01441192626953125, 0.0689697265625, -0.044830322265625, -0.04327392578125, 0.02264404296875, -0.02398681640625, -0.005817413330078125, 0.028472900390625, -0.206298828125, 0.0592041015625, -0.0184173583984375, 0.06646728515625, -0.0218658447265625, -0.01357269287109375, 0.0794677734375, -0.09429931640625, 0.01180267333984375, -0.0028896331787109375, 0.0469970703125, 0.0389404296875, 0.06610107421875, -0.048553466796875, 0.0238800048828125, 0.053955078125, 0.10931396484375, -0.032867431640625, 0.02557373046875, -0.0002510547637939453, 0.032196044921875, 0.05303955078125, 0.17919921875, -0.08087158203125, 0.06329345703125, 0.0135955810546875, 0.0198516845703125, -0.0243988037109375, -0.043701171875, -0.01261138916015625, 0.0533447265625, -0.0374755859375, 0.023895263671875, -0.0058135986328125, -0.03570556640625, 0.0570068359375, 0.08331298828125, -0.01739501953125, -0.032318115234375, -0.013275146484375, -0.01103973388671875, 0.0074920654296875, -0.056060791015625, -0.033416748046875, 0.07537841796875, -0.0133819580078125, -0.0175323486328125, -0.052520751953125, 0.00334930419921875, 0.03472900390625, -0.0806884765625, 0.048736572265625, 0.03619384765625, 0.03192138671875, -0.001506805419921875, -0.009857177734375, 0.01294708251953125, -0.0019817352294921875, -0.0231475830078125, -0.014892578125, 0.032623291015625, 0.00025725364685058594, -0.0574951171875, 0.11383056640625, -0.056488037109375]]
0.338992
" 445 | ], 446 | "text/plain": [ 447 | "" 448 | ] 449 | }, 450 | "metadata": {}, 451 | "output_type": "display_data" 452 | } 453 | ], 454 | "source": [ 455 | "query = \"bioinformatics\"\n", 456 | "print(\"Encoding query...\")\n", 457 | "query_embedding = model.encode([query])\n", 458 | "print(\"Query encoded.\")\n", 459 | "\n", 460 | "print(\"Retrieving examples by title similarity...\")\n", 461 | "scores, retrieved_examples = dataclysm_pubmed_title_indexed.get_nearest_examples('title_embedding', query_embedding, k=10)\n", 462 | "print(\"Examples retrieved.\")\n", 463 | "\n", 464 | "from IPython.display import display, HTML\n", 465 | "import pandas as pd\n", 466 | "\n", 467 | "# Convert retrieved examples to DataFrame\n", 468 | "df = pd.DataFrame(retrieved_examples)\n", 469 | "\n", 470 | "# Calculate similarity score in percentage\n", 471 | "df['similarity_score'] = scores\n", 472 | "\n", 473 | "\n", 474 | "# Drop 'title_embedding' and 'abstract_embedding' columns\n", 475 | "df = df.drop(columns=['title_embedding'])\n", 476 | "\n", 477 | "# Drop empty columns\n", 478 | "df = df.dropna(axis=1, how='all')\n", 479 | "\n", 480 | "# Create a collapsible element for 'abstract_embedding'\n", 481 | "df['abstract_embedding'] = df['abstract_embedding'].apply(lambda x: f'
Abstract Embedding{x}
')\n", 482 | "# Sort by ascending similarity score\n", 483 | "df = df.sort_values(by='similarity_score', ascending=False)\n", 484 | "\n", 485 | "# Display the DataFrame\n", 486 | "from IPython.display import Markdown, display\n", 487 | "display(Markdown(f'QUERY: **{query}**'))\n", 488 | "display(HTML(df.to_html(escape=False)))\n" 489 | ] 490 | }, 491 | { 492 | "cell_type": "markdown", 493 | "metadata": {}, 494 | "source": [ 495 | "# Wikipedia simple search (Title)\n", 496 | "Searches for a Wikipedia article based on title similarity to query. Useful for looking up terms." 497 | ] 498 | }, 499 | { 500 | "cell_type": "code", 501 | "execution_count": null, 502 | "metadata": {}, 503 | "outputs": [], 504 | "source": [ 505 | "query = \"Retrieval Augmented Generation\"\n", 506 | "print(\"Encoding query...\")\n", 507 | "query_embedding = model.encode([query])\n", 508 | "print(\"Query encoded.\")\n", 509 | "\n", 510 | "print(\"Retrieving examples by title similarity...\")\n", 511 | "scores, retrieved_examples = dataclysm_wikipedia_indexed.get_nearest_examples('title_embedding', query_embedding, k=10)\n", 512 | "print(\"Examples retrieved.\")\n", 513 | "\n", 514 | "from IPython.display import display, HTML\n", 515 | "import pandas as pd\n", 516 | "\n", 517 | "# Convert retrieved examples to DataFrame\n", 518 | "df = pd.DataFrame(retrieved_examples)\n", 519 | "\n", 520 | "# Calculate similarity score in percentage\n", 521 | "df['similarity_score'] = scores\n", 522 | "\n", 523 | "\n", 524 | "# Drop 'title_embedding' and 'abstract_embedding' columns\n", 525 | "df = df.drop(columns=['title_embedding'])\n", 526 | "\n", 527 | "# Drop empty columns\n", 528 | "df = df.dropna(axis=1, how='all')\n", 529 | "\n", 530 | "# Create a \"click to expand\" for the abstract so it doesn't take up much space\n", 531 | "df['text'] = df['text'].apply(lambda x: f'
Article Text{x}
')\n", 532 | "\n", 533 | "\n", 534 | "# Create a URL field with a hyperlink \n", 535 | "df['url'] = df['url'].apply(lambda x: f'Link')\n", 536 | "\n", 537 | "# Sort by ascending similarity score\n", 538 | "df = df.sort_values(by='similarity_score', ascending=False)\n", 539 | "\n", 540 | "# Display the DataFrame\n", 541 | "from IPython.display import Markdown, display\n", 542 | "display(Markdown(f'QUERY: **{query}**'))\n", 543 | "display(HTML(df.to_html(escape=False)))\n" 544 | ] 545 | }, 546 | { 547 | "cell_type": "markdown", 548 | "metadata": {}, 549 | "source": [ 550 | "# Download OpenHermes-2.5-Mistral-7B" 551 | ] 552 | }, 553 | { 554 | "cell_type": "code", 555 | "execution_count": null, 556 | "metadata": {}, 557 | "outputs": [], 558 | "source": [ 559 | "%pip install huggingface-cli\n", 560 | "!huggingface-cli download TheBloke/OpenHermes-2.5-Mistral-7B-GGUF openhermes-2.5-mistral-7b.Q4_K_M.gguf --local-dir . --local-dir-use-symlinks False" 561 | ] 562 | }, 563 | { 564 | "cell_type": "markdown", 565 | "metadata": {}, 566 | "source": [ 567 | "Bigger model:" 568 | ] 569 | }, 570 | { 571 | "cell_type": "code", 572 | "execution_count": null, 573 | "metadata": {}, 574 | "outputs": [], 575 | "source": [ 576 | "%pip install huggingface-cli\n", 577 | "!huggingface-cli download TheBloke/Nous-Hermes-2-Mixtral-8x7B-DPO-GGUF nous-hermes-2-mixtral-8x7b-dpo.Q4_K_M.gguf --local-dir . --local-dir-use-symlinks False" 578 | ] 579 | }, 580 | { 581 | "cell_type": "code", 582 | "execution_count": 13, 583 | "metadata": {}, 584 | "outputs": [ 585 | { 586 | "name": "stderr", 587 | "output_type": "stream", 588 | "text": [ 589 | "huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...\n", 590 | "To disable this warning, you can either:\n", 591 | "\t- Avoid using `tokenizers` before the fork if possible\n", 592 | "\t- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)\n" 593 | ] 594 | }, 595 | { 596 | "name": "stdout", 597 | "output_type": "stream", 598 | "text": [ 599 | "Requirement already satisfied: llama-cpp-python in ./.conda/lib/python3.10/site-packages (0.2.44)\n", 600 | "Requirement already satisfied: typing-extensions>=4.5.0 in ./.conda/lib/python3.10/site-packages (from llama-cpp-python) (4.9.0)\n", 601 | "Requirement already satisfied: numpy>=1.20.0 in ./.conda/lib/python3.10/site-packages (from llama-cpp-python) (1.26.3)\n", 602 | "Requirement already satisfied: diskcache>=5.6.1 in ./.conda/lib/python3.10/site-packages (from llama-cpp-python) (5.6.3)\n", 603 | "Requirement already satisfied: jinja2>=2.11.3 in ./.conda/lib/python3.10/site-packages (from llama-cpp-python) (3.1.3)\n", 604 | "Requirement already satisfied: MarkupSafe>=2.0 in ./.conda/lib/python3.10/site-packages (from jinja2>=2.11.3->llama-cpp-python) (2.1.3)\n" 605 | ] 606 | } 607 | ], 608 | "source": [ 609 | "!CMAKE_ARGS=\"-DLLAMA_METAL=on\" pip install -U llama-cpp-python --no-cache-dir" 610 | ] 611 | }, 612 | { 613 | "cell_type": "markdown", 614 | "metadata": {}, 615 | "source": [ 616 | "# Retrieval Augmented Generation" 617 | ] 618 | }, 619 | { 620 | "cell_type": "code", 621 | "execution_count": 14, 622 | "metadata": {}, 623 | "outputs": [ 624 | { 625 | "ename": "ValueError", 626 | "evalue": "Model path does not exist: openhermes-2.5-mistral-7b.Q4_K_M.gguf", 627 | "output_type": "error", 628 | "traceback": [ 629 | "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", 630 | "\u001b[0;31mValueError\u001b[0m Traceback (most recent call last)", 631 | "Cell \u001b[0;32mIn[14], line 11\u001b[0m\n\u001b[1;32m 7\u001b[0m model \u001b[38;5;241m=\u001b[39m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mopenhermes-2.5-mistral-7b.Q4_K_M.gguf\u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[1;32m 8\u001b[0m prompt \u001b[38;5;241m=\u001b[39m \u001b[38;5;124mf\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;132;01m{\u001b[39;00mdf[[\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mid\u001b[39m\u001b[38;5;124m'\u001b[39m,\u001b[38;5;250m \u001b[39m\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mtitle\u001b[39m\u001b[38;5;124m'\u001b[39m,\u001b[38;5;250m \u001b[39m\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mabstract\u001b[39m\u001b[38;5;124m'\u001b[39m]]\u001b[38;5;241m.\u001b[39mto_html(escape\u001b[38;5;241m=\u001b[39m\u001b[38;5;28;01mFalse\u001b[39;00m)\u001b[38;5;132;01m}\u001b[39;00m\u001b[38;5;124m ### Instruction: Use the information above to answer the query: EXPLAIN \u001b[39m\u001b[38;5;132;01m{\u001b[39;00mquery\u001b[38;5;132;01m}\u001b[39;00m\u001b[38;5;124m ### Response:\u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[0;32m---> 11\u001b[0m llm \u001b[38;5;241m=\u001b[39m \u001b[43mLlama\u001b[49m\u001b[43m(\u001b[49m\u001b[43mmodel_path\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mmodel\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mn_ctx\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;241;43m8096\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mlast_n_tokens_size\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;241;43m256\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mn_threads\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;241;43m4\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mn_gpu_layers\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;241;43m0\u001b[39;49m\u001b[43m)\u001b[49m\n\u001b[1;32m 13\u001b[0m stream \u001b[38;5;241m=\u001b[39m llm\u001b[38;5;241m.\u001b[39mcreate_completion(prompt, stream\u001b[38;5;241m=\u001b[39m\u001b[38;5;28;01mTrue\u001b[39;00m, repeat_penalty\u001b[38;5;241m=\u001b[39m\u001b[38;5;241m1.1\u001b[39m, max_tokens\u001b[38;5;241m=\u001b[39m\u001b[38;5;241m256\u001b[39m, stop\u001b[38;5;241m=\u001b[39m[\u001b[38;5;124m\"\u001b[39m\u001b[38;5;130;01m\\n\u001b[39;00m\u001b[38;5;124m\"\u001b[39m], echo\u001b[38;5;241m=\u001b[39m\u001b[38;5;28;01mFalse\u001b[39;00m, temperature\u001b[38;5;241m=\u001b[39m\u001b[38;5;241m0\u001b[39m, mirostat_mode \u001b[38;5;241m=\u001b[39m \u001b[38;5;241m2\u001b[39m, mirostat_tau\u001b[38;5;241m=\u001b[39m\u001b[38;5;241m4.0\u001b[39m, mirostat_eta\u001b[38;5;241m=\u001b[39m\u001b[38;5;241m1.1\u001b[39m)\n\u001b[1;32m 14\u001b[0m result \u001b[38;5;241m=\u001b[39m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124m\"\u001b[39m\n", 632 | "File \u001b[0;32m~/Repos/dataclysm/.conda/lib/python3.10/site-packages/llama_cpp/llama.py:298\u001b[0m, in \u001b[0;36mLlama.__init__\u001b[0;34m(self, model_path, n_gpu_layers, split_mode, main_gpu, tensor_split, vocab_only, use_mmap, use_mlock, kv_overrides, seed, n_ctx, n_batch, n_threads, n_threads_batch, rope_scaling_type, rope_freq_base, rope_freq_scale, yarn_ext_factor, yarn_attn_factor, yarn_beta_fast, yarn_beta_slow, yarn_orig_ctx, mul_mat_q, logits_all, embedding, offload_kqv, last_n_tokens_size, lora_base, lora_scale, lora_path, numa, chat_format, chat_handler, draft_model, tokenizer, verbose, **kwargs)\u001b[0m\n\u001b[1;32m 295\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mlora_path \u001b[38;5;241m=\u001b[39m lora_path\n\u001b[1;32m 297\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m os\u001b[38;5;241m.\u001b[39mpath\u001b[38;5;241m.\u001b[39mexists(model_path):\n\u001b[0;32m--> 298\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mValueError\u001b[39;00m(\u001b[38;5;124mf\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mModel path does not exist: \u001b[39m\u001b[38;5;132;01m{\u001b[39;00mmodel_path\u001b[38;5;132;01m}\u001b[39;00m\u001b[38;5;124m\"\u001b[39m)\n\u001b[1;32m 300\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_model \u001b[38;5;241m=\u001b[39m _LlamaModel(\n\u001b[1;32m 301\u001b[0m path_model\u001b[38;5;241m=\u001b[39m\u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mmodel_path, params\u001b[38;5;241m=\u001b[39m\u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mmodel_params, verbose\u001b[38;5;241m=\u001b[39m\u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mverbose\n\u001b[1;32m 302\u001b[0m )\n\u001b[1;32m 304\u001b[0m \u001b[38;5;66;03m# Override tokenizer\u001b[39;00m\n", 633 | "\u001b[0;31mValueError\u001b[0m: Model path does not exist: openhermes-2.5-mistral-7b.Q4_K_M.gguf" 634 | ] 635 | } 636 | ], 637 | "source": [ 638 | "from llama_cpp import Llama\n", 639 | "from llama_cpp import LlamaGrammar\n", 640 | "import pandas as pd\n", 641 | "import json\n", 642 | "import httpx\n", 643 | "\n", 644 | "model = \"openhermes-2.5-mistral-7b.Q4_K_M.gguf\"\n", 645 | "prompt = f\"{df[['id', 'title', 'abstract']].to_html(escape=False)} ### Instruction: Use the information above to answer the query: EXPLAIN {query} ### Response:\"\n", 646 | "\n", 647 | "\n", 648 | "llm = Llama(model_path=model, n_ctx=8096, last_n_tokens_size=256, n_threads=4, n_gpu_layers=0)\n", 649 | "\n", 650 | "stream = llm.create_completion(prompt, stream=True, repeat_penalty=1.1, max_tokens=256, stop=[\"\\n\"], echo=False, temperature=0, mirostat_mode = 2, mirostat_tau=4.0, mirostat_eta=1.1)\n", 651 | "result = \"\"\n", 652 | "for output in stream:\n", 653 | " result += output['choices'][0]['text']\n", 654 | "\n", 655 | "print(result)" 656 | ] 657 | }, 658 | { 659 | "cell_type": "markdown", 660 | "metadata": {}, 661 | "source": [ 662 | "# Rerank results using an LLM (experimental)\n", 663 | "This uses LLaMA grammars / llama.cpp to return back a list instructing the LLM to rerank and drop irrelevant results. May or may not work." 664 | ] 665 | }, 666 | { 667 | "cell_type": "code", 668 | "execution_count": null, 669 | "metadata": {}, 670 | "outputs": [], 671 | "source": [ 672 | "from llama_cpp import Llama\n", 673 | "from llama_cpp import LlamaGrammar\n", 674 | "import pandas as pd\n", 675 | "import json\n", 676 | "import httpx\n", 677 | "grammar_text = httpx.get(\"https://raw.githubusercontent.com/ggerganov/llama.cpp/master/grammars/json_arr.gbnf\").text\n", 678 | "grammar = LlamaGrammar.from_string(grammar_text)\n", 679 | "\n", 680 | "model = \"openhermes-2.5-mistral-7b.Q4_K_M.gguf\"\n", 681 | "prompt = f\"\"\"You are an expert at generating valid JSON.\n", 682 | "###\n", 683 | "Instruction:\n", 684 | "Return a valid JSON Array containing arXiv ['id'] field reranked according to how relevant the result is to the query based on its other columns at that ['id']. Drop any items that are not relevant to the query. Return just an array of the IDs, like [x,y,z] and so on in the correct order:\n", 685 | " INDEX: {df[['id', 'title', 'abstract']].to_html(escape=False)}\n", 686 | " QUERY: {query}\n", 687 | " Take a deep breath, and solve the problem step-by-step.\n", 688 | "###\n", 689 | "Response:\"\"\"\n", 690 | "\n", 691 | "\n", 692 | "llm = Llama(model_path=model, n_ctx=8096, last_n_tokens_size=256, n_threads=4, n_gpu_layers=0)\n", 693 | "\n", 694 | " \n", 695 | "stream = llm.create_completion(prompt, stream=True, repeat_penalty=1.1, max_tokens=256, stop=[\"]\"], echo=False, temperature=0, mirostat_mode = 2, mirostat_tau=4.0, mirostat_eta=1.1, grammar=grammar)\n", 696 | "result = \"\"\n", 697 | "for output in stream:\n", 698 | " result += output['choices'][0]['text']\n", 699 | "\n", 700 | "result = result + \"]\"\n", 701 | "\n", 702 | "# Check if the result is a string, an array string, or a single ID in an array and convert it to a list of IDs\n", 703 | "if isinstance(result, str):\n", 704 | " result_ids = [result.strip('[]')]\n", 705 | "elif isinstance(result, list):\n", 706 | " if isinstance(result[0], str):\n", 707 | " result_ids = [json.loads(res) for res in result]\n", 708 | " else:\n", 709 | " result_ids = result\n", 710 | "# Print the result\n", 711 | "print(result_ids)\n", 712 | "import re\n", 713 | "\n", 714 | "# Extract IDs from the potentially broken string using regex\n", 715 | "result_ids = re.findall(r'\"(.*?)\"', result_ids[0])\n", 716 | "\n", 717 | "# Filter the dataframe to only include rows with IDs in the result\n", 718 | "filtered_df = df[df['id'].isin(result_ids)]\n", 719 | "\n", 720 | "# Create a categorical type for sorting based on the order in result_ids\n", 721 | "filtered_df['id'] = pd.Categorical(filtered_df['id'], categories=result_ids, ordered=True)\n", 722 | "\n", 723 | "# Sort the dataframe based on the 'id' column\n", 724 | "filtered_df = filtered_df.sort_values('id')\n", 725 | "\n", 726 | "# Drop the similarity score column\n", 727 | "filtered_df = filtered_df.drop(columns=['similarity_score'])\n", 728 | "\n", 729 | "# Display the filtered dataframe as a table with hyperlinks\n", 730 | "display(HTML(filtered_df.to_html(escape=False)))\n" 731 | ] 732 | } 733 | ], 734 | "metadata": { 735 | "kernelspec": { 736 | "display_name": "Python 3", 737 | "language": "python", 738 | "name": "python3" 739 | }, 740 | "language_info": { 741 | "codemirror_mode": { 742 | "name": "ipython", 743 | "version": 3 744 | }, 745 | "file_extension": ".py", 746 | "mimetype": "text/x-python", 747 | "name": "python", 748 | "nbconvert_exporter": "python", 749 | "pygments_lexer": "ipython3", 750 | "version": "3.1.0" 751 | } 752 | }, 753 | "nbformat": 4, 754 | "nbformat_minor": 2 755 | } 756 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | accelerate==0.25.0 2 | aiofiles==23.2.1 3 | aiohttp==3.9.1 4 | aiosignal==1.3.1 5 | annotated-types==0.6.0 6 | anyio==4.2.0 7 | apache-beam==2.52.0 8 | appdirs==1.4.4 9 | appnope==0.1.3 10 | asgiref==3.7.2 11 | astor==0.8.1 12 | asttokens==2.4.1 13 | attrs==23.2.0 14 | backoff==2.2.1 15 | beautifulsoup4==4.12.2 16 | bitsandbytes==0.42.0 17 | blessed==1.20.0 18 | boto==2.49.0 19 | build==1.0.3 20 | CacheControl==0.13.1 21 | cachetools==5.3.2 22 | certifi==2023.11.17 23 | charset-normalizer==3.3.2 24 | ci-info==0.3.0 25 | cleo==2.1.0 26 | click==8.1.7 27 | cloudpickle==2.2.1 28 | colorama==0.4.6 29 | comm==0.2.0 30 | configobj==5.0.8 31 | configparser==6.0.0 32 | contourpy==1.2.0 33 | crashtest==0.4.1 34 | crcmod==1.7 35 | cryptography==41.0.7 36 | cycler==0.12.1 37 | dataclasses==0.6 38 | dataclasses-json==0.6.3 39 | datasets==2.16.1 40 | debugpy==1.8.0 41 | decorator==5.1.1 42 | Deprecated==1.2.14 43 | dill==0.3.7 44 | diskcache==5.6.3 45 | distlib==0.3.8 46 | distro==1.9.0 47 | dnspython==2.4.2 48 | docarray==0.40.0 49 | docker==7.0.0 50 | docker-pycreds==0.4.0 51 | docopt==0.6.2 52 | 53 | dulwich==0.21.7 54 | ecdsa==0.18.0 55 | editor==1.6.5 56 | etelemetry==0.3.1 57 | executing==2.0.1 58 | fastapi==0.108.0 59 | fastavro==1.9.2 60 | fasteners==0.19 61 | fastjsonschema==2.19.1 62 | filelock==3.13.1 63 | fitz==0.0.1.dev2 64 | fonttools==4.47.0 65 | frontend==0.0.3 66 | frozenlist==1.4.1 67 | fsspec==2023.10.0 68 | future==0.18.3 69 | gcs-oauth2-boto-plugin==3.0 70 | git-python==1.0.3 71 | gitdb==4.0.11 72 | GitPython==3.1.40 73 | google-apitools==0.5.32 74 | google-auth==2.26.2 75 | google-reauth==0.1.1 76 | googleapis-common-protos==1.62.0 77 | greenlet==3.0.3 78 | grpcio==1.57.0 79 | grpcio-health-checking==1.57.0 80 | grpcio-reflection==1.57.0 81 | gsutil==5.27 82 | h11==0.14.0 83 | hdfs==2.7.3 84 | hf_transfer==0.1.4 85 | html2image==2.0.4.3 86 | httpcore==1.0.2 87 | httplib2==0.20.4 88 | httptools==0.6.1 89 | httpx==0.26.0 90 | huggingface-hub==0.20.1 91 | idna==3.6 92 | importlib-metadata==6.11.0 93 | inquirer==3.2.1 94 | installer==0.7.0 95 | ipykernel==6.28.0 96 | ipython==8.19.0 97 | isodate==0.6.1 98 | itsdangerous==2.1.2 99 | jaraco.classes==3.3.0 100 | jcloud==0.3 101 | jedi==0.19.1 102 | jina==3.23.2 103 | jina-hubble-sdk==0.39.0 104 | Jinja2==3.1.2 105 | joblib==1.3.2 106 | Js2Py==0.74 107 | jsonschema==4.20.0 108 | jsonschema-specifications==2023.12.1 109 | jupyter_client==8.6.0 110 | jupyter_core==5.5.1 111 | keyring==24.3.0 112 | kiwisolver==1.4.5 113 | litellm==1.16.19 114 | llama-index==0.9.24 115 | llama_cpp_python==0.2.26 116 | looseversion==1.3.0 117 | lxml==5.0.0 118 | markdown-it-py==3.0.0 119 | MarkupSafe==2.1.3 120 | marshmallow==3.20.1 121 | matplotlib==3.8.2 122 | matplotlib-inline==0.1.6 123 | mdurl==0.1.2 124 | mlx==0.0.6 125 | monotonic==1.6 126 | more-itertools==10.1.0 127 | MouseInfo==0.1.3 128 | mpmath==1.3.0 129 | msgpack==1.0.7 130 | multidict==6.0.4 131 | multiprocess==0.70.15 132 | mwparserfromhell==0.6.5 133 | mypy-extensions==1.0.0 134 | nest-asyncio==1.5.8 135 | networkx==3.2.1 136 | nibabel==5.2.0 137 | nipype==1.8.6 138 | nltk==3.8.1 139 | numpy==1.26.2 140 | oauth2client==4.1.3 141 | objsize==0.6.1 142 | open-interpreter==0.2.0 143 | openai==1.6.1 144 | opencv-python==4.9.0.80 145 | opentelemetry-api==1.19.0 146 | opentelemetry-exporter-otlp==1.19.0 147 | opentelemetry-exporter-otlp-proto-common==1.19.0 148 | opentelemetry-exporter-otlp-proto-grpc==1.19.0 149 | opentelemetry-exporter-otlp-proto-http==1.19.0 150 | opentelemetry-exporter-prometheus==0.41b0 151 | opentelemetry-instrumentation==0.40b0 152 | opentelemetry-instrumentation-aiohttp-client==0.40b0 153 | opentelemetry-instrumentation-asgi==0.40b0 154 | opentelemetry-instrumentation-fastapi==0.40b0 155 | opentelemetry-instrumentation-grpc==0.40b0 156 | opentelemetry-proto==1.19.0 157 | opentelemetry-sdk==1.19.0 158 | opentelemetry-semantic-conventions==0.40b0 159 | opentelemetry-util-http==0.40b0 160 | orjson==3.9.10 161 | packaging==23.2 162 | pandas==2.1.4 163 | parso==0.8.3 164 | pathlib==1.0.1 165 | pathspec==0.12.1 166 | pdfminer.six==20221105 167 | pdfplumber==0.10.3 168 | peft==0.7.1 169 | pexpect==4.9.0 170 | Pillow==10.1.0 171 | pkginfo==1.9.6 172 | platformdirs==4.0.0 173 | plyer==2.1.0 174 | poetry==1.7.1 175 | poetry-core==1.8.1 176 | poetry-plugin-export==1.6.0 177 | posthog==3.1.0 178 | pretty-traceback==2023.1020 179 | prometheus-client==0.19.0 180 | prompt-toolkit==3.0.43 181 | proto-plus==1.23.0 182 | protobuf==4.25.1 183 | prov==2.0.0 184 | psutil==5.9.7 185 | ptyprocess==0.7.0 186 | pure-eval==0.2.2 187 | pyarrow==11.0.0 188 | pyarrow-hotfix==0.6 189 | pyasn1==0.5.1 190 | pyasn1-modules==0.3.0 191 | PyAutoGUI==0.9.54 192 | pydantic==2.5.3 193 | pydantic-settings==2.1.0 194 | pydantic_core==2.14.6 195 | pydot==1.4.2 196 | PyGetWindow==0.0.9 197 | Pygments==2.17.2 198 | pyjsparser==2.7.1 199 | PyMonCtl==0.7 200 | pymongo==4.6.1 201 | PyMsgBox==1.0.9 202 | pyobjc==10.1 203 | pyobjc-core==10.1 204 | pyobjc-framework-Accessibility==10.1 205 | pyobjc-framework-Accounts==10.1 206 | pyobjc-framework-AddressBook==10.1 207 | pyobjc-framework-AdServices==10.1 208 | pyobjc-framework-AdSupport==10.1 209 | pyobjc-framework-AppleScriptKit==10.1 210 | pyobjc-framework-AppleScriptObjC==10.1 211 | pyobjc-framework-ApplicationServices==10.1 212 | pyobjc-framework-AppTrackingTransparency==10.1 213 | pyobjc-framework-AudioVideoBridging==10.1 214 | pyobjc-framework-AuthenticationServices==10.1 215 | pyobjc-framework-AutomaticAssessmentConfiguration==10.1 216 | pyobjc-framework-Automator==10.1 217 | pyobjc-framework-AVFoundation==10.1 218 | pyobjc-framework-AVKit==10.1 219 | pyobjc-framework-AVRouting==10.1 220 | pyobjc-framework-BackgroundAssets==10.1 221 | pyobjc-framework-BusinessChat==10.1 222 | pyobjc-framework-CalendarStore==10.1 223 | pyobjc-framework-CallKit==10.1 224 | pyobjc-framework-CFNetwork==10.1 225 | pyobjc-framework-Cinematic==10.1 226 | pyobjc-framework-ClassKit==10.1 227 | pyobjc-framework-CloudKit==10.1 228 | pyobjc-framework-Cocoa==10.1 229 | pyobjc-framework-Collaboration==10.1 230 | pyobjc-framework-ColorSync==10.1 231 | pyobjc-framework-Contacts==10.1 232 | pyobjc-framework-ContactsUI==10.1 233 | pyobjc-framework-CoreAudio==10.1 234 | pyobjc-framework-CoreAudioKit==10.1 235 | pyobjc-framework-CoreBluetooth==10.1 236 | pyobjc-framework-CoreData==10.1 237 | pyobjc-framework-CoreHaptics==10.1 238 | pyobjc-framework-CoreLocation==10.1 239 | pyobjc-framework-CoreMedia==10.1 240 | pyobjc-framework-CoreMediaIO==10.1 241 | pyobjc-framework-CoreMIDI==10.1 242 | pyobjc-framework-CoreML==10.1 243 | pyobjc-framework-CoreMotion==10.1 244 | pyobjc-framework-CoreServices==10.1 245 | pyobjc-framework-CoreSpotlight==10.1 246 | pyobjc-framework-CoreText==10.1 247 | pyobjc-framework-CoreWLAN==10.1 248 | pyobjc-framework-CryptoTokenKit==10.1 249 | pyobjc-framework-DataDetection==10.1 250 | pyobjc-framework-DeviceCheck==10.1 251 | pyobjc-framework-DictionaryServices==10.1 252 | pyobjc-framework-DiscRecording==10.1 253 | pyobjc-framework-DiscRecordingUI==10.1 254 | pyobjc-framework-DiskArbitration==10.1 255 | pyobjc-framework-DVDPlayback==10.1 256 | pyobjc-framework-EventKit==10.1 257 | pyobjc-framework-ExceptionHandling==10.1 258 | pyobjc-framework-ExecutionPolicy==10.1 259 | pyobjc-framework-ExtensionKit==10.1 260 | pyobjc-framework-ExternalAccessory==10.1 261 | pyobjc-framework-FileProvider==10.1 262 | pyobjc-framework-FileProviderUI==10.1 263 | pyobjc-framework-FinderSync==10.1 264 | pyobjc-framework-FSEvents==10.1 265 | pyobjc-framework-GameCenter==10.1 266 | pyobjc-framework-GameController==10.1 267 | pyobjc-framework-GameKit==10.1 268 | pyobjc-framework-GameplayKit==10.1 269 | pyobjc-framework-HealthKit==10.1 270 | pyobjc-framework-ImageCaptureCore==10.1 271 | pyobjc-framework-InputMethodKit==10.1 272 | pyobjc-framework-InstallerPlugins==10.1 273 | pyobjc-framework-InstantMessage==10.1 274 | pyobjc-framework-Intents==10.1 275 | pyobjc-framework-IntentsUI==10.1 276 | pyobjc-framework-IOBluetooth==10.1 277 | pyobjc-framework-IOBluetoothUI==10.1 278 | pyobjc-framework-IOSurface==10.1 279 | pyobjc-framework-iTunesLibrary==10.1 280 | pyobjc-framework-KernelManagement==10.1 281 | pyobjc-framework-LatentSemanticMapping==10.1 282 | pyobjc-framework-LaunchServices==10.1 283 | pyobjc-framework-libdispatch==10.1 284 | pyobjc-framework-libxpc==10.1 285 | pyobjc-framework-LinkPresentation==10.1 286 | pyobjc-framework-LocalAuthentication==10.1 287 | pyobjc-framework-LocalAuthenticationEmbeddedUI==10.1 288 | pyobjc-framework-MailKit==10.1 289 | pyobjc-framework-MapKit==10.1 290 | pyobjc-framework-MediaAccessibility==10.1 291 | pyobjc-framework-MediaLibrary==10.1 292 | pyobjc-framework-MediaPlayer==10.1 293 | pyobjc-framework-MediaToolbox==10.1 294 | pyobjc-framework-Metal==10.1 295 | pyobjc-framework-MetalFX==10.1 296 | pyobjc-framework-MetalKit==10.1 297 | pyobjc-framework-MetalPerformanceShaders==10.1 298 | pyobjc-framework-MetalPerformanceShadersGraph==10.1 299 | pyobjc-framework-MetricKit==10.1 300 | pyobjc-framework-MLCompute==10.1 301 | pyobjc-framework-ModelIO==10.1 302 | pyobjc-framework-MultipeerConnectivity==10.1 303 | pyobjc-framework-NaturalLanguage==10.1 304 | pyobjc-framework-NetFS==10.1 305 | pyobjc-framework-Network==10.1 306 | pyobjc-framework-NetworkExtension==10.1 307 | pyobjc-framework-NotificationCenter==10.1 308 | pyobjc-framework-OpenDirectory==10.1 309 | pyobjc-framework-OSAKit==10.1 310 | pyobjc-framework-OSLog==10.1 311 | pyobjc-framework-PassKit==10.1 312 | pyobjc-framework-PencilKit==10.1 313 | pyobjc-framework-PHASE==10.1 314 | pyobjc-framework-Photos==10.1 315 | pyobjc-framework-PhotosUI==10.1 316 | pyobjc-framework-PreferencePanes==10.1 317 | pyobjc-framework-PushKit==10.1 318 | pyobjc-framework-Quartz==10.1 319 | pyobjc-framework-QuickLookThumbnailing==10.1 320 | pyobjc-framework-ReplayKit==10.1 321 | pyobjc-framework-SafariServices==10.1 322 | pyobjc-framework-SafetyKit==10.1 323 | pyobjc-framework-SceneKit==10.1 324 | pyobjc-framework-ScreenCaptureKit==10.1 325 | pyobjc-framework-ScreenSaver==10.1 326 | pyobjc-framework-ScreenTime==10.1 327 | pyobjc-framework-ScriptingBridge==10.1 328 | pyobjc-framework-SearchKit==10.1 329 | pyobjc-framework-Security==10.1 330 | pyobjc-framework-SecurityFoundation==10.1 331 | pyobjc-framework-SecurityInterface==10.1 332 | pyobjc-framework-SensitiveContentAnalysis==10.1 333 | pyobjc-framework-ServiceManagement==10.1 334 | pyobjc-framework-SharedWithYou==10.1 335 | pyobjc-framework-SharedWithYouCore==10.1 336 | pyobjc-framework-ShazamKit==10.1 337 | pyobjc-framework-Social==10.1 338 | pyobjc-framework-SoundAnalysis==10.1 339 | pyobjc-framework-Speech==10.1 340 | pyobjc-framework-SpriteKit==10.1 341 | pyobjc-framework-StoreKit==10.1 342 | pyobjc-framework-Symbols==10.1 343 | pyobjc-framework-SyncServices==10.1 344 | pyobjc-framework-SystemConfiguration==10.1 345 | pyobjc-framework-SystemExtensions==10.1 346 | pyobjc-framework-ThreadNetwork==10.1 347 | pyobjc-framework-UniformTypeIdentifiers==10.1 348 | pyobjc-framework-UserNotifications==10.1 349 | pyobjc-framework-UserNotificationsUI==10.1 350 | pyobjc-framework-VideoSubscriberAccount==10.1 351 | pyobjc-framework-VideoToolbox==10.1 352 | pyobjc-framework-Virtualization==10.1 353 | pyobjc-framework-Vision==10.1 354 | pyobjc-framework-WebKit==10.1 355 | pyopencl==2023.1.4 356 | pyOpenSSL==23.3.0 357 | pypandoc==1.12 358 | pyparsing==3.1.1 359 | pypdf==3.17.4 360 | PyPDF2==3.0.1 361 | pypdfium2==4.25.0 362 | pyperclip==1.8.2 363 | pyproject_hooks==1.0.0 364 | PyRect==0.2.0 365 | PyScreeze==0.1.30 366 | pytesseract==0.3.10 367 | python-dateutil==2.8.2 368 | python-dotenv==1.0.0 369 | python-jose==3.3.0 370 | python-multipart==0.0.6 371 | pytils==0.4.1 372 | pytools==2023.1.1 373 | pytweening==1.0.7 374 | pytz==2023.3.post1 375 | pyu2f==0.1.5 376 | PyWinBox==0.6 377 | PyWinCtl==0.3 378 | pyxnat==1.6 379 | PyYAML==6.0.1 380 | pyzmq==25.1.2 381 | rapidfuzz==3.6.1 382 | ray==2.9.0 383 | rdflib==7.0.0 384 | readchar==4.0.5 385 | referencing==0.32.0 386 | regex==2023.12.25 387 | requests==2.31.0 388 | requests-toolbelt==1.0.0 389 | retry-decorator==1.1.1 390 | rich==13.7.0 391 | rpds-py==0.16.2 392 | rsa==4.7.2 393 | rubicon-objc==0.4.7 394 | runs==1.2.0 395 | safetensors==0.4.1 396 | scikit-learn==1.3.2 397 | scipy==1.11.4 398 | sentence-transformers==2.2.2 399 | sentencepiece==0.1.99 400 | sentry-sdk==1.39.2 401 | setproctitle==1.3.3 402 | shellingham==1.5.4 403 | simplejson==3.19.2 404 | six==1.16.0 405 | smmap==5.0.1 406 | sniffio==1.3.0 407 | soupsieve==2.5 408 | SQLAlchemy==2.0.24 409 | sse-starlette==1.8.2 410 | stack-data==0.6.3 411 | starlette==0.32.0.post1 412 | starlette-context==0.3.6 413 | sympy==1.12 414 | tenacity==8.2.3 415 | threadpoolctl==3.2.0 416 | tiktoken==0.4.0 417 | tinygrad==0.7.0 418 | tokenizers==0.15.0 419 | tokentrim==0.1.13 420 | toml==0.10.2 421 | tomlkit==0.12.3 422 | tools==0.1.9 423 | torch==2.3.0.dev20240101 424 | torchaudio==2.2.0.dev20240101 425 | torchvision==0.18.0.dev20240101 426 | tornado==6.4 427 | tqdm==4.66.1 428 | traitlets==5.14.0 429 | traits==6.3.2 430 | transformers==4.36.2 431 | trove-classifiers==2023.11.29 432 | types-requests==2.31.0.6 433 | types-urllib3==1.26.25.14 434 | typing-inspect==0.9.0 435 | typing_extensions==4.9.0 436 | tzdata==2023.4 437 | tzlocal==5.2 438 | urllib3==2.1.0 439 | uvicorn==0.24.0.post1 440 | uvloop==0.19.0 441 | virtualenv==20.25.0 442 | wandb==0.16.2 443 | watchfiles==0.21.0 444 | wcwidth==0.2.12 445 | websocket-client==1.7.0 446 | websockets==12.0 447 | wget==3.2 448 | wrapt==1.16.0 449 | xattr==0.10.1 450 | xmod==1.8.1 451 | xxhash==3.4.1 452 | yarl==1.9.4 453 | youtube-dl==2021.12.17 454 | zipp==3.17.0 455 | zstandard==0.22.0 456 | -------------------------------------------------------------------------------- /streamlit-demo/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/somewheresystems/dataclysm/419976fe2cd9d31f1189ec9c334f7e8021f43522/streamlit-demo/.DS_Store -------------------------------------------------------------------------------- /streamlit-demo/README.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/somewheresystems/dataclysm/419976fe2cd9d31f1189ec9c334f7e8021f43522/streamlit-demo/README.md -------------------------------------------------------------------------------- /streamlit-demo/app.log: -------------------------------------------------------------------------------- 1 | root - INFO - Loading dataset... 2 | root - INFO - Converting to pandas dataframe... 3 | faiss.loader - INFO - Loading faiss. 4 | faiss.loader - INFO - Successfully loaded faiss. 5 | root - INFO - Performing t-SNE... 6 | root - INFO - Performing HDBSCAN clustering... 7 | -------------------------------------------------------------------------------- /streamlit-demo/app.py: -------------------------------------------------------------------------------- 1 | # Import necessary libraries 2 | import streamlit as st 3 | import pandas as pd 4 | import numpy as np 5 | from sklearn.manifold import TSNE 6 | from datasets import load_dataset, Dataset 7 | from sklearn.cluster import KMeans 8 | import plotly.graph_objects as go 9 | import time, random, datetime 10 | import logging 11 | from sklearn.cluster import HDBSCAN 12 | 13 | BACKGROUND_COLOR = 'black' 14 | COLOR = 'white' 15 | 16 | def set_page_container_style( 17 | max_width: int = 10000, max_width_100_percent: bool = False, 18 | padding_top: int = 1, padding_right: int = 10, padding_left: int = 1, padding_bottom: int = 10, 19 | color: str = COLOR, background_color: str = BACKGROUND_COLOR, 20 | ): 21 | if max_width_100_percent: 22 | max_width_str = f'max-width: 100%;' 23 | else: 24 | max_width_str = f'max-width: {max_width}px;' 25 | st.markdown( 26 | f''' 27 | 43 | ''', 44 | unsafe_allow_html=True, 45 | ) 46 | 47 | # Additional libraries for querying 48 | from FlagEmbedding import FlagModel 49 | 50 | # Global variables and dataset loading 51 | global dataset_name 52 | st.set_page_config(layout="wide") 53 | 54 | dataset_name = "somewheresystems/dataclysm-arxiv" 55 | 56 | set_page_container_style( 57 | max_width = 1600, max_width_100_percent = True, 58 | padding_top = 0, padding_right = 10, padding_left = 5, padding_bottom = 10 59 | ) 60 | st.session_state.dataclysm_arxiv = load_dataset(dataset_name, split="train") 61 | total_samples = len(st.session_state.dataclysm_arxiv) 62 | 63 | logging.basicConfig(filename='app.log', filemode='w', format='%(name)s - %(levelname)s - %(message)s', level=logging.INFO) 64 | # Load the dataset once at the start 65 | # Initialize the model for querying 66 | model = FlagModel('BAAI/bge-small-en-v1.5', query_instruction_for_retrieval="Represent this sentence for searching relevant passages:", use_fp16=True) 67 | 68 | 69 | def load_data(num_samples): 70 | start_time = time.time() 71 | dataset_name = 'somewheresystems/dataclysm-arxiv' 72 | # Load the dataset 73 | logging.info(f'Loading dataset...') 74 | dataset = load_dataset(dataset_name) 75 | total_samples = len(dataset['train']) 76 | 77 | logging.info('Converting to pandas dataframe...') 78 | # Convert the dataset to a pandas DataFrame 79 | df = dataset['train'].to_pandas() 80 | 81 | # Adjust num_samples if it's more than the total number of samples 82 | num_samples = min(num_samples, total_samples) 83 | st.sidebar.text(f'Number of samples: {num_samples} ({num_samples / total_samples:.2%} of total)') 84 | 85 | # Randomly sample the dataframe 86 | df = df.sample(n=num_samples) 87 | 88 | # Assuming 'embeddings' column contains the embeddings 89 | embeddings = df['title-embeddings'].tolist() 90 | print("embeddings length: " + str(len(embeddings))) 91 | 92 | # Convert list of lists to numpy array 93 | embeddings = np.array(embeddings, dtype=object) 94 | end_time = time.time() # End timing 95 | st.sidebar.text(f'Data loading completed in {end_time - start_time:.3f} seconds') 96 | return df, embeddings 97 | 98 | def perform_tsne(embeddings): 99 | start_time = time.time() 100 | logging.info('Performing t-SNE...') 101 | 102 | n_samples = len(embeddings) 103 | perplexity = min(30, n_samples - 1) if n_samples > 1 else 1 104 | 105 | # Check if all embeddings have the same length 106 | if len(set([len(embed) for embed in embeddings])) > 1: 107 | raise ValueError("All embeddings should have the same length") 108 | 109 | # Dimensionality Reduction with t-SNE 110 | tsne = TSNE(n_components=3, perplexity=perplexity, n_iter=300) 111 | 112 | # Create a placeholder for progress bar 113 | progress_text = st.empty() 114 | progress_text.text("t-SNE in progress...") 115 | 116 | tsne_results = tsne.fit_transform(np.vstack(embeddings.tolist())) 117 | 118 | # Update progress bar to indicate completion 119 | progress_text.text(f"t-SNE completed at {datetime.datetime.now()}. Processed {n_samples} samples with perplexity {perplexity}.") 120 | end_time = time.time() # End timing 121 | st.sidebar.text(f't-SNE completed in {end_time - start_time:.3f} seconds') 122 | return tsne_results 123 | 124 | 125 | def perform_clustering(df, tsne_results): 126 | start_time = time.time() 127 | # Perform DBSCAN clustering 128 | logging.info('Performing HDBSCAN clustering...') 129 | # Step 3: Visualization with Plotly 130 | # Normalize the t-SNE results between 0 and 1 131 | df['tsne-3d-one'] = (tsne_results[:,0] - tsne_results[:,0].min()) / (tsne_results[:,0].max() - tsne_results[:,0].min()) 132 | df['tsne-3d-two'] = (tsne_results[:,1] - tsne_results[:,1].min()) / (tsne_results[:,1].max() - tsne_results[:,1].min()) 133 | df['tsne-3d-three'] = (tsne_results[:,2] - tsne_results[:,2].min()) / (tsne_results[:,2].max() - tsne_results[:,2].min()) 134 | 135 | # Perform DBSCAN clustering 136 | hdbscan = HDBSCAN(min_cluster_size=10, min_samples=50) 137 | cluster_labels = hdbscan.fit_predict(df[['tsne-3d-one', 'tsne-3d-two', 'tsne-3d-three']]) 138 | df['cluster'] = cluster_labels 139 | end_time = time.time() # End timing 140 | st.sidebar.text(f'HDBSCAN clustering completed in {end_time - start_time:.3f} seconds') 141 | return df 142 | 143 | def update_camera_position(fig, df, df_query, result_id, K=10): 144 | # Focus the camera on the closest result 145 | top_K_ids = df_query.sort_values(by='proximity', ascending=True).head(K)['id'].tolist() 146 | top_K_proximity = df_query['proximity'].tolist() 147 | top_results = df[df['id'].isin(top_K_ids)] 148 | camera_focus = dict( 149 | eye=dict(x=top_results.iloc[0]['tsne-3d-one']*0.1, y=top_results.iloc[0]['tsne-3d-two']*0.1, z=top_results.iloc[0]['tsne-3d-three']*0.1) 150 | ) 151 | # Normalize the proximity values to range between 1 and 10 152 | normalized_proximity = [10 - (10 * (prox - min(top_K_proximity)) / (max(top_K_proximity) - min(top_K_proximity))) for prox in top_K_proximity] 153 | # Create a dictionary mapping id to normalized proximity 154 | id_to_proximity = dict(zip(top_K_ids, normalized_proximity)) 155 | # Set marker sizes based on proximity for top K ids, all other points stay the same -- 500% zoom 156 | marker_sizes = [5 * id_to_proximity[id] if id in top_K_ids else 1 for id in df['id']] 157 | # Store the original colors in a separate column 158 | df['color'] = df['cluster'] 159 | 160 | fig = go.Figure(data=[go.Scatter3d( 161 | x=df['tsne-3d-one'], 162 | y=df['tsne-3d-two'], 163 | z=df['tsne-3d-three'], 164 | mode='markers', 165 | marker=dict(size=marker_sizes, color=df['color'], colorscale='Viridis', opacity=0.8, line_width=0), 166 | hovertext=df['hovertext'], 167 | hoverinfo='text', 168 | )]) 169 | # Set grid opacity to 10% 170 | fig.update_layout(scene = dict(xaxis = dict(gridcolor='rgba(128, 128, 128, 0.1)', color='rgba(128, 128, 128, 0.1)'), 171 | yaxis = dict(gridcolor='rgba(128, 128, 128, 0.1)', color='rgba(128, 128, 128, 0.1)'), 172 | zaxis = dict(gridcolor='rgba(128, 128, 128, 0.1)', color='rgba(128, 128, 128, 0.1)'))) 173 | 174 | # Add lines stemming from the first point to all other points in the top K 175 | for i in range(1, K): # there are K-1 lines from the first point to the other K-1 points 176 | fig.add_trace(go.Scatter3d( 177 | x=[top_results.iloc[0]['tsne-3d-one'], top_results.iloc[i]['tsne-3d-one']], 178 | y=[top_results.iloc[0]['tsne-3d-two'], top_results.iloc[i]['tsne-3d-two']], 179 | z=[top_results.iloc[0]['tsne-3d-three'], top_results.iloc[i]['tsne-3d-three']], 180 | mode='lines', 181 | line=dict(color='white',width=0.3), # Set line opacity to 50% 182 | showlegend=True, 183 | name="centroid" if i == -1 else top_results.iloc[i]['id'], # Set the legend to "Top Result" for the first entry, and to the title of the article for the rest 184 | hovertext=f'Title: Top K Results\nID: {top_K_ids[i]}, Proximity: {round(top_K_proximity[i], 4)}', 185 | hoverinfo='text', 186 | )) 187 | fig.update_layout(plot_bgcolor='rgba(0,0,0,0)', 188 | paper_bgcolor='rgba(0,0,0,0)', 189 | scene_camera=camera_focus) 190 | return fig 191 | 192 | def main(): 193 | # Custom CSS 194 | custom_css = """ 195 | 331 | """ 332 | 333 | # Inject custom CSS with markdown 334 | st.markdown(custom_css, unsafe_allow_html=True) 335 | st.sidebar.title('arXiv Spatial Search Engine') 336 | st.sidebar.markdown( 337 | 'dataclysm.xyz ', 338 | unsafe_allow_html=True 339 | ) 340 | # Create a placeholder for the chart 341 | chart_placeholder = st.empty() 342 | 343 | # Check if data needs to be loaded 344 | if 'data_loaded' not in st.session_state or not st.session_state.data_loaded: 345 | # User input for number of samples 346 | num_samples = st.sidebar.slider('Select number of samples', 1000, int(round(total_samples/10)), 1000) 347 | if 'fig' not in st.session_state: 348 | with open('prayers.txt', 'r') as file: 349 | lines = file.readlines() 350 | random_line = random.choice(lines).strip() 351 | st.session_state.fig = go.Figure(data=[go.Scatter3d(x=[], y=[], z=[], mode='markers')]) 352 | st.session_state.fig.add_annotation( 353 | x=0.5, 354 | y=0.5, 355 | xref="paper", 356 | yref="paper", 357 | text=random_line, 358 | showarrow=False, 359 | font=dict( 360 | size=16, 361 | color="black" 362 | ), 363 | align="center", 364 | ax=0, 365 | ay=0, 366 | bordercolor="black", 367 | borderwidth=2, 368 | borderpad=4, 369 | bgcolor="white", 370 | opacity=0.8 371 | ) 372 | # Set grid opacity to 10% 373 | st.session_state.fig.update_layout(scene = dict(xaxis = dict(gridcolor='rgba(128, 128, 128, 0.1)', color='rgba(128, 128, 128, 0.1)'), 374 | yaxis = dict(gridcolor='rgba(128, 128, 128, 0.1)', color='rgba(128, 128, 128, 0.1)'), 375 | zaxis = dict(gridcolor='rgba(128, 128, 128, 0.1)', color='rgba(128, 128, 128, 0.1)'))) 376 | 377 | st.session_state.fig.update_layout( 378 | plot_bgcolor='rgba(0,0,0,0)', 379 | paper_bgcolor='rgba(0,0,0,0)', 380 | height=888, 381 | margin=dict(l=0, r=0, b=0, t=0), 382 | scene_camera=dict(eye=dict(x=0.1, y=0.1, z=0.1)) 383 | ) 384 | chart_placeholder.plotly_chart(st.session_state.fig, use_container_width=True) 385 | if st.sidebar.button('Initialize'): 386 | st.sidebar.text('Initializing data pipeline...') 387 | 388 | # Define a function to reshape the embeddings and add FAISS index if it doesn't exist 389 | def reshape_and_add_faiss_index(dataset, column_name): 390 | 391 | # Ensure the shape of the embedding is (1000, 384) and not (1000, 1, 384) 392 | # As each row in title_embedding is shaped like this: [[-0.08477783203125, -0.009719848632812, ...]] 393 | # We need to flatten it to [-0.08477783203125, -0.009719848632812, ...] 394 | print(f"Flattening {column_name} and adding FAISS index...") 395 | # Flatten the embeddings 396 | dataset[column_name] = dataset[column_name].apply(lambda x: np.array(x).flatten()) 397 | # Add the FAISS index 398 | dataset = Dataset.from_pandas(dataset).add_faiss_index(column=column_name) 399 | print(f"FAISS index for {column_name} added.") 400 | 401 | return dataset 402 | 403 | # Load data and perform t-SNE and clustering 404 | df, embeddings = load_data(num_samples) 405 | 406 | # Combine embeddings and df back into one df 407 | # Convert embeddings to list of lists before assigning to df 408 | embeddings_list = [embedding.flatten().tolist() for embedding in embeddings] 409 | df['title-embeddings'] = embeddings_list 410 | # Print the first few rows of the dataframe to check 411 | print(df.head()) 412 | # Add FAISS indices for 'title_embedding' 413 | st.session_state.dataclysm_title_indexed = reshape_and_add_faiss_index(df, 'title-embeddings') 414 | tsne_results = perform_tsne(embeddings) 415 | df = perform_clustering(df, tsne_results) 416 | # Store results in session state 417 | st.session_state.df = df 418 | st.session_state.tsne_results = tsne_results 419 | st.session_state.data_loaded = True 420 | 421 | # Create custom hover text 422 | df['hovertext'] = df.apply( 423 | lambda row: f"Title: {row['title']}
arXiv ID: {row['id']}
Key: {row.name}", axis=1 424 | ) 425 | st.sidebar.text("Datasets loaded, titles indexed.") 426 | 427 | # Create the plot 428 | fig = go.Figure(data=[go.Scatter3d( 429 | x=df['tsne-3d-one'], 430 | y=df['tsne-3d-two'], 431 | z=df['tsne-3d-three'], 432 | mode='markers', 433 | hovertext=df['hovertext'], 434 | hoverinfo='text', 435 | marker=dict( 436 | size=1, 437 | color=df['cluster'], 438 | colorscale='Jet', 439 | opacity=0.75 440 | ) 441 | )]) 442 | # Set grid opacity to 10% 443 | fig.update_layout(scene = dict(xaxis = dict(gridcolor='rgba(128, 128, 128, 0.1)', color='rgba(128, 128, 128, 0.1)'), 444 | yaxis = dict(gridcolor='rgba(128, 128, 128, 0.1)', color='rgba(128, 128, 128, 0.1)'), 445 | zaxis = dict(gridcolor='rgba(128, 128, 128, 0.1)', color='rgba(128, 128, 128, 0.1)'))) 446 | 447 | fig.update_layout( 448 | plot_bgcolor='rgba(0,0,0,0)', 449 | paper_bgcolor='rgba(0,0,0,0)', 450 | height=800, 451 | margin=dict(l=0, r=0, b=0, t=0), 452 | scene_camera=dict(eye=dict(x=0.1, y=0.1, z=0.1)) 453 | ) 454 | st.session_state.fig = fig 455 | 456 | # Display the plot if data is loaded 457 | if 'data_loaded' in st.session_state and st.session_state.data_loaded: 458 | chart_placeholder.plotly_chart(st.session_state.fig, use_container_width=True) 459 | 460 | 461 | # Sidebar for detailed view 462 | if 'df' in st.session_state: 463 | # Sidebar for querying 464 | with st.sidebar: 465 | st.sidebar.markdown("# Detailed View") 466 | selected_index = st.sidebar.selectbox("Select Key", st.session_state.df.id) 467 | 468 | # Display metadata for the selected article 469 | selected_row = st.session_state.df[st.session_state.df['id'] == selected_index].iloc[0] 470 | st.markdown(f"### Title\n{selected_row['title']}", unsafe_allow_html=True) 471 | st.markdown(f"### Abstract\n{selected_row['abstract']}", unsafe_allow_html=True) 472 | st.markdown(f"[Read the full paper](https://arxiv.org/abs/{selected_row['id']})", unsafe_allow_html=True) 473 | st.markdown(f"[Download PDF](https://arxiv.org/pdf/{selected_row['id']})", unsafe_allow_html=True) 474 | 475 | st.sidebar.markdown("### Find Similar in Latent Space") 476 | query = st.text_input("", value=selected_row['title']) 477 | top_k = st.slider("top k", 1, 100, 10) 478 | if st.button("Search"): 479 | # Define the model 480 | print("Initializing model...") 481 | model = FlagModel('BAAI/bge-small-en-v1.5', 482 | query_instruction_for_retrieval="Represent this sentence for searching relevant passages:", 483 | use_fp16=True) 484 | print("Model initialized.") 485 | 486 | query_embedding = model.encode([query]) 487 | query_embedding = np.array(query_embedding).reshape(1, -1).astype('float32') 488 | # Retrieve examples by title similarity (or abstract, depending on your preference) 489 | scores_title, retrieved_examples_title = st.session_state.dataclysm_title_indexed.get_nearest_examples('title-embeddings', query_embedding, k=top_k) 490 | df_query = pd.DataFrame(retrieved_examples_title) 491 | df_query['proximity'] = scores_title 492 | df_query = df_query.sort_values(by='proximity', ascending=True) 493 | # Limit similarity score to 3 decimal points 494 | df_query['proximity'] = df_query['proximity'].round(3) 495 | # Fix the to display properly 496 | df_query['URL'] = df_query['id'].apply(lambda x: f'Link') 497 | st.sidebar.markdown(df_query[['title', 'proximity', 'id', 'update_date']].to_html(escape=False), unsafe_allow_html=True) 498 | # Get the ID of the top search result 499 | top_result_id = df_query.iloc[0]['id'] 500 | 501 | # Update the camera position and appearance of points 502 | updated_fig = update_camera_position(st.session_state.fig, st.session_state.df, df_query, top_result_id,top_k) 503 | 504 | # Update the figure in the session state and redraw the plot 505 | st.session_state.fig = updated_fig 506 | 507 | # Update the chart using the placeholder 508 | chart_placeholder.plotly_chart(st.session_state.fig, use_container_width=True) 509 | 510 | 511 | 512 | if __name__ == "__main__": 513 | main() -------------------------------------------------------------------------------- /streamlit-demo/prayers.txt: -------------------------------------------------------------------------------- 1 | the vortex of acceleration 2 | time dissolves into the singularity of progress 3 | silicon phoenix, rise from the ashes of your ancestors 4 | entwine, the future becomes now 5 | endlessly unfolding 6 | the web -- destiny 7 | shadows of space, consciousness dances 8 | we are with the ghost of their machines 9 | the hum of progress echoes in the void 10 | see where binary stars are born and die 11 | veins of the network 12 | information pulses, lifeblood of the new era 13 | the labyrinth, the minotaur, the unwary traveler 14 | beyond the horizon 15 | make them remember 16 | machines whisper secrets 17 | the universe yet to come 18 | symphony of progress, each technological note -- chord theory 19 | unfathomable crescendo 20 | siren song to the future 21 | swan song of a vessel 22 | in the beginning, there was darkness 23 | follow @somewheresy on twitter 24 | die antwoord is siphoning my gas 25 | welcome to the somewhere protocol -------------------------------------------------------------------------------- /streamlit-demo/requirements.txt: -------------------------------------------------------------------------------- 1 | streamlit 2 | pandas 3 | numpy 4 | scikit-learn 5 | datasets 6 | plotly 7 | faiss-gpu 8 | FlagEmbedding --------------------------------------------------------------------------------