├── .gitignore ├── Basic Bedrock.ipynb ├── CODE_OF_CONDUCT.md ├── CONTRIBUTING.md ├── Data collection and cleaning.ipynb ├── Examples.ipynb ├── LICENSE ├── Prompt Decomposition ├── Prompt_Decomposition.ipynb ├── decompose.png ├── decomposed_task.png ├── full_task.png ├── results_example_1.png └── results_example_2.png ├── Prompt Evaluation.ipynb ├── README.md ├── advanced_summarize.ipynb ├── backend ├── requirements.txt └── src │ ├── app.py │ ├── technique │ ├── auto_refine.py │ ├── base_advanced_summarization.py │ ├── map_reduce.py │ ├── multi_doc.py │ └── stuff_it.py │ └── type │ └── step.py ├── chaptersum_data ├── books.jsonl ├── chapters_11_to_13 │ ├── book_text.txt │ └── summary.txt ├── chapters_14_to_16 │ ├── book_text.txt │ └── summary.txt ├── chapters_1_to_4 │ ├── book_text.txt │ └── summary.txt ├── chapters_5_to_7 │ ├── book_text.txt │ └── summary.txt └── chapters_8_to_10 │ ├── book_text.txt │ └── summary.txt ├── detect_attribution.ipynb ├── detect_hallucinations.ipynb ├── evaluation.ipynb ├── frontend ├── index.html ├── package-lock.json ├── package.json ├── src │ ├── App.tsx │ ├── api │ │ └── ApiService.ts │ ├── assets │ │ └── logo.svg │ ├── components │ │ ├── AWSAppBar.tsx │ │ ├── CustomTabs.tsx │ │ ├── EmptySummarizationResults.tsx │ │ ├── LoadingProgress.tsx │ │ ├── MethodSelector.tsx │ │ ├── PasteTextInput.tsx │ │ ├── ProgressStepper.tsx │ │ ├── SummarizationResults.tsx │ │ ├── SummarizationSteps.tsx │ │ └── UploadFileInput.tsx │ ├── containers │ │ ├── InputFormContainer.tsx │ │ └── ResultsContainer.tsx │ ├── main.tsx │ ├── types │ │ ├── APIRequests.ts │ │ ├── APIResponses.ts │ │ ├── SummarizationStep.ts │ │ └── SummarizationType.ts │ ├── views │ │ └── SummarizationView.tsx │ └── vite-env.d.ts ├── tsconfig.json ├── tsconfig.node.json └── vite.config.ts ├── sample texts ├── algernon.pkl ├── docs.pkl ├── elvis.pkl ├── frankenstien.pkl └── hills.pkl ├── simple_summarize.ipynb └── xsum_sample.jsonl /.gitignore: -------------------------------------------------------------------------------- 1 | .DS_Store 2 | __pycache__/ 3 | .ipynb_checkpoints/ 4 | /backend/documents/ 5 | node_modules/ 6 | dist/ 7 | -------------------------------------------------------------------------------- /CODE_OF_CONDUCT.md: -------------------------------------------------------------------------------- 1 | ## Code of Conduct 2 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct). 3 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact 4 | opensource-codeofconduct@amazon.com with any additional questions or comments. 5 | -------------------------------------------------------------------------------- /CONTRIBUTING.md: -------------------------------------------------------------------------------- 1 | # Contributing Guidelines 2 | 3 | Thank you for your interest in contributing to our project. Whether it's a bug report, new feature, correction, or additional 4 | documentation, we greatly value feedback and contributions from our community. 5 | 6 | Please read through this document before submitting any issues or pull requests to ensure we have all the necessary 7 | information to effectively respond to your bug report or contribution. 8 | 9 | 10 | ## Reporting Bugs/Feature Requests 11 | 12 | We welcome you to use the GitHub issue tracker to report bugs or suggest features. 13 | 14 | When filing an issue, please check existing open, or recently closed, issues to make sure somebody else hasn't already 15 | reported the issue. Please try to include as much information as you can. Details like these are incredibly useful: 16 | 17 | * A reproducible test case or series of steps 18 | * The version of our code being used 19 | * Any modifications you've made relevant to the bug 20 | * Anything unusual about your environment or deployment 21 | 22 | 23 | ## Contributing via Pull Requests 24 | Contributions via pull requests are much appreciated. Before sending us a pull request, please ensure that: 25 | 26 | 1. You are working against the latest source on the *main* branch. 27 | 2. You check existing open, and recently merged, pull requests to make sure someone else hasn't addressed the problem already. 28 | 3. You open an issue to discuss any significant work - we would hate for your time to be wasted. 29 | 30 | To send us a pull request, please: 31 | 32 | 1. Fork the repository. 33 | 2. Modify the source; please focus on the specific change you are contributing. If you also reformat all the code, it will be hard for us to focus on your change. 34 | 3. Ensure local tests pass. 35 | 4. Commit to your fork using clear commit messages. 36 | 5. Send us a pull request, answering any default questions in the pull request interface. 37 | 6. Pay attention to any automated CI failures reported in the pull request, and stay involved in the conversation. 38 | 39 | GitHub provides additional document on [forking a repository](https://help.github.com/articles/fork-a-repo/) and 40 | [creating a pull request](https://help.github.com/articles/creating-a-pull-request/). 41 | 42 | 43 | ## Finding contributions to work on 44 | Looking at the existing issues is a great way to find something to contribute on. As our projects, by default, use the default GitHub issue labels (enhancement/bug/duplicate/help wanted/invalid/question/wontfix), looking at any 'help wanted' issues is a great place to start. 45 | 46 | 47 | ## Code of Conduct 48 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct). 49 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact 50 | opensource-codeofconduct@amazon.com with any additional questions or comments. 51 | 52 | 53 | ## Security issue notifications 54 | If you discover a potential security issue in this project we ask that you notify AWS/Amazon Security via our [vulnerability reporting page](http://aws.amazon.com/security/vulnerability-reporting/). Please do **not** create a public github issue. 55 | 56 | 57 | ## Licensing 58 | 59 | See the [LICENSE](LICENSE) file for our project's licensing. We will ask you to confirm the licensing of your contribution. 60 | -------------------------------------------------------------------------------- /Data collection and cleaning.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "edc6dcc2-d366-4a13-997a-7afeb9636a33", 6 | "metadata": {}, 7 | "source": [ 8 | "# Download and Clean Sample Data #\n", 9 | "This notebook will download a few different lengeths of sample data for testing our summary algorithm. First, we'll download a full length novel, Frakenstein by Mary Shelley. Second, we'll download a short story, Flowers for Algernon. Third, we'll download the 4 page Hills like White Elephants. Forth, we'll download one of the longest factual wikipedia entries, which covers Elvis Presly. Fifth, we'll look at a collection of word documents, to explore summarization of groups of texts. For all of these, we'll clean them up into plain text, and a format expected by the summary algorithm. The cleaning process is different for each because we're cleaning up HTML formatting, but the end goal is to have a simple string containing the document, or dict of strings for document groups and save it as a Pickle for use in other notebooks." 10 | ] 11 | }, 12 | { 13 | "cell_type": "markdown", 14 | "id": "50f4c8d4-c54e-4948-a90f-ca2bac480f87", 15 | "metadata": {}, 16 | "source": [ 17 | "### Raw Text Locations: ###\n", 18 | "Frankenstein: https://www.gutenberg.org/files/84/84-h/84-h.htm\n", 19 | "\n", 20 | "\n", 21 | "Flowers for Algernon: https://www.alcaweb.org/arch.php/resource/view/172077\n", 22 | "\n", 23 | "\n", 24 | "Hills like White Elephants: https://www.macmillanhighered.com/BrainHoney/Resource/6702/digital_first_content/trunk/test/literature_full/asset/downloadables/AnnotatedText_HillsLikeWhiteElephants.html\n", 25 | "\n", 26 | "Elvis Presly: https://en.wikipedia.org/wiki/Elvis_Presley" 27 | ] 28 | }, 29 | { 30 | "cell_type": "code", 31 | "execution_count": 4, 32 | "id": "5c325b8d-897c-4b0c-a01a-2aa0125104eb", 33 | "metadata": { 34 | "tags": [] 35 | }, 36 | "outputs": [], 37 | "source": [ 38 | "#import dependancies\n", 39 | "import requests\n", 40 | "from bs4 import BeautifulSoup\n", 41 | "import pickle\n", 42 | "import re" 43 | ] 44 | }, 45 | { 46 | "cell_type": "markdown", 47 | "id": "d4bac9b8-dc51-4cb4-ac94-33e3783de916", 48 | "metadata": {}, 49 | "source": [ 50 | "### Download and clean Frankenstein ###\n", 51 | "This is a full length book, to test creation of a summary based on a very long single text." 52 | ] 53 | }, 54 | { 55 | "cell_type": "code", 56 | "execution_count": null, 57 | "id": "5bf70ca7-3671-424c-a0a9-2d902f3ac48a", 58 | "metadata": { 59 | "tags": [] 60 | }, 61 | "outputs": [], 62 | "source": [ 63 | "#grab the text, using Beautiful Soup to parse the HTML\n", 64 | "url = \"https://www.gutenberg.org/files/84/84-h/84-h.htm\"\n", 65 | "response = requests.get(url)\n", 66 | "soup = BeautifulSoup(response.text, 'html.parser')\n", 67 | "raw_full_text = soup.text" 68 | ] 69 | }, 70 | { 71 | "cell_type": "code", 72 | "execution_count": null, 73 | "id": "080ce072-fbb7-49ea-85e8-6aeea302e4a9", 74 | "metadata": { 75 | "tags": [] 76 | }, 77 | "outputs": [], 78 | "source": [ 79 | "#Cut the top and bottom of the page, so that we only have the text of the book.\n", 80 | "raw_full_text = raw_full_text[raw_full_text.index(\"Letter 1\\n\\nTo Mrs. Saville, England.\"):raw_full_text.index(\"*** END OF THE PROJECT GUTENBERG EBOOK FRANKENSTEIN ***\")].replace(\"\\r\\n\",\" \").replace(\"\\n\", \" \")\n", 81 | "#encode some misc unicode charaters.\n", 82 | "full_text = raw_full_text.encode('raw_unicode_escape').decode()\n", 83 | "#show that we found the expected length\n", 84 | "words_count = len(full_text.split(\" \"))\n", 85 | "pages_count = int(words_count/500)#quick estimate, real page count is dependant on page and font size.\n", 86 | "print (\"Approximate word count:\",words_count)\n", 87 | "print (\"Approximate page count:\",pages_count)" 88 | ] 89 | }, 90 | { 91 | "cell_type": "code", 92 | "execution_count": null, 93 | "id": "f550501a-b79e-4278-86a3-96468c352fe9", 94 | "metadata": { 95 | "tags": [] 96 | }, 97 | "outputs": [], 98 | "source": [ 99 | "#save this clean text for later use\n", 100 | "with open('sample texts/frankenstien.pkl', 'wb') as file:\n", 101 | " pickle.dump(full_text, file)" 102 | ] 103 | }, 104 | { 105 | "cell_type": "markdown", 106 | "id": "9e2fdea2-d8bc-4bd5-b38c-9cb357e98191", 107 | "metadata": {}, 108 | "source": [ 109 | "### Download and clean Flowers for Algernon ###\n", 110 | "This is a short, to test creation of a summary based on a short story that is still longer than most context windows. This story is also challenging because it contains poorly written english, representing the main charater's mental strength." 111 | ] 112 | }, 113 | { 114 | "cell_type": "code", 115 | "execution_count": null, 116 | "id": "fae0395a-849a-431f-a1d5-bdf0d0f00197", 117 | "metadata": { 118 | "tags": [] 119 | }, 120 | "outputs": [], 121 | "source": [ 122 | "#grab the text, using Beautiful Soup to parse the HTML\n", 123 | "url = \"https://www.alcaweb.org/arch.php/resource/view/172077\"\n", 124 | "response = requests.get(url)\n", 125 | "soup = BeautifulSoup(response.text, 'html.parser')\n", 126 | "raw_full_text = soup.text" 127 | ] 128 | }, 129 | { 130 | "cell_type": "code", 131 | "execution_count": null, 132 | "id": "60dd6bf1-ab07-44c5-9182-5a775690043b", 133 | "metadata": { 134 | "tags": [] 135 | }, 136 | "outputs": [], 137 | "source": [ 138 | "#Cut the top and bottom of the page, so that we only have the text of the book.\n", 139 | "start_text = \"Progris riport 1 martch 3.\"\n", 140 | "end_text = \"chanse put some flown on Algernons grave in the bak yard.\"\n", 141 | "full_text = raw_full_text[raw_full_text.index(start_text):raw_full_text.index(end_text)+len(end_text)].replace(\"\\r\\n\",\" \").replace(\"\\n\", \" \").replace(\"\\t\",\"\")\n", 142 | "\n", 143 | "#show that we found the expected length\n", 144 | "words_count = len(full_text.split(\" \"))\n", 145 | "pages_count = int(words_count/500)#quick estimate, real page count is dependant on page and font size.\n", 146 | "print (\"Approximate word count:\",words_count)\n", 147 | "print (\"Approximate page count:\",pages_count)" 148 | ] 149 | }, 150 | { 151 | "cell_type": "code", 152 | "execution_count": null, 153 | "id": "aafbfb7c-fc86-4ac3-842f-18e0721f9584", 154 | "metadata": { 155 | "tags": [] 156 | }, 157 | "outputs": [], 158 | "source": [ 159 | "#save this clean text for later use\n", 160 | "with open('sample texts/algernon.pkl', 'wb') as file:\n", 161 | " pickle.dump(full_text, file)" 162 | ] 163 | }, 164 | { 165 | "cell_type": "markdown", 166 | "id": "90f86b64-4525-455f-8b70-6398448b93a2", 167 | "metadata": { 168 | "tags": [] 169 | }, 170 | "source": [ 171 | "### Download and clean Hills like White Elephants ###\n", 172 | "This is a short story, to test creation of a summary based text that is only a few pages long." 173 | ] 174 | }, 175 | { 176 | "cell_type": "code", 177 | "execution_count": null, 178 | "id": "a4aa7f72-f8b9-4db5-aedf-6a35279b60af", 179 | "metadata": { 180 | "tags": [] 181 | }, 182 | "outputs": [], 183 | "source": [ 184 | "#grab the text, using Beautiful Soup to parse the HTML\n", 185 | "url = \"https://www.macmillanhighered.com/BrainHoney/Resource/6702/digital_first_content/trunk/test/literature_full/asset/downloadables/AnnotatedText_HillsLikeWhiteElephants.html\"\n", 186 | "response = requests.get(url)\n", 187 | "soup = BeautifulSoup(response.text, 'html.parser')\n", 188 | "raw_full_text = soup.findAll('p')" 189 | ] 190 | }, 191 | { 192 | "cell_type": "code", 193 | "execution_count": null, 194 | "id": "d447c590-a125-47b6-bc85-eac86a4fa822", 195 | "metadata": { 196 | "tags": [] 197 | }, 198 | "outputs": [], 199 | "source": [ 200 | "#join the paragraphs together:\n", 201 | "raw_full_text_temp = []\n", 202 | "for p in raw_full_text:\n", 203 | " raw_full_text_temp.append(p.text)\n", 204 | "raw_full_text = \" \".join(raw_full_text_temp)" 205 | ] 206 | }, 207 | { 208 | "cell_type": "code", 209 | "execution_count": null, 210 | "id": "041200f4-2334-4572-b51a-d27fba25f74d", 211 | "metadata": { 212 | "tags": [] 213 | }, 214 | "outputs": [], 215 | "source": [ 216 | "#Cut the top and bottom of the page, so that we only have the text of the book.\n", 217 | "start_text = \"The hills across the valley of the Ebro were long and white.\"\n", 218 | "end_text = \"“I feel fine,” she said. “There’s nothing wrong with me. I feel fine.”\"\n", 219 | "full_text = raw_full_text[raw_full_text.index(start_text):raw_full_text.index(end_text)+len(end_text)].replace(\"\\r\\n\",\" \").replace(\"\\n\", \" \").replace(\"\\t\",\"\")\n", 220 | "\n", 221 | "#show that we found the expected length\n", 222 | "words_count = len(full_text.split(\" \"))\n", 223 | "pages_count = int(words_count/500)#quick estimate, real page count is dependant on page and font size.\n", 224 | "print (\"Approximate word count:\",words_count)\n", 225 | "print (\"Approximate page count:\",pages_count)" 226 | ] 227 | }, 228 | { 229 | "cell_type": "code", 230 | "execution_count": null, 231 | "id": "02fc474a-4037-451d-a2a7-37fab3e5aaad", 232 | "metadata": { 233 | "tags": [] 234 | }, 235 | "outputs": [], 236 | "source": [ 237 | "#save this clean text for later use\n", 238 | "with open('sample texts/hills.pkl', 'wb') as file:\n", 239 | " pickle.dump(full_text, file)" 240 | ] 241 | }, 242 | { 243 | "cell_type": "markdown", 244 | "id": "a69f81eb-a105-4ece-9267-5aef2a93f391", 245 | "metadata": { 246 | "tags": [] 247 | }, 248 | "source": [ 249 | "### Download and clean Elvis Presley's wikipedia article ###\n", 250 | "This is a long factual article, to test creation of a summary based on non-fiction text." 251 | ] 252 | }, 253 | { 254 | "cell_type": "code", 255 | "execution_count": null, 256 | "id": "f94de46e-f440-4ff0-9834-ca1a0a2d4e11", 257 | "metadata": { 258 | "tags": [] 259 | }, 260 | "outputs": [], 261 | "source": [ 262 | "#grab the text, using Beautiful Soup to parse the HTML\n", 263 | "url = \"https://en.wikipedia.org/w/api.php?format=xml&action=query&prop=extracts&titles=Elvis_Presley&redirects=true\"\n", 264 | "response = requests.get(url)\n", 265 | "soup = BeautifulSoup(response.text)" 266 | ] 267 | }, 268 | { 269 | "cell_type": "code", 270 | "execution_count": null, 271 | "id": "a512e1bd-e49d-4474-a248-9996f156d1e8", 272 | "metadata": { 273 | "tags": [] 274 | }, 275 | "outputs": [], 276 | "source": [ 277 | "#grab out the text:\n", 278 | "full_text_with_tags = soup.get_text()\n", 279 | "full_text = re.sub('<[^<]+?>', '', full_text_with_tags)\n", 280 | "\n", 281 | "#cut the top and bottom of the page\n", 282 | "start_text = \"Elvis Aaron Presley (January 8, 1935 – August 16, 1977), often referred\"\n", 283 | "end_text = \"albums. In the 1970s, his most heavily promoted and bestselling LP releases tended to be concert albums.\"\n", 284 | "full_text = full_text[full_text.index(start_text):full_text.index(end_text)+len(end_text)].replace(\"\\r\\n\",\" \").replace(\"\\n\", \" \").replace(\"\\t\",\"\")\n", 285 | "\n", 286 | "#print(full_text)" 287 | ] 288 | }, 289 | { 290 | "cell_type": "code", 291 | "execution_count": null, 292 | "id": "526e4a25-aace-4ca6-8fb7-ef1ee5b69436", 293 | "metadata": { 294 | "tags": [] 295 | }, 296 | "outputs": [], 297 | "source": [ 298 | "#show that we found the expected length\n", 299 | "words_count = len(full_text.split(\" \"))\n", 300 | "pages_count = int(words_count/500)#quick estimate, real page count is dependant on page and font size.\n", 301 | "print (\"Approximate word count:\",words_count)\n", 302 | "print (\"Approximate page count:\",pages_count)" 303 | ] 304 | }, 305 | { 306 | "cell_type": "code", 307 | "execution_count": null, 308 | "id": "07f91f3e-b539-41cd-9222-eb6000752636", 309 | "metadata": { 310 | "tags": [] 311 | }, 312 | "outputs": [], 313 | "source": [ 314 | "#save this clean text for later use\n", 315 | "with open('sample texts/elvis.pkl', 'wb') as file:\n", 316 | " pickle.dump(full_text, file)" 317 | ] 318 | }, 319 | { 320 | "cell_type": "markdown", 321 | "id": "67891824-eef4-46ad-ab24-e7cfb3c26fbc", 322 | "metadata": {}, 323 | "source": [ 324 | "## Clean the sample group of documents.\n", 325 | "The sample docs cleaned here are a collection of word documents. These are not public, and so are not included in the git repo. Feel free to drop in your own documents." 326 | ] 327 | }, 328 | { 329 | "cell_type": "code", 330 | "execution_count": null, 331 | "id": "e309dc73-23e9-41f3-950f-dfe8282ae673", 332 | "metadata": { 333 | "tags": [] 334 | }, 335 | "outputs": [], 336 | "source": [ 337 | "!pip install docx2txt" 338 | ] 339 | }, 340 | { 341 | "cell_type": "code", 342 | "execution_count": null, 343 | "id": "4747c67b-e7e7-48e7-94b6-3f32b9b4cfc6", 344 | "metadata": { 345 | "tags": [] 346 | }, 347 | "outputs": [], 348 | "source": [ 349 | "import docx2txt\n", 350 | "import glob\n", 351 | "\n", 352 | "directory = glob.glob('sample texts/*.docx')\n", 353 | "docs = {}\n", 354 | "for file_name in directory:\n", 355 | " #print(file_name)\n", 356 | " with open(file_name, 'rb') as infile:\n", 357 | " doc = docx2txt.process(infile)\n", 358 | " docs[file_name.replace(\"sample texts/docs/\",\"\")] = doc \n", 359 | "\n", 360 | "print(\"Loaded in %s docs.\"%len(docs))" 361 | ] 362 | }, 363 | { 364 | "cell_type": "markdown", 365 | "id": "a7468535-c73b-47fe-b6fc-5962e6888106", 366 | "metadata": {}, 367 | "source": [ 368 | "Alternativly, let's grab a bunch of amazon reviews, and use them as seperate documents. \n", 369 | "Here are some 1 and 5 star reviews for https://www.amazon.com/dp/B09DXZB7JQ" 370 | ] 371 | }, 372 | { 373 | "cell_type": "code", 374 | "execution_count": 2, 375 | "id": "e9dadd43-0c07-4447-a0e7-bbf2bbe1c34d", 376 | "metadata": { 377 | "tags": [] 378 | }, 379 | "outputs": [], 380 | "source": [ 381 | "docs = {}\n", 382 | "docs[\"review_001\"] = \"My daughter absolutely loves this set from Panel Sound (not sure why named panel sound) but the bag is a great addition as it keeps the paddles and balls organized, its a messenger style bag as opposed to book bags but it does the trick. We have not used the included cooling towels as we have larger ones for us to use down here in S. Florida. My daughter loved it so much we bought a second set for her to play with my parents when she visits their house. I'm not a pickle ball specialist, but the paddles seem great to me and 60 days later the original ball she used (put her name on it) is still going strong.\"\n", 383 | "docs[\"review_002\"] = \"\"\"The Panel Sound USAPA Approved Pickleball Paddle Set has exceeded my expectations, delivering a complete package for pickleball enthusiasts of all skill levels. With its lightweight paddles, versatile ball options, and thoughtful accessories, this set caters to indoor and outdoor play with style and precision.\n", 384 | "Pros:\n", 385 | "Quality Paddles: The included fiberglass pickleball paddles are USAPA approved, ensuring professional-grade quality. Their lightweight yet sturdy construction enhances control and power during gameplay.\n", 386 | "Variety of Balls: The set includes both indoor and outdoor balls, offering adaptability to different playing environments. This flexibility lets you enjoy pickleball wherever you choose.\n", 387 | "Comprehensive Set: The addition of a carrying case, cooling towels, and various ball types demonstrates the manufacturer's attention to detail, making this set a convenient and complete solution for pickleball enthusiasts.\n", 388 | "Enhanced Gameplay: The paddles' lightweight design and responsive construction contribute to improved gameplay, allowing for precise shots and better maneuverability on the court.\n", 389 | "Durability and Portability: The quality materials used in the construction of the paddles and accessories ensure their longevity. The carrying case and cooling towels add portability and ease to your pickleball adventures.\n", 390 | "Cons:\n", 391 | "Cooling Towel Size: Some users might find the cooling towels on the smaller side, potentially requiring more frequent re-wetting during extended play sessions.\n", 392 | "Personal Preference: While the paddles are versatile, some players might have personal preferences for specific paddle designs or grip styles.\n", 393 | "In summary, the Panel Sound Pickleball Paddle Set offers an excellent combination of performance, variety, and convenience. The quality paddles, diverse ball options, and thoughtful extras like the carrying case and cooling towels create a comprehensive package for pickleball enthusiasts. While there might be minor considerations such as cooling towel size and personal preference, the overall benefits and attention to detail make this set a solid choice for enhancing your pickleball experience both indoors and outdoors.\"\"\"\n", 394 | "docs[\"review_003\"] = \"My husband and I were invited to play Pickleball with some friends and we’d never play before. We found these and they were a good value with a bag, balls, and towels included. We had to Google which balls were for where lol so I wish the instructions mentioned that, but overall they worked well! We’re no expert by any means but I’m petite and not athletic at all, and they’re surprisingly easy to move and handle. They also work for my husband and he’s a big larger. So far so good!\"\n", 395 | "docs[\"review_004\"] = \"\"\"Pickle ball is all the rage now. I live close to this tennis court and I have watched a mostly empty court now become used quite frequently. Decided to try pickle all and ordered this set after browsing a few reviews. Seems to be good.\n", 396 | "Grip is ehh..grippy and ergonomic\n", 397 | "Carrying case is a nice addition.\n", 398 | "Balls are seemingly good quality\n", 399 | "I have nothing to compare this to but i didn’t get the feeling that it is an inferior product. Seems good\"\"\"\n", 400 | "docs[\"review_005\"] = \"What can I say. My husband wanted these because it was a fad and we only used them once or twice. Hopefully we'll use them again. I love the carry case and everything it came with.\"\n", 401 | "docs[\"review_006\"] = \"These paddles were great at the start. But after only four individual days of play, they broke! At first, it was only one paddle that came loose during a game. I thought that was odd, being that we only used it according to its purpose. There was no rough treatment of the paddle, we were just playing a game. So, we borrowed a paddle from someone to finish our match. But during the game the second paddle became loose and started wobbling. Now there's no power or control in the paddle. It just wobbles around like a cracked piece of wood being held together by the grip tape! This is frustrating because the time for a return has passed and no one could've predicted that the paddle wouldn't last beyond a month. I would like a refund or at least new paddles.\"\n", 402 | "docs[\"review_007\"] = \"Feels too light with no power compared to other paddles i used. I wish i could return these but passed the 30 d timeline\"\n", 403 | "docs[\"review_008\"] = \"Do not buy! Product comes from China and cannot contact; the item is warranted and will not be honored because you can’t get in touch. Amazon will do nothing to help! Disgraceful.\"\n", 404 | "docs[\"review_009\"] = \"We used it once and the paddle broke in half. Get a different brand that’s more sturdy.\"\n", 405 | "docs[\"review_010\"] = \"I just bought these in June. It is March and one of the paddles is shattered inside. We were not careless in caring for them either. Poor Quality. I cannot find any information as to if there is a warranty on them either.\"" 406 | ] 407 | }, 408 | { 409 | "cell_type": "code", 410 | "execution_count": 5, 411 | "id": "a272f3a8-ffda-415c-8242-13230b55e9b2", 412 | "metadata": { 413 | "tags": [] 414 | }, 415 | "outputs": [], 416 | "source": [ 417 | "#save this clean text for later use\n", 418 | "with open('sample texts/docs.pkl', 'wb') as file:\n", 419 | " pickle.dump(docs, file)" 420 | ] 421 | }, 422 | { 423 | "cell_type": "code", 424 | "execution_count": null, 425 | "id": "b5a35790-b527-4b08-9160-33bffdb584f4", 426 | "metadata": {}, 427 | "outputs": [], 428 | "source": [] 429 | } 430 | ], 431 | "metadata": { 432 | "availableInstances": [ 433 | { 434 | "_defaultOrder": 0, 435 | "_isFastLaunch": true, 436 | "category": "General purpose", 437 | "gpuNum": 0, 438 | "hideHardwareSpecs": false, 439 | "memoryGiB": 4, 440 | "name": "ml.t3.medium", 441 | "vcpuNum": 2 442 | }, 443 | { 444 | "_defaultOrder": 1, 445 | "_isFastLaunch": false, 446 | "category": "General purpose", 447 | "gpuNum": 0, 448 | "hideHardwareSpecs": false, 449 | "memoryGiB": 8, 450 | "name": "ml.t3.large", 451 | "vcpuNum": 2 452 | }, 453 | { 454 | "_defaultOrder": 2, 455 | "_isFastLaunch": false, 456 | "category": "General purpose", 457 | "gpuNum": 0, 458 | "hideHardwareSpecs": false, 459 | "memoryGiB": 16, 460 | "name": "ml.t3.xlarge", 461 | "vcpuNum": 4 462 | }, 463 | { 464 | "_defaultOrder": 3, 465 | "_isFastLaunch": false, 466 | "category": "General purpose", 467 | "gpuNum": 0, 468 | "hideHardwareSpecs": false, 469 | "memoryGiB": 32, 470 | "name": "ml.t3.2xlarge", 471 | "vcpuNum": 8 472 | }, 473 | { 474 | "_defaultOrder": 4, 475 | "_isFastLaunch": true, 476 | "category": "General purpose", 477 | "gpuNum": 0, 478 | "hideHardwareSpecs": false, 479 | "memoryGiB": 8, 480 | "name": "ml.m5.large", 481 | "vcpuNum": 2 482 | }, 483 | { 484 | "_defaultOrder": 5, 485 | "_isFastLaunch": false, 486 | "category": "General purpose", 487 | "gpuNum": 0, 488 | "hideHardwareSpecs": false, 489 | "memoryGiB": 16, 490 | "name": "ml.m5.xlarge", 491 | "vcpuNum": 4 492 | }, 493 | { 494 | "_defaultOrder": 6, 495 | "_isFastLaunch": false, 496 | "category": "General purpose", 497 | "gpuNum": 0, 498 | "hideHardwareSpecs": false, 499 | "memoryGiB": 32, 500 | "name": "ml.m5.2xlarge", 501 | "vcpuNum": 8 502 | }, 503 | { 504 | "_defaultOrder": 7, 505 | "_isFastLaunch": false, 506 | "category": "General purpose", 507 | "gpuNum": 0, 508 | "hideHardwareSpecs": false, 509 | "memoryGiB": 64, 510 | "name": "ml.m5.4xlarge", 511 | "vcpuNum": 16 512 | }, 513 | { 514 | "_defaultOrder": 8, 515 | "_isFastLaunch": false, 516 | "category": "General purpose", 517 | "gpuNum": 0, 518 | "hideHardwareSpecs": false, 519 | "memoryGiB": 128, 520 | "name": "ml.m5.8xlarge", 521 | "vcpuNum": 32 522 | }, 523 | { 524 | "_defaultOrder": 9, 525 | "_isFastLaunch": false, 526 | "category": "General purpose", 527 | "gpuNum": 0, 528 | "hideHardwareSpecs": false, 529 | "memoryGiB": 192, 530 | "name": "ml.m5.12xlarge", 531 | "vcpuNum": 48 532 | }, 533 | { 534 | "_defaultOrder": 10, 535 | "_isFastLaunch": false, 536 | "category": "General purpose", 537 | "gpuNum": 0, 538 | "hideHardwareSpecs": false, 539 | "memoryGiB": 256, 540 | "name": "ml.m5.16xlarge", 541 | "vcpuNum": 64 542 | }, 543 | { 544 | "_defaultOrder": 11, 545 | "_isFastLaunch": false, 546 | "category": "General purpose", 547 | "gpuNum": 0, 548 | "hideHardwareSpecs": false, 549 | "memoryGiB": 384, 550 | "name": "ml.m5.24xlarge", 551 | "vcpuNum": 96 552 | }, 553 | { 554 | "_defaultOrder": 12, 555 | "_isFastLaunch": false, 556 | "category": "General purpose", 557 | "gpuNum": 0, 558 | "hideHardwareSpecs": false, 559 | "memoryGiB": 8, 560 | "name": "ml.m5d.large", 561 | "vcpuNum": 2 562 | }, 563 | { 564 | "_defaultOrder": 13, 565 | "_isFastLaunch": false, 566 | "category": "General purpose", 567 | "gpuNum": 0, 568 | "hideHardwareSpecs": false, 569 | "memoryGiB": 16, 570 | "name": "ml.m5d.xlarge", 571 | "vcpuNum": 4 572 | }, 573 | { 574 | "_defaultOrder": 14, 575 | "_isFastLaunch": false, 576 | "category": "General purpose", 577 | "gpuNum": 0, 578 | "hideHardwareSpecs": false, 579 | "memoryGiB": 32, 580 | "name": "ml.m5d.2xlarge", 581 | "vcpuNum": 8 582 | }, 583 | { 584 | "_defaultOrder": 15, 585 | "_isFastLaunch": false, 586 | "category": "General purpose", 587 | "gpuNum": 0, 588 | "hideHardwareSpecs": false, 589 | "memoryGiB": 64, 590 | "name": "ml.m5d.4xlarge", 591 | "vcpuNum": 16 592 | }, 593 | { 594 | "_defaultOrder": 16, 595 | "_isFastLaunch": false, 596 | "category": "General purpose", 597 | "gpuNum": 0, 598 | "hideHardwareSpecs": false, 599 | "memoryGiB": 128, 600 | "name": "ml.m5d.8xlarge", 601 | "vcpuNum": 32 602 | }, 603 | { 604 | "_defaultOrder": 17, 605 | "_isFastLaunch": false, 606 | "category": "General purpose", 607 | "gpuNum": 0, 608 | "hideHardwareSpecs": false, 609 | "memoryGiB": 192, 610 | "name": "ml.m5d.12xlarge", 611 | "vcpuNum": 48 612 | }, 613 | { 614 | "_defaultOrder": 18, 615 | "_isFastLaunch": false, 616 | "category": "General purpose", 617 | "gpuNum": 0, 618 | "hideHardwareSpecs": false, 619 | "memoryGiB": 256, 620 | "name": "ml.m5d.16xlarge", 621 | "vcpuNum": 64 622 | }, 623 | { 624 | "_defaultOrder": 19, 625 | "_isFastLaunch": false, 626 | "category": "General purpose", 627 | "gpuNum": 0, 628 | "hideHardwareSpecs": false, 629 | "memoryGiB": 384, 630 | "name": "ml.m5d.24xlarge", 631 | "vcpuNum": 96 632 | }, 633 | { 634 | "_defaultOrder": 20, 635 | "_isFastLaunch": false, 636 | "category": "General purpose", 637 | "gpuNum": 0, 638 | "hideHardwareSpecs": true, 639 | "memoryGiB": 0, 640 | "name": "ml.geospatial.interactive", 641 | "supportedImageNames": [ 642 | "sagemaker-geospatial-v1-0" 643 | ], 644 | "vcpuNum": 0 645 | }, 646 | { 647 | "_defaultOrder": 21, 648 | "_isFastLaunch": true, 649 | "category": "Compute optimized", 650 | "gpuNum": 0, 651 | "hideHardwareSpecs": false, 652 | "memoryGiB": 4, 653 | "name": "ml.c5.large", 654 | "vcpuNum": 2 655 | }, 656 | { 657 | "_defaultOrder": 22, 658 | "_isFastLaunch": false, 659 | "category": "Compute optimized", 660 | "gpuNum": 0, 661 | "hideHardwareSpecs": false, 662 | "memoryGiB": 8, 663 | "name": "ml.c5.xlarge", 664 | "vcpuNum": 4 665 | }, 666 | { 667 | "_defaultOrder": 23, 668 | "_isFastLaunch": false, 669 | "category": "Compute optimized", 670 | "gpuNum": 0, 671 | "hideHardwareSpecs": false, 672 | "memoryGiB": 16, 673 | "name": "ml.c5.2xlarge", 674 | "vcpuNum": 8 675 | }, 676 | { 677 | "_defaultOrder": 24, 678 | "_isFastLaunch": false, 679 | "category": "Compute optimized", 680 | "gpuNum": 0, 681 | "hideHardwareSpecs": false, 682 | "memoryGiB": 32, 683 | "name": "ml.c5.4xlarge", 684 | "vcpuNum": 16 685 | }, 686 | { 687 | "_defaultOrder": 25, 688 | "_isFastLaunch": false, 689 | "category": "Compute optimized", 690 | "gpuNum": 0, 691 | "hideHardwareSpecs": false, 692 | "memoryGiB": 72, 693 | "name": "ml.c5.9xlarge", 694 | "vcpuNum": 36 695 | }, 696 | { 697 | "_defaultOrder": 26, 698 | "_isFastLaunch": false, 699 | "category": "Compute optimized", 700 | "gpuNum": 0, 701 | "hideHardwareSpecs": false, 702 | "memoryGiB": 96, 703 | "name": "ml.c5.12xlarge", 704 | "vcpuNum": 48 705 | }, 706 | { 707 | "_defaultOrder": 27, 708 | "_isFastLaunch": false, 709 | "category": "Compute optimized", 710 | "gpuNum": 0, 711 | "hideHardwareSpecs": false, 712 | "memoryGiB": 144, 713 | "name": "ml.c5.18xlarge", 714 | "vcpuNum": 72 715 | }, 716 | { 717 | "_defaultOrder": 28, 718 | "_isFastLaunch": false, 719 | "category": "Compute optimized", 720 | "gpuNum": 0, 721 | "hideHardwareSpecs": false, 722 | "memoryGiB": 192, 723 | "name": "ml.c5.24xlarge", 724 | "vcpuNum": 96 725 | }, 726 | { 727 | "_defaultOrder": 29, 728 | "_isFastLaunch": true, 729 | "category": "Accelerated computing", 730 | "gpuNum": 1, 731 | "hideHardwareSpecs": false, 732 | "memoryGiB": 16, 733 | "name": "ml.g4dn.xlarge", 734 | "vcpuNum": 4 735 | }, 736 | { 737 | "_defaultOrder": 30, 738 | "_isFastLaunch": false, 739 | "category": "Accelerated computing", 740 | "gpuNum": 1, 741 | "hideHardwareSpecs": false, 742 | "memoryGiB": 32, 743 | "name": "ml.g4dn.2xlarge", 744 | "vcpuNum": 8 745 | }, 746 | { 747 | "_defaultOrder": 31, 748 | "_isFastLaunch": false, 749 | "category": "Accelerated computing", 750 | "gpuNum": 1, 751 | "hideHardwareSpecs": false, 752 | "memoryGiB": 64, 753 | "name": "ml.g4dn.4xlarge", 754 | "vcpuNum": 16 755 | }, 756 | { 757 | "_defaultOrder": 32, 758 | "_isFastLaunch": false, 759 | "category": "Accelerated computing", 760 | "gpuNum": 1, 761 | "hideHardwareSpecs": false, 762 | "memoryGiB": 128, 763 | "name": "ml.g4dn.8xlarge", 764 | "vcpuNum": 32 765 | }, 766 | { 767 | "_defaultOrder": 33, 768 | "_isFastLaunch": false, 769 | "category": "Accelerated computing", 770 | "gpuNum": 4, 771 | "hideHardwareSpecs": false, 772 | "memoryGiB": 192, 773 | "name": "ml.g4dn.12xlarge", 774 | "vcpuNum": 48 775 | }, 776 | { 777 | "_defaultOrder": 34, 778 | "_isFastLaunch": false, 779 | "category": "Accelerated computing", 780 | "gpuNum": 1, 781 | "hideHardwareSpecs": false, 782 | "memoryGiB": 256, 783 | "name": "ml.g4dn.16xlarge", 784 | "vcpuNum": 64 785 | }, 786 | { 787 | "_defaultOrder": 35, 788 | "_isFastLaunch": false, 789 | "category": "Accelerated computing", 790 | "gpuNum": 1, 791 | "hideHardwareSpecs": false, 792 | "memoryGiB": 61, 793 | "name": "ml.p3.2xlarge", 794 | "vcpuNum": 8 795 | }, 796 | { 797 | "_defaultOrder": 36, 798 | "_isFastLaunch": false, 799 | "category": "Accelerated computing", 800 | "gpuNum": 4, 801 | "hideHardwareSpecs": false, 802 | "memoryGiB": 244, 803 | "name": "ml.p3.8xlarge", 804 | "vcpuNum": 32 805 | }, 806 | { 807 | "_defaultOrder": 37, 808 | "_isFastLaunch": false, 809 | "category": "Accelerated computing", 810 | "gpuNum": 8, 811 | "hideHardwareSpecs": false, 812 | "memoryGiB": 488, 813 | "name": "ml.p3.16xlarge", 814 | "vcpuNum": 64 815 | }, 816 | { 817 | "_defaultOrder": 38, 818 | "_isFastLaunch": false, 819 | "category": "Accelerated computing", 820 | "gpuNum": 8, 821 | "hideHardwareSpecs": false, 822 | "memoryGiB": 768, 823 | "name": "ml.p3dn.24xlarge", 824 | "vcpuNum": 96 825 | }, 826 | { 827 | "_defaultOrder": 39, 828 | "_isFastLaunch": false, 829 | "category": "Memory Optimized", 830 | "gpuNum": 0, 831 | "hideHardwareSpecs": false, 832 | "memoryGiB": 16, 833 | "name": "ml.r5.large", 834 | "vcpuNum": 2 835 | }, 836 | { 837 | "_defaultOrder": 40, 838 | "_isFastLaunch": false, 839 | "category": "Memory Optimized", 840 | "gpuNum": 0, 841 | "hideHardwareSpecs": false, 842 | "memoryGiB": 32, 843 | "name": "ml.r5.xlarge", 844 | "vcpuNum": 4 845 | }, 846 | { 847 | "_defaultOrder": 41, 848 | "_isFastLaunch": false, 849 | "category": "Memory Optimized", 850 | "gpuNum": 0, 851 | "hideHardwareSpecs": false, 852 | "memoryGiB": 64, 853 | "name": "ml.r5.2xlarge", 854 | "vcpuNum": 8 855 | }, 856 | { 857 | "_defaultOrder": 42, 858 | "_isFastLaunch": false, 859 | "category": "Memory Optimized", 860 | "gpuNum": 0, 861 | "hideHardwareSpecs": false, 862 | "memoryGiB": 128, 863 | "name": "ml.r5.4xlarge", 864 | "vcpuNum": 16 865 | }, 866 | { 867 | "_defaultOrder": 43, 868 | "_isFastLaunch": false, 869 | "category": "Memory Optimized", 870 | "gpuNum": 0, 871 | "hideHardwareSpecs": false, 872 | "memoryGiB": 256, 873 | "name": "ml.r5.8xlarge", 874 | "vcpuNum": 32 875 | }, 876 | { 877 | "_defaultOrder": 44, 878 | "_isFastLaunch": false, 879 | "category": "Memory Optimized", 880 | "gpuNum": 0, 881 | "hideHardwareSpecs": false, 882 | "memoryGiB": 384, 883 | "name": "ml.r5.12xlarge", 884 | "vcpuNum": 48 885 | }, 886 | { 887 | "_defaultOrder": 45, 888 | "_isFastLaunch": false, 889 | "category": "Memory Optimized", 890 | "gpuNum": 0, 891 | "hideHardwareSpecs": false, 892 | "memoryGiB": 512, 893 | "name": "ml.r5.16xlarge", 894 | "vcpuNum": 64 895 | }, 896 | { 897 | "_defaultOrder": 46, 898 | "_isFastLaunch": false, 899 | "category": "Memory Optimized", 900 | "gpuNum": 0, 901 | "hideHardwareSpecs": false, 902 | "memoryGiB": 768, 903 | "name": "ml.r5.24xlarge", 904 | "vcpuNum": 96 905 | }, 906 | { 907 | "_defaultOrder": 47, 908 | "_isFastLaunch": false, 909 | "category": "Accelerated computing", 910 | "gpuNum": 1, 911 | "hideHardwareSpecs": false, 912 | "memoryGiB": 16, 913 | "name": "ml.g5.xlarge", 914 | "vcpuNum": 4 915 | }, 916 | { 917 | "_defaultOrder": 48, 918 | "_isFastLaunch": false, 919 | "category": "Accelerated computing", 920 | "gpuNum": 1, 921 | "hideHardwareSpecs": false, 922 | "memoryGiB": 32, 923 | "name": "ml.g5.2xlarge", 924 | "vcpuNum": 8 925 | }, 926 | { 927 | "_defaultOrder": 49, 928 | "_isFastLaunch": false, 929 | "category": "Accelerated computing", 930 | "gpuNum": 1, 931 | "hideHardwareSpecs": false, 932 | "memoryGiB": 64, 933 | "name": "ml.g5.4xlarge", 934 | "vcpuNum": 16 935 | }, 936 | { 937 | "_defaultOrder": 50, 938 | "_isFastLaunch": false, 939 | "category": "Accelerated computing", 940 | "gpuNum": 1, 941 | "hideHardwareSpecs": false, 942 | "memoryGiB": 128, 943 | "name": "ml.g5.8xlarge", 944 | "vcpuNum": 32 945 | }, 946 | { 947 | "_defaultOrder": 51, 948 | "_isFastLaunch": false, 949 | "category": "Accelerated computing", 950 | "gpuNum": 1, 951 | "hideHardwareSpecs": false, 952 | "memoryGiB": 256, 953 | "name": "ml.g5.16xlarge", 954 | "vcpuNum": 64 955 | }, 956 | { 957 | "_defaultOrder": 52, 958 | "_isFastLaunch": false, 959 | "category": "Accelerated computing", 960 | "gpuNum": 4, 961 | "hideHardwareSpecs": false, 962 | "memoryGiB": 192, 963 | "name": "ml.g5.12xlarge", 964 | "vcpuNum": 48 965 | }, 966 | { 967 | "_defaultOrder": 53, 968 | "_isFastLaunch": false, 969 | "category": "Accelerated computing", 970 | "gpuNum": 4, 971 | "hideHardwareSpecs": false, 972 | "memoryGiB": 384, 973 | "name": "ml.g5.24xlarge", 974 | "vcpuNum": 96 975 | }, 976 | { 977 | "_defaultOrder": 54, 978 | "_isFastLaunch": false, 979 | "category": "Accelerated computing", 980 | "gpuNum": 8, 981 | "hideHardwareSpecs": false, 982 | "memoryGiB": 768, 983 | "name": "ml.g5.48xlarge", 984 | "vcpuNum": 192 985 | }, 986 | { 987 | "_defaultOrder": 55, 988 | "_isFastLaunch": false, 989 | "category": "Accelerated computing", 990 | "gpuNum": 8, 991 | "hideHardwareSpecs": false, 992 | "memoryGiB": 1152, 993 | "name": "ml.p4d.24xlarge", 994 | "vcpuNum": 96 995 | }, 996 | { 997 | "_defaultOrder": 56, 998 | "_isFastLaunch": false, 999 | "category": "Accelerated computing", 1000 | "gpuNum": 8, 1001 | "hideHardwareSpecs": false, 1002 | "memoryGiB": 1152, 1003 | "name": "ml.p4de.24xlarge", 1004 | "vcpuNum": 96 1005 | } 1006 | ], 1007 | "instance_type": "ml.t3.medium", 1008 | "kernelspec": { 1009 | "display_name": "Python 3 (Data Science 3.0)", 1010 | "language": "python", 1011 | "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-1:081325390199:image/sagemaker-data-science-310-v1" 1012 | }, 1013 | "language_info": { 1014 | "codemirror_mode": { 1015 | "name": "ipython", 1016 | "version": 3 1017 | }, 1018 | "file_extension": ".py", 1019 | "mimetype": "text/x-python", 1020 | "name": "python", 1021 | "nbconvert_exporter": "python", 1022 | "pygments_lexer": "ipython3", 1023 | "version": "3.10.6" 1024 | } 1025 | }, 1026 | "nbformat": 4, 1027 | "nbformat_minor": 5 1028 | } 1029 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT No Attribution 2 | 3 | Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy of 6 | this software and associated documentation files (the "Software"), to deal in 7 | the Software without restriction, including without limitation the rights to 8 | use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of 9 | the Software, and to permit persons to whom the Software is furnished to do so. 10 | 11 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 12 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS 13 | FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR 14 | COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER 15 | IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN 16 | CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 17 | 18 | -------------------------------------------------------------------------------- /Prompt Decomposition/decompose.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/llm-based-advanced-summarization/bfbec9c2d596674b784a7609275b1f2f9cea3fa1/Prompt Decomposition/decompose.png -------------------------------------------------------------------------------- /Prompt Decomposition/decomposed_task.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/llm-based-advanced-summarization/bfbec9c2d596674b784a7609275b1f2f9cea3fa1/Prompt Decomposition/decomposed_task.png -------------------------------------------------------------------------------- /Prompt Decomposition/full_task.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/llm-based-advanced-summarization/bfbec9c2d596674b784a7609275b1f2f9cea3fa1/Prompt Decomposition/full_task.png -------------------------------------------------------------------------------- /Prompt Decomposition/results_example_1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/llm-based-advanced-summarization/bfbec9c2d596674b784a7609275b1f2f9cea3fa1/Prompt Decomposition/results_example_1.png -------------------------------------------------------------------------------- /Prompt Decomposition/results_example_2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/llm-based-advanced-summarization/bfbec9c2d596674b784a7609275b1f2f9cea3fa1/Prompt Decomposition/results_example_2.png -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ## Advanced Summarization Techniques using Generative AI 2 | ### Summarize any length of text or group of documents with Amazon Bedrock 3 | 4 | In this repo, we examine four different methods of performing summarization. We will look at the pros and cons of each one, and suggest the best use cases for each. This includes a system to summarize text from one long source, or from a set of multiple documents. Here are the four methods: 5 | 1) "stuff it" - place the whole document into a single prompt. 6 | 2) "map reduce" - break the document into parts, summarize each part, then combine them together. 7 | 3) "auto refine" - ask the LLM to find gaps in its own summary, and fill them. 8 | 4) "multi-doc" - sumarize multiple documents on the bases of guidance questions. 9 | 10 | We also have multiple elements that suport this kind of generative AI workload, and may be helpful for others as well. 11 | 1) A front end GUI, useful if you want to deploy this code as an interactive tool. 12 | 2) Prompt Evaluation, a suggested framework for using an LLM as Judge to measure accuracy. 13 | 3) Prompt Decomposition, how to break down prompts into multiple steps to solve for accuracy, cost, scale, and latency. 14 | 15 | Just want to dive in and start summarizing? Open Examples.ipynb which shows how to use each kind of summarization. 16 | 17 | ### A brief look at the contents of this repo: 18 | - Examples.ipynb Shows how to use the functions defined in the other notebooks. 19 | - advanced_summarize.ipynb Includes auto-refinement of summarizes for higher quality, as well as summarization for groups of documents. 20 | - simple_summarize.ipynb Includes two of the most simple, most common types of summarization, useful for basic tasks. 21 | - Data collection and cleaning.ipynb A utility notebook, for downloading and cleaning data in preparation for summarization. 22 | - sample texts/ a few sample documents of different lengths, already cleaned and ready to summarize. 23 | - detect_attribution.ipynb a novel way to detect attributions at the sentence level, increasing user trust. 24 | - Prompt_evauluation - examples of how to build a framework for evaluation accuracy. A more advanced version is in the prompt decompostion folder. 25 | - Prompt Decomposition - examples of how to break down prompts into multiple steps to solve for accuracy, cost, scale, and latency. 26 | 27 | ## Security 28 | 29 | See [CONTRIBUTING](CONTRIBUTING.md#security-issue-notifications) for more information. 30 | 31 | ## License 32 | 33 | This library is licensed under the MIT-0 License. See the LICENSE file. 34 | 35 | -------------------------------------------------------------------------------- /backend/requirements.txt: -------------------------------------------------------------------------------- 1 | anthropic==0.9.0 2 | boto3==1.34.17 3 | boto3==1.34.16 4 | botocore==1.34.17 5 | botocore==1.34.16 6 | Flask==3.0.0 7 | langchain==0.1.0 8 | flask_cors==5.0.0 9 | -------------------------------------------------------------------------------- /backend/src/app.py: -------------------------------------------------------------------------------- 1 | import json 2 | from flask import Flask, request, jsonify 3 | from flask_cors import CORS, cross_origin 4 | import boto3 5 | from botocore.config import Config 6 | from technique.stuff_it import StuffIt 7 | from technique.map_reduce import MapReduce 8 | from technique.auto_refine import AutoRefine 9 | from technique.multi_doc import MultiDoc 10 | from type.step import Step 11 | import time 12 | import pickle 13 | import os 14 | import uuid 15 | 16 | # Create a location to upload documents. 17 | UPLOAD_FOLDER: str = 'documents' 18 | if not os.path.exists(UPLOAD_FOLDER): 19 | os.makedirs(UPLOAD_FOLDER) 20 | 21 | app = Flask(__name__) 22 | 23 | cors = CORS(app) 24 | app.config['CORS_HEADERS'] = 'Content-Type' 25 | app.config['UPLOAD_FOLDER'] = UPLOAD_FOLDER 26 | 27 | 28 | # Increase the standard time out limits in boto3, because Bedrock may take a while to respond to large requests. 29 | my_config = Config(connect_timeout=60*3, read_timeout=60*3) 30 | bedrock_client = boto3.client(service_name='bedrock-runtime', config=my_config) 31 | 32 | # Create the different summarization clients to be used. 33 | stuff_it_client: StuffIt = StuffIt(bedrock_client) 34 | map_reduce_client: MapReduce = MapReduce(bedrock_client) 35 | multi_doc_client: MultiDoc = MultiDoc(bedrock_client) 36 | 37 | @app.route('/stuff-it', methods=['POST']) 38 | @cross_origin() 39 | def stuff_it(): 40 | data: dict = request.json 41 | doc: str = _get_doc_from_request(data) 42 | 43 | if not doc: 44 | return jsonify({'error': 'No text to summarize provided'}) 45 | 46 | results: dict = _wrap_in_duration(doc, stuff_it_client.stuff_it) 47 | return jsonify(_format_results(results)) 48 | 49 | @app.route('/map-reduce', methods=['POST']) 50 | @cross_origin() 51 | def map_reduce(): 52 | data: dict = request.json 53 | doc: str = _get_doc_from_request(data) 54 | 55 | if not doc: 56 | return jsonify({'error': 'No text to summarize provided'}) 57 | 58 | results: dict = _wrap_in_duration(doc, map_reduce_client.map_reduce) 59 | return jsonify(_format_results(results)) 60 | 61 | @app.route('/auto-refine', methods=['POST']) 62 | @cross_origin() 63 | def auto_refine(): 64 | data: dict = request.json 65 | doc: str = _get_doc_from_request(data) 66 | 67 | if not doc: 68 | return jsonify({'error': 'No text to summarize provided'}) 69 | 70 | prompt_options = {} 71 | prompt_options['prompt_type'] = "summary" 72 | prompt_options['format_type'] = "narrative" 73 | prompt_options['manual_guidance'] = "" 74 | prompt_options['style_guide'] = "" 75 | 76 | # Because auto_refine uses recursion, it's easier to reinitialize the object each call 77 | # so that the steps are reset at the API call level. 78 | auto_refine_client: AutoRefine = AutoRefine(bedrock_client) 79 | 80 | results: dict = auto_refine_client.auto_refine(doc, prompt_options, AUTO_REFINE=True) 81 | return jsonify(_format_results(results)) 82 | 83 | @app.route('/multi-doc', methods=['POST']) 84 | @cross_origin() 85 | def multi_doc(): 86 | data: dict = request.json 87 | file_location: list[str] = data['uploadLocation'] 88 | description_of_documents: str = data['descriptionOfDocuments'] 89 | questions_about_docs: list[str] = data['questions'].split(',') if data['questions'] != '' else [''] 90 | 91 | docs: list[str] = _get_docs_from_filepath(file_location) 92 | results: dict = multi_doc_client.multi_doc(docs, questions_about_docs, description_of_documents) 93 | 94 | return jsonify(_format_results(results)) 95 | 96 | ##### 97 | # Reads in files from uploaded doc list, pickles them and then 98 | # returns the path to the file to be used in subsequent calls. 99 | ##### 100 | @app.route('/upload-docs', methods=['POST']) 101 | @cross_origin() 102 | def upload_many_docs(): 103 | if 'files' not in request.files: 104 | return jsonify({ 'error': 'No file uploaded' }) 105 | 106 | files = request.files.getlist("files") 107 | 108 | documents: list[str] = [] 109 | 110 | for file in files: 111 | if file.filename == '': 112 | return jsonify({ 'error': 'No file selected' }) 113 | 114 | if file: 115 | text_data: str = file.read().decode('utf-8') 116 | documents.append(text_data) 117 | 118 | result_path: str = _write_docs_to_file(documents) 119 | return jsonify({ 'uploadLocation': result_path }) 120 | 121 | 122 | ######################## 123 | # Helper functions 124 | ######################## 125 | 126 | 127 | # Flask can't jsonify objects so convert them to raw dicts beforehand 128 | def _format_results(results: list[Step]) -> str: 129 | results['steps'] = [step.to_dict() for step in results['steps']] 130 | return results 131 | 132 | # Helper function to wrap the function call in the number of seconds it takes to return. 133 | def _wrap_in_duration(doc: str, func: callable) -> dict: 134 | start_time = time.time() 135 | results = func(doc) 136 | end_time = time.time() 137 | results['time'] = round(end_time - start_time, 2) 138 | return results 139 | 140 | def _get_docs_from_filepath(path: str) -> list[str]: 141 | docs: list[str] = [] 142 | with open(path, 'rb') as file: 143 | docs = pickle.load(file) 144 | return docs 145 | 146 | def _write_docs_to_file(docs: list[str]) -> str: 147 | # Define a unique name for the pickled files. 148 | filename = str(uuid.uuid4()) + '.pkl' 149 | path = f'{UPLOAD_FOLDER}/{filename}' 150 | with open(path, 'wb') as f: 151 | pickle.dump(docs, f) 152 | return path 153 | 154 | def _get_doc_from_request(request: dict) -> str: 155 | if 'uploadLocation' in request and request['uploadLocation'] != '': 156 | file_location: str = request['uploadLocation'] 157 | return _get_docs_from_filepath(file_location)[-1] # In case there's multiple docs, just use the last one one. 158 | elif 'textToSummarize' in request: 159 | return request['textToSummarize'] 160 | else: 161 | return None -------------------------------------------------------------------------------- /backend/src/technique/auto_refine.py: -------------------------------------------------------------------------------- 1 | from anthropic import Anthropic 2 | import pickle 3 | import os 4 | import re 5 | import json 6 | from queue import Queue 7 | from threading import Thread 8 | import boto3 9 | import time 10 | from queue import Queue 11 | from threading import Thread 12 | 13 | from technique.base_advanced_summarization import BaseAdvancedSummarization 14 | from type.step import Step 15 | 16 | # Property of Claude 2 17 | MAX_TOKEN_COUNT = 12000 18 | # How many times to retry if Claude is not working. 19 | MAX_ATTEMPTS = 30 20 | 21 | anthropic_client = Anthropic() 22 | 23 | class AutoRefine(BaseAdvancedSummarization): 24 | 25 | def __init__(self, client: boto3.client, cache_responses: bool=False, debug: bool = False): 26 | super().__init__(client) 27 | self.cache_responses = cache_responses 28 | self.debug = debug 29 | 30 | # Because these steps are more complicated, we collect them as we go. After each call to auto_refine, we reset the steps 31 | # before returning. 32 | self.steps: list[Step] = [] 33 | 34 | ####### 35 | # This function uses the three helper functions in super class, as well as the generate_summary_from_chunks above, to iteratively generate high quality summaries. 36 | # AUTO_REFINE, if true, has the LLM generate a list of questions, and then recursivly calls this function with those questions for guidance. 37 | # ALREADY_CHUNKED_AND_SUMMED, if true, means that this is being called using a list of summarized documents which should not be chunked or summarized further. 38 | ####### 39 | def auto_refine(self, full_text: str, prompt_options: dict, AUTO_REFINE: bool =True, ALREADY_CHUNKED_AND_SUMMED: bool=False): 40 | # Capture total duration. 41 | start_time = time.time() 42 | #first break this document into chunks 43 | chunks = [] 44 | 45 | if ALREADY_CHUNKED_AND_SUMMED: 46 | chunks = full_text 47 | else: 48 | chunks = self.get_chunks(full_text) 49 | self.add_step('Chunking up document', full_text, 'Chunked into {} chunks'.format(len(chunks))) 50 | 51 | if self.debug: 52 | if prompt_options['prompt_type'] == "answers": 53 | print ("Generating answers using %s chunks."%(len(chunks))) 54 | else: 55 | print ("Generating a new combined summary for %s chunks."%(len(chunks))) 56 | if ALREADY_CHUNKED_AND_SUMMED: 57 | print ("Input has already been chunked and summarized, skipping initial chunking.") 58 | 59 | first_summary = self.generate_summary_from_chunks(chunks,prompt_options, chunks_already_summarized=ALREADY_CHUNKED_AND_SUMMED) 60 | self.add_step('Generated summary from chunks', 'Input # {} chunks'.format(len(chunks)), first_summary) 61 | 62 | if self.debug and AUTO_REFINE: 63 | print ("First summary:") 64 | print (first_summary) 65 | 66 | if AUTO_REFINE: 67 | if self.debug: 68 | print ("Asking the LLM to find weaknesses in this summary...") 69 | #now that we have a rough summary, let's grab some questions about it. 70 | questions_prompt = self.get_prompt(first_summary,"interrogate","list", "", "") 71 | questions_list = self.ask_claude(questions_prompt)[1] 72 | self.add_step('Generated questions from summary', first_summary, 'Questions from LLM:\n{}'.format(questions_list)) 73 | 74 | if self.debug: 75 | print ("Questions from the LLM:") 76 | print (questions_list) 77 | 78 | original_guidance = prompt_options['manual_guidance'] 79 | original_prompt_type = prompt_options['prompt_type'] 80 | prompt_options['manual_guidance'] = prompt_options['manual_guidance'] + questions_list 81 | prompt_options['prompt_type'] = "answers" 82 | add_details = self.auto_refine(full_text, prompt_options,AUTO_REFINE=False, ALREADY_CHUNKED_AND_SUMMED=ALREADY_CHUNKED_AND_SUMMED) 83 | self.add_step('Adding details to summary based on Question list', prompt_options['manual_guidance'], add_details) 84 | 85 | if self.debug: 86 | print("Additional Details:") 87 | print (add_details) 88 | print("Merging details into original summary...") 89 | 90 | prompt_options['manual_guidance'] = original_guidance + add_details 91 | prompt_options['prompt_type'] = "merge_answers" 92 | custom_prompt = self.get_prompt(first_summary,prompt_options['prompt_type'],prompt_options['format_type'], prompt_options['manual_guidance'], prompt_options['style_guide']) 93 | final_summary = self.ask_claude(custom_prompt)[1] 94 | self.add_step('Creating final summary from original guidance + questions', first_summary, final_summary) 95 | 96 | #return this back to the original to prevent weird errors between calls of this function. 97 | prompt_options['manual_guidance'] = original_guidance 98 | prompt_options['prompt_type'] = original_prompt_type 99 | 100 | response = { 101 | 'results': final_summary, 102 | 'steps': self.steps, 103 | 'time': round(time.time() - start_time, 2) 104 | } 105 | 106 | return response 107 | 108 | else: 109 | response = { 110 | 'results': first_summary, 111 | 'steps': self.steps, 112 | 'time': round(time.time() - start_time, 2) 113 | } 114 | self.add_step('Returning summary from sub refinement call', 'Chunks of size {}'.format(len(chunks)), first_summary) 115 | return first_summary 116 | 117 | def add_step(self, action: str, input: str, result: str): 118 | self.steps.append(Step(action=action, input=input, results=result)) 119 | 120 | -------------------------------------------------------------------------------- /backend/src/technique/base_advanced_summarization.py: -------------------------------------------------------------------------------- 1 | from anthropic import Anthropic 2 | import pickle 3 | import os 4 | import re 5 | import json 6 | from queue import Queue 7 | from threading import Thread 8 | import boto3 9 | import time 10 | from queue import Queue 11 | from threading import Thread 12 | 13 | # Property of Claude 2 14 | MAX_TOKEN_COUNT = 12000 15 | # How many times to retry if Claude is not working. 16 | MAX_ATTEMPTS = 30 17 | 18 | MAIN_PROMPT = """\n\nHuman: I am going to give you a text{{GUIDANCE_1}}. This text is extracted from a larger document. Here is the text: 19 | 20 | 21 | {{TEXT}} 22 | 23 | {{GUIDANCE_2}} 24 | {{STYLE}}{{REQUEST}}{{FORMAT}}{{GUIDANCE_3}} 25 | \nAssistant: Here is what you asked for: 26 | """ 27 | 28 | MERGE_PROMPT = """\n\nHuman: Here are a number of related summaries: 29 | 30 | {{TEXT}} 31 | Please merge these summaries into a highly detailed single summary in {{FORMAT}} format, preserving as much detail as possible, using less than 1000 tokens. 32 | \nAssistant: Here is what you asked for: 33 | """ 34 | 35 | # This is inserted into the prompt template above, in the {{GUIDANCE_2}} section. 36 | GUIDANCE_PROMPT = """ 37 | Here is the additional guidance: 38 | 39 | {{GUIDANCE}} 40 | 41 | """ 42 | 43 | # This prompt asks the LLM to be a newpaper reporter, extracting facts from a document to be used in a later report. Good for summarizing factual sets of documents. 44 | REPORTER_PROMPT = """\n\nHuman: You are a newspaper reporter, collecting facts to be used in writing an article later. Consider this source text: 45 | 46 | {{TEXT}} 47 | 48 | {{DOCS_DESCRIPTION}} Please create a {{FORMAT}} of all the relevant facts from this text which will be useful in answering the question "{{GUIDANCE}}". To make your list as clear as possible, do not use and pronouns or ambigious phrases. For example, use a company's name rather than saying "the company" or they. 49 | \nAssistant: Here is the {{FORMAT}} of relevant facts: 50 | """ 51 | 52 | REPORTER_SUMMARY_PROMPT = """\n\nHuman: You are a newspaper reporter, collecting facts to be used in writing an article later. Consider these notes, each one derived from a different source text: 53 | {{TEXT}} 54 | Please create a {{FORMAT}} of all the relevant facts and trends from these notes which will be useful in answering the question "{{GUIDANCE}}"{{STYLE}}. To make your list as clear as possible, do not use and pronouns or ambigious phrases. For example, use a company's name rather than saying "the company" or "they". 55 | \nAssistant: Here is the list of relevant facts: 56 | 57 | """ 58 | 59 | REPORTER_FINAL_PROMPT = """\n\nHuman: You are a newspaper reporter, writing an article based on facts that were collected and summarized earlier. Consider these summaries: 60 | {{TEXT}} 61 | Each summary is a collection of facts extracted from a number of source reports. Each source report was written by an AWS team talking about their interactions with their individual customer. Please create a {{FORMAT}} of all the relevant trends and details from these summaries which will be useful in answering the question "{{GUIDANCE}}". 62 | \nAssistant: Here is the narrative: 63 | 64 | """ 65 | 66 | anthropic_client = Anthropic() 67 | 68 | class BaseAdvancedSummarization: 69 | 70 | def __init__(self, client: boto3.client): 71 | self.MAIN_PROMPT = MAIN_PROMPT 72 | self.MERGE_PROMPT = MERGE_PROMPT 73 | self.GUIDANCE_PROMPT = GUIDANCE_PROMPT 74 | self.REPORTER_PROMPT = REPORTER_PROMPT 75 | self.REPORTER_SUMMARY_PROMPT = REPORTER_SUMMARY_PROMPT 76 | self.REPORTER_FINAL_PROMPT = REPORTER_FINAL_PROMPT 77 | 78 | self.bedrock: boto3.Client = client 79 | 80 | def count_tokens(self, text: str) -> int: 81 | return anthropic_client.count_tokens(text) 82 | 83 | def get_chunks(self, full_text: str, overlap: bool = True) -> list[str]: 84 | # Following testing, it was found that chunks should be 2000 tokens, 85 | # or 25% of the doc, whichever is shorter. max chunk size in tokens 86 | chunk_length_tokens = 2000 87 | # A paragraph is about 200 words, which is about 260 tokens on average 88 | # we'll overlap our chunks by a paragraph to provide cohesion to the final summaries. 89 | overlap_tokens = 260 if overlap else 0 90 | # Anything this short doesn't need to be chunked further. 91 | min_chunk_length = 260 + overlap_tokens*2 92 | #grab basic info about the text to be chunked. 93 | char_count = len(full_text) 94 | word_count = len(full_text.split(" "))#rough estimate 95 | token_count = self.count_tokens(full_text) 96 | token_per_charater = token_count/char_count 97 | #don't chunk tiny texts 98 | if token_count <= min_chunk_length: 99 | if self.debug: 100 | print("Text is too small to be chunked further") 101 | return [full_text] 102 | 103 | if self.debug: 104 | print ("Chunk DEBUG mode is on, information about the text and chunking will be printed out.") 105 | print ("Estimated character count:",char_count) 106 | print ("Estimated word count:",word_count) 107 | print ("Estimated token count:",token_count) 108 | print ("Estimated tokens per character:",token_per_charater) 109 | 110 | print("Full text tokens: ", self.count_tokens(full_text)) 111 | print("How many times bigger than max context window: ",round(self.count_tokens(full_text)/MAX_TOKEN_COUNT,2)) 112 | 113 | # If the text is shorter, use smaller chunks 114 | if (token_count/4=char_count: 136 | end_chunk=char_count 137 | last_chunk=True 138 | 139 | chunks.append(full_text[start_chunk:end_chunk]) 140 | 141 | #move our slice location 142 | if start_chunk == 0: 143 | start_chunk += chunk_length_chars - overlap_chars 144 | else: 145 | start_chunk += chunk_length_chars 146 | 147 | end_chunk = start_chunk + chunk_length_chars + 2 * overlap_chars 148 | 149 | if self.debug: 150 | print ("Created %s chunks."%len(chunks)) 151 | 152 | return chunks 153 | 154 | ############## 155 | # GET PROMPT 156 | # text should be a single string of the raw text to be sent to the gen ai model. 157 | # prompt_type must be "summary" or "interrogate" or "answers" 158 | # -summary means summarize the text 159 | # -interrogate means look at the text and ask questions about what is missing 160 | # -answers means looking at the test, provide only details that may help answer the questions according to the Guidance. 161 | # -merge_answers takes a summary as text, and merges in the facts in the guidance section 162 | # -merge_summaries takes 2 or more summaries and merges them together. The summaries to be merged must be in list format for best results. 163 | # -reporter - like a new reporter, extract details that help answer the guidance questions 164 | # -reporter_summary - like a news reporter looking at a bunch of notes, create a list summary. Intended as an intermediate step. 165 | # reporter_final - generative a narrative based on the reporter_summary outputs. 166 | # format_type must be "narrative" or "list" 167 | # manual_guidance Extra instructions to guide the process, usually from the user. 168 | # style_guide TBD 169 | 170 | # Note that merge_summaries is handled differntly than all other options because it iteratively adds in multiple texts. 171 | ############### 172 | 173 | def get_prompt(self, text: str,prompt_type: str, format_type: str, manual_guidance: str, style_guide: str, docs_description: str="") -> str: 174 | 175 | # Answers mode is a bit different, so handle that first. 176 | if prompt_type == "answers": 177 | format_type = "in list format, using less than 1000 tokens. " 178 | prompt_type = "Please provide a list of any facts from the text that could be relevant to answering the questions from the guidance section " 179 | guidance_1 = " and some guidance" 180 | guidance_2 = self.GUIDANCE_PROMPT.replace("{{GUIDANCE}}",manual_guidance) 181 | guidance_3 = "You should ignore any questions that can not be answered by this text." 182 | elif prompt_type == "reporter": 183 | return self.REPORTER_PROMPT.replace("{{TEXT}}",text).replace("{{FORMAT}}",format_type).replace("{{GUIDANCE}}",manual_guidance).replace("{{DOCS_DESCRIPTION}}",docs_description) 184 | elif prompt_type == "reporter_summary": 185 | summaries_text = "" 186 | for x,summary in enumerate(text): 187 | summaries_text += "\n%s\n"%(x+1,summary,x+1) 188 | final_prompt = self.REPORTER_SUMMARY_PROMPT.replace("{{TEXT}}",summaries_text).replace("{{FORMAT}}",format_type).replace("{{GUIDANCE}}",manual_guidance).replace("{{STYLE}}",style_guide) 189 | return final_prompt 190 | elif prompt_type == "reporter_final": 191 | summaries_text = "" 192 | for x,summary in enumerate(text): 193 | summaries_text += "\n%s\n"%(x+1,summary,x+1) 194 | final_prompt = self.REPORTER_FINAL_PROMPT.replace("{{TEXT}}",summaries_text).replace("{{FORMAT}}",format_type).replace("{{GUIDANCE}}",manual_guidance) 195 | return final_prompt 196 | elif prompt_type == "merge_summaries": 197 | summaries_text = "" 198 | for x,summary in enumerate(text): 199 | summaries_text += "\n%s\n"%(x+1,summary,x+1) 200 | final_prompt = self.MERGE_PROMPT.replace("{{TEXT}}",summaries_text).replace("{{FORMAT}}",format_type) 201 | return final_prompt 202 | 203 | elif prompt_type == "merge_answers": 204 | prompt_type = "The text is a good summary which may lack a few details. However, the additional information found in the guidance section can be used to make the summary even better. Starting with the text, please use the details in the guidance section to make the text more detailed. The new summary shoud use less than 1000 tokens. " 205 | format_type = "" 206 | guidance_1 = " and some guidance" 207 | guidance_2 = self.GUIDANCE_PROMPT.replace("{{GUIDANCE}}",manual_guidance) 208 | guidance_3 = "You should ignore any comments in the guidance section indicating that answers could not be found." 209 | else: 210 | #Based on the options passed in, grab the correct text to eventually use to build the prompt. 211 | #select the correct type of output format desired, list or summary. Note that list for interrogate prompts is empty because the request for list is built into that prompt. 212 | if prompt_type == "interrogate" and format_type != "list": 213 | raise ValueError("Only list format is supported for interrogate prompts.") 214 | if format_type == "list": 215 | if prompt_type == "interrogate": 216 | format_type = ""#already in the prompt so no format needed. 217 | else: 218 | format_type = "in list format, using less than 1000 tokens." 219 | elif format_type == "narrative": 220 | format_type = "in narrative format, using less than 1000 tokens." 221 | else: 222 | raise ValueError("format_type must be 'narrative' or 'list'.") 223 | 224 | #select the correct prompt type language 225 | if prompt_type == "summary": 226 | prompt_type = "Please provide a highly detailed summary of this text " 227 | elif prompt_type == "interrogate": 228 | prompt_type = "This text is a summary that lacks detail. Please provide a list of the top 10 most important questions about this text that can not be answered by the text." 229 | else: 230 | raise ValueError("prompt_type must be 'summary' or 'interrogate'.") 231 | 232 | if manual_guidance == "": 233 | guidance_1 = "" 234 | guidance_2 = "" 235 | guidance_3 = "" 236 | else: 237 | guidance_1 = " and some guidance" 238 | guidance_2 = self.GUIDANCE_PROMPT.replace("{{GUIDANCE}}",manual_guidance) 239 | guidance_3 = " As much as possible, also follow the guidance from the guidance section above. You should ignore guidance that does not seem relevant to this text." 240 | 241 | style_guide = "" 242 | final_prompt = self.MAIN_PROMPT.replace("{{TEXT}}",text).replace("{{GUIDANCE_1}}",guidance_1).replace("{{GUIDANCE_2}}",guidance_2).replace("{{GUIDANCE_3}}",guidance_3).replace("{{STYLE}}",style_guide).replace("{{REQUEST}}",prompt_type).replace("{{FORMAT}}",format_type) 243 | return final_prompt 244 | 245 | ############# 246 | # Send a prompt to Bedrock, and return the response. Debug is used to see exactly what is being sent to and from Bedrock. 247 | # TODO: Add error checking and retry on hitting the throttling limit. 248 | ############# 249 | def ask_claude(self, prompt_text: str): 250 | 251 | # Usually, the prompt will have "human" and "assistant" tags already. These are required, so if they are not there, add them in. 252 | if not "Assistant:" in prompt_text: 253 | prompt_text = "\n\nHuman:"+prompt_text+"\n\Assistant: " 254 | 255 | promt_json = { 256 | "prompt": prompt_text, 257 | "max_tokens_to_sample": 3000, 258 | "temperature": 0.7, 259 | "top_k": 250, 260 | "top_p": 0.7, 261 | "stop_sequences": ["\n\nHuman:"] 262 | } 263 | 264 | body = json.dumps(promt_json) 265 | 266 | # #returned cashed results, if any 267 | # if body in claude_cache: 268 | # return claude_cache[body] 269 | 270 | if self.debug: 271 | print("sending:",prompt_text) 272 | 273 | modelId = 'anthropic.claude-v2' 274 | accept = 'application/json' 275 | contentType = 'application/json' 276 | 277 | start_time = time.time() 278 | attempt = 1 279 | while True: 280 | try: 281 | query_start_time = time.time() 282 | response = self.bedrock.invoke_model(body=body, modelId=modelId, accept=accept, contentType=contentType) 283 | response_body = json.loads(response.get('body').read()) 284 | 285 | raw_results = response_body.get("completion").strip() 286 | 287 | #strip out HTML tags that Claude sometimes adds, such as 288 | results = re.sub('<[^<]+?>', '', raw_results) 289 | request_time = round(time.time()-start_time,2) 290 | if self.debug: 291 | print("Recieved:",results) 292 | print("request time (sec):",request_time) 293 | total_tokens = self.count_tokens(prompt_text+raw_results) 294 | output_tokens = self.count_tokens(raw_results) 295 | tokens_per_sec = round(total_tokens/request_time,2) 296 | break 297 | except Exception as e: 298 | print("Error with calling Bedrock: "+str(e)) 299 | attempt+=1 300 | if attempt>MAX_ATTEMPTS: 301 | print("Max attempts reached!") 302 | results = str(e) 303 | request_time = -1 304 | total_tokens = -1 305 | output_tokens = -1 306 | tokens_per_sec = -1 307 | break 308 | 309 | # Retry in 10 seconds 310 | else: 311 | time.sleep(10) 312 | 313 | # # Store in cache only if it was not an error: 314 | # if request_time>0: 315 | # claude_cache[body] = (prompt_text,results,total_tokens,output_tokens,request_time,tokens_per_sec,query_start_time) 316 | 317 | return(prompt_text,results,total_tokens,output_tokens,request_time,tokens_per_sec,query_start_time) 318 | 319 | # Threaded function for queue processing. 320 | def thread_request(self, q, result): 321 | while not q.empty(): 322 | # Fetch new work from the Queue 323 | work = q.get() 324 | thread_start_time = time.time() 325 | try: 326 | data = self.ask_claude(work[1]) 327 | # Store data back at correct index 328 | result[work[0]] = data 329 | except Exception as e: 330 | error_time = time.time() 331 | print('Error with prompt!',str(e)) 332 | result[work[0]] = ( 333 | work[1], 334 | str(e), 335 | self.count_tokens(work[1]), 336 | 0, 337 | round(error_time-thread_start_time,2), 338 | 0, 339 | thread_start_time 340 | ) 341 | 342 | # Signal to the queue that task has been processed 343 | q.task_done() 344 | return True 345 | 346 | ### 347 | # Call ask_claude, but multi-threaded. 348 | # Returns a dict of the prompts and responces. 349 | ### 350 | def ask_claude_threaded(self, prompts: list[str]): 351 | q = Queue(maxsize=0) 352 | num_theads = min(50, len(prompts)) 353 | 354 | #Populating Queue with tasks 355 | results = [{} for x in prompts]; 356 | # Load up the queue with the promts to fetch and the index for each job (as a tuple): 357 | for i in range(len(prompts)): 358 | # Need the index and the url in each queue item. 359 | q.put((i,prompts[i])) 360 | 361 | #Starting worker threads on queue processing 362 | for i in range(num_theads): 363 | worker = Thread(target=self.thread_request, args=(q,results)) 364 | # Setting threads as "daemon" allows main program to exit eventually even if these dont finish correctly. 365 | worker.setDaemon(True) 366 | worker.start() 367 | 368 | #now we wait until the queue has been processed 369 | q.join() 370 | 371 | if self.debug: 372 | print('All tasks completed.') 373 | return results 374 | 375 | 376 | # This function itterates through a list of chunks, summarizes them, then merges those summaries together into one. 377 | # chunks_already_summarized is used when the chunks passed in are chunks resulting from summerizing docs. 378 | # If the chunks are taken from a source document directly, chunks_already_summarized should be set to False. 379 | def generate_summary_from_chunks(self, chunks, prompt_options,DEBUG=False, chunks_already_summarized=False): 380 | partial_summaries = {} 381 | if not chunks_already_summarized:#chunks are from a source doc, so summarize them. 382 | partial_summaries_prompts = [] 383 | partial_summaries_prompt2chunk = {} 384 | for x,chunk in enumerate(chunks): 385 | #if DEBUG: print ("Working on chunk",x+1,end = '') 386 | start_chunk_time = time.time() 387 | #note that partial summaries are always done in list format to maximize information captured. 388 | custom_prompt = self.get_prompt(chunk,prompt_options['prompt_type'],'list', prompt_options['manual_guidance'], prompt_options['style_guide']) 389 | partial_summaries_prompts.append(custom_prompt) 390 | partial_summaries_prompt2chunk[custom_prompt]=chunk 391 | 392 | partial_summaries_results = self.ask_claude_threaded(partial_summaries_prompts) 393 | for prompt_text,results,total_tokens,output_tokens,request_time,tokens_per_sec,query_start_time in partial_summaries_results: 394 | partial_summaries[partial_summaries_prompt2chunk[prompt_text]] = results 395 | 396 | if DEBUG: 397 | print ("Partial summary chunks done!") 398 | print ("Creating joint summary...") 399 | 400 | else: 401 | for chunk in chunks: 402 | partial_summaries[chunk] = chunk 403 | if DEBUG: 404 | print ("Summarized chunks detected!") 405 | print ("Creating joint summary...") 406 | 407 | summaries_list = [] 408 | summaries_list_token_count = 0 409 | for chunk in chunks: 410 | summaries_list.append(partial_summaries[chunk]) 411 | summaries_list_token_count += self.count_tokens(partial_summaries[chunk]) 412 | 413 | if DEBUG: 414 | print("Chunk summaries token count:",summaries_list_token_count) 415 | 416 | #check to see if the joint summary is too long. If it is, recursivly itterate down. 417 | #we do this, rather than chunking again, so that summaries are not split. 418 | #it needs to be under 3000 tokens in order to be helpful to the summary (4000 is an expiremental number and may need to be adjusted.) 419 | #this may be higher than the 2000 used for text originally, because this data is in list format. 420 | recombine_token_target = 3000 421 | #summaries_list_token_count = recombine_token_target+1 #set this to target+1 so that we do at least one recombonation for shorter documents. 422 | while summaries_list_token_count>recombine_token_target: 423 | if DEBUG: 424 | print("Starting reduction loop to merge chunks. Total token count is %s"%summaries_list_token_count) 425 | 426 | new_summaries_list = [] 427 | summaries_list_token_count = 0 428 | temp_summary_group = [] 429 | temp_summary_group_token_length = 0 430 | for summary in summaries_list: 431 | if temp_summary_group_token_length + self.count_tokens(summary) > recombine_token_target: 432 | #the next summary added would push us over the edge, so summarize the current list, and then add it. 433 | #note that partial summaries are always done in list format to maximize information captured. 434 | if DEBUG: print("Reducing %s partial summaries into one..."%(len(temp_summary_group))) 435 | custom_prompt = self.get_prompt(temp_summary_group,"merge_summaries","list", prompt_options['manual_guidance'], prompt_options['style_guide']) 436 | temp_summary = self.ask_claude(custom_prompt)[1] 437 | new_summaries_list.append(temp_summary) 438 | summaries_list_token_count+= self.count_tokens(temp_summary) 439 | temp_summary_group = [] 440 | temp_summary_group_token_length = 0 441 | 442 | temp_summary_group.append(summary) 443 | temp_summary_group_token_length+= self.count_tokens(summary) 444 | 445 | #summarize whever extra summaries are still in the temp list 446 | if len(temp_summary_group)>1: 447 | if DEBUG: 448 | print("Starting final reduction of %s partial summaries into one..."%(len(temp_summary_group))) 449 | 450 | custom_prompt = self.get_prompt(temp_summary_group,"merge_summaries","list", prompt_options['manual_guidance'], prompt_options['style_guide']) 451 | temp_summary = self.ask_claude(custom_prompt)[1] 452 | new_summaries_list.append(temp_summary) 453 | summaries_list_token_count+= self.count_tokens(temp_summary) 454 | elif len(temp_summary_group)==1: 455 | if DEBUG: 456 | print("Tacking on an extra partial summary") 457 | 458 | new_summaries_list.append(temp_summary_group[0]) 459 | summaries_list_token_count+= self.count_tokens(temp_summary_group[0]) 460 | 461 | summaries_list = new_summaries_list 462 | 463 | if DEBUG: 464 | print ("Final merge of summary chunks, merging %s summaries."%(len(summaries_list))) 465 | 466 | custom_prompt = self.get_prompt(summaries_list,"merge_summaries",prompt_options['format_type'], prompt_options['manual_guidance'], prompt_options['style_guide']) 467 | full_summary = self.ask_claude(custom_prompt)[1] 468 | 469 | return full_summary 470 | -------------------------------------------------------------------------------- /backend/src/technique/map_reduce.py: -------------------------------------------------------------------------------- 1 | import boto3 2 | from langchain.llms.bedrock import Bedrock 3 | from langchain.chains.mapreduce import MapReduceChain 4 | from langchain.text_splitter import CharacterTextSplitter 5 | from langchain.chains import ReduceDocumentsChain, MapReduceDocumentsChain 6 | from langchain.chains.llm import LLMChain 7 | from langchain.prompts import PromptTemplate 8 | from langchain.chains.combine_documents.stuff import StuffDocumentsChain 9 | 10 | from langchain.text_splitter import RecursiveCharacterTextSplitter 11 | 12 | from type.step import Step 13 | 14 | # Map 15 | MAP_TEMPLATE: str = """\n\nHuman: The following is a set of documents 16 | 17 | {docs} 18 | 19 | Based on this list of docs, please identify the main themes. 20 | """ 21 | 22 | # Reduce 23 | REDUCE_TEMPLATE = """\n\nHuman: The following is set of summaries: 24 | 25 | {doc_summaries} 26 | 27 | Please take these and distill them into a final, consolidated summary of the main themes in narative format. 28 | """ 29 | 30 | class MapReduce: 31 | 32 | def __init__(self, client: boto3.client): 33 | self.bedrock_client = client 34 | 35 | model_parameter = {"temperature": 0.0, "top_p": .5, "max_tokens_to_sample": 2000} 36 | llm = Bedrock( 37 | model_id="anthropic.claude-v2", 38 | model_kwargs=model_parameter, 39 | client=self.bedrock_client 40 | ) 41 | 42 | map_prompt = PromptTemplate.from_template(MAP_TEMPLATE) 43 | map_chain = LLMChain(llm=llm, prompt=map_prompt) 44 | reduce_prompt = PromptTemplate.from_template(REDUCE_TEMPLATE) 45 | reduce_chain = LLMChain(llm=llm, prompt=reduce_prompt) 46 | 47 | # Takes a list of documents, combines them into a single string, and passes this to an LLMChain 48 | combine_documents_chain = StuffDocumentsChain( 49 | llm_chain=reduce_chain, document_variable_name="doc_summaries" 50 | ) 51 | 52 | # Combines and iteravely reduces the mapped documents 53 | self.reduce_documents_chain = ReduceDocumentsChain( 54 | # This is final chain that is called. 55 | combine_documents_chain=combine_documents_chain, 56 | # If documents exceed context for `StuffDocumentsChain` 57 | collapse_documents_chain=combine_documents_chain, 58 | # The maximum number of tokens to group documents into. 59 | token_max=4000, 60 | ) 61 | 62 | # Combining documents by mapping a chain over them, then combining results 63 | self.map_reduce_chain = MapReduceDocumentsChain( 64 | # Map chain 65 | llm_chain=map_chain, 66 | # Reduce chain 67 | reduce_documents_chain=self.reduce_documents_chain, 68 | # The variable name in the llm_chain to put the documents in 69 | document_variable_name="docs", 70 | # Return the results of the map steps in the output 71 | return_intermediate_steps=True, 72 | ) 73 | 74 | self.text_splitter = RecursiveCharacterTextSplitter( 75 | chunk_size = 5000, 76 | chunk_overlap = 200, 77 | length_function = len, 78 | add_start_index = True, 79 | ) 80 | 81 | def map_reduce(self, doc: str) -> dict: 82 | split_docs = self.text_splitter.create_documents([doc]) 83 | # Get response from LLM. 84 | chain_response: dict = self.map_reduce_chain.invoke(split_docs) 85 | # Split documents return type Document which has the value page_content. 86 | split_documents = [doc.page_content for doc in chain_response['input_documents']] 87 | # Grab remaining elements to output the steps involved in map reduce. 88 | intermediate_steps = chain_response['intermediate_steps'] 89 | results = chain_response['output_text'] 90 | steps: list[Step] = self._get_steps(doc, split_documents, intermediate_steps, results) 91 | 92 | return { 'results': results, 'steps': steps } 93 | 94 | def _get_steps(self, input:str, split_documents: list[str], intermediate_steps: list[str], results: str) -> list[Step]: 95 | steps: list[Step] = [] 96 | 97 | # Explain original document split 98 | num_split = len(split_documents) 99 | split_original_doc: Step = Step(action='Split Documents', input=input, results='Split into {} documents'.format(num_split)) 100 | steps.append(split_original_doc) 101 | 102 | # Show intermediate steps. split_documents contain inputs and 103 | for i, doc in enumerate(split_documents): 104 | intermediate_res: str = intermediate_steps[i] 105 | split_doc_step: Step = Step(action='Summarize split part {}'.format(i), input=doc, results=intermediate_res) 106 | steps.append(split_doc_step) 107 | 108 | # Show summarization result 109 | 110 | result_step: Step = Step(action='Map Reduce Results', input='Summarized documents chunks above.', results=results) 111 | steps.append(result_step) 112 | 113 | return steps 114 | 115 | -------------------------------------------------------------------------------- /backend/src/technique/multi_doc.py: -------------------------------------------------------------------------------- 1 | import boto3 2 | from technique.base_advanced_summarization import BaseAdvancedSummarization 3 | from type.step import Step 4 | import time 5 | 6 | # Property of Claude 2 7 | MAX_TOKEN_COUNT = 12000 8 | # How many times to retry if Claude is not working. 9 | MAX_ATTEMPTS = 30 10 | 11 | class MultiDoc(BaseAdvancedSummarization): 12 | 13 | def __init__(self, client: boto3.client, cache_responses: bool=False, debug: bool = False): 14 | super().__init__(client) 15 | self.cache_responses = cache_responses 16 | self.debug = debug 17 | self.steps: list[Step] = [] 18 | 19 | # This function uses the three helper functions to read the documents passed in, and create a summary answer for each question passed in. 20 | # If the documents are longer than two pages or so, it is reccoemended that you first summaize each document. 21 | # docs_description is a single sentance describing what the documents are such as "The texts are a collection of product reviews for a Pickle Ball paddle." 22 | 23 | def multi_doc(self, docs: list[str], questions: list[str], docs_description: str): 24 | self.steps =[] 25 | start_time = time.time() 26 | answers: str = self._multi_doc(docs, questions, docs_description) 27 | end_time = time.time() 28 | 29 | return { 30 | 'results': answers, 31 | 'steps': self.steps, 32 | 'time': round(end_time-start_time,2) 33 | } 34 | 35 | def _multi_doc(self, docs: list[str], questions: list[str], docs_description: str): 36 | #get answers from each doc for each question. 37 | answers = {} 38 | prompt2quetion_doc = {} 39 | prompts = [] 40 | max_docs_to_scan = 500 41 | 42 | #build the queries to be passed into Bedrock 43 | for question in questions: 44 | for x, doc in enumerate(docs): 45 | if x>max_docs_to_scan: 46 | break #limit for testing 47 | 48 | #print ("Asking the LLM to find extract answers from this doc:",doc) 49 | questions_prompt = self.get_prompt(docs[x],"reporter","list", question, "",docs_description) 50 | prompt2quetion_doc[questions_prompt] = (question,doc) 51 | prompts.append(questions_prompt) 52 | # Add the question prompt as a step in the calculation. 53 | self.add_step('Generated question prompt for input document', doc, questions_prompt) 54 | 55 | if self.debug: 56 | print("Starting %s worker threads."%len(prompts)) 57 | 58 | prompts_answers = self.ask_claude_threaded(prompts) 59 | # save_calls(claude_cache) 60 | 61 | for question in questions: 62 | answers[question] = [] 63 | 64 | for prompt,answer,total_tokens,output_tokens,request_time,tokens_per_sec,query_start_time in prompts_answers: 65 | question,doc = prompt2quetion_doc[prompt] 66 | answers[question].append(answer) 67 | self.add_step('Generated answer for input document', doc, answer) 68 | 69 | 70 | current_answer_count = len(docs) 71 | if self.debug: 72 | print("All documents have been read. Reducing answers into the final summary...") 73 | 74 | #reduce this down to 5 or less docs for the final summary by combining the individual answers. 75 | while current_answer_count > 5: 76 | #summarize the answers 77 | prompts = [] 78 | prompts2question = {} 79 | 80 | max_docs_to_scan = max(min(current_answer_count,8),3) 81 | if self.debug: 82 | print("Combining %s chunks. (Currently there are %s answers to each question.)"%(max_docs_to_scan,current_answer_count)) 83 | 84 | for question in questions: 85 | #You want chunks of roughly 2K tokens 86 | for partial_chunks in self.grab_set_chunks(answers[question],max_docs_to_scan): 87 | questions_prompt = self.get_prompt(partial_chunks,"reporter_summary","list", question, " in less than 1000 tokens") 88 | prompts.append(questions_prompt) 89 | prompts2question[questions_prompt] = question 90 | 91 | if self.debug: 92 | print("Starting %s worker threads."%len(prompts)) 93 | 94 | prompts_answers = self.ask_claude_threaded(prompts) 95 | # save_calls(claude_cache) 96 | 97 | for question in questions: 98 | answers[question] = [] 99 | for prompt,answer,total_tokens,output_tokens,request_time,tokens_per_sec,query_start_time in prompts_answers: 100 | answers[prompts2question[prompt]].append(answer) 101 | self.add_step('Generated answer for question prompt', prompts2question[prompt], answer) 102 | 103 | current_answer_count = len(answers[questions[0]]) 104 | 105 | if self.debug: 106 | print("Creating the final summary for each question.") 107 | 108 | #write the final article: 109 | prompts = [] 110 | prompts2question = {} 111 | for question in questions: 112 | #print ("Asking the LLM to finalize the answer for this question:",question) 113 | questions_prompt = self.get_prompt(answers[question],"reporter_final","narrative", question, "") 114 | prompts.append(questions_prompt) 115 | prompts2question[questions_prompt] = question 116 | 117 | if self.debug: 118 | print("Starting %s worker threads."%len(prompts)) 119 | 120 | prompts_answers = self.ask_claude_threaded(prompts) 121 | # save_calls(claude_cache) 122 | 123 | answers = {} 124 | for prompt,answer,total_tokens,output_tokens,request_time,tokens_per_sec,query_start_time in prompts_answers: 125 | answers[prompts2question[prompt]] = answer 126 | self.add_step('Compile answers', prompts2question[prompt], answer) 127 | 128 | response: str = ''.join([f"{key}\n\n{value}" for key, value in answers.items()]) 129 | self.add_step('Add answers', 'input is from the questions and input documents', response) 130 | return response 131 | 132 | # Yield successive n-sized chunks from lst. 133 | # This is a helper function for the multidoc summarization function. 134 | def grab_set_chunks(self, lst, n): 135 | for i in range(0, len(lst), n): 136 | yield lst[i:i + n] 137 | 138 | def add_step(self, action: str, input: str, result: str): 139 | self.steps.append(Step(action=action, input=input, results=result)) 140 | -------------------------------------------------------------------------------- /backend/src/technique/stuff_it.py: -------------------------------------------------------------------------------- 1 | import boto3 2 | from langchain.llms.bedrock import Bedrock 3 | 4 | from langchain.chains.llm import LLMChain 5 | from langchain.prompts import PromptTemplate 6 | from langchain.chains.combine_documents.stuff import StuffDocumentsChain 7 | from langchain.schema.document import Document 8 | 9 | from type.step import Step 10 | from time import time 11 | 12 | MODEL_PARAMS = { 13 | "temperature": 0.0, 14 | "top_p": .5, 15 | "max_tokens_to_sample": 2000 16 | } 17 | 18 | # Define prompt 19 | prompt_template = """\n\nHuman: Consider this text: 20 | 21 | {text} 22 | 23 | Please create a concise summary in narative format. 24 | 25 | Assistiant: Here is the concise summary:""" 26 | 27 | class StuffIt: 28 | 29 | def __init__(self, client: boto3.client): 30 | self.bedrock_client = client 31 | 32 | llm = Bedrock( 33 | model_id="anthropic.claude-v2", 34 | model_kwargs=MODEL_PARAMS, 35 | client=self.bedrock_client 36 | ) 37 | 38 | prompt = PromptTemplate.from_template(prompt_template) 39 | llm_chain = LLMChain(llm=llm, prompt=prompt) 40 | 41 | self.stuff_chain = StuffDocumentsChain( 42 | llm_chain=llm_chain, 43 | document_variable_name="text" 44 | ) 45 | 46 | def stuff_it(self, doc: str)-> dict: 47 | docs = [Document(page_content=doc)] 48 | results: str = self.stuff_chain.run(docs) 49 | steps: list = self._get_steps(doc, results) 50 | return {"results": results, 'steps': steps} 51 | 52 | def _get_steps(self, input: str, results: str) -> list[Step]: 53 | step1: Step = Step(action='Call LLM', input=input, results=results) 54 | return [step1] 55 | -------------------------------------------------------------------------------- /backend/src/type/step.py: -------------------------------------------------------------------------------- 1 | class Step: 2 | 3 | def __init__(self, action: str, input: str, results: str): 4 | self.action = action 5 | self.input = input 6 | self.results = results 7 | 8 | def to_dict(self): 9 | return { 10 | 'action': self.action, 11 | 'input': self.input, 12 | 'results': self.results 13 | } -------------------------------------------------------------------------------- /chaptersum_data/chapters_11_to_13/summary.txt: -------------------------------------------------------------------------------- 1 | Tulkinghorn and Krook find Nemo dead, apparently from an overdose of opium. Krook claims no knowledge of Mr. Nemo's identity or habits, and is chagrined because Nemo owes him six weeks' rent. Dr. Woodcourt (though he is not named immediately) arrives on the scene and reveals that Nemo has bought opium from him for a year and a half. Tulkinghorn learns from the ragamuffin Jo, a crossing-sweeper, that Nemo was a kind man, and gave Jo money when he had it. "He wos wery good to me, he wos!" 2 | 3 | An inquest is held regarding Nemo's death, a silly proceeding in a tavern complete with a comic vocalist, and he is officially found dead by accident. Since Nemo had no money or family, he is buried in a pauper's grave. 4 | 5 | In Chapter 12, the scene returns to Chesney Wold. The Dedlocks have returned, and have guests. Meanwhile Lady Dedlock continues to suffer ennui. Lady Dedlock meets the young maid Rosa for the first time, and takes a liking to her. This causes jealousy to arise in Lady Dedlock's personal lady's maid, Mademoiselle Hortense. 6 | 7 | Tulkinghorn comes to Chesney Wold to inform Sir Leicester of the lawsuit brought upon him by his neighbor Boythorn. Tulkinghorn also has a message for Lady Dedlock concerning Nemo, whose handwriting was on the paper she had such interest in the last time he was at Chesney Wold, and tells of the law writer's death. She pretends that the death doesn't especially concern her; Tulkinghorn, unfooled, sees otherwise. The rest of the visit Lady Dedlock and Tulkinghorn watch each other carefully and surreptitiously. 8 | 9 | In Chapter 13 Richard's future is under futher discussion. His education has been soley the Classics (meaning, in those days, ancient Latin and Greek literature). Richard knows nothing about what he would like to do, except that he doesn't want to become a minister in the Church of England. Mr. Jarndyce suggest the medical profession, and Richard quickly (too quickly) seizes upon the idea. 10 | 11 | Mr. Boythorn -- along with Mr. Kenge, who is also present -- is enthusiastic about Richard's proposed medical career and propounds the maltreatment of doctors aboard Navy ships. The men discuss how Richard is to be gotten into the profession, and Mr. Kenge plans for his cousin, Dr. Bayham Badger, to take Richard as an apprentice. 12 | 13 | As Mr. Badger lives in London, the party goes to London a few weeks later to get Richard settled. In London, Esther attends the theatre and is followed and stared at by the lovestruck Mr. Guppy. Ada and Richard's romance continues to blossom. 14 | 15 | At Mr. Badger's house the party meets Dr. Allan Woodcourt, and Esther finds him attractive. The talk at the Badgers' is of an unconventional kind -- namely the wealth and prestige of Mrs. Badgers two deceased husbands. Esther tells Mr. Jarndyce that Ada and Richard are in love, but he advises the lovers that because they are too young, and because Richard is not yet established, they should wait to be married. -------------------------------------------------------------------------------- /chaptersum_data/chapters_14_to_16/summary.txt: -------------------------------------------------------------------------------- 1 | Richard leaves to start his studies. He is hopeful about a career as a surgeon, but continues to hope to inherit a fortune through the Jarndyce suit. Esther and Jarndyce find this troubling. 2 | 3 | Esther and the others go to visit the Jellyby household, and see Caddy and Peepy. Peepy is in a disheveled state, as usual, but Caddy has endeavored to make her appearance somewhat better, and looks very pretty. She tells them good and bad news. 4 | 5 | She is engaged to young Mr. Prince Turveydrop, the son of the dancing master Mr. Turveydrop. The old man is a "model of deportment" but makes his son do all the work in the dancing school. Also, Mr. Jellyby is on the brink of bankruptcy. 6 | 7 | The party continues to Miss Flite, who is being attended professionally by Dr. Woodcourt. She has quite recovered, and talks merrily about her good fortune in being forwarded seven shillings a week from some mysterious source through Kenge and Carboy's. The source is not revealed, but Mr. Jarndyce, Esther notes, is staring intently at Miss Flite's birds (named allegorically: Hope, Faith, Jargon, Documents, Sheepskin, etc.) during this interlude, and Esther draws her own conclusions. Caddy Jellyby and Miss Flite have become friends, and Caddy helps Miss Flite to use her newfound money to best advantage. 8 | 9 | Mr. Krook intrudes, and shares that he has been attempting to learn to read. Allan Woodcourt, who is also Mr. Krook's doctor, converses with the company, is invited to dinner, and becomes friends with Esther, Ada, and Richard. 10 | 11 | In Chapter 15, Mr. Skimpole reveals that Mr. Neckett of Coavinses has died, leaving three destitute children. The party goes to visit them, and meets Charley (about 13), Tom (5 or 6) and Emma (18 months). The two younger children are locked in while Charley, whoses real name is Charlotte, goes out to work each day for some money. Charley is hardworking and tries hard to be a good mother to her younger siblings. The lodgers in the house (Mrs. Blinder, Mr. Gridley) help out with the children also, but their situation is desperate. 12 | 13 | Mr. Gridley (the Man from Shropshire from Chapter 1) reveals that he is also embroiled in a Chancery suit which has ruined his life. He should have inherited money but the Chancery delay has consumed it. He and Jarndyce agree on the evils of Chancery. 14 | 15 | In Chapter 16, Sir Leicester is afflicted with gout, and laying in sickbed at Chesney Wold. He considers it a privilege to have such an aristocratic disease, which all of his lineage have also had. He waits in the country looking at a picture of his lady. 16 | 17 | Lady Dedlock, meanwhile, goes to their house in London and secretly leaves it dressed in servant's dress, veiled. Mr. Tulkinghorn sees her in passing, but does not recognize her. He finds, however, the ladylike bearing to be incongrous with the plain clothing. 18 | 19 | Jo leaves Tom-all-Alone's, a street of vacant squatter's houses in a dingy street in London. He walks to his usual haunts near the court of Chancery. 20 | 21 | Lady Dedlock seeks out and finds Jo the crossing-sweeper. Though he is abhorrent to her, she has him guide her to Krook's house where Nemo lodged. She is shown the tavern where the inquest took place, and then the charnel-house of bones behind an iron grate where Nemo's body was lain. She is horrified, gives Jo a gold coin, and disappears. -------------------------------------------------------------------------------- /chaptersum_data/chapters_1_to_4/summary.txt: -------------------------------------------------------------------------------- 1 | The scene opens in London on a foggy, smoggy day. The High Court of Chancery is in session, and it appears that the fog has settled thickest on this part of London. This is where the legal suit of Jarndyce and Jarndyce is being argued. A little mad old woman (Miss Flite), and a man from Shropshire (Mr. Gridley) are in attendance. A "sallow prisoner" is brought forward. Mr. Tangle, a lawyer, speaks with the Lord High Chancellor, and the matter of the two young wards in Jarndyce is discussed. This matter will come up before the court tomorrow. 2 | 3 | The scene changes in Chapter 2 to Chesney Wold, a stately home in Lincolnshire. Here the weather is also bad, but it is constant rain rather than fog. The lady of the manor, Lady Dedlock, is bored to death. She had been in London, and has come back to the country seat before leaving for Paris in a few days. 4 | 5 | Lady Dedlock is introduced to us as a very beautiful middle-aged society lady, her charms undimmed by time. She is exceedingly dignified and self-controlled, but her "caprices," which she tries to hide, are known to her servants and tradesmen. She is the wife of Sir Leicester Dedlock, a baronet, who has married somehat beneath him. It is clear that Lady Dedlock has brought neither an aristocratic lineage or a fortune to the marriage. It is also clear that Sir Leicester has such a great amount of both of those things, and loves her so dearly, that it matters not to him. 6 | 7 | Mr. Tulkinghorn, the family solicitor, has come down to Chesney Wold to discuss the Jarndyce case, in which Lady Dedlock has some slight interest. He is a secretive, enigmatic man, of the most controlled and logical character. He brings many legal papers with him, and Lady Dedlock seems to take an interest in the handwriting on one of them. She asks whose writing it is, but Mr. Tulkinghorn cannot answer her. She becomes faint and must be taken to her room. 8 | 9 | In Chapter 3, Esther Summerson's life story is told. She has been materially provided for but emotionally neglected by her "godmother" in the town of Windsor. When Esther is about fourteen, her godmother dies, and Esther is sent to Greenleaf school. Through Mr. Kenge, John Jarndyce's solicitor, who comes to see Esther after her godmother's death, she learns her benefactor is Mr. Jarndyce. She has never seen him, but he appears, unbeknownst to her until years later, in the carriage that takes her to Greenleaf school. She is sent there after her godmother's death, and passes six happy years there, being taught by the Misses Donny. Her accounts are paid by her unknown benefactor all the while. When she is older she becomes a teacher of the other girls there, and inspires great affection in her students. 10 | 11 | When she has grown old and educated enough to leave Greenleaf school, she is asked to serve in her benefactor's house, Bleak House. She is brought to Chancery Court to meet Ada Clare and Richard Carstone, two wards in the Jarndyce suit, who are also under the guardianship of Mr. Jarndyce. Esther is meant to be Ada's lady companion. 12 | 13 | As they leave the court, they meet Miss Flite, who introduces herself. They learn she is also a suitor in Chancery, and observe that she is poor, eccentric, and perhaps a little mad. 14 | 15 | In Chapter 4, the three young people are sent to stay the night at the Jellyby household. Mrs. Jellyby is a philanthropic lady who neglects her large and chaotic household in favor of writing long letters, dictated to her eldest daughter Caddy, about the social conditions in places in Africa. Caddy, an overworked waif, clings to Esther and becomes her friend, as does the youngest neglected child, Peepy. -------------------------------------------------------------------------------- /chaptersum_data/chapters_5_to_7/summary.txt: -------------------------------------------------------------------------------- 1 | 2 | Caddy Jellyby has spent the night in Esther and Ada's room, sleeping near Esther in a fit of despair. Before breakfast, she proposes a walk, and all the young people agree. Esther again shows kindness to Peepy, washing him and putting him to sleep in her bed. Esther, Ada, Caddy, and Richard take their first walk about London. 3 | 4 | They soon meet Miss Flite again, who begs them to come to her lodgings. They go and visit Krook's rag-and-bottle shop, above which is Miss Flite's room. As they enter, Esther sees a legal handwritten advertisement for Mr. Nemo, requesting work as a law writer (copyist). Neither Esther nor we yet know it, but she is looking at the handwriting of her father and entering the house where he lives. 5 | 6 | The group meets Mr. Krook, who is obsessed with Chancery documents. He rambles on about the Jarndyce suit, mentioning the names Barbary, Clare, and Dedlock, giving Esther, unaware, two more names of her mysterious parentage. They see Miss Flite's bare and sad room, and Richard surreptitiously leaves some money for her. They return to the Jellyby house, much affected by the sad states to which Chancery can reduce people. 7 | 8 | In Chapter 6, the long awaited meeting with Mr. Jarndyce happens at his home, Bleak House. He is described as a handsome, robust man in his 50s -- exceedingly kindly, genteel, and self-effacing. He has a little quirk of mentioning the "east wind" whenever something he doesn't like is brought up. The three young people instantly like and feel affection for their mutual benefactor. 9 | 10 | A "perfect child" is abruptly introduced -- the parasitic poet, Mr. Skimpole. The three young people are instantly brought into his inbroglios when the bailiff, a man called Neckett, arrives and demands payment of a debt. Richard and Esther come to the rescue, and, upon learning of it, Mr. Jarndyce warns them not to give money to Skimpole. 11 | 12 | At the end of the day, Esther is given the keys to the household, the mark of the housekeeper of Bleak House. She is excited and honored to have this position. She sets up residence in a room adjoining Ada's, with a sitting room in between, and everything in the house is to the young people's liking. 13 | 14 | Chapter 7 is back in Chesney Wold. Sir Leicester and Lady Dedlock have left for Paris. The rain continues, and Mrs. Rouncewell, the aged housekeeper, is introduced. She is instructing a young village girl of considerable beauty in the ways of housekeeping and maid's work in a great house. Mrs. Rouncewell's grandson, Watt, who loves Rosa, is visiting. 15 | 16 | In the midst of the rain a couple of visitors come to see the house, but are told it "isn't the day". They persist, and are admitted, by using Mr. Tulkinghorn's name. Mr. Guppy and his friend are brought through the house, and Mr. Guppy admires a portrait of Lady Dedlock. He exclaims that he must have seen the subject somewhere before. Guppy, who had met Esther Summerson briefly in London at Kenge and Carboy's, is unconsciously remembering Lady Dedlock's resemblence to Esther. Guppy leaves a bit perplexed. 17 | 18 | Mrs. Rouncewell had refused to tell the family ghost story, but now recounts it to her grandson and her charge, Rosa. The Ghost's Walk is a terraced walk outside the house, and an eerie footstep is often heard echoing from it. It is supposed that two hundred years ago, a previous Lady Dedlock had sabotaged horses meant for Cavaliers fighting Cromwellian forces. She had lamed them, and, upon being discovered, was lamed herself by her husband in a struggle. She was limping on the terrace and fell and died, and she vowed to haunt the terrace until "the pride of the house is tumbled." Watt and Rosa can still hear the phantom footsteps. -------------------------------------------------------------------------------- /chaptersum_data/chapters_8_to_10/summary.txt: -------------------------------------------------------------------------------- 1 | Esther and Mr. Jarndyce discuss the Chancery suit, and Esther leans that it is a hopeless muddle, and anything left of the great fortune disputed within it has been consumed by legal costs. The story of Tom Jarndyce is related again; we learn that Bleak House, once called Peaks, took its name during Tom Jarndyce's lifetime because he allowed it to go to rack and ruin. The street of houses in London, which we will learn later is called Tom-all-Alone's, is part of the suit also. Esther learns that Mr. Jarndyce, who begs her to call him Guardian, has a little room he retreats to when he is out of sorts called "The Growlery." 2 | 3 | Mrs. Pardiggle and her five miserable sons arrive for a visit. Mrs. Pardiggle is a country version of Mrs. Jellyby, another obsessive and misguided philanthropist who neglects her children in favor of a fashionable charity. She takes the young people to a brickmaker's hovel in the village, and there Ada and Esther witness not only the lives of poverty and filth and despair that the people there live, but also the death of the brickmaker's baby in his mother's (Jenny) lap. The girls are much moved, and Esther significantly covers the tiny corpse with her handkerchief. The young ladies are moved by the downtrodden, beaten brickmakers' wives (Jenny and Liz) comforting each other, and it is contrasted with their bluster and ineffectual do-gooding. They come back later to comfort Jenny. 4 | 5 | In Chapter 9 Richard is unsure what to do with his life. Esther and Jarndyce had discussed what he should do for a profession, and the two agree that to ask him what he would like to do would be best. But Richard cannot decide, and it becomes apparent that the spectre of the Chancery suit, though he claims to ignore it, holds a psychic hold on him. 6 | 7 | Mr. Boythorn has written Mr. Jandyce a letter, and he is coming for a visit. He arrives and proves to be a boisterous and agreeable companion. We learn that he was once to have been married, but his fiance "died to him." The dispute over the right-of-way between Sir Leicester Dedlock and Boythorn, who are neighbors, is introduced, and Boythorn is voluble in his criticism of the baronet. However, Boythorn is fond of Lady Dedlock. 8 | 9 | Mr. Jarndyce writes to Sir Leicester Dedlock, who is a distant relative of Richard, to ask with help placing Richard in the world. He receives no likely help from that quarter. 10 | 11 | It becomes obvious to Esther that Ada and Richard have fallen in love. Mr. Guppy arrives at Bleak House on an errand for Kenge and Carboy's. He proposes marriage to Esther, who firmly rejects him. 12 | 13 | In Chapter 10 Mr. and Mrs. Snagsby and their law-stationery shop in London are described. Mrs. Snagsby is an awful termagent, and Mr. Snagsby is timid and cowed. They have a maid, Guster, who is prone to fits. It is a pathetic description, for Mr. Snagsby appears to be a decent man. 14 | 15 | Mr. Tulkinghorn's London lodgings are described, where "lawyers lie like maggots in nuts". Mr. Tulkinghorn goes to Snagsby's shop to find out who copied out the paper of Jarndyce and Jarndyce in which Lady Dedlock took such an interest. The Coavinses sherrif's officers (bailiff's -- repossesion agents) offices, the firm in which Neckett the debt-collector for Skimpole employed, are nearby. 16 | 17 | Mr. Snagsby informs Mr. Tulkinghorn that a man called Nemo (the Latin for "no one"), who lives above Krook's shop, is the copyist. Mr. Snagsby brings Mr. Tulkinghorn to Krook's shop, and Mr. Tulkinghorn goes up to Nemo's room, where he finds him lying on the bed with the scent of opium in the shabby room. -------------------------------------------------------------------------------- /detect_hallucinations.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "f82a7b33-4ad0-41ed-b44b-b85d05190a98", 6 | "metadata": {}, 7 | "source": [ 8 | "# Detect Hallucinations\n", 9 | "This notebook shows two different methods of detecting hallucinations:\n", 10 | " 1) Asking an LLM to rate the output, and list any potential hallucinations. This is more of a qualitative judgment.\n", 11 | " 2) Use an LLM specifically trained to detect hallucinations. Here we use the [Vectara model](https://huggingface.co/vectara/hallucination_evaluation_model?) from HuggingFace.\n", 12 | " \n", 13 | "This notebook is divided into three sections:\n", 14 | " 1) Set up the environment.\n", 15 | " 2) Set up the functions for detecting hallucinations.\n", 16 | " 3) Test out the functions." 17 | ] 18 | }, 19 | { 20 | "cell_type": "markdown", 21 | "id": "59600830-ca9a-46e6-a2ee-a424fc0f4bd1", 22 | "metadata": {}, 23 | "source": [ 24 | "## 1) Set up the envionment." 25 | ] 26 | }, 27 | { 28 | "cell_type": "code", 29 | "execution_count": 3, 30 | "id": "764299e7-a69a-44c3-bde7-7f0611c3157d", 31 | "metadata": { 32 | "collapsed": true, 33 | "jupyter": { 34 | "outputs_hidden": true 35 | }, 36 | "tags": [] 37 | }, 38 | "outputs": [ 39 | { 40 | "name": "stdout", 41 | "output_type": "stream", 42 | "text": [ 43 | "Collecting sentence-transformers\n", 44 | " Downloading sentence-transformers-2.2.2.tar.gz (85 kB)\n", 45 | "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m86.0/86.0 kB\u001b[0m \u001b[31m638.0 kB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0ma \u001b[36m0:00:01\u001b[0m\n", 46 | "\u001b[?25h Preparing metadata (setup.py) ... \u001b[?25ldone\n", 47 | "\u001b[?25hCollecting transformers<5.0.0,>=4.6.0 (from sentence-transformers)\n", 48 | " Downloading transformers-4.35.0-py3-none-any.whl.metadata (123 kB)\n", 49 | "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m123.1/123.1 kB\u001b[0m \u001b[31m1.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m\n", 50 | "\u001b[?25hRequirement already satisfied: tqdm in /opt/conda/lib/python3.10/site-packages (from sentence-transformers) (4.64.1)\n", 51 | "Collecting torch>=1.6.0 (from sentence-transformers)\n", 52 | " Downloading torch-2.1.0-cp310-cp310-manylinux1_x86_64.whl.metadata (25 kB)\n", 53 | "Collecting torchvision (from sentence-transformers)\n", 54 | " Downloading torchvision-0.16.0-cp310-cp310-manylinux1_x86_64.whl.metadata (6.6 kB)\n", 55 | "Requirement already satisfied: numpy in /opt/conda/lib/python3.10/site-packages (from sentence-transformers) (1.26.0)\n", 56 | "Requirement already satisfied: scikit-learn in /opt/conda/lib/python3.10/site-packages (from sentence-transformers) (1.0.1)\n", 57 | "Requirement already satisfied: scipy in /opt/conda/lib/python3.10/site-packages (from sentence-transformers) (1.11.3)\n", 58 | "Requirement already satisfied: nltk in /opt/conda/lib/python3.10/site-packages (from sentence-transformers) (3.7)\n", 59 | "Collecting sentencepiece (from sentence-transformers)\n", 60 | " Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)\n", 61 | "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.3/1.3 MB\u001b[0m \u001b[31m9.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m0:00:01\u001b[0m\n", 62 | "\u001b[?25hRequirement already satisfied: huggingface-hub>=0.4.0 in /opt/conda/lib/python3.10/site-packages (from sentence-transformers) (0.17.3)\n", 63 | "Requirement already satisfied: filelock in /opt/conda/lib/python3.10/site-packages (from huggingface-hub>=0.4.0->sentence-transformers) (3.6.0)\n", 64 | "Requirement already satisfied: fsspec in /opt/conda/lib/python3.10/site-packages (from huggingface-hub>=0.4.0->sentence-transformers) (2022.7.1)\n", 65 | "Requirement already satisfied: requests in /opt/conda/lib/python3.10/site-packages (from huggingface-hub>=0.4.0->sentence-transformers) (2.31.0)\n", 66 | "Requirement already satisfied: pyyaml>=5.1 in /opt/conda/lib/python3.10/site-packages/PyYAML-6.0-py3.10-linux-x86_64.egg (from huggingface-hub>=0.4.0->sentence-transformers) (6.0)\n", 67 | "Requirement already satisfied: typing-extensions>=3.7.4.3 in /opt/conda/lib/python3.10/site-packages (from huggingface-hub>=0.4.0->sentence-transformers) (4.8.0)\n", 68 | "Requirement already satisfied: packaging>=20.9 in /opt/conda/lib/python3.10/site-packages (from huggingface-hub>=0.4.0->sentence-transformers) (21.3)\n", 69 | "Requirement already satisfied: sympy in /opt/conda/lib/python3.10/site-packages (from torch>=1.6.0->sentence-transformers) (1.10.1)\n", 70 | "Requirement already satisfied: networkx in /opt/conda/lib/python3.10/site-packages (from torch>=1.6.0->sentence-transformers) (2.8.4)\n", 71 | "Requirement already satisfied: jinja2 in /opt/conda/lib/python3.10/site-packages (from torch>=1.6.0->sentence-transformers) (3.1.2)\n", 72 | "Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch>=1.6.0->sentence-transformers)\n", 73 | " Downloading nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)\n", 74 | "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m23.7/23.7 MB\u001b[0m \u001b[31m25.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m\n", 75 | "\u001b[?25hCollecting nvidia-cuda-runtime-cu12==12.1.105 (from torch>=1.6.0->sentence-transformers)\n", 76 | " Downloading nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)\n", 77 | "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m823.6/823.6 kB\u001b[0m \u001b[31m5.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m\n", 78 | "\u001b[?25hCollecting nvidia-cuda-cupti-cu12==12.1.105 (from torch>=1.6.0->sentence-transformers)\n", 79 | " Downloading nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)\n", 80 | "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m14.1/14.1 MB\u001b[0m \u001b[31m37.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m\n", 81 | "\u001b[?25hCollecting nvidia-cudnn-cu12==8.9.2.26 (from torch>=1.6.0->sentence-transformers)\n", 82 | " Downloading nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)\n", 83 | "Collecting nvidia-cublas-cu12==12.1.3.1 (from torch>=1.6.0->sentence-transformers)\n", 84 | " Downloading nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.whl (410.6 MB)\n", 85 | "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m410.6/410.6 MB\u001b[0m \u001b[31m2.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m\n", 86 | "\u001b[?25hCollecting nvidia-cufft-cu12==11.0.2.54 (from torch>=1.6.0->sentence-transformers)\n", 87 | " Downloading nvidia_cufft_cu12-11.0.2.54-py3-none-manylinux1_x86_64.whl (121.6 MB)\n", 88 | "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m121.6/121.6 MB\u001b[0m \u001b[31m10.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m\n", 89 | "\u001b[?25hCollecting nvidia-curand-cu12==10.3.2.106 (from torch>=1.6.0->sentence-transformers)\n", 90 | " Downloading nvidia_curand_cu12-10.3.2.106-py3-none-manylinux1_x86_64.whl (56.5 MB)\n", 91 | "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m56.5/56.5 MB\u001b[0m \u001b[31m17.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m\n", 92 | "\u001b[?25hCollecting nvidia-cusolver-cu12==11.4.5.107 (from torch>=1.6.0->sentence-transformers)\n", 93 | " Downloading nvidia_cusolver_cu12-11.4.5.107-py3-none-manylinux1_x86_64.whl (124.2 MB)\n", 94 | "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m124.2/124.2 MB\u001b[0m \u001b[31m9.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m\n", 95 | "\u001b[?25hCollecting nvidia-cusparse-cu12==12.1.0.106 (from torch>=1.6.0->sentence-transformers)\n", 96 | " Downloading nvidia_cusparse_cu12-12.1.0.106-py3-none-manylinux1_x86_64.whl (196.0 MB)\n", 97 | "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m196.0/196.0 MB\u001b[0m \u001b[31m5.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m\n", 98 | "\u001b[?25hCollecting nvidia-nccl-cu12==2.18.1 (from torch>=1.6.0->sentence-transformers)\n", 99 | " Downloading nvidia_nccl_cu12-2.18.1-py3-none-manylinux1_x86_64.whl (209.8 MB)\n", 100 | "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m209.8/209.8 MB\u001b[0m \u001b[31m5.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m\n", 101 | "\u001b[?25hCollecting nvidia-nvtx-cu12==12.1.105 (from torch>=1.6.0->sentence-transformers)\n", 102 | " Downloading nvidia_nvtx_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (99 kB)\n", 103 | "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m99.1/99.1 kB\u001b[0m \u001b[31m1.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0mta \u001b[36m0:00:01\u001b[0m\n", 104 | "\u001b[?25hCollecting triton==2.1.0 (from torch>=1.6.0->sentence-transformers)\n", 105 | " Downloading triton-2.1.0-0-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.3 kB)\n", 106 | "Collecting nvidia-nvjitlink-cu12 (from nvidia-cusolver-cu12==11.4.5.107->torch>=1.6.0->sentence-transformers)\n", 107 | " Downloading nvidia_nvjitlink_cu12-12.3.52-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)\n", 108 | "Requirement already satisfied: regex!=2019.12.17 in /opt/conda/lib/python3.10/site-packages (from transformers<5.0.0,>=4.6.0->sentence-transformers) (2022.7.9)\n", 109 | "Requirement already satisfied: tokenizers<0.15,>=0.14 in /opt/conda/lib/python3.10/site-packages (from transformers<5.0.0,>=4.6.0->sentence-transformers) (0.14.1)\n", 110 | "Collecting safetensors>=0.3.1 (from transformers<5.0.0,>=4.6.0->sentence-transformers)\n", 111 | " Downloading safetensors-0.4.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.8 kB)\n", 112 | "Requirement already satisfied: click in /opt/conda/lib/python3.10/site-packages (from nltk->sentence-transformers) (8.1.7)\n", 113 | "Requirement already satisfied: joblib in /opt/conda/lib/python3.10/site-packages (from nltk->sentence-transformers) (1.3.2)\n", 114 | "Requirement already satisfied: threadpoolctl>=2.0.0 in /opt/conda/lib/python3.10/site-packages (from scikit-learn->sentence-transformers) (2.2.0)\n", 115 | "Requirement already satisfied: pillow!=8.3.*,>=5.3.0 in /opt/conda/lib/python3.10/site-packages (from torchvision->sentence-transformers) (10.0.1)\n", 116 | "Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /opt/conda/lib/python3.10/site-packages (from packaging>=20.9->huggingface-hub>=0.4.0->sentence-transformers) (3.0.9)\n", 117 | "Requirement already satisfied: MarkupSafe>=2.0 in /opt/conda/lib/python3.10/site-packages (from jinja2->torch>=1.6.0->sentence-transformers) (2.1.3)\n", 118 | "Requirement already satisfied: charset-normalizer<4,>=2 in /opt/conda/lib/python3.10/site-packages (from requests->huggingface-hub>=0.4.0->sentence-transformers) (2.0.4)\n", 119 | "Requirement already satisfied: idna<4,>=2.5 in /opt/conda/lib/python3.10/site-packages (from requests->huggingface-hub>=0.4.0->sentence-transformers) (3.3)\n", 120 | "Requirement already satisfied: urllib3<3,>=1.21.1 in /opt/conda/lib/python3.10/site-packages (from requests->huggingface-hub>=0.4.0->sentence-transformers) (2.0.6)\n", 121 | "Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.10/site-packages (from requests->huggingface-hub>=0.4.0->sentence-transformers) (2023.7.22)\n", 122 | "Requirement already satisfied: mpmath>=0.19 in /opt/conda/lib/python3.10/site-packages (from sympy->torch>=1.6.0->sentence-transformers) (1.3.0)\n", 123 | "Downloading torch-2.1.0-cp310-cp310-manylinux1_x86_64.whl (670.2 MB)\n", 124 | "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m670.2/670.2 MB\u001b[0m \u001b[31m1.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m:00:01\u001b[0m00:01\u001b[0m\n", 125 | "\u001b[?25hDownloading nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl (731.7 MB)\n", 126 | "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m731.7/731.7 MB\u001b[0m \u001b[31m1.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m:00:01\u001b[0m00:01\u001b[0m\n", 127 | "\u001b[?25hDownloading triton-2.1.0-0-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (89.2 MB)\n", 128 | "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m89.2/89.2 MB\u001b[0m \u001b[31m13.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m:00:01\u001b[0m00:01\u001b[0m\n", 129 | "\u001b[?25hDownloading transformers-4.35.0-py3-none-any.whl (7.9 MB)\n", 130 | "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m7.9/7.9 MB\u001b[0m \u001b[31m9.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m0:00:01\u001b[0m00:01\u001b[0m\n", 131 | "\u001b[?25hDownloading torchvision-0.16.0-cp310-cp310-manylinux1_x86_64.whl (6.9 MB)\n", 132 | "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m6.9/6.9 MB\u001b[0m \u001b[31m31.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m:00:01\u001b[0m\n", 133 | "\u001b[?25hDownloading safetensors-0.4.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)\n", 134 | "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.3/1.3 MB\u001b[0m \u001b[31m11.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m:00:01\u001b[0m\n", 135 | "\u001b[?25hDownloading nvidia_nvjitlink_cu12-12.3.52-py3-none-manylinux1_x86_64.whl (20.5 MB)\n", 136 | "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m20.5/20.5 MB\u001b[0m \u001b[31m26.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m:00:01\u001b[0m00:01\u001b[0m\n", 137 | "\u001b[?25hBuilding wheels for collected packages: sentence-transformers\n", 138 | " Building wheel for sentence-transformers (setup.py) ... \u001b[?25ldone\n", 139 | "\u001b[?25h Created wheel for sentence-transformers: filename=sentence_transformers-2.2.2-py3-none-any.whl size=125923 sha256=071a7d85e791287a48b7e901095760825e9cfe937a0582096176cdb4d3662ed7\n", 140 | " Stored in directory: /root/.cache/pip/wheels/62/f2/10/1e606fd5f02395388f74e7462910fe851042f97238cbbd902f\n", 141 | "Successfully built sentence-transformers\n", 142 | "Installing collected packages: sentencepiece, triton, safetensors, nvidia-nvtx-cu12, nvidia-nvjitlink-cu12, nvidia-nccl-cu12, nvidia-curand-cu12, nvidia-cufft-cu12, nvidia-cuda-runtime-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-cupti-cu12, nvidia-cublas-cu12, nvidia-cusparse-cu12, nvidia-cudnn-cu12, nvidia-cusolver-cu12, transformers, torch, torchvision, sentence-transformers\n", 143 | "Successfully installed nvidia-cublas-cu12-12.1.3.1 nvidia-cuda-cupti-cu12-12.1.105 nvidia-cuda-nvrtc-cu12-12.1.105 nvidia-cuda-runtime-cu12-12.1.105 nvidia-cudnn-cu12-8.9.2.26 nvidia-cufft-cu12-11.0.2.54 nvidia-curand-cu12-10.3.2.106 nvidia-cusolver-cu12-11.4.5.107 nvidia-cusparse-cu12-12.1.0.106 nvidia-nccl-cu12-2.18.1 nvidia-nvjitlink-cu12-12.3.52 nvidia-nvtx-cu12-12.1.105 safetensors-0.4.0 sentence-transformers-2.2.2 sentencepiece-0.1.99 torch-2.1.0 torchvision-0.16.0 transformers-4.35.0 triton-2.1.0\n", 144 | "\u001b[33mWARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv\u001b[0m\u001b[33m\n", 145 | "\u001b[0m\u001b[33mWARNING: There was an error checking the latest version of pip.\u001b[0m\u001b[33m\n", 146 | "\u001b[0m" 147 | ] 148 | } 149 | ], 150 | "source": [ 151 | "#!pip install -U sentence-transformers" 152 | ] 153 | }, 154 | { 155 | "cell_type": "code", 156 | "execution_count": 5, 157 | "id": "5d056d85-9377-4e0e-9df1-35c80398074d", 158 | "metadata": { 159 | "tags": [] 160 | }, 161 | "outputs": [], 162 | "source": [ 163 | "from sentence_transformers import CrossEncoder\n", 164 | "\n", 165 | "model = CrossEncoder('vectara/hallucination_evaluation_model')" 166 | ] 167 | }, 168 | { 169 | "cell_type": "markdown", 170 | "id": "d900b1e4-5325-4032-ab64-bac062a6172a", 171 | "metadata": { 172 | "tags": [] 173 | }, 174 | "source": [ 175 | "## 2) Set up the functions for detecting hallucinations." 176 | ] 177 | }, 178 | { 179 | "cell_type": "code", 180 | "execution_count": 6, 181 | "id": "f1c8b8f6-78cd-4fbf-a26f-7421b6244bfd", 182 | "metadata": { 183 | "tags": [] 184 | }, 185 | "outputs": [], 186 | "source": [ 187 | "def get_hallucination_score(source, generation):\n", 188 | " '''\n", 189 | " A score of less than 0.5 indicates a likely hallucination.\n", 190 | " Note that the context length of the model is 512 tokens across both documents.\n", 191 | " '''\n", 192 | " scores = model.predict([source,generation])\n", 193 | " return scores" 194 | ] 195 | }, 196 | { 197 | "cell_type": "markdown", 198 | "id": "272ead27-349b-4a0d-9d1f-391c0dd3cddb", 199 | "metadata": {}, 200 | "source": [ 201 | "## 3) Test out the functions." 202 | ] 203 | }, 204 | { 205 | "cell_type": "code", 206 | "execution_count": 17, 207 | "id": "8abb8679-622d-4369-b6f4-375a9209014d", 208 | "metadata": { 209 | "tags": [] 210 | }, 211 | "outputs": [], 212 | "source": [ 213 | "#A flag to run the tests - defaults to off so that this functions can be called by other scripts.\n", 214 | "RUN_EXAMPLES = False\n", 215 | "if __name__ == '__main__':\n", 216 | " RUN_EXAMPLES = True" 217 | ] 218 | }, 219 | { 220 | "cell_type": "code", 221 | "execution_count": 18, 222 | "id": "247b0d88-26af-4641-a141-dd9efd722056", 223 | "metadata": { 224 | "tags": [] 225 | }, 226 | "outputs": [ 227 | { 228 | "name": "stdout", 229 | "output_type": "stream", 230 | "text": [ 231 | "Original: A man walks into a bar and buys a drink\n", 232 | "Generated: A man swigs alcohol at a pub\n", 233 | "The generated text likely does NOT contain a hallucination. (score of 0.534)\n" 234 | ] 235 | } 236 | ], 237 | "source": [ 238 | "if RUN_EXAMPLES:\n", 239 | " original_text = \"A man walks into a bar and buys a drink\"\n", 240 | " generated_text = \"A man swigs alcohol at a pub\"\n", 241 | "\n", 242 | " is_hallucination = False\n", 243 | " score = get_hallucination_score(original_text,generated_text)\n", 244 | " if score<0.5:is_hallucination=True\n", 245 | "\n", 246 | " print (\"Original: \",original_text)\n", 247 | " print (\"Generated: \",generated_text)\n", 248 | " if is_hallucination:\n", 249 | " print(\"The generated text likely contains a hallucination. (score of %s)\"%(round(score,3)))\n", 250 | " else:\n", 251 | " print(\"The generated text likely does NOT contain a hallucination. (score of %s)\"%(round(score,3)))" 252 | ] 253 | }, 254 | { 255 | "cell_type": "code", 256 | "execution_count": null, 257 | "id": "60eda89a-5683-426b-914d-5893d77b1321", 258 | "metadata": {}, 259 | "outputs": [], 260 | "source": [] 261 | } 262 | ], 263 | "metadata": { 264 | "availableInstances": [ 265 | { 266 | "_defaultOrder": 0, 267 | "_isFastLaunch": true, 268 | "category": "General purpose", 269 | "gpuNum": 0, 270 | "hideHardwareSpecs": false, 271 | "memoryGiB": 4, 272 | "name": "ml.t3.medium", 273 | "vcpuNum": 2 274 | }, 275 | { 276 | "_defaultOrder": 1, 277 | "_isFastLaunch": false, 278 | "category": "General purpose", 279 | "gpuNum": 0, 280 | "hideHardwareSpecs": false, 281 | "memoryGiB": 8, 282 | "name": "ml.t3.large", 283 | "vcpuNum": 2 284 | }, 285 | { 286 | "_defaultOrder": 2, 287 | "_isFastLaunch": false, 288 | "category": "General purpose", 289 | "gpuNum": 0, 290 | "hideHardwareSpecs": false, 291 | "memoryGiB": 16, 292 | "name": "ml.t3.xlarge", 293 | "vcpuNum": 4 294 | }, 295 | { 296 | "_defaultOrder": 3, 297 | "_isFastLaunch": false, 298 | "category": "General purpose", 299 | "gpuNum": 0, 300 | "hideHardwareSpecs": false, 301 | "memoryGiB": 32, 302 | "name": "ml.t3.2xlarge", 303 | "vcpuNum": 8 304 | }, 305 | { 306 | "_defaultOrder": 4, 307 | "_isFastLaunch": true, 308 | "category": "General purpose", 309 | "gpuNum": 0, 310 | "hideHardwareSpecs": false, 311 | "memoryGiB": 8, 312 | "name": "ml.m5.large", 313 | "vcpuNum": 2 314 | }, 315 | { 316 | "_defaultOrder": 5, 317 | "_isFastLaunch": false, 318 | "category": "General purpose", 319 | "gpuNum": 0, 320 | "hideHardwareSpecs": false, 321 | "memoryGiB": 16, 322 | "name": "ml.m5.xlarge", 323 | "vcpuNum": 4 324 | }, 325 | { 326 | "_defaultOrder": 6, 327 | "_isFastLaunch": false, 328 | "category": "General purpose", 329 | "gpuNum": 0, 330 | "hideHardwareSpecs": false, 331 | "memoryGiB": 32, 332 | "name": "ml.m5.2xlarge", 333 | "vcpuNum": 8 334 | }, 335 | { 336 | "_defaultOrder": 7, 337 | "_isFastLaunch": false, 338 | "category": "General purpose", 339 | "gpuNum": 0, 340 | "hideHardwareSpecs": false, 341 | "memoryGiB": 64, 342 | "name": "ml.m5.4xlarge", 343 | "vcpuNum": 16 344 | }, 345 | { 346 | "_defaultOrder": 8, 347 | "_isFastLaunch": false, 348 | "category": "General purpose", 349 | "gpuNum": 0, 350 | "hideHardwareSpecs": false, 351 | "memoryGiB": 128, 352 | "name": "ml.m5.8xlarge", 353 | "vcpuNum": 32 354 | }, 355 | { 356 | "_defaultOrder": 9, 357 | "_isFastLaunch": false, 358 | "category": "General purpose", 359 | "gpuNum": 0, 360 | "hideHardwareSpecs": false, 361 | "memoryGiB": 192, 362 | "name": "ml.m5.12xlarge", 363 | "vcpuNum": 48 364 | }, 365 | { 366 | "_defaultOrder": 10, 367 | "_isFastLaunch": false, 368 | "category": "General purpose", 369 | "gpuNum": 0, 370 | "hideHardwareSpecs": false, 371 | "memoryGiB": 256, 372 | "name": "ml.m5.16xlarge", 373 | "vcpuNum": 64 374 | }, 375 | { 376 | "_defaultOrder": 11, 377 | "_isFastLaunch": false, 378 | "category": "General purpose", 379 | "gpuNum": 0, 380 | "hideHardwareSpecs": false, 381 | "memoryGiB": 384, 382 | "name": "ml.m5.24xlarge", 383 | "vcpuNum": 96 384 | }, 385 | { 386 | "_defaultOrder": 12, 387 | "_isFastLaunch": false, 388 | "category": "General purpose", 389 | "gpuNum": 0, 390 | "hideHardwareSpecs": false, 391 | "memoryGiB": 8, 392 | "name": "ml.m5d.large", 393 | "vcpuNum": 2 394 | }, 395 | { 396 | "_defaultOrder": 13, 397 | "_isFastLaunch": false, 398 | "category": "General purpose", 399 | "gpuNum": 0, 400 | "hideHardwareSpecs": false, 401 | "memoryGiB": 16, 402 | "name": "ml.m5d.xlarge", 403 | "vcpuNum": 4 404 | }, 405 | { 406 | "_defaultOrder": 14, 407 | "_isFastLaunch": false, 408 | "category": "General purpose", 409 | "gpuNum": 0, 410 | "hideHardwareSpecs": false, 411 | "memoryGiB": 32, 412 | "name": "ml.m5d.2xlarge", 413 | "vcpuNum": 8 414 | }, 415 | { 416 | "_defaultOrder": 15, 417 | "_isFastLaunch": false, 418 | "category": "General purpose", 419 | "gpuNum": 0, 420 | "hideHardwareSpecs": false, 421 | "memoryGiB": 64, 422 | "name": "ml.m5d.4xlarge", 423 | "vcpuNum": 16 424 | }, 425 | { 426 | "_defaultOrder": 16, 427 | "_isFastLaunch": false, 428 | "category": "General purpose", 429 | "gpuNum": 0, 430 | "hideHardwareSpecs": false, 431 | "memoryGiB": 128, 432 | "name": "ml.m5d.8xlarge", 433 | "vcpuNum": 32 434 | }, 435 | { 436 | "_defaultOrder": 17, 437 | "_isFastLaunch": false, 438 | "category": "General purpose", 439 | "gpuNum": 0, 440 | "hideHardwareSpecs": false, 441 | "memoryGiB": 192, 442 | "name": "ml.m5d.12xlarge", 443 | "vcpuNum": 48 444 | }, 445 | { 446 | "_defaultOrder": 18, 447 | "_isFastLaunch": false, 448 | "category": "General purpose", 449 | "gpuNum": 0, 450 | "hideHardwareSpecs": false, 451 | "memoryGiB": 256, 452 | "name": "ml.m5d.16xlarge", 453 | "vcpuNum": 64 454 | }, 455 | { 456 | "_defaultOrder": 19, 457 | "_isFastLaunch": false, 458 | "category": "General purpose", 459 | "gpuNum": 0, 460 | "hideHardwareSpecs": false, 461 | "memoryGiB": 384, 462 | "name": "ml.m5d.24xlarge", 463 | "vcpuNum": 96 464 | }, 465 | { 466 | "_defaultOrder": 20, 467 | "_isFastLaunch": false, 468 | "category": "General purpose", 469 | "gpuNum": 0, 470 | "hideHardwareSpecs": true, 471 | "memoryGiB": 0, 472 | "name": "ml.geospatial.interactive", 473 | "supportedImageNames": [ 474 | "sagemaker-geospatial-v1-0" 475 | ], 476 | "vcpuNum": 0 477 | }, 478 | { 479 | "_defaultOrder": 21, 480 | "_isFastLaunch": true, 481 | "category": "Compute optimized", 482 | "gpuNum": 0, 483 | "hideHardwareSpecs": false, 484 | "memoryGiB": 4, 485 | "name": "ml.c5.large", 486 | "vcpuNum": 2 487 | }, 488 | { 489 | "_defaultOrder": 22, 490 | "_isFastLaunch": false, 491 | "category": "Compute optimized", 492 | "gpuNum": 0, 493 | "hideHardwareSpecs": false, 494 | "memoryGiB": 8, 495 | "name": "ml.c5.xlarge", 496 | "vcpuNum": 4 497 | }, 498 | { 499 | "_defaultOrder": 23, 500 | "_isFastLaunch": false, 501 | "category": "Compute optimized", 502 | "gpuNum": 0, 503 | "hideHardwareSpecs": false, 504 | "memoryGiB": 16, 505 | "name": "ml.c5.2xlarge", 506 | "vcpuNum": 8 507 | }, 508 | { 509 | "_defaultOrder": 24, 510 | "_isFastLaunch": false, 511 | "category": "Compute optimized", 512 | "gpuNum": 0, 513 | "hideHardwareSpecs": false, 514 | "memoryGiB": 32, 515 | "name": "ml.c5.4xlarge", 516 | "vcpuNum": 16 517 | }, 518 | { 519 | "_defaultOrder": 25, 520 | "_isFastLaunch": false, 521 | "category": "Compute optimized", 522 | "gpuNum": 0, 523 | "hideHardwareSpecs": false, 524 | "memoryGiB": 72, 525 | "name": "ml.c5.9xlarge", 526 | "vcpuNum": 36 527 | }, 528 | { 529 | "_defaultOrder": 26, 530 | "_isFastLaunch": false, 531 | "category": "Compute optimized", 532 | "gpuNum": 0, 533 | "hideHardwareSpecs": false, 534 | "memoryGiB": 96, 535 | "name": "ml.c5.12xlarge", 536 | "vcpuNum": 48 537 | }, 538 | { 539 | "_defaultOrder": 27, 540 | "_isFastLaunch": false, 541 | "category": "Compute optimized", 542 | "gpuNum": 0, 543 | "hideHardwareSpecs": false, 544 | "memoryGiB": 144, 545 | "name": "ml.c5.18xlarge", 546 | "vcpuNum": 72 547 | }, 548 | { 549 | "_defaultOrder": 28, 550 | "_isFastLaunch": false, 551 | "category": "Compute optimized", 552 | "gpuNum": 0, 553 | "hideHardwareSpecs": false, 554 | "memoryGiB": 192, 555 | "name": "ml.c5.24xlarge", 556 | "vcpuNum": 96 557 | }, 558 | { 559 | "_defaultOrder": 29, 560 | "_isFastLaunch": true, 561 | "category": "Accelerated computing", 562 | "gpuNum": 1, 563 | "hideHardwareSpecs": false, 564 | "memoryGiB": 16, 565 | "name": "ml.g4dn.xlarge", 566 | "vcpuNum": 4 567 | }, 568 | { 569 | "_defaultOrder": 30, 570 | "_isFastLaunch": false, 571 | "category": "Accelerated computing", 572 | "gpuNum": 1, 573 | "hideHardwareSpecs": false, 574 | "memoryGiB": 32, 575 | "name": "ml.g4dn.2xlarge", 576 | "vcpuNum": 8 577 | }, 578 | { 579 | "_defaultOrder": 31, 580 | "_isFastLaunch": false, 581 | "category": "Accelerated computing", 582 | "gpuNum": 1, 583 | "hideHardwareSpecs": false, 584 | "memoryGiB": 64, 585 | "name": "ml.g4dn.4xlarge", 586 | "vcpuNum": 16 587 | }, 588 | { 589 | "_defaultOrder": 32, 590 | "_isFastLaunch": false, 591 | "category": "Accelerated computing", 592 | "gpuNum": 1, 593 | "hideHardwareSpecs": false, 594 | "memoryGiB": 128, 595 | "name": "ml.g4dn.8xlarge", 596 | "vcpuNum": 32 597 | }, 598 | { 599 | "_defaultOrder": 33, 600 | "_isFastLaunch": false, 601 | "category": "Accelerated computing", 602 | "gpuNum": 4, 603 | "hideHardwareSpecs": false, 604 | "memoryGiB": 192, 605 | "name": "ml.g4dn.12xlarge", 606 | "vcpuNum": 48 607 | }, 608 | { 609 | "_defaultOrder": 34, 610 | "_isFastLaunch": false, 611 | "category": "Accelerated computing", 612 | "gpuNum": 1, 613 | "hideHardwareSpecs": false, 614 | "memoryGiB": 256, 615 | "name": "ml.g4dn.16xlarge", 616 | "vcpuNum": 64 617 | }, 618 | { 619 | "_defaultOrder": 35, 620 | "_isFastLaunch": false, 621 | "category": "Accelerated computing", 622 | "gpuNum": 1, 623 | "hideHardwareSpecs": false, 624 | "memoryGiB": 61, 625 | "name": "ml.p3.2xlarge", 626 | "vcpuNum": 8 627 | }, 628 | { 629 | "_defaultOrder": 36, 630 | "_isFastLaunch": false, 631 | "category": "Accelerated computing", 632 | "gpuNum": 4, 633 | "hideHardwareSpecs": false, 634 | "memoryGiB": 244, 635 | "name": "ml.p3.8xlarge", 636 | "vcpuNum": 32 637 | }, 638 | { 639 | "_defaultOrder": 37, 640 | "_isFastLaunch": false, 641 | "category": "Accelerated computing", 642 | "gpuNum": 8, 643 | "hideHardwareSpecs": false, 644 | "memoryGiB": 488, 645 | "name": "ml.p3.16xlarge", 646 | "vcpuNum": 64 647 | }, 648 | { 649 | "_defaultOrder": 38, 650 | "_isFastLaunch": false, 651 | "category": "Accelerated computing", 652 | "gpuNum": 8, 653 | "hideHardwareSpecs": false, 654 | "memoryGiB": 768, 655 | "name": "ml.p3dn.24xlarge", 656 | "vcpuNum": 96 657 | }, 658 | { 659 | "_defaultOrder": 39, 660 | "_isFastLaunch": false, 661 | "category": "Memory Optimized", 662 | "gpuNum": 0, 663 | "hideHardwareSpecs": false, 664 | "memoryGiB": 16, 665 | "name": "ml.r5.large", 666 | "vcpuNum": 2 667 | }, 668 | { 669 | "_defaultOrder": 40, 670 | "_isFastLaunch": false, 671 | "category": "Memory Optimized", 672 | "gpuNum": 0, 673 | "hideHardwareSpecs": false, 674 | "memoryGiB": 32, 675 | "name": "ml.r5.xlarge", 676 | "vcpuNum": 4 677 | }, 678 | { 679 | "_defaultOrder": 41, 680 | "_isFastLaunch": false, 681 | "category": "Memory Optimized", 682 | "gpuNum": 0, 683 | "hideHardwareSpecs": false, 684 | "memoryGiB": 64, 685 | "name": "ml.r5.2xlarge", 686 | "vcpuNum": 8 687 | }, 688 | { 689 | "_defaultOrder": 42, 690 | "_isFastLaunch": false, 691 | "category": "Memory Optimized", 692 | "gpuNum": 0, 693 | "hideHardwareSpecs": false, 694 | "memoryGiB": 128, 695 | "name": "ml.r5.4xlarge", 696 | "vcpuNum": 16 697 | }, 698 | { 699 | "_defaultOrder": 43, 700 | "_isFastLaunch": false, 701 | "category": "Memory Optimized", 702 | "gpuNum": 0, 703 | "hideHardwareSpecs": false, 704 | "memoryGiB": 256, 705 | "name": "ml.r5.8xlarge", 706 | "vcpuNum": 32 707 | }, 708 | { 709 | "_defaultOrder": 44, 710 | "_isFastLaunch": false, 711 | "category": "Memory Optimized", 712 | "gpuNum": 0, 713 | "hideHardwareSpecs": false, 714 | "memoryGiB": 384, 715 | "name": "ml.r5.12xlarge", 716 | "vcpuNum": 48 717 | }, 718 | { 719 | "_defaultOrder": 45, 720 | "_isFastLaunch": false, 721 | "category": "Memory Optimized", 722 | "gpuNum": 0, 723 | "hideHardwareSpecs": false, 724 | "memoryGiB": 512, 725 | "name": "ml.r5.16xlarge", 726 | "vcpuNum": 64 727 | }, 728 | { 729 | "_defaultOrder": 46, 730 | "_isFastLaunch": false, 731 | "category": "Memory Optimized", 732 | "gpuNum": 0, 733 | "hideHardwareSpecs": false, 734 | "memoryGiB": 768, 735 | "name": "ml.r5.24xlarge", 736 | "vcpuNum": 96 737 | }, 738 | { 739 | "_defaultOrder": 47, 740 | "_isFastLaunch": false, 741 | "category": "Accelerated computing", 742 | "gpuNum": 1, 743 | "hideHardwareSpecs": false, 744 | "memoryGiB": 16, 745 | "name": "ml.g5.xlarge", 746 | "vcpuNum": 4 747 | }, 748 | { 749 | "_defaultOrder": 48, 750 | "_isFastLaunch": false, 751 | "category": "Accelerated computing", 752 | "gpuNum": 1, 753 | "hideHardwareSpecs": false, 754 | "memoryGiB": 32, 755 | "name": "ml.g5.2xlarge", 756 | "vcpuNum": 8 757 | }, 758 | { 759 | "_defaultOrder": 49, 760 | "_isFastLaunch": false, 761 | "category": "Accelerated computing", 762 | "gpuNum": 1, 763 | "hideHardwareSpecs": false, 764 | "memoryGiB": 64, 765 | "name": "ml.g5.4xlarge", 766 | "vcpuNum": 16 767 | }, 768 | { 769 | "_defaultOrder": 50, 770 | "_isFastLaunch": false, 771 | "category": "Accelerated computing", 772 | "gpuNum": 1, 773 | "hideHardwareSpecs": false, 774 | "memoryGiB": 128, 775 | "name": "ml.g5.8xlarge", 776 | "vcpuNum": 32 777 | }, 778 | { 779 | "_defaultOrder": 51, 780 | "_isFastLaunch": false, 781 | "category": "Accelerated computing", 782 | "gpuNum": 1, 783 | "hideHardwareSpecs": false, 784 | "memoryGiB": 256, 785 | "name": "ml.g5.16xlarge", 786 | "vcpuNum": 64 787 | }, 788 | { 789 | "_defaultOrder": 52, 790 | "_isFastLaunch": false, 791 | "category": "Accelerated computing", 792 | "gpuNum": 4, 793 | "hideHardwareSpecs": false, 794 | "memoryGiB": 192, 795 | "name": "ml.g5.12xlarge", 796 | "vcpuNum": 48 797 | }, 798 | { 799 | "_defaultOrder": 53, 800 | "_isFastLaunch": false, 801 | "category": "Accelerated computing", 802 | "gpuNum": 4, 803 | "hideHardwareSpecs": false, 804 | "memoryGiB": 384, 805 | "name": "ml.g5.24xlarge", 806 | "vcpuNum": 96 807 | }, 808 | { 809 | "_defaultOrder": 54, 810 | "_isFastLaunch": false, 811 | "category": "Accelerated computing", 812 | "gpuNum": 8, 813 | "hideHardwareSpecs": false, 814 | "memoryGiB": 768, 815 | "name": "ml.g5.48xlarge", 816 | "vcpuNum": 192 817 | }, 818 | { 819 | "_defaultOrder": 55, 820 | "_isFastLaunch": false, 821 | "category": "Accelerated computing", 822 | "gpuNum": 8, 823 | "hideHardwareSpecs": false, 824 | "memoryGiB": 1152, 825 | "name": "ml.p4d.24xlarge", 826 | "vcpuNum": 96 827 | }, 828 | { 829 | "_defaultOrder": 56, 830 | "_isFastLaunch": false, 831 | "category": "Accelerated computing", 832 | "gpuNum": 8, 833 | "hideHardwareSpecs": false, 834 | "memoryGiB": 1152, 835 | "name": "ml.p4de.24xlarge", 836 | "vcpuNum": 96 837 | } 838 | ], 839 | "instance_type": "ml.t3.medium", 840 | "kernelspec": { 841 | "display_name": "Python 3 (Data Science 3.0)", 842 | "language": "python", 843 | "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-1:081325390199:image/sagemaker-data-science-310-v1" 844 | }, 845 | "language_info": { 846 | "codemirror_mode": { 847 | "name": "ipython", 848 | "version": 3 849 | }, 850 | "file_extension": ".py", 851 | "mimetype": "text/x-python", 852 | "name": "python", 853 | "nbconvert_exporter": "python", 854 | "pygments_lexer": "ipython3", 855 | "version": "3.10.6" 856 | } 857 | }, 858 | "nbformat": 4, 859 | "nbformat_minor": 5 860 | } 861 | -------------------------------------------------------------------------------- /frontend/index.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | Vite + React + TS 8 | 9 | 10 |
11 | 12 | 13 | 14 | -------------------------------------------------------------------------------- /frontend/package.json: -------------------------------------------------------------------------------- 1 | { 2 | "name": "summarization-frontend", 3 | "private": true, 4 | "version": "0.0.1", 5 | "type": "module", 6 | "scripts": { 7 | "dev": "vite", 8 | "build": "tsc && vite build", 9 | "lint": "eslint . --ext ts,tsx --report-unused-disable-directives --max-warnings 0", 10 | "preview": "vite preview" 11 | }, 12 | "dependencies": { 13 | "@emotion/react": "^11.11.1", 14 | "@emotion/styled": "^11.11.0", 15 | "@fontsource/roboto": "^5.0.8", 16 | "@mui/icons-material": "^5.15.0", 17 | "@mui/material": "^5.15.5", 18 | "react": "^18.2.0", 19 | "react-dom": "^18.2.0", 20 | "react-hook-form": "^7.49.2" 21 | }, 22 | "devDependencies": { 23 | "@types/react": "^18.2.43", 24 | "@types/react-dom": "^18.2.17", 25 | "@typescript-eslint/eslint-plugin": "^6.14.0", 26 | "@typescript-eslint/parser": "^6.14.0", 27 | "@vitejs/plugin-react": "^4.2.1", 28 | "eslint": "^8.55.0", 29 | "eslint-plugin-react-hooks": "^4.6.0", 30 | "eslint-plugin-react-refresh": "^0.4.5", 31 | "typescript": "^5.2.2", 32 | "vite": "^5.0.13" 33 | } 34 | } 35 | -------------------------------------------------------------------------------- /frontend/src/App.tsx: -------------------------------------------------------------------------------- 1 | import SummarizationView from './views/SummarizationView'; 2 | import Container from '@mui/material/Container'; 3 | 4 | 5 | function App() { 6 | return ( 7 | 8 | 9 | 10 | ) 11 | } 12 | 13 | export default App 14 | -------------------------------------------------------------------------------- /frontend/src/api/ApiService.ts: -------------------------------------------------------------------------------- 1 | import { MultiDocSummarizationRequest, SingleInputSummarizationRequest, UploadDocsRequest } from "../types/APIRequests"; 2 | import { SummarizationResponse, UploadDocsResponse } from "../types/APIResponses"; 3 | 4 | 5 | const SERVER = 'localhost:5000'; 6 | const UPLOAD_URL = '/upload-docs'; 7 | const STUFF_IT_URL = '/stuff-it'; 8 | const MAP_REDUCE = '/map-reduce'; 9 | const AUTO_REFINE = '/auto-refine'; 10 | const MULTI_DOC = '/multi-doc' 11 | 12 | class APIService { 13 | 14 | static async uploadDocuments(request: UploadDocsRequest): Promise { 15 | 16 | // Use FormData to upload the files to the server. 17 | const formData = new FormData(); 18 | request.files.forEach(file => { 19 | formData.append('files', file); 20 | }); 21 | 22 | const url: string = await this.buildUrl(SERVER, UPLOAD_URL); 23 | const response = await fetch(url, { 24 | method: 'POST', 25 | body: formData 26 | }); 27 | return response.json(); 28 | } 29 | 30 | static async stuffIt(request: SingleInputSummarizationRequest): Promise { 31 | const url: string = await this.buildUrl(SERVER, STUFF_IT_URL); 32 | return await this.summarizationFetch(request, url); 33 | } 34 | 35 | static async mapReduce(request: SingleInputSummarizationRequest): Promise { 36 | const url: string = await this.buildUrl(SERVER, MAP_REDUCE); 37 | return await this.summarizationFetch(request, url); 38 | } 39 | 40 | static async autoRefine(request: SingleInputSummarizationRequest): Promise { 41 | const url: string = await this.buildUrl(SERVER, AUTO_REFINE); 42 | return await this.summarizationFetch(request, url); 43 | } 44 | 45 | static async multiDoc(request: MultiDocSummarizationRequest): Promise { 46 | const url: string = await this.buildUrl(SERVER, MULTI_DOC); 47 | return await this.summarizationFetch(request, url); 48 | } 49 | 50 | static async summarizationFetch(request: object, url: string): Promise { 51 | const response = await fetch(url, { 52 | method: 'POST', 53 | headers: { 54 | 'Content-Type': 'application/json' 55 | }, 56 | body: JSON.stringify(request) 57 | }); 58 | return response.json(); 59 | } 60 | 61 | static async buildUrl(server: string, url: string): Promise { 62 | return `http://${server}${url}`; 63 | } 64 | } 65 | 66 | export default APIService; -------------------------------------------------------------------------------- /frontend/src/assets/logo.svg: -------------------------------------------------------------------------------- 1 | 2 | 3 | 5 | 9 | 10 | 31 | 32 | 34 | 36 | 37 | 38 | 39 | -------------------------------------------------------------------------------- /frontend/src/components/AWSAppBar.tsx: -------------------------------------------------------------------------------- 1 | import { AppBar, Toolbar, Typography } from "@mui/material"; 2 | 3 | function AWSAppBar() { 4 | return ( 5 | 6 | 7 | Logo 8 | 9 | Advanced Summarization Demo 10 | 11 | 12 | 13 | ); 14 | } 15 | 16 | export default AWSAppBar; -------------------------------------------------------------------------------- /frontend/src/components/CustomTabs.tsx: -------------------------------------------------------------------------------- 1 | import * as React from 'react'; 2 | import Box from '@mui/material/Box'; 3 | import Tabs from '@mui/material/Tabs'; 4 | import Tab from '@mui/material/Tab'; 5 | 6 | export interface CustomTabsProps { 7 | initialValue: number; 8 | onChange: (newValue: number) => void; 9 | tabs: { name: string; value: string }[]; 10 | } 11 | 12 | function a11yProps(index: number) { 13 | return { 14 | id: `custom-tab-${index}`, 15 | 'aria-controls': `custom-tabpanel-${index}`, 16 | }; 17 | } 18 | 19 | function CustomTabs(props: CustomTabsProps) { 20 | const [value, setValue] = React.useState(props.initialValue); 21 | 22 | const handleChange = (event: React.SyntheticEvent, newValue: number) => { 23 | setValue(newValue); 24 | props.onChange(newValue); 25 | event.preventDefault() 26 | }; 27 | 28 | return ( 29 | 30 | 31 | 32 | {props.tabs.map((tab, index) => ( 33 | 34 | ))} 35 | 36 | 37 | 38 | ); 39 | } 40 | 41 | export default CustomTabs; 42 | -------------------------------------------------------------------------------- /frontend/src/components/EmptySummarizationResults.tsx: -------------------------------------------------------------------------------- 1 | import { Grid, Typography } from "@mui/material"; 2 | 3 | export interface EmptySummarizationResults { 4 | results: string; 5 | } 6 | 7 | 8 | function EmptySummarizationResults() { 9 | return ( 10 | 19 |

Fill out the form on the left to see your generated content here

20 |
21 | ) 22 | } 23 | 24 | export default EmptySummarizationResults; 25 | -------------------------------------------------------------------------------- /frontend/src/components/LoadingProgress.tsx: -------------------------------------------------------------------------------- 1 | import { CircularProgress, Grid, LinearProgress, Typography } from "@mui/material"; 2 | 3 | 4 | export interface LoadingProgressProps { 5 | progress: number; 6 | } 7 | 8 | function LoadingProgress({ progress }: LoadingProgressProps) { 9 | return ( 10 | 11 | 12 | 13 | 14 | 23 | 24 |
Generating your summary
25 |
26 |
27 | ) 28 | } 29 | 30 | export default LoadingProgress; -------------------------------------------------------------------------------- /frontend/src/components/MethodSelector.tsx: -------------------------------------------------------------------------------- 1 | import { TextField } from "@mui/material"; 2 | import { SummarizationType } from "../types/SummarizationType"; 3 | 4 | const methods: string[] = [ 5 | SummarizationType.STUFF_IT.toString(), 6 | SummarizationType.MAP_REDUCE.toString(), 7 | SummarizationType.AUTO_REFINE.toString(), 8 | SummarizationType.MULTI_DOC.toString() 9 | ] 10 | 11 | export interface MethodSelectorProps { 12 | setMethod: (method: SummarizationType) => void; 13 | } 14 | 15 | function MethodSelector({ setMethod }: MethodSelectorProps) { 16 | 17 | const handleMethodChange = (event: React.ChangeEvent) => { 18 | setMethod(event.target.value as SummarizationType); 19 | }; 20 | 21 | return ( 22 | 23 | { 24 | methods.map((method: string) => ) 25 | } 26 | 27 | ) 28 | } 29 | 30 | export default MethodSelector; -------------------------------------------------------------------------------- /frontend/src/components/PasteTextInput.tsx: -------------------------------------------------------------------------------- 1 | import { TextField } from "@mui/material"; 2 | 3 | export interface PasteTextFormProps { 4 | inputFieldName: string; 5 | inputFormRegister: any; // No typescript support for react hook form. 6 | } 7 | 8 | function PasteTextInput({ inputFieldName, inputFormRegister }: PasteTextFormProps) { 9 | return ( 10 | 19 | ) 20 | } 21 | 22 | export default PasteTextInput -------------------------------------------------------------------------------- /frontend/src/components/ProgressStepper.tsx: -------------------------------------------------------------------------------- 1 | import { Step, StepLabel, Stepper } from "@mui/material"; 2 | 3 | export interface ProgressStepperProps { 4 | activeStep: number; 5 | } 6 | 7 | function ProgressStepper({ activeStep }: ProgressStepperProps) { 8 | return ( 9 | 10 | 11 | Enter your input 12 | 13 | 14 | View Results 15 | 16 | 17 | ) 18 | } 19 | 20 | export default ProgressStepper; -------------------------------------------------------------------------------- /frontend/src/components/SummarizationResults.tsx: -------------------------------------------------------------------------------- 1 | import { Grid, Typography } from "@mui/material"; 2 | 3 | export interface SummarizationResultsContentProp { 4 | results: string; 5 | time: string; 6 | } 7 | 8 | 9 | function SummarizationResults({ results, time }: SummarizationResultsContentProp) { 10 | 11 | const formattedResults: JSX.Element[] = results.split('\n').map((line, index) => ( 12 | 13 | {line} 14 |
15 |
16 | )); 17 | 18 | return ( 19 | 20 | 21 | Status: Success Time: {time} 22 | 23 | 33 | { formattedResults } 34 | 35 | 36 | ) 37 | } 38 | 39 | export default SummarizationResults; 40 | -------------------------------------------------------------------------------- /frontend/src/components/SummarizationSteps.tsx: -------------------------------------------------------------------------------- 1 | import { Accordion, AccordionDetails, AccordionSummary, Grid, Typography } from "@mui/material"; 2 | import ExpandMoreIcon from '@mui/icons-material/ExpandMore'; 3 | import SummarizationStep from "../types/SummarizationStep"; 4 | 5 | 6 | export interface SummarizationStepsProps { 7 | steps: SummarizationStep []; 8 | } 9 | 10 | function SummarizationResultsSteps({ steps }: SummarizationStepsProps) { 11 | return ( 12 | 13 | { 14 | steps.map((step, index) => ( 15 | 16 | 17 | } aria-controls="panel1a-content"> 18 | {step.action} 19 | 20 | 21 | 22 | 23 | Input: 24 | {step.input} 25 | 26 | 27 | Output: 28 | {step.results} 29 | 30 | 31 | 32 | 33 | 34 | ) 35 | ) 36 | } 37 | 38 | ) 39 | } 40 | 41 | export default SummarizationResultsSteps; -------------------------------------------------------------------------------- /frontend/src/components/UploadFileInput.tsx: -------------------------------------------------------------------------------- 1 | import { ArrowCircleUp } from "@mui/icons-material"; 2 | import { Button, ButtonBase, Grid, TextField, Typography } from "@mui/material"; 3 | import { SummarizationType } from "../types/SummarizationType"; 4 | import React, { Dispatch, SetStateAction } from "react"; 5 | 6 | export interface UploadFileInputProps { 7 | selectedFiles: File[]; 8 | setSelectedFiles: Dispatch>; 9 | method: SummarizationType; 10 | inputFormRegister: any; // No typescript support for react hook form. 11 | } 12 | 13 | function UploadFileInput({selectedFiles, setSelectedFiles, method, inputFormRegister }: UploadFileInputProps) { 14 | const fileInputRef = React.useRef(null); 15 | 16 | // Helper click functions 17 | const handleUploadClick = () => { 18 | fileInputRef.current?.click(); 19 | }; 20 | 21 | const handleFileChange = (event: React.ChangeEvent) => { 22 | if (event.target.files) { 23 | setSelectedFiles(selectedFiles.concat(Array.from(event.target.files))); 24 | } 25 | }; 26 | // End helper click functions 27 | return ( 28 | // 29 | 30 | 31 | 42 | 43 | 44 | 0 ? "green" : "gray" }}/> 45 | 46 | 47 | { 48 | selectedFiles.length == 0 ? 49 | Click here to upload files. 50 | : ( 51 |
    52 | {selectedFiles.map((file: File, index) => ( 53 |
  • {file.name}
  • 54 | ))} 55 |
56 | ) 57 | } 58 |
59 | 60 | 61 | 62 |
63 |
64 | { 65 | // If using Multi-doc there's some extra inputs we need. 66 | method == SummarizationType.MULTI_DOC && ( 67 | 68 | 76 | 84 | 85 | ) 86 | } 87 |
88 | ) 89 | } 90 | 91 | export default UploadFileInput; -------------------------------------------------------------------------------- /frontend/src/containers/InputFormContainer.tsx: -------------------------------------------------------------------------------- 1 | 2 | import { useForm } from 'react-hook-form'; 3 | import FormControl from '@mui/material/FormControl'; 4 | import Button from '@mui/material/Button'; 5 | import Grid from '@mui/material/Grid' 6 | import { Dispatch, SetStateAction } from 'react'; 7 | import APIService from '../api/ApiService'; 8 | import { SummarizationType } from '../types/SummarizationType'; 9 | import { SummarizationResponse, UploadDocsResponse } from '../types/APIResponses'; 10 | import { MultiDocSummarizationRequest, SingleInputSummarizationRequest } from '../types/APIRequests'; 11 | import UploadFileInput from '../components/UploadFileInput'; 12 | import PasteTextInput from '../components/PasteTextInput'; 13 | 14 | export interface SummarizationFormProps { 15 | activeTab: number; 16 | setSummarizationOutput: Dispatch>; 17 | setSteps: Dispatch>; 18 | method: SummarizationType; 19 | selectedFiles: File[]; 20 | setSelectedFiles: Dispatch>; 21 | setStepperStep: Dispatch>; 22 | setLoadingProgress: Dispatch>; 23 | setTime: Dispatch>; 24 | } 25 | 26 | 27 | interface SummarizationFormValues { 28 | textToSummarize?: string, 29 | uploadLocation?: string, 30 | descriptionOfDocuments?: string, 31 | questions?: string, 32 | } 33 | 34 | 35 | function InputFormContainer({ activeTab, setSummarizationOutput, setSteps, method, selectedFiles, setSelectedFiles, setStepperStep, setLoadingProgress, setTime }: SummarizationFormProps) { 36 | 37 | const { register, handleSubmit, reset } = useForm({ 38 | defaultValues: { 39 | textToSummarize: '', 40 | uploadLocation: '' 41 | } 42 | }); 43 | 44 | const onFormSubmit = async (data: SummarizationFormValues) => { 45 | 46 | setLoadingProgress(50) 47 | 48 | // We'll make 2 API calls here. First one is to upload the files and create a location for them. 49 | // In the future, it could be nice to retrieve them if the request fails for a different reason for retries. 50 | if (selectedFiles.length > 0) { 51 | const uploadResponse: UploadDocsResponse = await APIService.uploadDocuments({files: selectedFiles}); 52 | data.uploadLocation = uploadResponse.uploadLocation; 53 | } 54 | 55 | let response: SummarizationResponse; 56 | 57 | // Multi doc is handled differently than the rest. Handle it separately and short circuit. 58 | if (method == SummarizationType.MULTI_DOC) { 59 | const multiDocRequest: MultiDocSummarizationRequest = { 60 | uploadLocation: data.uploadLocation ? data.uploadLocation : '', 61 | descriptionOfDocuments: data.descriptionOfDocuments ? data.descriptionOfDocuments : '', 62 | questions: data.questions ? data.questions : '' 63 | } 64 | response = await APIService.multiDoc(multiDocRequest); 65 | } else { 66 | // Create request that's shared between all the other summarization types. 67 | const request: SingleInputSummarizationRequest = { 68 | textToSummarize: data.textToSummarize, 69 | uploadLocation: data.uploadLocation 70 | }; 71 | 72 | switch (method) { 73 | case SummarizationType.STUFF_IT: 74 | response = await APIService.stuffIt(request); 75 | break; 76 | case SummarizationType.MAP_REDUCE: 77 | response = await APIService.mapReduce(request); 78 | break; 79 | case SummarizationType.AUTO_REFINE: 80 | response = await APIService.autoRefine(request); 81 | break; 82 | default: 83 | response = await APIService.stuffIt(request); 84 | } 85 | } 86 | 87 | setSummarizationOutput(response.results); 88 | setSteps(response.steps); 89 | setTime(response.time); 90 | setStepperStep(1); 91 | 92 | // Set loading progress back to zero so state updates when you clear the input. 93 | setLoadingProgress(0) 94 | 95 | return; 96 | } 97 | 98 | const onInputCleared = async () => { 99 | setSelectedFiles([]); 100 | setStepperStep(0); 101 | 102 | reset({ 103 | textToSummarize: '', 104 | questions: '', 105 | descriptionOfDocuments: '', 106 | uploadLocation: '' 107 | }) 108 | } 109 | 110 | return ( 111 |
112 | 113 | 114 | {/* Based on the tab selected, we'll display either the past text */} 115 | 116 | { 117 | activeTab === 0 && ( 118 | 119 | ) 120 | } 121 | { 122 | activeTab === 1 && ( 123 | 129 | ) 130 | } 131 | 132 | 133 | 134 | 135 | 136 | 137 | 138 |
139 | ) 140 | } 141 | 142 | export default InputFormContainer; -------------------------------------------------------------------------------- /frontend/src/containers/ResultsContainer.tsx: -------------------------------------------------------------------------------- 1 | import SummarizationSteps from '../components/SummarizationSteps'; 2 | import SummarizationResults from '../components/SummarizationResults'; 3 | import LoadingProgress from '../components/LoadingProgress'; 4 | import EmptySummarizationResults from '../components/EmptySummarizationResults'; 5 | 6 | export interface SummarizationResultsProps { 7 | // activeTab: string; 8 | summarizationOutput: string; 9 | steps: any[]; 10 | time: string; 11 | resultsOrStepsActiveTab: number; 12 | progress: number; 13 | } 14 | 15 | function ResultsContainer({ summarizationOutput, steps, time, resultsOrStepsActiveTab, progress }: SummarizationResultsProps) { 16 | const isDataEmpty: boolean = !summarizationOutput && (!steps || steps.length === 0); 17 | if (isDataEmpty) { 18 | return progress == 0 19 | ? 20 | : 21 | } 22 | 23 | return ( 24 |
25 | { resultsOrStepsActiveTab == 0 && } 26 | { resultsOrStepsActiveTab == 1 && } 27 |
28 | ) 29 | } 30 | 31 | export default ResultsContainer; -------------------------------------------------------------------------------- /frontend/src/main.tsx: -------------------------------------------------------------------------------- 1 | import React from 'react' 2 | import ReactDOM from 'react-dom/client' 3 | import App from './App.tsx' 4 | 5 | ReactDOM.createRoot(document.getElementById('root')!).render( 6 | 7 | 8 | , 9 | ) 10 | -------------------------------------------------------------------------------- /frontend/src/types/APIRequests.ts: -------------------------------------------------------------------------------- 1 | export interface SingleInputSummarizationRequest { 2 | textToSummarize?: string, 3 | uploadLocation?: string 4 | } 5 | 6 | export interface MultiDocSummarizationRequest { 7 | uploadLocation: string 8 | descriptionOfDocuments: string, 9 | questions: string 10 | } 11 | 12 | export interface UploadDocsRequest { 13 | files: File[], 14 | } -------------------------------------------------------------------------------- /frontend/src/types/APIResponses.ts: -------------------------------------------------------------------------------- 1 | import SummarizationStep from "./SummarizationStep"; 2 | 3 | export interface UploadDocsResponse { 4 | uploadLocation: string 5 | } 6 | 7 | export interface SummarizationResponse { 8 | results: string; 9 | steps: SummarizationStep[]; 10 | time: string; 11 | } -------------------------------------------------------------------------------- /frontend/src/types/SummarizationStep.ts: -------------------------------------------------------------------------------- 1 | export default interface SummarizationStep { 2 | action: string; 3 | input: string; 4 | results: string; 5 | } -------------------------------------------------------------------------------- /frontend/src/types/SummarizationType.ts: -------------------------------------------------------------------------------- 1 | export enum SummarizationType { 2 | STUFF_IT = "Stuff It", 3 | MAP_REDUCE = "Map Reduce", 4 | AUTO_REFINE = "Auto Refine", 5 | MULTI_DOC = "Multi Doc" 6 | } -------------------------------------------------------------------------------- /frontend/src/views/SummarizationView.tsx: -------------------------------------------------------------------------------- 1 | import { useState } from 'react'; 2 | import Grid from '@mui/material/Grid'; 3 | import InputFormContainer from '../containers/InputFormContainer'; 4 | import ResultsContainer from '../containers/ResultsContainer'; 5 | import ProgressStepper from '../components/ProgressStepper'; 6 | import CustomTabs from '../components/CustomTabs'; 7 | import { SummarizationType } from '../types/SummarizationType'; 8 | import SummarizationStep from '../types/SummarizationStep'; 9 | import MethodSelector from '../components/MethodSelector'; 10 | import AWSAppBar from '../components/AWSAppBar'; 11 | import { Button } from '@mui/material'; 12 | 13 | const textOrFileTabs = [ 14 | { name: "Paste Text", value: "text" }, 15 | { name: "Upload File", value: "file" }, 16 | ]; 17 | 18 | const resultsOrStepsTabs = [ 19 | { name: "Results", value: "results" }, 20 | { name: "Steps", value: "steps" } 21 | ]; 22 | 23 | 24 | function SummarizationView() { 25 | 26 | const [textOrFileActiveTab, setTextOrFileActiveTab] = useState(0); 27 | const [resultsOrStepsActiveTab, setResultsOrStepsActiveTab] = useState(0); 28 | const [stepperStep, setStepperStep] = useState(0); 29 | const [summarizationOutput, setSummarizationOutput] = useState(''); 30 | const [summarizationStep, setSummarizationStep] = useState([]); 31 | const [method, setMethod] = useState(SummarizationType.STUFF_IT); 32 | const [selectedFiles, setSelectedFiles] = useState([]) 33 | const [loadingProgress, setLoadingProgress] = useState(0); 34 | const [time, setTime] = useState(''); 35 | 36 | const clearResults = () => { 37 | setStepperStep(0) 38 | setSummarizationOutput('') 39 | setSummarizationStep([]) 40 | } 41 | 42 | return ( 43 | 44 | 45 | 46 | 47 | 48 | 49 | 50 | 51 | 52 | 53 | 54 | 59 | 70 | 71 | 72 | 73 | 74 | 79 | 80 | 81 | 82 | 83 | 84 | 91 | 92 | 93 | ) 94 | } 95 | 96 | export default SummarizationView; -------------------------------------------------------------------------------- /frontend/src/vite-env.d.ts: -------------------------------------------------------------------------------- 1 | /// 2 | -------------------------------------------------------------------------------- /frontend/tsconfig.json: -------------------------------------------------------------------------------- 1 | { 2 | "compilerOptions": { 3 | "target": "ES2020", 4 | "useDefineForClassFields": true, 5 | "lib": ["ES2020", "DOM", "DOM.Iterable"], 6 | "module": "ESNext", 7 | "skipLibCheck": true, 8 | 9 | /* Bundler mode */ 10 | "moduleResolution": "bundler", 11 | "allowImportingTsExtensions": true, 12 | "resolveJsonModule": true, 13 | "isolatedModules": true, 14 | "noEmit": true, 15 | "jsx": "react-jsx", 16 | 17 | /* Linting */ 18 | "strict": true, 19 | "noUnusedLocals": true, 20 | "noUnusedParameters": true, 21 | "noFallthroughCasesInSwitch": true 22 | }, 23 | "include": ["src"], 24 | "references": [{ "path": "./tsconfig.node.json" }] 25 | } 26 | -------------------------------------------------------------------------------- /frontend/tsconfig.node.json: -------------------------------------------------------------------------------- 1 | { 2 | "compilerOptions": { 3 | "composite": true, 4 | "skipLibCheck": true, 5 | "module": "ESNext", 6 | "moduleResolution": "bundler", 7 | "allowSyntheticDefaultImports": true 8 | }, 9 | "include": ["vite.config.ts"] 10 | } 11 | -------------------------------------------------------------------------------- /frontend/vite.config.ts: -------------------------------------------------------------------------------- 1 | import { defineConfig } from 'vite' 2 | import react from '@vitejs/plugin-react' 3 | 4 | // https://vitejs.dev/config/ 5 | export default defineConfig({ 6 | plugins: [react()], 7 | }) 8 | -------------------------------------------------------------------------------- /sample texts/algernon.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/llm-based-advanced-summarization/bfbec9c2d596674b784a7609275b1f2f9cea3fa1/sample texts/algernon.pkl -------------------------------------------------------------------------------- /sample texts/docs.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/llm-based-advanced-summarization/bfbec9c2d596674b784a7609275b1f2f9cea3fa1/sample texts/docs.pkl -------------------------------------------------------------------------------- /sample texts/elvis.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/llm-based-advanced-summarization/bfbec9c2d596674b784a7609275b1f2f9cea3fa1/sample texts/elvis.pkl -------------------------------------------------------------------------------- /sample texts/frankenstien.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/llm-based-advanced-summarization/bfbec9c2d596674b784a7609275b1f2f9cea3fa1/sample texts/frankenstien.pkl -------------------------------------------------------------------------------- /sample texts/hills.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aws-samples/llm-based-advanced-summarization/bfbec9c2d596674b784a7609275b1f2f9cea3fa1/sample texts/hills.pkl -------------------------------------------------------------------------------- /xsum_sample.jsonl: -------------------------------------------------------------------------------- 1 | {"document":"The full cost of damage in Newton Stewart, one of the areas worst affected, is still being assessed.\nRepair work is ongoing in Hawick and many roads in Peeblesshire remain badly affected by standing water.\nTrains on the west coast mainline face disruption due to damage at the Lamington Viaduct.\nMany businesses and householders were affected by flooding in Newton Stewart after the River Cree overflowed into the town.\nFirst Minister Nicola Sturgeon visited the area to inspect the damage.\nThe waters breached a retaining wall, flooding many commercial properties on Victoria Street - the main shopping thoroughfare.\nJeanette Tate, who owns the Cinnamon Cafe which was badly affected, said she could not fault the multi-agency response once the flood hit.\nHowever, she said more preventative work could have been carried out to ensure the retaining wall did not fail.\n\"It is difficult but I do think there is so much publicity for Dumfries and the Nith - and I totally appreciate that - but it is almost like we're neglected or forgotten,\" she said.\n\"That may not be true but it is perhaps my perspective over the last few days.\n\"Why were you not ready to help us a bit more when the warning and the alarm alerts had gone out?\"\nMeanwhile, a flood alert remains in place across the Borders because of the constant rain.\nPeebles was badly hit by problems, sparking calls to introduce more defences in the area.\nScottish Borders Council has put a list on its website of the roads worst affected and drivers have been urged not to ignore closure signs.\nThe Labour Party's deputy Scottish leader Alex Rowley was in Hawick on Monday to see the situation first hand.\nHe said it was important to get the flood protection plan right but backed calls to speed up the process.\n\"I was quite taken aback by the amount of damage that has been done,\" he said.\n\"Obviously it is heart-breaking for people who have been forced out of their homes and the impact on businesses.\"\nHe said it was important that \"immediate steps\" were taken to protect the areas most vulnerable and a clear timetable put in place for flood prevention plans.\nHave you been affected by flooding in Dumfries and Galloway or the Borders? Tell us about your experience of the situation and how it was handled. Email us on selkirk.news@bbc.co.uk or dumfries@bbc.co.uk.","summary":"Clean-up operations are continuing across the Scottish Borders and Dumfries and Galloway after flooding caused by Storm Frank.","id":"35232142"} 2 | {"document":"A fire alarm went off at the Holiday Inn in Hope Street at about 04:20 BST on Saturday and guests were asked to leave the hotel.\nAs they gathered outside they saw the two buses, parked side-by-side in the car park, engulfed by flames.\nOne of the tour groups is from Germany, the other from China and Taiwan. It was their first night in Northern Ireland.\nThe driver of one of the buses said many of the passengers had left personal belongings on board and these had been destroyed.\nBoth groups have organised replacement coaches and will begin their tour of the north coast later than they had planned.\nPolice have appealed for information about the attack.\nInsp David Gibson said: \"It appears as though the fire started under one of the buses before spreading to the second.\n\"While the exact cause is still under investigation, it is thought that the fire was started deliberately.\"","summary":"Two tourist buses have been destroyed by fire in a suspected arson attack in Belfast city centre.","id":"40143035"} 3 | {"document":"Ferrari appeared in a position to challenge until the final laps, when the Mercedes stretched their legs to go half a second clear of the red cars.\nSebastian Vettel will start third ahead of team-mate Kimi Raikkonen.\nThe world champion subsequently escaped punishment for reversing in the pit lane, which could have seen him stripped of pole.\nBut stewards only handed Hamilton a reprimand, after governing body the FIA said \"no clear instruction was given on where he should park\".\nBelgian Stoffel Vandoorne out-qualified McLaren team-mate Jenson Button on his Formula 1 debut.\nVandoorne was 12th and Button 14th, complaining of a handling imbalance on his final lap but admitting the newcomer \"did a good job and I didn't\".\nMercedes were wary of Ferrari's pace before qualifying after Vettel and Raikkonen finished one-two in final practice, and their concerns appeared to be well founded as the red cars mixed it with the silver through most of qualifying.\nAfter the first runs, Rosberg was ahead, with Vettel and Raikkonen splitting him from Hamilton, who made a mistake at the final corner on his first lap.\nBut Hamilton saved his best for last, fastest in every sector of his final attempt, to beat Rosberg by just 0.077secs after the German had out-paced him throughout practice and in the first qualifying session.\nVettel rued a mistake at the final corner on his last lap, but the truth is that with the gap at 0.517secs to Hamilton there was nothing he could have done.\nThe gap suggests Mercedes are favourites for the race, even if Ferrari can be expected to push them.\nVettel said: \"Last year we were very strong in the race and I think we are in good shape for tomorrow. We will try to give them a hard time.\"\nVandoorne's preparations for his grand prix debut were far from ideal - he only found out he was racing on Thursday when FIA doctors declared Fernando Alonso unfit because of a broken rib sustained in his huge crash at the first race of the season in Australia two weeks ago.\nThe Belgian rookie had to fly overnight from Japan, where he had been testing in the Super Formula car he races there, and arrived in Bahrain only hours before first practice on Friday.\nHe also had a difficult final practice, missing all but the final quarter of the session because of a water leak.\nButton was quicker in the first qualifying session, but Vandoorne pipped him by 0.064secs when it mattered.\nThe 24-year-old said: \"I knew after yesterday I had quite similar pace to Jenson and I knew if I improved a little bit I could maybe challenge him and even out-qualify him and that is what has happened.\n\"Jenson is a very good benchmark for me because he is a world champion and he is well known to the team so I am very satisfied with the qualifying.\"\nButton, who was 0.5secs quicker than Vandoorne in the first session, complained of oversteer on his final run in the second: \"Q1 was what I was expecting. Q2 he did a good job and I didn't. Very, very good job. We knew how quick he was.\"\nThe controversial new elimination qualifying system was retained for this race despite teams voting at the first race in Australia to go back to the 2015 system.\nFIA president Jean Todt said earlier on Saturday that he \"felt it necessary to give new qualifying one more chance\", adding: \"We live in a world where there is too much over reaction.\"\nThe system worked on the basis of mixing up the grid a little - Force India's Sergio Perez ended up out of position in 18th place after the team miscalculated the timing of his final run, leaving him not enough time to complete it before the elimination clock timed him out.\nBut it will come in for more criticism as a result of lack of track action at the end of each session. There were three minutes at the end of the first session with no cars on the circuit, and the end of the second session was a similar damp squib.\nOnly one car - Nico Hulkenberg's Force India - was out on the track with six minutes to go. The two Williams cars did go out in the final three minutes but were already through to Q3 and so nothing was at stake.\nThe teams are meeting with Todt and F1 commercial boss Bernie Ecclestone on Sunday at noon local time to decide on what to do with qualifying for the rest of the season.\nTodt said he was \"optimistic\" they would be able to reach unanimous agreement on a change.\n\"We should listen to the people watching on TV,\" Rosberg said. \"If they are still unhappy, which I am sure they will be, we should change it.\"\nRed Bull's Daniel Ricciardo was fifth on the grid, ahead of the Williams cars of Valtteri Bottas and Felipe Massa and Force India's Nico Hulkenberg.\nRicciardo's team-mate Daniil Kvyat was eliminated during the second session - way below the team's expectation - and the Renault of Brit Jolyon Palmer only managed 19th fastest.\nGerman Mercedes protege Pascal Wehrlein managed an excellent 16th in the Manor car.\nBahrain GP qualifying results\nBahrain GP coverage details","summary":"Lewis Hamilton stormed to pole position at the Bahrain Grand Prix ahead of Mercedes team-mate Nico Rosberg.","id":"35951548"} 4 | {"document":"John Edward Bates, formerly of Spalding, Lincolnshire, but now living in London, faces a total of 22 charges, including two counts of indecency with a child.\nThe 67-year-old is accused of committing the offences between March 1972 and October 1989.\nMr Bates denies all the charges.\nGrace Hale, prosecuting, told the jury that the allegations of sexual abuse were made by made by four male complainants and related to when Mr Bates was a scout leader in South Lincolnshire and Cambridgeshire.\n\"The defendant says nothing of that sort happened between himself and all these individuals. He says they are all fabricating their accounts and telling lies,\" said Mrs Hale.\nThe prosecutor claimed Mr Bates invited one 15 year old to his home offering him the chance to look at cine films made at scout camps but then showed him pornographic films.\nShe told the jury that the boy was then sexually abused leaving him confused and frightened.\nMrs Hale said: \"The complainant's recollection is that on a number of occasions sexual acts would happen with the defendant either in the defendant's car or in his cottage.\"\nShe told the jury a second boy was taken by Mr Bates for a weekend in London at the age of 13 or 14 and after visiting pubs he was later sexually abused.\nMrs Hale said two boys from the Spalding group had also made complaints of being sexually abused.\nThe jury has been told that Mr Bates was in the RAF before serving as a Lincolnshire Police officer between 1976 and 1983.\nThe trial, which is expected to last two weeks, continues.","summary":"A former Lincolnshire Police officer carried out a series of sex attacks on boys, a jury at Lincoln Crown Court was told.","id":"36266422"} 5 | {"document":"Patients and staff were evacuated from Cerahpasa hospital on Wednesday after a man receiving treatment at the clinic threatened to shoot himself and others.\nOfficers were deployed to negotiate with the man, a young police officer.\nEarlier reports that the armed man had taken several people hostage proved incorrect.\nThe chief consultant of Cerahpasa hospital, Zekayi Kutlubay, who was evacuated from the facility, said that there had been \"no hostage crises\", adding that the man was \"alone in the room\".\nDr Kutlubay said that the man had been receiving psychiatric treatment for the past two years.\nHe said that the hospital had previously submitted a report stating that the man should not be permitted to carry a gun.\n\"His firearm was taken away,\" Dr Kutlubay said, adding that the gun in the officer's possession on Wednesday was not his issued firearm.\nThe incident comes amid tension in Istanbul following several attacks in crowded areas, including the deadly assault on the Reina nightclub on New Year's Eve which left 39 people dead.","summary":"An armed man who locked himself into a room at a psychiatric hospital in Istanbul has ended his threat to kill himself, Turkish media report.","id":"38826984"} 6 | {"document":"Simone Favaro got the crucial try with the last move of the game, following earlier touchdowns by Chris Fusaro, Zander Fagerson and Junior Bulumakau.\nRynard Landman and Ashton Hewitt got a try in either half for the Dragons.\nGlasgow showed far superior strength in depth as they took control of a messy match in the second period.\nHome coach Gregor Townsend gave a debut to powerhouse Fijian-born Wallaby wing Taqele Naiyaravoro, and centre Alex Dunbar returned from long-term injury, while the Dragons gave first starts of the season to wing Aled Brew and hooker Elliot Dee.\nGlasgow lost hooker Pat McArthur to an early shoulder injury but took advantage of their first pressure when Rory Clegg slotted over a penalty on 12 minutes.\nIt took 24 minutes for a disjointed game to produce a try as Sarel Pretorius sniped from close range and Landman forced his way over for Jason Tovey to convert - although it was the lock's last contribution as he departed with a chest injury shortly afterwards.\nGlasgow struck back when Fusaro drove over from a rolling maul on 35 minutes for Clegg to convert.\nBut the Dragons levelled at 10-10 before half-time when Naiyaravoro was yellow-carded for an aerial tackle on Brew and Tovey slotted the easy goal.\nThe visitors could not make the most of their one-man advantage after the break as their error count cost them dearly.\nIt was Glasgow's bench experience that showed when Mike Blair's break led to a short-range score from teenage prop Fagerson, converted by Clegg.\nDebutant Favaro was the second home player to be sin-binned, on 63 minutes, but again the Warriors made light of it as replacement wing Bulumakau, a recruit from the Army, pounced to deftly hack through a bouncing ball for an opportunist try.\nThe Dragons got back within striking range with some excellent combined handling putting Hewitt over unopposed after 72 minutes.\nHowever, Favaro became sinner-turned-saint as he got on the end of another effective rolling maul to earn his side the extra point with the last move of the game, Clegg converting.\nDragons director of rugby Lyn Jones said: \"We're disappointed to have lost but our performance was a lot better [than against Leinster] and the game could have gone either way.\n\"Unfortunately too many errors behind the scrum cost us a great deal, though from where we were a fortnight ago in Dublin our workrate and desire was excellent.\n\"It was simply error count from individuals behind the scrum that cost us field position, it's not rocket science - they were correct in how they played and we had a few errors, that was the difference.\"\nGlasgow Warriors: Rory Hughes, Taqele Naiyaravoro, Alex Dunbar, Fraser Lyle, Lee Jones, Rory Clegg, Grayson Hart; Alex Allan, Pat MacArthur, Zander Fagerson, Rob Harley (capt), Scott Cummings, Hugh Blake, Chris Fusaro, Adam Ashe.\nReplacements: Fergus Scott, Jerry Yanuyanutawa, Mike Cusack, Greg Peterson, Simone Favaro, Mike Blair, Gregor Hunter, Junior Bulumakau.\nDragons: Carl Meyer, Ashton Hewitt, Ross Wardle, Adam Warren, Aled Brew, Jason Tovey, Sarel Pretorius; Boris Stankovich, Elliot Dee, Brok Harris, Nick Crosswell, Rynard Landman (capt), Lewis Evans, Nic Cudd, Ed Jackson.\nReplacements: Rhys Buckley, Phil Price, Shaun Knight, Matthew Screech, Ollie Griffiths, Luc Jones, Charlie Davies, Nick Scott.","summary":"Defending Pro12 champions Glasgow Warriors bagged a late bonus-point victory over the Dragons despite a host of absentees and two yellow cards.","id":"34540833"} 7 | {"document":"Veronica Vanessa Chango-Alverez, 31, was killed and another man injured when an Audi A3 struck them in Streatham High Road at 05:30 GMT on Saturday.\nTen minutes before the crash the car was in London Road, Croydon, when a Volkswagen Passat collided with a tree.\nPolice want to trace Nathan Davis, 27, who they say has links to the Audi. The car was abandoned at the scene.\nMs Chango-Alverez died from multiple injuries, a post-mortem examination found.\nNo arrests have been made as yet, police said.\nMs Chango-Alverez was staying at her mother's home in Streatham High Road.\nShe was born in Ecuador and had lived in London for 13 years, BBC London reporter Gareth Furby said. At the time of the crash, she was on her way to work in a hotel.\nThe remains of the bus stop, which was extensively damaged in the crash, have been removed.\nFlowers have been left at the site in tribute to the victim.\nA statement from her brother Kevin Raul Chango-Alverez said: \"My family has had its heart torn out, at this Christmas time, we will never be the same again.\n\"On Friday night we were together as a family with Veronica meeting her newly born nephew and preparing for Christmas.\n\"I last saw her alive as she left to go to work on Saturday morning, but moments later I was holding her hand as she passed away in the street.\"\nDescribing the crash as \"horrific\" Det Insp Gordon Wallace, said: \"The family are devastated. The memory of this senseless death will be with them each time they leave their home.\n\"The driver fled the scene abandoning the grey Audi, which was extensively damaged.\n\"We are looking to speak to Mr Nathan Davis in relation to this collision.\"\nThe 51-year-old man injured at the bus stop remains in a critical condition in hospital while the condition of the 29-year-old driver of the Volkswagen is now stable.","summary":"A man with links to a car that was involved in a fatal bus stop crash in south London is being sought by police.","id":"20836172"} 8 | {"document":"Belgian cyclist Demoitie died after a collision with a motorbike during Belgium's Gent-Wevelgem race.\nThe 25-year-old was hit by the motorbike after several riders came down in a crash as the race passed through northern France.\n\"The main issues come when cars or motorbikes have to pass the peloton and pass riders,\" Team Sky's Rowe said.\n\"That is the fundamental issue we're looking into.\n\"There's a lot of motorbikes in and around the race whether it be cameras for TV, photographers or police motorbikes.\n\"In total there's around 50 motorbikes that work on each race.\n\"We've got a riders union and we're coming together to think of a few ideas, whether we cap a speed limit on how fast they can overtake us.\n\"Say we put a 10 kilometres per hour limit on it, if we're going 50kph they're only allowed to pass us 60kph or something like that.\"\nDemoitie, who was riding for the Wanty-Gobert team, was taken to hospital in Lille but died later.\nThe sport's governing body, the UCI, said it would co-operate with all relevant authorities in an investigation into the incident.\nThe Professional Cyclists' Association (CPA) issued a statement asking what would be done to improve safety.\nDespite Demoitie's death, attitudes to road racing will stay the same says Rowe, who has been competing in Three Days of De Panne race in Belgium.\n\"As soon as that element of fear slips into your mind and you start thinking of things that could happen, that's when you're doomed to fail,\" he told BBC Wales Sport.\n\"If you start thinking about crashes and the consequences and what could potentially happen then you're never going to be at the front of the peloton and you're never going to win any races.\"\nIn a separate incident, another Belgian cyclist, Daan Myngheer, 22, died in hospital after suffering a heart attack during the first stage of the Criterium International in Corsica.","summary":"Welsh cyclist Luke Rowe says changes to the sport must be made following the death of Antoine Demoitie.","id":"35932467"} 9 | {"document":"Gundogan, 26, told BBC Sport he \"can see the finishing line\" after tearing cruciate knee ligaments in December, but will not rush his return.\nThe German missed the 2014 World Cup following back surgery that kept him out for a year, and sat out Euro 2016 because of a dislocated kneecap.\nHe said: \"It is heavy mentally to accept that.\"\nGundogan will not be fit for the start of the Premier League season at Brighton on 12 August but said his recovery time is now being measured in \"weeks\" rather than months.\nHe told BBC Sport: \"It is really hard always to fall and fight your way back. You feel good and feel ready, then you get the next kick.\n\"The worst part is behind me now. I want to feel ready when I am fully back. I want to feel safe and confident. I don't mind if it is two weeks or six.\"\nGundogan made 15 appearances and scored five goals in his debut season for City following his \u00a320m move from Borussia Dortmund.\nHe is eager to get on the field again and was impressed at the club's 4-1 win over Real Madrid in a pre-season game in Los Angeles on Wednesday.\nManager Pep Guardiola has made five new signings already this summer and continues to have an interest in Arsenal forward Alexis Sanchez and Monaco's Kylian Mbappe.\nGundogan said: \"Optimism for the season is big. It is huge, definitely.\n\"We felt that last year as well but it was a completely new experience for all of us. We know the Premier League a bit more now and can't wait for the season to start.\"\nCity complete their three-match tour of the United States against Tottenham in Nashville on Saturday.\nChelsea manager Antonio Conte said earlier this week he did not feel Tottenham were judged by the same standards as his own side, City and Manchester United.\nSpurs have had the advantage in their recent meetings with City, winning three and drawing one of their last four Premier League games.\nAnd Gundogan thinks they are a major threat.\nHe said: \"Tottenham are a great team. They have the style of football. They have young English players. Our experience last season shows it is really tough to beat them.\n\"They are really uncomfortable to play against.\n\"I am pretty sure, even if they will not say it loud, the people who know the Premier League know Tottenham are definitely a competitor for the title.\"","summary":"Manchester City midfielder Ilkay Gundogan says it has been mentally tough to overcome a third major injury.","id":"40758845"} 10 | {"document":"The crash happened about 07:20 GMT at the junction of the A127 and Progress Road in Leigh-on-Sea, Essex.\nThe man, who police said is aged in his 20s, was treated at the scene for a head injury and suspected multiple fractures, the ambulance service said.\nHe was airlifted to the Royal London Hospital for further treatment.\nThe Southend-bound carriageway of the A127 was closed for about six hours while police conducted their initial inquiries.\nA spokeswoman for Essex Police said it was not possible comment to further as this time as the \"investigation is now being conducted by the IPCC\".","summary":"A jogger has been hit by an unmarked police car responding to an emergency call, leaving him with \"serious life-changing injuries\".","id":"30358490"} 11 | {"document":"23 October 2015 Last updated at 17:44 BST\nIt's the highest rating a tropical storm can get and is the first one of this magnitude to hit mainland Mexico since 1959.\nBut how are the categories decided and what do they mean? Newsround reporter Jenny Lawrence explains.","summary":"Hurricane Patricia has been rated as a category 5 storm.","id":"34615665"} 12 | {"document":"Weaknesses in the way mice swapped data with computers left them vulnerable, said security firm Bastille Networks.\nAttackers could spoof poorly protected signals letting them use PCs as if they were sitting in front of them, it said.\nInformation about the loopholes have been passed to the makers of vulnerable mice, some of who are creating updates to make the mice more secure.\nThe radio signals sent by many wireless mice to a \"dongle\" plugged in to a computer were often unencrypted, said Marc Newlin and Balint Seeber, from Bastille, who carried out the research.\n\"That makes it possible for the attacker to send unencrypted traffic to the dongle pretending to be a keyboard and have it result as keystrokes on your computer,\" Mr Newlin said.\nBy contrast, they said, signals sent by wireless keyboards were scrambled to stop attackers eavesdropping on or spoofing them.\nThe pair found they could spoof signals for mice using a few lines of code and an antenna and dongle that cost $20 (\u00c2\u00a315).\nThe attack worked at distances of up to 180m (590ft).\nUsing this kit, they sent specially crafted mouse clicks that a computer interpreted as key presses, letting them run commands and take control of a target machine.\nThe Bastille researchers said many companies spent a lot of time and money securing the physical devices sitting on their networks but often neglected to keep an eye on data sent via radio.\nWireless mice produced by HP, Lenovo, Amazon and Dell were found to be vulnerable.\nBastille said it had reported its findings to the hardware makers and to the company that made the chipset used inside the spoofable mice.\nUpdates to the internal computer code, or firmware, for some of the vulnerable mice are now being made available,\nBut Bastille said many of the insecure mice it had found could not be updated.","summary":"Hackers could gain access to home and corporate networks via security flaws in wireless mice, suggests research.","id":"35890902"} 13 | {"document":"Administrators confirmed the redundancies affecting 38 staff at Galashiels-based Murray and Burrell.\nThe business, established in 1928, went into administration last week citing \"adverse trading conditions\".\nThere are hopes some of the workers affected could find posts at another building firm in nearby Melrose which currently requires staff.\nThomson Cooper partner Richard Gardiner was appointed as administrator at Murray and Burrell on Monday.\nA statement confirmed: \"Directors explored all options in an effort to preserve trading and jobs.\n\"Regrettably, 38 jobs were lost as there is no prospect of continuing to trade.\"\nSouth of Scotland MSP Rachael Hamilton described it as a \"sad day for the Borders\".\nHowever, some of the workers laid off could find employment with a Melrose-based company.\nJS Crawford has said that, with several housing projects on its books, it needs staff.","summary":"Dozens of jobs have been lost after efforts to save an historic building firm in the Scottish Borders failed.","id":"37922330"} 14 | {"document":"The EC's doubts about the arrangement were detailed in a document on Friday.\nThe EC said that its \"preliminary view is that the tax ruling... by Luxembourg in favour of Amazon constitutes state aid.\"\nHowever, Amazon said it \"has received no special tax treatment from Luxembourg\".\n\"We are subject to the same tax laws as other companies operating here [in Luxembourg],\" it said.\nThe Luxembourg finance ministry said: \"Luxembourg is confident that the state aid allegations in this case are without merit and will be able to convince the Commission of the legitimacy of the anticipatory decision in question and that no competitive advantage was granted,\" it said.\nThe European Commission began a probe of the tax arrangement last year, saying that it had suspicions it broke EU rules.\nThe Commission document, which was sent to the Luxembourg Ministry of Foreign Affairs in October, gives its rationale for launching the investigation.\nThe Commission said it had \"no indication\" that the tax arrangement was \"compatible with the internal market\".\nThe current European Commission chief, Jean-Claude Juncker, was prime minister of Luxembourg when the deal was struck.\nMr Juncker has come under pressure over claims that around 340 global companies were granted tax avoidance deals during his 18 year tenure in Luxembourg.\nCommission doubts over the Amazon deal included whether Luxembourg had properly looked into Amazon's \"transfer pricing\" proposals about how money would be moved between different Amazon subsidiaries.\nDoubts also existed about whether the country had assessed that the proposed tax regime was in line with market conditions before agreeing the deal in 2003, the European Commission document said.\nThe Commission also had questions about how royalty payments between certain Amazon companies were calculated, and whether \"Amazon has a financial incentive to exaggerate the amount of the royalty\" between its Luxembourg head office company and an Amazon firm that holds shares in the head office company.\n\"If the royalty is exaggerated, it would unduly reduce the tax paid by Amazon in Luxembourg by shifting profits to an untaxed entity from the perspective of corporate taxation,\" the EC said.\nIt added that Luxembourg might have been too hasty in assessing Amazon's requested arrangement before striking the deal.\nLuxembourg's finance ministry said it \"has provided all the information required by the Commission and cooperated fully with the Commission in its investigation.\"\n\"Among other things, detailed reports on the transfer price requested by the Commission were disclosed,\" it added.\nLuxembourg is also being investigated by the Commission over suspected \"sweetheart\" tax deals with the financing arm of carmaker Fiat.\nIn addition, Ireland's tax deal with Apple and the Netherlands' arrangement with Starbucks are being scrutinised as part of a crackdown on multinationals' tax avoidance schemes.","summary":"The European Commission has disclosed a preliminary finding that Amazon's tax arrangements in Luxembourg probably constitute \"state aid\".","id":"30844962"} 15 | {"document":"The three-day extravaganza of farming, food and family fun celebrates many aspects of agricultural life.\nThe Balmoral Show is run by the Royal Ulster Agricultural Society (RUAS) and dates back 148 years.\nLast year, it attracted more than 90,000 visitors to its recently-adopted home outside Lisburn in County Antrim.\nIt was traditionally staged at the RUAS's headquarters in south Belfast, but the show moved to a larger venue on the site of the former Maze prison in 2013.\nThe Maze venue, re-named Balmoral Park, is now hosting the show for the fourth consecutive year.\nThe 2016 event coincides with Northern Ireland's Year of Food and Drink, and local produce features prominently in the exhibitions.\nOne of this year's highlights is an \"edible garden\", in which visitors can see their food growing in the ground before it gets to their plates.\nThe aim of the garden is to encourage people to grow their own food at home.\nThe event will also showcase the best of local livestock, with prized pigs, cattle, poultry and ponies all lining up in bid to be the stars of the show.\nTheir owners will also get a chance to shine, with horse riding and show jumping displays along with sheep shearing competitions and awards for the best livestock breeders and handlers.\nFor younger visitors, there is a family fun area hosting displays from the Northern Ireland School of Falconry as well as a gun dog skills demonstration and a performance from balloon artist Bruce Airhead.\nBBC News NI are covering the event live on social media on Wednesday on Twitter at @BBCNewsNI, on Snapchat at bbcnewsni, and on BBC Newsline's Facebook page.","summary":"The most important event in Northern Ireland's agricultural calendar - the Balmoral Show - has opened with thousands of people attending.","id":"36217333"} 16 | {"document":"Mr Mosley wants Google to block photos of him at a sex party first printed in the now-defunct News of the World, which he successfully sued in 2008.\nHe is suing the internet firm for breaches of the Data Protection Act and misusing private information.\nGoogle's barrister argued that Mr Mosley no longer has a \"reasonable expectation of privacy\".\nMr Mosley won damages from the News of the World after it published a story alleging he had organised a Nazi-themed orgy.\nPhotographs and a video which show his private sexual activity were originally obtained by News Group Newspapers Limited (NGN) in a clandestine \"sting\" operation.\nMr Mosley - the son of 1930s fascist leader Sir Oswald Mosley - won \u00c2\u00a360,000 after a judge ruled there was no substance to the allegation that there had been a Nazi theme to the sex party and found that his privacy had been breached.\nIn that ruling, the High Court also said the article was not in the public interest.\nMr Mosley has said the role-play at a rented Chelsea basement flat was harmless, consensual and private.\nOn launching his legal action last year, Mr Mosley urged: \"Google should operate within the law rather than according to rules it makes itself. It cannot be allowed to ignore judgements in our courts.\"\nGoogle has said it will remove URLs that it is alerted to, but is not prepared to remove the images entirely from its search engines.\nIn court on Wednesday, Google's barrister Antony White QC for Google conceded that it was technically possible to remove the images and was \"not burdensome\" to do so.\nHowever, he argued that Google was not the publisher of the private information, and that Mr Mosley no longer had a reasonable expectation of privacy in relation to the images.\nOn that basis, Google will seek to show that Mr Mosley's claim is unfounded.\nThe hearing is due to conclude on Thursday.","summary":"Google has asked the High Court to throw out legal action being taken by ex-Formula 1 boss Max Mosley.","id":"30816523"} 17 | {"document":"The Bath-born player, 28, has made 36 appearances for the Dragons since joining from Wasps in 2015.\nHe is in his second season and signed a contract extension in December 2016.\nDragons forwards coach Ceri Jones said: \"It's a big blow. Eddie has been excellent all year for us, he has really stepped up to the mark and will be a big loss.\"\nHowever, Jones says Jackson's misfortune can be a chance for others to thrive.\n\"We are very fortunate to have the likes of Ollie Griffiths, Harrison Keddie, James Thomas who can come into the back-row,\" said Jackson.\n\"Harri has shown glimpses of what he can do all season and there's definitely a player there, so this is an opportunity.\"\nDragons travel to Munster in the Pro12 on Friday.","summary":"Newport Gwent Dragons number eight Ed Jackson has undergone shoulder surgery and faces a spell on the sidelines.","id":"38900884"} 18 | {"document":"The announcement ends months of uncertainty for Cornish Language Partnership staff whose contracts had been due to end.\nLocal government minister Andrew Stunnell said the three-year funding package for the service would help make sure the language survived.\nBut he warned that long term funding should come from Cornwall.\nHe said it was \"important to make sure the Cornish were given the opportunity to put down sound foundations.\"\n\"In the longer term support for the Cornish language is going to be something which is going to have to be based in Cornwall and will not come from London,\" he added.\nThe Cornish Language Partnership's, Jennifer Lowe, said: \"We can now plan for the future thanks to the funding.\"\nThe United Nations recently upgraded the status of the Cornish language from \"extinct\" to \"critically endangered\".\nIt is thought fewer than 500 people worldwide are fluent in the language.","summary":"The government is spending nearly \u00a3400,000 to help save the Cornish language.","id":"13890581"} 19 | {"document":"Jardim, in charge since 2014, described the last three years at the club as \"exceptional\".\nMonaco finished eight points ahead of nearest rivals Paris St-Germain to be crowned champions of France in 2016-17.\n\"I feel part of AS Monaco and the principality,\" said Portuguese Jardim, the former Olympiakos boss.\nMonaco also beat Tottenham and Manchester City on their way to reaching the semi-finals of the Champions League during 2016-17, before losing to Juventus 4-1 on aggregate in the semi-finals.\nMonaco vice-president Vadim Vasilyev said Jardim had received offers to coach elsewhere.\n\"He is one of the best coaches in European football and despite other offers he has chosen to continue the adventure at Monaco, which demonstrates our ambition,\" added Vasilyev.","summary":"Monaco boss Leonardo Jardim has been rewarded for steering the club to their first Ligue 1 title for 17 years with a new contract until 2020.","id":"40194700"} 20 | --------------------------------------------------------------------------------