├── LICENSE ├── README.md ├── answers.py ├── answersgradio.py ├── docs └── EDPannualreport.pdf └── models └── Download and put model here.txt /LICENSE: -------------------------------------------------------------------------------- 1 | Creative Commons Legal Code 2 | 3 | CC0 1.0 Universal 4 | 5 | CREATIVE COMMONS CORPORATION IS NOT A LAW FIRM AND DOES NOT PROVIDE 6 | LEGAL SERVICES. DISTRIBUTION OF THIS DOCUMENT DOES NOT CREATE AN 7 | ATTORNEY-CLIENT RELATIONSHIP. CREATIVE COMMONS PROVIDES THIS 8 | INFORMATION ON AN "AS-IS" BASIS. CREATIVE COMMONS MAKES NO WARRANTIES 9 | REGARDING THE USE OF THIS DOCUMENT OR THE INFORMATION OR WORKS 10 | PROVIDED HEREUNDER, AND DISCLAIMS LIABILITY FOR DAMAGES RESULTING FROM 11 | THE USE OF THIS DOCUMENT OR THE INFORMATION OR WORKS PROVIDED 12 | HEREUNDER. 13 | 14 | Statement of Purpose 15 | 16 | The laws of most jurisdictions throughout the world automatically confer 17 | exclusive Copyright and Related Rights (defined below) upon the creator 18 | and subsequent owner(s) (each and all, an "owner") of an original work of 19 | authorship and/or a database (each, a "Work"). 20 | 21 | Certain owners wish to permanently relinquish those rights to a Work for 22 | the purpose of contributing to a commons of creative, cultural and 23 | scientific works ("Commons") that the public can reliably and without fear 24 | of later claims of infringement build upon, modify, incorporate in other 25 | works, reuse and redistribute as freely as possible in any form whatsoever 26 | and for any purposes, including without limitation commercial purposes. 27 | These owners may contribute to the Commons to promote the ideal of a free 28 | culture and the further production of creative, cultural and scientific 29 | works, or to gain reputation or greater distribution for their Work in 30 | part through the use and efforts of others. 31 | 32 | For these and/or other purposes and motivations, and without any 33 | expectation of additional consideration or compensation, the person 34 | associating CC0 with a Work (the "Affirmer"), to the extent that he or she 35 | is an owner of Copyright and Related Rights in the Work, voluntarily 36 | elects to apply CC0 to the Work and publicly distribute the Work under its 37 | terms, with knowledge of his or her Copyright and Related Rights in the 38 | Work and the meaning and intended legal effect of CC0 on those rights. 39 | 40 | 1. Copyright and Related Rights. A Work made available under CC0 may be 41 | protected by copyright and related or neighboring rights ("Copyright and 42 | Related Rights"). Copyright and Related Rights include, but are not 43 | limited to, the following: 44 | 45 | i. the right to reproduce, adapt, distribute, perform, display, 46 | communicate, and translate a Work; 47 | ii. moral rights retained by the original author(s) and/or performer(s); 48 | iii. publicity and privacy rights pertaining to a person's image or 49 | likeness depicted in a Work; 50 | iv. rights protecting against unfair competition in regards to a Work, 51 | subject to the limitations in paragraph 4(a), below; 52 | v. rights protecting the extraction, dissemination, use and reuse of data 53 | in a Work; 54 | vi. database rights (such as those arising under Directive 96/9/EC of the 55 | European Parliament and of the Council of 11 March 1996 on the legal 56 | protection of databases, and under any national implementation 57 | thereof, including any amended or successor version of such 58 | directive); and 59 | vii. other similar, equivalent or corresponding rights throughout the 60 | world based on applicable law or treaty, and any national 61 | implementations thereof. 62 | 63 | 2. Waiver. To the greatest extent permitted by, but not in contravention 64 | of, applicable law, Affirmer hereby overtly, fully, permanently, 65 | irrevocably and unconditionally waives, abandons, and surrenders all of 66 | Affirmer's Copyright and Related Rights and associated claims and causes 67 | of action, whether now known or unknown (including existing as well as 68 | future claims and causes of action), in the Work (i) in all territories 69 | worldwide, (ii) for the maximum duration provided by applicable law or 70 | treaty (including future time extensions), (iii) in any current or future 71 | medium and for any number of copies, and (iv) for any purpose whatsoever, 72 | including without limitation commercial, advertising or promotional 73 | purposes (the "Waiver"). Affirmer makes the Waiver for the benefit of each 74 | member of the public at large and to the detriment of Affirmer's heirs and 75 | successors, fully intending that such Waiver shall not be subject to 76 | revocation, rescission, cancellation, termination, or any other legal or 77 | equitable action to disrupt the quiet enjoyment of the Work by the public 78 | as contemplated by Affirmer's express Statement of Purpose. 79 | 80 | 3. Public License Fallback. Should any part of the Waiver for any reason 81 | be judged legally invalid or ineffective under applicable law, then the 82 | Waiver shall be preserved to the maximum extent permitted taking into 83 | account Affirmer's express Statement of Purpose. In addition, to the 84 | extent the Waiver is so judged Affirmer hereby grants to each affected 85 | person a royalty-free, non transferable, non sublicensable, non exclusive, 86 | irrevocable and unconditional license to exercise Affirmer's Copyright and 87 | Related Rights in the Work (i) in all territories worldwide, (ii) for the 88 | maximum duration provided by applicable law or treaty (including future 89 | time extensions), (iii) in any current or future medium and for any number 90 | of copies, and (iv) for any purpose whatsoever, including without 91 | limitation commercial, advertising or promotional purposes (the 92 | "License"). The License shall be deemed effective as of the date CC0 was 93 | applied by Affirmer to the Work. Should any part of the License for any 94 | reason be judged legally invalid or ineffective under applicable law, such 95 | partial invalidity or ineffectiveness shall not invalidate the remainder 96 | of the License, and in such case Affirmer hereby affirms that he or she 97 | will not (i) exercise any of his or her remaining Copyright and Related 98 | Rights in the Work or (ii) assert any associated claims and causes of 99 | action with respect to the Work, in either case contrary to Affirmer's 100 | express Statement of Purpose. 101 | 102 | 4. Limitations and Disclaimers. 103 | 104 | a. No trademark or patent rights held by Affirmer are waived, abandoned, 105 | surrendered, licensed or otherwise affected by this document. 106 | b. Affirmer offers the Work as-is and makes no representations or 107 | warranties of any kind concerning the Work, express, implied, 108 | statutory or otherwise, including without limitation warranties of 109 | title, merchantability, fitness for a particular purpose, non 110 | infringement, or the absence of latent or other defects, accuracy, or 111 | the present or absence of errors, whether or not discoverable, all to 112 | the greatest extent permissible under applicable law. 113 | c. Affirmer disclaims responsibility for clearing rights of other persons 114 | that may apply to the Work or any use thereof, including without 115 | limitation any person's Copyright and Related Rights in the Work. 116 | Further, Affirmer disclaims responsibility for obtaining any necessary 117 | consents, permissions or other rights required for any use of the 118 | Work. 119 | d. Affirmer understands and acknowledges that Creative Commons is not a 120 | party to this document and has no duty or obligation with respect to 121 | this CC0 or use of the Work. 122 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # LLM your Docs Localy 2 | ### Use a local Large Language Model to read and provide answers on your local files 3 | 4 |
5 | 6 | Uses https://huggingface.co/mosaicml/mpt-7b-instruct, a model published by 7 |
8 | Mosaic ML https://mosaicml.com 9 | 10 | 11 | PDF documents in the `docs` folder are loaded into a vectorstore 12 |
13 | Includes EDPannualreport.pdf as example. 14 |
15 | It is the Annual Report from EDP, a large company from the energy sector 16 | 17 | The model can then answer questions about the ducument(s) it has ingested. 18 | 19 | There are two versions one CLI, and one GUI based on Gradio 20 | 21 |
22 | 23 | (optional) You can create a virtual environment with: 24 | ``` 25 | python -m venv "venv" 26 | venv\Scripts\activate 27 | ``` 28 | 29 | To install do: 30 | ``` 31 | git clone https://github.com/vluz/LLMDocsLocal.git 32 | cd LLMDocsLocal 33 | pip install -r requirements.txt 34 | ``` 35 | 36 | To run do:
37 | `python answers.py` for the cli version 38 |
or 39 | `python answersgradio.py` for the Gradio version 40 | 41 | Note: Not tested, do not use for production 42 | -------------------------------------------------------------------------------- /answers.py: -------------------------------------------------------------------------------- 1 | import re 2 | import os 3 | from langchain.document_loaders import UnstructuredPDFLoader 4 | from langchain.indexes import VectorstoreIndexCreator 5 | from langchain.text_splitter import CharacterTextSplitter 6 | from langchain.chains import RetrievalQA 7 | from langchain.embeddings import HuggingFaceEmbeddings 8 | from langchain.llms import GPT4All 9 | 10 | 11 | embeddings_model_name = "all-MiniLM-L6-v2" 12 | model_path = "models/ggml-mpt-7b-instruct.bin" 13 | pdf_folder_path = 'docs' 14 | os.system('cls') 15 | print("Loading Documents...") 16 | embeddings = HuggingFaceEmbeddings(model_name=embeddings_model_name) 17 | loaders = [UnstructuredPDFLoader(os.path.join(pdf_folder_path, fn)) for fn in os.listdir(pdf_folder_path)] 18 | index = VectorstoreIndexCreator(embedding=HuggingFaceEmbeddings(), text_splitter=CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)).from_loaders(loaders) 19 | print("Done.") 20 | llm = GPT4All(model=model_path, n_ctx=1000, backend='mpt', verbose=False) 21 | qa = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=index.vectorstore.as_retriever(), return_source_documents=True) 22 | os.system('cls') 23 | print("\n\n\n") 24 | print(" __ __ __ __ ____ _____ ___ ___ __ _____ ___ __ __ ") 25 | print("( ) ( ) ( \/ ) ( _ \( _ )/ __)/ __) ( ) ( _ )/ __) /__\ ( )") 26 | print(" )(__ )(__ ) ( )(_) ))(_)(( (__ \__ \ )(__ )(_)(( (__ /(__)\ )(__ ") 27 | print("(____)(____)(_/\/\_) (____/(_____)\___)(___/ (____)(_____)\___)(__)(__)(____)") 28 | 29 | while True: 30 | prompt = input("\nPrompt: ") 31 | res = qa(prompt) 32 | answer = res['result'] 33 | docs = res['source_documents'] 34 | print("\nAnswer: ") 35 | print(answer) 36 | print("\n----------------------") 37 | for document in docs: 38 | texto = re.sub('[^A-Za-z0-9 ]+', '', document.page_content) 39 | print("\n" + document.metadata["source"] + ' -> ' + texto) 40 | print("\n######################") -------------------------------------------------------------------------------- /answersgradio.py: -------------------------------------------------------------------------------- 1 | import re 2 | import os 3 | import gradio as gr 4 | from langchain.document_loaders import UnstructuredPDFLoader 5 | from langchain.indexes import VectorstoreIndexCreator 6 | from langchain.text_splitter import CharacterTextSplitter 7 | from langchain.chains import RetrievalQA 8 | from langchain.embeddings import HuggingFaceEmbeddings 9 | from langchain.llms import GPT4All 10 | 11 | 12 | embeddings_model_name = "all-MiniLM-L6-v2" 13 | model_path = "models/ggml-mpt-7b-instruct.bin" 14 | pdf_folder_path = 'docs' 15 | embeddings = HuggingFaceEmbeddings(model_name=embeddings_model_name) 16 | loaders = [UnstructuredPDFLoader(os.path.join(pdf_folder_path, fn)) for fn in os.listdir(pdf_folder_path)] 17 | index = VectorstoreIndexCreator(embedding=HuggingFaceEmbeddings(), text_splitter=CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)).from_loaders(loaders) 18 | llm = GPT4All(model=model_path, n_ctx=1000, backend='mpt', verbose=False) 19 | qa = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=index.vectorstore.as_retriever(), return_source_documents=True) 20 | examples = [["How much did EDP earn in the last quarter of 2022?"],] 21 | 22 | def generate(prompt): 23 | res = qa(prompt) 24 | answer = res['result'] 25 | docs = res['source_documents'] 26 | output = answer + "\n" 27 | for document in docs: 28 | output += ("\n\n\n" + document.metadata["source"] + ' -> ' + document.page_content) 29 | print(output) 30 | return output 31 | 32 | app = gr.Interface( 33 | fn=generate, 34 | inputs=gr.inputs.Textbox(label="Prompt"), 35 | outputs=gr.outputs.Textbox(label="Answer"), 36 | examples=examples 37 | ) 38 | 39 | app.launch() 40 | 41 | -------------------------------------------------------------------------------- /docs/EDPannualreport.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vluz/LLMDocsLocal/1f00512ae0d7fc5dd35c5fa16d1582668158ac5a/docs/EDPannualreport.pdf -------------------------------------------------------------------------------- /models/Download and put model here.txt: -------------------------------------------------------------------------------- 1 | ggml-mpt-7b-instruct.bin from GPT4ALL 2 | --------------------------------------------------------------------------------