├── README.md
├── app.py
├── app1.py
├── app2.py
├── app3.py
└── requirement.txt


/README.md:
--------------------------------------------------------------------------------
 1 | # Document-Summarization
 2 | Document Summarization App using large language model (LLM) and Langchain framework. Used a pre-trained T5 model and its tokenizer from Hugging Face Transformers library. 
 3 | Created a summarization pipeline to generate summary using model.
 4 | 
 5 | 1. Import Statements:
 6 |    - It begins by importing necessary libraries like Streamlit, Langchain, Transformers, and other Python libraries.
 7 | 
 8 | 2. Model and Tokenizer Loading:
 9 |    - The code loads a pre-trained T5 model (a Transformer-based model) and its associated tokenizer from the Hugging Face Transformers library.
10 |      This model is used for text summarization.
11 | 
12 | 3. File Loader and Preprocessing:
13 |    - The `file_preprocessing` function loads a PDF file using the Langchain library and splits it into smaller text chunks. These text chunks are later used for
14 |      summarization.
15 | 
16 | 4. LLM Pipeline:
17 |    - The `llm_pipeline` function sets up a summarization pipeline using the pre-trained T5 model and tokenizer. It takes the preprocessed text as input and generates
18 |      a summary using the model.
19 | 
20 | 5. Streamlit Setup:
21 |    - The Streamlit app is set up with a title and an option to upload a PDF file.
22 | 
23 | 6. Main Function:
24 |    - The `main` function is the entry point of the app.
25 |    - It provides a file upload button and a "Summarize" button.
26 |    - When a PDF file is uploaded and the "Summarize" button is clicked, it displays the uploaded PDF on the left side and the generated summary on the right side
27 |      of the Streamlit app.
28 | 
29 | 7. HTML Display of PDF:
30 |    - The `displayPDF` function converts the uploaded PDF file into base64 format and embeds it in an HTML iframe, allowing the PDF to be displayed in the app.
31 | 
32 | 8. Streamlit Configuration:
33 |    - The app's layout is configured to be "wide" using `st.set_page_config`.
34 | 
35 | 9. Running the App:
36 |    - The app is launched when the script is run as the main module (`if __name__ == "__main__": main()`).
37 | 
38 | The main functionality of this app is to upload a PDF document, process it, and then display both the PDF and a summarized version of the document.
39 | It utilizes a pre-trained language model for text summarization and Streamlit for creating a user-friendly interface. Users can upload PDFs and quickly obtain 
40 | summarized content from them.
41 | 


--------------------------------------------------------------------------------
/app.py:
--------------------------------------------------------------------------------
 1 | import streamlit as st 
 2 | from langchain.text_splitter import RecursiveCharacterTextSplitter
 3 | from langchain.document_loaders import PyPDFLoader, DirectoryLoader
 4 | from langchain.chains.summarize import load_summarize_chain
 5 | from transformers import T5Tokenizer, T5ForConditionalGeneration
 6 | from transformers import pipeline
 7 | import torch
 8 | import base64
 9 | 
10 | #model and tokenizer loading
11 | checkpoint = "MBZUAI/LaMini-Flan-T5-248M"
12 | tokenizer = T5Tokenizer.from_pretrained(checkpoint)
13 | base_model = T5ForConditionalGeneration.from_pretrained(checkpoint, device_map='auto', torch_dtype=torch.float32)
14 | 
15 | #file loader and preprocessing
16 | def file_preprocessing(file):
17 |     loader =  PyPDFLoader(file)
18 |     pages = loader.load_and_split()
19 |     text_splitter = RecursiveCharacterTextSplitter(chunk_size=200, chunk_overlap=50)
20 |     texts = text_splitter.split_documents(pages)
21 |     final_texts = ""
22 |     for text in texts:
23 |         print(text)
24 |         final_texts = final_texts + text.page_content
25 |     return final_texts
26 | 
27 | #LLM pipeline
28 | def llm_pipeline(filepath):
29 |     pipe_sum = pipeline(
30 |         'summarization',
31 |         model = base_model,
32 |         tokenizer = tokenizer,
33 |         max_length = 500, 
34 |         min_length = 50)
35 |     input_text = file_preprocessing(filepath)
36 |     result = pipe_sum(input_text)
37 |     result = result[0]['summary_text']
38 |     return result
39 | 
40 | @st.cache_data
41 | #function to display the PDF of a given file 
42 | def displayPDF(file):
43 |     # Opening file from file path
44 |     with open(file, "rb") as f:
45 |         base64_pdf = base64.b64encode(f.read()).decode('utf-8')
46 | 
47 |     # Embedding PDF in HTML
48 |     pdf_display = F'<iframe src="data:application/pdf;base64,{base64_pdf}" width="100%" height="600" type="application/pdf"></iframe>'
49 | 
50 |     # Displaying File
51 |     st.markdown(pdf_display, unsafe_allow_html=True)
52 | 
53 | #streamlit code 
54 | st.set_page_config(layout="wide")
55 | 
56 | def main():
57 |     st.title("Document Summarization App")
58 | 
59 |     uploaded_file = st.file_uploader("Upload your PDF file", type=['pdf'])
60 | 
61 |     if uploaded_file is not None:
62 |         if st.button("Summarize"):
63 |             col1, col2 = st.columns(2)
64 |             filepath = "data/"+uploaded_file.name
65 |             with open(filepath, "wb") as temp_file:
66 |                 temp_file.write(uploaded_file.read())
67 |             with col1:
68 |                 st.info("Uploaded File")
69 |                 pdf_view = displayPDF(filepath)
70 | 
71 |             with col2:
72 |                 summary = llm_pipeline(filepath)
73 |                 st.info("Summarization Complete")
74 |                 st.success(summary)
75 | 
76 | 
77 | 
78 | if __name__ == "__main__":
79 |     main()


--------------------------------------------------------------------------------
/app1.py:
--------------------------------------------------------------------------------
 1 | import streamlit as st
 2 | from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline
 3 | import torch
 4 | import base64
 5 | from langchain.text_splitter import RecursiveCharacterTextSplitter
 6 | from langchain.document_loaders import PyPDFLoader
 7 | 
 8 | # Model and tokenizer loading
 9 | # model_name ="google/pegasus-large" # good
10 | # model_name = "t5-large" # good
11 | model_name = "facebook/bart-large"
12 | tokenizer = AutoTokenizer.from_pretrained(model_name)
13 | base_model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
14 | 
15 | # File loader and preprocessing
16 | def file_preprocessing(file):
17 |     loader = PyPDFLoader(file)
18 |     pages = loader.load_and_split()
19 |     text_splitter = RecursiveCharacterTextSplitter(chunk_size=200, chunk_overlap=50)
20 |     texts = text_splitter.split_documents(pages)
21 |     final_texts = ""
22 |     for text in texts:
23 |         print(text)
24 |         final_texts = final_texts + text.page_content
25 |     return final_texts
26 | 
27 | # LLM pipeline
28 | def llm_pipeline(filepath):
29 |     pipe_sum = pipeline(
30 |         'summarization',
31 |         model=base_model,
32 |         tokenizer=tokenizer,
33 |         min_length=50
34 |     )
35 |     input_text = file_preprocessing(filepath)
36 |     result = pipe_sum(input_text)
37 |     result = result[0]['summary_text']
38 |     return result
39 | 
40 | # Function to display the PDF of a given file
41 | def displayPDF(file):
42 |     # Opening file from file path
43 |     with open(file, "rb") as f:
44 |         base64_pdf = base64.b64encode(f.read()).decode('utf-8')
45 |     # Embedding PDF in HTML
46 |     pdf_display = F'<iframe src="data:application/pdf;base64,{base64_pdf}" width="100%" height="600" type="application/pdf"></iframe>'
47 |     # Displ aying File
48 |     st.markdown(pdf_display, unsafe_allow_html=True)
49 | 
50 | # Streamlit code
51 | st.set_page_config(layout="wide")
52 | 
53 | def main():
54 |     st.title("Document Summarization App")
55 |     uploaded_file = st.file_uploader("Upload your PDF file", type=['pdf'])
56 |     if uploaded_file is not None:
57 |         if st.button("Summarize"):
58 |             col1, col2 = st.columns(2)
59 |             filepath = "data/" + uploaded_file.name
60 |             with open(filepath, "wb") as temp_file:
61 |                 temp_file.write(uploaded_file.read())
62 |             with col1:
63 |                 st.info("Uploaded File")
64 |                 pdf_view = displayPDF(filepath)
65 |             with col2:
66 |                 summary = llm_pipeline(filepath)
67 |                 st.info("Summarization Complete")
68 |                 st.success(summary)
69 | 
70 | if __name__ == "__main__":
71 |     main()


--------------------------------------------------------------------------------
/app2.py:
--------------------------------------------------------------------------------
 1 | import streamlit as st
 2 | import faiss
 3 | import torch
 4 | import numpy as np
 5 | import base64
 6 | from langchain.text_splitter import RecursiveCharacterTextSplitter
 7 | from langchain.document_loaders import PyPDFLoader
 8 | from transformers import T5Tokenizer, T5ForConditionalGeneration, pipeline
 9 | 
10 | # Initialize Faiss index and storagestreamlit 
11 | dimension = 768 # Change this dimension to match your language model's output dimension
12 | index = faiss.IndexFlatL2(dimension) # You can choose a different index type if needed
13 | doc_vectors = [] # List to store document vectors
14 | 
15 | # Load tokenizer and model
16 | checkpoint = "MBZUAI/LaMini-Flan-T5-248M"
17 | tokenizer = T5Tokenizer.from_pretrained(checkpoint)
18 | base_model = T5ForConditionalGeneration.from_pretrained(checkpoint, device_map='auto', torch_dtype=torch.float32)
19 | 
20 | # Modify the following function to vectorize text using your language model
21 | def vectorize_text(text):
22 |     inputs = tokenizer(text, return_tensors="pt", truncation=True, padding="max_length", max_length=512)
23 |     with torch.no_grad():
24 |         outputs = base_model(**inputs)
25 |     return outputs.last_hidden_state.mean(dim=1).numpy().astype('float32')
26 | 
27 | # Modify the file_preprocessing function to store vectors in the vector database
28 | def file_preprocessing_and_vectorization(file):
29 |     loader = PyPDFLoader(file)
30 |     pages = loader.load_and_split()
31 |     text_splitter = RecursiveCharacterTextSplitter(chunk_size=200, chunk_overlap=50)
32 |     texts = text_splitter.split_documents(pages)
33 | 
34 |     for text in texts:
35 |         vector = vectorize_text(text.page_content)
36 |         doc_vectors.append(vector)
37 |         index.add(np.array([vector], dtype='float32')) # Add vector to the index
38 | 
39 |     return texts
40 | 
41 | # Streamlit code
42 | st.set_page_config(layout="wide")
43 | 
44 | def main():
45 |     st.title("Document Summarization App")
46 |     uploaded_file = st.file_uploader("Upload your PDF file", type=['pdf'])
47 |     if uploaded_file is not None:
48 |         if st.button("Summarize"):
49 |             col1, col2 = st.columns(2)
50 |             filepath = "data/" + uploaded_file.name
51 |             with open(filepath, "wb") as temp_file:
52 |                 temp_file.write(uploaded_file.read())
53 |             with col1:
54 |                 st.info("Uploaded File")
55 |                 pdf_view = displayPDF(filepath)
56 |             with col2:
57 |                 texts = file_preprocessing_and_vectorization(filepath)
58 |                 input_text = texts[0].page_content
59 |                 summary = llm_pipeline(input_text)
60 |                 st.info("Summarization Complete")
61 |                 st.success(summary)
62 | 
63 | 
64 | if __name__ == "__main__":
65 |     main()


--------------------------------------------------------------------------------
/app3.py:
--------------------------------------------------------------------------------
 1 | import streamlit as st
 2 | import faiss
 3 | import numpy as np
 4 | import base64
 5 | import torch
 6 | from transformers import T5Tokenizer, T5ForConditionalGeneration, pipeline
 7 | 
 8 | 
 9 | # Initialize Faiss index and storage
10 | dimension = 768  # Change this dimension to match your language model's output dimension
11 | num_clusters = 1000 # Adjust the number of clusters based on your requirements
12 | num_sub_quantizers = 64 # Adjust the number of sub-quantizers for IndexIVFPQ
13 | index = faiss.IndexIVFPQ(faiss.IndexFlatL2(dimension), num_clusters, num_sub_quantizers, faiss.METRIC_L2)
14 | doc_ids = []  # List to store document IDs for retrieval
15 | doc_vectors = []  # List to store document vectors
16 | 
17 | # Load tokenizer and model
18 | checkpoint = "MBZUAI/LaMini-Flan-T5-248M"
19 | tokenizer = T5Tokenizer.from_pretrained(checkpoint)
20 | base_model = T5ForConditionalGeneration.from_pretrained(checkpoint, device_map='auto', torch_dtype=torch.float32)
21 | 
22 | # Function to vectorize text using your language model
23 | def vectorize_text(text):
24 |     inputs = tokenizer(text, return_tensors="pt", truncation=True, padding="max_length", max_length=512)
25 |     with torch.no_grad():
26 |         outputs = base_model(**inputs)
27 |     return outputs.last_hidden_state.mean(dim=1).numpy().astype('float32')
28 | 
29 | # Function to display the PDF of a given file
30 | def displayPDF(file):
31 |     with open(file, "rb") as f:
32 |         base64_pdf = base64.b64encode(f.read()).decode('utf-8')
33 |     pdf_display = f''
34 |     st.markdown(pdf_display, unsafe_allow_html=True)
35 | 
36 | # Document summarization pipeline using the language model
37 | def llm_pipeline(input_text):
38 |     pipe_sum = pipeline(
39 |         'summarization',
40 |         model=base_model,
41 |         tokenizer=tokenizer,
42 |         max_length=500,
43 |         min_length=50
44 |     )
45 |     result = pipe_sum(input_text)
46 |     result = result[0]['summary_text']
47 |     return result
48 | 
49 | # Main Streamlit application
50 | st.set_page_config(layout="wide")
51 | 
52 | def main():
53 |         st.title("Document Summarization App")
54 |         uploaded_file = st.file_uploader("Upload your PDF file", type=['pdf'])
55 | 
56 |         if uploaded_file is not None:
57 |             if st.button("Summarize"):
58 |                 col1, col2 = st.columns(2)
59 |                 filepath = "data/" + uploaded_file.name
60 |                 
61 |                 with open(filepath, "wb") as temp_file:
62 |                     temp_file.write(uploaded_file.read())
63 |                 with col1:
64 |                    st.info("Uploaded File")
65 |                    displayPDF(filepath)
66 | 
67 |                 with col2:
68 |                     with open(filepath, "rb") as f:
69 |                         input_text = f.read().decode('utf-8','ignore')  # Read the content of the file
70 |                     summary = llm_pipeline(input_text)
71 |                     st.info("Summarization Complete")
72 |                     st.success(summary)
73 | 
74 | if __name__ == "__main__":
75 |         main()
76 | 


--------------------------------------------------------------------------------
/requirement.txt:
--------------------------------------------------------------------------------
 1 | langchain
 2 | sentence_transformers
 3 | torch
 4 | sentencepiece
 5 | transformers
 6 | accelerate
 7 | pypdf
 8 | tiktoken
 9 | streamlit
10 | chromadb


--------------------------------------------------------------------------------