├── Img
    ├── distiller.gif
    ├── sidebar_img.jpeg
    └── workflow.png
├── Index
    ├── .DS_Store
    ├── BERT
    │   ├── index.faiss
    │   └── index.pkl
    ├── attention_is_all_you_need
    │   ├── index.faiss
    │   └── index.pkl
    ├── gpt3
    │   ├── index.faiss
    │   └── index.pkl
    └── whisper
    │   ├── index.faiss
    │   └── index.pkl
├── Papers
    ├── .DS_Store
    ├── attention_is_all_you_need.pdf
    ├── bert.pdf
    ├── gpt3.pdf
    └── whisper.pdf
├── README.md
├── distiller.py
└── distller.db


/Img/distiller.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/rlancemartin/paper-distiller/6cff93335e654d2f3304bb8e4c16931501eba6a6/Img/distiller.gif


--------------------------------------------------------------------------------
/Img/sidebar_img.jpeg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/rlancemartin/paper-distiller/6cff93335e654d2f3304bb8e4c16931501eba6a6/Img/sidebar_img.jpeg


--------------------------------------------------------------------------------
/Img/workflow.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/rlancemartin/paper-distiller/6cff93335e654d2f3304bb8e4c16931501eba6a6/Img/workflow.png


--------------------------------------------------------------------------------
/Index/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/rlancemartin/paper-distiller/6cff93335e654d2f3304bb8e4c16931501eba6a6/Index/.DS_Store


--------------------------------------------------------------------------------
/Index/BERT/index.faiss:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/rlancemartin/paper-distiller/6cff93335e654d2f3304bb8e4c16931501eba6a6/Index/BERT/index.faiss


--------------------------------------------------------------------------------
/Index/BERT/index.pkl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/rlancemartin/paper-distiller/6cff93335e654d2f3304bb8e4c16931501eba6a6/Index/BERT/index.pkl


--------------------------------------------------------------------------------
/Index/attention_is_all_you_need/index.faiss:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/rlancemartin/paper-distiller/6cff93335e654d2f3304bb8e4c16931501eba6a6/Index/attention_is_all_you_need/index.faiss


--------------------------------------------------------------------------------
/Index/attention_is_all_you_need/index.pkl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/rlancemartin/paper-distiller/6cff93335e654d2f3304bb8e4c16931501eba6a6/Index/attention_is_all_you_need/index.pkl


--------------------------------------------------------------------------------
/Index/gpt3/index.faiss:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/rlancemartin/paper-distiller/6cff93335e654d2f3304bb8e4c16931501eba6a6/Index/gpt3/index.faiss


--------------------------------------------------------------------------------
/Index/gpt3/index.pkl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/rlancemartin/paper-distiller/6cff93335e654d2f3304bb8e4c16931501eba6a6/Index/gpt3/index.pkl


--------------------------------------------------------------------------------
/Index/whisper/index.faiss:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/rlancemartin/paper-distiller/6cff93335e654d2f3304bb8e4c16931501eba6a6/Index/whisper/index.faiss


--------------------------------------------------------------------------------
/Index/whisper/index.pkl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/rlancemartin/paper-distiller/6cff93335e654d2f3304bb8e4c16931501eba6a6/Index/whisper/index.pkl


--------------------------------------------------------------------------------
/Papers/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/rlancemartin/paper-distiller/6cff93335e654d2f3304bb8e4c16931501eba6a6/Papers/.DS_Store


--------------------------------------------------------------------------------
/Papers/attention_is_all_you_need.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/rlancemartin/paper-distiller/6cff93335e654d2f3304bb8e4c16931501eba6a6/Papers/attention_is_all_you_need.pdf


--------------------------------------------------------------------------------
/Papers/bert.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/rlancemartin/paper-distiller/6cff93335e654d2f3304bb8e4c16931501eba6a6/Papers/bert.pdf


--------------------------------------------------------------------------------
/Papers/gpt3.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/rlancemartin/paper-distiller/6cff93335e654d2f3304bb8e4c16931501eba6a6/Papers/gpt3.pdf


--------------------------------------------------------------------------------
/Papers/whisper.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/rlancemartin/paper-distiller/6cff93335e654d2f3304bb8e4c16931501eba6a6/Papers/whisper.pdf


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | Write-up:
 2 | https://lancemartin.notion.site/Langchain-for-paper-summarization-d4ad122ea9a64c0eb1f981e743d6c419
 3 | 
 4 | Local testing:
 5 | `streamlit run distiller.py`
 6 | 
 7 | Hosted app for general QA using ChatGPT:
 8 | https://pineappleexpress808-doc-gpt-doc-gpt-q0823l.streamlit.app/
 9 | 
10 | Improvements:
11 | + Improve data loader (chunk size sensitivity is a problem)
12 | + Add ChatGPT
13 | 


--------------------------------------------------------------------------------
/distiller.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import time
  3 | import pypdf
  4 | import pickledb
  5 | import pandas as pd
  6 | import streamlit as st
  7 | from langchain.llms import OpenAI
  8 | from langchain.vectorstores import FAISS
  9 | from langchain.embeddings.openai import OpenAIEmbeddings
 10 | from langchain.chains.question_answering import load_qa_chain
 11 | 
 12 | # Image banner for Streamlit app
 13 | st.sidebar.image("Img/sidebar_img.jpeg")
 14 | 
 15 | # Get papers (format as pdfs)
 16 | papers  = [l.split('.')[0] for l in os.listdir("Papers/") if l.endswith('.pdf')]
 17 | selectbox = st.sidebar.radio('Which paper to distill?',papers)
 18 |   
 19 | # Paper distillation 
 20 | class PaperDistiller:
 21 | 
 22 |     def __init__(self,paper_name):
 23 | 
 24 |         self.name = paper_name
 25 |         self.answers = {}
 26 |         # Use pickledb as local q-a store (save cost)
 27 |         self.cached_answers = pickledb.load('distller.db',auto_dump=False,sig=False) 
 28 | 
 29 |     def split_pdf(self,chunk_chars=4000,overlap=50):
 30 |         """
 31 |         Pre-process PDF into chunks
 32 |         Some code from: https://github.com/whitead/paper-qa/blob/main/paperqa/readers.py
 33 |         """
 34 | 
 35 |         pdfFileObj = open("Papers/%s.pdf"%self.name, "rb")
 36 |         pdfReader = pypdf.PdfReader(pdfFileObj)
 37 |         splits = []
 38 |         split = ""
 39 |         for i, page in enumerate(pdfReader.pages):
 40 |             split += page.extract_text()
 41 |             if len(split) > chunk_chars:
 42 |                 splits.append(split[:chunk_chars])
 43 |                 split = split[chunk_chars - overlap:]
 44 |         pdfFileObj.close()
 45 |         return splits 
 46 | 
 47 | 
 48 |     def read_or_create_index(self):
 49 |         """
 50 |         Read or generate embeddings for pdf
 51 |         """
 52 | 
 53 |         if os.path.isdir('Index/%s'%self.name):
 54 |             print("Index Found!")
 55 |             self.ix = FAISS.load_local('Index/%s'%self.name,OpenAIEmbeddings())
 56 |         else:
 57 |             print("Creating index!")
 58 |             self.ix = FAISS.from_texts(self.split_pdf(), OpenAIEmbeddings())
 59 |             # Save index to local (save cost)
 60 |             self.ix.save_local('Index/%s'%self.name)
 61 |             
 62 |     def query_and_distill(self,query):
 63 |         """
 64 |         Query embeddings and pass relevant chunks to LLM for answer
 65 |         """
 66 | 
 67 |         # Answer already in memory
 68 |         if query in self.answers:
 69 |             print("Answer found!")
 70 |             return self.answers[query]
 71 |         # Answer cached (asked in the past) in pickledb
 72 |         elif self.cached_answers.get(query+"-%s"%self.name):
 73 |             print("Answered in the past!")
 74 |             return self.cached_answers.get(query+"-%s"%self.name)
 75 |         # Generate the answer 
 76 |         else:
 77 |             print("Generating answer!")
 78 |             query_results = self.ix.similarity_search(query, k=2)
 79 |             chain = load_qa_chain(OpenAI(temperature=0.25), chain_type="stuff")
 80 |             self.answers[query] = chain.run(input_documents=query_results, question=query)
 81 |             self.cached_answers.set(query+"-%s"%self.name,self.answers[query])
 82 |             return self.answers[query]
 83 |         
 84 |     def cache_answers(self):
 85 |         """
 86 |         Write answers to local pickledb
 87 |         """
 88 |         self.cached_answers.dump()
 89 | 
 90 | # Select paper via radio button 
 91 | print(selectbox)
 92 | p=PaperDistiller(selectbox)
 93 | p.read_or_create_index()
 94 | 
 95 | # Pre-set queries for each paper
 96 | # TO DO: improve this w/ prompt engineering
 97 | queries = ["What is the main innovation or new idea in the paper?",
 98 |            "How many tokens or examples are in the training set?",
 99 |            "Where is the training set scraped from or obtained and what modalities does it include?",
100 |            "What are the tasks performed by the model?",
101 |            "How is evaluation performed?",
102 |            "What is the model architecture and what prior work used simnilar architecture?"]
103 | 
104 | # UI headers
105 | headers = ["Innovation","Dataset size","Dataset source","Tasks","Evaluation","Architecture"]
106 | 
107 | # Outputs
108 | st.header("`Paper Distiller`")
109 | for q,header in zip(queries,headers):
110 |     st.subheader("`%s`"%header)
111 |     st.info("`%s`"%p.query_and_distill(q))
112 |     # time.sleep(3) # may be needed for OpenAI API limit
113 | 
114 | # Cache the answers 
115 | p.cache_answers()
116 |     


--------------------------------------------------------------------------------
/distller.db:
--------------------------------------------------------------------------------
1 | {"What is the training data used for each experimemnt with dataset sizes?": " For the English-to-German translation task, the big transformer model used the WMT 2014 English-to-German dataset. For the English-to-French translation task, the big model used the WMT 2014 English-to-French dataset. For English constituency parsing, the model used the Wall Street Journal (WSJ) portion of the Penn Treebank and the larger high-confidence and BerkleyParser corpora.", "What is the main innovation or new idea in the paper?": " The main innovation or new idea in the paper is the Transformer, a new simple network architecture based solely on attention mechanisms, dispensing with recurrence and convolutions entirely.", "What is the model architecture?": " The Transformer follows an encoder-decoder structure with a stack of 6 identical layers. Each layer has two sub-layers: a multi-head self-attention mechanism and a simple, position-wise fully connected feed-forward network. The encoder maps an input sequence of symbol representations to a sequence of continuous representations, and the decoder then generates an output sequence of symbols one element at a time.", "How many tokens or examples are in the training set?": " Approximately 40,000 training sentences and 16,000 tokens for the WSJ only setting and approximately 17M sentences and 32,000 tokens for the semi-supervised setting.", "Where is the training set scraped from or obtained and what modalities does it include?": " The training set is from the Wall Street Journal (WSJ) training set of 40K sentences and includes text.", "What are the tasks performed by the model?": " The Transformer model has been used successfully in a variety of tasks including reading comprehension, abstractive summarization, textual entailment, and learning task-independent sentence representations. It has also been used for simple-language question answering and language modeling tasks.", "How is evaluation performed?": " Evaluation is performed by measuring the change in performance on English-to-German translation on the development set, newstest2013, using beam search and no checkpoint averaging. For English constituency parsing, evaluation is performed by measuring the Training WSJ 23 F1 score on Section 23 of WSJ.", "What is the model architecture and what prior work used simnilar architecture?": " The Transformer model architecture uses stacked self-attention and point-wise, fully connected layers for both the encoder and decoder. Prior work such as the Extended Neural GPU, ByteNet, and ConvS2S have used similar architectures, but the Transformer is the first model to rely entirely on self-attention to compute representations of its input and output without using sequence-aligned RNNs or convolution.", "What is the main innovation or new idea in the paper?-attention_is_all_you_need": " The main innovation or new idea in the paper is the Transformer, a new simple network architecture based solely on attention mechanisms, dispensing with recurrence and convolutions entirely.", "How many tokens or examples are in the training set?-attention_is_all_you_need": " Approximately 40,000 tokens and 17 million sentences.", "Where is the training set scraped from or obtained and what modalities does it include?-attention_is_all_you_need": " The training set is obtained from the Penn Treebank and includes text modalities.", "What are the tasks performed by the model?-attention_is_all_you_need": " The Transformer model is used for tasks such as language modeling, machine translation, reading comprehension, abstractive summarization, textual entailment, and learning task-independent sentence representations.", "How is evaluation performed?-attention_is_all_you_need": " Evaluation is performed by measuring the change in performance on English-to-German translation on the development set, newstest2013. Beam search is used, but no checkpoint averaging. Results are presented in Table 3. For English constituency parsing, evaluation is performed by selecting the dropout, both attention and residual, learning rates and beam size on the Section 22 development set, and measuring the results on Section 23 of WSJ.", "What is the model architecture and what prior work used simnilar architecture?-attention_is_all_you_need": " The model architecture is composed of a stack of N= 6 identical layers. Each layer has two sub-layers: a multi-head self-attention mechanism and a simple, position-wise fully connected feed-forward network. This architecture is similar to the Extended Neural GPU, ByteNet, and ConvS2S, which all use convolutional neural networks as basic building blocks to compute hidden representations in parallel for all input and output positions.", "What is the main innovation or new idea in the paper?-bert": " The main innovation or new idea in the paper is the use of a bidirectional Transformer for pre-training, which is different from the left-to-right Transformer used in OpenAI GPT and the concatenation of independently trained left-to-right and right-to-left LSTMs used in ELMo.", "How many tokens or examples are in the training set?-bert": " I don't know.", "Where is the training set scraped from or obtained and what modalities does it include?-bert": " The training set for BERT is obtained from a variety of sources, including Wikipedia, BookCorpus, and OpenWebText. It includes text modalities.", "What are the tasks performed by the model?-bert": " The model is used to perform natural language understanding tasks on the GLUE benchmark, question answering on the SQuAD v1.1 dataset, and named entity recognition on the CoNLL-2003 dataset.", "How is evaluation performed?-bert": " To fine-tune on GLUE, we represent the input sequence (for single sentence or sentence pairs) as described in Section 3, and use the final hidden vector corresponding to the first input token ([CLS]) as the aggregate representation. The only new parameters introduced during fine-tuning are classification layer weights W2RKH, where K is the number of labels. We compute a standard classification loss with CandW, i.e., log(softmax(CWT)).\n\nThe evaluation for GLUE is performed by computing a standard classification loss with CandW, i.e., log(softmax(CWT)).", "What is the model architecture and what prior work used simnilar architecture?-bert": " BERT uses a bidirectional Transformer. OpenAI GPT uses a left-to-right Transformer. ELMo uses the concatenation of independently trained left-to-right and right-to-left LSTMs to generate features for downstream tasks.", "What is the main innovation or new idea in the paper?-whisper": " The main innovation or new idea in the paper is the development of Whisper, a robust speech recognition system that is trained on a large-scale weakly supervised dataset and is able to transfer between both text and speech language tasks, demonstrating transfer between them.", "How many tokens or examples are in the training set?-whisper": " The training set contains 680k hours of data.", "Where is the training set scraped from or obtained and what modalities does it include?-whisper": " The training set is scraped from the internet and includes audio paired with transcripts, as well as audio language detection, X!en translation data, and voice activity detection.", "What are the tasks performed by the model?-whisper": " The model performs language tasks, such as transcription and translation, as well as voice activity detection, speaker diarization, and language identification.", "How is evaluation performed?-whisper": " Evaluation is typically performed using the word error rate (WER) metric, which is based on string edit distance and penalizes all differences between the model's output and the reference transcript including innocuous differences in transcript style. To address this problem, extensive standardization of text before the WER calculation is used to minimize penalization of non-semantic differences.", "What is the model architecture and what prior work used simnilar architecture?-whisper": " The model architecture is an audio conditional language model and similar architectures have been used in works such as Narayanan et al. (2018), Likhomanenko et al. (2020), and Chan et al. (2021).", "What is the main innovation or new idea in the paper?-gpt3": " The main innovation or new idea in the paper is the investigation of how good humans are at detecting longer news articles generated by GPT-3 175B.", "How many tokens or examples are in the training set?-gpt3": " I don't know.", "Where is the training set scraped from or obtained and what modalities does it include?-gpt3": " The training set for GPT-3 is a combination of public datasets, including text from books, websites, and other sources. It includes natural language text.", "What are the tasks performed by the model?-gpt3": " The model performs tasks such as code and writing auto-completion, grammar assistance, game narrative generation, improving search engine responses, and answering questions.", "How is evaluation performed?-gpt3": " Human evaluation experiments are conducted to evaluate the model.", "What is the model architecture and what prior work used simnilar architecture?-gpt3": " The model architecture used is the same as GPT-2 [RWC+19], including the modified initialization, pre-normalization, and reversible tokenization described therein, with the exception that they use alternating dense and locally banded sparse attention patterns in the layers of the transformer, similar to the Sparse Transformer [CGRS19]. Other prior work that used similar architecture includes the mixture-of-experts method [SMM+17], which produced 100 billion parameter models and 50 billion parameter translation models [AJF19], and the conditional computation framework [BLC13]."}


--------------------------------------------------------------------------------