├── .gitignore ├── LICENSE ├── README.md ├── app ├── app.py ├── constants │ └── prompts.toml ├── date_iterator.sh ├── gen │ ├── gemini.py │ └── utils.py ├── outputs.json ├── paper │ ├── download.py │ └── parser.py ├── requirements.txt └── utils.py └── notebooks └── analyze_arxiv_pdf.ipynb /.gitignore: -------------------------------------------------------------------------------- 1 | __pycache__ 2 | .DS_Store -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Apache License 2 | Version 2.0, January 2004 3 | http://www.apache.org/licenses/ 4 | 5 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 6 | 7 | 1. Definitions. 8 | 9 | "License" shall mean the terms and conditions for use, reproduction, 10 | and distribution as defined by Sections 1 through 9 of this document. 11 | 12 | "Licensor" shall mean the copyright owner or entity authorized by 13 | the copyright owner that is granting the License. 14 | 15 | "Legal Entity" shall mean the union of the acting entity and all 16 | other entities that control, are controlled by, or are under common 17 | control with that entity. For the purposes of this definition, 18 | "control" means (i) the power, direct or indirect, to cause the 19 | direction or management of such entity, whether by contract or 20 | otherwise, or (ii) ownership of fifty percent (50%) or more of the 21 | outstanding shares, or (iii) beneficial ownership of such entity. 22 | 23 | "You" (or "Your") shall mean an individual or Legal Entity 24 | exercising permissions granted by this License. 25 | 26 | "Source" form shall mean the preferred form for making modifications, 27 | including but not limited to software source code, documentation 28 | source, and configuration files. 29 | 30 | "Object" form shall mean any form resulting from mechanical 31 | transformation or translation of a Source form, including but 32 | not limited to compiled object code, generated documentation, 33 | and conversions to other media types. 34 | 35 | "Work" shall mean the work of authorship, whether in Source or 36 | Object form, made available under the License, as indicated by a 37 | copyright notice that is included in or attached to the work 38 | (an example is provided in the Appendix below). 39 | 40 | "Derivative Works" shall mean any work, whether in Source or Object 41 | form, that is based on (or derived from) the Work and for which the 42 | editorial revisions, annotations, elaborations, or other modifications 43 | represent, as a whole, an original work of authorship. For the purposes 44 | of this License, Derivative Works shall not include works that remain 45 | separable from, or merely link (or bind by name) to the interfaces of, 46 | the Work and Derivative Works thereof. 47 | 48 | "Contribution" shall mean any work of authorship, including 49 | the original version of the Work and any modifications or additions 50 | to that Work or Derivative Works thereof, that is intentionally 51 | submitted to Licensor for inclusion in the Work by the copyright owner 52 | or by an individual or Legal Entity authorized to submit on behalf of 53 | the copyright owner. For the purposes of this definition, "submitted" 54 | means any form of electronic, verbal, or written communication sent 55 | to the Licensor or its representatives, including but not limited to 56 | communication on electronic mailing lists, source code control systems, 57 | and issue tracking systems that are managed by, or on behalf of, the 58 | Licensor for the purpose of discussing and improving the Work, but 59 | excluding communication that is conspicuously marked or otherwise 60 | designated in writing by the copyright owner as "Not a Contribution." 61 | 62 | "Contributor" shall mean Licensor and any individual or Legal Entity 63 | on behalf of whom a Contribution has been received by Licensor and 64 | subsequently incorporated within the Work. 65 | 66 | 2. Grant of Copyright License. Subject to the terms and conditions of 67 | this License, each Contributor hereby grants to You a perpetual, 68 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 69 | copyright license to reproduce, prepare Derivative Works of, 70 | publicly display, publicly perform, sublicense, and distribute the 71 | Work and such Derivative Works in Source or Object form. 72 | 73 | 3. Grant of Patent License. Subject to the terms and conditions of 74 | this License, each Contributor hereby grants to You a perpetual, 75 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 76 | (except as stated in this section) patent license to make, have made, 77 | use, offer to sell, sell, import, and otherwise transfer the Work, 78 | where such license applies only to those patent claims licensable 79 | by such Contributor that are necessarily infringed by their 80 | Contribution(s) alone or by combination of their Contribution(s) 81 | with the Work to which such Contribution(s) was submitted. If You 82 | institute patent litigation against any entity (including a 83 | cross-claim or counterclaim in a lawsuit) alleging that the Work 84 | or a Contribution incorporated within the Work constitutes direct 85 | or contributory patent infringement, then any patent licenses 86 | granted to You under this License for that Work shall terminate 87 | as of the date such litigation is filed. 88 | 89 | 4. Redistribution. You may reproduce and distribute copies of the 90 | Work or Derivative Works thereof in any medium, with or without 91 | modifications, and in Source or Object form, provided that You 92 | meet the following conditions: 93 | 94 | (a) You must give any other recipients of the Work or 95 | Derivative Works a copy of this License; and 96 | 97 | (b) You must cause any modified files to carry prominent notices 98 | stating that You changed the files; and 99 | 100 | (c) You must retain, in the Source form of any Derivative Works 101 | that You distribute, all copyright, patent, trademark, and 102 | attribution notices from the Source form of the Work, 103 | excluding those notices that do not pertain to any part of 104 | the Derivative Works; and 105 | 106 | (d) If the Work includes a "NOTICE" text file as part of its 107 | distribution, then any Derivative Works that You distribute must 108 | include a readable copy of the attribution notices contained 109 | within such NOTICE file, excluding those notices that do not 110 | pertain to any part of the Derivative Works, in at least one 111 | of the following places: within a NOTICE text file distributed 112 | as part of the Derivative Works; within the Source form or 113 | documentation, if provided along with the Derivative Works; or, 114 | within a display generated by the Derivative Works, if and 115 | wherever such third-party notices normally appear. The contents 116 | of the NOTICE file are for informational purposes only and 117 | do not modify the License. You may add Your own attribution 118 | notices within Derivative Works that You distribute, alongside 119 | or as an addendum to the NOTICE text from the Work, provided 120 | that such additional attribution notices cannot be construed 121 | as modifying the License. 122 | 123 | You may add Your own copyright statement to Your modifications and 124 | may provide additional or different license terms and conditions 125 | for use, reproduction, or distribution of Your modifications, or 126 | for any such Derivative Works as a whole, provided Your use, 127 | reproduction, and distribution of the Work otherwise complies with 128 | the conditions stated in this License. 129 | 130 | 5. Submission of Contributions. Unless You explicitly state otherwise, 131 | any Contribution intentionally submitted for inclusion in the Work 132 | by You to the Licensor shall be under the terms and conditions of 133 | this License, without any additional terms or conditions. 134 | Notwithstanding the above, nothing herein shall supersede or modify 135 | the terms of any separate license agreement you may have executed 136 | with Licensor regarding such Contributions. 137 | 138 | 6. Trademarks. This License does not grant permission to use the trade 139 | names, trademarks, service marks, or product names of the Licensor, 140 | except as required for reasonable and customary use in describing the 141 | origin of the Work and reproducing the content of the NOTICE file. 142 | 143 | 7. Disclaimer of Warranty. Unless required by applicable law or 144 | agreed to in writing, Licensor provides the Work (and each 145 | Contributor provides its Contributions) on an "AS IS" BASIS, 146 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 147 | implied, including, without limitation, any warranties or conditions 148 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A 149 | PARTICULAR PURPOSE. You are solely responsible for determining the 150 | appropriateness of using or redistributing the Work and assume any 151 | risks associated with Your exercise of permissions under this License. 152 | 153 | 8. Limitation of Liability. In no event and under no legal theory, 154 | whether in tort (including negligence), contract, or otherwise, 155 | unless required by applicable law (such as deliberate and grossly 156 | negligent acts) or agreed to in writing, shall any Contributor be 157 | liable to You for damages, including any direct, indirect, special, 158 | incidental, or consequential damages of any character arising as a 159 | result of this License or out of the use or inability to use the 160 | Work (including but not limited to damages for loss of goodwill, 161 | work stoppage, computer failure or malfunction, or any and all 162 | other commercial damages or losses), even if such Contributor 163 | has been advised of the possibility of such damages. 164 | 165 | 9. Accepting Warranty or Additional Liability. While redistributing 166 | the Work or Derivative Works thereof, You may choose to offer, 167 | and charge a fee for, acceptance of support, warranty, indemnity, 168 | or other liability obligations and/or rights consistent with this 169 | License. However, in accepting such obligations, You may act only 170 | on Your own behalf and on Your sole responsibility, not on behalf 171 | of any other Contributor, and only if You agree to indemnify, 172 | defend, and hold each Contributor harmless for any liability 173 | incurred by, or claims asserted against, such Contributor by reason 174 | of your accepting any such warranty or additional liability. 175 | 176 | END OF TERMS AND CONDITIONS 177 | 178 | APPENDIX: How to apply the Apache License to your work. 179 | 180 | To apply the Apache License to your work, attach the following 181 | boilerplate notice, with the fields enclosed by brackets "[]" 182 | replaced with your own identifying information. (Don't include 183 | the brackets!) The text should be enclosed in the appropriate 184 | comment syntax for the file format. We also recommend that a 185 | file or class name and description of purpose be included on the 186 | same "printed page" as the copyright notice for easier 187 | identification within third-party archives. 188 | 189 | Copyright [yyyy] [name of copyright owner] 190 | 191 | Licensed under the Apache License, Version 2.0 (the "License"); 192 | you may not use this file except in compliance with the License. 193 | You may obtain a copy of the License at 194 | 195 | http://www.apache.org/licenses/LICENSE-2.0 196 | 197 | Unless required by applicable law or agreed to in writing, software 198 | distributed under the License is distributed on an "AS IS" BASIS, 199 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 200 | See the License for the specific language governing permissions and 201 | limitations under the License. 202 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Auto Paper Analysis 2 | 3 | This project automatically generate Questions and Answers on a given arXiv ids. For now, the CLI tool only supports to grasp arXiv ids from [Hugging Face 🤗 Daily Papers](https://huggingface.co/papers). Also, it is possible to directly generate on a set of arXiv ids. 4 | 5 | You can see the generated QA dataset from [chansung/auto-paper-qa2](https://huggingface.co/datasets/chansung/auto-paper-qa2) repository. Also, you can see how these dataset could be used with [PaperQA](https://huggingface.co/spaces/chansung/paper_qa) space application. 6 | 7 | ## Instruction 8 | 9 | If you want to do prompt engineering, modify the [prompts.toml](https://github.com/deep-diver/auto-paper-analysis/tree/main/app/constants/prompts.toml) file. There are two prompts to play with. 10 | 11 | 12 | ### Hugging Face 🤗 Daily Papers 13 | 14 | To generate QAs of arXiv papers on a specific date, run: 15 | 16 | ```shell 17 | export GEMINI_API_KEY= 18 | export HF_ACCESS_TOKEN= 19 | 20 | python app.py --target-date $current_date \ 21 | --gemini-api $GEMINI_API_KEY \ 22 | --hf-token $HF_ACCESS_TOKEN \ 23 | --hf-repo-id $hf_repo_id \ 24 | --hf-daily-papers 25 | ``` 26 | 27 | If you want to generate QAs of arXiv papers on the range of date, run: 28 | 29 | ```shell 30 | export GEMINI_API_KEY= 31 | export HF_ACCESS_TOKEN= 32 | export HF_DATASET_REPO_ID= 33 | 34 | ./date_iterator.sh "2024-03-01" "2024-03-03" $HF_DATASET_REPO_ID 35 | ``` 36 | 37 | ### arXiv Ids 38 | 39 | To generate QAs of arXiv papers on a list of arXiv IDs, run: 40 | 41 | ```shell 42 | export GEMINI_API_KEY= 43 | export HF_ACCESS_TOKEN= 44 | 45 | python app.py \ 46 | --gemini-api $GEMINI_API_KEY \ 47 | --hf-token $HF_ACCESS_TOKEN \ 48 | --hf-repo-id $hf_repo_id \ 49 | --arxiv-ids ... 50 | ``` 51 | 52 | ## Acknowledgements 53 | 54 | This is a project built during the Gemini sprint held by Google's ML Developer Programs team. I am thankful to be granted good amount of GCP credits to finish up this project. 55 | -------------------------------------------------------------------------------- /app/app.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | from paper.download import ( 3 | download_pdf_from_arxiv, 4 | get_papers_from_hf_daily_papers, 5 | get_papers_from_arxiv_ids 6 | ) 7 | from paper.parser import extract_text_and_figures 8 | from gen.gemini import get_basic_qa, get_deep_qa 9 | from utils import push_to_hf_hub 10 | 11 | def _process_hf_daily_papers(args): 12 | print("1. Get a list of papers from 🤗 Daily Papers") 13 | target_date, papers = get_papers_from_hf_daily_papers(args.target_date) 14 | print("...DONE") 15 | 16 | print("2. Generating QAs for the paper") 17 | for paper in papers: 18 | try: 19 | title = paper['title'] 20 | abstract = paper['paper']['summary'] 21 | arxiv_id = paper['paper']['id'] 22 | authors = [] 23 | 24 | print(f"...PROCESSING ON[{arxiv_id}, {title}]") 25 | print(f"......Extracting authors' names") 26 | for author in paper['paper']['authors']: 27 | if 'user' in author: 28 | fullname = author['user']['fullname'] 29 | else: 30 | fullname = author['name'] 31 | authors.append(fullname) 32 | print(f"......DONE") 33 | 34 | print(f"......Downloading the paper PDF") 35 | filename = download_pdf_from_arxiv(arxiv_id) 36 | print(f"......DONE") 37 | 38 | print(f"......Extracting text and figures") 39 | texts, figures = extract_text_and_figures(filename) 40 | text =' '.join(texts) 41 | print(f"......DONE") 42 | 43 | print(f"......Generating the seed(basic) QAs") 44 | qnas = get_basic_qa(text, gemini_api_key=args.gemini_api, trucate=30000) 45 | qnas['title'] = title 46 | qnas['abstract'] = abstract 47 | qnas['authors'] = ','.join(authors) 48 | qnas['arxiv_id'] = arxiv_id 49 | qnas['target_date'] = target_date 50 | qnas['full_text'] = text 51 | print(f"......DONE") 52 | 53 | print(f"......Generating the follow-up QAs") 54 | qnas = get_deep_qa(text, qnas, gemini_api_key=args.gemini_api, trucate=30000) 55 | del qnas["qna"] 56 | print(f"......DONE") 57 | 58 | print(f"......Exporting to HF Dataset repo at [{args.hf_repo_id}]") 59 | push_to_hf_hub(qnas, args.hf_repo_id, args.hf_token) 60 | print(f"......DONE") 61 | except: 62 | print(".......failed due to exception") 63 | continue 64 | 65 | print("...DONE") 66 | 67 | def _process_arxiv_ids(args): 68 | arxiv_ids = args.arxiv_ids 69 | 70 | print(f"1. Get metadata for the papers [{arxiv_ids}]") 71 | papers = get_papers_from_arxiv_ids(arxiv_ids) 72 | print("...DONE") 73 | 74 | print("2. Generating QAs for the paper") 75 | for paper in papers: 76 | try: 77 | title = paper['title'] 78 | target_date = paper['target_date'] 79 | abstract = paper['paper']['summary'] 80 | arxiv_id = paper['paper']['id'] 81 | authors = paper['paper']['authors'] 82 | 83 | print(f"...PROCESSING ON[{arxiv_id}, {title}]") 84 | print(f"......Downloading the paper PDF") 85 | filename = download_pdf_from_arxiv(arxiv_id) 86 | print(f"......DONE") 87 | 88 | print(f"......Extracting text and figures") 89 | texts, figures = extract_text_and_figures(filename) 90 | text =' '.join(texts) 91 | print(f"......DONE") 92 | 93 | print(f"......Generating the seed(basic) QAs") 94 | qnas = get_basic_qa(text, gemini_api_key=args.gemini_api, trucate=30000) 95 | qnas['title'] = title 96 | qnas['abstract'] = abstract 97 | qnas['authors'] = ','.join(authors) 98 | qnas['arxiv_id'] = arxiv_id 99 | qnas['target_date'] = target_date 100 | qnas['full_text'] = text 101 | print(f"......DONE") 102 | 103 | print(f"......Generating the follow-up QAs") 104 | qnas = get_deep_qa(text, qnas, gemini_api_key=args.gemini_api, trucate=30000) 105 | del qnas["qna"] 106 | print(f"......DONE") 107 | 108 | print(f"......Exporting to HF Dataset repo at [{args.hf_repo_id}]") 109 | push_to_hf_hub(qnas, args.hf_repo_id, args.hf_token) 110 | print(f"......DONE") 111 | except: 112 | print(".......failed due to exception") 113 | continue 114 | 115 | def main(args): 116 | if args.hf_daily_papers: 117 | _process_hf_daily_papers(args) 118 | else: 119 | _process_arxiv_ids(args) 120 | 121 | if __name__ == '__main__': 122 | parser = argparse.ArgumentParser(description="auto paper analysis") 123 | parser.add_argument("--gemini-api", type=str) 124 | parser.add_argument("--hf-token", type=str) 125 | parser.add_argument("--hf-repo-id", type=str) 126 | 127 | parser.add_argument('--hf-daily-papers', action='store_true') 128 | parser.add_argument("--target-date", type=str, default=None) 129 | 130 | parser.add_argument('--arxiv-ids', nargs='+') 131 | args = parser.parse_args() 132 | 133 | main(args) -------------------------------------------------------------------------------- /app/constants/prompts.toml: -------------------------------------------------------------------------------- 1 | [basic_qa] 2 | prompt = """ 3 | come up with the 10 questions and answers that could be commonly asked by people about the following paper. 4 | There should be two types of answers included, one for expert and the other for ELI5. 5 | Your response should be recorded in a JSON format as ```json{"title": text, "summary": text, "qna": [{"question": "answers": {"eli5": text, "expert": text}}, ...]}``` 6 | """ 7 | 8 | [deep_qa] 9 | prompt = """ 10 | Paper title: $title 11 | Previous question: $previous_question 12 | The answer on the previous question: $previous_answer 13 | 14 | Based on the previous question and answer above, and based on the paper content below, suggest follow-up question and answers in $tone manner. 15 | There should be two types of answers included, one for expert and the other for ELI5. 16 | Your response should be recorded in a JSON format as ```json{"follow up question": text, "answers": {"eli5": text, "expert": text}}``` 17 | """ -------------------------------------------------------------------------------- /app/date_iterator.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | # Set start and end dates (format YYYY-MM-DD) 4 | start_date=$1 5 | end_date=$2 6 | hf_repo_id=$3 7 | 8 | # Convert dates into seconds since epoch (for easier calculations) 9 | start_seconds=$(date -j -f "%Y-%m-%d" "$start_date" "+%s") 10 | end_seconds=$(date -j -f "%Y-%m-%d" "$end_date" "+%s") 11 | 12 | # Iterate through dates 13 | current_seconds=$start_seconds 14 | while [[ $current_seconds -le $end_seconds ]]; do 15 | current_date=$(date -j -r $current_seconds "+%Y-%m-%d") 16 | 17 | # Replace with your actual program execution 18 | echo "Running program for date: $current_date" 19 | python app.py --target-date $current_date \ 20 | --gemini-api $GEMINI_API_KEY \ 21 | --hf-token $HF_ACCESS_TOKEN \ 22 | --hf-repo-id $hf_repo_id \ 23 | --hf-daily-papers 24 | 25 | current_seconds=$((current_seconds + 86400)) # Add 1 day (86400 seconds) 26 | done 27 | 28 | -------------------------------------------------------------------------------- /app/gen/gemini.py: -------------------------------------------------------------------------------- 1 | import ast 2 | import copy 3 | import toml 4 | from string import Template 5 | from pathlib import Path 6 | from flatdict import FlatDict 7 | import google.generativeai as genai 8 | 9 | from gen.utils import parse_first_json_snippet 10 | 11 | def determine_model_name(given_image=None): 12 | if given_image is None: 13 | return "gemini-pro" 14 | else: 15 | return "gemini-pro-vision" 16 | 17 | def construct_image_part(given_image): 18 | return { 19 | "mime_type": "image/jpeg", 20 | "data": given_image 21 | } 22 | 23 | def call_gemini(prompt="", API_KEY=None, given_text=None, given_image=None, generation_config=None, safety_settings=None): 24 | genai.configure(api_key=API_KEY) 25 | 26 | if generation_config is None: 27 | generation_config = { 28 | "temperature": 0.8, 29 | "top_p": 1, 30 | "top_k": 32, 31 | "max_output_tokens": 8192, 32 | } 33 | 34 | if safety_settings is None: 35 | safety_settings = [ 36 | { 37 | "category": "HARM_CATEGORY_HARASSMENT", 38 | "threshold": "BLOCK_ONLY_HIGH" 39 | }, 40 | { 41 | "category": "HARM_CATEGORY_HATE_SPEECH", 42 | "threshold": "BLOCK_ONLY_HIGH" 43 | }, 44 | { 45 | "category": "HARM_CATEGORY_SEXUALLY_EXPLICIT", 46 | "threshold": "BLOCK_ONLY_HIGH" 47 | }, 48 | { 49 | "category": "HARM_CATEGORY_DANGEROUS_CONTENT", 50 | "threshold": "BLOCK_ONLY_HIGH" 51 | }, 52 | ] 53 | 54 | model_name = determine_model_name(given_image) 55 | model = genai.GenerativeModel(model_name=model_name, 56 | generation_config=generation_config, 57 | safety_settings=safety_settings) 58 | 59 | USER_PROMPT = prompt 60 | if given_text is not None: 61 | USER_PROMPT += f"""{prompt} 62 | ------------------------------------------------ 63 | {given_text} 64 | """ 65 | prompt_parts = [USER_PROMPT] 66 | if given_image is not None: 67 | prompt_parts.append(construct_image_part(given_image)) 68 | 69 | response = model.generate_content(prompt_parts) 70 | return response.text 71 | 72 | def try_out(prompt, given_text, gemini_api_key, given_image=None, retry_num=3): 73 | qna_json = None 74 | cur_retry = 0 75 | 76 | while qna_json is None and cur_retry < retry_num: 77 | try: 78 | qna = call_gemini( 79 | prompt=prompt, 80 | given_text=given_text, 81 | given_image=given_image, 82 | API_KEY=gemini_api_key 83 | ) 84 | 85 | qna_json = parse_first_json_snippet(qna) 86 | except: 87 | cur_retry = cur_retry + 1 88 | print("......retry") 89 | 90 | return qna_json 91 | 92 | def get_basic_qa(text, gemini_api_key, trucate=7000): 93 | prompts = toml.load(Path('.') / 'constants' / 'prompts.toml') 94 | basic_qa = try_out(prompts['basic_qa']['prompt'], text[:trucate], gemini_api_key=gemini_api_key) 95 | return basic_qa 96 | 97 | 98 | def get_deep_qa(text, basic_qa, gemini_api_key, trucate=7000): 99 | prompts = toml.load(Path('.') / 'constants' / 'prompts.toml') 100 | 101 | title = basic_qa['title'] 102 | qnas = copy.deepcopy(basic_qa['qna']) 103 | 104 | for idx, qna in enumerate(qnas): 105 | q = qna['question'] 106 | a_expert = qna['answers']['expert'] 107 | 108 | depth_search_prompt = Template(prompts['deep_qa']['prompt']).substitute( 109 | title=title, previous_question=q, previous_answer=a_expert, tone="in-depth" 110 | ) 111 | breath_search_prompt = Template(prompts['deep_qa']['prompt']).substitute( 112 | title=title, previous_question=q, previous_answer=a_expert, tone="broad" 113 | ) 114 | 115 | depth_search_response = {} 116 | breath_search_response = {} 117 | 118 | while 'follow up question' not in depth_search_response or \ 119 | 'answers' not in depth_search_response or \ 120 | 'eli5' not in depth_search_response['answers'] or \ 121 | 'expert' not in depth_search_response['answers']: 122 | depth_search_response = try_out(depth_search_prompt, text[:trucate], gemini_api_key=gemini_api_key) 123 | 124 | while 'follow up question' not in breath_search_response or \ 125 | 'answers' not in breath_search_response or \ 126 | 'eli5' not in breath_search_response['answers'] or \ 127 | 'expert' not in breath_search_response['answers']: 128 | breath_search_response = try_out(breath_search_prompt, text[:trucate], gemini_api_key=gemini_api_key) 129 | 130 | if depth_search_response is not None: 131 | qna['additional_depth_q'] = depth_search_response 132 | if breath_search_response is not None: 133 | qna['additional_breath_q'] = breath_search_response 134 | 135 | qna = FlatDict(qna) 136 | qna_tmp = copy.deepcopy(qna) 137 | for k in qna_tmp: 138 | value = qna.pop(k) 139 | qna[f'{idx}_{k}'] = value 140 | basic_qa.update(ast.literal_eval(str(qna))) 141 | 142 | return basic_qa -------------------------------------------------------------------------------- /app/gen/utils.py: -------------------------------------------------------------------------------- 1 | import json 2 | 3 | def find_json_snippet(raw_snippet): 4 | json_parsed_string = None 5 | 6 | json_start_index = raw_snippet.find('{') 7 | json_end_index = raw_snippet.rfind('}') 8 | 9 | if json_start_index >= 0 and json_end_index >= 0: 10 | json_snippet = raw_snippet[json_start_index:json_end_index+1] 11 | try: 12 | json_parsed_string = json.loads(json_snippet, strict=False) 13 | except: 14 | raise ValueError('......failed to parse string into JSON format') 15 | else: 16 | raise ValueError('......No JSON code snippet found in string.') 17 | 18 | return json_parsed_string 19 | 20 | def parse_first_json_snippet(snippet): 21 | json_parsed_string = None 22 | 23 | if isinstance(snippet, list): 24 | for snippet_piece in snippet: 25 | try: 26 | json_parsed_string = find_json_snippet(snippet_piece) 27 | return json_parsed_string 28 | except: 29 | pass 30 | else: 31 | try: 32 | json_parsed_string = find_json_snippet(snippet) 33 | except Exception as e: 34 | print(e) 35 | raise ValueError() 36 | 37 | return json_parsed_string -------------------------------------------------------------------------------- /app/paper/download.py: -------------------------------------------------------------------------------- 1 | import re 2 | import json 3 | import requests 4 | import datetime 5 | from datetime import date 6 | from datetime import datetime 7 | import xml.etree.ElementTree as ET 8 | from requests.exceptions import HTTPError 9 | 10 | def _get_today(): 11 | return str(date.today()) 12 | 13 | def _download_pdf_from_arxiv(filename): 14 | url = f'https://arxiv.org/pdf/{filename}' 15 | response = requests.get(url) 16 | if response.status_code == 200: 17 | return response.content 18 | else: 19 | raise Exception(f"Failed to download pdf for arXiv id {filename}") 20 | 21 | def download_pdf_from_arxiv(arxiv_id): 22 | filename = f"{arxiv_id}.pdf" 23 | pdf_content = _download_pdf_from_arxiv(filename) 24 | 25 | # Save the pdf content to a file 26 | with open(filename, "wb") as f: 27 | f.write(pdf_content) 28 | 29 | return filename 30 | 31 | def _get_papers_from_hf_daily_papers(target_date): 32 | if target_date is None: 33 | target_date = _get_today() 34 | print(f"target_date is not set => scrap today's papers [{target_date}]") 35 | url = f"https://huggingface.co/api/daily_papers?date={target_date}" 36 | 37 | response = requests.get(url) 38 | 39 | if response.status_code == 200: 40 | return target_date, response.text 41 | else: 42 | raise HTTPError(f"Error fetching data. Status code: {response.status_code}") 43 | 44 | def get_papers_from_hf_daily_papers(target_date): 45 | target_date, results = _get_papers_from_hf_daily_papers(target_date) 46 | results = json.loads(results) 47 | for result in results: 48 | result["target_date"] = target_date 49 | return target_date, results 50 | 51 | 52 | def _get_paper_xml_by_arxiv_id(arxiv_id): 53 | url = f"http://export.arxiv.org/api/query?search_query=id:{arxiv_id}&start=0&max_results=1" 54 | return requests.get(url) 55 | 56 | def _is_arxiv_id_valid(arxiv_id): 57 | pattern = r"^\d{4}\.\d{5}$" 58 | return bool(re.match(pattern, arxiv_id)) 59 | 60 | def _get_paper_metadata_by_arxiv_id(response): 61 | root = ET.fromstring(response.content) 62 | 63 | # Example: Extracting title, authors, and abstract 64 | title = root.find('{http://www.w3.org/2005/Atom}entry/{http://www.w3.org/2005/Atom}title').text 65 | authors = [author.find('{http://www.w3.org/2005/Atom}name').text for author in root.findall('{http://www.w3.org/2005/Atom}entry/{http://www.w3.org/2005/Atom}author')] 66 | abstract = root.find('{http://www.w3.org/2005/Atom}entry/{http://www.w3.org/2005/Atom}summary').text 67 | target_date = root.find('{http://www.w3.org/2005/Atom}entry/{http://www.w3.org/2005/Atom}published').text 68 | 69 | return title, authors, abstract, target_date 70 | 71 | def get_papers_from_arxiv_ids(arxiv_ids): 72 | results = [] 73 | 74 | for arxiv_id in arxiv_ids: 75 | print(arxiv_id) 76 | if _is_arxiv_id_valid(arxiv_id): 77 | try: 78 | xml_data = _get_paper_xml_by_arxiv_id(arxiv_id) 79 | title, authors, abstract, target_date = _get_paper_metadata_by_arxiv_id(xml_data) 80 | 81 | datetime_obj = datetime.strptime(target_date, "%Y-%m-%dT%H:%M:%SZ") 82 | formatted_date = datetime_obj.strftime("%Y-%m-%d") 83 | 84 | results.append( 85 | { 86 | "title": title, 87 | "target_date": formatted_date, 88 | "paper": { 89 | "summary": abstract, 90 | "id": arxiv_id, 91 | "authors" : authors, 92 | } 93 | } 94 | ) 95 | except: 96 | print("......something wrong happend when downloading metadata") 97 | print("......this usually happens when you try out the today's published paper") 98 | continue 99 | else: 100 | print(f"......not a valid arXiv ID[{arxiv_id}]") 101 | 102 | return results -------------------------------------------------------------------------------- /app/paper/parser.py: -------------------------------------------------------------------------------- 1 | import os 2 | import fitz 3 | import PyPDF2 4 | 5 | def extract_text_and_figures(pdf_path): 6 | """ 7 | Extracts text and figures from a PDF file. 8 | 9 | Args: 10 | pdf_path (str): The path to the PDF file. 11 | 12 | Returns: 13 | tuple: A tuple containing two lists: 14 | * A list of extracted text blocks. 15 | * A list of extracted figures (as bytes). 16 | """ 17 | 18 | texts = [] 19 | figures = [] 20 | 21 | # Open the PDF using PyMuPDF (fitz) for image extraction 22 | doc = fitz.open(pdf_path) 23 | for page_num, page in enumerate(doc): 24 | text = page.get_text("text") # Extract text as plain text 25 | texts.append(text) 26 | 27 | # Process images on the page 28 | image_list = page.get_images() 29 | for image_index, img in enumerate(image_list): 30 | xref = img[0] # Image XREF 31 | pix = fitz.Pixmap(doc, xref) # Create Pixmap image 32 | 33 | # Save image in desired format (here, PNG) 34 | if not pix.colorspace.name in (fitz.csGRAY.name, fitz.csRGB.name): 35 | pix = fitz.Pixmap(fitz.csRGB, pix) 36 | img_bytes = pix.tobytes("png") 37 | 38 | figures.append(img_bytes) 39 | 40 | # Extract additional text using PyPDF2 (in case fitz didn't get everything) 41 | with open(pdf_path, 'rb') as pdf_file: 42 | pdf_reader = PyPDF2.PdfReader(pdf_file) 43 | for page_num in range(len(pdf_reader.pages)): 44 | page = pdf_reader.pages[page_num] 45 | text = page.extract_text() 46 | texts.append(text) 47 | 48 | try: 49 | os.remove(pdf_path) 50 | except FileNotFoundError: 51 | print(f"File '{pdf_path}' not found.") 52 | except PermissionError: 53 | print(f"Unable to remove '{pdf_path}'. Check permissions.") 54 | 55 | return texts, figures 56 | -------------------------------------------------------------------------------- /app/requirements.txt: -------------------------------------------------------------------------------- 1 | google-generativeai 2 | pypdf2 3 | PyMuPDF 4 | gradio 5 | requests 6 | toml 7 | datasets 8 | flatdict -------------------------------------------------------------------------------- /app/utils.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import datasets 3 | from datasets import Dataset 4 | from huggingface_hub import create_repo 5 | from huggingface_hub.utils import HfHubHTTPError 6 | 7 | def push_to_hf_hub( 8 | qnas, repo_id, token, append=True 9 | ): 10 | exist = False 11 | df = pd.DataFrame([qnas]) 12 | ds = Dataset.from_pandas(df) 13 | ds = ds.cast_column("target_date", datasets.features.Value("timestamp[s]")) 14 | 15 | try: 16 | create_repo(repo_id, repo_type="dataset", token=token) 17 | except HfHubHTTPError as e: 18 | exist = True 19 | 20 | if exist and append: 21 | existing_ds = datasets.load_dataset(repo_id) 22 | ds = datasets.concatenate_datasets([existing_ds['train'], ds]) 23 | 24 | ds.push_to_hub(repo_id, token=token) --------------------------------------------------------------------------------