├── .gitignore ├── .gitmodules ├── LICENSE ├── README.md ├── __init__.py ├── alexandria ├── README.md ├── base.py ├── grobid.json ├── runner.py └── visualization.py ├── docs ├── larger_example.md └── simple_example.md ├── example-images ├── fuzz-barplot.png ├── fuzz-wordcloud.png └── paper-citation-graph.png ├── example-papers ├── 1908.09204.pdf └── 1908.10167.pdf ├── gen_wordcloud.py ├── install.sh ├── pq_format_reader.py ├── pq_main.py ├── pq_pdf_utility.py └── requirements.txt /.gitignore: -------------------------------------------------------------------------------- 1 | pq-out/* 2 | paper-pdfs 3 | venv 4 | tmp 5 | -------------------------------------------------------------------------------- /.gitmodules: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AdaLogics/paper-analyser/464daa29bcab2a35a8a4630751ca96023643a558/.gitmodules -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2020 AdaLogics Ltd 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Paper analyser 2 | The goal of this project is to enable quantitative analysis of 3 | academic papers. 4 | 5 | To achieve the goal, the project contains logic for 6 | 7 | 1. Parsing academic white papers into structured representation 8 | 1. Doing analysis on the structured representations 9 | 10 | #### Paper dependency graph 11 | The project as it currently stands focuses on the task of taking 12 | a list of arbitrary papers in the form of PDFs, and then creating 13 | a dependency graph of citations amongst these papers. This graph 14 | then shows how each of the PDFs reference each other. Paper analyser 15 | achieves this by going through the steps: 16 | 17 | 1. Parse the papers to extract relevant data 18 | 1. Read the PDF files to a format usable in Python 19 | 1. Extract title of a given paper 20 | 1. Extract the raw data of the "References" section 21 | 1. Parse the raw "References" section into individual refereces: 22 | 1. Extract the title and authors of the citation 23 | 1. Normalise the data of the extracted citations 24 | 1. Do dependency analysis based on the above citation extractions 25 | 26 | 27 | ## Usage 28 | To see example usage of a simple exmplae look at the [simple_example.md](/docs/simple_example.md) 29 | 30 | Paper analyser takes as input PDF files of academic papers and outputs data about these papers. 31 | For convenience we maintain a list of links to software analysis papers 32 | focused on software security in our sister repository [software-security-paper-list](https://github.com/AdaLogics/software-security-paper-list) 33 | 34 | To see an example of doing analysis on many papers look at the explanation here [large_example.md](/docs/larger_example.md) 35 | 36 | ## Example visualisation 37 | 38 | We have also created visualisations for the output of the paper 39 | analyser, which makes it very nice to rapidly understand the 40 | relationship between the academic papers in the data set. 41 | 42 | See a link here for an example of the visualisations 43 | [https://adalogics.com/software-security-research-citations-visualiser](https://adalogics.com/software-security-research-citations-visualiser) 44 | 45 | These visualistions will be open sourced in the near future. 46 | 47 | 48 | ### Citation graph example: 49 | 50 | ![Alt txt](/example-images/paper-citation-graph.png) 51 | 52 | 53 | ### Wordcloud of 85 fuzzing papers 54 | Example of a wordcloud generated by the papers in the "Fuzzing" section of [software-security-paper-list](https://github.com/AdaLogics/software-security-paper-list). This wordcloud discounts the use of the 100 most common english words [https://www.espressoenglish.net/the-100-most-common-words-in-english/](https://www.espressoenglish.net/the-100-most-common-words-in-english/) 55 | ![Alt txt](/example-images/fuzz-wordcloud.png) 56 | 57 | ### Wordcount of 85 fuzzing papers 58 | Doing a barplot of the words in the papers in the "Fuzzing" section of [software-security-paper-list](https://github.com/AdaLogics/software-security-paper-list). This plot discounts the use of the 100 most commond english words [https://www.espressoenglish.net/the-100-most-common-words-in-english/](https://www.espressoenglish.net/the-100-most-common-words-in-english/) 59 | ![Alt txt](/example-images/fuzz-barplot.png) 60 | 61 | ## Installation 62 | ``` 63 | git clone https://github.com/AdaLogics/paper-analyser 64 | cd paper-analyser 65 | ./install.sh 66 | ``` 67 | 68 | ## Contribute 69 | We welcome contributions. 70 | 71 | Paper analyser is maintained by: 72 | * [David Korczynski](https://twitter.com/Davkorcz) 73 | * [Adam Korczynski](https://twitter.com/AdamKorcz4) 74 | * [Giovanni Cherubin](https://twitter.com/gchers) 75 | 76 | We are particularly interested in features for: 77 | 1. Improved parsing of the PDF files to get better structured ouput out 78 | 1. More data analysis into the project 79 | 80 | 81 | ### Feature suggestions 82 | If you would like to contribute but dont have a feature in mind, please see the list below for suggestions: 83 | 84 | * Extraction of authors from papers 85 | * Extraction of the actual text from the papers. This could be used for a lot of cool data analysis 86 | -------------------------------------------------------------------------------- /__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AdaLogics/paper-analyser/464daa29bcab2a35a8a4630751ca96023643a558/__init__.py -------------------------------------------------------------------------------- /alexandria/README.md: -------------------------------------------------------------------------------- 1 | # Installation 2 | 3 | Before using Alexandria, you need to have two services running: 4 | `grobid` (used for processing pdf files) and `mongodb`. 5 | 6 | 7 | ### Grobid 8 | 9 | You first need to have an instance of [`grobid`](https://grobid.readthedocs.io/en/latest/) running. 10 | 11 | Using Docker: 12 | ``` 13 | docker run -t --rm --init -p 8080:8070 -p 8081:8071 lfoppiano/grobid:0.6.2 14 | ``` 15 | The above command will download the image and run it. 16 | (Add `-d` to run in daemon mode.) 17 | 18 | Verify that `grobid` is running at http://localhost:8070. 19 | 20 | 21 | If you configure `grobid` to listen on a different host/port, please adapt 22 | the `grobid.json` configuration. 23 | 24 | ### Mongodb 25 | 26 | Alexandria requires you to have a mongodb instance running 27 | locally at the standard port (`27017`). 28 | 29 | Using docker: 30 | 31 | ``` 32 | docker run -p 27017:27017 mongo 33 | ``` 34 | (Add `-d` to run in daemon mode.) 35 | 36 | ### Alexandria 37 | 38 | You're now ready to install Alexandria. 39 | 40 | ``` 41 | git clone https://github.com/AdaLogics/paper-analyser 42 | cd paper-analyser/alexandria 43 | python3 -m venv venv 44 | . venv/bin/activate 45 | pip install -r ../requirements.txt 46 | ``` 47 | 48 | 49 | # Usage 50 | 51 | To process the pdf files in `example-papers` launch: 52 | 53 | ``` 54 | python runner.py ../example-papers 55 | ``` 56 | 57 |
58 | Click to see the output. 59 | 60 | ``` 61 | Please check out the script for more info. 62 | GROBID server is up and running 63 | Connecting to db. 64 | Processing files in ../example-papers/ 65 | Parsing 1908.09204.pdf 66 | Title: RePEconstruct: reconstructing binaries with self-modifying code and import address table destruction 67 | Authors: David Korczynski 68 | References: 69 | Title: System-Level Support for Intrusion Recovery 70 | Authors: Andrei Bacs, Remco Vermeulen, Asia Slowinska, Herbert Bos 71 | ------------------- 72 | Title: WYSINWYX: What You See Is Not What You eXecute 73 | Authors: G Balakrishnan, T Reps, D Melski, T Teitelbaum 74 | ------------------- 75 | Title: BYTEWEIGHT: Learning to Recognize Functions in Binary Code 76 | Authors: Bao Ti Any, Jonathan Burket, Maverick Woo, Rafael Turner, David Brumley 77 | ------------------- 78 | Title: incy: Detecting Host-Based Code Injection A acks in Memory Dumps 79 | Authors: Niklas Omas Barabosch, Adrian Bergmann, Elmar Dombeck, Padilla 80 | ------------------- 81 | Title: Bee Master: Detecting Host-Based Code Injection A acks 82 | Authors: Sebastian Barabosch, Elmar Eschweiler, Gerhards-Padilla 83 | ------------------- 84 | Title: CoDisasm: Medium Scale Concatic Disassembly of Self-Modifying Binaries with Overlapping Instructions 85 | Authors: Guillaume Bonfante, Jose Fernandez, Jean-Yves Marion, Benjamin Rouxel 86 | ------------------- 87 | Title: Minemu: e World's Fastest Taint Tracker 88 | Authors: Erik Bosman, Asia Slowinska, Herbert Bos 89 | ------------------- 90 | Title: Decoupling Dynamic Program Analysis from Execution in Virtual Environments 91 | Authors: Jim Chow, Peter Chen 92 | ------------------- 93 | Title: Decompilation of Binary Programs. So w 94 | Authors: Cristina Cifuentes, K. John Gough 95 | ------------------- 96 | Title: Ether: Malware Analysis via Hardware Virtualization Extensions 97 | Authors: Artem Dinaburg, Paul Royal, Monirul Sharif, Wenke Lee 98 | ------------------- 99 | Title: Repeatable Reverse Engineering with PANDA 100 | Authors: Brendan Dolan-Gavi, Josh Hodosh, Patrick Hulin, Tim Leek, Ryan Whelan 101 | ------------------- 102 | Title: Graph-based comparison of executable objects (english version) 103 | Authors: Rolf Omas Dullien, Rolles 104 | ------------------- 105 | Title: Signatures for Library Functions in Executable Files 106 | Authors: Mike Van Emmerik 107 | ------------------- 108 | Title: Structural Comparison of Executable Objects 109 | Authors: ; Halvar Flake, G Sig Sidar, Workshop 110 | ------------------- 111 | Title: A Study of the Packer Problem and Its Solutions 112 | Authors: Fanglu Guo, Peter Ferrie, Tzi-Cker Chiueh 113 | ------------------- 114 | Title: DECAF: A Platform-Neutral Whole-System Dynamic Binary Analysis Platform 115 | Authors: Andrew Henderson, Lok-Kwong Yan, Xunchao Hu, Aravind Prakash, Heng Yin, Stephen Mccamant 116 | ------------------- 117 | Title: MutantX-S: Scalable Malware Clustering Based on Static Features 118 | Authors: Xin Hu, Sandeep Bhatkar, Kent Gri N, Kang Shin 119 | ------------------- 120 | Title: 121 | Authors: Hungenberg, Eckert Ma Hias 122 | ------------------- 123 | Title: malWASH: Washing Malware to Evade Dynamic Analysis 124 | Authors: K Kyriakos, Mathias Ispoglou, Payer 125 | ------------------- 126 | Title: Labeling Library Functions in Stripped Binaries 127 | Authors: Emily Jacobson, Nathan Rosenblum, Barton Miller 128 | ------------------- 129 | Title: Secure and advanced unpacking using computer emulation 130 | Authors: Sébastien Josse 131 | ------------------- 132 | Title: Malware Dynamic Recompilation 133 | Authors: S Josse 134 | ------------------- 135 | Title: Renovo: A Hidden Code Extractor for Packed Executables 136 | Authors: Min Kang, Pongsin Poosankam, Heng Yin 137 | ------------------- 138 | Title: Taint-assisted IAT Reconstruction against Position Obfuscation 139 | Authors: Yuhei Kawakoya, Makoto Iwamura, Jun Miyoshi 140 | ------------------- 141 | Title: API Chaser: Taint-Assisted Sandbox for Evasive Malware Analysis 142 | Authors: Yuhei Kawakoya, Eitaro Shioji, Makoto Iwamura 143 | ------------------- 144 | Title: Jakstab: A Static Analysis Platform for Binaries 145 | Authors: Johannes Kinder, Helmut Veith 146 | ------------------- 147 | Title: Power of Procrastination: Detection and Mitigation of Execution-stalling Malicious Code 148 | Authors: Clemens Kolbitsch, Engin Kirda, Christopher Kruegel 149 | ------------------- 150 | Title: RePEconstruct: reconstructing binaries with selfmodifying code and import address table destruction 151 | Authors: David Korczynski 152 | ------------------- 153 | Title: Capturing Malware Propagations with Code Injections and Code-Reuse A acks 154 | Authors: David Korczynski, Heng Yin 155 | ------------------- 156 | Title: Polymorphic Worm Detection Using Structural Information of Executables 157 | Authors: Christopher Kruegel, Engin Kirda, Darren Mutz, William Robertson, Giovanni Vigna 158 | ------------------- 159 | Title: Static Disassembly of Obfuscated Binaries 160 | Authors: Christopher Kruegel, William Robertson, Fredrik Valeur, Giovanni Vigna 161 | ------------------- 162 | Title: Graph Matching Networks for Learning the Similarity of Graph Structured Objects 163 | Authors: Yujia Li, Chenjie Gu, Omas Dullien 164 | ------------------- 165 | Title: OmniUnpack: Fast, Generic, and Safe Unpacking of Malware 166 | Authors: L Martignoni, M Christodorescu, S Jha 167 | ------------------- 168 | Title: Measuring and Defeating Anti-Instrumentation-Equipped Malware 169 | Authors: Mario Polino, Andrea Continella, Sebastiano Mariani, Lorenzo Stefano D'alessio, Fabio Fontana, Stefano Gri, Zanero 170 | ------------------- 171 | Title: Argos: an Emulator for Fingerprinting Zero-Day A acks 172 | Authors: Georgios Portokalidis, Asia Slowinska, Herbert Bos 173 | ------------------- 174 | Title: 175 | Authors: Symantec Security Response 176 | ------------------- 177 | Title: Learning to Analyze Binary Computer Code 178 | Authors: Nathan Rosenblum, Xiaojin Zhu, Barton Miller, Karen Hunt 179 | ------------------- 180 | Title: PolyUnpack: Automating the Hidden-Code Extraction of Unpack-Executing Malware 181 | Authors: Paul Royal, Mitch Halpin, David Dagon, Robert Edmonds, Wenke Lee 182 | ------------------- 183 | Title: Eureka: A Framework for Enabling Static Malware Analysis 184 | Authors: Monirul Sharif, Vinod Yegneswaran, Hassen Saidi, Phillip Porras, Wenke Lee 185 | ------------------- 186 | Title: Binary Translation 187 | Authors: Richard Sites, Anton Cherno, B Ma Hew, Maurice Kirk, Sco Marks, Robinson 188 | ------------------- 189 | Title: DeepMem: Learning Graph Neural Network Models for Fast and Robust Memory Forensic Analysis 190 | Authors: Wei Song, Chang Heng Yin, Dawn Liu, Song 191 | ------------------- 192 | Title: SoK: Deep Packer Inspection: A Longitudinal Study of the Complexity of Run-Time Packers 193 | Authors: Davide Xabier Ugarte-Pedrero, Igor Balzaro I, Pablo Santos, Bringas 194 | ------------------- 195 | Title: Panorama: Capturing System-wide Information Flow for Malware Detection and Analysis 196 | Authors: Dawn Heng Yin, Manuel Song, Christopher Egele, Engin Kruegel, Kirda 197 | ------------------- 198 | 199 | Parsing 1908.10167.pdf 200 | Title: RePEconstruct: reconstructing binaries with self-modifying code and import address table destruction 201 | Authors: David Korczynski, Createprocessa, Createfilea, , 202 | References: 203 | Title: System-Level Support for Intrusion Recovery 204 | Authors: Andrei Bacs, Remco Vermeulen, Asia Slowinska, Herbert Bos 205 | ------------------- 206 | Title: incy: Detecting Host-Based Code Injection A acks in Memory Dumps 207 | Authors: Niklas Omas Barabosch, Adrian Bergmann, Elmar Dombeck, Padilla 208 | ------------------- 209 | Title: Bee Master: Detecting Host-Based Code Injection A acks 210 | Authors: Sebastian Barabosch, Elmar Eschweiler, Gerhards-Padilla 211 | ------------------- 212 | Title: Host-based code injection a acks: A popular technique used by malware 213 | Authors: Elmar Omas Barabosch, Gerhards-Padilla 214 | ------------------- 215 | Title: A View on Current Malware Behaviors 216 | Authors: Ulrich Bayer, Imam Habibi, Davide Balzaro I, Engin Kirda 217 | ------------------- 218 | Title: Dynamic Analysis of Malicious Code 219 | Authors: Ulrich Bayer, Andreas Moser, Christopher Kruegel, Engin Kirda 220 | ------------------- 221 | Title: Dridex's Cold War 222 | Authors: Magal Baz, Or Safran 223 | ------------------- 224 | Title: QEMU, a Fast and Portable Dynamic Translator 225 | Authors: Fabrice Bellard 226 | ------------------- 227 | Title: CoDisasm: Medium Scale Concatic Disassembly of Self-Modifying Binaries with Overlapping Instructions 228 | Authors: Guillaume Bonfante, Jose Fernandez, Jean-Yves Marion, Benjamin Rouxel 229 | ------------------- 230 | Title: Understanding Linux Malware 231 | Authors: E Cozzi, M Graziano, Y Fratantonio, D Balzaro I 232 | ------------------- 233 | Title: Understanding Linux malware 234 | Authors: Emanuele Cozzi, Mariano Graziano, Yanick Fratantonio, Davide Balzaro I 235 | ------------------- 236 | Title: Ether: Malware Analysis via Hardware Virtualization Extensions 237 | Authors: Artem Dinaburg, Paul Royal, Monirul Sharif, Wenke Lee 238 | ------------------- 239 | Title: Repeatable Reverse Engineering with PANDA 240 | Authors: Brendan Dolan-Gavi, Josh Hodosh, Patrick Hulin, Tim Leek, Ryan Whelan 241 | ------------------- 242 | Title: A Survey on Automated Dynamic Malware-analysis Techniques and Tools 243 | Authors: Manuel Egele, Engin Eodoor Scholte, Christopher Kirda, Kruegel 244 | ------------------- 245 | Title: A Survey of Mobile Malware in the Wild 246 | Authors: Adrienne Felt, Ma Hew Fini Er, Erika Chin, Steve Hanna, David Wagner 247 | ------------------- 248 | Title: DECAF: A Platform-Neutral Whole-System Dynamic Binary Analysis Platform 249 | Authors: Andrew Henderson, Lok-Kwong Yan, Xunchao Hu, Aravind Prakash, Heng Yin, Stephen Mccamant 250 | ------------------- 251 | Title: Ten Process Injection Techniques: A Technical Survey Of Common And Trending Process Injection Techniques 252 | Authors: Ashkan Hosseini 253 | ------------------- 254 | Title: 255 | Authors: Hungenberg, Eckert Ma Hias 256 | ------------------- 257 | Title: malWASH: Washing Malware to Evade Dynamic Analysis 258 | Authors: K Kyriakos, Mathias Ispoglou, Payer 259 | ------------------- 260 | Title: Renovo: A Hidden Code Extractor for Packed Executables 261 | Authors: Min Kang, Pongsin Poosankam, Heng Yin 262 | ------------------- 263 | Title: API Chaser: Taint-Assisted Sandbox for Evasive Malware Analysis 264 | Authors: Yuhei Kawakoya, Eitaro Shioji, Makoto Iwamura 265 | ------------------- 266 | Title: RePEconstruct: reconstructing binaries with selfmodifying code and import address table destruction 267 | Authors: David Korczynski 268 | ------------------- 269 | Title: Precise system-wide concatic malware unpacking. arXiv e-prints 270 | Authors: David Korczynski 271 | ------------------- 272 | Title: Capturing Malware Propagations with Code Injections and Code-Reuse A acks 273 | Authors: David Korczynski, Heng Yin 274 | ------------------- 275 | Title: 276 | Authors: Pasquale Giulio De 277 | ------------------- 278 | Title: 279 | Authors: Daniel Plohmann, Martin Clauß, Elmar Padilla 280 | ------------------- 281 | Title: Argos: an Emulator for Fingerprinting Zero-Day A acks 282 | Authors: Georgios Portokalidis, Asia Slowinska, Herbert Bos 283 | ------------------- 284 | Title: AVclass: A Tool for Massive Malware Labeling 285 | Authors: Marcos Sebastián, Richard Rivera, Platon Kotzias, Juan Caballero 286 | ------------------- 287 | Title: Malrec: Compact Full-Trace Malware Recording for Retrospective Deep Analysis 288 | Authors: Giorgio Severi, Tim Leek, Brendan Dolan-Gavi 289 | ------------------- 290 | Title: Nor Badrul Anuar, Rosli Salleh, and Lorenzo Cavallaro 291 | Authors: Kimberly Tam, Ali Feizollah 292 | ------------------- 293 | Title: SoK: Deep Packer Inspection: A Longitudinal Study of the Complexity of Run-Time Packers 294 | Authors: Davide Xabier Ugarte-Pedrero, Igor Balzaro I, Pablo Santos, Bringas 295 | ------------------- 296 | Title: Deep Ground Truth Analysis of Current Android Malware 297 | Authors: Fengguo Wei, Yuping Li, Sankardas Roy, Xinming Ou, Wu Zhou 298 | ------------------- 299 | Title: Panorama: Capturing System-wide Information Flow for Malware Detection and Analysis 300 | Authors: Dawn Heng Yin, Manuel Song, Christopher Egele, Engin Kruegel, Kirda 301 | ------------------- 302 | Title: Dissecting Android Malware: Characterization and Evolution 303 | Authors: Yajin Zhou, Xuxian Jiang 304 | ------------------- 305 | ``` 306 |
307 | 308 | # Processing data 309 | 310 | A simple example of how to process the data: [visualization.py](visualization.py). 311 | -------------------------------------------------------------------------------- /alexandria/base.py: -------------------------------------------------------------------------------- 1 | """Contains the data structures and parsing functions to process 2 | and store data contained in an academic article. 3 | """ 4 | import re 5 | from bs4 import BeautifulSoup 6 | from serde import Model, fields 7 | 8 | 9 | def text(field): 10 | return field.text.strip() if field else "" 11 | 12 | 13 | class Author(Model): 14 | """Contains information about an author. 15 | """ 16 | name: fields.Str() 17 | surname: fields.Str() 18 | affiliation: fields.Optional(fields.List(fields.Str())) 19 | 20 | def __init__(self, soup): 21 | """Creates an `Author` by parsing a `soup` of type 22 | `BeautifulSoup`. 23 | """ 24 | if not soup.persname: 25 | self.name = "" 26 | self.surname = "" 27 | else: 28 | self.name = text(soup.persname.forename) 29 | self.surname = text(soup.persname.surname) 30 | # TODO: better affiliation parsing. 31 | self.affiliation = list(map(text, soup.find_all("affiliation"))) 32 | 33 | def __str__(self): 34 | s = "" 35 | if self.name: 36 | s += self.name + " " 37 | if self.surname: 38 | s += self.surname 39 | 40 | return s.strip() 41 | 42 | 43 | class _Article(Model): 44 | """This is required by serde for serialization, which 45 | is unable to take references to self as a field. 46 | For all purposes, refer to `Article`. 47 | """ 48 | pass 49 | 50 | 51 | class Article(_Article): 52 | """Represents an academic article or a reference contained in 53 | an article. 54 | 55 | The data is parsed from a TEI XML file (`from_file()`) or 56 | directly from a `BeautifulSoup` object. 57 | """ 58 | title: fields.Str() 59 | text: fields.Str() 60 | authors: fields.List(Author) 61 | year: fields.Optional(fields.Date()) 62 | references: fields.Optional(fields.List(_Article)) 63 | 64 | def __init__(self, soup, is_reference=False): 65 | """Create a new `Article` by parsing a `soup: BeautifulSoup` 66 | instance. 67 | The parameter `is_reference` specifies if the `soup` contains 68 | an entire article or just the content of a reference. 69 | """ 70 | self.title = text(soup.title) 71 | self.doi = text(soup.idno) 72 | self.abstract = text(soup.abstract) 73 | self.text = soup.text.strip() if soup.text else "" 74 | # FIXME 75 | self.year = None 76 | 77 | if is_reference: 78 | self.authors = list(map(Author, soup.find_all("author"))) 79 | self.references = [] 80 | else: 81 | self.authors = list(map(Author, soup.analytic.find_all("author"))) 82 | self.references = self._parse_biblio(soup) 83 | 84 | @staticmethod 85 | def from_file(tei_file): 86 | """Creates an `Article` by parsing a TEI XML file. 87 | """ 88 | with open(tei_file) as f: 89 | soup = BeautifulSoup(f, "lxml") 90 | 91 | return Article(soup) 92 | 93 | def _parse_biblio(self, soup): 94 | """Parses the bibliography from an article. 95 | """ 96 | references = [] 97 | # NOTE: we could do this without the regex. 98 | bibs = soup.find_all("biblstruct", {"xml:id": re.compile(r"b[0-9]*")}) 99 | 100 | for bib in bibs: 101 | if bib.analytic: 102 | references.append(Article(bib.analytic, is_reference=True)) 103 | # NOTE: in this case, bib.monogr contains more info 104 | # about the manuscript where the paper was published. 105 | # Not parsing for now. 106 | elif bib.monogr: 107 | references.append(Article(bib.monogr, is_reference=True)) 108 | else: 109 | print(f"Could not parse reference from {bib}") 110 | 111 | return references 112 | 113 | def __str__(self): 114 | return f"'{self.title}' - {' '.join(map(str, self.authors))}" 115 | 116 | def summary(self): 117 | """Prints a human-readable summary. 118 | """ 119 | print(f"Title: {self.title}") 120 | print("Authors: " + ", ".join(map(str, self.authors))) 121 | if self.references: 122 | print("References:") 123 | for r in self.references: 124 | r.summary() 125 | print("-------------------") 126 | -------------------------------------------------------------------------------- /alexandria/grobid.json: -------------------------------------------------------------------------------- 1 | { 2 | "grobid_server": "127.0.0.1", 3 | "grobid_port": "8080", 4 | "batch_size": 1000, 5 | "sleep_time": 5, 6 | "coordinates": [ "persName", "figure", "ref", "biblStruct", "formula" ] 7 | } 8 | -------------------------------------------------------------------------------- /alexandria/runner.py: -------------------------------------------------------------------------------- 1 | """ 2 | Usage: 3 | runner.py 4 | """ 5 | import os 6 | import sys 7 | import pymongo 8 | from docopt import docopt 9 | from bs4 import BeautifulSoup 10 | from grobid_client.grobid_client import GrobidClient 11 | 12 | from base import Article 13 | 14 | # Valid grobid services 15 | FULL = "processFulltextDocument" 16 | HEADER = "processHeaderDocument" 17 | REFS = "processReferences" 18 | 19 | 20 | def parse_pdf(client, pdf_file): 21 | _, status, r = client.process_pdf(service=FULL, 22 | pdf_file=pdf_file, 23 | generateIDs=False, 24 | consolidate_header=True, 25 | consolidate_citations=False, 26 | include_raw_citations=False, 27 | include_raw_affiliations=False, 28 | teiCoordinates=True, 29 | ) 30 | 31 | return Article(BeautifulSoup(r, "lxml")) 32 | 33 | 34 | if __name__ == "__main__": 35 | args = docopt(__doc__) 36 | 37 | grobid = GrobidClient("./grobid.json") 38 | client = pymongo.MongoClient("localhost", 27017) 39 | print("Connecting to db.") 40 | try: 41 | client.server_info() 42 | except pymongo.errors.ServerSelectionTimeoutError: 43 | print("Failed to connect to db.") 44 | sys.exit(1) 45 | 46 | db = client.alexandria 47 | 48 | print(f"Processing files in {args['']}") 49 | 50 | for pdf_file in os.listdir(args[""]): 51 | if not pdf_file.endswith(".pdf"): 52 | continue 53 | print(f"Parsing {pdf_file}") 54 | article = parse_pdf(grobid, os.path.join(args[""], pdf_file)) 55 | # Can also do print(a.to_json()) 56 | article.summary() 57 | print() 58 | 59 | db.articles.insert_one(article.to_dict()) 60 | -------------------------------------------------------------------------------- /alexandria/visualization.py: -------------------------------------------------------------------------------- 1 | import yake 2 | import matplotlib.pyplot as plt 3 | from pymongo import MongoClient 4 | from wordcloud import WordCloud 5 | 6 | from base import Article 7 | 8 | 9 | def find_keywords(articles, top_n=80, ngram_size=1): 10 | text = " ".join([a.text.lower() for a in articles]) 11 | 12 | return yake.KeywordExtractor(top=top_n, n=ngram_size).extract_keywords(text) 13 | 14 | def wordcloud(articles, dst=None): 15 | # Get keywords. 16 | keywords, importance = zip(*find_keywords(articles)) 17 | # Reverse importance. 18 | importance = map(lambda x: 1-x, importance) 19 | keywords = dict(zip(keywords, importance)) 20 | 21 | wordcloud = WordCloud(background_color="white", 22 | colormap="winter", 23 | #font_path="Source_Sans_Pro/SourceSansPro-Bold.ttf", 24 | height=800, 25 | width=800, 26 | min_font_size=10, 27 | max_font_size=120, 28 | prefer_horizontal=3.0).generate_from_frequencies(keywords) 29 | 30 | plt.figure(figsize=(10.0, 10.0)) 31 | plt.imshow(wordcloud, interpolation="bilinear") 32 | plt.axis("off") 33 | if dst: 34 | plt.savefig(dst) 35 | 36 | 37 | if __name__ == "__main__": 38 | client = MongoClient("localhost", 27017) 39 | db = client.alexandria 40 | articles = list(map(Article.from_dict, db.articles.find())) 41 | 42 | print(f"Fetched {len(articles)} articles.") 43 | 44 | keywords, _ = zip(*find_keywords(articles, top_n=5, ngram_size=2)) 45 | print(f"Top 5 keywords: {keywords}") 46 | 47 | wordcloud(articles, "examples/wordcloud-final.png") 48 | -------------------------------------------------------------------------------- /docs/larger_example.md: -------------------------------------------------------------------------------- 1 | # Example of larger analyses 2 | 3 | Paper analyser relies on PDF file representations of academic papers. 4 | As such, it is up to you to find these papers. 5 | 6 | For convenience we maintain a list of links to software analysis papers 7 | focused on software security in our sister repository [here](https://github.com/AdaLogics/software-security-paper-list) 8 | 9 | As an example of doing analysis on several Fuzzing papers, you can use the following commands: 10 | 11 | ``` 12 | cd paper-analyzer 13 | mkdir tmp && cd tmp 14 | git clone https://github.com/AdaLogics/software-security-paper-list 15 | cd software-security-paper-list 16 | python auto_download.py Fuzzing 17 | ``` 18 | 19 | At this point you will see more than 80 papers in the directory `out/Fuzzing/` 20 | 21 | We continue to do analysis on these papers: 22 | ``` 23 | cd ../.. 24 | python3 pq_main.py -f ./tmp/software-security-paper-list/out/Fuzzing 25 | ``` 26 | 27 | -------------------------------------------------------------------------------- /docs/simple_example.md: -------------------------------------------------------------------------------- 1 | # Simple example 2 | 3 | We include two whitepapers in the repository as examples for using 4 | paper-analyser to pass papers and get results out. 5 | 6 | To try the tool, simply follow the commands: 7 | ``` 8 | cd paper-analyzer 9 | . venv/bin/activate 10 | python3 ./pq_main.py -f ./example-papers/ 11 | ``` 12 | 13 | At this point you will see results in `pq-out/out-0` 14 | 15 | Specifically, you will see: 16 | ``` 17 | $ ls pq-out/out-0/ 18 | data_out img json_data normalised_dependency_graph.json parsed_paper_data.json report.txt 19 | ``` 20 | 21 | * `data_out` contains one `.txt` and one `.xml` file for each PDF. These `.txt` and `.xml` are simply data representations of the content of the given PDF file. 22 | * `json_data` contains JSON data representations for each paper 23 | * `img` contains a `.png` image of a citation-dependency graph of the PDF files in the folder 24 | * `parsed_paper_data.json` is a single json file containing data about the papers analysed, such as the title of each paper as well as the papers cited by each paper. 25 | * `report.txt` contains a lot of information about the papers in data set 26 | 27 | ``` 28 | $ cat pq-out/out-0/report.txt 29 | ###################################### 30 | Parsed papers summary 31 | paper: 32 | pq-out/out-0/json_data/json_dump_0.json 33 | b'A characterisation of system-wide propagation in the malware landscape ' 34 | Normalised title: acharacterisationofsystemwidepropagationinthemalwarelandscape 35 | References: 36 | Authors: Andrei Bacs, Remco Vermeulen, Asia Slowinska, and Herbert Bos 37 | Title: System- Level Support for Intrusion Recovery 38 | Normalised: systemlevelsupportforintrusionrecovery 39 | ------------------- 40 | Authors: Thomas Barabosch, Niklas Bergmann, Adrian Dombeck, and Elmar Padilla 41 | Title: Quincy: Detecting Host-Based Code Injection Attacks in Memory Dumps 42 | Normalised: quincy:detectinghostbasedcodeinjectionattacksinmemorydumps 43 | ------------------- 44 | Authors: Thomas Barabosch, Sebastian Eschweiler, and Elmar Gerhards-Padilla 45 | Title: Bee Master: Detecting Host-Based Code Injection Attacks 46 | Normalised: beemaster:detectinghostbasedcodeinjectionattacks 47 | ------------------- 48 | Authors: Thomas Barabosch and Elmar Gerhards-Padilla 49 | Title: Host-based code injection attacks: A popular technique used by malware 50 | Normalised: hostbasedcodeinjectionattacks:apopulartechniqueusedbymalware 51 | ------------------- 52 | Authors: Ulrich Bayer, Imam Habibi, Davide Balzarotti, and Engin Kirda 53 | Title: A View on Current Malware Behaviors 54 | Normalised: aviewoncurrentmalwarebehaviors 55 | ------------------- 56 | Authors: Ulrich Bayer, Andreas Moser, Christopher Kruegel, and Engin Kirda 57 | Title: Dy- namic Analysis of Malicious Code 58 | Normalised: dynamicanalysisofmaliciouscode 59 | ------------------- 60 | Authors: Magal Baz and Or Safran 61 | Title: Dridexs Cold War: Enter AtomBombing 62 | Normalised: dridexscoldwar:enteratombombing 63 | ------------------- 64 | Authors: Fabrice Bellard 65 | Title: QEMU, a Fast and Portable Dynamic Translator 66 | Normalised: qemuafastandportabledynamictranslator 67 | ------------------- 68 | Authors: Guillaume Bonfante, Jose Fernandez, Jean-Yves Marion, Benjamin Rouxel, Fab- rice Sabatier, and Aurelien Thierry 69 | Title: CoDisasm: Medium Scale Con- catic Disassembly of Self-Modifying Binaries with Overlapping Instructions 70 | Normalised: codisasm:mediumscaleconcaticdisassemblyofselfmodifyingbinarieswithoverlappinginstructions 71 | ------------------- 72 | Authors: Brendan Dolan-Gavitt, Josh Hodosh, Patrick Hulin, Tim Leek, and Ryan Whelan 73 | Title: Repeatable Reverse Engineering with PANDA 74 | Normalised: repeatablereverseengineeringwithpanda 75 | ------------------- 76 | Authors: Manuel Egele, Theodoor Scholte, Engin Kirda, and Christopher Kruegel 77 | Title: A Survey on Automated Dynamic Malware-analysis Techniques and Tools 78 | Normalised: asurveyonautomateddynamicmalwareanalysistechniquesandtools 79 | ------------------- 80 | Authors: Adrienne Porter Felt, Matthew Finifter, Erika Chin, Steve Hanna, and David Wagner 81 | Title: A Survey of Mobile Malware in the Wild 82 | Normalised: asurveyofmobilemalwareinthewild 83 | ------------------- 84 | Authors: Andrew Henderson, Lok-Kwong Yan, Xunchao Hu, Aravind Prakash, Heng Yin, and Stephen McCamant 85 | Title: DECAF: A Platform-Neutral Whole-System Dynamic Binary Analysis Platform 86 | Normalised: decaf:aplatformneutralwholesystemdynamicbinaryanalysisplatform 87 | ------------------- 88 | Authors: Min Gyung Kang, Pongsin Poosankam, and Heng Yin 89 | Title: Renovo: A Hidden Code Extractor for Packed Executables 90 | Normalised: renovo:ahiddencodeextractorforpackedexecutables 91 | ------------------- 92 | Authors: Yuhei Kawakoya, Eitaro Shioji, Makoto Iwamura, and Jun Miyoshi 93 | Title: API Chaser: Taint-Assisted Sandbox for Evasive Malware Analysis 94 | Normalised: apichaser:taintassistedsandboxforevasivemalwareanalysis 95 | ------------------- 96 | Authors: David Korczynski 97 | Title: RePEconstruct: reconstructing binaries with self- modifying code and import address table destruction 98 | Normalised: repeconstruct:reconstructingbinarieswithselfmodifyingcodeandimportaddresstabledestruction 99 | ------------------- 100 | Authors: David Korczynski 101 | Title: Precise system-wide concatic malware unpacking 102 | Normalised: precisesystemwideconcaticmalwareunpacking 103 | ------------------- 104 | Authors: David Korczynski and Heng Yin 105 | Title: Capturing Malware Propagations with Code Injections and Code-Reuse Attacks 106 | Normalised: capturingmalwarepropagationswithcodeinjectionsandcodereuseattacks 107 | ------------------- 108 | Authors: Kimberly Tam, Ali Feizollah, Nor Badrul Anuar, Rosli Salleh, and Lorenzo Caval- laro 109 | Title: The Evolution of Android Malware and Android Analysis Techniques 110 | Normalised: theevolutionofandroidmalwareandandroidanalysistechniques 111 | ------------------- 112 | Authors: Heng Yin, Dawn Song, Manuel Egele, Christopher Kruegel, and Engin Kirda 113 | Title: Panorama: Capturing System-wide Information Flow for Malware De- tection and Analysis 114 | Normalised: panorama:capturingsystemwideinformationflowformalwaredetectionandanalysis 115 | ------------------- 116 | Authors: Yajin Zhou and Xuxian Jiang 117 | Title: Dissecting Android Malware: Character- ization and Evolution 118 | Normalised: dissectingandroidmalware:characterizationandevolution 119 | ------------------- 120 | paper: 121 | pq-out/out-0/json_data/json_dump_1.json 122 | b'Precise system-wide concatic malware unpacking ' 123 | Normalised title: precisesystemwideconcaticmalwareunpacking 124 | References: 125 | Authors: Andrei Bacs, Remco Vermeulen, Asia Slowinska, and Herbert Bos 126 | Title: System- Level Support for Intrusion Recovery 127 | Normalised: systemlevelsupportforintrusionrecovery 128 | ------------------- 129 | Authors: G. Balakrishnan, T. Reps, D. Melski, and T. Teitelbaum 130 | Title: WYSINWYX: What You See Is Not What You eXecute 131 | Normalised: wysinwyx:whatyouseeisnotwhatyouexecute 132 | ------------------- 133 | Authors: Tiffany Bao, Jonathan Burket, Maverick Woo, Rafael Turner, and David Brumley 134 | Title: BYTEWEIGHT: Learning to Recognize Functions in Binary Code 135 | Normalised: byteweight:learningtorecognizefunctionsinbinarycode 136 | ------------------- 137 | Authors: Thomas Barabosch, Niklas Bergmann, Adrian Dombeck, and Elmar Padilla 138 | Title: Quincy: Detecting Host-Based Code Injection Attacks in Memory Dumps 139 | Normalised: quincy:detectinghostbasedcodeinjectionattacksinmemorydumps 140 | ------------------- 141 | Authors: Thomas Barabosch, Sebastian Eschweiler, and Elmar Gerhards-Padilla 142 | Title: Bee Master: Detecting Host-Based Code Injection Attacks 143 | Normalised: beemaster:detectinghostbasedcodeinjectionattacks 144 | ------------------- 145 | Authors: Guillaume Bonfante, Jose Fernandez, Jean-Yves Marion, Benjamin Rouxel, Fab- rice Sabatier, and Aurelien Thierry 146 | Title: CoDisasm: Medium Scale Con- catic Disassembly of Self-Modifying Binaries with Overlapping Instructions 147 | Normalised: codisasm:mediumscaleconcaticdisassemblyofselfmodifyingbinarieswithoverlappinginstructions 148 | ------------------- 149 | Authors: Erik Bosman, Asia Slowinska, and Herbert Bos 150 | Title: Minemu: The Worlds Fastest Taint Tracker 151 | Normalised: minemu:theworldsfastesttainttracker 152 | ------------------- 153 | Could not parse 154 | ------------------- 155 | Authors: Cristina Cifuentes and K. John Gough 156 | Title: Decompilation of Binary Pro- grams 157 | Normalised: decompilationofbinaryprograms 158 | ------------------- 159 | Authors: Artem Dinaburg, Paul Royal, Monirul Sharif, and Wenke Lee 160 | Title: Ether: Malware Analysis via Hardware Virtualization Extensions 161 | Normalised: ether:malwareanalysisviahardwarevirtualizationextensions 162 | ------------------- 163 | Authors: Brendan Dolan-Gavitt, Josh Hodosh, Patrick Hulin, Tim Leek, and Ryan Whelan 164 | Title: Repeatable Reverse Engineering with PANDA 165 | Normalised: repeatablereverseengineeringwithpanda 166 | ------------------- 167 | Authors: Thomas Dullien and Rolf Rolles 168 | Title: Graph-based comparison of executable objects (english version) 169 | Normalised: graphbasedcomparisonofexecutableobjects(englishversion) 170 | ------------------- 171 | Authors: Mike Van Emmerik 172 | Title: Signatures for Library Functions in Executable Files 173 | Normalised: signaturesforlibraryfunctionsinexecutablefiles 174 | ------------------- 175 | Authors: Halvar Flake 176 | Title: Structural Comparison of Executable Objects 177 | Normalised: structuralcomparisonofexecutableobjects 178 | ------------------- 179 | Authors: Fanglu Guo, Peter Ferrie, and Tzi-cker Chiueh 180 | Title: A Study of the Packer Problem and Its Solutions 181 | Normalised: astudyofthepackerproblemanditssolutions 182 | ------------------- 183 | Authors: Andrew Henderson, Lok-Kwong Yan, Xunchao Hu, Aravind Prakash, Heng Yin, and Stephen McCamant 184 | Title: DECAF: A Platform-Neutral Whole-System Dynamic Binary Analysis Platform 185 | Normalised: decaf:aplatformneutralwholesystemdynamicbinaryanalysisplatform 186 | ------------------- 187 | Authors: Xin Hu, Sandeep Bhatkar, Kent Griffin, and Kang G. Shin 188 | Title: MutantX-S: Scalable Malware Clustering Based on Static Features 189 | Normalised: mutantxs:scalablemalwareclusteringbasedonstaticfeatures 190 | ------------------- 191 | Authors: Thomas Hungenberg and Matthias Eckert 192 | Title: http://www 193 | Normalised: http://www 194 | ------------------- 195 | Authors: Kyriakos K. Ispoglou and Mathias Payer 196 | Title: malWASH: Washing Malware to Evade Dynamic Analysis 197 | Normalised: malwash:washingmalwaretoevadedynamicanalysis 198 | ------------------- 199 | Authors: Emily R. Jacobson, Nathan Rosenblum, and Barton P. Miller 200 | Title: Labeling Library Functions in Stripped Binaries 201 | Normalised: labelinglibraryfunctionsinstrippedbinaries 202 | ------------------- 203 | Authors: Sebastien Josse 204 | Title: Secure and advanced unpacking using computer emulation 205 | Normalised: secureandadvancedunpackingusingcomputeremulation 206 | ------------------- 207 | Authors: S. Josse 208 | Title: Malware Dynamic Recompilation 209 | Normalised: malwaredynamicrecompilation 210 | ------------------- 211 | Authors: Min Gyung Kang, Pongsin Poosankam, and Heng Yin 212 | Title: Renovo: A Hidden Code Extractor for Packed Executables 213 | Normalised: renovo:ahiddencodeextractorforpackedexecutables 214 | ------------------- 215 | Authors: Yuhei Kawakoya, Makoto Iwamura, and Jun Miyoshi 216 | Title: Taint-assisted IAT Reconstruction against Position Obfuscation 217 | Normalised: taintassistediatreconstructionagainstpositionobfuscation 218 | ------------------- 219 | Authors: Yuhei Kawakoya, Eitaro Shioji, Makoto Iwamura, and Jun Miyoshi 220 | Title: API Chaser: Taint-Assisted Sandbox for Evasive Malware Analysis 221 | Normalised: apichaser:taintassistedsandboxforevasivemalwareanalysis 222 | ------------------- 223 | Could not parse 224 | ------------------- 225 | Authors: Clemens Kolbitsch, Engin Kirda, and Christopher Kruegel 226 | Title: The Power of Procrastination: Detection and Mitigation of Execution-stalling Malicious Code 227 | Normalised: thepowerofprocrastination:detectionandmitigationofexecutionstallingmaliciouscode 228 | ------------------- 229 | Authors: David Korczynski 230 | Title: RePEconstruct: reconstructing binaries with self- modifying code and import address table destruction 231 | Normalised: repeconstruct:reconstructingbinarieswithselfmodifyingcodeandimportaddresstabledestruction 232 | ------------------- 233 | Authors: David Korczynski and Heng Yin 234 | Title: Capturing Malware Propagations with Code Injections and Code-Reuse Attacks 235 | Normalised: capturingmalwarepropagationswithcodeinjectionsandcodereuseattacks 236 | ------------------- 237 | Authors: Christopher Kruegel, Engin Kirda, Darren Mutz, William Robertson, and Gio- vanni Vigna 238 | Title: Polymorphic Worm Detection Using Structural Information of Executables 239 | Normalised: polymorphicwormdetectionusingstructuralinformationofexecutables 240 | ------------------- 241 | Authors: Christopher Kruegel, William Robertson, Fredrik Valeur, and Giovanni Vigna 242 | Title: Static Disassembly of Obfuscated Binaries 243 | Normalised: staticdisassemblyofobfuscatedbinaries 244 | ------------------- 245 | Authors: Yujia Li, Chenjie Gu, Thomas Dullien, Oriol Vinyals, and Pushmeet Kohli 246 | Title: Graph Matching Networks for Learning the Similarity of Graph Structured Objects 247 | Normalised: graphmatchingnetworksforlearningthesimilarityofgraphstructuredobjects 248 | ------------------- 249 | Authors: L. Martignoni, M. Christodorescu, and S. Jha 250 | Title: OmniUnpack: Fast, Generic, and Safe Unpacking of Malware 251 | Normalised: omniunpack:fastgenericandsafeunpackingofmalware 252 | ------------------- 253 | Authors: Mario Polino, Andrea Continella, Sebastiano Mariani, Stefano DAlessio, Lorenzo Fontana, Fabio Gritti, and Stefano Zanero 254 | Title: Measuring and Defeating Anti- Instrumentation-Equipped Malware 255 | Normalised: measuringanddefeatingantiinstrumentationequippedmalware 256 | ------------------- 257 | Authors: Georgios Portokalidis, Asia Slowinska, and Herbert Bos 258 | Title: Argos: an Emula- tor for Fingerprinting Zero-Day Attacks 259 | Normalised: argos:anemulatorforfingerprintingzerodayattacks 260 | ------------------- 261 | Authors: Symantec Security Response 262 | Title: W32 263 | Normalised: w32 264 | ------------------- 265 | Authors: Nathan E. Rosenblum, Xiaojin Zhu, Barton P. Miller, and Karen Hunt 266 | Title: Learning to Analyze Binary Computer Code 267 | Normalised: learningtoanalyzebinarycomputercode 268 | ------------------- 269 | Authors: Paul Royal, Mitch Halpin, David Dagon, Robert Edmonds, and Wenke Lee 270 | Title: PolyUnpack: Automating the Hidden-Code Extraction of Unpack-Executing Malware 271 | Normalised: polyunpack:automatingthehiddencodeextractionofunpackexecutingmalware 272 | ------------------- 273 | Authors: Monirul Sharif, Vinod Yegneswaran, Hassen Saidi, Phillip Porras, and Wenke Lee 274 | Title: Eureka: A Framework for Enabling Static Malware Analysis 275 | Normalised: eureka:aframeworkforenablingstaticmalwareanalysis 276 | ------------------- 277 | Authors: Richard L. Sites, Anton Chernoff, Matthew B. Kirk, Maurice P. Marks, and Scott G. Robinson 278 | Title: Binary Translation 279 | Normalised: binarytranslation 280 | ------------------- 281 | Authors: Wei Song, Heng Yin, Chang Liu, and Dawn Song 282 | Title: DeepMem: Learning Graph Neural Network Models for Fast and Robust Memory Forensic Analysis 283 | Normalised: deepmem:learninggraphneuralnetworkmodelsforfastandrobustmemoryforensicanalysis 284 | ------------------- 285 | Authors: Heng Yin, Dawn Song, Manuel Egele, Christopher Kruegel, and Engin Kirda 286 | Title: Panorama: Capturing System-wide Information Flow for Malware De- tection and Analysis 287 | Normalised: panorama:capturingsystemwideinformationflowformalwaredetectionandanalysis 288 | ------------------- 289 | ###################################### 290 | Papers in the data set : 291 | Cited papers: 292 | b'Precise system-wide concatic malware unpacking ' 293 | Noncited papers: 294 | b'A characterisation of system-wide propagation in the malware landscape ' 295 | Succ: 2 296 | Fail: 0 297 | Succ sigs: 61 298 | Failed sigs: 2 299 | ###################################### 300 | All Titles (normalised) of the papers in the data set: 301 | b'A characterisation of system-wide propagation in the malware landscape ' 302 | b'Precise system-wide concatic malware unpacking ' 303 | ###################################### 304 | ###################################### 305 | All references/citations (normalised) issued by these papers 306 | Name : Count 307 | hostbasedcodeinjectionattacks:apopulartechniqueusedbymalware : 1 308 | aviewoncurrentmalwarebehaviors : 1 309 | dynamicanalysisofmaliciouscode : 1 310 | dridexscoldwar:enteratombombing : 1 311 | qemuafastandportabledynamictranslator : 1 312 | asurveyonautomateddynamicmalwareanalysistechniquesandtools : 1 313 | asurveyofmobilemalwareinthewild : 1 314 | precisesystemwideconcaticmalwareunpacking : 1 315 | theevolutionofandroidmalwareandandroidanalysistechniques : 1 316 | dissectingandroidmalware:characterizationandevolution : 1 317 | wysinwyx:whatyouseeisnotwhatyouexecute : 1 318 | byteweight:learningtorecognizefunctionsinbinarycode : 1 319 | minemu:theworldsfastesttainttracker : 1 320 | decompilationofbinaryprograms : 1 321 | ether:malwareanalysisviahardwarevirtualizationextensions : 1 322 | graphbasedcomparisonofexecutableobjects(englishversion) : 1 323 | signaturesforlibraryfunctionsinexecutablefiles : 1 324 | structuralcomparisonofexecutableobjects : 1 325 | astudyofthepackerproblemanditssolutions : 1 326 | mutantxs:scalablemalwareclusteringbasedonstaticfeatures : 1 327 | http://www : 1 328 | malwash:washingmalwaretoevadedynamicanalysis : 1 329 | labelinglibraryfunctionsinstrippedbinaries : 1 330 | secureandadvancedunpackingusingcomputeremulation : 1 331 | malwaredynamicrecompilation : 1 332 | taintassistediatreconstructionagainstpositionobfuscation : 1 333 | thepowerofprocrastination:detectionandmitigationofexecutionstallingmaliciouscode : 1 334 | polymorphicwormdetectionusingstructuralinformationofexecutables : 1 335 | staticdisassemblyofobfuscatedbinaries : 1 336 | graphmatchingnetworksforlearningthesimilarityofgraphstructuredobjects : 1 337 | omniunpack:fastgenericandsafeunpackingofmalware : 1 338 | measuringanddefeatingantiinstrumentationequippedmalware : 1 339 | argos:anemulatorforfingerprintingzerodayattacks : 1 340 | w32 : 1 341 | learningtoanalyzebinarycomputercode : 1 342 | polyunpack:automatingthehiddencodeextractionofunpackexecutingmalware : 1 343 | eureka:aframeworkforenablingstaticmalwareanalysis : 1 344 | binarytranslation : 1 345 | deepmem:learninggraphneuralnetworkmodelsforfastandrobustmemoryforensicanalysis : 1 346 | systemlevelsupportforintrusionrecovery : 2 347 | quincy:detectinghostbasedcodeinjectionattacksinmemorydumps : 2 348 | beemaster:detectinghostbasedcodeinjectionattacks : 2 349 | codisasm:mediumscaleconcaticdisassemblyofselfmodifyingbinarieswithoverlappinginstructions : 2 350 | repeatablereverseengineeringwithpanda : 2 351 | decaf:aplatformneutralwholesystemdynamicbinaryanalysisplatform : 2 352 | renovo:ahiddencodeextractorforpackedexecutables : 2 353 | apichaser:taintassistedsandboxforevasivemalwareanalysis : 2 354 | repeconstruct:reconstructingbinarieswithselfmodifyingcodeandimportaddresstabledestruction : 2 355 | capturingmalwarepropagationswithcodeinjectionsandcodereuseattacks : 2 356 | panorama:capturingsystemwideinformationflowformalwaredetectionandanalysis : 2 357 | ###################################### 358 | The total number of unique citations: 50 359 | ###################################### 360 | Dependency graph based on citations 361 | Paper: 362 | pq-out/out-0/json_data/json_dump_0.json 363 | b'A characterisation of system-wide propagation in the malware landscape ' 364 | Normalised title: acharacterisationofsystemwidepropagationinthemalwarelandscape 365 | Cited by: 366 | --------------------------------- 367 | Paper: 368 | pq-out/out-0/json_data/json_dump_1.json 369 | b'Precise system-wide concatic malware unpacking ' 370 | Normalised title: precisesystemwideconcaticmalwareunpacking 371 | Cited by: 372 | b'A characterisation of system-wide propagation in the malware landscape ' 373 | --------------------------------- 374 | ``` 375 | -------------------------------------------------------------------------------- /example-images/fuzz-barplot.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AdaLogics/paper-analyser/464daa29bcab2a35a8a4630751ca96023643a558/example-images/fuzz-barplot.png -------------------------------------------------------------------------------- /example-images/fuzz-wordcloud.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AdaLogics/paper-analyser/464daa29bcab2a35a8a4630751ca96023643a558/example-images/fuzz-wordcloud.png -------------------------------------------------------------------------------- /example-images/paper-citation-graph.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AdaLogics/paper-analyser/464daa29bcab2a35a8a4630751ca96023643a558/example-images/paper-citation-graph.png -------------------------------------------------------------------------------- /example-papers/1908.09204.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AdaLogics/paper-analyser/464daa29bcab2a35a8a4630751ca96023643a558/example-papers/1908.09204.pdf -------------------------------------------------------------------------------- /example-papers/1908.10167.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AdaLogics/paper-analyser/464daa29bcab2a35a8a4630751ca96023643a558/example-papers/1908.10167.pdf -------------------------------------------------------------------------------- /gen_wordcloud.py: -------------------------------------------------------------------------------- 1 | import os 2 | import sys 3 | from wordcloud import WordCloud 4 | 5 | if len(sys.argv) != 2: 6 | print("usage: python pq_gen_wordcloud.py TEXTFILE") 7 | exit(0) 8 | 9 | whole_text = None 10 | #with open("copy-total-words", "r") as i_f: 11 | with open(sys.argv[1], "r") as i_f: 12 | whole_text = i_f.read() 13 | 14 | whole_text = whole_text.replace("cid:", "") 15 | 16 | #wordcloud = WordCloud().generate(whole_text) 17 | 18 | import matplotlib.pyplot as plt 19 | #plt.imshow(wordcloud, interpolation="bilinear") 20 | #plt.axis("off") 21 | 22 | wordcloud = WordCloud(background_color="white", 23 | colormap="winter", 24 | font_path="Source_Sans_Pro/SourceSansPro-Bold.ttf", 25 | height=800, 26 | width=800, 27 | min_font_size=10, 28 | max_font_size=120, 29 | prefer_horizontal=3.0).generate(whole_text) 30 | 31 | plt.figure(figsize=(10.0, 10.0)) 32 | plt.imshow(wordcloud, interpolation="bilinear") 33 | plt.axis("off") 34 | plt.savefig("wordcloud-final.png") 35 | 36 | # Now let's do word frequency 37 | word_dict = dict() 38 | split_words = whole_text.split(" ") 39 | total_number_of_words = len(split_words) 40 | print(total_number_of_words) 41 | total_sorted = 0 42 | for w in split_words: 43 | total_sorted += 1 44 | if w in word_dict: 45 | word_dict[w] = word_dict[w] + 1 46 | else: 47 | word_dict[w] = 1 48 | print("Total sorted: %d"%(total_sorted)) 49 | 50 | listed_word_dict = [] 51 | for key,value in word_dict.items(): 52 | listed_word_dict.append((key, value, )) 53 | 54 | print("Length of listed word %d"%(len(listed_word_dict))) 55 | listed_word_dict.sort(key=lambda x:x[1]) 56 | listed_word_dict.reverse() 57 | sorted_list = listed_word_dict 58 | print(sorted_list) 59 | 60 | 61 | # https://www.espressoenglish.net/the-100-most-common-words-in-english/ 62 | avoid_words = { "the", "at", "there", "some", "my", "of", "be", "use", "her", "than", "and", "this", "an", "would", "first", "a", "have", "each", "make", "water", "to", "from", "which", "like", "been", "in", "or", "she", "him", "call", "is", "one", "do", "into", "who", "you", "had", "how", "time", "oil", "that", "by", "their", "has", "its", "it", "word", "if", "look", "now", "he", "but", "will", "two", "find", "was", "not", "up", "more", "long", "for", "what", "other", "write", "down", "on", "all", "about", "go", "day", "are", "were", "out", "see", "did", "as", "we", "many", "number", "get", "with", "when", "then", "no", "come", "his", "your", "them", "way", "made", "they", "can", "these", "could", "may", "I", "said", "so", "people", "part" } 63 | 64 | avoid_words.add("=") 65 | avoid_words.add(".") 66 | avoid_words.add(" ") 67 | avoid_words.add("") 68 | avoid_words.add(",") 69 | 70 | words = [] 71 | freqs = [] 72 | for key, value in sorted_list: 73 | if key.lower() in avoid_words: 74 | continue 75 | words.append(key) 76 | freqs.append(value) 77 | for i in range(140): 78 | print("sorted_list: %s"%(str(sorted_list[i]))) 79 | 80 | Data = { "words: " : words, "freqs" : freqs } 81 | 82 | plt.clf() 83 | plt.figure(figsize=(10.0, 10.0)) 84 | plt.barh(words[:50], freqs[:50]) 85 | 86 | plt.title("Word frequency") 87 | #plt.show() 88 | plt.savefig("barplot.png") 89 | -------------------------------------------------------------------------------- /install.sh: -------------------------------------------------------------------------------- 1 | # Install some needed packages 2 | sudo apt-get install python3 3 | sudo apt-get install python3-pip 4 | sudo apt install graphviz 5 | 6 | # Create a virtual environment 7 | virtualenv venv 8 | 9 | # Launch the virtual environment 10 | . venv/bin/activate 11 | 12 | # Install various packages 13 | pip install -r requirements.txt 14 | -------------------------------------------------------------------------------- /pq_format_reader.py: -------------------------------------------------------------------------------- 1 | import os 2 | import sys 3 | import simplejson as json 4 | import traceback 5 | 6 | import graphviz as gv 7 | import pylab 8 | import random 9 | 10 | 11 | def append_to_report(workdir, str_to_write, to_print=True): 12 | work_report = os.path.join(workdir, "report.txt") 13 | if not os.path.isfile(work_report): 14 | with open(work_report, "w+") as wr: 15 | wr.write(str_to_write) 16 | wr.write("\n") 17 | else: 18 | with open(work_report, "ab") as wr: 19 | wr.write(str_to_write.encode("utf-8")) 20 | wr.write("\n".encode("utf-8")) 21 | 22 | if to_print: 23 | print(str_to_write) 24 | 25 | 26 | def split_reference_section_into_references(all_content, to_print=True): 27 | ''' 28 | Takes a raw string that resembles references and returns a python 29 | list where each element is composed of a tuple of four elements: 30 | - The number of the reference 31 | - The data of the reference 32 | - The offset where the reference begins in the raw data 33 | - The offset where the reference ends in the raw data 34 | (This end offset is not entirely workign and will often 35 | be 0) 36 | Each tuple is arranged in the order of the elements listed above. 37 | ''' 38 | 39 | collect_num = False 40 | curr_num = "" 41 | curr_content = "" 42 | idx_offset = 0 43 | begin_offset = 0 44 | end_offset = 0 45 | 46 | references = list() 47 | 48 | for c in all_content: 49 | if collect_num == True: 50 | try: 51 | n = int(c) 52 | curr_num = "%s%s" % (curr_num, c) 53 | except: 54 | None 55 | 56 | if c == "[": 57 | # Make sure the next character is a number 58 | try: 59 | tmpi = int(all_content[idx_offset + 1]) 60 | except: 61 | curr_content += c 62 | continue 63 | 64 | if curr_content != "": 65 | references.append({ 66 | "number": curr_num, 67 | "raw_content": curr_content, 68 | "offset": begin_offset 69 | }) 70 | #references.append((curr_num, curr_content, begin_offset, end_offset)) 71 | curr_num = "" 72 | collect_num = True 73 | begin_offset = idx_offset 74 | elif c == "]": 75 | collect_num = False 76 | #print("Got a number: %d"%(int(curr_num))) 77 | #curr_num = "" 78 | curr_content = "" 79 | elif collect_num == False: 80 | curr_content += c 81 | 82 | idx_offset += 1 83 | 84 | # Add the final reference 85 | #references.append((curr_num, curr_content, begin_offset, end_offset)) 86 | references.append({ 87 | "number": curr_num, 88 | "raw_content": curr_content, 89 | "offset": begin_offset 90 | }) 91 | print("Size of references: %d" % (len(references))) 92 | 93 | if to_print == True: 94 | for ref in references: 95 | print("Ref: %s" % (str((ref)))) 96 | 97 | return references 98 | 99 | 100 | #################################################################### 101 | # Reference author routines 102 | # 103 | # 104 | # These routines are used to take raw reference data 105 | # and split it up between authors and title of reference. 106 | # 107 | # We need multiple routines for this since there are multiple 108 | # ways of writing references. 109 | #################################################################### 110 | def read_single_letter_references(r_num, r_con, r_beg, r_end): 111 | ''' 112 | Parse a raw reference string and extract authors as well as title. 113 | This parsing routine is concerned with references where there is only a single 114 | name spelled out entirely per author, and the rest are delivered as single letters 115 | followed by periods. 116 | 117 | Examples of strings it parses: 118 | - T. Ball, R. Majumdar, T. Millstein, and S. K. Rajamani. Automatic predicate abstraction of C programs. 119 | - A. Bartel, J. Klein, M. Monperrus, and Y. Le Traon. Automatically securing permission-based software by reducing the attack 120 | - A. Bose, X. Hu, K. G. Shin, and T. Park. Behavioral detection of malware on mobile handsets. 121 | 122 | Inputs: 123 | - An element from a list as parsed by read_harvard_references 124 | 125 | Outputs: 126 | - On failure: None 127 | - On success: 128 | ''' 129 | # Now find the period that divides the authors from the title 130 | authors = None 131 | rest = None 132 | space_split = r_con.split(" ") 133 | for idx2 in range(len(space_split) - 4): 134 | try: 135 | # This will capture when there are multiple authors 136 | if (space_split[idx2].lower() == "and" 137 | and space_split[idx2 + 1][-1] == "." 138 | and space_split[1][-1] == "."): 139 | # If we have a double single-letter name 140 | 141 | # Example: T. Ball, R. Majumdar, T. Millstein, and S. K. Rajamani. Automatic predicate abstraction of C programs. 142 | if space_split[idx2 + 2][-1] == "." and len( 143 | space_split[idx2 + 2]) == 2: 144 | print("Potentially last1: %s" % (space_split[idx2 + 4])) 145 | authors = " ".join(space_split[0:idx2 + 4]) 146 | rest = " ".join(space_split[idx2 + 4:]) 147 | # Example: A. Bartel, J. Klein, M. Monperrus, and Y. Le Traon. Automatically securing permission-based software by reducing the attack 148 | elif space_split[idx2 + 2][-1] != "." and space_split[ 149 | idx2 + 3][-1] == ".": 150 | print("Potentially last2: %s" % (space_split[idx2 + 4])) 151 | authors = " ".join(space_split[0:idx2 + 4]) 152 | rest = " ".join(space_split[idx2 + 4:]) 153 | # Example:A. Bose, X. Hu, K. G. Shin, and T. Park. Behavioral detection of malware on mobile handsets. 154 | else: 155 | print("Potentially last3: %s" % (space_split[idx2 + 3])) 156 | authors = " ".join(space_split[0:idx2 + 3]) 157 | rest = " ".join(space_split[idx2 + 3:]) 158 | break 159 | except: 160 | None 161 | 162 | if authors == None: 163 | for idx2 in range(len(space_split) - 4): 164 | try: 165 | # This will capture when there is only a single author 166 | if (len(space_split[idx2]) == 2 167 | and space_split[idx2][-1] == "." 168 | and len(space_split[idx2 + 2]) > 2): 169 | # If we have double single-letter naming 170 | if space_split[idx2 + 1][-1] == "." and len( 171 | space_split[idx2 + 1]) == 2: 172 | print("Potentially last single 1: %s" % 173 | (space_split[idx2 + 3])) 174 | authors = " ".join(space_split[0:idx2 + 3]) 175 | rest = " ".join(space_split[idx2 + 3:]) 176 | break 177 | else: 178 | print("Potentially last single 2: %s" % 179 | (space_split[idx2 + 2])) 180 | authors = " ".join(space_split[0:idx2 + 2]) 181 | rest = " ".join(space_split[idx2 + 2:]) 182 | break 183 | except: 184 | None 185 | 186 | # Do some post processing to ensure we got the right stuff. 187 | # First ensure we at least have two words in authors: 188 | if authors != None and len(authors.split(" ")) < 2: 189 | print("Breaking 1") 190 | return None 191 | 192 | # Now ensure that the title is not just a name: 193 | if authors != None: 194 | try: 195 | title = int(rest.split(".")[0]) 196 | # If the above is successful, then the title is just a number, which cannot be true 197 | print("Breaking 2") 198 | return None 199 | except: 200 | print("Could not break 2") 201 | None 202 | 203 | if authors != None: 204 | r_title = rest.split(".")[0] 205 | 206 | # with the special '' symbols (0x201c and 0x201d, respectively). 207 | #if chr(0x201d) in r_title: 208 | # r_title = rest.split(chr(0x201d))[0] 209 | # r_title = r_title[1:-1] 210 | # rest = ".".join(rest.split(chr(0x201d))[1:]) 211 | 212 | print("Authors: %s" % (authors)) 213 | print("Title: %s" % (r_title)) 214 | print("rest: %s" % (rest)) 215 | 216 | # Create a directory 217 | ref_dict = dict() 218 | ref_dict['Authors'] = authors 219 | ref_dict['Title'] = r_title 220 | ref_dict['rest'] = rest 221 | ref_dict['num'] = r_num 222 | return ref_dict 223 | else: 224 | return None 225 | print("----------------") 226 | 227 | 228 | def read_full_author_references(r_num, r_con, r_beg, r_end): 229 | ''' 230 | This routine is focused on parsing references where the authors names are 231 | spelled out fully, including all names of each author. 232 | 233 | Examples: Sang Kil Cha, Thanassis Avgerinos, Alexandre Rebert, and David Brumley. 2012. Unleashing Mayhem on Binary Code. In IEEE Symposium on Security and Privacy (S&P). 234 | ''' 235 | # Now find the period that divides the authors from the title 236 | authors = None 237 | rest = None 238 | space_split = r_con.split(".") 239 | print("Space split: %s" % (str(space_split))) 240 | 241 | # Merge all of the elements of the list that are of less than 2 in length: 242 | new_space_split = list() 243 | curr_elem = "" 244 | for elem in space_split: 245 | if len(elem) <= 2: 246 | curr_elem += "%s." % (elem) 247 | elif len(elem) > 2 and elem[-2] == ' ' and elem[-1] != ' ': 248 | curr_elem += "%s." % (elem) 249 | else: 250 | new_elem = "%s%s" % (curr_elem, elem) 251 | print("New elem: %s" % (new_elem)) 252 | new_space_split.append(new_elem) 253 | curr_elem = "" 254 | space_split = new_space_split 255 | 256 | print("Refined space split: %s" % (str(space_split))) 257 | 258 | # See if it is clean 259 | if len(space_split) == 4: 260 | try: 261 | year = int(space_split[1]) 262 | print("Gotten the reference") 263 | 264 | ref_dict = dict() 265 | ref_dict['Authors'] = space_split[0] 266 | ref_dict['Title'] = space_split[2] 267 | ref_dict['Venue'] = space_split[3] 268 | ref_dict['Year'] = space_split[1] 269 | ref_dict['rest'] = space_split[3] 270 | ref_dict['num'] = r_num 271 | return ref_dict 272 | except: 273 | None 274 | # Let's try a bit vaguer for the sake of it 275 | try: 276 | year = int(space_split[1]) 277 | print("Gotten the reference") 278 | 279 | ref_dict = dict() 280 | ref_dict['Authors'] = space_split[0] 281 | ref_dict['Year'] = space_split[1] 282 | ref_dict['Title'] = space_split[2] 283 | ref_dict['rest'] = ".".join(space_split[3:]) 284 | ref_dict['num'] = r_num 285 | return ref_dict 286 | except: 287 | None 288 | 289 | return None 290 | 291 | 292 | def normalise_title(title): 293 | return title.lower().replace(",", "").replace(" ", "").replace("-", "") 294 | 295 | 296 | ########################################## 297 | # End of citation parsing routines 298 | ########################################## 299 | 300 | 301 | def parse_raw_reference_list(refs): 302 | ''' 303 | The goal of this function is to convert the raw references 304 | into more normalized references, and in particular split 305 | the raw reference into authors and title of reference. 306 | 307 | An important aspect of this function is that it tries to pass 308 | the reference in multiple ways, based on how the citations 309 | are written. 310 | 311 | As such, this routine is somewhat implemented with a 312 | test-and-check aproach in mind. 313 | ''' 314 | parse_funcs = [read_single_letter_references, read_full_author_references] 315 | parse_success = [] 316 | for parse_func in parse_funcs: 317 | missing_refs = [] 318 | refined = [] 319 | for ref in refs: 320 | res = parse_func(ref['number'], ref['raw_content'], ref['offset'], 321 | 0) 322 | if res == None: 323 | missing_refs.append(ref) 324 | else: 325 | refined.append(res) 326 | parse_success.append({ 327 | 'missing_refs': missing_refs, 328 | 'refined': refined 329 | }) 330 | 331 | # Now pick the index with highest number of successful parses 332 | if len(parse_success[0]['refined']) > len(parse_success[1]['refined']): 333 | parse_func = parse_funcs[0] 334 | else: 335 | parse_func = parse_funcs[1] 336 | 337 | # Now parse a last time and insert the parsed data into 338 | # the references dictionary. 339 | # This loop will modify the content of the dictionary. 340 | for ref in refs: 341 | res = parse_func(ref['number'], ref['raw_content'], ref['offset'], 0) 342 | ref['parsed'] = res 343 | if res != None: 344 | ref['normalised-title'] = normalise_title(res['Title']) 345 | else: 346 | ref['normalised-title'] = None 347 | 348 | 349 | ######################################## 350 | # Utility functions 351 | ######################################## 352 | 353 | 354 | def print_refined(refined): 355 | print("Refined lists:") 356 | #for rd in refined: 357 | # try: 358 | # print("\t%d"%(int(rd['num']))) 359 | # except: 360 | # None 361 | # print("\tTitle: %s"%(rd['Title'])) 362 | # print("\tAuthors: %s"%(rd['Authors'])) 363 | # print("\tRest: %s"%(rd['rest'])) 364 | # print("#"*90) 365 | 366 | 367 | def print_missing(missings): 368 | print("Missing refs:") 369 | #for ms in missings: 370 | # print("\t%s"%(str(ms))) 371 | # print("<"*90) 372 | 373 | 374 | def read_decoded(file_path): 375 | ''' 376 | Reads a file an encodes each data as an ASCII string. We do 377 | the ASCII encoding because many of the papers have some form 378 | of weird UTF-coding and we do not handle those at the moment. 379 | ''' 380 | with open(file_path, "r") as f: 381 | json_content = json.loads(str(f.read().replace('\r\n', '')), 382 | strict=False) 383 | dec_content = dict() 384 | for elem in json_content: 385 | #dec_content[elem] = json_content[elem].encode('utf-8') 386 | dec_content[elem] = json_content[elem].encode('ascii', 'ignore') 387 | #asciidata= dec.encode("ascii","ignore") 388 | return dec_content 389 | 390 | 391 | def parse_file(file_path): 392 | ''' 393 | Parses a json file with the raw data of title and references 394 | and returns a list of citations made by this paper. 395 | ''' 396 | # Read the json file with our raw data. 397 | dec_json = read_decoded(file_path) 398 | 399 | # Merge all lines in the file by converting newlines to spaces 400 | dec_json['References'] = dec_json['References'].decode("utf-8") 401 | print("Type: %s" % (str(type(dec_json['References'])))) 402 | print(dec_json['References']) 403 | print(dec_json['References'].encode('ascii', 'ignore')) 404 | all_refs = dec_json['References'].replace( 405 | "\n", " ") #decode("utf-8").replace("\n", " ") 406 | 407 | # Extract the references in a raw format. This is just splitting 408 | # 409 | references = split_reference_section_into_references(all_refs) 410 | 411 | # Decipher the references 412 | print("References") 413 | print(references) 414 | #refined, missing_sigs = parse_raw_reference_list(references) 415 | parse_raw_reference_list(references) 416 | print("Refined refs:") 417 | print(references) 418 | 419 | #print_refined(refined) 420 | #print_missing(missing_sigs) 421 | #print("Refined sigs: %d"%(len(refined))) 422 | 423 | # Now create our dictionary where we will wrap all data in 424 | paper_dict = {} 425 | paper_dict['paper-title'] = dec_json['Title'] 426 | paper_dict['paper-title-normalised'] = normalise_title( 427 | dec_json['Title'].decode("utf-8")) 428 | paper_dict['references'] = references 429 | #paper_dict['success-sigs'] = refined 430 | #paper_dict['missing-sigs'] = missing_sigs 431 | 432 | return paper_dict 433 | #return refined, missing_sigs, dec_json['Title'] 434 | 435 | 436 | def read_json_file(filename): 437 | print("[+] Parsing file: %s" % (filename)) 438 | ret = None 439 | try: 440 | ret = parse_file(filename) 441 | except: 442 | exc_type, exc_value, exc_traceback = sys.exc_info() 443 | traceback.print_tb(exc_traceback, limit=10, file=sys.stdout) 444 | traceback.print_exception(exc_type, 445 | exc_value, 446 | exc_traceback, 447 | limit=20, 448 | file=sys.stdout) 449 | traceback.print_exc() 450 | formatted_lines = traceback.format_exc().splitlines() 451 | print("[+] Completed parsing of %s" % (filename)) 452 | return ret 453 | 454 | 455 | def identify_group_dependencies(workdir, parsed_papers): 456 | """ identifies who cites who in a group of papers """ 457 | append_to_report(workdir, "######################################") 458 | append_to_report(workdir, "Dependency graph based on citations") 459 | #append_to_report(workdir, "######################################") 460 | g1 = gv.Digraph(format='png') 461 | 462 | MAX_STR_SIZE = 15 463 | idx = 0 464 | paper_dict = {} 465 | normalised_to_id = {} 466 | 467 | for src_paper in parsed_papers: 468 | #paper_dict[src_paper['result']['paper-title-normalised']] = idx 469 | if len(src_paper['result']['paper-title-normalised']) > MAX_STR_SIZE: 470 | 471 | target_d = src_paper['result']['paper-title-normalised'][ 472 | 0:MAX_STR_SIZE] 473 | else: 474 | target_d = src_paper['result']['paper-title-normalised'] 475 | 476 | #paper_dict[src_paper['result']['paper-title-normalised']] = src_paper['result']['paper-title-normalised'] 477 | paper_dict[src_paper['result']['paper-title-normalised']] = target_d 478 | #g1.node(str(idx)) 479 | g1.node(target_d) 480 | idx += 1 481 | 482 | #print("Paper dict") 483 | #print(paper_dict) 484 | #print("-"*50) 485 | #raw_dependency_graph_path = os.path.join(workdir, "raw_dependency_graph.json") 486 | #with open(raw_dependency_graph_path, "w+") as dependency_graph_json: 487 | # json.dump(paper_dict, dependency_graph_json) 488 | 489 | dependency_graph = [] 490 | for src_paper in parsed_papers: 491 | 492 | cited_by = [] 493 | for tmp_paper in parsed_papers: 494 | if tmp_paper == src_paper: 495 | continue 496 | # Now go through references of tmp_paper and 497 | # see if it cites the src paper 498 | cites = False 499 | for ref in tmp_paper['result']['references']: 500 | if ref['parsed'] != None: 501 | if ref['normalised-title'] == src_paper['result'][ 502 | 'paper-title-normalised']: 503 | if len(tmp_paper['result'] 504 | ['paper-title-normalised']) > MAX_STR_SIZE: 505 | src_d = tmp_paper['result'][ 506 | 'paper-title-normalised'][0:MAX_STR_SIZE] 507 | else: 508 | src_d = tmp_paper['result'][ 509 | 'paper-title-normalised'] 510 | 511 | if len(src_paper['result'] 512 | ['paper-title-normalised']) > MAX_STR_SIZE: 513 | dst_d = src_paper['result'][ 514 | 'paper-title-normalised'][0:MAX_STR_SIZE] 515 | else: 516 | dst_d = src_paper['result'][ 517 | 'paper-title-normalised'] 518 | g1.edge(src_d, dst_d) 519 | 520 | #src_idx = paper_dict[tmp_paper['result']['paper-title-normalised']] 521 | #dst_idx = paper_dict[src_paper['result']['paper-title-normalised']] 522 | #g1.edge(str(src_idx), str(dst_idx)) 523 | cites = True 524 | if cites == True: 525 | cited_by.append(tmp_paper) 526 | src_paper['cited_by'] = cited_by 527 | append_to_report(workdir, "Paper:") 528 | append_to_report(workdir, "\t%s" % (src_paper['filename'])) 529 | append_to_report(workdir, "\t%s" % (src_paper['result']['paper-title'])) 530 | append_to_report(workdir, "\tNormalised title: %s" % 531 | (src_paper['result']['paper-title-normalised'])) 532 | append_to_report(workdir, "\tCited by: ") 533 | for cited_by_paper in cited_by: 534 | append_to_report(workdir, "\t\t%s" % (cited_by_paper['result']['paper-title'])) 535 | append_to_report(workdir, "---------------------------------") 536 | 537 | paper_info = {} 538 | paper_info['name'] = src_paper['result']['paper-title'].decode("utf-8") 539 | paper_info['minimized-name'] = src_paper['result'][ 540 | 'paper-title-normalised'] 541 | paper_info['imports'] = [] 542 | for cited_by_paper in cited_by: 543 | paper_info['imports'].append({ 544 | "name": 545 | cited_by_paper['result']['paper-title'].decode("utf-8"), 546 | "minimized-name": 547 | cited_by_paper['result']['paper-title-normalised'] 548 | }) 549 | dependency_graph.append(paper_info) 550 | 551 | #append_to_report(workdir, "Done identifying references within the group") 552 | #g1.view() 553 | 554 | filename = g1.render( 555 | filename=os.path.join(workdir, "img", "citation_graph")) 556 | print("idx: %d" % (idx)) 557 | #print(parsed_papers) 558 | 559 | dependency_graph_path = os.path.join(workdir, 560 | "normalised_dependency_graph.json") 561 | with open(dependency_graph_path, "w+") as dependency_graph_json: 562 | json.dump(dependency_graph, dependency_graph_json) 563 | 564 | 565 | def display_summary(workdir, parsed_papers_raw): 566 | succ = [] 567 | fail = [] 568 | succ_sigs = [] 569 | miss_sigs = [] 570 | all_titles = [] 571 | append_to_report(workdir, "######################################") 572 | append_to_report(workdir, " Parsed papers summary ") 573 | #append_to_report(workdir, "######################################") 574 | all_normalised_citations = set() 575 | for paper in parsed_papers_raw: 576 | append_to_report(workdir, "paper:") 577 | append_to_report(workdir, "\t%s" % (paper['filename'])) 578 | append_to_report(workdir, "\t%s" % (paper['result']['paper-title'])) 579 | append_to_report(workdir, "\tNormalised title: %s" % 580 | (paper['result']['paper-title-normalised'])) 581 | append_to_report(workdir, "\tReferences:") 582 | for ref in paper['result']['references']: 583 | if ref['parsed'] != None: 584 | append_to_report(workdir, "\t\tAuthors: %s" % (ref['parsed']['Authors'])) 585 | append_to_report(workdir, "\t\tTitle: %s" % (ref['parsed']['Title'])) 586 | append_to_report(workdir, "\t\tNormalised: %s" % (ref['normalised-title'])) 587 | all_normalised_citations.add(ref['normalised-title']) 588 | else: 589 | append_to_report(workdir, "\t\tCould not parse") 590 | append_to_report(workdir, "\t\t-------------------") 591 | #print("\t\t%s"%(str(ref['parsed']))) 592 | 593 | #print("[+] %s"%(paper['title'])) 594 | if paper['result'] == None: 595 | fail.append(paper['filename']) 596 | else: 597 | succ.append(paper['filename']) 598 | 599 | for ref in paper['result']['references']: 600 | if ref['parsed'] != None: 601 | succ_sigs.append(ref['parsed']) 602 | else: 603 | miss_sigs.append(ref['raw_content']) 604 | 605 | #succ_sigs.append(paper['result']['success-sigs']) 606 | #miss_sigs.append(paper['result']['missing-sigs']) 607 | all_titles.append(paper['result']['paper-title']) 608 | 609 | # Check which papers are in our set: 610 | append_to_report(workdir, "######################################") 611 | append_to_report(workdir, "Papers in the data set :") 612 | #append_to_report(workdir, "######################################") 613 | cited_papers = list() 614 | noncited_papers = list() 615 | for paper in parsed_papers_raw: 616 | print(paper['result']['paper-title']) 617 | cited_by_group = False 618 | for s in all_normalised_citations: 619 | if s == paper['result']['paper-title-normalised']: 620 | cited_by_group = True 621 | if cited_by_group: 622 | cited_papers.append(paper) 623 | else: 624 | noncited_papers.append(paper) 625 | #append_to_report(workdir, "######################################") 626 | append_to_report(workdir, "Cited papers:") 627 | #append_to_report(workdir, "######################################") 628 | for p in cited_papers: 629 | append_to_report(workdir, "\t%s" % (p['result']['paper-title'])) 630 | 631 | 632 | #append_to_report(workdir, "######################################") 633 | append_to_report(workdir, "Noncited papers:") 634 | #append_to_report(workdir, "######################################") 635 | for p in noncited_papers: 636 | append_to_report(workdir, "\t%s" % (p['result']['paper-title'])) 637 | 638 | #exit(0) 639 | 640 | # Now display the content. 641 | append_to_report(workdir, "Succ: %d" % (len(succ))) 642 | append_to_report(workdir, "Fail: %d" % (len(fail))) 643 | append_to_report(workdir, "Succ sigs: %d" % (len(succ_sigs))) 644 | append_to_report(workdir, "Failed sigs: %d" % (len(miss_sigs))) 645 | 646 | #append_to_report(workdir, "Summary of references:") 647 | # Now do some status on how many references we have in total 648 | all_refs = dict() 649 | #for siglist in succ_sigs: 650 | for sig in succ_sigs: 651 | # Normalise 652 | #append_to_report(workdir, "Normalising: %s" % (sig['Title'])) 653 | tmp_sig = sig['Title'].lower().replace(",", "").replace(" ", 654 | "").replace( 655 | "-", "") 656 | #append_to_report(workdir, "\tNormalised: %s" % (tmp_sig)) 657 | 658 | #if "\xe2\x80\x9c" in tmp_sig: 659 | # s1_split = tmp_sig.split("\xe2\x80\x9c") 660 | # if "\xe2\x80\x9d" in s1_split[1]: 661 | # tmp_sig = s1_split[1].split("\xe2\x80\x9d")[0] 662 | # else: 663 | # tmp_sig=s1_split[1] 664 | 665 | if tmp_sig not in all_refs: 666 | all_refs[tmp_sig] = 0 667 | all_refs[tmp_sig] += 1 668 | 669 | #if sig['Title'].lower() not in all_refs: 670 | # all_refs[sig['Title'].lower().strip()] = 0 671 | #all_refs[sig['Title'].lower().strip()] += 1 672 | 673 | append_to_report(workdir, "######################################") 674 | append_to_report(workdir, "All Titles (normalised) of the papers in the data set:") 675 | #append_to_report(workdir, "######################################") 676 | for t in all_titles: 677 | append_to_report(workdir, "\t%s" % (t)) 678 | append_to_report(workdir, "######################################") 679 | 680 | append_to_report(workdir, "######################################") 681 | append_to_report(workdir, "All references/citations (normalised) issued by these papers") 682 | #append_to_report(workdir, "######################################") 683 | append_to_report(workdir, "\tName : Count") 684 | sorted_list = list() 685 | for title in all_refs: 686 | sorted_list.append((title, all_refs[title])) 687 | sorted_list = sorted(sorted_list, key=lambda x: x[1]) 688 | for title, counts in sorted_list: 689 | append_to_report(workdir, "\t%s : %d" % (title, counts)) 690 | 691 | append_to_report(workdir, "######################################") 692 | append_to_report(workdir, "The total number of unique citations: %d" % (len(sorted_list))) 693 | #append_to_report(workdir, "######################################") 694 | return 695 | #exit(0) 696 | 697 | #all_missing_sigs = [] 698 | #for missig in miss_sigs: 699 | # all_missing_sigs += missig 700 | 701 | #for miss in missig: 702 | # print("###\t%s"%(str(miss))) 703 | 704 | #print( 705 | # "[+] The references that we were unable to fully parse, but yet are cited by the papers:" 706 | #) 707 | #print( 708 | # "Total amount of references not parsed and thus not included in analysis: %d" 709 | # % (len(all_missing_sigs))) 710 | #print("All of these references:") 711 | #for missig in all_missing_sigs: 712 | # print("###\t%s" % (str(missig))) 713 | 714 | 715 | def parse_first_stage(workdir, target_dir): 716 | parsed_papers = [] 717 | for json_f in os.listdir(target_dir): 718 | if ".json" in json_f: 719 | complete_filename = os.path.join(target_dir, json_f) 720 | print("Checking: %s" % (complete_filename)) 721 | res = read_json_file(complete_filename) 722 | parsed_papers.append({ 723 | "filename": complete_filename, 724 | "result": res 725 | }) 726 | #display_summary(parsed_papers) 727 | 728 | parsed_papers_json = os.path.join(workdir, "parsed_paper_data.json") 729 | with open(parsed_papers_json, "w+", encoding='utf-8') as ppj: 730 | json.dump(parsed_papers, ppj) 731 | 732 | return parsed_papers 733 | 734 | 735 | if __name__ == "__main__": 736 | succ = [] 737 | fail = [] 738 | succ_sigs = [] 739 | miss_sigs = [] 740 | all_titles = [] 741 | for json_f in os.listdir("."): 742 | if ".json" in json_f: 743 | if len(sys.argv) == 2: 744 | if sys.argv[1] not in json_f: 745 | continue 746 | print("Checking: %s" % (json_f)) 747 | 748 | res = read_json_file(json_f) 749 | if res == None: 750 | fail.append(json_f) 751 | sys.stdout.flush() 752 | continue 753 | succ.append(json_f) 754 | succ_sigs.append(res['success-sigs']) 755 | miss_sigs.append(res['missing-sigs']) 756 | all_titles.append(res['paper-title']) 757 | 758 | print("Succ: %d" % (len(succ))) 759 | print("Fail: %d" % (len(fail))) 760 | print("Succ sigs: %d" % (len(succ_sigs))) 761 | print("Failed sigs: %d" % (len(miss_sigs))) 762 | display_summary(succ, fail, succ_sigs, miss_sigs, all_titles) 763 | -------------------------------------------------------------------------------- /pq_main.py: -------------------------------------------------------------------------------- 1 | import os 2 | import sys 3 | import simplejson as json 4 | import argparse 5 | import shutil 6 | 7 | import pq_format_reader 8 | import pq_pdf_utility 9 | 10 | 11 | ### Utilities 12 | def create_working_dir(target_dir): 13 | if os.path.isdir(target_dir): 14 | shutil.rmtree(target_dir) 15 | os.mkdir(target_dir) 16 | 17 | 18 | def create_workdir(): 19 | MDIR = "pq-out" 20 | if not os.path.isdir(MDIR): 21 | os.mkdir(MDIR) 22 | 23 | FNAME = "out-" 24 | max_idx = -1 25 | for l in os.listdir(MDIR): 26 | if FNAME in l: 27 | try: 28 | val = int(l.replace(FNAME, "")) 29 | if val > max_idx: 30 | max_idx = val 31 | except: 32 | None 33 | max_idx += 1 34 | new_workdir = os.path.join(MDIR, "%s%d" % (FNAME, max_idx)) 35 | print("Making the work directory: %s" % (new_workdir)) 36 | os.mkdir(new_workdir) 37 | return new_workdir 38 | 39 | 40 | ### Action functions 41 | def convert_pdfs_to_data(paper_dir, workdir, target_dir): 42 | # First extract data from the pdf files, such as 43 | # title as well as the references cited by each paper. 44 | # This will produce json files with the data that we need. 45 | filtered_pdf_list = pq_pdf_utility.convert_folder(workdir, paper_dir) 46 | 47 | # Now write it out to a json folder 48 | create_working_dir(target_dir) 49 | pq_pdf_utility.write_to_json(filtered_pdf_list, target_dir) 50 | 51 | 52 | def read_parsed_json_data(workdir, target_dir): 53 | # Reads the json files produced by "full-run", and extracts the 54 | # various forms of data. 55 | parsed_papers = pq_format_reader.parse_first_stage(workdir, target_dir) 56 | 57 | pq_format_reader.display_summary(workdir, parsed_papers) 58 | 59 | # Now extract the group dependencies 60 | pq_format_reader.identify_group_dependencies(workdir, parsed_papers) 61 | 62 | 63 | ### CLI 64 | def parse_args(): 65 | parser = argparse.ArgumentParser( 66 | "pq_main.py", description="Quantitative analysis of papers") 67 | 68 | parser.add_argument("-f", "--folder", help="Folder with PDFs") 69 | parser.add_argument("--task", 70 | help="Specify a given task", 71 | default="dependency-graph") 72 | parser.add_argument("--wd", help="Workdir") 73 | 74 | args = parser.parse_args() 75 | if args.folder != None: 76 | print("Analysing folder the pdfs in folder: %s" % (args.folder)) 77 | 78 | return args 79 | 80 | 81 | if __name__ == "__main__": 82 | args = parse_args() 83 | if args.task == "json-only": 84 | read_parsed_json_data(args.wd, os.path.join(args.wd, "json_data")) 85 | elif args.task == "dependency-graph": 86 | workdir = create_workdir() 87 | json_data_dir = os.path.join(workdir, "json_data") 88 | 89 | convert_pdfs_to_data(args.folder, workdir, json_data_dir) 90 | read_parsed_json_data(workdir, json_data_dir) 91 | else: 92 | print("Task not supported") 93 | -------------------------------------------------------------------------------- /pq_pdf_utility.py: -------------------------------------------------------------------------------- 1 | import xml.etree.ElementTree as ET 2 | import subprocess 3 | import shutil 4 | import sys 5 | import os 6 | import simplejson as json 7 | 8 | 9 | def print_recursively(elem, depth): 10 | # The below is only for debugging 11 | #print("\t"*depth + "%s - %s"%(elem.tag, elem.attrib)) 12 | #if "size" in elem.attrib: 13 | #sz = float(elem.attrib['size']) 14 | #print("We got a size: %f"%(sz)) 15 | #if sz > 20.0: 16 | # print("Text: %s"%(elem.text)) 17 | for child in elem: 18 | print_recursively(child, depth + 1) 19 | 20 | 21 | def get_all_texts_rec(elem): 22 | ''' 23 | Extracts all elements from an xml 24 | file generatex by pdf2txt. As such, this function 25 | is simply a partial XML parser. 26 | This is a recursive function where the recursion 27 | is used to capture recursive XML elements. 28 | ''' 29 | global saw_page2 30 | all_texts = [] 31 | 32 | # If we hit page 2, then we return empty list 33 | if saw_page2 == True: 34 | return [] 35 | 36 | # We don't want to move beyond page 2, as the 37 | # we assume the title must have been given here. 38 | if elem.tag == "page": 39 | if int(elem.attrib['id']) > 3: 40 | saw_page2 = True 41 | return [] 42 | 43 | # Extract all XML elements recursively 44 | for c in elem: 45 | tmp_txts = get_all_texts_rec(c) 46 | for t2 in tmp_txts: 47 | all_texts.append(t2) 48 | 49 | # If the current elem is a text element, 50 | # add this to our list of all texts. 51 | if elem.tag == "text": 52 | all_texts.append(elem) 53 | return all_texts 54 | 55 | 56 | def get_all_texts(elem): 57 | global saw_page2 58 | saw_page2 = False 59 | return get_all_texts_rec(elem) 60 | 61 | 62 | def get_title(the_file): 63 | ''' 64 | Gets the title of the paper by: 65 | 1) Reading an xml representation of the PDF converted by pdf2txt 66 | 2) Searching for the top font size of page 1 67 | 3) Exiting if we reach page 2 of the paper. 68 | 69 | returns 70 | @max_string which is the string with the max font in a paper 71 | @second_max_string which is the string with the second max font 72 | in a paper. 73 | ''' 74 | print("[+] Getting title") 75 | print("Extracting content from %s" % (the_file)) 76 | #print("current working directory: %s"%(os.getcwd())) 77 | try: 78 | tree = ET.parse(the_file) 79 | except: 80 | print("[+] Error when ET.parse of the file %s" % (the_file)) 81 | return None 82 | 83 | root = tree.getroot() 84 | #print_recursively(root, 0) 85 | all_texts = get_all_texts(root) 86 | sizes = dict() 87 | latest_size = None 88 | for te in all_texts: 89 | #print("%s-%s"%(te.attrib, te.text)) 90 | if 'size' not in te.attrib: 91 | if latest_size != None: 92 | sizes[latest_size].append(" ") 93 | continue 94 | sz = float(te.attrib['size']) 95 | latest_size = sz 96 | if sz not in sizes: 97 | sizes[sz] = list() 98 | sizes[sz].append(te.text) 99 | 100 | # We now have all the text elements and we can proceed 101 | # to extract the elements with the highest and second-highest 102 | # font sizes. 103 | sorted_sizes = sorted(sizes.keys()) 104 | 105 | # Highest font size 106 | max_string = "" 107 | for c in sizes[sorted_sizes[-1]]: 108 | max_string += c 109 | 110 | # Second highest font size 111 | second_max_string = "" 112 | for c in sizes[sorted_sizes[-2]]: 113 | second_max_string += c 114 | 115 | # Log them 116 | #print("max %s"%(max_string)) 117 | #print("second max %s"%(second_max_string)) 118 | 119 | # Print the bytes: only used for debugging 120 | ords = list() 121 | for c in max_string: 122 | ords.append(hex(ord(c))) 123 | s1 = " ".join(ords) 124 | #print("\t\t%s"%(s1)) 125 | #print("\tCompleted getting title") 126 | return max_string, second_max_string 127 | 128 | 129 | def get_references(text_file): 130 | ''' 131 | Reads a txt file converted by pdf2txt and extracts 132 | the "references" section of the paper. 133 | 134 | This function assumes the references are at the end 135 | of a paper and that there might be an appendix after. 136 | As such, we read the text from "References" until end 137 | of the file, and only stop in case we hit an 138 | "appendix" keyword. 139 | ''' 140 | references = "" 141 | reading_references = False 142 | with open(text_file, "r") as tf: 143 | for l in tf: 144 | if reading_references == True and "appendix" in l.lower(): 145 | reading_references = False 146 | 147 | if reading_references: 148 | references += l 149 | 150 | if "references" in l.lower(): 151 | #print("We have a line with references: %s"%(l)) 152 | if len(l.split(" ")) < 3: 153 | reading_references = True 154 | #print("References: %s"%(references)) 155 | return references 156 | 157 | 158 | def convert_folder(workdir, folder_name): 159 | ''' 160 | Converts an entire folder of PDFs into representations 161 | where we have the: 162 | title 163 | references 164 | for each paper. 165 | 166 | 167 | returns a list of dictionaries, where each element 168 | in the dictionary holds data about a given paper. 169 | ''' 170 | all_titles = list() 171 | all_seconds = list() 172 | title_pairs = list() 173 | 174 | paper_list = [] 175 | data_out_dir = os.path.join(workdir, "data_out") 176 | if not os.path.isdir(data_out_dir): 177 | os.mkdir(data_out_dir) 178 | for l in os.listdir(folder_name): 179 | if ".pdf" not in l: 180 | continue 181 | 182 | print("======================================") 183 | print("[+] working on %s" % (l)) 184 | paper_dict = { 185 | "file": l, 186 | "title": None, 187 | "second_title": None, 188 | "references": None, 189 | "success": False 190 | } 191 | 192 | # First step. 193 | # Convert the pdf to xml format. This is for getting 194 | # the title of the paper. 195 | target_xml = os.path.join(data_out_dir, "%s_analysed.xml" % (l)) 196 | cmd = ["pdf2txt.py", "-t", "xml", "-o", target_xml] 197 | cmd.append(os.path.join(folder_name, l)) 198 | try: 199 | subprocess.check_call(" ".join(cmd), shell=True) 200 | except: 201 | paper_list.append(paper_dict) 202 | print("Could not execute the call") 203 | continue 204 | 205 | try: 206 | res = get_title(target_xml) 207 | if res == None: 208 | paper_list.append(paper_dict) 209 | continue 210 | the_title, second = res 211 | except: 212 | # print("Exception in get_title") 213 | paper_list.append(paper_dict) 214 | continue 215 | all_titles.append(the_title) 216 | all_seconds.append(second) 217 | 218 | # Second step. 219 | # Convert the pdf to txt format. This step if for getting 220 | # the references of the paper. 221 | target_txt = os.path.join(data_out_dir, "%s_analysed.txt" % (l)) 222 | cmd = ["pdf2txt.py", "-t", "text", "-o", target_txt] 223 | cmd.append(os.path.join(folder_name, l)) 224 | try: 225 | subprocess.check_call(" ".join(cmd), shell=True) 226 | except: 227 | print("Could not execute the call") 228 | paper_list.append(paper_dict) 229 | continue 230 | 231 | references = get_references(target_txt) 232 | paper_dict = { 233 | "file": l, 234 | "title": the_title, 235 | "second_title": second, 236 | "references": references, 237 | "success": True 238 | } 239 | paper_list.append(paper_dict) 240 | #print("Adding: %s ----- %s"%(the_title, second)) 241 | 242 | return paper_list 243 | 244 | 245 | def should_select_title(title): 246 | # Check if there is any characters with value greater than 0xff in first 247 | # If this is the case then we use the second highest font for title. 248 | total_above = 0 249 | for c in title: 250 | if ord(c) > 0xff: 251 | total_above += 1 252 | if total_above > 3: 253 | return False 254 | 255 | # Now check if "(cid:" is in the name:. If it is, 256 | # then we will use the second highest font for title. 257 | if "(cid:" in str(title): 258 | return False 259 | 260 | return True 261 | 262 | 263 | def write_to_json(results, target_directory=None): 264 | counter = 0 265 | for paper_dict in results: 266 | if paper_dict['success'] == False: 267 | print("####### %s" % (paper_dict['file'])) 268 | print("Unsuccessful") 269 | print("-" * 60) 270 | continue 271 | 272 | first = paper_dict['title'] 273 | second = paper_dict['second_title'] 274 | use_second = not should_select_title(first) 275 | 276 | # Create a json dictionary and write it to the file system 277 | json_dict = { 278 | "Title": first if use_second == False else second, 279 | "References": paper_dict['references'], 280 | "Year": "1999", 281 | "Authors": "David", 282 | "ReferenceType": "Automatically" 283 | } 284 | filepath = "json_dump_%d.json" % (counter) 285 | if target_directory != None: 286 | filepath = os.path.join(target_directory, filepath) 287 | with open(filepath, "w+") as jf: 288 | json.dump(json_dict, jf) 289 | counter += 1 290 | 291 | # Now print the content for convenience 292 | print("########## %s" % (paper_dict['file'])) 293 | print("[+] Title: %s" % (json_dict['Title'])) 294 | #print("[+] References: ") 295 | #print("%s"%(paper_dict['references'])) 296 | print("-" * 60) 297 | 298 | 299 | if __name__ == "__main__": 300 | results = convert_folder("papers") 301 | target_dir = "json_data" 302 | if os.path.isdir(target_dir): 303 | shutil.rmtree(target_dir) 304 | os.mkdir(target_dir) 305 | os.chdir(target_dir) 306 | write_to_json(results) 307 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | wheel 2 | graphviz 3 | matplotlib 4 | pdf2txt 5 | requests 6 | simplejson 7 | git+git://github.com/kermitt2/grobid_client_python@4bce8b574da363171079b97eb84fe5ecfce8cfdb#egg=grobid_client 8 | pymongo 9 | docopt 10 | bs4 11 | serde 12 | lxml --------------------------------------------------------------------------------