├── .gitignore
├── .gitmodules
├── LICENSE
├── README.md
├── __init__.py
├── alexandria
    ├── README.md
    ├── base.py
    ├── grobid.json
    ├── runner.py
    └── visualization.py
├── docs
    ├── larger_example.md
    └── simple_example.md
├── example-images
    ├── fuzz-barplot.png
    ├── fuzz-wordcloud.png
    └── paper-citation-graph.png
├── example-papers
    ├── 1908.09204.pdf
    └── 1908.10167.pdf
├── gen_wordcloud.py
├── install.sh
├── pq_format_reader.py
├── pq_main.py
├── pq_pdf_utility.py
└── requirements.txt


/.gitignore:
--------------------------------------------------------------------------------
1 | pq-out/*
2 | paper-pdfs
3 | venv
4 | tmp
5 | 


--------------------------------------------------------------------------------
/.gitmodules:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AdaLogics/paper-analyser/464daa29bcab2a35a8a4630751ca96023643a558/.gitmodules


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2020 AdaLogics Ltd
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Paper analyser
 2 | The goal of this project is to enable quantitative analysis of 
 3 | academic papers. 
 4 | 
 5 | To achieve the goal, the project contains logic for
 6 | 
 7 | 1. Parsing academic white papers into structured representation
 8 | 1. Doing analysis on the structured representations
 9 | 
10 | #### Paper dependency graph
11 | The project as it currently stands focuses on the task of taking 
12 | a list of arbitrary papers in the form of PDFs, and then creating
13 | a dependency graph of citations amongst these papers. This graph
14 | then shows how each of the PDFs reference each other. Paper analyser
15 | achieves this by going through the steps:
16 | 
17 | 1. Parse the papers to extract relevant data
18 |    1. Read the PDF files to a format usable in Python
19 |    1. Extract title of a given paper
20 |    1. Extract the raw data of the "References" section
21 | 1. Parse the raw "References" section into individual refereces:
22 |    1. Extract the title and authors of the citation
23 |    1. Normalise the data of the extracted citations
24 | 1. Do dependency analysis based on the above citation extractions
25 | 
26 | 
27 | ## Usage 
28 | To see example usage of a simple exmplae look at the [simple_example.md](/docs/simple_example.md)
29 | 
30 | Paper analyser takes as input PDF files of academic papers and outputs data about these papers. 
31 | For convenience we maintain a list of links to software analysis papers
32 | focused on software security in our sister repository [software-security-paper-list](https://github.com/AdaLogics/software-security-paper-list)
33 | 
34 | To see an example of doing analysis on many papers look at the explanation here [large_example.md](/docs/larger_example.md)
35 | 
36 | ## Example visualisation
37 | 
38 | We have also created visualisations for the output of the paper 
39 | analyser, which makes it very nice to rapidly understand the 
40 | relationship between the academic papers in the data set. 
41 | 
42 | See a link here for an example of the visualisations
43 | [https://adalogics.com/software-security-research-citations-visualiser](https://adalogics.com/software-security-research-citations-visualiser)
44 | 
45 | These visualistions will be open sourced in the near future.
46 | 
47 | 
48 | ### Citation graph example:
49 | 
50 | ![Alt txt](/example-images/paper-citation-graph.png)
51 | 
52 | 
53 | ### Wordcloud of 85 fuzzing papers
54 | Example of a wordcloud generated by the papers in the "Fuzzing" section of [software-security-paper-list](https://github.com/AdaLogics/software-security-paper-list). This wordcloud discounts the use of the 100 most common english words [https://www.espressoenglish.net/the-100-most-common-words-in-english/](https://www.espressoenglish.net/the-100-most-common-words-in-english/)
55 | ![Alt txt](/example-images/fuzz-wordcloud.png)
56 | 
57 | ### Wordcount of 85 fuzzing papers
58 | Doing a barplot of the words in the papers in the "Fuzzing" section of [software-security-paper-list](https://github.com/AdaLogics/software-security-paper-list). This plot discounts the use of the 100 most commond english words [https://www.espressoenglish.net/the-100-most-common-words-in-english/](https://www.espressoenglish.net/the-100-most-common-words-in-english/)
59 | ![Alt txt](/example-images/fuzz-barplot.png)
60 | 
61 | ## Installation
62 | ```
63 | git clone https://github.com/AdaLogics/paper-analyser
64 | cd paper-analyser
65 | ./install.sh
66 | ```
67 | 
68 | ## Contribute
69 | We welcome contributions. 
70 | 
71 | Paper analyser is maintained by: 
72 | * [David Korczynski](https://twitter.com/Davkorcz)  
73 | * [Adam Korczynski](https://twitter.com/AdamKorcz4)
74 | * [Giovanni Cherubin](https://twitter.com/gchers)
75 | 
76 | We are particularly interested in features for:
77 | 1. Improved parsing of the PDF files to get better structured ouput out
78 | 1. More data analysis into the project
79 | 
80 | 
81 | ### Feature suggestions
82 | If you would like to contribute but dont have a feature in mind, please see the list below for suggestions:
83 | 
84 | * Extraction of authors from papers
85 | * Extraction of the actual text from the papers. This could be used for a lot of cool data analysis
86 | 


--------------------------------------------------------------------------------
/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AdaLogics/paper-analyser/464daa29bcab2a35a8a4630751ca96023643a558/__init__.py


--------------------------------------------------------------------------------
/alexandria/README.md:
--------------------------------------------------------------------------------
  1 | # Installation
  2 | 
  3 | Before using Alexandria, you need to have two services running:
  4 | `grobid` (used for processing pdf files) and `mongodb`.
  5 | 
  6 | 
  7 | ### Grobid
  8 | 
  9 | You first need to have an instance of [`grobid`](https://grobid.readthedocs.io/en/latest/) running.
 10 | 
 11 | Using Docker:
 12 | ```
 13 | docker run -t --rm --init -p 8080:8070 -p 8081:8071 lfoppiano/grobid:0.6.2
 14 | ```
 15 | The above command will download the image and run it.
 16 | (Add `-d` to run in daemon mode.)
 17 | 
 18 | Verify that `grobid` is running at http://localhost:8070.
 19 | 
 20 | 
 21 | If you configure `grobid` to listen on a different host/port, please adapt
 22 | the `grobid.json` configuration.
 23 | 
 24 | ### Mongodb
 25 | 
 26 | Alexandria requires you to have a mongodb instance running
 27 | locally at the standard port (`27017`).
 28 | 
 29 | Using docker:
 30 | 
 31 | ```
 32 | docker run -p 27017:27017 mongo
 33 | ```
 34 | (Add `-d` to run in daemon mode.)
 35 | 
 36 | ### Alexandria
 37 | 
 38 | You're now ready to install Alexandria.
 39 | 
 40 | ```
 41 | git clone https://github.com/AdaLogics/paper-analyser
 42 | cd paper-analyser/alexandria
 43 | python3 -m venv venv
 44 | . venv/bin/activate
 45 | pip install -r ../requirements.txt
 46 | ```
 47 | 
 48 | 
 49 | # Usage
 50 | 
 51 | To process the pdf files in `example-papers` launch:
 52 | 
 53 | ```
 54 | python runner.py ../example-papers
 55 | ```
 56 | 
 57 | <details>
 58 |   <summary>Click to see the output.</summary>
 59 | 
 60 | ```
 61 | Please check out the script for more info.
 62 | GROBID server is up and running
 63 | Connecting to db.
 64 | Processing files in ../example-papers/
 65 | Parsing 1908.09204.pdf
 66 | Title: RePEconstruct: reconstructing binaries with self-modifying code and import address table destruction
 67 | Authors: David Korczynski
 68 | References:
 69 | Title: System-Level Support for Intrusion Recovery
 70 | Authors: Andrei Bacs, Remco Vermeulen, Asia Slowinska, Herbert Bos
 71 | -------------------
 72 | Title: WYSINWYX: What You See Is Not What You eXecute
 73 | Authors: G Balakrishnan, T Reps, D Melski, T Teitelbaum
 74 | -------------------
 75 | Title: BYTEWEIGHT: Learning to Recognize Functions in Binary Code
 76 | Authors: Bao Ti Any, Jonathan Burket, Maverick Woo, Rafael Turner, David Brumley
 77 | -------------------
 78 | Title: incy: Detecting Host-Based Code Injection A acks in Memory Dumps
 79 | Authors: Niklas Omas Barabosch, Adrian Bergmann, Elmar Dombeck, Padilla
 80 | -------------------
 81 | Title: Bee Master: Detecting Host-Based Code Injection A acks
 82 | Authors: Sebastian Barabosch, Elmar Eschweiler, Gerhards-Padilla
 83 | -------------------
 84 | Title: CoDisasm: Medium Scale Concatic Disassembly of Self-Modifying Binaries with Overlapping Instructions
 85 | Authors: Guillaume Bonfante, Jose Fernandez, Jean-Yves Marion, Benjamin Rouxel
 86 | -------------------
 87 | Title: Minemu: e World's Fastest Taint Tracker
 88 | Authors: Erik Bosman, Asia Slowinska, Herbert Bos
 89 | -------------------
 90 | Title: Decoupling Dynamic Program Analysis from Execution in Virtual Environments
 91 | Authors: Jim Chow, Peter Chen
 92 | -------------------
 93 | Title: Decompilation of Binary Programs. So w
 94 | Authors: Cristina Cifuentes, K. John Gough
 95 | -------------------
 96 | Title: Ether: Malware Analysis via Hardware Virtualization Extensions
 97 | Authors: Artem Dinaburg, Paul Royal, Monirul Sharif, Wenke Lee
 98 | -------------------
 99 | Title: Repeatable Reverse Engineering with PANDA
100 | Authors: Brendan Dolan-Gavi, Josh Hodosh, Patrick Hulin, Tim Leek, Ryan Whelan
101 | -------------------
102 | Title: Graph-based comparison of executable objects (english version)
103 | Authors: Rolf Omas Dullien, Rolles
104 | -------------------
105 | Title: Signatures for Library Functions in Executable Files
106 | Authors: Mike Van Emmerik
107 | -------------------
108 | Title: Structural Comparison of Executable Objects
109 | Authors: ; Halvar Flake, G Sig Sidar, Workshop
110 | -------------------
111 | Title: A Study of the Packer Problem and Its Solutions
112 | Authors: Fanglu Guo, Peter Ferrie, Tzi-Cker Chiueh
113 | -------------------
114 | Title: DECAF: A Platform-Neutral Whole-System Dynamic Binary Analysis Platform
115 | Authors: Andrew Henderson, Lok-Kwong Yan, Xunchao Hu, Aravind Prakash, Heng Yin, Stephen Mccamant
116 | -------------------
117 | Title: MutantX-S: Scalable Malware Clustering Based on Static Features
118 | Authors: Xin Hu, Sandeep Bhatkar, Kent Gri N, Kang Shin
119 | -------------------
120 | Title: 
121 | Authors: Hungenberg, Eckert Ma Hias
122 | -------------------
123 | Title: malWASH: Washing Malware to Evade Dynamic Analysis
124 | Authors: K Kyriakos, Mathias Ispoglou, Payer
125 | -------------------
126 | Title: Labeling Library Functions in Stripped Binaries
127 | Authors: Emily Jacobson, Nathan Rosenblum, Barton Miller
128 | -------------------
129 | Title: Secure and advanced unpacking using computer emulation
130 | Authors: Sébastien Josse
131 | -------------------
132 | Title: Malware Dynamic Recompilation
133 | Authors: S Josse
134 | -------------------
135 | Title: Renovo: A Hidden Code Extractor for Packed Executables
136 | Authors: Min Kang, Pongsin Poosankam, Heng Yin
137 | -------------------
138 | Title: Taint-assisted IAT Reconstruction against Position Obfuscation
139 | Authors: Yuhei Kawakoya, Makoto Iwamura, Jun Miyoshi
140 | -------------------
141 | Title: API Chaser: Taint-Assisted Sandbox for Evasive Malware Analysis
142 | Authors: Yuhei Kawakoya, Eitaro Shioji, Makoto Iwamura
143 | -------------------
144 | Title: Jakstab: A Static Analysis Platform for Binaries
145 | Authors: Johannes Kinder, Helmut Veith
146 | -------------------
147 | Title: Power of Procrastination: Detection and Mitigation of Execution-stalling Malicious Code
148 | Authors: Clemens Kolbitsch, Engin Kirda, Christopher Kruegel
149 | -------------------
150 | Title: RePEconstruct: reconstructing binaries with selfmodifying code and import address table destruction
151 | Authors: David Korczynski
152 | -------------------
153 | Title: Capturing Malware Propagations with Code Injections and Code-Reuse A acks
154 | Authors: David Korczynski, Heng Yin
155 | -------------------
156 | Title: Polymorphic Worm Detection Using Structural Information of Executables
157 | Authors: Christopher Kruegel, Engin Kirda, Darren Mutz, William Robertson, Giovanni Vigna
158 | -------------------
159 | Title: Static Disassembly of Obfuscated Binaries
160 | Authors: Christopher Kruegel, William Robertson, Fredrik Valeur, Giovanni Vigna
161 | -------------------
162 | Title: Graph Matching Networks for Learning the Similarity of Graph Structured Objects
163 | Authors: Yujia Li, Chenjie Gu, Omas Dullien
164 | -------------------
165 | Title: OmniUnpack: Fast, Generic, and Safe Unpacking of Malware
166 | Authors: L Martignoni, M Christodorescu, S Jha
167 | -------------------
168 | Title: Measuring and Defeating Anti-Instrumentation-Equipped Malware
169 | Authors: Mario Polino, Andrea Continella, Sebastiano Mariani, Lorenzo Stefano D'alessio, Fabio Fontana, Stefano Gri, Zanero
170 | -------------------
171 | Title: Argos: an Emulator for Fingerprinting Zero-Day A acks
172 | Authors: Georgios Portokalidis, Asia Slowinska, Herbert Bos
173 | -------------------
174 | Title: 
175 | Authors: Symantec Security Response
176 | -------------------
177 | Title: Learning to Analyze Binary Computer Code
178 | Authors: Nathan Rosenblum, Xiaojin Zhu, Barton Miller, Karen Hunt
179 | -------------------
180 | Title: PolyUnpack: Automating the Hidden-Code Extraction of Unpack-Executing Malware
181 | Authors: Paul Royal, Mitch Halpin, David Dagon, Robert Edmonds, Wenke Lee
182 | -------------------
183 | Title: Eureka: A Framework for Enabling Static Malware Analysis
184 | Authors: Monirul Sharif, Vinod Yegneswaran, Hassen Saidi, Phillip Porras, Wenke Lee
185 | -------------------
186 | Title: Binary Translation
187 | Authors: Richard Sites, Anton Cherno, B Ma Hew, Maurice Kirk, Sco Marks, Robinson
188 | -------------------
189 | Title: DeepMem: Learning Graph Neural Network Models for Fast and Robust Memory Forensic Analysis
190 | Authors: Wei Song, Chang Heng Yin, Dawn Liu, Song
191 | -------------------
192 | Title: SoK: Deep Packer Inspection: A Longitudinal Study of the Complexity of Run-Time Packers
193 | Authors: Davide Xabier Ugarte-Pedrero, Igor Balzaro I, Pablo Santos, Bringas
194 | -------------------
195 | Title: Panorama: Capturing System-wide Information Flow for Malware Detection and Analysis
196 | Authors: Dawn Heng Yin, Manuel Song, Christopher Egele, Engin Kruegel, Kirda
197 | -------------------
198 | 
199 | Parsing 1908.10167.pdf
200 | Title: RePEconstruct: reconstructing binaries with self-modifying code and import address table destruction
201 | Authors: David Korczynski, Createprocessa, Createfilea, , 
202 | References:
203 | Title: System-Level Support for Intrusion Recovery
204 | Authors: Andrei Bacs, Remco Vermeulen, Asia Slowinska, Herbert Bos
205 | -------------------
206 | Title: incy: Detecting Host-Based Code Injection A acks in Memory Dumps
207 | Authors: Niklas Omas Barabosch, Adrian Bergmann, Elmar Dombeck, Padilla
208 | -------------------
209 | Title: Bee Master: Detecting Host-Based Code Injection A acks
210 | Authors: Sebastian Barabosch, Elmar Eschweiler, Gerhards-Padilla
211 | -------------------
212 | Title: Host-based code injection a acks: A popular technique used by malware
213 | Authors: Elmar Omas Barabosch, Gerhards-Padilla
214 | -------------------
215 | Title: A View on Current Malware Behaviors
216 | Authors: Ulrich Bayer, Imam Habibi, Davide Balzaro I, Engin Kirda
217 | -------------------
218 | Title: Dynamic Analysis of Malicious Code
219 | Authors: Ulrich Bayer, Andreas Moser, Christopher Kruegel, Engin Kirda
220 | -------------------
221 | Title: Dridex's Cold War
222 | Authors: Magal Baz, Or Safran
223 | -------------------
224 | Title: QEMU, a Fast and Portable Dynamic Translator
225 | Authors: Fabrice Bellard
226 | -------------------
227 | Title: CoDisasm: Medium Scale Concatic Disassembly of Self-Modifying Binaries with Overlapping Instructions
228 | Authors: Guillaume Bonfante, Jose Fernandez, Jean-Yves Marion, Benjamin Rouxel
229 | -------------------
230 | Title: Understanding Linux Malware
231 | Authors: E Cozzi, M Graziano, Y Fratantonio, D Balzaro I
232 | -------------------
233 | Title: Understanding Linux malware
234 | Authors: Emanuele Cozzi, Mariano Graziano, Yanick Fratantonio, Davide Balzaro I
235 | -------------------
236 | Title: Ether: Malware Analysis via Hardware Virtualization Extensions
237 | Authors: Artem Dinaburg, Paul Royal, Monirul Sharif, Wenke Lee
238 | -------------------
239 | Title: Repeatable Reverse Engineering with PANDA
240 | Authors: Brendan Dolan-Gavi, Josh Hodosh, Patrick Hulin, Tim Leek, Ryan Whelan
241 | -------------------
242 | Title: A Survey on Automated Dynamic Malware-analysis Techniques and Tools
243 | Authors: Manuel Egele, Engin Eodoor Scholte, Christopher Kirda, Kruegel
244 | -------------------
245 | Title: A Survey of Mobile Malware in the Wild
246 | Authors: Adrienne Felt, Ma Hew Fini Er, Erika Chin, Steve Hanna, David Wagner
247 | -------------------
248 | Title: DECAF: A Platform-Neutral Whole-System Dynamic Binary Analysis Platform
249 | Authors: Andrew Henderson, Lok-Kwong Yan, Xunchao Hu, Aravind Prakash, Heng Yin, Stephen Mccamant
250 | -------------------
251 | Title: Ten Process Injection Techniques: A Technical Survey Of Common And Trending Process Injection Techniques
252 | Authors: Ashkan Hosseini
253 | -------------------
254 | Title: 
255 | Authors: Hungenberg, Eckert Ma Hias
256 | -------------------
257 | Title: malWASH: Washing Malware to Evade Dynamic Analysis
258 | Authors: K Kyriakos, Mathias Ispoglou, Payer
259 | -------------------
260 | Title: Renovo: A Hidden Code Extractor for Packed Executables
261 | Authors: Min Kang, Pongsin Poosankam, Heng Yin
262 | -------------------
263 | Title: API Chaser: Taint-Assisted Sandbox for Evasive Malware Analysis
264 | Authors: Yuhei Kawakoya, Eitaro Shioji, Makoto Iwamura
265 | -------------------
266 | Title: RePEconstruct: reconstructing binaries with selfmodifying code and import address table destruction
267 | Authors: David Korczynski
268 | -------------------
269 | Title: Precise system-wide concatic malware unpacking. arXiv e-prints
270 | Authors: David Korczynski
271 | -------------------
272 | Title: Capturing Malware Propagations with Code Injections and Code-Reuse A acks
273 | Authors: David Korczynski, Heng Yin
274 | -------------------
275 | Title: 
276 | Authors: Pasquale Giulio De
277 | -------------------
278 | Title: 
279 | Authors: Daniel Plohmann, Martin Clauß, Elmar Padilla
280 | -------------------
281 | Title: Argos: an Emulator for Fingerprinting Zero-Day A acks
282 | Authors: Georgios Portokalidis, Asia Slowinska, Herbert Bos
283 | -------------------
284 | Title: AVclass: A Tool for Massive Malware Labeling
285 | Authors: Marcos Sebastián, Richard Rivera, Platon Kotzias, Juan Caballero
286 | -------------------
287 | Title: Malrec: Compact Full-Trace Malware Recording for Retrospective Deep Analysis
288 | Authors: Giorgio Severi, Tim Leek, Brendan Dolan-Gavi
289 | -------------------
290 | Title: Nor Badrul Anuar, Rosli Salleh, and Lorenzo Cavallaro
291 | Authors: Kimberly Tam, Ali Feizollah
292 | -------------------
293 | Title: SoK: Deep Packer Inspection: A Longitudinal Study of the Complexity of Run-Time Packers
294 | Authors: Davide Xabier Ugarte-Pedrero, Igor Balzaro I, Pablo Santos, Bringas
295 | -------------------
296 | Title: Deep Ground Truth Analysis of Current Android Malware
297 | Authors: Fengguo Wei, Yuping Li, Sankardas Roy, Xinming Ou, Wu Zhou
298 | -------------------
299 | Title: Panorama: Capturing System-wide Information Flow for Malware Detection and Analysis
300 | Authors: Dawn Heng Yin, Manuel Song, Christopher Egele, Engin Kruegel, Kirda
301 | -------------------
302 | Title: Dissecting Android Malware: Characterization and Evolution
303 | Authors: Yajin Zhou, Xuxian Jiang
304 | -------------------
305 | ```
306 | </details>
307 | 
308 | # Processing data
309 | 
310 | A simple example of how to process the data: [visualization.py](visualization.py).
311 | 


--------------------------------------------------------------------------------
/alexandria/base.py:
--------------------------------------------------------------------------------
  1 | """Contains the data structures and parsing functions to process
  2 | and store data contained in an academic article.
  3 | """
  4 | import re
  5 | from bs4 import BeautifulSoup
  6 | from serde import Model, fields
  7 | 
  8 | 
  9 | def text(field):
 10 |     return field.text.strip() if field else ""
 11 |     
 12 | 
 13 | class Author(Model):
 14 |     """Contains information about an author.
 15 |     """
 16 |     name: fields.Str()
 17 |     surname: fields.Str()
 18 |     affiliation: fields.Optional(fields.List(fields.Str()))
 19 | 
 20 |     def __init__(self, soup):
 21 |         """Creates an `Author` by parsing a `soup` of type
 22 |         `BeautifulSoup`.
 23 |         """
 24 |         if not soup.persname:
 25 |             self.name = ""
 26 |             self.surname = ""
 27 |         else:
 28 |             self.name = text(soup.persname.forename)
 29 |             self.surname = text(soup.persname.surname)
 30 |         # TODO: better affiliation parsing.
 31 |         self.affiliation = list(map(text, soup.find_all("affiliation")))
 32 | 
 33 |     def __str__(self):
 34 |         s = ""
 35 |         if self.name:
 36 |             s += self.name + " "
 37 |         if self.surname:
 38 |             s += self.surname
 39 | 
 40 |         return s.strip()
 41 | 
 42 | 
 43 | class _Article(Model):
 44 |     """This is required by serde for serialization, which
 45 |     is unable to take references to self as a field.
 46 |     For all purposes, refer to `Article`.
 47 |     """
 48 |     pass
 49 | 
 50 | 
 51 | class Article(_Article):
 52 |     """Represents an academic article or a reference contained in
 53 |     an article.
 54 | 
 55 |     The data is parsed from a TEI XML file (`from_file()`) or
 56 |     directly from a `BeautifulSoup` object.
 57 |     """
 58 |     title: fields.Str()
 59 |     text: fields.Str()
 60 |     authors: fields.List(Author)
 61 |     year: fields.Optional(fields.Date())
 62 |     references: fields.Optional(fields.List(_Article))
 63 | 
 64 |     def __init__(self, soup, is_reference=False):
 65 |         """Create a new `Article` by parsing a `soup: BeautifulSoup`
 66 |         instance.
 67 |         The parameter `is_reference` specifies if the `soup` contains
 68 |         an entire article or just the content of a reference.
 69 |         """
 70 |         self.title = text(soup.title)
 71 |         self.doi = text(soup.idno)
 72 |         self.abstract = text(soup.abstract)
 73 |         self.text = soup.text.strip() if soup.text else ""
 74 |         # FIXME
 75 |         self.year = None
 76 | 
 77 |         if is_reference:
 78 |             self.authors = list(map(Author, soup.find_all("author")))
 79 |             self.references = []
 80 |         else:
 81 |             self.authors = list(map(Author, soup.analytic.find_all("author")))
 82 |             self.references = self._parse_biblio(soup)
 83 | 
 84 |     @staticmethod
 85 |     def from_file(tei_file):
 86 |         """Creates an `Article` by parsing a TEI XML file.
 87 |         """
 88 |         with open(tei_file) as f:
 89 |             soup = BeautifulSoup(f, "lxml")
 90 | 
 91 |         return Article(soup)
 92 | 
 93 |     def _parse_biblio(self, soup):
 94 |         """Parses the bibliography from an article.
 95 |         """
 96 |         references = []
 97 |         # NOTE: we could do this without the regex.
 98 |         bibs = soup.find_all("biblstruct", {"xml:id": re.compile(r"b[0-9]*")})
 99 | 
100 |         for bib in bibs:
101 |             if bib.analytic:
102 |                 references.append(Article(bib.analytic, is_reference=True))
103 |                 # NOTE: in this case, bib.monogr contains more info
104 |                 # about the manuscript where the paper was published.
105 |                 # Not parsing for now.
106 |             elif bib.monogr:
107 |                 references.append(Article(bib.monogr, is_reference=True))
108 |             else:
109 |                 print(f"Could not parse reference from {bib}")
110 | 
111 |         return references
112 | 
113 |     def __str__(self):
114 |         return f"'{self.title}' - {' '.join(map(str, self.authors))}"
115 | 
116 |     def summary(self):
117 |         """Prints a human-readable summary.
118 |         """
119 |         print(f"Title: {self.title}")
120 |         print("Authors: " + ", ".join(map(str, self.authors)))
121 |         if self.references:
122 |             print("References:")
123 |             for r in self.references:
124 |                 r.summary()
125 |                 print("-------------------")
126 | 


--------------------------------------------------------------------------------
/alexandria/grobid.json:
--------------------------------------------------------------------------------
1 | {
2 |     "grobid_server": "127.0.0.1",
3 |     "grobid_port": "8080",
4 |     "batch_size": 1000,
5 |     "sleep_time": 5,
6 |     "coordinates": [ "persName", "figure", "ref", "biblStruct", "formula" ]
7 | }
8 | 


--------------------------------------------------------------------------------
/alexandria/runner.py:
--------------------------------------------------------------------------------
 1 | """
 2 | Usage:
 3 |     runner.py <folder>
 4 | """
 5 | import os
 6 | import sys
 7 | import pymongo
 8 | from docopt import docopt
 9 | from bs4 import BeautifulSoup
10 | from grobid_client.grobid_client import GrobidClient
11 | 
12 | from base import Article
13 | 
14 | # Valid grobid services
15 | FULL = "processFulltextDocument"
16 | HEADER = "processHeaderDocument"
17 | REFS = "processReferences"
18 | 
19 | 
20 | def parse_pdf(client, pdf_file):
21 |     _, status, r = client.process_pdf(service=FULL,
22 |                                       pdf_file=pdf_file,
23 |                                       generateIDs=False,
24 |                                       consolidate_header=True,
25 |                                       consolidate_citations=False,
26 |                                       include_raw_citations=False,
27 |                                       include_raw_affiliations=False,
28 |                                       teiCoordinates=True,
29 |                                      )
30 | 
31 |     return Article(BeautifulSoup(r, "lxml"))
32 | 
33 | 
34 | if __name__ == "__main__":
35 |     args = docopt(__doc__)
36 | 
37 |     grobid = GrobidClient("./grobid.json")
38 |     client = pymongo.MongoClient("localhost", 27017)
39 |     print("Connecting to db.")
40 |     try:
41 |         client.server_info()
42 |     except pymongo.errors.ServerSelectionTimeoutError:
43 |         print("Failed to connect to db.")
44 |         sys.exit(1)
45 | 
46 |     db = client.alexandria
47 | 
48 |     print(f"Processing files in {args['<folder>']}")
49 | 
50 |     for pdf_file in os.listdir(args["<folder>"]):
51 |         if not pdf_file.endswith(".pdf"):
52 |             continue
53 |         print(f"Parsing {pdf_file}")
54 |         article = parse_pdf(grobid, os.path.join(args["<folder>"], pdf_file))
55 |         # Can also do print(a.to_json())
56 |         article.summary()
57 |         print()
58 | 
59 |         db.articles.insert_one(article.to_dict())
60 | 


--------------------------------------------------------------------------------
/alexandria/visualization.py:
--------------------------------------------------------------------------------
 1 | import yake
 2 | import matplotlib.pyplot as plt
 3 | from pymongo import MongoClient
 4 | from wordcloud import WordCloud
 5 | 
 6 | from base import Article
 7 | 
 8 | 
 9 | def find_keywords(articles, top_n=80, ngram_size=1):
10 |     text = " ".join([a.text.lower() for a in articles])
11 | 
12 |     return yake.KeywordExtractor(top=top_n, n=ngram_size).extract_keywords(text)
13 | 
14 | def wordcloud(articles, dst=None):
15 |     # Get keywords.
16 |     keywords, importance = zip(*find_keywords(articles))
17 |     # Reverse importance.
18 |     importance = map(lambda x: 1-x, importance)
19 |     keywords = dict(zip(keywords, importance))
20 | 
21 |     wordcloud = WordCloud(background_color="white",
22 |                     colormap="winter",
23 |                     #font_path="Source_Sans_Pro/SourceSansPro-Bold.ttf",
24 |                     height=800,
25 |                     width=800,
26 |                     min_font_size=10,
27 |                     max_font_size=120,
28 |                     prefer_horizontal=3.0).generate_from_frequencies(keywords)
29 | 
30 |     plt.figure(figsize=(10.0, 10.0))
31 |     plt.imshow(wordcloud, interpolation="bilinear")
32 |     plt.axis("off")
33 |     if dst:
34 |         plt.savefig(dst)
35 |     
36 | 
37 | if __name__ == "__main__":
38 |     client = MongoClient("localhost", 27017) 
39 |     db = client.alexandria
40 |     articles = list(map(Article.from_dict, db.articles.find()))
41 | 
42 |     print(f"Fetched {len(articles)} articles.")
43 | 
44 |     keywords, _ = zip(*find_keywords(articles, top_n=5, ngram_size=2))
45 |     print(f"Top 5 keywords: {keywords}")
46 | 
47 |     wordcloud(articles, "examples/wordcloud-final.png")
48 | 


--------------------------------------------------------------------------------
/docs/larger_example.md:
--------------------------------------------------------------------------------
 1 | # Example of larger analyses
 2 | 
 3 | Paper analyser relies on PDF file representations of academic papers.
 4 | As such, it is up to you to find these papers.
 5 | 
 6 | For convenience we maintain a list of links to software analysis papers
 7 | focused on software security in our sister repository [here](https://github.com/AdaLogics/software-security-paper-list)
 8 | 
 9 | As an example of doing analysis on several Fuzzing papers, you can use the following commands:
10 | 
11 | ```
12 | cd paper-analyzer
13 | mkdir tmp && cd tmp
14 | git clone https://github.com/AdaLogics/software-security-paper-list
15 | cd software-security-paper-list
16 | python auto_download.py Fuzzing
17 | ```
18 | 
19 | At this point you will see more than 80 papers in the directory `out/Fuzzing/`
20 | 
21 | We continue to do analysis on these papers:
22 | ```
23 | cd ../..
24 | python3 pq_main.py -f ./tmp/software-security-paper-list/out/Fuzzing
25 | ```
26 | 
27 | 


--------------------------------------------------------------------------------
/docs/simple_example.md:
--------------------------------------------------------------------------------
  1 | # Simple example
  2 | 
  3 | We include two whitepapers in the repository as examples for using
  4 | paper-analyser to pass papers and get results out.
  5 | 
  6 | To try the tool, simply follow the commands:
  7 | ```
  8 | cd paper-analyzer
  9 | . venv/bin/activate
 10 | python3 ./pq_main.py -f ./example-papers/
 11 | ```
 12 | 
 13 | At this point you will see results in `pq-out/out-0`
 14 | 
 15 | Specifically, you will see:
 16 | ```
 17 | $ ls pq-out/out-0/
 18 | data_out  img  json_data  normalised_dependency_graph.json  parsed_paper_data.json  report.txt
 19 | ```
 20 | 
 21 | * `data_out` contains one `.txt` and one `.xml` file for each PDF. These `.txt` and `.xml` are simply data representations of the content of the given PDF file.
 22 | * `json_data` contains JSON data representations for each paper
 23 | * `img` contains a `.png` image of a citation-dependency graph of the PDF files in the folder
 24 | * `parsed_paper_data.json` is a single json file containing data about the papers analysed, such as the title of each paper as well as the papers cited by each paper.
 25 | * `report.txt` contains a lot of information about the papers in data set
 26 | 
 27 | ```
 28 | $ cat pq-out/out-0/report.txt
 29 | ######################################
 30 |  Parsed papers summary        
 31 | paper:
 32 |     pq-out/out-0/json_data/json_dump_0.json
 33 |     b'A characterisation of system-wide propagation in the malware landscape '
 34 |     Normalised title: acharacterisationofsystemwidepropagationinthemalwarelandscape
 35 |     References:
 36 |         Authors:  Andrei Bacs, Remco Vermeulen, Asia Slowinska, and Herbert Bos
 37 |         Title:  System- Level Support for Intrusion Recovery
 38 |         Normalised: systemlevelsupportforintrusionrecovery
 39 |         -------------------
 40 |         Authors:  Thomas Barabosch, Niklas Bergmann, Adrian Dombeck, and Elmar Padilla
 41 |         Title:  Quincy: Detecting Host-Based Code Injection Attacks in Memory Dumps
 42 |         Normalised: quincy:detectinghostbasedcodeinjectionattacksinmemorydumps
 43 |         -------------------
 44 |         Authors:  Thomas Barabosch, Sebastian Eschweiler, and Elmar Gerhards-Padilla
 45 |         Title:  Bee Master: Detecting Host-Based Code Injection Attacks
 46 |         Normalised: beemaster:detectinghostbasedcodeinjectionattacks
 47 |         -------------------
 48 |         Authors:  Thomas Barabosch and Elmar Gerhards-Padilla
 49 |         Title:  Host-based code injection attacks: A popular technique used by malware
 50 |         Normalised: hostbasedcodeinjectionattacks:apopulartechniqueusedbymalware
 51 |         -------------------
 52 |         Authors:  Ulrich Bayer, Imam Habibi, Davide Balzarotti, and Engin Kirda
 53 |         Title:  A View on Current Malware Behaviors
 54 |         Normalised: aviewoncurrentmalwarebehaviors
 55 |         -------------------
 56 |         Authors:  Ulrich Bayer, Andreas Moser, Christopher Kruegel, and Engin Kirda
 57 |         Title:  Dy- namic Analysis of Malicious Code
 58 |         Normalised: dynamicanalysisofmaliciouscode
 59 |         -------------------
 60 |         Authors:  Magal Baz and Or Safran
 61 |         Title:  Dridexs Cold War: Enter AtomBombing
 62 |         Normalised: dridexscoldwar:enteratombombing
 63 |         -------------------
 64 |         Authors:  Fabrice Bellard
 65 |         Title:  QEMU, a Fast and Portable Dynamic Translator
 66 |         Normalised: qemuafastandportabledynamictranslator
 67 |         -------------------
 68 |         Authors:  Guillaume Bonfante, Jose Fernandez, Jean-Yves Marion, Benjamin Rouxel, Fab- rice Sabatier, and Aurelien Thierry
 69 |         Title:  CoDisasm: Medium Scale Con- catic Disassembly of Self-Modifying Binaries with Overlapping Instructions
 70 |         Normalised: codisasm:mediumscaleconcaticdisassemblyofselfmodifyingbinarieswithoverlappinginstructions
 71 |         -------------------
 72 |         Authors:  Brendan Dolan-Gavitt, Josh Hodosh, Patrick Hulin, Tim Leek, and Ryan Whelan
 73 |         Title:  Repeatable Reverse Engineering with PANDA
 74 |         Normalised: repeatablereverseengineeringwithpanda
 75 |         -------------------
 76 |         Authors:  Manuel Egele, Theodoor Scholte, Engin Kirda, and Christopher Kruegel
 77 |         Title:  A Survey on Automated Dynamic Malware-analysis Techniques and Tools
 78 |         Normalised: asurveyonautomateddynamicmalwareanalysistechniquesandtools
 79 |         -------------------
 80 |         Authors:  Adrienne Porter Felt, Matthew Finifter, Erika Chin, Steve Hanna, and David Wagner
 81 |         Title:  A Survey of Mobile Malware in the Wild
 82 |         Normalised: asurveyofmobilemalwareinthewild
 83 |         -------------------
 84 |         Authors:  Andrew Henderson, Lok-Kwong Yan, Xunchao Hu, Aravind Prakash, Heng Yin, and Stephen McCamant
 85 |         Title:  DECAF: A Platform-Neutral Whole-System Dynamic Binary Analysis Platform
 86 |         Normalised: decaf:aplatformneutralwholesystemdynamicbinaryanalysisplatform
 87 |         -------------------
 88 |         Authors:  Min Gyung Kang, Pongsin Poosankam, and Heng Yin
 89 |         Title:  Renovo: A Hidden Code Extractor for Packed Executables
 90 |         Normalised: renovo:ahiddencodeextractorforpackedexecutables
 91 |         -------------------
 92 |         Authors:  Yuhei Kawakoya, Eitaro Shioji, Makoto Iwamura, and Jun Miyoshi
 93 |         Title:  API Chaser: Taint-Assisted Sandbox for Evasive Malware Analysis
 94 |         Normalised: apichaser:taintassistedsandboxforevasivemalwareanalysis
 95 |         -------------------
 96 |         Authors:  David Korczynski
 97 |         Title:  RePEconstruct: reconstructing binaries with self- modifying code and import address table destruction
 98 |         Normalised: repeconstruct:reconstructingbinarieswithselfmodifyingcodeandimportaddresstabledestruction
 99 |         -------------------
100 |         Authors:  David Korczynski
101 |         Title:  Precise system-wide concatic malware unpacking
102 |         Normalised: precisesystemwideconcaticmalwareunpacking
103 |         -------------------
104 |         Authors:  David Korczynski and Heng Yin
105 |         Title:  Capturing Malware Propagations with Code Injections and Code-Reuse Attacks
106 |         Normalised: capturingmalwarepropagationswithcodeinjectionsandcodereuseattacks
107 |         -------------------
108 |         Authors:  Kimberly Tam, Ali Feizollah, Nor Badrul Anuar, Rosli Salleh, and Lorenzo Caval- laro
109 |         Title:  The Evolution of Android Malware and Android Analysis Techniques
110 |         Normalised: theevolutionofandroidmalwareandandroidanalysistechniques
111 |         -------------------
112 |         Authors:  Heng Yin, Dawn Song, Manuel Egele, Christopher Kruegel, and Engin Kirda
113 |         Title:  Panorama: Capturing System-wide Information Flow for Malware De- tection and Analysis
114 |         Normalised: panorama:capturingsystemwideinformationflowformalwaredetectionandanalysis
115 |         -------------------
116 |         Authors:  Yajin Zhou and Xuxian Jiang
117 |         Title:  Dissecting Android Malware: Character- ization and Evolution
118 |         Normalised: dissectingandroidmalware:characterizationandevolution
119 |         -------------------
120 | paper:
121 |     pq-out/out-0/json_data/json_dump_1.json
122 |     b'Precise system-wide concatic malware unpacking '
123 |     Normalised title: precisesystemwideconcaticmalwareunpacking
124 |     References:
125 |         Authors:  Andrei Bacs, Remco Vermeulen, Asia Slowinska, and Herbert Bos
126 |         Title:  System- Level Support for Intrusion Recovery
127 |         Normalised: systemlevelsupportforintrusionrecovery
128 |         -------------------
129 |         Authors:  G. Balakrishnan, T. Reps, D. Melski, and T. Teitelbaum
130 |         Title:  WYSINWYX: What You See Is Not What You eXecute
131 |         Normalised: wysinwyx:whatyouseeisnotwhatyouexecute
132 |         -------------------
133 |         Authors:  Tiffany Bao, Jonathan Burket, Maverick Woo, Rafael Turner, and David Brumley
134 |         Title:  BYTEWEIGHT: Learning to Recognize Functions in Binary Code
135 |         Normalised: byteweight:learningtorecognizefunctionsinbinarycode
136 |         -------------------
137 |         Authors:  Thomas Barabosch, Niklas Bergmann, Adrian Dombeck, and Elmar Padilla
138 |         Title:  Quincy: Detecting Host-Based Code Injection Attacks in Memory Dumps
139 |         Normalised: quincy:detectinghostbasedcodeinjectionattacksinmemorydumps
140 |         -------------------
141 |         Authors:  Thomas Barabosch, Sebastian Eschweiler, and Elmar Gerhards-Padilla
142 |         Title:  Bee Master: Detecting Host-Based Code Injection Attacks
143 |         Normalised: beemaster:detectinghostbasedcodeinjectionattacks
144 |         -------------------
145 |         Authors:  Guillaume Bonfante, Jose Fernandez, Jean-Yves Marion, Benjamin Rouxel, Fab- rice Sabatier, and Aurelien Thierry
146 |         Title:  CoDisasm: Medium Scale Con- catic Disassembly of Self-Modifying Binaries with Overlapping Instructions
147 |         Normalised: codisasm:mediumscaleconcaticdisassemblyofselfmodifyingbinarieswithoverlappinginstructions
148 |         -------------------
149 |         Authors:  Erik Bosman, Asia Slowinska, and Herbert Bos
150 |         Title:  Minemu: The Worlds Fastest Taint Tracker
151 |         Normalised: minemu:theworldsfastesttainttracker
152 |         -------------------
153 |         Could not parse
154 |         -------------------
155 |         Authors:  Cristina Cifuentes and K. John Gough
156 |         Title:  Decompilation of Binary Pro- grams
157 |         Normalised: decompilationofbinaryprograms
158 |         -------------------
159 |         Authors:  Artem Dinaburg, Paul Royal, Monirul Sharif, and Wenke Lee
160 |         Title:  Ether: Malware Analysis via Hardware Virtualization Extensions
161 |         Normalised: ether:malwareanalysisviahardwarevirtualizationextensions
162 |         -------------------
163 |         Authors:  Brendan Dolan-Gavitt, Josh Hodosh, Patrick Hulin, Tim Leek, and Ryan Whelan
164 |         Title:  Repeatable Reverse Engineering with PANDA
165 |         Normalised: repeatablereverseengineeringwithpanda
166 |         -------------------
167 |         Authors:  Thomas Dullien and Rolf Rolles
168 |         Title:  Graph-based comparison of executable  objects (english version)
169 |         Normalised: graphbasedcomparisonofexecutableobjects(englishversion)
170 |         -------------------
171 |         Authors:  Mike Van Emmerik
172 |         Title:  Signatures for Library Functions in Executable Files
173 |         Normalised: signaturesforlibraryfunctionsinexecutablefiles
174 |         -------------------
175 |         Authors:  Halvar Flake
176 |         Title:  Structural Comparison of Executable Objects
177 |         Normalised: structuralcomparisonofexecutableobjects
178 |         -------------------
179 |         Authors:  Fanglu Guo, Peter Ferrie, and Tzi-cker Chiueh
180 |         Title:  A Study of the Packer Problem and Its Solutions
181 |         Normalised: astudyofthepackerproblemanditssolutions
182 |         -------------------
183 |         Authors:  Andrew Henderson, Lok-Kwong Yan, Xunchao Hu, Aravind Prakash, Heng Yin, and Stephen McCamant
184 |         Title:  DECAF: A Platform-Neutral Whole-System Dynamic Binary Analysis Platform
185 |         Normalised: decaf:aplatformneutralwholesystemdynamicbinaryanalysisplatform
186 |         -------------------
187 |         Authors:  Xin Hu, Sandeep Bhatkar, Kent Griffin, and Kang G. Shin
188 |         Title:  MutantX-S: Scalable Malware Clustering Based on Static Features
189 |         Normalised: mutantxs:scalablemalwareclusteringbasedonstaticfeatures
190 |         -------------------
191 |         Authors:  Thomas Hungenberg and Matthias Eckert
192 |         Title:  http://www
193 |         Normalised: http://www
194 |         -------------------
195 |         Authors:  Kyriakos K. Ispoglou and Mathias Payer
196 |         Title:  malWASH: Washing Malware to Evade Dynamic Analysis
197 |         Normalised: malwash:washingmalwaretoevadedynamicanalysis
198 |         -------------------
199 |         Authors:  Emily R. Jacobson, Nathan Rosenblum, and Barton P. Miller
200 |         Title:  Labeling Library Functions in Stripped Binaries
201 |         Normalised: labelinglibraryfunctionsinstrippedbinaries
202 |         -------------------
203 |         Authors:  Sebastien Josse
204 |         Title:  Secure and advanced unpacking using computer emulation
205 |         Normalised: secureandadvancedunpackingusingcomputeremulation
206 |         -------------------
207 |         Authors:  S. Josse
208 |         Title:  Malware Dynamic Recompilation
209 |         Normalised: malwaredynamicrecompilation
210 |         -------------------
211 |         Authors:  Min Gyung Kang, Pongsin Poosankam, and Heng Yin
212 |         Title:  Renovo: A Hidden Code Extractor for Packed Executables
213 |         Normalised: renovo:ahiddencodeextractorforpackedexecutables
214 |         -------------------
215 |         Authors:  Yuhei Kawakoya, Makoto Iwamura, and Jun Miyoshi
216 |         Title:  Taint-assisted IAT Reconstruction against Position Obfuscation
217 |         Normalised: taintassistediatreconstructionagainstpositionobfuscation
218 |         -------------------
219 |         Authors:  Yuhei Kawakoya, Eitaro Shioji, Makoto Iwamura, and Jun Miyoshi
220 |         Title:  API Chaser: Taint-Assisted Sandbox for Evasive Malware Analysis
221 |         Normalised: apichaser:taintassistedsandboxforevasivemalwareanalysis
222 |         -------------------
223 |         Could not parse
224 |         -------------------
225 |         Authors:  Clemens Kolbitsch, Engin Kirda, and Christopher Kruegel
226 |         Title:  The Power of Procrastination: Detection and Mitigation of Execution-stalling Malicious Code
227 |         Normalised: thepowerofprocrastination:detectionandmitigationofexecutionstallingmaliciouscode
228 |         -------------------
229 |         Authors:  David Korczynski
230 |         Title:  RePEconstruct: reconstructing binaries with self- modifying code and import address table destruction
231 |         Normalised: repeconstruct:reconstructingbinarieswithselfmodifyingcodeandimportaddresstabledestruction
232 |         -------------------
233 |         Authors:  David Korczynski and Heng Yin
234 |         Title:  Capturing Malware Propagations with Code Injections and Code-Reuse Attacks
235 |         Normalised: capturingmalwarepropagationswithcodeinjectionsandcodereuseattacks
236 |         -------------------
237 |         Authors:  Christopher Kruegel, Engin Kirda, Darren Mutz, William Robertson, and Gio- vanni Vigna
238 |         Title:  Polymorphic Worm Detection Using Structural Information of Executables
239 |         Normalised: polymorphicwormdetectionusingstructuralinformationofexecutables
240 |         -------------------
241 |         Authors:  Christopher Kruegel, William Robertson, Fredrik Valeur, and Giovanni Vigna
242 |         Title:  Static Disassembly of Obfuscated Binaries
243 |         Normalised: staticdisassemblyofobfuscatedbinaries
244 |         -------------------
245 |         Authors:  Yujia Li, Chenjie Gu, Thomas Dullien, Oriol Vinyals, and Pushmeet Kohli
246 |         Title:  Graph Matching Networks for Learning the Similarity of Graph Structured Objects
247 |         Normalised: graphmatchingnetworksforlearningthesimilarityofgraphstructuredobjects
248 |         -------------------
249 |         Authors:  L. Martignoni, M. Christodorescu, and S. Jha
250 |         Title:  OmniUnpack: Fast, Generic, and Safe Unpacking of Malware
251 |         Normalised: omniunpack:fastgenericandsafeunpackingofmalware
252 |         -------------------
253 |         Authors:  Mario Polino, Andrea Continella, Sebastiano Mariani, Stefano DAlessio, Lorenzo Fontana, Fabio Gritti, and Stefano Zanero
254 |         Title:  Measuring and Defeating Anti- Instrumentation-Equipped Malware
255 |         Normalised: measuringanddefeatingantiinstrumentationequippedmalware
256 |         -------------------
257 |         Authors:  Georgios Portokalidis, Asia Slowinska, and Herbert Bos
258 |         Title:  Argos: an Emula- tor for Fingerprinting Zero-Day Attacks
259 |         Normalised: argos:anemulatorforfingerprintingzerodayattacks
260 |         -------------------
261 |         Authors:  Symantec Security Response
262 |         Title:  W32
263 |         Normalised: w32
264 |         -------------------
265 |         Authors:  Nathan E. Rosenblum, Xiaojin Zhu, Barton P. Miller, and Karen Hunt
266 |         Title:  Learning to Analyze Binary Computer Code
267 |         Normalised: learningtoanalyzebinarycomputercode
268 |         -------------------
269 |         Authors:  Paul Royal, Mitch Halpin, David Dagon, Robert Edmonds, and Wenke Lee
270 |         Title:  PolyUnpack: Automating the Hidden-Code Extraction of Unpack-Executing Malware
271 |         Normalised: polyunpack:automatingthehiddencodeextractionofunpackexecutingmalware
272 |         -------------------
273 |         Authors:  Monirul Sharif, Vinod Yegneswaran, Hassen Saidi, Phillip Porras, and Wenke Lee
274 |         Title:  Eureka: A Framework for Enabling Static Malware Analysis
275 |         Normalised: eureka:aframeworkforenablingstaticmalwareanalysis
276 |         -------------------
277 |         Authors:  Richard L. Sites, Anton Chernoff, Matthew B. Kirk, Maurice P. Marks, and Scott G. Robinson
278 |         Title:  Binary Translation
279 |         Normalised: binarytranslation
280 |         -------------------
281 |         Authors:  Wei Song, Heng Yin, Chang Liu, and Dawn Song
282 |         Title:  DeepMem: Learning Graph Neural Network Models for Fast and Robust Memory Forensic Analysis
283 |         Normalised: deepmem:learninggraphneuralnetworkmodelsforfastandrobustmemoryforensicanalysis
284 |         -------------------
285 |         Authors:  Heng Yin, Dawn Song, Manuel Egele, Christopher Kruegel, and Engin Kirda
286 |         Title:  Panorama: Capturing System-wide Information Flow for Malware De- tection and Analysis
287 |         Normalised: panorama:capturingsystemwideinformationflowformalwaredetectionandanalysis
288 |         -------------------
289 | ######################################
290 | Papers in the data set :
291 | Cited papers:
292 |     b'Precise system-wide concatic malware unpacking '
293 | Noncited papers:
294 |     b'A characterisation of system-wide propagation in the malware landscape '
295 | Succ: 2
296 | Fail: 0
297 | Succ sigs: 61
298 | Failed sigs: 2
299 | ######################################
300 | All Titles (normalised) of the papers in the data set:
301 |     b'A characterisation of system-wide propagation in the malware landscape '
302 |     b'Precise system-wide concatic malware unpacking '
303 | ######################################
304 | ######################################
305 | All references/citations (normalised) issued by these papers
306 |     Name : Count
307 |     hostbasedcodeinjectionattacks:apopulartechniqueusedbymalware :  1
308 |     aviewoncurrentmalwarebehaviors :  1
309 |     dynamicanalysisofmaliciouscode :  1
310 |     dridexscoldwar:enteratombombing :  1
311 |     qemuafastandportabledynamictranslator :  1
312 |     asurveyonautomateddynamicmalwareanalysistechniquesandtools :  1
313 |     asurveyofmobilemalwareinthewild :  1
314 |     precisesystemwideconcaticmalwareunpacking :  1
315 |     theevolutionofandroidmalwareandandroidanalysistechniques :  1
316 |     dissectingandroidmalware:characterizationandevolution :  1
317 |     wysinwyx:whatyouseeisnotwhatyouexecute :  1
318 |     byteweight:learningtorecognizefunctionsinbinarycode :  1
319 |     minemu:theworldsfastesttainttracker :  1
320 |     decompilationofbinaryprograms :  1
321 |     ether:malwareanalysisviahardwarevirtualizationextensions :  1
322 |     graphbasedcomparisonofexecutableobjects(englishversion) :  1
323 |     signaturesforlibraryfunctionsinexecutablefiles :  1
324 |     structuralcomparisonofexecutableobjects :  1
325 |     astudyofthepackerproblemanditssolutions :  1
326 |     mutantxs:scalablemalwareclusteringbasedonstaticfeatures :  1
327 |     http://www :  1
328 |     malwash:washingmalwaretoevadedynamicanalysis :  1
329 |     labelinglibraryfunctionsinstrippedbinaries :  1
330 |     secureandadvancedunpackingusingcomputeremulation :  1
331 |     malwaredynamicrecompilation :  1
332 |     taintassistediatreconstructionagainstpositionobfuscation :  1
333 |     thepowerofprocrastination:detectionandmitigationofexecutionstallingmaliciouscode :  1
334 |     polymorphicwormdetectionusingstructuralinformationofexecutables :  1
335 |     staticdisassemblyofobfuscatedbinaries :  1
336 |     graphmatchingnetworksforlearningthesimilarityofgraphstructuredobjects :  1
337 |     omniunpack:fastgenericandsafeunpackingofmalware :  1
338 |     measuringanddefeatingantiinstrumentationequippedmalware :  1
339 |     argos:anemulatorforfingerprintingzerodayattacks :  1
340 |     w32 :  1
341 |     learningtoanalyzebinarycomputercode :  1
342 |     polyunpack:automatingthehiddencodeextractionofunpackexecutingmalware :  1
343 |     eureka:aframeworkforenablingstaticmalwareanalysis :  1
344 |     binarytranslation :  1
345 |     deepmem:learninggraphneuralnetworkmodelsforfastandrobustmemoryforensicanalysis :  1
346 |     systemlevelsupportforintrusionrecovery :  2
347 |     quincy:detectinghostbasedcodeinjectionattacksinmemorydumps :  2
348 |     beemaster:detectinghostbasedcodeinjectionattacks :  2
349 |     codisasm:mediumscaleconcaticdisassemblyofselfmodifyingbinarieswithoverlappinginstructions :  2
350 |     repeatablereverseengineeringwithpanda :  2
351 |     decaf:aplatformneutralwholesystemdynamicbinaryanalysisplatform :  2
352 |     renovo:ahiddencodeextractorforpackedexecutables :  2
353 |     apichaser:taintassistedsandboxforevasivemalwareanalysis :  2
354 |     repeconstruct:reconstructingbinarieswithselfmodifyingcodeandimportaddresstabledestruction :  2
355 |     capturingmalwarepropagationswithcodeinjectionsandcodereuseattacks :  2
356 |     panorama:capturingsystemwideinformationflowformalwaredetectionandanalysis :  2
357 | ######################################
358 | The total number of unique citations: 50
359 | ######################################
360 | Dependency graph based on citations
361 | Paper:
362 |     pq-out/out-0/json_data/json_dump_0.json
363 |     b'A characterisation of system-wide propagation in the malware landscape '
364 |     Normalised title: acharacterisationofsystemwidepropagationinthemalwarelandscape
365 |     Cited by: 
366 | ---------------------------------
367 | Paper:
368 |     pq-out/out-0/json_data/json_dump_1.json
369 |     b'Precise system-wide concatic malware unpacking '
370 |     Normalised title: precisesystemwideconcaticmalwareunpacking
371 |     Cited by: 
372 |         b'A characterisation of system-wide propagation in the malware landscape '
373 | ---------------------------------
374 | ```
375 | 


--------------------------------------------------------------------------------
/example-images/fuzz-barplot.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AdaLogics/paper-analyser/464daa29bcab2a35a8a4630751ca96023643a558/example-images/fuzz-barplot.png


--------------------------------------------------------------------------------
/example-images/fuzz-wordcloud.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AdaLogics/paper-analyser/464daa29bcab2a35a8a4630751ca96023643a558/example-images/fuzz-wordcloud.png


--------------------------------------------------------------------------------
/example-images/paper-citation-graph.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AdaLogics/paper-analyser/464daa29bcab2a35a8a4630751ca96023643a558/example-images/paper-citation-graph.png


--------------------------------------------------------------------------------
/example-papers/1908.09204.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AdaLogics/paper-analyser/464daa29bcab2a35a8a4630751ca96023643a558/example-papers/1908.09204.pdf


--------------------------------------------------------------------------------
/example-papers/1908.10167.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AdaLogics/paper-analyser/464daa29bcab2a35a8a4630751ca96023643a558/example-papers/1908.10167.pdf


--------------------------------------------------------------------------------
/gen_wordcloud.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | import sys
 3 | from wordcloud import WordCloud
 4 | 
 5 | if len(sys.argv) != 2:
 6 |     print("usage: python pq_gen_wordcloud.py TEXTFILE")
 7 |     exit(0)
 8 | 
 9 | whole_text = None
10 | #with open("copy-total-words", "r") as i_f:
11 | with open(sys.argv[1], "r") as i_f:
12 |     whole_text = i_f.read()
13 | 
14 | whole_text = whole_text.replace("cid:", "")
15 | 
16 | #wordcloud = WordCloud().generate(whole_text)
17 | 
18 | import matplotlib.pyplot as plt
19 | #plt.imshow(wordcloud, interpolation="bilinear")
20 | #plt.axis("off")
21 | 
22 | wordcloud = WordCloud(background_color="white",
23 |                     colormap="winter",
24 |                     font_path="Source_Sans_Pro/SourceSansPro-Bold.ttf",
25 |                     height=800,
26 |                     width=800,
27 |                     min_font_size=10,
28 |                     max_font_size=120,
29 |                     prefer_horizontal=3.0).generate(whole_text)
30 | 
31 | plt.figure(figsize=(10.0, 10.0))
32 | plt.imshow(wordcloud, interpolation="bilinear")
33 | plt.axis("off")
34 | plt.savefig("wordcloud-final.png")
35 | 
36 | # Now let's do word frequency
37 | word_dict = dict()
38 | split_words = whole_text.split(" ")
39 | total_number_of_words = len(split_words)
40 | print(total_number_of_words)
41 | total_sorted = 0
42 | for w in split_words:
43 |     total_sorted += 1
44 |     if w in word_dict:
45 |         word_dict[w] = word_dict[w] + 1
46 |     else:
47 |         word_dict[w] = 1
48 | print("Total sorted: %d"%(total_sorted))
49 | 
50 | listed_word_dict = []
51 | for key,value in word_dict.items():
52 |     listed_word_dict.append((key, value, ))
53 | 
54 | print("Length of listed word %d"%(len(listed_word_dict)))
55 | listed_word_dict.sort(key=lambda x:x[1])
56 | listed_word_dict.reverse()
57 | sorted_list = listed_word_dict
58 | print(sorted_list)
59 | 
60 | 
61 | # https://www.espressoenglish.net/the-100-most-common-words-in-english/
62 | avoid_words = { "the", "at", "there", "some", "my", "of", "be", "use", "her", "than", "and", "this", "an", "would", "first", "a", "have", "each", "make", "water", "to", "from", "which", "like", "been", "in", "or", "she", "him", "call", "is", "one", "do", "into", "who", "you", "had", "how", "time", "oil", "that", "by", "their", "has", "its", "it", "word", "if", "look", "now", "he", "but", "will", "two", "find", "was", "not", "up", "more", "long", "for", "what", "other", "write", "down", "on", "all", "about", "go", "day", "are", "were", "out", "see", "did", "as", "we", "many", "number", "get", "with", "when", "then", "no", "come", "his", "your", "them", "way", "made", "they", "can", "these", "could", "may", "I", "said", "so", "people", "part" }
63 | 
64 | avoid_words.add("=")
65 | avoid_words.add(".")
66 | avoid_words.add(" ")
67 | avoid_words.add("")
68 | avoid_words.add(",")
69 | 
70 | words = []
71 | freqs = []
72 | for key, value in sorted_list:
73 |     if key.lower() in avoid_words:
74 |         continue
75 |     words.append(key)
76 |     freqs.append(value)
77 | for i in range(140):
78 |     print("sorted_list: %s"%(str(sorted_list[i])))
79 | 
80 | Data = { "words: " : words, "freqs" : freqs }
81 | 
82 | plt.clf()
83 | plt.figure(figsize=(10.0, 10.0))
84 | plt.barh(words[:50], freqs[:50])
85 | 
86 | plt.title("Word frequency")
87 | #plt.show()
88 | plt.savefig("barplot.png")
89 | 


--------------------------------------------------------------------------------
/install.sh:
--------------------------------------------------------------------------------
 1 | # Install some needed packages
 2 | sudo apt-get install python3
 3 | sudo apt-get install python3-pip
 4 | sudo apt install graphviz
 5 | 
 6 | # Create a virtual environment
 7 | virtualenv venv
 8 | 
 9 | # Launch the virtual environment
10 | . venv/bin/activate
11 | 
12 | # Install various packages
13 | pip install -r requirements.txt
14 | 


--------------------------------------------------------------------------------
/pq_format_reader.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import sys
  3 | import simplejson as json
  4 | import traceback
  5 | 
  6 | import graphviz as gv
  7 | import pylab
  8 | import random
  9 | 
 10 | 
 11 | def append_to_report(workdir, str_to_write, to_print=True):
 12 |     work_report = os.path.join(workdir, "report.txt")
 13 |     if not os.path.isfile(work_report):
 14 |         with open(work_report, "w+") as wr:
 15 |             wr.write(str_to_write)
 16 |             wr.write("\n")
 17 |     else:
 18 |         with open(work_report, "ab") as wr:
 19 |             wr.write(str_to_write.encode("utf-8"))
 20 |             wr.write("\n".encode("utf-8"))
 21 | 
 22 |     if to_print:
 23 |         print(str_to_write)
 24 | 
 25 | 
 26 | def split_reference_section_into_references(all_content, to_print=True):
 27 |     '''
 28 |     Takes a raw string that resembles references and returns a python
 29 |     list where each element is composed of a tuple of four elements:
 30 |         - The number of the reference
 31 |         - The data of the reference
 32 |         - The offset where the reference begins in the raw data
 33 |         - The offset where the reference ends in the raw data
 34 |             (This end offset is not entirely workign and will often
 35 |              be 0)
 36 |     Each tuple is arranged in the order of the elements listed above.
 37 |     '''
 38 | 
 39 |     collect_num = False
 40 |     curr_num = ""
 41 |     curr_content = ""
 42 |     idx_offset = 0
 43 |     begin_offset = 0
 44 |     end_offset = 0
 45 | 
 46 |     references = list()
 47 | 
 48 |     for c in all_content:
 49 |         if collect_num == True:
 50 |             try:
 51 |                 n = int(c)
 52 |                 curr_num = "%s%s" % (curr_num, c)
 53 |             except:
 54 |                 None
 55 | 
 56 |         if c == "[":
 57 |             # Make sure the next character is a number
 58 |             try:
 59 |                 tmpi = int(all_content[idx_offset + 1])
 60 |             except:
 61 |                 curr_content += c
 62 |                 continue
 63 | 
 64 |             if curr_content != "":
 65 |                 references.append({
 66 |                     "number": curr_num,
 67 |                     "raw_content": curr_content,
 68 |                     "offset": begin_offset
 69 |                 })
 70 |                 #references.append((curr_num, curr_content, begin_offset, end_offset))
 71 |             curr_num = ""
 72 |             collect_num = True
 73 |             begin_offset = idx_offset
 74 |         elif c == "]":
 75 |             collect_num = False
 76 |             #print("Got a number: %d"%(int(curr_num)))
 77 |             #curr_num = ""
 78 |             curr_content = ""
 79 |         elif collect_num == False:
 80 |             curr_content += c
 81 | 
 82 |         idx_offset += 1
 83 | 
 84 |     # Add the final reference
 85 |     #references.append((curr_num, curr_content, begin_offset, end_offset))
 86 |     references.append({
 87 |         "number": curr_num,
 88 |         "raw_content": curr_content,
 89 |         "offset": begin_offset
 90 |     })
 91 |     print("Size of references: %d" % (len(references)))
 92 | 
 93 |     if to_print == True:
 94 |         for ref in references:
 95 |             print("Ref: %s" % (str((ref))))
 96 | 
 97 |     return references
 98 | 
 99 | 
100 | ####################################################################
101 | # Reference author routines
102 | #
103 | #
104 | # These routines are used to take raw reference data
105 | # and split it up between authors and title of reference.
106 | #
107 | # We need multiple routines for this since there are multiple
108 | # ways of writing references.
109 | ####################################################################
110 | def read_single_letter_references(r_num, r_con, r_beg, r_end):
111 |     '''
112 |     Parse a raw reference string and extract authors as well as title.
113 |     This parsing routine is concerned with references where there is only a single
114 |     name spelled out entirely per author, and the rest are delivered as single letters
115 |     followed by periods. 
116 | 
117 |     Examples of strings it parses:
118 |         - T. Ball, R. Majumdar, T. Millstein, and S. K. Rajamani. Automatic predicate abstraction of C programs.
119 |         - A. Bartel, J. Klein, M. Monperrus, and Y. Le Traon. Automatically securing permission-based software by reducing the attack
120 |         - A. Bose, X. Hu, K. G. Shin, and T. Park. Behavioral detection of malware on mobile handsets.
121 | 
122 |     Inputs: 
123 |         - An element from a list as parsed by read_harvard_references
124 | 
125 |     Outputs:
126 |         - On failure: None
127 |         - On success:
128 |     '''
129 |     # Now find the period that divides the authors from the title
130 |     authors = None
131 |     rest = None
132 |     space_split = r_con.split(" ")
133 |     for idx2 in range(len(space_split) - 4):
134 |         try:
135 |             # This will capture when there are multiple authors
136 |             if (space_split[idx2].lower() == "and"
137 |                     and space_split[idx2 + 1][-1] == "."
138 |                     and space_split[1][-1] == "."):
139 |                 # If we have a double single-letter name
140 | 
141 |                 # Example: T. Ball, R. Majumdar, T. Millstein, and S. K. Rajamani. Automatic predicate abstraction of C programs.
142 |                 if space_split[idx2 + 2][-1] == "." and len(
143 |                         space_split[idx2 + 2]) == 2:
144 |                     print("Potentially last1: %s" % (space_split[idx2 + 4]))
145 |                     authors = " ".join(space_split[0:idx2 + 4])
146 |                     rest = " ".join(space_split[idx2 + 4:])
147 |                 # Example:  A. Bartel, J. Klein, M. Monperrus, and Y. Le Traon. Automatically securing permission-based software by reducing the attack
148 |                 elif space_split[idx2 + 2][-1] != "." and space_split[
149 |                         idx2 + 3][-1] == ".":
150 |                     print("Potentially last2: %s" % (space_split[idx2 + 4]))
151 |                     authors = " ".join(space_split[0:idx2 + 4])
152 |                     rest = " ".join(space_split[idx2 + 4:])
153 |                 # Example:A. Bose, X. Hu, K. G. Shin, and T. Park. Behavioral detection of malware on mobile handsets.
154 |                 else:
155 |                     print("Potentially last3: %s" % (space_split[idx2 + 3]))
156 |                     authors = " ".join(space_split[0:idx2 + 3])
157 |                     rest = " ".join(space_split[idx2 + 3:])
158 |                 break
159 |         except:
160 |             None
161 | 
162 |     if authors == None:
163 |         for idx2 in range(len(space_split) - 4):
164 |             try:
165 |                 # This will capture when there is only a single author
166 |                 if (len(space_split[idx2]) == 2
167 |                         and space_split[idx2][-1] == "."
168 |                         and len(space_split[idx2 + 2]) > 2):
169 |                     # If we have double single-letter naming
170 |                     if space_split[idx2 + 1][-1] == "." and len(
171 |                             space_split[idx2 + 1]) == 2:
172 |                         print("Potentially last single 1: %s" %
173 |                               (space_split[idx2 + 3]))
174 |                         authors = " ".join(space_split[0:idx2 + 3])
175 |                         rest = " ".join(space_split[idx2 + 3:])
176 |                         break
177 |                     else:
178 |                         print("Potentially last single 2: %s" %
179 |                               (space_split[idx2 + 2]))
180 |                         authors = " ".join(space_split[0:idx2 + 2])
181 |                         rest = " ".join(space_split[idx2 + 2:])
182 |                         break
183 |             except:
184 |                 None
185 | 
186 |     # Do some post processing to ensure we got the right stuff.
187 |     # First ensure we at least have two words in authors:
188 |     if authors != None and len(authors.split(" ")) < 2:
189 |         print("Breaking 1")
190 |         return None
191 | 
192 |     # Now ensure that the title is not just a name:
193 |     if authors != None:
194 |         try:
195 |             title = int(rest.split(".")[0])
196 |             # If the above is successful, then the title is just a number, which cannot be true
197 |             print("Breaking 2")
198 |             return None
199 |         except:
200 |             print("Could not break 2")
201 |             None
202 | 
203 |     if authors != None:
204 |         r_title = rest.split(".")[0]
205 | 
206 |         # with the special '' symbols (0x201c and 0x201d, respectively).
207 |         #if chr(0x201d) in r_title:
208 |         #    r_title = rest.split(chr(0x201d))[0]
209 |         #    r_title = r_title[1:-1]
210 |         #    rest = ".".join(rest.split(chr(0x201d))[1:])
211 | 
212 |         print("Authors: %s" % (authors))
213 |         print("Title: %s" % (r_title))
214 |         print("rest: %s" % (rest))
215 | 
216 |         # Create a directory
217 |         ref_dict = dict()
218 |         ref_dict['Authors'] = authors
219 |         ref_dict['Title'] = r_title
220 |         ref_dict['rest'] = rest
221 |         ref_dict['num'] = r_num
222 |         return ref_dict
223 |     else:
224 |         return None
225 |     print("----------------")
226 | 
227 | 
228 | def read_full_author_references(r_num, r_con, r_beg, r_end):
229 |     '''
230 |         This routine is focused on parsing references where the authors names are
231 |         spelled out fully, including all names of each author.
232 | 
233 |         Examples: Sang Kil Cha, Thanassis Avgerinos, Alexandre Rebert, and David Brumley. 2012. Unleashing Mayhem on Binary Code. In IEEE Symposium on Security and Privacy (S&P).
234 |     '''
235 |     # Now find the period that divides the authors from the title
236 |     authors = None
237 |     rest = None
238 |     space_split = r_con.split(".")
239 |     print("Space split: %s" % (str(space_split)))
240 | 
241 |     # Merge all of the elements of the list that are of less than 2 in length:
242 |     new_space_split = list()
243 |     curr_elem = ""
244 |     for elem in space_split:
245 |         if len(elem) <= 2:
246 |             curr_elem += "%s." % (elem)
247 |         elif len(elem) > 2 and elem[-2] == ' ' and elem[-1] != ' ':
248 |             curr_elem += "%s." % (elem)
249 |         else:
250 |             new_elem = "%s%s" % (curr_elem, elem)
251 |             print("New elem: %s" % (new_elem))
252 |             new_space_split.append(new_elem)
253 |             curr_elem = ""
254 |     space_split = new_space_split
255 | 
256 |     print("Refined space split: %s" % (str(space_split)))
257 | 
258 |     # See if it is clean
259 |     if len(space_split) == 4:
260 |         try:
261 |             year = int(space_split[1])
262 |             print("Gotten the reference")
263 | 
264 |             ref_dict = dict()
265 |             ref_dict['Authors'] = space_split[0]
266 |             ref_dict['Title'] = space_split[2]
267 |             ref_dict['Venue'] = space_split[3]
268 |             ref_dict['Year'] = space_split[1]
269 |             ref_dict['rest'] = space_split[3]
270 |             ref_dict['num'] = r_num
271 |             return ref_dict
272 |         except:
273 |             None
274 |     # Let's try a bit vaguer for the sake of it
275 |     try:
276 |         year = int(space_split[1])
277 |         print("Gotten the reference")
278 | 
279 |         ref_dict = dict()
280 |         ref_dict['Authors'] = space_split[0]
281 |         ref_dict['Year'] = space_split[1]
282 |         ref_dict['Title'] = space_split[2]
283 |         ref_dict['rest'] = ".".join(space_split[3:])
284 |         ref_dict['num'] = r_num
285 |         return ref_dict
286 |     except:
287 |         None
288 | 
289 |     return None
290 | 
291 | 
292 | def normalise_title(title):
293 |     return title.lower().replace(",", "").replace(" ", "").replace("-", "")
294 | 
295 | 
296 | ##########################################
297 | # End of citation parsing routines
298 | ##########################################
299 | 
300 | 
301 | def parse_raw_reference_list(refs):
302 |     '''
303 |     The goal of this function is to convert the raw references
304 |     into more normalized references, and in particular split
305 |     the raw reference into authors and title of reference.
306 | 
307 |     An important aspect of this function is that it tries to pass
308 |     the reference in multiple ways, based on how the citations
309 |     are written.
310 | 
311 |     As such, this routine is somewhat implemented with a 
312 |     test-and-check aproach in mind.
313 |     '''
314 |     parse_funcs = [read_single_letter_references, read_full_author_references]
315 |     parse_success = []
316 |     for parse_func in parse_funcs:
317 |         missing_refs = []
318 |         refined = []
319 |         for ref in refs:
320 |             res = parse_func(ref['number'], ref['raw_content'], ref['offset'],
321 |                              0)
322 |             if res == None:
323 |                 missing_refs.append(ref)
324 |             else:
325 |                 refined.append(res)
326 |         parse_success.append({
327 |             'missing_refs': missing_refs,
328 |             'refined': refined
329 |         })
330 | 
331 |     # Now pick the index with highest number of successful parses
332 |     if len(parse_success[0]['refined']) > len(parse_success[1]['refined']):
333 |         parse_func = parse_funcs[0]
334 |     else:
335 |         parse_func = parse_funcs[1]
336 | 
337 |     # Now parse a last time and insert the parsed data into
338 |     # the references dictionary.
339 |     # This loop will modify the content of the dictionary.
340 |     for ref in refs:
341 |         res = parse_func(ref['number'], ref['raw_content'], ref['offset'], 0)
342 |         ref['parsed'] = res
343 |         if res != None:
344 |             ref['normalised-title'] = normalise_title(res['Title'])
345 |         else:
346 |             ref['normalised-title'] = None
347 | 
348 | 
349 | ########################################
350 | # Utility functions
351 | ########################################
352 | 
353 | 
354 | def print_refined(refined):
355 |     print("Refined lists:")
356 |     #for rd in refined:
357 |     #    try:
358 |     #        print("\t%d"%(int(rd['num'])))
359 |     #    except:
360 |     #        None
361 |     #    print("\tTitle: %s"%(rd['Title']))
362 |     #    print("\tAuthors: %s"%(rd['Authors']))
363 |     #    print("\tRest: %s"%(rd['rest']))
364 |     #    print("#"*90)
365 | 
366 | 
367 | def print_missing(missings):
368 |     print("Missing refs:")
369 |     #for ms in missings:
370 |     #    print("\t%s"%(str(ms)))
371 |     #    print("<"*90)
372 | 
373 | 
374 | def read_decoded(file_path):
375 |     '''
376 |     Reads a file an encodes each data as an ASCII string. We do
377 |     the ASCII encoding because many of the papers have some form
378 |     of weird UTF-coding and we do not handle those at the moment.
379 |     '''
380 |     with open(file_path, "r") as f:
381 |         json_content = json.loads(str(f.read().replace('\r\n', '')),
382 |                                   strict=False)
383 |         dec_content = dict()
384 |         for elem in json_content:
385 |             #dec_content[elem] = json_content[elem].encode('utf-8')
386 |             dec_content[elem] = json_content[elem].encode('ascii', 'ignore')
387 |             #asciidata= dec.encode("ascii","ignore")
388 |         return dec_content
389 | 
390 | 
391 | def parse_file(file_path):
392 |     '''
393 |     Parses a json file with the raw data of title and references
394 |     and returns a list of citations made by this paper.
395 |     '''
396 |     # Read the json file with our raw data.
397 |     dec_json = read_decoded(file_path)
398 | 
399 |     # Merge all lines in the file by converting newlines to spaces
400 |     dec_json['References'] = dec_json['References'].decode("utf-8")
401 |     print("Type: %s" % (str(type(dec_json['References']))))
402 |     print(dec_json['References'])
403 |     print(dec_json['References'].encode('ascii', 'ignore'))
404 |     all_refs = dec_json['References'].replace(
405 |         "\n", " ")  #decode("utf-8").replace("\n", " ")
406 | 
407 |     # Extract the references in a raw format. This is just splitting
408 |     #
409 |     references = split_reference_section_into_references(all_refs)
410 | 
411 |     # Decipher the references
412 |     print("References")
413 |     print(references)
414 |     #refined, missing_sigs = parse_raw_reference_list(references)
415 |     parse_raw_reference_list(references)
416 |     print("Refined refs:")
417 |     print(references)
418 | 
419 |     #print_refined(refined)
420 |     #print_missing(missing_sigs)
421 |     #print("Refined sigs: %d"%(len(refined)))
422 | 
423 |     # Now create our dictionary where we will wrap all data in
424 |     paper_dict = {}
425 |     paper_dict['paper-title'] = dec_json['Title']
426 |     paper_dict['paper-title-normalised'] = normalise_title(
427 |         dec_json['Title'].decode("utf-8"))
428 |     paper_dict['references'] = references
429 |     #paper_dict['success-sigs'] = refined
430 |     #paper_dict['missing-sigs'] = missing_sigs
431 | 
432 |     return paper_dict
433 |     #return refined, missing_sigs, dec_json['Title']
434 | 
435 | 
436 | def read_json_file(filename):
437 |     print("[+] Parsing file: %s" % (filename))
438 |     ret = None
439 |     try:
440 |         ret = parse_file(filename)
441 |     except:
442 |         exc_type, exc_value, exc_traceback = sys.exc_info()
443 |         traceback.print_tb(exc_traceback, limit=10, file=sys.stdout)
444 |         traceback.print_exception(exc_type,
445 |                                   exc_value,
446 |                                   exc_traceback,
447 |                                   limit=20,
448 |                                   file=sys.stdout)
449 |         traceback.print_exc()
450 |         formatted_lines = traceback.format_exc().splitlines()
451 |     print("[+] Completed parsing of %s" % (filename))
452 |     return ret
453 | 
454 | 
455 | def identify_group_dependencies(workdir, parsed_papers):
456 |     """ identifies who cites who in a group of papers """
457 |     append_to_report(workdir, "######################################")
458 |     append_to_report(workdir, "Dependency graph based on citations")
459 |     #append_to_report(workdir, "######################################")
460 |     g1 = gv.Digraph(format='png')
461 | 
462 |     MAX_STR_SIZE = 15
463 |     idx = 0
464 |     paper_dict = {}
465 |     normalised_to_id = {}
466 | 
467 |     for src_paper in parsed_papers:
468 |         #paper_dict[src_paper['result']['paper-title-normalised']] = idx
469 |         if len(src_paper['result']['paper-title-normalised']) > MAX_STR_SIZE:
470 | 
471 |             target_d = src_paper['result']['paper-title-normalised'][
472 |                 0:MAX_STR_SIZE]
473 |         else:
474 |             target_d = src_paper['result']['paper-title-normalised']
475 | 
476 |         #paper_dict[src_paper['result']['paper-title-normalised']] = src_paper['result']['paper-title-normalised']
477 |         paper_dict[src_paper['result']['paper-title-normalised']] = target_d
478 |         #g1.node(str(idx))
479 |         g1.node(target_d)
480 |         idx += 1
481 | 
482 |     #print("Paper dict")
483 |     #print(paper_dict)
484 |     #print("-"*50)
485 |     #raw_dependency_graph_path = os.path.join(workdir, "raw_dependency_graph.json")
486 |     #with open(raw_dependency_graph_path, "w+") as dependency_graph_json:
487 |     #    json.dump(paper_dict, dependency_graph_json)
488 | 
489 |     dependency_graph = []
490 |     for src_paper in parsed_papers:
491 | 
492 |         cited_by = []
493 |         for tmp_paper in parsed_papers:
494 |             if tmp_paper == src_paper:
495 |                 continue
496 |             # Now go through references of tmp_paper and
497 |             # see if it cites the src paper
498 |             cites = False
499 |             for ref in tmp_paper['result']['references']:
500 |                 if ref['parsed'] != None:
501 |                     if ref['normalised-title'] == src_paper['result'][
502 |                             'paper-title-normalised']:
503 |                         if len(tmp_paper['result']
504 |                                ['paper-title-normalised']) > MAX_STR_SIZE:
505 |                             src_d = tmp_paper['result'][
506 |                                 'paper-title-normalised'][0:MAX_STR_SIZE]
507 |                         else:
508 |                             src_d = tmp_paper['result'][
509 |                                 'paper-title-normalised']
510 | 
511 |                         if len(src_paper['result']
512 |                                ['paper-title-normalised']) > MAX_STR_SIZE:
513 |                             dst_d = src_paper['result'][
514 |                                 'paper-title-normalised'][0:MAX_STR_SIZE]
515 |                         else:
516 |                             dst_d = src_paper['result'][
517 |                                 'paper-title-normalised']
518 |                         g1.edge(src_d, dst_d)
519 | 
520 |                         #src_idx = paper_dict[tmp_paper['result']['paper-title-normalised']]
521 |                         #dst_idx = paper_dict[src_paper['result']['paper-title-normalised']]
522 |                         #g1.edge(str(src_idx), str(dst_idx))
523 |                         cites = True
524 |             if cites == True:
525 |                 cited_by.append(tmp_paper)
526 |         src_paper['cited_by'] = cited_by
527 |         append_to_report(workdir, "Paper:")
528 |         append_to_report(workdir, "\t%s" % (src_paper['filename']))
529 |         append_to_report(workdir, "\t%s" % (src_paper['result']['paper-title']))
530 |         append_to_report(workdir, "\tNormalised title: %s" %
531 |               (src_paper['result']['paper-title-normalised']))
532 |         append_to_report(workdir, "\tCited by: ")
533 |         for cited_by_paper in cited_by:
534 |             append_to_report(workdir, "\t\t%s" % (cited_by_paper['result']['paper-title']))
535 |         append_to_report(workdir, "---------------------------------")
536 | 
537 |         paper_info = {}
538 |         paper_info['name'] = src_paper['result']['paper-title'].decode("utf-8")
539 |         paper_info['minimized-name'] = src_paper['result'][
540 |             'paper-title-normalised']
541 |         paper_info['imports'] = []
542 |         for cited_by_paper in cited_by:
543 |             paper_info['imports'].append({
544 |                 "name":
545 |                 cited_by_paper['result']['paper-title'].decode("utf-8"),
546 |                 "minimized-name":
547 |                 cited_by_paper['result']['paper-title-normalised']
548 |             })
549 |         dependency_graph.append(paper_info)
550 | 
551 |     #append_to_report(workdir, "Done identifying references within the group")
552 |     #g1.view()
553 | 
554 |     filename = g1.render(
555 |         filename=os.path.join(workdir, "img", "citation_graph"))
556 |     print("idx: %d" % (idx))
557 |     #print(parsed_papers)
558 | 
559 |     dependency_graph_path = os.path.join(workdir,
560 |                                          "normalised_dependency_graph.json")
561 |     with open(dependency_graph_path, "w+") as dependency_graph_json:
562 |         json.dump(dependency_graph, dependency_graph_json)
563 | 
564 | 
565 | def display_summary(workdir, parsed_papers_raw):
566 |     succ = []
567 |     fail = []
568 |     succ_sigs = []
569 |     miss_sigs = []
570 |     all_titles = []
571 |     append_to_report(workdir, "######################################")
572 |     append_to_report(workdir, " Parsed papers summary        ")
573 |     #append_to_report(workdir, "######################################")
574 |     all_normalised_citations = set()
575 |     for paper in parsed_papers_raw:
576 |         append_to_report(workdir, "paper:")
577 |         append_to_report(workdir, "\t%s" % (paper['filename']))
578 |         append_to_report(workdir, "\t%s" % (paper['result']['paper-title']))
579 |         append_to_report(workdir, "\tNormalised title: %s" %
580 |               (paper['result']['paper-title-normalised']))
581 |         append_to_report(workdir, "\tReferences:")
582 |         for ref in paper['result']['references']:
583 |             if ref['parsed'] != None:
584 |                 append_to_report(workdir, "\t\tAuthors: %s" % (ref['parsed']['Authors']))
585 |                 append_to_report(workdir, "\t\tTitle: %s" % (ref['parsed']['Title']))
586 |                 append_to_report(workdir, "\t\tNormalised: %s" % (ref['normalised-title']))
587 |                 all_normalised_citations.add(ref['normalised-title'])
588 |             else:
589 |                 append_to_report(workdir, "\t\tCould not parse")
590 |             append_to_report(workdir, "\t\t-------------------")
591 |             #print("\t\t%s"%(str(ref['parsed'])))
592 | 
593 |         #print("[+] %s"%(paper['title']))
594 |         if paper['result'] == None:
595 |             fail.append(paper['filename'])
596 |         else:
597 |             succ.append(paper['filename'])
598 | 
599 |             for ref in paper['result']['references']:
600 |                 if ref['parsed'] != None:
601 |                     succ_sigs.append(ref['parsed'])
602 |                 else:
603 |                     miss_sigs.append(ref['raw_content'])
604 | 
605 |             #succ_sigs.append(paper['result']['success-sigs'])
606 |             #miss_sigs.append(paper['result']['missing-sigs'])
607 |             all_titles.append(paper['result']['paper-title'])
608 | 
609 |     # Check which papers are in our set:
610 |     append_to_report(workdir, "######################################")
611 |     append_to_report(workdir, "Papers in the data set :")
612 |     #append_to_report(workdir, "######################################")
613 |     cited_papers = list()
614 |     noncited_papers = list()
615 |     for paper in parsed_papers_raw:
616 |         print(paper['result']['paper-title'])
617 |         cited_by_group = False
618 |         for s in all_normalised_citations:
619 |             if s == paper['result']['paper-title-normalised']:
620 |                 cited_by_group = True
621 |         if cited_by_group:
622 |             cited_papers.append(paper)
623 |         else:
624 |             noncited_papers.append(paper)
625 |     #append_to_report(workdir, "######################################")
626 |     append_to_report(workdir, "Cited papers:")
627 |     #append_to_report(workdir, "######################################")
628 |     for p in cited_papers:
629 |         append_to_report(workdir, "\t%s" % (p['result']['paper-title']))
630 | 
631 | 
632 |     #append_to_report(workdir, "######################################")    
633 |     append_to_report(workdir, "Noncited papers:")
634 |     #append_to_report(workdir, "######################################")
635 |     for p in noncited_papers:
636 |         append_to_report(workdir, "\t%s" % (p['result']['paper-title']))
637 | 
638 |     #exit(0)
639 | 
640 |     # Now display the content.
641 |     append_to_report(workdir, "Succ: %d" % (len(succ)))
642 |     append_to_report(workdir, "Fail: %d" % (len(fail)))
643 |     append_to_report(workdir, "Succ sigs: %d" % (len(succ_sigs)))
644 |     append_to_report(workdir, "Failed sigs: %d" % (len(miss_sigs)))
645 | 
646 |     #append_to_report(workdir, "Summary of references:")
647 |     # Now do some status on how many references we have in total
648 |     all_refs = dict()
649 |     #for siglist in succ_sigs:
650 |     for sig in succ_sigs:
651 |         # Normalise
652 |         #append_to_report(workdir, "Normalising: %s" % (sig['Title']))
653 |         tmp_sig = sig['Title'].lower().replace(",", "").replace(" ",
654 |                                                                 "").replace(
655 |                                                                     "-", "")
656 |         #append_to_report(workdir, "\tNormalised: %s" % (tmp_sig))
657 | 
658 |         #if "\xe2\x80\x9c" in tmp_sig:
659 |         #    s1_split = tmp_sig.split("\xe2\x80\x9c")
660 |         #    if "\xe2\x80\x9d" in s1_split[1]:
661 |         #        tmp_sig = s1_split[1].split("\xe2\x80\x9d")[0]
662 |         #    else:
663 |         #        tmp_sig=s1_split[1]
664 | 
665 |         if tmp_sig not in all_refs:
666 |             all_refs[tmp_sig] = 0
667 |         all_refs[tmp_sig] += 1
668 | 
669 |         #if sig['Title'].lower() not in all_refs:
670 |         #    all_refs[sig['Title'].lower().strip()] = 0
671 |         #all_refs[sig['Title'].lower().strip()] += 1
672 | 
673 |     append_to_report(workdir, "######################################")
674 |     append_to_report(workdir, "All Titles (normalised) of the papers in the data set:")
675 |     #append_to_report(workdir, "######################################")
676 |     for t in all_titles:
677 |         append_to_report(workdir, "\t%s" % (t))
678 |     append_to_report(workdir, "######################################")        
679 | 
680 |     append_to_report(workdir, "######################################")
681 |     append_to_report(workdir, "All references/citations (normalised) issued by these papers")
682 |     #append_to_report(workdir, "######################################")
683 |     append_to_report(workdir, "\tName : Count")
684 |     sorted_list = list()
685 |     for title in all_refs:
686 |         sorted_list.append((title, all_refs[title]))
687 |     sorted_list = sorted(sorted_list, key=lambda x: x[1])
688 |     for title, counts in sorted_list:
689 |         append_to_report(workdir, "\t%s :  %d" % (title, counts))
690 | 
691 |     append_to_report(workdir, "######################################")
692 |     append_to_report(workdir, "The total number of unique citations: %d" % (len(sorted_list)))
693 |     #append_to_report(workdir, "######################################")
694 |     return
695 |     #exit(0)
696 | 
697 |     #all_missing_sigs = []
698 |     #for missig in miss_sigs:
699 |     #    all_missing_sigs += missig
700 | 
701 |         #for miss in missig:
702 |         #    print("###\t%s"%(str(miss)))
703 | 
704 |     #print(
705 |     #    "[+] The references that we were unable to fully parse, but yet are cited by the papers:"
706 |     #)
707 |     #print(
708 |     #    "Total amount of references not parsed and thus not included in analysis: %d"
709 |     #    % (len(all_missing_sigs)))
710 |     #print("All of these references:")
711 |     #for missig in all_missing_sigs:
712 |     #    print("###\t%s" % (str(missig)))
713 | 
714 | 
715 | def parse_first_stage(workdir, target_dir):
716 |     parsed_papers = []
717 |     for json_f in os.listdir(target_dir):
718 |         if ".json" in json_f:
719 |             complete_filename = os.path.join(target_dir, json_f)
720 |             print("Checking: %s" % (complete_filename))
721 |             res = read_json_file(complete_filename)
722 |             parsed_papers.append({
723 |                 "filename": complete_filename,
724 |                 "result": res
725 |             })
726 |     #display_summary(parsed_papers)
727 | 
728 |     parsed_papers_json = os.path.join(workdir, "parsed_paper_data.json")
729 |     with open(parsed_papers_json, "w+", encoding='utf-8') as ppj:
730 |         json.dump(parsed_papers, ppj)
731 | 
732 |     return parsed_papers
733 | 
734 | 
735 | if __name__ == "__main__":
736 |     succ = []
737 |     fail = []
738 |     succ_sigs = []
739 |     miss_sigs = []
740 |     all_titles = []
741 |     for json_f in os.listdir("."):
742 |         if ".json" in json_f:
743 |             if len(sys.argv) == 2:
744 |                 if sys.argv[1] not in json_f:
745 |                     continue
746 |             print("Checking: %s" % (json_f))
747 | 
748 |             res = read_json_file(json_f)
749 |             if res == None:
750 |                 fail.append(json_f)
751 |                 sys.stdout.flush()
752 |                 continue
753 |             succ.append(json_f)
754 |             succ_sigs.append(res['success-sigs'])
755 |             miss_sigs.append(res['missing-sigs'])
756 |             all_titles.append(res['paper-title'])
757 | 
758 |     print("Succ: %d" % (len(succ)))
759 |     print("Fail: %d" % (len(fail)))
760 |     print("Succ sigs: %d" % (len(succ_sigs)))
761 |     print("Failed sigs: %d" % (len(miss_sigs)))
762 |     display_summary(succ, fail, succ_sigs, miss_sigs, all_titles)
763 | 


--------------------------------------------------------------------------------
/pq_main.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | import sys
 3 | import simplejson as json
 4 | import argparse
 5 | import shutil
 6 | 
 7 | import pq_format_reader
 8 | import pq_pdf_utility
 9 | 
10 | 
11 | ### Utilities
12 | def create_working_dir(target_dir):
13 |     if os.path.isdir(target_dir):
14 |         shutil.rmtree(target_dir)
15 |     os.mkdir(target_dir)
16 | 
17 | 
18 | def create_workdir():
19 |     MDIR = "pq-out"
20 |     if not os.path.isdir(MDIR):
21 |         os.mkdir(MDIR)
22 | 
23 |     FNAME = "out-"
24 |     max_idx = -1
25 |     for l in os.listdir(MDIR):
26 |         if FNAME in l:
27 |             try:
28 |                 val = int(l.replace(FNAME, ""))
29 |                 if val > max_idx:
30 |                     max_idx = val
31 |             except:
32 |                 None
33 |     max_idx += 1
34 |     new_workdir = os.path.join(MDIR, "%s%d" % (FNAME, max_idx))
35 |     print("Making the work directory: %s" % (new_workdir))
36 |     os.mkdir(new_workdir)
37 |     return new_workdir
38 | 
39 | 
40 | ### Action functions
41 | def convert_pdfs_to_data(paper_dir, workdir, target_dir):
42 |     # First extract data from the pdf files, such as
43 |     # title as well as the references cited by each paper.
44 |     # This will produce json files with the data that we need.
45 |     filtered_pdf_list = pq_pdf_utility.convert_folder(workdir, paper_dir)
46 | 
47 |     # Now write it out to a json folder
48 |     create_working_dir(target_dir)
49 |     pq_pdf_utility.write_to_json(filtered_pdf_list, target_dir)
50 | 
51 | 
52 | def read_parsed_json_data(workdir, target_dir):
53 |     # Reads the json files produced by "full-run", and extracts the
54 |     # various forms of data.
55 |     parsed_papers = pq_format_reader.parse_first_stage(workdir, target_dir)
56 | 
57 |     pq_format_reader.display_summary(workdir, parsed_papers)
58 | 
59 |     # Now extract the group dependencies
60 |     pq_format_reader.identify_group_dependencies(workdir, parsed_papers)
61 | 
62 | 
63 | ### CLI
64 | def parse_args():
65 |     parser = argparse.ArgumentParser(
66 |         "pq_main.py", description="Quantitative analysis of papers")
67 | 
68 |     parser.add_argument("-f", "--folder", help="Folder with PDFs")
69 |     parser.add_argument("--task",
70 |                         help="Specify a given task",
71 |                         default="dependency-graph")
72 |     parser.add_argument("--wd", help="Workdir")
73 | 
74 |     args = parser.parse_args()
75 |     if args.folder != None:
76 |         print("Analysing folder the pdfs in folder: %s" % (args.folder))
77 | 
78 |     return args
79 | 
80 | 
81 | if __name__ == "__main__":
82 |     args = parse_args()
83 |     if args.task == "json-only":
84 |         read_parsed_json_data(args.wd, os.path.join(args.wd, "json_data"))
85 |     elif args.task == "dependency-graph":
86 |         workdir = create_workdir()
87 |         json_data_dir = os.path.join(workdir, "json_data")
88 | 
89 |         convert_pdfs_to_data(args.folder, workdir, json_data_dir)
90 |         read_parsed_json_data(workdir, json_data_dir)
91 |     else:
92 |         print("Task not supported")
93 | 


--------------------------------------------------------------------------------
/pq_pdf_utility.py:
--------------------------------------------------------------------------------
  1 | import xml.etree.ElementTree as ET
  2 | import subprocess
  3 | import shutil
  4 | import sys
  5 | import os
  6 | import simplejson as json
  7 | 
  8 | 
  9 | def print_recursively(elem, depth):
 10 |     # The below is only for debugging
 11 |     #print("\t"*depth + "%s - %s"%(elem.tag, elem.attrib))
 12 |     #if "size" in elem.attrib:
 13 |     #sz = float(elem.attrib['size'])
 14 |     #print("We got a size: %f"%(sz))
 15 |     #if sz > 20.0:
 16 |     #    print("Text: %s"%(elem.text))
 17 |     for child in elem:
 18 |         print_recursively(child, depth + 1)
 19 | 
 20 | 
 21 | def get_all_texts_rec(elem):
 22 |     '''
 23 |     Extracts all <text> elements from an xml
 24 |     file generatex by pdf2txt. As such, this function
 25 |     is simply a partial XML parser.
 26 |     This is a recursive function where the recursion
 27 |     is used to capture recursive XML elements.
 28 |     '''
 29 |     global saw_page2
 30 |     all_texts = []
 31 | 
 32 |     # If we hit page 2, then we return empty list
 33 |     if saw_page2 == True:
 34 |         return []
 35 | 
 36 |     # We don't want to move beyond page 2, as the
 37 |     # we assume the title must have been given here.
 38 |     if elem.tag == "page":
 39 |         if int(elem.attrib['id']) > 3:
 40 |             saw_page2 = True
 41 |             return []
 42 | 
 43 |     # Extract all XML elements recursively
 44 |     for c in elem:
 45 |         tmp_txts = get_all_texts_rec(c)
 46 |         for t2 in tmp_txts:
 47 |             all_texts.append(t2)
 48 | 
 49 |     # If the current elem is a text element,
 50 |     # add this to our list of all texts.
 51 |     if elem.tag == "text":
 52 |         all_texts.append(elem)
 53 |     return all_texts
 54 | 
 55 | 
 56 | def get_all_texts(elem):
 57 |     global saw_page2
 58 |     saw_page2 = False
 59 |     return get_all_texts_rec(elem)
 60 | 
 61 | 
 62 | def get_title(the_file):
 63 |     '''
 64 |     Gets the title of the paper by:
 65 |         1) Reading an xml representation of the PDF converted by pdf2txt
 66 |         2) Searching for the top font size of page 1
 67 |         3) Exiting if we reach page 2 of the paper.
 68 | 
 69 |     returns 
 70 |         @max_string which is the string with the max font in a paper
 71 |         @second_max_string which is the string with the second max font
 72 |             in a paper.
 73 |     '''
 74 |     print("[+] Getting title")
 75 |     print("Extracting content from %s" % (the_file))
 76 |     #print("current working directory: %s"%(os.getcwd()))
 77 |     try:
 78 |         tree = ET.parse(the_file)
 79 |     except:
 80 |         print("[+] Error when ET.parse of the file %s" % (the_file))
 81 |         return None
 82 | 
 83 |     root = tree.getroot()
 84 |     #print_recursively(root, 0)
 85 |     all_texts = get_all_texts(root)
 86 |     sizes = dict()
 87 |     latest_size = None
 88 |     for te in all_texts:
 89 |         #print("%s-%s"%(te.attrib, te.text))
 90 |         if 'size' not in te.attrib:
 91 |             if latest_size != None:
 92 |                 sizes[latest_size].append(" ")
 93 |             continue
 94 |         sz = float(te.attrib['size'])
 95 |         latest_size = sz
 96 |         if sz not in sizes:
 97 |             sizes[sz] = list()
 98 |         sizes[sz].append(te.text)
 99 | 
100 |     # We now have all the text elements and we can proceed
101 |     # to extract the elements with the highest and second-highest
102 |     # font sizes.
103 |     sorted_sizes = sorted(sizes.keys())
104 | 
105 |     # Highest font size
106 |     max_string = ""
107 |     for c in sizes[sorted_sizes[-1]]:
108 |         max_string += c
109 | 
110 |     # Second highest font size
111 |     second_max_string = ""
112 |     for c in sizes[sorted_sizes[-2]]:
113 |         second_max_string += c
114 | 
115 |     # Log them
116 |     #print("max %s"%(max_string))
117 |     #print("second max %s"%(second_max_string))
118 | 
119 |     # Print the bytes: only used for debugging
120 |     ords = list()
121 |     for c in max_string:
122 |         ords.append(hex(ord(c)))
123 |     s1 = " ".join(ords)
124 |     #print("\t\t%s"%(s1))
125 |     #print("\tCompleted getting title")
126 |     return max_string, second_max_string
127 | 
128 | 
129 | def get_references(text_file):
130 |     '''
131 |     Reads a txt file converted by pdf2txt and extracts
132 |     the "references" section of the paper.
133 | 
134 |     This function assumes the references are at the end
135 |     of a paper and that there might be an appendix after.
136 |     As such, we read the text from "References" until end
137 |     of the file, and only stop in case we hit an 
138 |     "appendix" keyword.
139 |     '''
140 |     references = ""
141 |     reading_references = False
142 |     with open(text_file, "r") as tf:
143 |         for l in tf:
144 |             if reading_references == True and "appendix" in l.lower():
145 |                 reading_references = False
146 | 
147 |             if reading_references:
148 |                 references += l
149 | 
150 |             if "references" in l.lower():
151 |                 #print("We have a line with references: %s"%(l))
152 |                 if len(l.split(" ")) < 3:
153 |                     reading_references = True
154 |     #print("References: %s"%(references))
155 |     return references
156 | 
157 | 
158 | def convert_folder(workdir, folder_name):
159 |     '''
160 |     Converts an entire folder of PDFs into representations
161 |     where we have the:
162 |         title
163 |         references
164 |     for each paper. 
165 | 
166 | 
167 |     returns a list of dictionaries, where each element
168 |     in the dictionary holds data about a given paper.
169 |     '''
170 |     all_titles = list()
171 |     all_seconds = list()
172 |     title_pairs = list()
173 | 
174 |     paper_list = []
175 |     data_out_dir = os.path.join(workdir, "data_out")
176 |     if not os.path.isdir(data_out_dir):
177 |         os.mkdir(data_out_dir)
178 |     for l in os.listdir(folder_name):
179 |         if ".pdf" not in l:
180 |             continue
181 | 
182 |         print("======================================")
183 |         print("[+] working on %s" % (l))
184 |         paper_dict = {
185 |             "file": l,
186 |             "title": None,
187 |             "second_title": None,
188 |             "references": None,
189 |             "success": False
190 |         }
191 | 
192 |         # First step.
193 |         # Convert the pdf to xml format. This is for getting
194 |         # the title of the paper.
195 |         target_xml = os.path.join(data_out_dir, "%s_analysed.xml" % (l))
196 |         cmd = ["pdf2txt.py", "-t", "xml", "-o", target_xml]
197 |         cmd.append(os.path.join(folder_name, l))
198 |         try:
199 |             subprocess.check_call(" ".join(cmd), shell=True)
200 |         except:
201 |             paper_list.append(paper_dict)
202 |             print("Could not execute the call")
203 |             continue
204 | 
205 |         try:
206 |             res = get_title(target_xml)
207 |             if res == None:
208 |                 paper_list.append(paper_dict)
209 |                 continue
210 |             the_title, second = res
211 |         except:
212 |             #    print("Exception in get_title")
213 |             paper_list.append(paper_dict)
214 |             continue
215 |         all_titles.append(the_title)
216 |         all_seconds.append(second)
217 | 
218 |         # Second step.
219 |         # Convert the pdf to txt format. This step if for getting
220 |         # the references of the paper.
221 |         target_txt = os.path.join(data_out_dir, "%s_analysed.txt" % (l))
222 |         cmd = ["pdf2txt.py", "-t", "text", "-o", target_txt]
223 |         cmd.append(os.path.join(folder_name, l))
224 |         try:
225 |             subprocess.check_call(" ".join(cmd), shell=True)
226 |         except:
227 |             print("Could not execute the call")
228 |             paper_list.append(paper_dict)
229 |             continue
230 | 
231 |         references = get_references(target_txt)
232 |         paper_dict = {
233 |             "file": l,
234 |             "title": the_title,
235 |             "second_title": second,
236 |             "references": references,
237 |             "success": True
238 |         }
239 |         paper_list.append(paper_dict)
240 |         #print("Adding: %s ----- %s"%(the_title, second))
241 | 
242 |     return paper_list
243 | 
244 | 
245 | def should_select_title(title):
246 |     # Check if there is any characters with value greater than 0xff in first
247 |     # If this is the case then we use the second highest font for title.
248 |     total_above = 0
249 |     for c in title:
250 |         if ord(c) > 0xff:
251 |             total_above += 1
252 |     if total_above > 3:
253 |         return False
254 | 
255 |     # Now check if "(cid:" is in the name:. If it is,
256 |     # then we will use the second highest font for title.
257 |     if "(cid:" in str(title):
258 |         return False
259 | 
260 |     return True
261 | 
262 | 
263 | def write_to_json(results, target_directory=None):
264 |     counter = 0
265 |     for paper_dict in results:
266 |         if paper_dict['success'] == False:
267 |             print("####### %s" % (paper_dict['file']))
268 |             print("Unsuccessful")
269 |             print("-" * 60)
270 |             continue
271 | 
272 |         first = paper_dict['title']
273 |         second = paper_dict['second_title']
274 |         use_second = not should_select_title(first)
275 | 
276 |         # Create a json dictionary and write it to the file system
277 |         json_dict = {
278 |             "Title": first if use_second == False else second,
279 |             "References": paper_dict['references'],
280 |             "Year": "1999",
281 |             "Authors": "David",
282 |             "ReferenceType": "Automatically"
283 |         }
284 |         filepath = "json_dump_%d.json" % (counter)
285 |         if target_directory != None:
286 |             filepath = os.path.join(target_directory, filepath)
287 |         with open(filepath, "w+") as jf:
288 |             json.dump(json_dict, jf)
289 |         counter += 1
290 | 
291 |         # Now print the content for convenience
292 |         print("########## %s" % (paper_dict['file']))
293 |         print("[+] Title: %s" % (json_dict['Title']))
294 |         #print("[+] References: ")
295 |         #print("%s"%(paper_dict['references']))
296 |         print("-" * 60)
297 | 
298 | 
299 | if __name__ == "__main__":
300 |     results = convert_folder("papers")
301 |     target_dir = "json_data"
302 |     if os.path.isdir(target_dir):
303 |         shutil.rmtree(target_dir)
304 |     os.mkdir(target_dir)
305 |     os.chdir(target_dir)
306 |     write_to_json(results)
307 | 


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
 1 | wheel
 2 | graphviz
 3 | matplotlib
 4 | pdf2txt
 5 | requests
 6 | simplejson
 7 | git+git://github.com/kermitt2/grobid_client_python@4bce8b574da363171079b97eb84fe5ecfce8cfdb#egg=grobid_client
 8 | pymongo
 9 | docopt
10 | bs4
11 | serde
12 | lxml


--------------------------------------------------------------------------------