├── .gitignore
├── .gitmodules
├── LICENSE
├── README.md
├── __init__.py
├── alexandria
├── README.md
├── base.py
├── grobid.json
├── runner.py
└── visualization.py
├── docs
├── larger_example.md
└── simple_example.md
├── example-images
├── fuzz-barplot.png
├── fuzz-wordcloud.png
└── paper-citation-graph.png
├── example-papers
├── 1908.09204.pdf
└── 1908.10167.pdf
├── gen_wordcloud.py
├── install.sh
├── pq_format_reader.py
├── pq_main.py
├── pq_pdf_utility.py
└── requirements.txt
/.gitignore:
--------------------------------------------------------------------------------
1 | pq-out/*
2 | paper-pdfs
3 | venv
4 | tmp
5 |
--------------------------------------------------------------------------------
/.gitmodules:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AdaLogics/paper-analyser/464daa29bcab2a35a8a4630751ca96023643a558/.gitmodules
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2020 AdaLogics Ltd
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Paper analyser
2 | The goal of this project is to enable quantitative analysis of
3 | academic papers.
4 |
5 | To achieve the goal, the project contains logic for
6 |
7 | 1. Parsing academic white papers into structured representation
8 | 1. Doing analysis on the structured representations
9 |
10 | #### Paper dependency graph
11 | The project as it currently stands focuses on the task of taking
12 | a list of arbitrary papers in the form of PDFs, and then creating
13 | a dependency graph of citations amongst these papers. This graph
14 | then shows how each of the PDFs reference each other. Paper analyser
15 | achieves this by going through the steps:
16 |
17 | 1. Parse the papers to extract relevant data
18 | 1. Read the PDF files to a format usable in Python
19 | 1. Extract title of a given paper
20 | 1. Extract the raw data of the "References" section
21 | 1. Parse the raw "References" section into individual refereces:
22 | 1. Extract the title and authors of the citation
23 | 1. Normalise the data of the extracted citations
24 | 1. Do dependency analysis based on the above citation extractions
25 |
26 |
27 | ## Usage
28 | To see example usage of a simple exmplae look at the [simple_example.md](/docs/simple_example.md)
29 |
30 | Paper analyser takes as input PDF files of academic papers and outputs data about these papers.
31 | For convenience we maintain a list of links to software analysis papers
32 | focused on software security in our sister repository [software-security-paper-list](https://github.com/AdaLogics/software-security-paper-list)
33 |
34 | To see an example of doing analysis on many papers look at the explanation here [large_example.md](/docs/larger_example.md)
35 |
36 | ## Example visualisation
37 |
38 | We have also created visualisations for the output of the paper
39 | analyser, which makes it very nice to rapidly understand the
40 | relationship between the academic papers in the data set.
41 |
42 | See a link here for an example of the visualisations
43 | [https://adalogics.com/software-security-research-citations-visualiser](https://adalogics.com/software-security-research-citations-visualiser)
44 |
45 | These visualistions will be open sourced in the near future.
46 |
47 |
48 | ### Citation graph example:
49 |
50 | 
51 |
52 |
53 | ### Wordcloud of 85 fuzzing papers
54 | Example of a wordcloud generated by the papers in the "Fuzzing" section of [software-security-paper-list](https://github.com/AdaLogics/software-security-paper-list). This wordcloud discounts the use of the 100 most common english words [https://www.espressoenglish.net/the-100-most-common-words-in-english/](https://www.espressoenglish.net/the-100-most-common-words-in-english/)
55 | 
56 |
57 | ### Wordcount of 85 fuzzing papers
58 | Doing a barplot of the words in the papers in the "Fuzzing" section of [software-security-paper-list](https://github.com/AdaLogics/software-security-paper-list). This plot discounts the use of the 100 most commond english words [https://www.espressoenglish.net/the-100-most-common-words-in-english/](https://www.espressoenglish.net/the-100-most-common-words-in-english/)
59 | 
60 |
61 | ## Installation
62 | ```
63 | git clone https://github.com/AdaLogics/paper-analyser
64 | cd paper-analyser
65 | ./install.sh
66 | ```
67 |
68 | ## Contribute
69 | We welcome contributions.
70 |
71 | Paper analyser is maintained by:
72 | * [David Korczynski](https://twitter.com/Davkorcz)
73 | * [Adam Korczynski](https://twitter.com/AdamKorcz4)
74 | * [Giovanni Cherubin](https://twitter.com/gchers)
75 |
76 | We are particularly interested in features for:
77 | 1. Improved parsing of the PDF files to get better structured ouput out
78 | 1. More data analysis into the project
79 |
80 |
81 | ### Feature suggestions
82 | If you would like to contribute but dont have a feature in mind, please see the list below for suggestions:
83 |
84 | * Extraction of authors from papers
85 | * Extraction of the actual text from the papers. This could be used for a lot of cool data analysis
86 |
--------------------------------------------------------------------------------
/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AdaLogics/paper-analyser/464daa29bcab2a35a8a4630751ca96023643a558/__init__.py
--------------------------------------------------------------------------------
/alexandria/README.md:
--------------------------------------------------------------------------------
1 | # Installation
2 |
3 | Before using Alexandria, you need to have two services running:
4 | `grobid` (used for processing pdf files) and `mongodb`.
5 |
6 |
7 | ### Grobid
8 |
9 | You first need to have an instance of [`grobid`](https://grobid.readthedocs.io/en/latest/) running.
10 |
11 | Using Docker:
12 | ```
13 | docker run -t --rm --init -p 8080:8070 -p 8081:8071 lfoppiano/grobid:0.6.2
14 | ```
15 | The above command will download the image and run it.
16 | (Add `-d` to run in daemon mode.)
17 |
18 | Verify that `grobid` is running at http://localhost:8070.
19 |
20 |
21 | If you configure `grobid` to listen on a different host/port, please adapt
22 | the `grobid.json` configuration.
23 |
24 | ### Mongodb
25 |
26 | Alexandria requires you to have a mongodb instance running
27 | locally at the standard port (`27017`).
28 |
29 | Using docker:
30 |
31 | ```
32 | docker run -p 27017:27017 mongo
33 | ```
34 | (Add `-d` to run in daemon mode.)
35 |
36 | ### Alexandria
37 |
38 | You're now ready to install Alexandria.
39 |
40 | ```
41 | git clone https://github.com/AdaLogics/paper-analyser
42 | cd paper-analyser/alexandria
43 | python3 -m venv venv
44 | . venv/bin/activate
45 | pip install -r ../requirements.txt
46 | ```
47 |
48 |
49 | # Usage
50 |
51 | To process the pdf files in `example-papers` launch:
52 |
53 | ```
54 | python runner.py ../example-papers
55 | ```
56 |
57 |
58 | Click to see the output.
59 |
60 | ```
61 | Please check out the script for more info.
62 | GROBID server is up and running
63 | Connecting to db.
64 | Processing files in ../example-papers/
65 | Parsing 1908.09204.pdf
66 | Title: RePEconstruct: reconstructing binaries with self-modifying code and import address table destruction
67 | Authors: David Korczynski
68 | References:
69 | Title: System-Level Support for Intrusion Recovery
70 | Authors: Andrei Bacs, Remco Vermeulen, Asia Slowinska, Herbert Bos
71 | -------------------
72 | Title: WYSINWYX: What You See Is Not What You eXecute
73 | Authors: G Balakrishnan, T Reps, D Melski, T Teitelbaum
74 | -------------------
75 | Title: BYTEWEIGHT: Learning to Recognize Functions in Binary Code
76 | Authors: Bao Ti Any, Jonathan Burket, Maverick Woo, Rafael Turner, David Brumley
77 | -------------------
78 | Title: incy: Detecting Host-Based Code Injection A acks in Memory Dumps
79 | Authors: Niklas Omas Barabosch, Adrian Bergmann, Elmar Dombeck, Padilla
80 | -------------------
81 | Title: Bee Master: Detecting Host-Based Code Injection A acks
82 | Authors: Sebastian Barabosch, Elmar Eschweiler, Gerhards-Padilla
83 | -------------------
84 | Title: CoDisasm: Medium Scale Concatic Disassembly of Self-Modifying Binaries with Overlapping Instructions
85 | Authors: Guillaume Bonfante, Jose Fernandez, Jean-Yves Marion, Benjamin Rouxel
86 | -------------------
87 | Title: Minemu: e World's Fastest Taint Tracker
88 | Authors: Erik Bosman, Asia Slowinska, Herbert Bos
89 | -------------------
90 | Title: Decoupling Dynamic Program Analysis from Execution in Virtual Environments
91 | Authors: Jim Chow, Peter Chen
92 | -------------------
93 | Title: Decompilation of Binary Programs. So w
94 | Authors: Cristina Cifuentes, K. John Gough
95 | -------------------
96 | Title: Ether: Malware Analysis via Hardware Virtualization Extensions
97 | Authors: Artem Dinaburg, Paul Royal, Monirul Sharif, Wenke Lee
98 | -------------------
99 | Title: Repeatable Reverse Engineering with PANDA
100 | Authors: Brendan Dolan-Gavi, Josh Hodosh, Patrick Hulin, Tim Leek, Ryan Whelan
101 | -------------------
102 | Title: Graph-based comparison of executable objects (english version)
103 | Authors: Rolf Omas Dullien, Rolles
104 | -------------------
105 | Title: Signatures for Library Functions in Executable Files
106 | Authors: Mike Van Emmerik
107 | -------------------
108 | Title: Structural Comparison of Executable Objects
109 | Authors: ; Halvar Flake, G Sig Sidar, Workshop
110 | -------------------
111 | Title: A Study of the Packer Problem and Its Solutions
112 | Authors: Fanglu Guo, Peter Ferrie, Tzi-Cker Chiueh
113 | -------------------
114 | Title: DECAF: A Platform-Neutral Whole-System Dynamic Binary Analysis Platform
115 | Authors: Andrew Henderson, Lok-Kwong Yan, Xunchao Hu, Aravind Prakash, Heng Yin, Stephen Mccamant
116 | -------------------
117 | Title: MutantX-S: Scalable Malware Clustering Based on Static Features
118 | Authors: Xin Hu, Sandeep Bhatkar, Kent Gri N, Kang Shin
119 | -------------------
120 | Title:
121 | Authors: Hungenberg, Eckert Ma Hias
122 | -------------------
123 | Title: malWASH: Washing Malware to Evade Dynamic Analysis
124 | Authors: K Kyriakos, Mathias Ispoglou, Payer
125 | -------------------
126 | Title: Labeling Library Functions in Stripped Binaries
127 | Authors: Emily Jacobson, Nathan Rosenblum, Barton Miller
128 | -------------------
129 | Title: Secure and advanced unpacking using computer emulation
130 | Authors: Sébastien Josse
131 | -------------------
132 | Title: Malware Dynamic Recompilation
133 | Authors: S Josse
134 | -------------------
135 | Title: Renovo: A Hidden Code Extractor for Packed Executables
136 | Authors: Min Kang, Pongsin Poosankam, Heng Yin
137 | -------------------
138 | Title: Taint-assisted IAT Reconstruction against Position Obfuscation
139 | Authors: Yuhei Kawakoya, Makoto Iwamura, Jun Miyoshi
140 | -------------------
141 | Title: API Chaser: Taint-Assisted Sandbox for Evasive Malware Analysis
142 | Authors: Yuhei Kawakoya, Eitaro Shioji, Makoto Iwamura
143 | -------------------
144 | Title: Jakstab: A Static Analysis Platform for Binaries
145 | Authors: Johannes Kinder, Helmut Veith
146 | -------------------
147 | Title: Power of Procrastination: Detection and Mitigation of Execution-stalling Malicious Code
148 | Authors: Clemens Kolbitsch, Engin Kirda, Christopher Kruegel
149 | -------------------
150 | Title: RePEconstruct: reconstructing binaries with selfmodifying code and import address table destruction
151 | Authors: David Korczynski
152 | -------------------
153 | Title: Capturing Malware Propagations with Code Injections and Code-Reuse A acks
154 | Authors: David Korczynski, Heng Yin
155 | -------------------
156 | Title: Polymorphic Worm Detection Using Structural Information of Executables
157 | Authors: Christopher Kruegel, Engin Kirda, Darren Mutz, William Robertson, Giovanni Vigna
158 | -------------------
159 | Title: Static Disassembly of Obfuscated Binaries
160 | Authors: Christopher Kruegel, William Robertson, Fredrik Valeur, Giovanni Vigna
161 | -------------------
162 | Title: Graph Matching Networks for Learning the Similarity of Graph Structured Objects
163 | Authors: Yujia Li, Chenjie Gu, Omas Dullien
164 | -------------------
165 | Title: OmniUnpack: Fast, Generic, and Safe Unpacking of Malware
166 | Authors: L Martignoni, M Christodorescu, S Jha
167 | -------------------
168 | Title: Measuring and Defeating Anti-Instrumentation-Equipped Malware
169 | Authors: Mario Polino, Andrea Continella, Sebastiano Mariani, Lorenzo Stefano D'alessio, Fabio Fontana, Stefano Gri, Zanero
170 | -------------------
171 | Title: Argos: an Emulator for Fingerprinting Zero-Day A acks
172 | Authors: Georgios Portokalidis, Asia Slowinska, Herbert Bos
173 | -------------------
174 | Title:
175 | Authors: Symantec Security Response
176 | -------------------
177 | Title: Learning to Analyze Binary Computer Code
178 | Authors: Nathan Rosenblum, Xiaojin Zhu, Barton Miller, Karen Hunt
179 | -------------------
180 | Title: PolyUnpack: Automating the Hidden-Code Extraction of Unpack-Executing Malware
181 | Authors: Paul Royal, Mitch Halpin, David Dagon, Robert Edmonds, Wenke Lee
182 | -------------------
183 | Title: Eureka: A Framework for Enabling Static Malware Analysis
184 | Authors: Monirul Sharif, Vinod Yegneswaran, Hassen Saidi, Phillip Porras, Wenke Lee
185 | -------------------
186 | Title: Binary Translation
187 | Authors: Richard Sites, Anton Cherno, B Ma Hew, Maurice Kirk, Sco Marks, Robinson
188 | -------------------
189 | Title: DeepMem: Learning Graph Neural Network Models for Fast and Robust Memory Forensic Analysis
190 | Authors: Wei Song, Chang Heng Yin, Dawn Liu, Song
191 | -------------------
192 | Title: SoK: Deep Packer Inspection: A Longitudinal Study of the Complexity of Run-Time Packers
193 | Authors: Davide Xabier Ugarte-Pedrero, Igor Balzaro I, Pablo Santos, Bringas
194 | -------------------
195 | Title: Panorama: Capturing System-wide Information Flow for Malware Detection and Analysis
196 | Authors: Dawn Heng Yin, Manuel Song, Christopher Egele, Engin Kruegel, Kirda
197 | -------------------
198 |
199 | Parsing 1908.10167.pdf
200 | Title: RePEconstruct: reconstructing binaries with self-modifying code and import address table destruction
201 | Authors: David Korczynski, Createprocessa, Createfilea, ,
202 | References:
203 | Title: System-Level Support for Intrusion Recovery
204 | Authors: Andrei Bacs, Remco Vermeulen, Asia Slowinska, Herbert Bos
205 | -------------------
206 | Title: incy: Detecting Host-Based Code Injection A acks in Memory Dumps
207 | Authors: Niklas Omas Barabosch, Adrian Bergmann, Elmar Dombeck, Padilla
208 | -------------------
209 | Title: Bee Master: Detecting Host-Based Code Injection A acks
210 | Authors: Sebastian Barabosch, Elmar Eschweiler, Gerhards-Padilla
211 | -------------------
212 | Title: Host-based code injection a acks: A popular technique used by malware
213 | Authors: Elmar Omas Barabosch, Gerhards-Padilla
214 | -------------------
215 | Title: A View on Current Malware Behaviors
216 | Authors: Ulrich Bayer, Imam Habibi, Davide Balzaro I, Engin Kirda
217 | -------------------
218 | Title: Dynamic Analysis of Malicious Code
219 | Authors: Ulrich Bayer, Andreas Moser, Christopher Kruegel, Engin Kirda
220 | -------------------
221 | Title: Dridex's Cold War
222 | Authors: Magal Baz, Or Safran
223 | -------------------
224 | Title: QEMU, a Fast and Portable Dynamic Translator
225 | Authors: Fabrice Bellard
226 | -------------------
227 | Title: CoDisasm: Medium Scale Concatic Disassembly of Self-Modifying Binaries with Overlapping Instructions
228 | Authors: Guillaume Bonfante, Jose Fernandez, Jean-Yves Marion, Benjamin Rouxel
229 | -------------------
230 | Title: Understanding Linux Malware
231 | Authors: E Cozzi, M Graziano, Y Fratantonio, D Balzaro I
232 | -------------------
233 | Title: Understanding Linux malware
234 | Authors: Emanuele Cozzi, Mariano Graziano, Yanick Fratantonio, Davide Balzaro I
235 | -------------------
236 | Title: Ether: Malware Analysis via Hardware Virtualization Extensions
237 | Authors: Artem Dinaburg, Paul Royal, Monirul Sharif, Wenke Lee
238 | -------------------
239 | Title: Repeatable Reverse Engineering with PANDA
240 | Authors: Brendan Dolan-Gavi, Josh Hodosh, Patrick Hulin, Tim Leek, Ryan Whelan
241 | -------------------
242 | Title: A Survey on Automated Dynamic Malware-analysis Techniques and Tools
243 | Authors: Manuel Egele, Engin Eodoor Scholte, Christopher Kirda, Kruegel
244 | -------------------
245 | Title: A Survey of Mobile Malware in the Wild
246 | Authors: Adrienne Felt, Ma Hew Fini Er, Erika Chin, Steve Hanna, David Wagner
247 | -------------------
248 | Title: DECAF: A Platform-Neutral Whole-System Dynamic Binary Analysis Platform
249 | Authors: Andrew Henderson, Lok-Kwong Yan, Xunchao Hu, Aravind Prakash, Heng Yin, Stephen Mccamant
250 | -------------------
251 | Title: Ten Process Injection Techniques: A Technical Survey Of Common And Trending Process Injection Techniques
252 | Authors: Ashkan Hosseini
253 | -------------------
254 | Title:
255 | Authors: Hungenberg, Eckert Ma Hias
256 | -------------------
257 | Title: malWASH: Washing Malware to Evade Dynamic Analysis
258 | Authors: K Kyriakos, Mathias Ispoglou, Payer
259 | -------------------
260 | Title: Renovo: A Hidden Code Extractor for Packed Executables
261 | Authors: Min Kang, Pongsin Poosankam, Heng Yin
262 | -------------------
263 | Title: API Chaser: Taint-Assisted Sandbox for Evasive Malware Analysis
264 | Authors: Yuhei Kawakoya, Eitaro Shioji, Makoto Iwamura
265 | -------------------
266 | Title: RePEconstruct: reconstructing binaries with selfmodifying code and import address table destruction
267 | Authors: David Korczynski
268 | -------------------
269 | Title: Precise system-wide concatic malware unpacking. arXiv e-prints
270 | Authors: David Korczynski
271 | -------------------
272 | Title: Capturing Malware Propagations with Code Injections and Code-Reuse A acks
273 | Authors: David Korczynski, Heng Yin
274 | -------------------
275 | Title:
276 | Authors: Pasquale Giulio De
277 | -------------------
278 | Title:
279 | Authors: Daniel Plohmann, Martin Clauß, Elmar Padilla
280 | -------------------
281 | Title: Argos: an Emulator for Fingerprinting Zero-Day A acks
282 | Authors: Georgios Portokalidis, Asia Slowinska, Herbert Bos
283 | -------------------
284 | Title: AVclass: A Tool for Massive Malware Labeling
285 | Authors: Marcos Sebastián, Richard Rivera, Platon Kotzias, Juan Caballero
286 | -------------------
287 | Title: Malrec: Compact Full-Trace Malware Recording for Retrospective Deep Analysis
288 | Authors: Giorgio Severi, Tim Leek, Brendan Dolan-Gavi
289 | -------------------
290 | Title: Nor Badrul Anuar, Rosli Salleh, and Lorenzo Cavallaro
291 | Authors: Kimberly Tam, Ali Feizollah
292 | -------------------
293 | Title: SoK: Deep Packer Inspection: A Longitudinal Study of the Complexity of Run-Time Packers
294 | Authors: Davide Xabier Ugarte-Pedrero, Igor Balzaro I, Pablo Santos, Bringas
295 | -------------------
296 | Title: Deep Ground Truth Analysis of Current Android Malware
297 | Authors: Fengguo Wei, Yuping Li, Sankardas Roy, Xinming Ou, Wu Zhou
298 | -------------------
299 | Title: Panorama: Capturing System-wide Information Flow for Malware Detection and Analysis
300 | Authors: Dawn Heng Yin, Manuel Song, Christopher Egele, Engin Kruegel, Kirda
301 | -------------------
302 | Title: Dissecting Android Malware: Characterization and Evolution
303 | Authors: Yajin Zhou, Xuxian Jiang
304 | -------------------
305 | ```
306 |
307 |
308 | # Processing data
309 |
310 | A simple example of how to process the data: [visualization.py](visualization.py).
311 |
--------------------------------------------------------------------------------
/alexandria/base.py:
--------------------------------------------------------------------------------
1 | """Contains the data structures and parsing functions to process
2 | and store data contained in an academic article.
3 | """
4 | import re
5 | from bs4 import BeautifulSoup
6 | from serde import Model, fields
7 |
8 |
9 | def text(field):
10 | return field.text.strip() if field else ""
11 |
12 |
13 | class Author(Model):
14 | """Contains information about an author.
15 | """
16 | name: fields.Str()
17 | surname: fields.Str()
18 | affiliation: fields.Optional(fields.List(fields.Str()))
19 |
20 | def __init__(self, soup):
21 | """Creates an `Author` by parsing a `soup` of type
22 | `BeautifulSoup`.
23 | """
24 | if not soup.persname:
25 | self.name = ""
26 | self.surname = ""
27 | else:
28 | self.name = text(soup.persname.forename)
29 | self.surname = text(soup.persname.surname)
30 | # TODO: better affiliation parsing.
31 | self.affiliation = list(map(text, soup.find_all("affiliation")))
32 |
33 | def __str__(self):
34 | s = ""
35 | if self.name:
36 | s += self.name + " "
37 | if self.surname:
38 | s += self.surname
39 |
40 | return s.strip()
41 |
42 |
43 | class _Article(Model):
44 | """This is required by serde for serialization, which
45 | is unable to take references to self as a field.
46 | For all purposes, refer to `Article`.
47 | """
48 | pass
49 |
50 |
51 | class Article(_Article):
52 | """Represents an academic article or a reference contained in
53 | an article.
54 |
55 | The data is parsed from a TEI XML file (`from_file()`) or
56 | directly from a `BeautifulSoup` object.
57 | """
58 | title: fields.Str()
59 | text: fields.Str()
60 | authors: fields.List(Author)
61 | year: fields.Optional(fields.Date())
62 | references: fields.Optional(fields.List(_Article))
63 |
64 | def __init__(self, soup, is_reference=False):
65 | """Create a new `Article` by parsing a `soup: BeautifulSoup`
66 | instance.
67 | The parameter `is_reference` specifies if the `soup` contains
68 | an entire article or just the content of a reference.
69 | """
70 | self.title = text(soup.title)
71 | self.doi = text(soup.idno)
72 | self.abstract = text(soup.abstract)
73 | self.text = soup.text.strip() if soup.text else ""
74 | # FIXME
75 | self.year = None
76 |
77 | if is_reference:
78 | self.authors = list(map(Author, soup.find_all("author")))
79 | self.references = []
80 | else:
81 | self.authors = list(map(Author, soup.analytic.find_all("author")))
82 | self.references = self._parse_biblio(soup)
83 |
84 | @staticmethod
85 | def from_file(tei_file):
86 | """Creates an `Article` by parsing a TEI XML file.
87 | """
88 | with open(tei_file) as f:
89 | soup = BeautifulSoup(f, "lxml")
90 |
91 | return Article(soup)
92 |
93 | def _parse_biblio(self, soup):
94 | """Parses the bibliography from an article.
95 | """
96 | references = []
97 | # NOTE: we could do this without the regex.
98 | bibs = soup.find_all("biblstruct", {"xml:id": re.compile(r"b[0-9]*")})
99 |
100 | for bib in bibs:
101 | if bib.analytic:
102 | references.append(Article(bib.analytic, is_reference=True))
103 | # NOTE: in this case, bib.monogr contains more info
104 | # about the manuscript where the paper was published.
105 | # Not parsing for now.
106 | elif bib.monogr:
107 | references.append(Article(bib.monogr, is_reference=True))
108 | else:
109 | print(f"Could not parse reference from {bib}")
110 |
111 | return references
112 |
113 | def __str__(self):
114 | return f"'{self.title}' - {' '.join(map(str, self.authors))}"
115 |
116 | def summary(self):
117 | """Prints a human-readable summary.
118 | """
119 | print(f"Title: {self.title}")
120 | print("Authors: " + ", ".join(map(str, self.authors)))
121 | if self.references:
122 | print("References:")
123 | for r in self.references:
124 | r.summary()
125 | print("-------------------")
126 |
--------------------------------------------------------------------------------
/alexandria/grobid.json:
--------------------------------------------------------------------------------
1 | {
2 | "grobid_server": "127.0.0.1",
3 | "grobid_port": "8080",
4 | "batch_size": 1000,
5 | "sleep_time": 5,
6 | "coordinates": [ "persName", "figure", "ref", "biblStruct", "formula" ]
7 | }
8 |
--------------------------------------------------------------------------------
/alexandria/runner.py:
--------------------------------------------------------------------------------
1 | """
2 | Usage:
3 | runner.py
4 | """
5 | import os
6 | import sys
7 | import pymongo
8 | from docopt import docopt
9 | from bs4 import BeautifulSoup
10 | from grobid_client.grobid_client import GrobidClient
11 |
12 | from base import Article
13 |
14 | # Valid grobid services
15 | FULL = "processFulltextDocument"
16 | HEADER = "processHeaderDocument"
17 | REFS = "processReferences"
18 |
19 |
20 | def parse_pdf(client, pdf_file):
21 | _, status, r = client.process_pdf(service=FULL,
22 | pdf_file=pdf_file,
23 | generateIDs=False,
24 | consolidate_header=True,
25 | consolidate_citations=False,
26 | include_raw_citations=False,
27 | include_raw_affiliations=False,
28 | teiCoordinates=True,
29 | )
30 |
31 | return Article(BeautifulSoup(r, "lxml"))
32 |
33 |
34 | if __name__ == "__main__":
35 | args = docopt(__doc__)
36 |
37 | grobid = GrobidClient("./grobid.json")
38 | client = pymongo.MongoClient("localhost", 27017)
39 | print("Connecting to db.")
40 | try:
41 | client.server_info()
42 | except pymongo.errors.ServerSelectionTimeoutError:
43 | print("Failed to connect to db.")
44 | sys.exit(1)
45 |
46 | db = client.alexandria
47 |
48 | print(f"Processing files in {args['']}")
49 |
50 | for pdf_file in os.listdir(args[""]):
51 | if not pdf_file.endswith(".pdf"):
52 | continue
53 | print(f"Parsing {pdf_file}")
54 | article = parse_pdf(grobid, os.path.join(args[""], pdf_file))
55 | # Can also do print(a.to_json())
56 | article.summary()
57 | print()
58 |
59 | db.articles.insert_one(article.to_dict())
60 |
--------------------------------------------------------------------------------
/alexandria/visualization.py:
--------------------------------------------------------------------------------
1 | import yake
2 | import matplotlib.pyplot as plt
3 | from pymongo import MongoClient
4 | from wordcloud import WordCloud
5 |
6 | from base import Article
7 |
8 |
9 | def find_keywords(articles, top_n=80, ngram_size=1):
10 | text = " ".join([a.text.lower() for a in articles])
11 |
12 | return yake.KeywordExtractor(top=top_n, n=ngram_size).extract_keywords(text)
13 |
14 | def wordcloud(articles, dst=None):
15 | # Get keywords.
16 | keywords, importance = zip(*find_keywords(articles))
17 | # Reverse importance.
18 | importance = map(lambda x: 1-x, importance)
19 | keywords = dict(zip(keywords, importance))
20 |
21 | wordcloud = WordCloud(background_color="white",
22 | colormap="winter",
23 | #font_path="Source_Sans_Pro/SourceSansPro-Bold.ttf",
24 | height=800,
25 | width=800,
26 | min_font_size=10,
27 | max_font_size=120,
28 | prefer_horizontal=3.0).generate_from_frequencies(keywords)
29 |
30 | plt.figure(figsize=(10.0, 10.0))
31 | plt.imshow(wordcloud, interpolation="bilinear")
32 | plt.axis("off")
33 | if dst:
34 | plt.savefig(dst)
35 |
36 |
37 | if __name__ == "__main__":
38 | client = MongoClient("localhost", 27017)
39 | db = client.alexandria
40 | articles = list(map(Article.from_dict, db.articles.find()))
41 |
42 | print(f"Fetched {len(articles)} articles.")
43 |
44 | keywords, _ = zip(*find_keywords(articles, top_n=5, ngram_size=2))
45 | print(f"Top 5 keywords: {keywords}")
46 |
47 | wordcloud(articles, "examples/wordcloud-final.png")
48 |
--------------------------------------------------------------------------------
/docs/larger_example.md:
--------------------------------------------------------------------------------
1 | # Example of larger analyses
2 |
3 | Paper analyser relies on PDF file representations of academic papers.
4 | As such, it is up to you to find these papers.
5 |
6 | For convenience we maintain a list of links to software analysis papers
7 | focused on software security in our sister repository [here](https://github.com/AdaLogics/software-security-paper-list)
8 |
9 | As an example of doing analysis on several Fuzzing papers, you can use the following commands:
10 |
11 | ```
12 | cd paper-analyzer
13 | mkdir tmp && cd tmp
14 | git clone https://github.com/AdaLogics/software-security-paper-list
15 | cd software-security-paper-list
16 | python auto_download.py Fuzzing
17 | ```
18 |
19 | At this point you will see more than 80 papers in the directory `out/Fuzzing/`
20 |
21 | We continue to do analysis on these papers:
22 | ```
23 | cd ../..
24 | python3 pq_main.py -f ./tmp/software-security-paper-list/out/Fuzzing
25 | ```
26 |
27 |
--------------------------------------------------------------------------------
/docs/simple_example.md:
--------------------------------------------------------------------------------
1 | # Simple example
2 |
3 | We include two whitepapers in the repository as examples for using
4 | paper-analyser to pass papers and get results out.
5 |
6 | To try the tool, simply follow the commands:
7 | ```
8 | cd paper-analyzer
9 | . venv/bin/activate
10 | python3 ./pq_main.py -f ./example-papers/
11 | ```
12 |
13 | At this point you will see results in `pq-out/out-0`
14 |
15 | Specifically, you will see:
16 | ```
17 | $ ls pq-out/out-0/
18 | data_out img json_data normalised_dependency_graph.json parsed_paper_data.json report.txt
19 | ```
20 |
21 | * `data_out` contains one `.txt` and one `.xml` file for each PDF. These `.txt` and `.xml` are simply data representations of the content of the given PDF file.
22 | * `json_data` contains JSON data representations for each paper
23 | * `img` contains a `.png` image of a citation-dependency graph of the PDF files in the folder
24 | * `parsed_paper_data.json` is a single json file containing data about the papers analysed, such as the title of each paper as well as the papers cited by each paper.
25 | * `report.txt` contains a lot of information about the papers in data set
26 |
27 | ```
28 | $ cat pq-out/out-0/report.txt
29 | ######################################
30 | Parsed papers summary
31 | paper:
32 | pq-out/out-0/json_data/json_dump_0.json
33 | b'A characterisation of system-wide propagation in the malware landscape '
34 | Normalised title: acharacterisationofsystemwidepropagationinthemalwarelandscape
35 | References:
36 | Authors: Andrei Bacs, Remco Vermeulen, Asia Slowinska, and Herbert Bos
37 | Title: System- Level Support for Intrusion Recovery
38 | Normalised: systemlevelsupportforintrusionrecovery
39 | -------------------
40 | Authors: Thomas Barabosch, Niklas Bergmann, Adrian Dombeck, and Elmar Padilla
41 | Title: Quincy: Detecting Host-Based Code Injection Attacks in Memory Dumps
42 | Normalised: quincy:detectinghostbasedcodeinjectionattacksinmemorydumps
43 | -------------------
44 | Authors: Thomas Barabosch, Sebastian Eschweiler, and Elmar Gerhards-Padilla
45 | Title: Bee Master: Detecting Host-Based Code Injection Attacks
46 | Normalised: beemaster:detectinghostbasedcodeinjectionattacks
47 | -------------------
48 | Authors: Thomas Barabosch and Elmar Gerhards-Padilla
49 | Title: Host-based code injection attacks: A popular technique used by malware
50 | Normalised: hostbasedcodeinjectionattacks:apopulartechniqueusedbymalware
51 | -------------------
52 | Authors: Ulrich Bayer, Imam Habibi, Davide Balzarotti, and Engin Kirda
53 | Title: A View on Current Malware Behaviors
54 | Normalised: aviewoncurrentmalwarebehaviors
55 | -------------------
56 | Authors: Ulrich Bayer, Andreas Moser, Christopher Kruegel, and Engin Kirda
57 | Title: Dy- namic Analysis of Malicious Code
58 | Normalised: dynamicanalysisofmaliciouscode
59 | -------------------
60 | Authors: Magal Baz and Or Safran
61 | Title: Dridexs Cold War: Enter AtomBombing
62 | Normalised: dridexscoldwar:enteratombombing
63 | -------------------
64 | Authors: Fabrice Bellard
65 | Title: QEMU, a Fast and Portable Dynamic Translator
66 | Normalised: qemuafastandportabledynamictranslator
67 | -------------------
68 | Authors: Guillaume Bonfante, Jose Fernandez, Jean-Yves Marion, Benjamin Rouxel, Fab- rice Sabatier, and Aurelien Thierry
69 | Title: CoDisasm: Medium Scale Con- catic Disassembly of Self-Modifying Binaries with Overlapping Instructions
70 | Normalised: codisasm:mediumscaleconcaticdisassemblyofselfmodifyingbinarieswithoverlappinginstructions
71 | -------------------
72 | Authors: Brendan Dolan-Gavitt, Josh Hodosh, Patrick Hulin, Tim Leek, and Ryan Whelan
73 | Title: Repeatable Reverse Engineering with PANDA
74 | Normalised: repeatablereverseengineeringwithpanda
75 | -------------------
76 | Authors: Manuel Egele, Theodoor Scholte, Engin Kirda, and Christopher Kruegel
77 | Title: A Survey on Automated Dynamic Malware-analysis Techniques and Tools
78 | Normalised: asurveyonautomateddynamicmalwareanalysistechniquesandtools
79 | -------------------
80 | Authors: Adrienne Porter Felt, Matthew Finifter, Erika Chin, Steve Hanna, and David Wagner
81 | Title: A Survey of Mobile Malware in the Wild
82 | Normalised: asurveyofmobilemalwareinthewild
83 | -------------------
84 | Authors: Andrew Henderson, Lok-Kwong Yan, Xunchao Hu, Aravind Prakash, Heng Yin, and Stephen McCamant
85 | Title: DECAF: A Platform-Neutral Whole-System Dynamic Binary Analysis Platform
86 | Normalised: decaf:aplatformneutralwholesystemdynamicbinaryanalysisplatform
87 | -------------------
88 | Authors: Min Gyung Kang, Pongsin Poosankam, and Heng Yin
89 | Title: Renovo: A Hidden Code Extractor for Packed Executables
90 | Normalised: renovo:ahiddencodeextractorforpackedexecutables
91 | -------------------
92 | Authors: Yuhei Kawakoya, Eitaro Shioji, Makoto Iwamura, and Jun Miyoshi
93 | Title: API Chaser: Taint-Assisted Sandbox for Evasive Malware Analysis
94 | Normalised: apichaser:taintassistedsandboxforevasivemalwareanalysis
95 | -------------------
96 | Authors: David Korczynski
97 | Title: RePEconstruct: reconstructing binaries with self- modifying code and import address table destruction
98 | Normalised: repeconstruct:reconstructingbinarieswithselfmodifyingcodeandimportaddresstabledestruction
99 | -------------------
100 | Authors: David Korczynski
101 | Title: Precise system-wide concatic malware unpacking
102 | Normalised: precisesystemwideconcaticmalwareunpacking
103 | -------------------
104 | Authors: David Korczynski and Heng Yin
105 | Title: Capturing Malware Propagations with Code Injections and Code-Reuse Attacks
106 | Normalised: capturingmalwarepropagationswithcodeinjectionsandcodereuseattacks
107 | -------------------
108 | Authors: Kimberly Tam, Ali Feizollah, Nor Badrul Anuar, Rosli Salleh, and Lorenzo Caval- laro
109 | Title: The Evolution of Android Malware and Android Analysis Techniques
110 | Normalised: theevolutionofandroidmalwareandandroidanalysistechniques
111 | -------------------
112 | Authors: Heng Yin, Dawn Song, Manuel Egele, Christopher Kruegel, and Engin Kirda
113 | Title: Panorama: Capturing System-wide Information Flow for Malware De- tection and Analysis
114 | Normalised: panorama:capturingsystemwideinformationflowformalwaredetectionandanalysis
115 | -------------------
116 | Authors: Yajin Zhou and Xuxian Jiang
117 | Title: Dissecting Android Malware: Character- ization and Evolution
118 | Normalised: dissectingandroidmalware:characterizationandevolution
119 | -------------------
120 | paper:
121 | pq-out/out-0/json_data/json_dump_1.json
122 | b'Precise system-wide concatic malware unpacking '
123 | Normalised title: precisesystemwideconcaticmalwareunpacking
124 | References:
125 | Authors: Andrei Bacs, Remco Vermeulen, Asia Slowinska, and Herbert Bos
126 | Title: System- Level Support for Intrusion Recovery
127 | Normalised: systemlevelsupportforintrusionrecovery
128 | -------------------
129 | Authors: G. Balakrishnan, T. Reps, D. Melski, and T. Teitelbaum
130 | Title: WYSINWYX: What You See Is Not What You eXecute
131 | Normalised: wysinwyx:whatyouseeisnotwhatyouexecute
132 | -------------------
133 | Authors: Tiffany Bao, Jonathan Burket, Maverick Woo, Rafael Turner, and David Brumley
134 | Title: BYTEWEIGHT: Learning to Recognize Functions in Binary Code
135 | Normalised: byteweight:learningtorecognizefunctionsinbinarycode
136 | -------------------
137 | Authors: Thomas Barabosch, Niklas Bergmann, Adrian Dombeck, and Elmar Padilla
138 | Title: Quincy: Detecting Host-Based Code Injection Attacks in Memory Dumps
139 | Normalised: quincy:detectinghostbasedcodeinjectionattacksinmemorydumps
140 | -------------------
141 | Authors: Thomas Barabosch, Sebastian Eschweiler, and Elmar Gerhards-Padilla
142 | Title: Bee Master: Detecting Host-Based Code Injection Attacks
143 | Normalised: beemaster:detectinghostbasedcodeinjectionattacks
144 | -------------------
145 | Authors: Guillaume Bonfante, Jose Fernandez, Jean-Yves Marion, Benjamin Rouxel, Fab- rice Sabatier, and Aurelien Thierry
146 | Title: CoDisasm: Medium Scale Con- catic Disassembly of Self-Modifying Binaries with Overlapping Instructions
147 | Normalised: codisasm:mediumscaleconcaticdisassemblyofselfmodifyingbinarieswithoverlappinginstructions
148 | -------------------
149 | Authors: Erik Bosman, Asia Slowinska, and Herbert Bos
150 | Title: Minemu: The Worlds Fastest Taint Tracker
151 | Normalised: minemu:theworldsfastesttainttracker
152 | -------------------
153 | Could not parse
154 | -------------------
155 | Authors: Cristina Cifuentes and K. John Gough
156 | Title: Decompilation of Binary Pro- grams
157 | Normalised: decompilationofbinaryprograms
158 | -------------------
159 | Authors: Artem Dinaburg, Paul Royal, Monirul Sharif, and Wenke Lee
160 | Title: Ether: Malware Analysis via Hardware Virtualization Extensions
161 | Normalised: ether:malwareanalysisviahardwarevirtualizationextensions
162 | -------------------
163 | Authors: Brendan Dolan-Gavitt, Josh Hodosh, Patrick Hulin, Tim Leek, and Ryan Whelan
164 | Title: Repeatable Reverse Engineering with PANDA
165 | Normalised: repeatablereverseengineeringwithpanda
166 | -------------------
167 | Authors: Thomas Dullien and Rolf Rolles
168 | Title: Graph-based comparison of executable objects (english version)
169 | Normalised: graphbasedcomparisonofexecutableobjects(englishversion)
170 | -------------------
171 | Authors: Mike Van Emmerik
172 | Title: Signatures for Library Functions in Executable Files
173 | Normalised: signaturesforlibraryfunctionsinexecutablefiles
174 | -------------------
175 | Authors: Halvar Flake
176 | Title: Structural Comparison of Executable Objects
177 | Normalised: structuralcomparisonofexecutableobjects
178 | -------------------
179 | Authors: Fanglu Guo, Peter Ferrie, and Tzi-cker Chiueh
180 | Title: A Study of the Packer Problem and Its Solutions
181 | Normalised: astudyofthepackerproblemanditssolutions
182 | -------------------
183 | Authors: Andrew Henderson, Lok-Kwong Yan, Xunchao Hu, Aravind Prakash, Heng Yin, and Stephen McCamant
184 | Title: DECAF: A Platform-Neutral Whole-System Dynamic Binary Analysis Platform
185 | Normalised: decaf:aplatformneutralwholesystemdynamicbinaryanalysisplatform
186 | -------------------
187 | Authors: Xin Hu, Sandeep Bhatkar, Kent Griffin, and Kang G. Shin
188 | Title: MutantX-S: Scalable Malware Clustering Based on Static Features
189 | Normalised: mutantxs:scalablemalwareclusteringbasedonstaticfeatures
190 | -------------------
191 | Authors: Thomas Hungenberg and Matthias Eckert
192 | Title: http://www
193 | Normalised: http://www
194 | -------------------
195 | Authors: Kyriakos K. Ispoglou and Mathias Payer
196 | Title: malWASH: Washing Malware to Evade Dynamic Analysis
197 | Normalised: malwash:washingmalwaretoevadedynamicanalysis
198 | -------------------
199 | Authors: Emily R. Jacobson, Nathan Rosenblum, and Barton P. Miller
200 | Title: Labeling Library Functions in Stripped Binaries
201 | Normalised: labelinglibraryfunctionsinstrippedbinaries
202 | -------------------
203 | Authors: Sebastien Josse
204 | Title: Secure and advanced unpacking using computer emulation
205 | Normalised: secureandadvancedunpackingusingcomputeremulation
206 | -------------------
207 | Authors: S. Josse
208 | Title: Malware Dynamic Recompilation
209 | Normalised: malwaredynamicrecompilation
210 | -------------------
211 | Authors: Min Gyung Kang, Pongsin Poosankam, and Heng Yin
212 | Title: Renovo: A Hidden Code Extractor for Packed Executables
213 | Normalised: renovo:ahiddencodeextractorforpackedexecutables
214 | -------------------
215 | Authors: Yuhei Kawakoya, Makoto Iwamura, and Jun Miyoshi
216 | Title: Taint-assisted IAT Reconstruction against Position Obfuscation
217 | Normalised: taintassistediatreconstructionagainstpositionobfuscation
218 | -------------------
219 | Authors: Yuhei Kawakoya, Eitaro Shioji, Makoto Iwamura, and Jun Miyoshi
220 | Title: API Chaser: Taint-Assisted Sandbox for Evasive Malware Analysis
221 | Normalised: apichaser:taintassistedsandboxforevasivemalwareanalysis
222 | -------------------
223 | Could not parse
224 | -------------------
225 | Authors: Clemens Kolbitsch, Engin Kirda, and Christopher Kruegel
226 | Title: The Power of Procrastination: Detection and Mitigation of Execution-stalling Malicious Code
227 | Normalised: thepowerofprocrastination:detectionandmitigationofexecutionstallingmaliciouscode
228 | -------------------
229 | Authors: David Korczynski
230 | Title: RePEconstruct: reconstructing binaries with self- modifying code and import address table destruction
231 | Normalised: repeconstruct:reconstructingbinarieswithselfmodifyingcodeandimportaddresstabledestruction
232 | -------------------
233 | Authors: David Korczynski and Heng Yin
234 | Title: Capturing Malware Propagations with Code Injections and Code-Reuse Attacks
235 | Normalised: capturingmalwarepropagationswithcodeinjectionsandcodereuseattacks
236 | -------------------
237 | Authors: Christopher Kruegel, Engin Kirda, Darren Mutz, William Robertson, and Gio- vanni Vigna
238 | Title: Polymorphic Worm Detection Using Structural Information of Executables
239 | Normalised: polymorphicwormdetectionusingstructuralinformationofexecutables
240 | -------------------
241 | Authors: Christopher Kruegel, William Robertson, Fredrik Valeur, and Giovanni Vigna
242 | Title: Static Disassembly of Obfuscated Binaries
243 | Normalised: staticdisassemblyofobfuscatedbinaries
244 | -------------------
245 | Authors: Yujia Li, Chenjie Gu, Thomas Dullien, Oriol Vinyals, and Pushmeet Kohli
246 | Title: Graph Matching Networks for Learning the Similarity of Graph Structured Objects
247 | Normalised: graphmatchingnetworksforlearningthesimilarityofgraphstructuredobjects
248 | -------------------
249 | Authors: L. Martignoni, M. Christodorescu, and S. Jha
250 | Title: OmniUnpack: Fast, Generic, and Safe Unpacking of Malware
251 | Normalised: omniunpack:fastgenericandsafeunpackingofmalware
252 | -------------------
253 | Authors: Mario Polino, Andrea Continella, Sebastiano Mariani, Stefano DAlessio, Lorenzo Fontana, Fabio Gritti, and Stefano Zanero
254 | Title: Measuring and Defeating Anti- Instrumentation-Equipped Malware
255 | Normalised: measuringanddefeatingantiinstrumentationequippedmalware
256 | -------------------
257 | Authors: Georgios Portokalidis, Asia Slowinska, and Herbert Bos
258 | Title: Argos: an Emula- tor for Fingerprinting Zero-Day Attacks
259 | Normalised: argos:anemulatorforfingerprintingzerodayattacks
260 | -------------------
261 | Authors: Symantec Security Response
262 | Title: W32
263 | Normalised: w32
264 | -------------------
265 | Authors: Nathan E. Rosenblum, Xiaojin Zhu, Barton P. Miller, and Karen Hunt
266 | Title: Learning to Analyze Binary Computer Code
267 | Normalised: learningtoanalyzebinarycomputercode
268 | -------------------
269 | Authors: Paul Royal, Mitch Halpin, David Dagon, Robert Edmonds, and Wenke Lee
270 | Title: PolyUnpack: Automating the Hidden-Code Extraction of Unpack-Executing Malware
271 | Normalised: polyunpack:automatingthehiddencodeextractionofunpackexecutingmalware
272 | -------------------
273 | Authors: Monirul Sharif, Vinod Yegneswaran, Hassen Saidi, Phillip Porras, and Wenke Lee
274 | Title: Eureka: A Framework for Enabling Static Malware Analysis
275 | Normalised: eureka:aframeworkforenablingstaticmalwareanalysis
276 | -------------------
277 | Authors: Richard L. Sites, Anton Chernoff, Matthew B. Kirk, Maurice P. Marks, and Scott G. Robinson
278 | Title: Binary Translation
279 | Normalised: binarytranslation
280 | -------------------
281 | Authors: Wei Song, Heng Yin, Chang Liu, and Dawn Song
282 | Title: DeepMem: Learning Graph Neural Network Models for Fast and Robust Memory Forensic Analysis
283 | Normalised: deepmem:learninggraphneuralnetworkmodelsforfastandrobustmemoryforensicanalysis
284 | -------------------
285 | Authors: Heng Yin, Dawn Song, Manuel Egele, Christopher Kruegel, and Engin Kirda
286 | Title: Panorama: Capturing System-wide Information Flow for Malware De- tection and Analysis
287 | Normalised: panorama:capturingsystemwideinformationflowformalwaredetectionandanalysis
288 | -------------------
289 | ######################################
290 | Papers in the data set :
291 | Cited papers:
292 | b'Precise system-wide concatic malware unpacking '
293 | Noncited papers:
294 | b'A characterisation of system-wide propagation in the malware landscape '
295 | Succ: 2
296 | Fail: 0
297 | Succ sigs: 61
298 | Failed sigs: 2
299 | ######################################
300 | All Titles (normalised) of the papers in the data set:
301 | b'A characterisation of system-wide propagation in the malware landscape '
302 | b'Precise system-wide concatic malware unpacking '
303 | ######################################
304 | ######################################
305 | All references/citations (normalised) issued by these papers
306 | Name : Count
307 | hostbasedcodeinjectionattacks:apopulartechniqueusedbymalware : 1
308 | aviewoncurrentmalwarebehaviors : 1
309 | dynamicanalysisofmaliciouscode : 1
310 | dridexscoldwar:enteratombombing : 1
311 | qemuafastandportabledynamictranslator : 1
312 | asurveyonautomateddynamicmalwareanalysistechniquesandtools : 1
313 | asurveyofmobilemalwareinthewild : 1
314 | precisesystemwideconcaticmalwareunpacking : 1
315 | theevolutionofandroidmalwareandandroidanalysistechniques : 1
316 | dissectingandroidmalware:characterizationandevolution : 1
317 | wysinwyx:whatyouseeisnotwhatyouexecute : 1
318 | byteweight:learningtorecognizefunctionsinbinarycode : 1
319 | minemu:theworldsfastesttainttracker : 1
320 | decompilationofbinaryprograms : 1
321 | ether:malwareanalysisviahardwarevirtualizationextensions : 1
322 | graphbasedcomparisonofexecutableobjects(englishversion) : 1
323 | signaturesforlibraryfunctionsinexecutablefiles : 1
324 | structuralcomparisonofexecutableobjects : 1
325 | astudyofthepackerproblemanditssolutions : 1
326 | mutantxs:scalablemalwareclusteringbasedonstaticfeatures : 1
327 | http://www : 1
328 | malwash:washingmalwaretoevadedynamicanalysis : 1
329 | labelinglibraryfunctionsinstrippedbinaries : 1
330 | secureandadvancedunpackingusingcomputeremulation : 1
331 | malwaredynamicrecompilation : 1
332 | taintassistediatreconstructionagainstpositionobfuscation : 1
333 | thepowerofprocrastination:detectionandmitigationofexecutionstallingmaliciouscode : 1
334 | polymorphicwormdetectionusingstructuralinformationofexecutables : 1
335 | staticdisassemblyofobfuscatedbinaries : 1
336 | graphmatchingnetworksforlearningthesimilarityofgraphstructuredobjects : 1
337 | omniunpack:fastgenericandsafeunpackingofmalware : 1
338 | measuringanddefeatingantiinstrumentationequippedmalware : 1
339 | argos:anemulatorforfingerprintingzerodayattacks : 1
340 | w32 : 1
341 | learningtoanalyzebinarycomputercode : 1
342 | polyunpack:automatingthehiddencodeextractionofunpackexecutingmalware : 1
343 | eureka:aframeworkforenablingstaticmalwareanalysis : 1
344 | binarytranslation : 1
345 | deepmem:learninggraphneuralnetworkmodelsforfastandrobustmemoryforensicanalysis : 1
346 | systemlevelsupportforintrusionrecovery : 2
347 | quincy:detectinghostbasedcodeinjectionattacksinmemorydumps : 2
348 | beemaster:detectinghostbasedcodeinjectionattacks : 2
349 | codisasm:mediumscaleconcaticdisassemblyofselfmodifyingbinarieswithoverlappinginstructions : 2
350 | repeatablereverseengineeringwithpanda : 2
351 | decaf:aplatformneutralwholesystemdynamicbinaryanalysisplatform : 2
352 | renovo:ahiddencodeextractorforpackedexecutables : 2
353 | apichaser:taintassistedsandboxforevasivemalwareanalysis : 2
354 | repeconstruct:reconstructingbinarieswithselfmodifyingcodeandimportaddresstabledestruction : 2
355 | capturingmalwarepropagationswithcodeinjectionsandcodereuseattacks : 2
356 | panorama:capturingsystemwideinformationflowformalwaredetectionandanalysis : 2
357 | ######################################
358 | The total number of unique citations: 50
359 | ######################################
360 | Dependency graph based on citations
361 | Paper:
362 | pq-out/out-0/json_data/json_dump_0.json
363 | b'A characterisation of system-wide propagation in the malware landscape '
364 | Normalised title: acharacterisationofsystemwidepropagationinthemalwarelandscape
365 | Cited by:
366 | ---------------------------------
367 | Paper:
368 | pq-out/out-0/json_data/json_dump_1.json
369 | b'Precise system-wide concatic malware unpacking '
370 | Normalised title: precisesystemwideconcaticmalwareunpacking
371 | Cited by:
372 | b'A characterisation of system-wide propagation in the malware landscape '
373 | ---------------------------------
374 | ```
375 |
--------------------------------------------------------------------------------
/example-images/fuzz-barplot.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AdaLogics/paper-analyser/464daa29bcab2a35a8a4630751ca96023643a558/example-images/fuzz-barplot.png
--------------------------------------------------------------------------------
/example-images/fuzz-wordcloud.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AdaLogics/paper-analyser/464daa29bcab2a35a8a4630751ca96023643a558/example-images/fuzz-wordcloud.png
--------------------------------------------------------------------------------
/example-images/paper-citation-graph.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AdaLogics/paper-analyser/464daa29bcab2a35a8a4630751ca96023643a558/example-images/paper-citation-graph.png
--------------------------------------------------------------------------------
/example-papers/1908.09204.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AdaLogics/paper-analyser/464daa29bcab2a35a8a4630751ca96023643a558/example-papers/1908.09204.pdf
--------------------------------------------------------------------------------
/example-papers/1908.10167.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AdaLogics/paper-analyser/464daa29bcab2a35a8a4630751ca96023643a558/example-papers/1908.10167.pdf
--------------------------------------------------------------------------------
/gen_wordcloud.py:
--------------------------------------------------------------------------------
1 | import os
2 | import sys
3 | from wordcloud import WordCloud
4 |
5 | if len(sys.argv) != 2:
6 | print("usage: python pq_gen_wordcloud.py TEXTFILE")
7 | exit(0)
8 |
9 | whole_text = None
10 | #with open("copy-total-words", "r") as i_f:
11 | with open(sys.argv[1], "r") as i_f:
12 | whole_text = i_f.read()
13 |
14 | whole_text = whole_text.replace("cid:", "")
15 |
16 | #wordcloud = WordCloud().generate(whole_text)
17 |
18 | import matplotlib.pyplot as plt
19 | #plt.imshow(wordcloud, interpolation="bilinear")
20 | #plt.axis("off")
21 |
22 | wordcloud = WordCloud(background_color="white",
23 | colormap="winter",
24 | font_path="Source_Sans_Pro/SourceSansPro-Bold.ttf",
25 | height=800,
26 | width=800,
27 | min_font_size=10,
28 | max_font_size=120,
29 | prefer_horizontal=3.0).generate(whole_text)
30 |
31 | plt.figure(figsize=(10.0, 10.0))
32 | plt.imshow(wordcloud, interpolation="bilinear")
33 | plt.axis("off")
34 | plt.savefig("wordcloud-final.png")
35 |
36 | # Now let's do word frequency
37 | word_dict = dict()
38 | split_words = whole_text.split(" ")
39 | total_number_of_words = len(split_words)
40 | print(total_number_of_words)
41 | total_sorted = 0
42 | for w in split_words:
43 | total_sorted += 1
44 | if w in word_dict:
45 | word_dict[w] = word_dict[w] + 1
46 | else:
47 | word_dict[w] = 1
48 | print("Total sorted: %d"%(total_sorted))
49 |
50 | listed_word_dict = []
51 | for key,value in word_dict.items():
52 | listed_word_dict.append((key, value, ))
53 |
54 | print("Length of listed word %d"%(len(listed_word_dict)))
55 | listed_word_dict.sort(key=lambda x:x[1])
56 | listed_word_dict.reverse()
57 | sorted_list = listed_word_dict
58 | print(sorted_list)
59 |
60 |
61 | # https://www.espressoenglish.net/the-100-most-common-words-in-english/
62 | avoid_words = { "the", "at", "there", "some", "my", "of", "be", "use", "her", "than", "and", "this", "an", "would", "first", "a", "have", "each", "make", "water", "to", "from", "which", "like", "been", "in", "or", "she", "him", "call", "is", "one", "do", "into", "who", "you", "had", "how", "time", "oil", "that", "by", "their", "has", "its", "it", "word", "if", "look", "now", "he", "but", "will", "two", "find", "was", "not", "up", "more", "long", "for", "what", "other", "write", "down", "on", "all", "about", "go", "day", "are", "were", "out", "see", "did", "as", "we", "many", "number", "get", "with", "when", "then", "no", "come", "his", "your", "them", "way", "made", "they", "can", "these", "could", "may", "I", "said", "so", "people", "part" }
63 |
64 | avoid_words.add("=")
65 | avoid_words.add(".")
66 | avoid_words.add(" ")
67 | avoid_words.add("")
68 | avoid_words.add(",")
69 |
70 | words = []
71 | freqs = []
72 | for key, value in sorted_list:
73 | if key.lower() in avoid_words:
74 | continue
75 | words.append(key)
76 | freqs.append(value)
77 | for i in range(140):
78 | print("sorted_list: %s"%(str(sorted_list[i])))
79 |
80 | Data = { "words: " : words, "freqs" : freqs }
81 |
82 | plt.clf()
83 | plt.figure(figsize=(10.0, 10.0))
84 | plt.barh(words[:50], freqs[:50])
85 |
86 | plt.title("Word frequency")
87 | #plt.show()
88 | plt.savefig("barplot.png")
89 |
--------------------------------------------------------------------------------
/install.sh:
--------------------------------------------------------------------------------
1 | # Install some needed packages
2 | sudo apt-get install python3
3 | sudo apt-get install python3-pip
4 | sudo apt install graphviz
5 |
6 | # Create a virtual environment
7 | virtualenv venv
8 |
9 | # Launch the virtual environment
10 | . venv/bin/activate
11 |
12 | # Install various packages
13 | pip install -r requirements.txt
14 |
--------------------------------------------------------------------------------
/pq_format_reader.py:
--------------------------------------------------------------------------------
1 | import os
2 | import sys
3 | import simplejson as json
4 | import traceback
5 |
6 | import graphviz as gv
7 | import pylab
8 | import random
9 |
10 |
11 | def append_to_report(workdir, str_to_write, to_print=True):
12 | work_report = os.path.join(workdir, "report.txt")
13 | if not os.path.isfile(work_report):
14 | with open(work_report, "w+") as wr:
15 | wr.write(str_to_write)
16 | wr.write("\n")
17 | else:
18 | with open(work_report, "ab") as wr:
19 | wr.write(str_to_write.encode("utf-8"))
20 | wr.write("\n".encode("utf-8"))
21 |
22 | if to_print:
23 | print(str_to_write)
24 |
25 |
26 | def split_reference_section_into_references(all_content, to_print=True):
27 | '''
28 | Takes a raw string that resembles references and returns a python
29 | list where each element is composed of a tuple of four elements:
30 | - The number of the reference
31 | - The data of the reference
32 | - The offset where the reference begins in the raw data
33 | - The offset where the reference ends in the raw data
34 | (This end offset is not entirely workign and will often
35 | be 0)
36 | Each tuple is arranged in the order of the elements listed above.
37 | '''
38 |
39 | collect_num = False
40 | curr_num = ""
41 | curr_content = ""
42 | idx_offset = 0
43 | begin_offset = 0
44 | end_offset = 0
45 |
46 | references = list()
47 |
48 | for c in all_content:
49 | if collect_num == True:
50 | try:
51 | n = int(c)
52 | curr_num = "%s%s" % (curr_num, c)
53 | except:
54 | None
55 |
56 | if c == "[":
57 | # Make sure the next character is a number
58 | try:
59 | tmpi = int(all_content[idx_offset + 1])
60 | except:
61 | curr_content += c
62 | continue
63 |
64 | if curr_content != "":
65 | references.append({
66 | "number": curr_num,
67 | "raw_content": curr_content,
68 | "offset": begin_offset
69 | })
70 | #references.append((curr_num, curr_content, begin_offset, end_offset))
71 | curr_num = ""
72 | collect_num = True
73 | begin_offset = idx_offset
74 | elif c == "]":
75 | collect_num = False
76 | #print("Got a number: %d"%(int(curr_num)))
77 | #curr_num = ""
78 | curr_content = ""
79 | elif collect_num == False:
80 | curr_content += c
81 |
82 | idx_offset += 1
83 |
84 | # Add the final reference
85 | #references.append((curr_num, curr_content, begin_offset, end_offset))
86 | references.append({
87 | "number": curr_num,
88 | "raw_content": curr_content,
89 | "offset": begin_offset
90 | })
91 | print("Size of references: %d" % (len(references)))
92 |
93 | if to_print == True:
94 | for ref in references:
95 | print("Ref: %s" % (str((ref))))
96 |
97 | return references
98 |
99 |
100 | ####################################################################
101 | # Reference author routines
102 | #
103 | #
104 | # These routines are used to take raw reference data
105 | # and split it up between authors and title of reference.
106 | #
107 | # We need multiple routines for this since there are multiple
108 | # ways of writing references.
109 | ####################################################################
110 | def read_single_letter_references(r_num, r_con, r_beg, r_end):
111 | '''
112 | Parse a raw reference string and extract authors as well as title.
113 | This parsing routine is concerned with references where there is only a single
114 | name spelled out entirely per author, and the rest are delivered as single letters
115 | followed by periods.
116 |
117 | Examples of strings it parses:
118 | - T. Ball, R. Majumdar, T. Millstein, and S. K. Rajamani. Automatic predicate abstraction of C programs.
119 | - A. Bartel, J. Klein, M. Monperrus, and Y. Le Traon. Automatically securing permission-based software by reducing the attack
120 | - A. Bose, X. Hu, K. G. Shin, and T. Park. Behavioral detection of malware on mobile handsets.
121 |
122 | Inputs:
123 | - An element from a list as parsed by read_harvard_references
124 |
125 | Outputs:
126 | - On failure: None
127 | - On success:
128 | '''
129 | # Now find the period that divides the authors from the title
130 | authors = None
131 | rest = None
132 | space_split = r_con.split(" ")
133 | for idx2 in range(len(space_split) - 4):
134 | try:
135 | # This will capture when there are multiple authors
136 | if (space_split[idx2].lower() == "and"
137 | and space_split[idx2 + 1][-1] == "."
138 | and space_split[1][-1] == "."):
139 | # If we have a double single-letter name
140 |
141 | # Example: T. Ball, R. Majumdar, T. Millstein, and S. K. Rajamani. Automatic predicate abstraction of C programs.
142 | if space_split[idx2 + 2][-1] == "." and len(
143 | space_split[idx2 + 2]) == 2:
144 | print("Potentially last1: %s" % (space_split[idx2 + 4]))
145 | authors = " ".join(space_split[0:idx2 + 4])
146 | rest = " ".join(space_split[idx2 + 4:])
147 | # Example: A. Bartel, J. Klein, M. Monperrus, and Y. Le Traon. Automatically securing permission-based software by reducing the attack
148 | elif space_split[idx2 + 2][-1] != "." and space_split[
149 | idx2 + 3][-1] == ".":
150 | print("Potentially last2: %s" % (space_split[idx2 + 4]))
151 | authors = " ".join(space_split[0:idx2 + 4])
152 | rest = " ".join(space_split[idx2 + 4:])
153 | # Example:A. Bose, X. Hu, K. G. Shin, and T. Park. Behavioral detection of malware on mobile handsets.
154 | else:
155 | print("Potentially last3: %s" % (space_split[idx2 + 3]))
156 | authors = " ".join(space_split[0:idx2 + 3])
157 | rest = " ".join(space_split[idx2 + 3:])
158 | break
159 | except:
160 | None
161 |
162 | if authors == None:
163 | for idx2 in range(len(space_split) - 4):
164 | try:
165 | # This will capture when there is only a single author
166 | if (len(space_split[idx2]) == 2
167 | and space_split[idx2][-1] == "."
168 | and len(space_split[idx2 + 2]) > 2):
169 | # If we have double single-letter naming
170 | if space_split[idx2 + 1][-1] == "." and len(
171 | space_split[idx2 + 1]) == 2:
172 | print("Potentially last single 1: %s" %
173 | (space_split[idx2 + 3]))
174 | authors = " ".join(space_split[0:idx2 + 3])
175 | rest = " ".join(space_split[idx2 + 3:])
176 | break
177 | else:
178 | print("Potentially last single 2: %s" %
179 | (space_split[idx2 + 2]))
180 | authors = " ".join(space_split[0:idx2 + 2])
181 | rest = " ".join(space_split[idx2 + 2:])
182 | break
183 | except:
184 | None
185 |
186 | # Do some post processing to ensure we got the right stuff.
187 | # First ensure we at least have two words in authors:
188 | if authors != None and len(authors.split(" ")) < 2:
189 | print("Breaking 1")
190 | return None
191 |
192 | # Now ensure that the title is not just a name:
193 | if authors != None:
194 | try:
195 | title = int(rest.split(".")[0])
196 | # If the above is successful, then the title is just a number, which cannot be true
197 | print("Breaking 2")
198 | return None
199 | except:
200 | print("Could not break 2")
201 | None
202 |
203 | if authors != None:
204 | r_title = rest.split(".")[0]
205 |
206 | # with the special '' symbols (0x201c and 0x201d, respectively).
207 | #if chr(0x201d) in r_title:
208 | # r_title = rest.split(chr(0x201d))[0]
209 | # r_title = r_title[1:-1]
210 | # rest = ".".join(rest.split(chr(0x201d))[1:])
211 |
212 | print("Authors: %s" % (authors))
213 | print("Title: %s" % (r_title))
214 | print("rest: %s" % (rest))
215 |
216 | # Create a directory
217 | ref_dict = dict()
218 | ref_dict['Authors'] = authors
219 | ref_dict['Title'] = r_title
220 | ref_dict['rest'] = rest
221 | ref_dict['num'] = r_num
222 | return ref_dict
223 | else:
224 | return None
225 | print("----------------")
226 |
227 |
228 | def read_full_author_references(r_num, r_con, r_beg, r_end):
229 | '''
230 | This routine is focused on parsing references where the authors names are
231 | spelled out fully, including all names of each author.
232 |
233 | Examples: Sang Kil Cha, Thanassis Avgerinos, Alexandre Rebert, and David Brumley. 2012. Unleashing Mayhem on Binary Code. In IEEE Symposium on Security and Privacy (S&P).
234 | '''
235 | # Now find the period that divides the authors from the title
236 | authors = None
237 | rest = None
238 | space_split = r_con.split(".")
239 | print("Space split: %s" % (str(space_split)))
240 |
241 | # Merge all of the elements of the list that are of less than 2 in length:
242 | new_space_split = list()
243 | curr_elem = ""
244 | for elem in space_split:
245 | if len(elem) <= 2:
246 | curr_elem += "%s." % (elem)
247 | elif len(elem) > 2 and elem[-2] == ' ' and elem[-1] != ' ':
248 | curr_elem += "%s." % (elem)
249 | else:
250 | new_elem = "%s%s" % (curr_elem, elem)
251 | print("New elem: %s" % (new_elem))
252 | new_space_split.append(new_elem)
253 | curr_elem = ""
254 | space_split = new_space_split
255 |
256 | print("Refined space split: %s" % (str(space_split)))
257 |
258 | # See if it is clean
259 | if len(space_split) == 4:
260 | try:
261 | year = int(space_split[1])
262 | print("Gotten the reference")
263 |
264 | ref_dict = dict()
265 | ref_dict['Authors'] = space_split[0]
266 | ref_dict['Title'] = space_split[2]
267 | ref_dict['Venue'] = space_split[3]
268 | ref_dict['Year'] = space_split[1]
269 | ref_dict['rest'] = space_split[3]
270 | ref_dict['num'] = r_num
271 | return ref_dict
272 | except:
273 | None
274 | # Let's try a bit vaguer for the sake of it
275 | try:
276 | year = int(space_split[1])
277 | print("Gotten the reference")
278 |
279 | ref_dict = dict()
280 | ref_dict['Authors'] = space_split[0]
281 | ref_dict['Year'] = space_split[1]
282 | ref_dict['Title'] = space_split[2]
283 | ref_dict['rest'] = ".".join(space_split[3:])
284 | ref_dict['num'] = r_num
285 | return ref_dict
286 | except:
287 | None
288 |
289 | return None
290 |
291 |
292 | def normalise_title(title):
293 | return title.lower().replace(",", "").replace(" ", "").replace("-", "")
294 |
295 |
296 | ##########################################
297 | # End of citation parsing routines
298 | ##########################################
299 |
300 |
301 | def parse_raw_reference_list(refs):
302 | '''
303 | The goal of this function is to convert the raw references
304 | into more normalized references, and in particular split
305 | the raw reference into authors and title of reference.
306 |
307 | An important aspect of this function is that it tries to pass
308 | the reference in multiple ways, based on how the citations
309 | are written.
310 |
311 | As such, this routine is somewhat implemented with a
312 | test-and-check aproach in mind.
313 | '''
314 | parse_funcs = [read_single_letter_references, read_full_author_references]
315 | parse_success = []
316 | for parse_func in parse_funcs:
317 | missing_refs = []
318 | refined = []
319 | for ref in refs:
320 | res = parse_func(ref['number'], ref['raw_content'], ref['offset'],
321 | 0)
322 | if res == None:
323 | missing_refs.append(ref)
324 | else:
325 | refined.append(res)
326 | parse_success.append({
327 | 'missing_refs': missing_refs,
328 | 'refined': refined
329 | })
330 |
331 | # Now pick the index with highest number of successful parses
332 | if len(parse_success[0]['refined']) > len(parse_success[1]['refined']):
333 | parse_func = parse_funcs[0]
334 | else:
335 | parse_func = parse_funcs[1]
336 |
337 | # Now parse a last time and insert the parsed data into
338 | # the references dictionary.
339 | # This loop will modify the content of the dictionary.
340 | for ref in refs:
341 | res = parse_func(ref['number'], ref['raw_content'], ref['offset'], 0)
342 | ref['parsed'] = res
343 | if res != None:
344 | ref['normalised-title'] = normalise_title(res['Title'])
345 | else:
346 | ref['normalised-title'] = None
347 |
348 |
349 | ########################################
350 | # Utility functions
351 | ########################################
352 |
353 |
354 | def print_refined(refined):
355 | print("Refined lists:")
356 | #for rd in refined:
357 | # try:
358 | # print("\t%d"%(int(rd['num'])))
359 | # except:
360 | # None
361 | # print("\tTitle: %s"%(rd['Title']))
362 | # print("\tAuthors: %s"%(rd['Authors']))
363 | # print("\tRest: %s"%(rd['rest']))
364 | # print("#"*90)
365 |
366 |
367 | def print_missing(missings):
368 | print("Missing refs:")
369 | #for ms in missings:
370 | # print("\t%s"%(str(ms)))
371 | # print("<"*90)
372 |
373 |
374 | def read_decoded(file_path):
375 | '''
376 | Reads a file an encodes each data as an ASCII string. We do
377 | the ASCII encoding because many of the papers have some form
378 | of weird UTF-coding and we do not handle those at the moment.
379 | '''
380 | with open(file_path, "r") as f:
381 | json_content = json.loads(str(f.read().replace('\r\n', '')),
382 | strict=False)
383 | dec_content = dict()
384 | for elem in json_content:
385 | #dec_content[elem] = json_content[elem].encode('utf-8')
386 | dec_content[elem] = json_content[elem].encode('ascii', 'ignore')
387 | #asciidata= dec.encode("ascii","ignore")
388 | return dec_content
389 |
390 |
391 | def parse_file(file_path):
392 | '''
393 | Parses a json file with the raw data of title and references
394 | and returns a list of citations made by this paper.
395 | '''
396 | # Read the json file with our raw data.
397 | dec_json = read_decoded(file_path)
398 |
399 | # Merge all lines in the file by converting newlines to spaces
400 | dec_json['References'] = dec_json['References'].decode("utf-8")
401 | print("Type: %s" % (str(type(dec_json['References']))))
402 | print(dec_json['References'])
403 | print(dec_json['References'].encode('ascii', 'ignore'))
404 | all_refs = dec_json['References'].replace(
405 | "\n", " ") #decode("utf-8").replace("\n", " ")
406 |
407 | # Extract the references in a raw format. This is just splitting
408 | #
409 | references = split_reference_section_into_references(all_refs)
410 |
411 | # Decipher the references
412 | print("References")
413 | print(references)
414 | #refined, missing_sigs = parse_raw_reference_list(references)
415 | parse_raw_reference_list(references)
416 | print("Refined refs:")
417 | print(references)
418 |
419 | #print_refined(refined)
420 | #print_missing(missing_sigs)
421 | #print("Refined sigs: %d"%(len(refined)))
422 |
423 | # Now create our dictionary where we will wrap all data in
424 | paper_dict = {}
425 | paper_dict['paper-title'] = dec_json['Title']
426 | paper_dict['paper-title-normalised'] = normalise_title(
427 | dec_json['Title'].decode("utf-8"))
428 | paper_dict['references'] = references
429 | #paper_dict['success-sigs'] = refined
430 | #paper_dict['missing-sigs'] = missing_sigs
431 |
432 | return paper_dict
433 | #return refined, missing_sigs, dec_json['Title']
434 |
435 |
436 | def read_json_file(filename):
437 | print("[+] Parsing file: %s" % (filename))
438 | ret = None
439 | try:
440 | ret = parse_file(filename)
441 | except:
442 | exc_type, exc_value, exc_traceback = sys.exc_info()
443 | traceback.print_tb(exc_traceback, limit=10, file=sys.stdout)
444 | traceback.print_exception(exc_type,
445 | exc_value,
446 | exc_traceback,
447 | limit=20,
448 | file=sys.stdout)
449 | traceback.print_exc()
450 | formatted_lines = traceback.format_exc().splitlines()
451 | print("[+] Completed parsing of %s" % (filename))
452 | return ret
453 |
454 |
455 | def identify_group_dependencies(workdir, parsed_papers):
456 | """ identifies who cites who in a group of papers """
457 | append_to_report(workdir, "######################################")
458 | append_to_report(workdir, "Dependency graph based on citations")
459 | #append_to_report(workdir, "######################################")
460 | g1 = gv.Digraph(format='png')
461 |
462 | MAX_STR_SIZE = 15
463 | idx = 0
464 | paper_dict = {}
465 | normalised_to_id = {}
466 |
467 | for src_paper in parsed_papers:
468 | #paper_dict[src_paper['result']['paper-title-normalised']] = idx
469 | if len(src_paper['result']['paper-title-normalised']) > MAX_STR_SIZE:
470 |
471 | target_d = src_paper['result']['paper-title-normalised'][
472 | 0:MAX_STR_SIZE]
473 | else:
474 | target_d = src_paper['result']['paper-title-normalised']
475 |
476 | #paper_dict[src_paper['result']['paper-title-normalised']] = src_paper['result']['paper-title-normalised']
477 | paper_dict[src_paper['result']['paper-title-normalised']] = target_d
478 | #g1.node(str(idx))
479 | g1.node(target_d)
480 | idx += 1
481 |
482 | #print("Paper dict")
483 | #print(paper_dict)
484 | #print("-"*50)
485 | #raw_dependency_graph_path = os.path.join(workdir, "raw_dependency_graph.json")
486 | #with open(raw_dependency_graph_path, "w+") as dependency_graph_json:
487 | # json.dump(paper_dict, dependency_graph_json)
488 |
489 | dependency_graph = []
490 | for src_paper in parsed_papers:
491 |
492 | cited_by = []
493 | for tmp_paper in parsed_papers:
494 | if tmp_paper == src_paper:
495 | continue
496 | # Now go through references of tmp_paper and
497 | # see if it cites the src paper
498 | cites = False
499 | for ref in tmp_paper['result']['references']:
500 | if ref['parsed'] != None:
501 | if ref['normalised-title'] == src_paper['result'][
502 | 'paper-title-normalised']:
503 | if len(tmp_paper['result']
504 | ['paper-title-normalised']) > MAX_STR_SIZE:
505 | src_d = tmp_paper['result'][
506 | 'paper-title-normalised'][0:MAX_STR_SIZE]
507 | else:
508 | src_d = tmp_paper['result'][
509 | 'paper-title-normalised']
510 |
511 | if len(src_paper['result']
512 | ['paper-title-normalised']) > MAX_STR_SIZE:
513 | dst_d = src_paper['result'][
514 | 'paper-title-normalised'][0:MAX_STR_SIZE]
515 | else:
516 | dst_d = src_paper['result'][
517 | 'paper-title-normalised']
518 | g1.edge(src_d, dst_d)
519 |
520 | #src_idx = paper_dict[tmp_paper['result']['paper-title-normalised']]
521 | #dst_idx = paper_dict[src_paper['result']['paper-title-normalised']]
522 | #g1.edge(str(src_idx), str(dst_idx))
523 | cites = True
524 | if cites == True:
525 | cited_by.append(tmp_paper)
526 | src_paper['cited_by'] = cited_by
527 | append_to_report(workdir, "Paper:")
528 | append_to_report(workdir, "\t%s" % (src_paper['filename']))
529 | append_to_report(workdir, "\t%s" % (src_paper['result']['paper-title']))
530 | append_to_report(workdir, "\tNormalised title: %s" %
531 | (src_paper['result']['paper-title-normalised']))
532 | append_to_report(workdir, "\tCited by: ")
533 | for cited_by_paper in cited_by:
534 | append_to_report(workdir, "\t\t%s" % (cited_by_paper['result']['paper-title']))
535 | append_to_report(workdir, "---------------------------------")
536 |
537 | paper_info = {}
538 | paper_info['name'] = src_paper['result']['paper-title'].decode("utf-8")
539 | paper_info['minimized-name'] = src_paper['result'][
540 | 'paper-title-normalised']
541 | paper_info['imports'] = []
542 | for cited_by_paper in cited_by:
543 | paper_info['imports'].append({
544 | "name":
545 | cited_by_paper['result']['paper-title'].decode("utf-8"),
546 | "minimized-name":
547 | cited_by_paper['result']['paper-title-normalised']
548 | })
549 | dependency_graph.append(paper_info)
550 |
551 | #append_to_report(workdir, "Done identifying references within the group")
552 | #g1.view()
553 |
554 | filename = g1.render(
555 | filename=os.path.join(workdir, "img", "citation_graph"))
556 | print("idx: %d" % (idx))
557 | #print(parsed_papers)
558 |
559 | dependency_graph_path = os.path.join(workdir,
560 | "normalised_dependency_graph.json")
561 | with open(dependency_graph_path, "w+") as dependency_graph_json:
562 | json.dump(dependency_graph, dependency_graph_json)
563 |
564 |
565 | def display_summary(workdir, parsed_papers_raw):
566 | succ = []
567 | fail = []
568 | succ_sigs = []
569 | miss_sigs = []
570 | all_titles = []
571 | append_to_report(workdir, "######################################")
572 | append_to_report(workdir, " Parsed papers summary ")
573 | #append_to_report(workdir, "######################################")
574 | all_normalised_citations = set()
575 | for paper in parsed_papers_raw:
576 | append_to_report(workdir, "paper:")
577 | append_to_report(workdir, "\t%s" % (paper['filename']))
578 | append_to_report(workdir, "\t%s" % (paper['result']['paper-title']))
579 | append_to_report(workdir, "\tNormalised title: %s" %
580 | (paper['result']['paper-title-normalised']))
581 | append_to_report(workdir, "\tReferences:")
582 | for ref in paper['result']['references']:
583 | if ref['parsed'] != None:
584 | append_to_report(workdir, "\t\tAuthors: %s" % (ref['parsed']['Authors']))
585 | append_to_report(workdir, "\t\tTitle: %s" % (ref['parsed']['Title']))
586 | append_to_report(workdir, "\t\tNormalised: %s" % (ref['normalised-title']))
587 | all_normalised_citations.add(ref['normalised-title'])
588 | else:
589 | append_to_report(workdir, "\t\tCould not parse")
590 | append_to_report(workdir, "\t\t-------------------")
591 | #print("\t\t%s"%(str(ref['parsed'])))
592 |
593 | #print("[+] %s"%(paper['title']))
594 | if paper['result'] == None:
595 | fail.append(paper['filename'])
596 | else:
597 | succ.append(paper['filename'])
598 |
599 | for ref in paper['result']['references']:
600 | if ref['parsed'] != None:
601 | succ_sigs.append(ref['parsed'])
602 | else:
603 | miss_sigs.append(ref['raw_content'])
604 |
605 | #succ_sigs.append(paper['result']['success-sigs'])
606 | #miss_sigs.append(paper['result']['missing-sigs'])
607 | all_titles.append(paper['result']['paper-title'])
608 |
609 | # Check which papers are in our set:
610 | append_to_report(workdir, "######################################")
611 | append_to_report(workdir, "Papers in the data set :")
612 | #append_to_report(workdir, "######################################")
613 | cited_papers = list()
614 | noncited_papers = list()
615 | for paper in parsed_papers_raw:
616 | print(paper['result']['paper-title'])
617 | cited_by_group = False
618 | for s in all_normalised_citations:
619 | if s == paper['result']['paper-title-normalised']:
620 | cited_by_group = True
621 | if cited_by_group:
622 | cited_papers.append(paper)
623 | else:
624 | noncited_papers.append(paper)
625 | #append_to_report(workdir, "######################################")
626 | append_to_report(workdir, "Cited papers:")
627 | #append_to_report(workdir, "######################################")
628 | for p in cited_papers:
629 | append_to_report(workdir, "\t%s" % (p['result']['paper-title']))
630 |
631 |
632 | #append_to_report(workdir, "######################################")
633 | append_to_report(workdir, "Noncited papers:")
634 | #append_to_report(workdir, "######################################")
635 | for p in noncited_papers:
636 | append_to_report(workdir, "\t%s" % (p['result']['paper-title']))
637 |
638 | #exit(0)
639 |
640 | # Now display the content.
641 | append_to_report(workdir, "Succ: %d" % (len(succ)))
642 | append_to_report(workdir, "Fail: %d" % (len(fail)))
643 | append_to_report(workdir, "Succ sigs: %d" % (len(succ_sigs)))
644 | append_to_report(workdir, "Failed sigs: %d" % (len(miss_sigs)))
645 |
646 | #append_to_report(workdir, "Summary of references:")
647 | # Now do some status on how many references we have in total
648 | all_refs = dict()
649 | #for siglist in succ_sigs:
650 | for sig in succ_sigs:
651 | # Normalise
652 | #append_to_report(workdir, "Normalising: %s" % (sig['Title']))
653 | tmp_sig = sig['Title'].lower().replace(",", "").replace(" ",
654 | "").replace(
655 | "-", "")
656 | #append_to_report(workdir, "\tNormalised: %s" % (tmp_sig))
657 |
658 | #if "\xe2\x80\x9c" in tmp_sig:
659 | # s1_split = tmp_sig.split("\xe2\x80\x9c")
660 | # if "\xe2\x80\x9d" in s1_split[1]:
661 | # tmp_sig = s1_split[1].split("\xe2\x80\x9d")[0]
662 | # else:
663 | # tmp_sig=s1_split[1]
664 |
665 | if tmp_sig not in all_refs:
666 | all_refs[tmp_sig] = 0
667 | all_refs[tmp_sig] += 1
668 |
669 | #if sig['Title'].lower() not in all_refs:
670 | # all_refs[sig['Title'].lower().strip()] = 0
671 | #all_refs[sig['Title'].lower().strip()] += 1
672 |
673 | append_to_report(workdir, "######################################")
674 | append_to_report(workdir, "All Titles (normalised) of the papers in the data set:")
675 | #append_to_report(workdir, "######################################")
676 | for t in all_titles:
677 | append_to_report(workdir, "\t%s" % (t))
678 | append_to_report(workdir, "######################################")
679 |
680 | append_to_report(workdir, "######################################")
681 | append_to_report(workdir, "All references/citations (normalised) issued by these papers")
682 | #append_to_report(workdir, "######################################")
683 | append_to_report(workdir, "\tName : Count")
684 | sorted_list = list()
685 | for title in all_refs:
686 | sorted_list.append((title, all_refs[title]))
687 | sorted_list = sorted(sorted_list, key=lambda x: x[1])
688 | for title, counts in sorted_list:
689 | append_to_report(workdir, "\t%s : %d" % (title, counts))
690 |
691 | append_to_report(workdir, "######################################")
692 | append_to_report(workdir, "The total number of unique citations: %d" % (len(sorted_list)))
693 | #append_to_report(workdir, "######################################")
694 | return
695 | #exit(0)
696 |
697 | #all_missing_sigs = []
698 | #for missig in miss_sigs:
699 | # all_missing_sigs += missig
700 |
701 | #for miss in missig:
702 | # print("###\t%s"%(str(miss)))
703 |
704 | #print(
705 | # "[+] The references that we were unable to fully parse, but yet are cited by the papers:"
706 | #)
707 | #print(
708 | # "Total amount of references not parsed and thus not included in analysis: %d"
709 | # % (len(all_missing_sigs)))
710 | #print("All of these references:")
711 | #for missig in all_missing_sigs:
712 | # print("###\t%s" % (str(missig)))
713 |
714 |
715 | def parse_first_stage(workdir, target_dir):
716 | parsed_papers = []
717 | for json_f in os.listdir(target_dir):
718 | if ".json" in json_f:
719 | complete_filename = os.path.join(target_dir, json_f)
720 | print("Checking: %s" % (complete_filename))
721 | res = read_json_file(complete_filename)
722 | parsed_papers.append({
723 | "filename": complete_filename,
724 | "result": res
725 | })
726 | #display_summary(parsed_papers)
727 |
728 | parsed_papers_json = os.path.join(workdir, "parsed_paper_data.json")
729 | with open(parsed_papers_json, "w+", encoding='utf-8') as ppj:
730 | json.dump(parsed_papers, ppj)
731 |
732 | return parsed_papers
733 |
734 |
735 | if __name__ == "__main__":
736 | succ = []
737 | fail = []
738 | succ_sigs = []
739 | miss_sigs = []
740 | all_titles = []
741 | for json_f in os.listdir("."):
742 | if ".json" in json_f:
743 | if len(sys.argv) == 2:
744 | if sys.argv[1] not in json_f:
745 | continue
746 | print("Checking: %s" % (json_f))
747 |
748 | res = read_json_file(json_f)
749 | if res == None:
750 | fail.append(json_f)
751 | sys.stdout.flush()
752 | continue
753 | succ.append(json_f)
754 | succ_sigs.append(res['success-sigs'])
755 | miss_sigs.append(res['missing-sigs'])
756 | all_titles.append(res['paper-title'])
757 |
758 | print("Succ: %d" % (len(succ)))
759 | print("Fail: %d" % (len(fail)))
760 | print("Succ sigs: %d" % (len(succ_sigs)))
761 | print("Failed sigs: %d" % (len(miss_sigs)))
762 | display_summary(succ, fail, succ_sigs, miss_sigs, all_titles)
763 |
--------------------------------------------------------------------------------
/pq_main.py:
--------------------------------------------------------------------------------
1 | import os
2 | import sys
3 | import simplejson as json
4 | import argparse
5 | import shutil
6 |
7 | import pq_format_reader
8 | import pq_pdf_utility
9 |
10 |
11 | ### Utilities
12 | def create_working_dir(target_dir):
13 | if os.path.isdir(target_dir):
14 | shutil.rmtree(target_dir)
15 | os.mkdir(target_dir)
16 |
17 |
18 | def create_workdir():
19 | MDIR = "pq-out"
20 | if not os.path.isdir(MDIR):
21 | os.mkdir(MDIR)
22 |
23 | FNAME = "out-"
24 | max_idx = -1
25 | for l in os.listdir(MDIR):
26 | if FNAME in l:
27 | try:
28 | val = int(l.replace(FNAME, ""))
29 | if val > max_idx:
30 | max_idx = val
31 | except:
32 | None
33 | max_idx += 1
34 | new_workdir = os.path.join(MDIR, "%s%d" % (FNAME, max_idx))
35 | print("Making the work directory: %s" % (new_workdir))
36 | os.mkdir(new_workdir)
37 | return new_workdir
38 |
39 |
40 | ### Action functions
41 | def convert_pdfs_to_data(paper_dir, workdir, target_dir):
42 | # First extract data from the pdf files, such as
43 | # title as well as the references cited by each paper.
44 | # This will produce json files with the data that we need.
45 | filtered_pdf_list = pq_pdf_utility.convert_folder(workdir, paper_dir)
46 |
47 | # Now write it out to a json folder
48 | create_working_dir(target_dir)
49 | pq_pdf_utility.write_to_json(filtered_pdf_list, target_dir)
50 |
51 |
52 | def read_parsed_json_data(workdir, target_dir):
53 | # Reads the json files produced by "full-run", and extracts the
54 | # various forms of data.
55 | parsed_papers = pq_format_reader.parse_first_stage(workdir, target_dir)
56 |
57 | pq_format_reader.display_summary(workdir, parsed_papers)
58 |
59 | # Now extract the group dependencies
60 | pq_format_reader.identify_group_dependencies(workdir, parsed_papers)
61 |
62 |
63 | ### CLI
64 | def parse_args():
65 | parser = argparse.ArgumentParser(
66 | "pq_main.py", description="Quantitative analysis of papers")
67 |
68 | parser.add_argument("-f", "--folder", help="Folder with PDFs")
69 | parser.add_argument("--task",
70 | help="Specify a given task",
71 | default="dependency-graph")
72 | parser.add_argument("--wd", help="Workdir")
73 |
74 | args = parser.parse_args()
75 | if args.folder != None:
76 | print("Analysing folder the pdfs in folder: %s" % (args.folder))
77 |
78 | return args
79 |
80 |
81 | if __name__ == "__main__":
82 | args = parse_args()
83 | if args.task == "json-only":
84 | read_parsed_json_data(args.wd, os.path.join(args.wd, "json_data"))
85 | elif args.task == "dependency-graph":
86 | workdir = create_workdir()
87 | json_data_dir = os.path.join(workdir, "json_data")
88 |
89 | convert_pdfs_to_data(args.folder, workdir, json_data_dir)
90 | read_parsed_json_data(workdir, json_data_dir)
91 | else:
92 | print("Task not supported")
93 |
--------------------------------------------------------------------------------
/pq_pdf_utility.py:
--------------------------------------------------------------------------------
1 | import xml.etree.ElementTree as ET
2 | import subprocess
3 | import shutil
4 | import sys
5 | import os
6 | import simplejson as json
7 |
8 |
9 | def print_recursively(elem, depth):
10 | # The below is only for debugging
11 | #print("\t"*depth + "%s - %s"%(elem.tag, elem.attrib))
12 | #if "size" in elem.attrib:
13 | #sz = float(elem.attrib['size'])
14 | #print("We got a size: %f"%(sz))
15 | #if sz > 20.0:
16 | # print("Text: %s"%(elem.text))
17 | for child in elem:
18 | print_recursively(child, depth + 1)
19 |
20 |
21 | def get_all_texts_rec(elem):
22 | '''
23 | Extracts all elements from an xml
24 | file generatex by pdf2txt. As such, this function
25 | is simply a partial XML parser.
26 | This is a recursive function where the recursion
27 | is used to capture recursive XML elements.
28 | '''
29 | global saw_page2
30 | all_texts = []
31 |
32 | # If we hit page 2, then we return empty list
33 | if saw_page2 == True:
34 | return []
35 |
36 | # We don't want to move beyond page 2, as the
37 | # we assume the title must have been given here.
38 | if elem.tag == "page":
39 | if int(elem.attrib['id']) > 3:
40 | saw_page2 = True
41 | return []
42 |
43 | # Extract all XML elements recursively
44 | for c in elem:
45 | tmp_txts = get_all_texts_rec(c)
46 | for t2 in tmp_txts:
47 | all_texts.append(t2)
48 |
49 | # If the current elem is a text element,
50 | # add this to our list of all texts.
51 | if elem.tag == "text":
52 | all_texts.append(elem)
53 | return all_texts
54 |
55 |
56 | def get_all_texts(elem):
57 | global saw_page2
58 | saw_page2 = False
59 | return get_all_texts_rec(elem)
60 |
61 |
62 | def get_title(the_file):
63 | '''
64 | Gets the title of the paper by:
65 | 1) Reading an xml representation of the PDF converted by pdf2txt
66 | 2) Searching for the top font size of page 1
67 | 3) Exiting if we reach page 2 of the paper.
68 |
69 | returns
70 | @max_string which is the string with the max font in a paper
71 | @second_max_string which is the string with the second max font
72 | in a paper.
73 | '''
74 | print("[+] Getting title")
75 | print("Extracting content from %s" % (the_file))
76 | #print("current working directory: %s"%(os.getcwd()))
77 | try:
78 | tree = ET.parse(the_file)
79 | except:
80 | print("[+] Error when ET.parse of the file %s" % (the_file))
81 | return None
82 |
83 | root = tree.getroot()
84 | #print_recursively(root, 0)
85 | all_texts = get_all_texts(root)
86 | sizes = dict()
87 | latest_size = None
88 | for te in all_texts:
89 | #print("%s-%s"%(te.attrib, te.text))
90 | if 'size' not in te.attrib:
91 | if latest_size != None:
92 | sizes[latest_size].append(" ")
93 | continue
94 | sz = float(te.attrib['size'])
95 | latest_size = sz
96 | if sz not in sizes:
97 | sizes[sz] = list()
98 | sizes[sz].append(te.text)
99 |
100 | # We now have all the text elements and we can proceed
101 | # to extract the elements with the highest and second-highest
102 | # font sizes.
103 | sorted_sizes = sorted(sizes.keys())
104 |
105 | # Highest font size
106 | max_string = ""
107 | for c in sizes[sorted_sizes[-1]]:
108 | max_string += c
109 |
110 | # Second highest font size
111 | second_max_string = ""
112 | for c in sizes[sorted_sizes[-2]]:
113 | second_max_string += c
114 |
115 | # Log them
116 | #print("max %s"%(max_string))
117 | #print("second max %s"%(second_max_string))
118 |
119 | # Print the bytes: only used for debugging
120 | ords = list()
121 | for c in max_string:
122 | ords.append(hex(ord(c)))
123 | s1 = " ".join(ords)
124 | #print("\t\t%s"%(s1))
125 | #print("\tCompleted getting title")
126 | return max_string, second_max_string
127 |
128 |
129 | def get_references(text_file):
130 | '''
131 | Reads a txt file converted by pdf2txt and extracts
132 | the "references" section of the paper.
133 |
134 | This function assumes the references are at the end
135 | of a paper and that there might be an appendix after.
136 | As such, we read the text from "References" until end
137 | of the file, and only stop in case we hit an
138 | "appendix" keyword.
139 | '''
140 | references = ""
141 | reading_references = False
142 | with open(text_file, "r") as tf:
143 | for l in tf:
144 | if reading_references == True and "appendix" in l.lower():
145 | reading_references = False
146 |
147 | if reading_references:
148 | references += l
149 |
150 | if "references" in l.lower():
151 | #print("We have a line with references: %s"%(l))
152 | if len(l.split(" ")) < 3:
153 | reading_references = True
154 | #print("References: %s"%(references))
155 | return references
156 |
157 |
158 | def convert_folder(workdir, folder_name):
159 | '''
160 | Converts an entire folder of PDFs into representations
161 | where we have the:
162 | title
163 | references
164 | for each paper.
165 |
166 |
167 | returns a list of dictionaries, where each element
168 | in the dictionary holds data about a given paper.
169 | '''
170 | all_titles = list()
171 | all_seconds = list()
172 | title_pairs = list()
173 |
174 | paper_list = []
175 | data_out_dir = os.path.join(workdir, "data_out")
176 | if not os.path.isdir(data_out_dir):
177 | os.mkdir(data_out_dir)
178 | for l in os.listdir(folder_name):
179 | if ".pdf" not in l:
180 | continue
181 |
182 | print("======================================")
183 | print("[+] working on %s" % (l))
184 | paper_dict = {
185 | "file": l,
186 | "title": None,
187 | "second_title": None,
188 | "references": None,
189 | "success": False
190 | }
191 |
192 | # First step.
193 | # Convert the pdf to xml format. This is for getting
194 | # the title of the paper.
195 | target_xml = os.path.join(data_out_dir, "%s_analysed.xml" % (l))
196 | cmd = ["pdf2txt.py", "-t", "xml", "-o", target_xml]
197 | cmd.append(os.path.join(folder_name, l))
198 | try:
199 | subprocess.check_call(" ".join(cmd), shell=True)
200 | except:
201 | paper_list.append(paper_dict)
202 | print("Could not execute the call")
203 | continue
204 |
205 | try:
206 | res = get_title(target_xml)
207 | if res == None:
208 | paper_list.append(paper_dict)
209 | continue
210 | the_title, second = res
211 | except:
212 | # print("Exception in get_title")
213 | paper_list.append(paper_dict)
214 | continue
215 | all_titles.append(the_title)
216 | all_seconds.append(second)
217 |
218 | # Second step.
219 | # Convert the pdf to txt format. This step if for getting
220 | # the references of the paper.
221 | target_txt = os.path.join(data_out_dir, "%s_analysed.txt" % (l))
222 | cmd = ["pdf2txt.py", "-t", "text", "-o", target_txt]
223 | cmd.append(os.path.join(folder_name, l))
224 | try:
225 | subprocess.check_call(" ".join(cmd), shell=True)
226 | except:
227 | print("Could not execute the call")
228 | paper_list.append(paper_dict)
229 | continue
230 |
231 | references = get_references(target_txt)
232 | paper_dict = {
233 | "file": l,
234 | "title": the_title,
235 | "second_title": second,
236 | "references": references,
237 | "success": True
238 | }
239 | paper_list.append(paper_dict)
240 | #print("Adding: %s ----- %s"%(the_title, second))
241 |
242 | return paper_list
243 |
244 |
245 | def should_select_title(title):
246 | # Check if there is any characters with value greater than 0xff in first
247 | # If this is the case then we use the second highest font for title.
248 | total_above = 0
249 | for c in title:
250 | if ord(c) > 0xff:
251 | total_above += 1
252 | if total_above > 3:
253 | return False
254 |
255 | # Now check if "(cid:" is in the name:. If it is,
256 | # then we will use the second highest font for title.
257 | if "(cid:" in str(title):
258 | return False
259 |
260 | return True
261 |
262 |
263 | def write_to_json(results, target_directory=None):
264 | counter = 0
265 | for paper_dict in results:
266 | if paper_dict['success'] == False:
267 | print("####### %s" % (paper_dict['file']))
268 | print("Unsuccessful")
269 | print("-" * 60)
270 | continue
271 |
272 | first = paper_dict['title']
273 | second = paper_dict['second_title']
274 | use_second = not should_select_title(first)
275 |
276 | # Create a json dictionary and write it to the file system
277 | json_dict = {
278 | "Title": first if use_second == False else second,
279 | "References": paper_dict['references'],
280 | "Year": "1999",
281 | "Authors": "David",
282 | "ReferenceType": "Automatically"
283 | }
284 | filepath = "json_dump_%d.json" % (counter)
285 | if target_directory != None:
286 | filepath = os.path.join(target_directory, filepath)
287 | with open(filepath, "w+") as jf:
288 | json.dump(json_dict, jf)
289 | counter += 1
290 |
291 | # Now print the content for convenience
292 | print("########## %s" % (paper_dict['file']))
293 | print("[+] Title: %s" % (json_dict['Title']))
294 | #print("[+] References: ")
295 | #print("%s"%(paper_dict['references']))
296 | print("-" * 60)
297 |
298 |
299 | if __name__ == "__main__":
300 | results = convert_folder("papers")
301 | target_dir = "json_data"
302 | if os.path.isdir(target_dir):
303 | shutil.rmtree(target_dir)
304 | os.mkdir(target_dir)
305 | os.chdir(target_dir)
306 | write_to_json(results)
307 |
--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | wheel
2 | graphviz
3 | matplotlib
4 | pdf2txt
5 | requests
6 | simplejson
7 | git+git://github.com/kermitt2/grobid_client_python@4bce8b574da363171079b97eb84fe5ecfce8cfdb#egg=grobid_client
8 | pymongo
9 | docopt
10 | bs4
11 | serde
12 | lxml
--------------------------------------------------------------------------------