├── .gitignore
├── LICENSE
├── README.md
├── aclpubcheck
    ├── __init__.py
    ├── __main__.py
    ├── copyright_signatures.py
    ├── formatchecker.py
    ├── googletools.py
    ├── metadatachecker.py
    └── name_check.py
├── aclpubcheck_additional_info.pdf
├── aclpubcheck_online.ipynb
├── example
    └── 2023.acl-tutorials.1.pdf
├── pdf_image.png
├── screenshot.png
└── setup.py


/.gitignore:
--------------------------------------------------------------------------------
1 | aclpubcheck.egg-info
2 | .idea
3 | dist/*.egg
4 | dist/*.tar.gz
5 | build/
6 | */__pycache__


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2022 Association for Computational Linguistics
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # ACL pubcheck
 2 | ACL pubcheck is a Python tool that automatically detects font errors, author formatting errors, margin violations, outdated citations as well as many other common formatting errors in papers that are using the LaTeX sty file associated with ACL venues. The script can be used to check your papers before you submit to a conference. (We highly recommend running ACL pubcheck on your papers *pre-submission*&mdash;a well formatted paper helps keep the reviewers focused on the scientific content.) However, its main purpose is to ensure your accepted paper is properly formatted, i.e., it follows the venue's style guidelines. The script is used by the publication chairs at most ACL events to check for formatting issues. Indeed, running this script yourself and fixing errors before uploading the camera-ready version of your paper will often save you a personalized email from the publication chairs.
 3 | 
 4 | ## Installation
 5 | 
 6 | The simplest way to use `aclpubcheck` is to install using `pip` directly from the GitHub repository (DIFFERENT from `pypi`):
 7 | 
 8 | ```bash
 9 | pip3 install git+https://github.com/acl-org/aclpubcheck
10 | ```
11 | 
12 | Alternatively, you can install directly from source and build locally:
13 | ```bash
14 | # clone using ssh
15 | git clone git@github.com:acl-org/aclpubcheck.git
16 | # or http
17 | git clone https://github.com/acl-org/aclpubcheck.git
18 | 
19 | cd aclpubcheck/
20 | 
21 | # install locally
22 | pip install -e .
23 | ```
24 | 
25 | ## Usage
26 | 
27 | Once installed, you can use apply it on a PDF:
28 | 
29 | ```bash
30 | # Script execution
31 | aclpubcheck --paper_type PAPER_TYPE path/to/paper.pdf
32 | 
33 | # Module execution (in case script execution does not work)
34 | python3 -m aclpubcheck --paper_type PAPER_TYPE path/to/paper.pdf
35 | ```
36 | 
37 | Replace `PAPER_TYPE` with one of (1) `long`, (2) `short`, (3) `demo`, depending on the type of paper you have accepted. Then, change `path/to/paper.pdf` to be the path to your paper. For example:
38 | 
39 | ```bash
40 | # -p is a shorthand for --paper_type
41 | python3 -m aclpubcheck --p long example/2023.acl-tutorials.1.pdf
42 | ```
43 | 
44 | If you find that ACL pubcheck gives you a margin error due to a figure that runs into the margin, you can often fix the problem by applying the [adjustbox package](https://ctan.org/pkg/adjustbox?lang=en). Additionally, if the margin error is caused by an equation, then it may help to break the equation over two lines.
45 | 
46 | ACL pubcheck is meant to be run on the camera ready version of the paper, not on the review version (e.g. anonymous, line-numbered submission version). Running ACL pubcheck on a line-numbered version will result in a stream of spurious errors related to the numbers in the margins.
47 | 
48 | **Note**: Additional info can be found in the PDF document ``aclpubcheck_additional_info.pdf`` included in this package.
49 | 
50 | ## Page Numbering
51 | 
52 | Typically, the space at the bottom of a paper should be left empty, as page numbers will be added during the watermarking process of the proceedings. By default, ACL pubcheck ensures that a margin of approximately 2 cm at the bottom of each page is left blank. If any text is detected in this area, such as page numbers mistakenly added, a warning is generated. However, if this area must contain information, or if you need to bypass this check for any reason, you can disable it by using the parameter `--disable_bottom_check`.
53 | 
54 | 
55 | ## Online Versions 
56 | 
57 | If you are having trouble with installing and using the Python toolkit directly, you can use:
58 | - a [**Colab version** you can use to directly upload and run aclpubcheck](https://colab.research.google.com/github/acl-org/aclpubcheck/blob/main/aclpubcheck_online.ipynb) without local installation (thank Danilo Croce).
59 | - a **Hugging Face Space** at https://huggingface.co/spaces/teelinsan/aclpubcheck (thank Andrea Santilli). More info about this version can be found at https://github.com/teelinsan/aclpubcheck-gui
60 | 
61 | ## Updating the names in citations
62 | 
63 | ### Description
64 | 
65 | Our toolkit now automatically checks your citations and will leave a warning if you have used incorrect names or author list. Please have a look [here](https://2021.naacl.org/blog/name-change-procedure/) on why it is important to use updated citations.
66 | 
67 | Demo version of PDF name checking is available [here](https://pdf-name-change-checking.herokuapp.com/).
68 | 
69 | ### How it's done
70 | 
71 | The bibilography from your PDF file is extracted using [Scholarcy API](https://ref.scholarcy.com/api/). Each bib entry in this bib file is updated by pulling information from ACL anthology, DBLP and arXiv; by using fuzzy match of the titles. After updating the bibs, the author names are compared and mismatches in author names are warned.
72 | 
73 | ![Procedure](pdf_image.png)
74 | 
75 | ### Functionality
76 | 
77 | The functions are present in `aclpubcheck/name_check.py`. The class `PDFNameCheck` is used in `formatchecker.py`.
78 | 
79 | ### Caveats
80 | 
81 | Some of the warnings generated for citations may be spurious and inaccurate, due to parsing and indexing errors. We encourage you to double check the citations and update them depending on the latest source. If you believe that your citation is updated and correct, then please ignore those warnings. You can fix your bib files using the toolkit like [rebiber](https://github.com/yuchenlin/rebiber).
82 | 
83 | ### Screenshots
84 | 
85 | This is how the warnings appear for the outdated names. You would be directed to a URL where you can correct the citations. We are not showing the name changes as it might out the deadnames in the warnings.
86 | 
87 | ![Screenshot](screenshot.png)
88 | 
89 | ## Credits
90 | The original version of ACL pubcheck was written by Yichao Zhou, Iz Beltagy, Steven Bethard, Ryan Cotterell and Tanmoy Chakraborty in their role as publications chairs of [NAACL 2021](https://2021.naacl.org/organization/). The tool was improved by Ryan Cotterell and Danilo Croce in their role as publication chairs of [ACL 2022](https://www.2022.aclweb.org/organisers) and [NAACL 2022](https://2022.naacl.org/). Pranav A added the name checking functions to this toolkit.
91 | 
92 | ## Maintenance 
93 | The tool is primarily maintained by Ryan Cotterell and Danilo Croce. More volunteers are welcome!
94 | 


--------------------------------------------------------------------------------
/aclpubcheck/__init__.py:
--------------------------------------------------------------------------------
1 | # This file is needed in order for aclpubcheck/ to be considered a directory


--------------------------------------------------------------------------------
/aclpubcheck/__main__.py:
--------------------------------------------------------------------------------
1 | # __main__.py allows this library to be used with `python -m aclpubcheck`
2 | from .formatchecker import main
3 | 
4 | if __name__ == "__main__":
5 |     main()


--------------------------------------------------------------------------------
/aclpubcheck/copyright_signatures.py:
--------------------------------------------------------------------------------
 1 | import argparse
 2 | import textwrap
 3 | import pandas as pd
 4 | 
 5 | 
 6 | def write_copyright_signatures(submissions_path):
 7 | 
 8 |     def clean_str(value):
 9 |         return '' if pd.isna(value) else value.strip()
10 | 
11 |     # write all copyright signatures to a single file, noting any problems
12 |     with open("copyright-signatures.txt", "w") as output_file:
13 |         df = pd.read_csv(submissions_path, keep_default_na=False)
14 |         for index, row in df.iterrows():
15 |             submission_id = row["Submission ID"]
16 | 
17 |             # NOTE: These were the names in the custom final submission form
18 |             # for NAACL 2021. Names and structure may be different depending
19 |             # on your final submission form.
20 |             signature = clean_str(row["copyrightSig"])
21 |             org_name = clean_str(row["orgName"])
22 |             org_address = clean_str(row["orgAddress"])
23 | 
24 |             # collect all authors and their affiliations
25 |             authors_parts = []
26 |             for i in range(1, 25):
27 |                 name_parts = [
28 |                     clean_str(row[f'{i}: {x} Name'])
29 |                     for x in ['First', 'Middle', 'Last']]
30 |                 name = ' '.join(x for x in name_parts if x)
31 |                 if name:
32 |                     affiliation = clean_str(row[f"{i}: Affiliation"])
33 |                     authors_parts.append(f'{name} ({affiliation})')
34 |             authors = '\n'.join(authors_parts)
35 | 
36 |             # write out the copyright signature in the standard ACL format
37 |             indent = " " * 4
38 |             output_file.write(f"""
39 | Submission # {submission_id}
40 | Title: {row["Title"]}
41 | Authors:
42 | {textwrap.indent(authors, indent)}
43 | Signature: {signature}
44 | Your job title (if not one of the authors): {clean_str(row["jobTitle"])}
45 | Name and address of your organization:
46 | {textwrap.indent(org_name, indent)}
47 | {textwrap.indent(org_address, indent)}
48 | 
49 | =================================================================
50 | """)
51 | 
52 | 
53 | if __name__ == "__main__":
54 |     parser = argparse.ArgumentParser()
55 |     parser.add_argument('--submissions', dest='submissions_path',
56 |                         default='Submission_Information.csv')
57 |     args = parser.parse_args()
58 |     write_copyright_signatures(**vars(args))
59 | 


--------------------------------------------------------------------------------
/aclpubcheck/formatchecker.py:
--------------------------------------------------------------------------------
  1 | '''
  2 | python3 formatchecker.py [-h] [--paper_type {long,short,demo,other}] file_or_dir [file_or_dir ...]
  3 | '''
  4 | 
  5 | import argparse
  6 | from argparse import Namespace
  7 | import json
  8 | from enum import Enum
  9 | from collections import defaultdict
 10 | from os import walk
 11 | from os.path import isfile, join
 12 | import pdfplumber
 13 | from tqdm import tqdm
 14 | from termcolor import colored
 15 | import os
 16 | import numpy as np
 17 | import traceback
 18 | 
 19 | from .name_check import PDFNameCheck
 20 | 
 21 | 
 22 | class Error(Enum):
 23 |     SIZE = "Size"
 24 |     PARSING = "Parsing"
 25 |     MARGIN = "Margin"
 26 |     SPELLING = "Spelling"
 27 |     FONT = "Font"
 28 |     PAGELIMIT = "Page Limit"
 29 | 
 30 | 
 31 | class Warn(Enum):
 32 |     BIB = "Bibliography"
 33 | 
 34 | 
 35 | class Page(Enum):
 36 |     # 595 pixels (72ppi) = 21cm
 37 |     WIDTH = 595
 38 |     # 842 pixels (72ppi) = 29.7cm
 39 |     HEIGHT = 842
 40 | 
 41 | 
 42 | class Margin(Enum):
 43 |     TOP = "top"
 44 |     BOTTOM = "bottom"
 45 |     RIGHT = "right"
 46 |     LEFT = "left"
 47 | 
 48 | 
 49 | class Formatter(object):
 50 | 
 51 |     def __init__(self):
 52 |         # TODO: these should be constants
 53 |         self.right_offset = 4.5
 54 |         self.left_offset = 2
 55 |         self.top_offset = 1
 56 |         self.bottom_offset = 1
 57 | 
 58 |         # this is used to check if an area out of the margin is a "false positive",
 59 |         # i.e., an area containing invisible symbols. When a candidate area out of
 60 |         # the margin is proposed, this is cropped and if all pixels are equal to
 61 |         # the background, this is skipped
 62 |         self.background_color = 255
 63 |         self.pdf_namecheck = PDFNameCheck()
 64 | 
 65 | 
 66 |     def format_check(self, submission, paper_type, output_dir = ".", print_only_errors = False, check_references = False):
 67 |         """
 68 |         Return True if the paper is correct, False otherwise.
 69 |         """
 70 |         print(f"Checking {submission}")
 71 | 
 72 |         # TOOD: make this less of a hack
 73 |         self.number = submission.split("/")[-1].split("_")[0].replace(".pdf", "")
 74 |         self.pdf = pdfplumber.open(submission)
 75 |         self.logs = defaultdict(list)  # reset log before calling the format-checking functions
 76 |         self.page_errors = set()
 77 |         self.pdfpath = submission
 78 | 
 79 |         # TODO: A few papers take hours to check. Consider using a timeout
 80 |         self.check_page_size()
 81 |         self.check_page_margin(output_dir)
 82 |         self.check_page_num(paper_type)
 83 |         self.check_font()
 84 | 
 85 |         if check_references:
 86 |             self.check_references()
 87 | 
 88 |         # TODO: put json dump back on
 89 |         output_file = "errors-{0}.json".format(self.number)
 90 |         # string conversion for json dump
 91 |         logs_json = {}
 92 |         for k, v in self.logs.items():
 93 |             logs_json[str(k)] = v
 94 | 
 95 |         if self.logs:
 96 |             print(f"Errors. Check {output_file} for details.")
 97 | 
 98 |         errors, warnings = 0, 0
 99 |         if self.logs.items():
100 |             for e, ms in self.logs.items():
101 |                 for m in ms:
102 |                     if isinstance(e, Error) and e != Error.PARSING:
103 |                         print(colored("Error ({0}):".format(e.value), "red")+" "+m)
104 |                         errors += 1
105 |                     elif e == Error.PARSING:
106 |                         print(colored("Parsing Error:".format(e.value), "yellow")+" "+m)
107 |                     else:
108 |                         print(colored("Warning ({0}):".format(e.value), "yellow")+" "+m)
109 |                         warnings += 1
110 | 
111 | 
112 |             # English nominal morphology
113 |             error_text = "errors"
114 |             if errors == 1:
115 |                 error_text = "error"
116 |             warning_text = "warnings"
117 |             if warnings == 1:
118 |                 warning_text = "warning"
119 | 
120 | 
121 |             if print_only_errors == False:
122 |                 json.dump(logs_json, open(os.path.join(output_dir,output_file), 'w'))  # always write a log file even if it is empty
123 | 
124 |             # display to user
125 |             print()
126 |             print("We detected {0} {1} and {2} {3} in your paper.".format(*(errors, error_text, warnings, warning_text)))
127 |             print("In general, it is required that you fix errors for your paper to be published. Fixing warnings is optional, but recommended.")
128 |             print("Important: Some of the margin errors may be spurious. The library detects the location of images, but not whether they have a white background that blends in.")
129 |             print("Important: Some of the warnings generated for citations may be spurious and inaccurate, due to parsing and indexing errors.")
130 |             print("We encourage you to double check the citations and update them depending on the latest source. If you believe that your citation is updated and correct, then please ignore those warnings.")
131 | 
132 |             if errors >= 1:
133 |                 return logs_json
134 |             else:
135 |                 return {}
136 | 
137 | 
138 |         else:
139 |             if print_only_errors == False:
140 |                 json.dump(logs_json, open(os.path.join(output_dir,output_file), 'w'))
141 | 
142 |             print(colored("All Clear!", "green"))
143 |             return logs_json
144 | 
145 | 
146 | 
147 |     def check_page_size(self):
148 |         """ Checks the paper size (A4) of each pages in the submission. """
149 | 
150 |         pages = []
151 |         for i, page in enumerate(self.pdf.pages):
152 | 
153 |             if (round(page.width), round(page.height)) != (Page.WIDTH.value, Page.HEIGHT.value):
154 |                 pages.append(i+1)
155 |         for page in pages:
156 |             error = "Page #{} is not A4.".format(page)
157 |             self.logs[Error.SIZE] += [error]
158 |         self.page_errors.update(pages)
159 | 
160 | 
161 |     def check_page_margin(self, output_dir):
162 |         """ Checks if any text or figure is in the margin of pages. """
163 | 
164 |         pages_image = defaultdict(list)
165 |         pages_text = defaultdict(list)
166 |         perror = []
167 |         for i, p in enumerate(self.pdf.pages):
168 |             if i+1 in self.page_errors:
169 |                 continue
170 |             try:
171 |                 # Parse images
172 |                 # 57 pixels (72ppi) = 2cm; 71 pixels (72ppi) = 2.5cm.
173 |                 for image in p.images:
174 |                     violation = None
175 |                     if int(image["bottom"]) > 0 and float(image["top"]) < (57-self.top_offset):
176 |                         violation = Margin.TOP
177 |                     elif int(image["x1"]) > 0 and float(image["x0"]) < (71-self.left_offset):
178 |                         violation = Margin.LEFT
179 |                     elif int(image["x0"]) < Page.WIDTH.value and Page.WIDTH.value-float(image["x1"]) < (71-self.right_offset):
180 |                         violation = Margin.RIGHT
181 | 
182 |                     if violation:
183 |                         # if the image is completely white, it can be skipped
184 | 
185 |                         # get the actual visible area
186 |                         x0 = max(0, int(image["x0"]))
187 |                         # check the intersection with the right margin to handle larger images
188 |                         # but with an "overflow" that is of the same color of the backgrond
189 |                         if violation == Margin.RIGHT:
190 |                             x0 = max(x0, Page.WIDTH.value - 71 + self.right_offset)
191 | 
192 |                         x1 = min(int(image["x1"]), Page.WIDTH.value)
193 |                         if violation == Margin.LEFT:
194 |                             x1 = min(x1, 71 - self.right_offset)
195 | 
196 |                         y0 = max(0, int(image["top"]))
197 | 
198 |                         y1 = min(int(image["bottom"]), Page.HEIGHT.value)
199 |                         if violation == Margin.TOP:
200 |                             y1 = min(y1, 57-self.top_offset)
201 | 
202 |                         bbox = (x0, y0, x1, y1)
203 | 
204 |                         # avoid problems in cropping images too small
205 |                         if x1 - x0 <= 1 or y1 - y0 <= 1:
206 |                             continue
207 | 
208 |                         # cropping the image to check if it is white
209 |                         # i.e., all pixels set to 255
210 |                         cropped_page = p.crop(bbox)
211 |                         try:
212 |                           image_obj = cropped_page.to_image(resolution=100)
213 |                           if np.mean(image_obj.original) != self.background_color:
214 |                             pages_image[i] += [(image, violation)]
215 |                         # if there are some errors during cropping, it is better to check
216 |                         except:
217 |                           pages_image[i] += [(image, violation)]
218 | 
219 |                 # Parse texts
220 |                 for j, word in enumerate(p.extract_words(extra_attrs=["non_stroking_color", "stroking_color"])):
221 |                     violation = None
222 | 
223 |                     #if word["non_stroking_color"] == (0, 0, 0) or word["non_stroking_color"] == 0 or word["stroking_color"] == 0:
224 |                     if word["non_stroking_color"] == (0, 0, 0) or word["non_stroking_color"] == [0]:
225 |                         continue
226 | 
227 |                     if word["non_stroking_color"] is None and word["stroking_color"] is None:
228 |                         continue
229 | 
230 |                     if int(word["bottom"]) > 0 and float(word["top"]) < (57-self.top_offset):
231 |                         violation = Margin.TOP
232 |                     elif int(word["x1"]) > 0 and float(word["x0"]) < (71-self.left_offset):
233 |                         violation = Margin.LEFT
234 |                     elif int(word["x0"]) < Page.WIDTH.value and Page.WIDTH.value-float(word["x1"]) < (71-self.right_offset):
235 |                         violation = Margin.RIGHT
236 | 
237 |                     if violation and int(word["x0"]) < Page.WIDTH.value and int(word["x1"]) >= 0 and int(word["bottom"]) >= 0:
238 |                         # if the area image is completely white, it can be skipped
239 |                         # get the actual visible area
240 |                         x0 = max(0, int(word["x0"]))
241 |                         # check the intersection with the right margin to handle larger images
242 |                         # but with an "overflow" that is of the same color of the backgrond
243 |                         if violation == Margin.RIGHT:
244 |                             x0 = max(x0, Page.WIDTH.value - 71 + self.right_offset)
245 | 
246 |                         x1 = min(int(word["x1"]), Page.WIDTH.value)
247 |                         if violation == Margin.LEFT:
248 |                             x1 = min(x1, 71 - self.right_offset)
249 | 
250 |                         y0 = max(0, int(word["top"]))
251 | 
252 |                         y1 = min(int(word["bottom"]), Page.HEIGHT.value)
253 |                         if violation == Margin.TOP:
254 |                             y1 = min(y1, 57-self.top_offset)
255 | 
256 |                         bbox = (x0, y0, x1, y1)
257 | 
258 |                         # avoid problems in cropping images too small
259 |                         if x1 - x0 <= 1 or y1 - y0 <= 1:
260 |                             continue
261 | 
262 |                         # cropping the image to check if it is white
263 |                         # i.e., all pixels set to 255
264 |                         try:
265 |                             cropped_page = p.crop(bbox)
266 |                             image_obj = cropped_page.to_image(resolution=100)
267 |                             if np.mean(image_obj.original) != self.background_color:
268 |                                 print("Found text violation:\t" + str(violation) + "\t" + str(word))
269 |                                 pages_text[i] += [(word, violation)]
270 |                         except:
271 |                           # if there are some errors during cropping, it is better to check
272 |                           pages_image[i] += [(word, violation)]
273 | 
274 |                 # CHECK THE AREA BELOW THE TEXT, it should be empty as it is expected to
275 |                 # be populated with watermark and pages during the construction of the
276 |                 # proceedings
277 |                 if args.disable_bottom_check:
278 |                     bpixels = 62
279 |                     bbox = (0, Page.HEIGHT.value - bpixels, Page.WIDTH.value - self.bottom_offset, Page.HEIGHT.value - self.bottom_offset)
280 |                     word = {"top": bbox[1], "bottom": bbox[3]}
281 |             
282 |                     # cropping the image to check if it is white
283 |                     # i.e., all pixels set to 255
284 |                     try:
285 |                         cropped_page = p.crop(bbox)
286 |                         image_obj = cropped_page.to_image(resolution=100)
287 |                         if np.mean(image_obj.original) != self.background_color:
288 |                             print("Found text violation:\t" + str(Margin.BOTTOM) + "\t" + str(word))
289 |                             pages_text[i] += [(word, Margin.BOTTOM)]
290 |                     except:
291 |                       # if there are some errors during cropping, it is better to check
292 |                       pages_image[i] += [(word, Margin.BOTTOM)]
293 |                       traceback.print_exc()
294 | 
295 |             except:
296 |                 traceback.print_exc()
297 |                 perror.append(i+1)
298 | 
299 |         if perror:
300 |             self.page_errors.update(perror)
301 |             self.logs[Error.PARSING] = ["Error occurs when parsing page {}.".format(perror)]
302 | 
303 |         if pages_text or pages_image:
304 |             pages = sorted(set(pages_text.keys()).union(set((pages_image.keys()))))
305 |             for page in pages:
306 |                 im = self.pdf.pages[page].to_image(resolution=150)
307 |                 for (word, violation) in pages_text[page]:
308 | 
309 |                     bbox = None
310 |                     if violation == Margin.RIGHT:
311 |                         self.logs[Error.MARGIN] += ["Text on page {} bleeds into the right margin.".format(page+1)]
312 |                         bbox = (Page.WIDTH.value-80, int(word["top"]-20), Page.WIDTH.value-20, int(word["bottom"]+20))
313 |                         im.draw_rect(bbox, fill=None, stroke="red", stroke_width=5)
314 |                     elif violation == Margin.LEFT:
315 |                         self.logs[Error.MARGIN] += ["Text on page {} bleeds into the left margin.".format(page+1)]
316 |                         bbox = (20, int(word["top"]-20), 80, int(word["bottom"]+20))
317 |                         im.draw_rect(bbox, fill=None, stroke="red", stroke_width=5)
318 |                     elif violation == Margin.TOP:
319 |                         self.logs[Error.MARGIN] += ["Text on page {} bleeds into the top margin.".format(page+1)]
320 |                         bbox = (20, int(word["top"]-20), 80, int(word["bottom"]+20))
321 |                         im.draw_rect(bbox, fill=None, stroke="red", stroke_width=5)
322 |                     elif violation == Margin.BOTTOM:
323 |                         self.logs[Error.MARGIN] += ["Text on page {} bleeds into the bottom margin. It should be empty (e.g., without page number) and populated when building the proceedings.".format(page+1)]
324 |                         bbox = (0, int(word["top"]), Page.WIDTH.value, int(word["bottom"]))
325 |                         im.draw_rect(bbox, fill=None, stroke="red", stroke_width=5)
326 |                     else:
327 |                         # TODO: add bottom margin violations
328 |                         pass
329 | 
330 | 
331 |                 for (image, violation) in pages_image[page]:
332 | 
333 |                     self.logs[Error.MARGIN] += ["An image on page {} bleeds into the margin.".format(page+1)]
334 |                     bbox = (image["x0"], image["top"], image["x1"], image["bottom"])
335 |                     im.draw_rect(bbox, fill=None, stroke="red", stroke_width=5)
336 | 
337 |                 png_file_name = "errors-{0}-page-{1}.png".format(*(self.number, page+1))
338 |                 im.save(os.path.join(output_dir, png_file_name), format="PNG")
339 |                 #+ "Specific text: "+str([v for k, v in pages_text.values()])]
340 | 
341 | 
342 |     def check_page_num(self, paper_type):
343 |         """Check if the paper exceeds the page limit."""
344 | 
345 |         # TODO: Enable uploading a paper_type file to include all papers' types.
346 | 
347 |         # thresholds for different types of papers
348 |         standards = {"short": 5, "long": 9, "demo": 7, "other": float("inf")}
349 |         page_threshold = standards[paper_type.lower()]
350 |         candidates = {"References", "Acknowledgments", "Acknowledgement", "Acknowledgment", "EthicsStatement", "EthicalConsiderations", "Ethicalconsiderations", "BroaderImpact", "EthicalConcerns", "EthicalStatement", "EthicalDeclaration", "Limitations", "Limitation"}
351 |         #acks = {"Acknowledgment", "Acknowledgement"}
352 | 
353 |         # Find (references, acknowledgements, ethics).
354 |         marker = None
355 |         if len(self.pdf.pages) <= page_threshold:
356 |             return
357 | 
358 |         for i, page in enumerate(self.pdf.pages):
359 |             if i+1 in self.page_errors:
360 |                 continue
361 |             text = page.extract_text().split('\n')
362 |             for j, line in enumerate(text):
363 |                 if marker is None and any(x in line for x in candidates):
364 |                     marker = (i+1, j+1)
365 |                 #if "Acknowl" in line and all(x not in line for x in acks):
366 |                 #    self.logs[Error.SPELLING] = ["'Acknowledgments' was misspelled."]
367 | 
368 |         # if the first marker appears after the first line of page 10,
369 |         # there is high probability the paper exceeds the page limit.
370 | 
371 |         # If we reached this state and that marker is still None it means all pages already have errors
372 |         # We can return here to print the already existing errors
373 |         if marker is None:
374 |             return
375 | 
376 |         if marker > (page_threshold + 1, 1):
377 |             page, line = marker
378 |             self.logs[Error.PAGELIMIT] = [f"Paper exceeds the page limit "
379 |                                       f"because first (References, "
380 |                                       f"Acknowledgments, Ethics Statement) was found on "
381 |                                       f"page {page}, line {line}."]
382 | 
383 | 
384 |     def check_font(self):
385 |         """ Checks the fonts. """
386 | 
387 |         correct_fontnames = set(["NimbusRomNo9L-Regu",
388 |                                  "TeXGyreTermesX-Regular",
389 |                                  "TimesNewRomanPSMT",
390 |                                  "ICWANT+STIXGeneral-Regular",
391 |                                  "ICZIZQ+Inconsolatazi4-Regular"
392 |                                  ])
393 | 
394 |         fonts = defaultdict(int)
395 |         for i, page in enumerate(self.pdf.pages):
396 |             try:
397 |                 for char in page.chars:
398 |                     fonts[char['fontname']] += 1
399 |             except:
400 |                 self.logs[Error.FONT] += [f"Can't parse page #{i+1}"]
401 | 
402 |         max_font_count, max_font_name = max((count, name) for name, count in fonts.items())  # find most used font
403 |         sum_char_count = sum(fonts.values())
404 | 
405 |         # TODO: make this a command line argument
406 |         if max_font_count / sum_char_count < 0.35:  # the most used font should be used more than 35% of the time
407 |             self.logs[Error.FONT] += ["Can't find the main font"]
408 | 
409 |         if not any([max_font_name.endswith(correct_fontname) for correct_fontname in correct_fontnames]):  # the most used font should be `correct_fontname`
410 |             self.logs[Error.FONT] += [f"Wrong font. The main font used is {max_font_name} when it should a font in {correct_fontnames}."]
411 | 
412 |     def make_name_check_config(self):
413 |         """Configure the name checking parameters"""
414 | 
415 |         config_dict = {
416 |             'file': self.pdfpath,
417 |             'show_names': False, # Show how the name is changed
418 |             'whole_name': False, # Consider the whole name changes
419 |             'first_name': True, # Consider only first name changes
420 |             'last_name': True, # Consider only last name changes
421 |             'ref_string': 'References', # How the bibilography starts
422 |             'mode': 'ensemble', # The mode for scholarcy, ensemble worked the best for ACL papers
423 |             'initials': True # Allow abbreviating first names to initials only.
424 |         }
425 | 
426 |         return Namespace(**config_dict)
427 | 
428 | 
429 |     def check_references(self):
430 |         """ Check that citations have URLs, and that they have venues (not just arXiv ids). """
431 | 
432 |         found_references = False
433 |         arxiv_word_count = 0
434 |         doi_url_count = 0
435 |         arxiv_url_count = 0
436 |         all_url_count = 0
437 | 
438 |         for i, page in enumerate(self.pdf.pages):
439 |             try:
440 |                 page_text = page.extract_text()
441 |             except:
442 |                 page_text = ""
443 |                 self.logs[Warn.BIB] += [f"Can't parse page #{i+1}"]
444 | 
445 |             lines = page_text.split('\n')
446 |             for j, line in enumerate(lines):
447 |                 if "References" in line:
448 |                     found_references = True
449 |                     break
450 |             if found_references:
451 |                 arxiv_word_count += page_text.lower().count('arxiv')
452 |                 urls = [h['uri'] for h in page.hyperlinks]
453 |                 urls = set(urls)  # When link text spans more than one line, it returns the same url multiple times
454 |                 for url in urls:
455 |                     if 'doi.org' in url:
456 |                         doi_url_count += 1
457 |                     elif 'arxiv.org' in url:
458 |                         arxiv_url_count += 1
459 |                     all_url_count += 1
460 | 
461 |         # The following checks fail in ~60% of the papers. TODO: relax them a bit
462 | 
463 |         if args.disable_name_check:
464 |             config = self.make_name_check_config()
465 |             output_strings = self.pdf_namecheck.execute(config)
466 |             self.logs[Warn.BIB] += output_strings
467 | 
468 |         if doi_url_count < 3:
469 |             self.logs[Warn.BIB] += [f"Bibliography should use ACL Anthology DOIs whenever possible. Only {doi_url_count} references do."]
470 | 
471 |         if arxiv_url_count > 0.2 * all_url_count:  # only 20% of the links are allowed to be arXiv links
472 |             self.logs[Warn.BIB] += [f"It appears you are using arXiv links more than you should ({arxiv_url_count}/{all_url_count}). Consider using ACL Anthology DOIs instead."]
473 | 
474 |         if all_url_count < 5:
475 |             self.logs[Warn.BIB] += [f"It appears most of the references are not using paper links. Only {all_url_count} links found."]
476 | 
477 |         if arxiv_word_count > 10:
478 |             self.logs[Warn.BIB] += [f"It appears you are using arXiv references more than you should ({arxiv_word_count} found). Consider using ACL Anthology references instead."]
479 | 
480 |         if not found_references:
481 |             self.logs[Warn.BIB] += ["Couldn't find any references."]
482 | 
483 | 
484 | args = None
485 | def worker(pdf_path, paper_type):
486 |     """ process one pdf """
487 |     return Formatter().format_check(submission=pdf_path, paper_type=paper_type)
488 | 
489 | 
490 | def main():
491 |     global args
492 |     parser = argparse.ArgumentParser()
493 |     parser.add_argument('submission_paths', metavar='file_or_dir', nargs='+',
494 |                         default=[])
495 |     parser.add_argument('-p', '--paper_type', choices={"short", "long", "demo", "other"},
496 |                         default='long', help="")
497 |     parser.add_argument('--num_workers', type=int, default=1)
498 |     parser.add_argument('--disable_name_check', action='store_false')
499 |     parser.add_argument('--disable_bottom_check', action='store_false')
500 | 
501 | 
502 |     args = parser.parse_args()
503 | 
504 | 
505 |     # retrieve file paths
506 |     paths = {join(root, file_name)
507 |              for path in args.submission_paths
508 |              for root, _, file_names in walk(path)
509 |              for file_name in file_names}
510 |     paths.update(args.submission_paths)
511 | 
512 |     # retrieve files
513 |     fileset = sorted([p for p in paths if isfile(p) and p.endswith(".pdf")])
514 | 
515 |     if not fileset:
516 |         print(f"No PDF files found in {paths}")
517 | 
518 |     if args.num_workers > 1:
519 |         from multiprocessing.pool import Pool
520 |         with Pool(args.num_workers) as p:
521 |             list(tqdm(p.imap(worker, fileset), total=len(fileset)))
522 |     else:
523 |         # TODO: make the tqdm togglable
524 |         #for submission in tqdm(fileset):
525 |         for submission in fileset:
526 |             worker(submission, args.paper_type)
527 | 
528 | if __name__ == "__main__":
529 |     main()
530 | 


--------------------------------------------------------------------------------
/aclpubcheck/googletools.py:
--------------------------------------------------------------------------------
 1 | import os.path
 2 | 
 3 | import google_auth_oauthlib.flow
 4 | import google.auth.transport.requests
 5 | import google.oauth2.credentials
 6 | import googleapiclient.discovery
 7 | 
 8 | 
 9 | def sheets_service():
10 |     """Loads credentials and opens a Google Sheets API client.
11 | 
12 |      A credentials.json file should be in the current directory.
13 |      https://developers.google.com/workspace/guides/create-credentials
14 |      A token.json file will be written to the current directory to avoid
15 |      repeatedly asking the user to login.
16 | 
17 |     :return: the Google Sheets API client
18 |     """
19 |     scopes = ['https://www.googleapis.com/auth/spreadsheets']
20 |     creds = None
21 |     if os.path.exists('token.json'):
22 |         creds = google.oauth2.credentials.Credentials.from_authorized_user_file(
23 |             'token.json', scopes)
24 |     if not creds or not creds.valid:
25 |         if creds and creds.expired and creds.refresh_token:
26 |             creds.refresh(google.auth.transport.requests.Request())
27 |         else:
28 |             iaf = google_auth_oauthlib.flow.InstalledAppFlow
29 |             flow = iaf.from_client_secrets_file('credentials.json', scopes)
30 |             creds = flow.run_local_server(port=0)
31 |         with open('token.json', 'w') as token:
32 |             token.write(creds.to_json())
33 |     return googleapiclient.discovery.build('sheets', 'v4', credentials=creds)
34 | 


--------------------------------------------------------------------------------
/aclpubcheck/metadatachecker.py:
--------------------------------------------------------------------------------
  1 | import argparse
  2 | import collections
  3 | import itertools
  4 | import os
  5 | import os.path
  6 | import regex as re
  7 | import unicodedata
  8 | import textwrap
  9 | 
 10 | import pandas as pd
 11 | import pdfplumber
 12 | import unidecode
 13 | 
 14 | from . import googletools
 15 | 
 16 | 
 17 | def _clean_str(value):
 18 |     if pd.isna(value):
 19 |         return ''
 20 |     # uncurl all quotes
 21 |     value = re.sub(r'[\u2018\u2019]', "'", value)
 22 |     value = re.sub(r'[\u201C\u201D]', '"', value)
 23 |     # use simple dashes
 24 |     value = re.sub(r'[\u2013\u2014]', "-", value)
 25 |     # not exactly sure why, but this has to be done iteratively
 26 |     old_value = None
 27 |     value = value.strip()
 28 |     while old_value != value:
 29 |         old_value = value
 30 |         # strip space before accent; PDF seems to introduce these
 31 |         value = re.sub(r'\p{Zs}+(\p{Mn})', r'\1', value)
 32 |         # combine accents with characters
 33 |         value = unicodedata.normalize('NFKC', value)
 34 |     return value
 35 | 
 36 | 
 37 | def yield_author_problems(names, text):
 38 |     # check for author names in the expected order, allowing for
 39 |     # punctuation, affiliations, etc. between names
 40 |     # NOTE: only removed or re-ordered (not added) authors will be caught
 41 |     match = re.search('.*?'.join(names), text, re.DOTALL)
 42 |     if not match:
 43 | 
 44 |         # check if there is a match when ignoring case, punctuation, accents
 45 |         # since this is the most common type of error
 46 |         allowed_chars = r'[\p{Zs}\p{p}\p{Mn}]'
 47 |         match_ignoring_case_punct_accent = re.search(
 48 |             '.*?'.join(
 49 |                 fr'{allowed_chars}*'.join(unidecode.unidecode(c) for c in p)
 50 |                 for part in names for p in re.split(allowed_chars, part)),
 51 |             unidecode.unidecode(text),
 52 |             re.DOTALL | re.IGNORECASE)
 53 |         if match_ignoring_case_punct_accent:
 54 |             problem = 'AUTHOR-MISMATCH-CASE-PUNCT-ACCENT'
 55 |             # these offsets may be slightly incorrect because unidecode may
 56 |             # change the number of characters, but it should be close enough
 57 |             start, end = match_ignoring_case_punct_accent.span()
 58 |             in_text = text[start: end]
 59 |         else:
 60 |             problem = 'AUTHOR-MISMATCH'
 61 |             in_text = text
 62 |         yield problem, f"meta=\"{' '.join(names)}\"\npdf =\"{in_text}\""
 63 | 
 64 | 
 65 | def yield_title_problems(title, text):
 66 |     # ignore spaces and some LaTeX-isms
 67 |     title_chars = re.sub(r'[\s{}$^]', '', title.replace('--', '-'))
 68 |     title_regex = r'\s*'.join(re.escape(c) for c in title_chars)
 69 | 
 70 |     # ignore differences in case; LaTeX \sc comes out as caps in PDF
 71 |     match = re.search(title_regex, text, re.IGNORECASE)
 72 |     if not match:
 73 |         yield 'TITLE', f"meta=\"{title}\"\npdf =\"{text}\""
 74 | 
 75 | 
 76 | def yield_copyright_problems(signature, org_name, org_address):
 77 |     if not signature:
 78 |         yield "COPYRIGHT", "The signature is missing."
 79 |     elif signature == "NA":
 80 |         yield "COPYRIGHT", f'The signature "{signature}" must be accompanied ' \
 81 |                            f'by a "License to Publish" or equivalent.'
 82 |     elif len(signature) < 3 or len(signature.split()) < 2:
 83 |         yield "COPYRIGHT", f'The signature "{signature}" does not appear to ' \
 84 |                            f'be a full name.'
 85 |     if not org_name:
 86 |         yield "COPYRIGHT", "The organization name is missing."
 87 |     elif len(org_name) < 5 and org_name not in {'IBM'}:
 88 |         yield "COPYRIGHT", f'The organization name "{org_name}" does not ' \
 89 |                            f'appear to be a full name. '
 90 |     if not org_address:
 91 |         yield "COPYRIGHT", "The organization address is missing."
 92 |     elif len(org_address) < 3 or len(org_address.split()) < 2:
 93 |         org_address_simple = org_address.replace("\n", " ")
 94 |         yield "COPYRIGHT", f'The organization address "{org_address_simple}" ' \
 95 |                            f'does not appear to be a complete physical address.'
 96 | 
 97 | 
 98 | def check_metadata(
 99 |         submissions_path,
100 |         pdfs_dir,
101 |         spreadsheet_id,
102 |         sheet_id,
103 |         id_column,
104 |         problem_column,
105 |         post=False):
106 | 
107 |     # map submission IDs to PDF paths
108 |     id_to_pdf = {}
109 |     for root, _, filenames in os.walk(pdfs_dir):
110 |         for filename in filenames:
111 |             if filename.endswith("_Paper.pdf"):
112 |                 submission_id, _ = filename.split("_", 1)
113 |                 id_to_pdf[int(submission_id)] = os.path.join(root, filename)
114 | 
115 |     id_to_sheet_row = {}
116 |     problems = collections.defaultdict(lambda: collections.defaultdict(list))
117 | 
118 |     df = pd.read_csv(submissions_path, keep_default_na=False)
119 |     for index, row in df.iterrows():
120 |         submission_id = row["Submission ID"]
121 |         title = _clean_str(row["Title"])
122 | 
123 |         # NOTE: These were the names in the custom final submission form
124 |         # for NAACL 2021. Names and structure may be different depending
125 |         # on your final submission form.
126 |         signature = _clean_str(row["copyrightSig"])
127 |         org_name = _clean_str(row["orgName"])
128 |         org_address = _clean_str(row["orgAddress"])
129 | 
130 |         # row in the spreadsheet is 1-based and first row is the header
131 |         id_to_sheet_row[submission_id] = index + 2
132 | 
133 |         # open the PDF
134 |         pdf_path = id_to_pdf[submission_id]
135 |         pdf = pdfplumber.open(pdf_path)
136 | 
137 |         # assumes metadata can be found in the first 500 characters
138 |         text = _clean_str(pdf.pages[0].extract_text()[:500])
139 | 
140 |         # collect all authors and their affiliations
141 |         names = []
142 |         for i in range(1, 25):
143 |             for x in ['First', 'Middle', 'Last']:
144 |                 name_part = _clean_str(row[f'{i}: {x} Name'])
145 |                 if name_part:
146 |                     names.extend(name_part.split())
147 | 
148 |         # collect all problems
149 |         for problem_type, problem_text in itertools.chain(
150 |                 yield_author_problems(names, text),
151 |                 yield_title_problems(title, text),
152 |                 yield_copyright_problems(signature, org_name, org_address)):
153 |             problems[submission_id][problem_type].append(problem_text)
154 | 
155 |     # print all problems, grouped by type of problem
156 |     for submission_id in sorted(problems):
157 |         for problem_type in sorted(problems[submission_id]):
158 |             problem_text = '\n'.join(problems[submission_id][problem_type])
159 |             problem_text = textwrap.indent(problem_text, '  ')
160 |             print(f'{submission_id}:{problem_type}:\n{problem_text}\n')
161 | 
162 |     # report overall problem statistics
163 |     print(f"{len(problems)} submissions failed:")
164 |     problem_counts = collections.Counter(
165 |         problem_type
166 |         for type_texts in problems.values()
167 |         for problem_type in type_texts.keys()
168 |     )
169 |     for problem_type in sorted(problem_counts.keys()):
170 |         print(f"  {problem_counts[problem_type]} {problem_type}")
171 | 
172 |     # if requested, post problems to the Google Sheet
173 |     if post:
174 |         values = googletools.sheets_service().spreadsheets().values()
175 | 
176 |         # get the number of rows
177 |         id_range = f'{sheet_id}!{id_column}2:{id_column}'
178 |         request = values.get(spreadsheetId=spreadsheet_id, range=id_range)
179 |         submission_ids = {int(value) for [value] in request.execute()['values']}
180 |         if submission_ids != id_to_sheet_row.keys():
181 |             raise ValueError(f'in Google sheet only: '
182 |                              f'{submission_ids - id_to_sheet_row.keys()}; '
183 |                              f'in START sheet only: '
184 |                              f'{id_to_sheet_row.keys() - submission_ids}')
185 |         n_rows = len(submission_ids) + 1
186 | 
187 |         sheet_row_to_problems = collections.defaultdict(list)
188 |         for submission_id, type_texts in problems.items():
189 |             for problem_type, texts in type_texts.items():
190 |                 problems = '\n'.join(texts)
191 |                 sheet_row_to_problems[id_to_sheet_row[submission_id]].append(
192 |                     f'{problem_type}:\n{problems}')
193 | 
194 |         # fill in the problem column
195 |         request = values.update(
196 |             spreadsheetId=spreadsheet_id,
197 |             range=f'{sheet_id}!{problem_column}2:{problem_column}',
198 |             valueInputOption='RAW',
199 |             body={'values': [['\n'.join(sheet_row_to_problems.get(i, []))]
200 |                              for i in range(2, n_rows)]})
201 |         request.execute()
202 | 
203 | 
204 | if __name__ == "__main__":
205 |     parser = argparse.ArgumentParser()
206 |     parser.add_argument('--submissions', dest='submissions_path',
207 |                         default='Submission_Information.csv')
208 |     parser.add_argument('--pdfs', dest='pdfs_dir', default='final')
209 |     parser.add_argument('--post', action='store_true')
210 |     parser.add_argument('--spreadsheet-id',
211 |                         default='1lQyGZNBEBwukf8-mgPzIH57xUX9y4o2OUCzpEvNpW9A')
212 |     parser.add_argument('--sheet-id', default='Sheet1')
213 |     parser.add_argument('--id-column', default='A')
214 |     parser.add_argument('--problem-column', default='E')
215 |     args = parser.parse_args()
216 |     check_metadata(**vars(args))
217 | 


--------------------------------------------------------------------------------
/aclpubcheck/name_check.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import rebiber
  3 | from pylatexenc.latex2text import LatexNodes2Text
  4 | from pybtex.database import parse_file
  5 | import contextlib
  6 | from unidecode import unidecode
  7 | import re
  8 | 
  9 | 
 10 | class PDFNameCheck:
 11 | 
 12 |     def __init__(self):
 13 |         # Generate and update the bib list from various conferences
 14 |         filepath = os.path.abspath(rebiber.__file__).replace("__init__.py", "")
 15 |         bib_list_path = os.path.join(filepath, "bib_list.txt")
 16 |         with open(os.devnull, "w") as f, contextlib.redirect_stdout(f):
 17 |             self.bib_db = rebiber.construct_bib_db(
 18 |                 bib_list_path, start_dir=filepath)
 19 | 
 20 |     def execute_curl(self, config):
 21 |         # The curl string to convert the PDF to bib.
 22 |         # I have used scholarcy API here.
 23 |         # See the link here: https://ref.scholarcy.com/api/
 24 |         # I used the POST curl for download
 25 | 
 26 |         self.filename = config.file.split('.')[0]
 27 |         temp_name = self.filename.split('/')[-1]
 28 |         os.makedirs('temp', exist_ok=True)
 29 | 
 30 |         curl_string = 'curl --silent -X \'POST\'' \
 31 |             ' \'https://ref.scholarcy.com/api/references/download\'' \
 32 |             ' -H \'accept: application/json\'' \
 33 |             ' -H \'Authorization: Bearer \'' \
 34 |             ' -H \'Content-Type: multipart/form-data\'' \
 35 |             f' -F \'file=@{config.file};type=application/pdf\'' \
 36 |             ' -F \'document_type=full_paper\'' \
 37 |             f' -F \'references={config.ref_string}\'' \
 38 |             f' -F \'reference_style={config.mode}\'' \
 39 |             ' -F \'reference_format=bibtex\'' \
 40 |             ' -F \'parser=v2\'' \
 41 |             f' -F \'engine=v1\' > temp/before-rebiber-{temp_name}.bib'
 42 | 
 43 |         # Execute that curl string
 44 |         with open(os.devnull, "w") as f, contextlib.redirect_stdout(f):
 45 |             os.system(curl_string)
 46 | 
 47 |     def apply_rebiber(self):
 48 |         # The curl string generates a bib file called 'before rebiber'
 49 |         # Pass it to rebiber
 50 |         temp_name = self.filename.split('/')[-1]
 51 |         all_bib_entries = rebiber.load_bib_file(f'temp/before-rebiber-{temp_name}.bib')
 52 | 
 53 |         # Update the bib file using rebiber and call it 'after rebiber'
 54 |         with open(os.devnull, "w") as f, contextlib.redirect_stdout(f):
 55 |             rebiber.normalize_bib(
 56 |                 self.bib_db, all_bib_entries, f'temp/after-rebiber-{temp_name}.bib')
 57 | 
 58 |     def extract_names(self):
 59 |         # Parse both bib files
 60 |         temp_name = self.filename.split('/')[-1]
 61 |         old_bib_data = parse_file(f'temp/before-rebiber-{temp_name}.bib')
 62 |         new_bib_data = parse_file(f'temp/after-rebiber-{temp_name}.bib')
 63 | 
 64 |         name_list = {}
 65 | 
 66 |         paper_keys = list(new_bib_data.entries.keys())
 67 | 
 68 |         # Here old means before updating
 69 |         # Here new means after updating
 70 |         # We will collect the author names before and after the bib updates
 71 |         for paper in paper_keys:
 72 |             old_paper_authors = []
 73 |             new_paper_authors = []
 74 |             if 'author' in old_bib_data.entries[paper].persons:
 75 |                 old_key = old_bib_data.entries[paper].persons['author']
 76 |                 new_key = new_bib_data.entries[paper].persons['author']
 77 |                 old_length = len(old_bib_data.entries[paper].persons['author'])
 78 |                 new_length = len(new_bib_data.entries[paper].persons['author'])
 79 |                 additional = False
 80 |                 for i in range(new_length):
 81 |                     if i < old_length:
 82 |                         # Bugfix: Sometimes names with dots are being parsed as full names
 83 |                         if ' '.join(old_key[i].bibtex_first_names).replace('.', '') == \
 84 |                                 ' '.join(new_key[i].bibtex_first_names + new_key[i].last_names):
 85 |                             old_key[i] = new_key[i]
 86 | 
 87 |                         old_name = old_key[i].bibtex_first_names + \
 88 |                             old_key[i].last_names
 89 |                         new_name = new_key[i].bibtex_first_names + \
 90 |                             new_key[i].last_names
 91 | 
 92 |                         # Bugfix: Sometimes there are two names in a name
 93 |                         if old_key[i].last_names == new_key[i].bibtex_first_names + new_key[i].last_names:
 94 |                             additional = i
 95 |                         new_name = [LatexNodes2Text().latex_to_text(name)
 96 |                                     for name in new_name]
 97 |                         old_paper_authors.append(old_name)
 98 |                         new_paper_authors.append(new_name)
 99 |                     else:
100 |                         # Sometimes authors tend to cite only n authors
101 |                         new_name = new_key[i].first_names + \
102 |                             new_key[i].last_names
103 |                         new_paper_authors.append(new_name)
104 | 
105 |                 # Bugfix: Sometimes there are two names in a name
106 |                 if additional:
107 |                     old_paper_authors[additional] = new_paper_authors[additional]
108 |                     if additional+1 < len(new_paper_authors):
109 |                         old_paper_authors.insert(
110 |                             additional+1, new_paper_authors[additional+1])
111 |                 name_list[paper] = {}
112 |                 name_list[paper]['old'] = old_paper_authors
113 |                 name_list[paper]['new'] = new_paper_authors
114 |                 name_list[paper]['title'] = LatexNodes2Text().latex_to_text(
115 |                     new_bib_data.entries[paper].fields['title'])
116 | 
117 |                 if 'url' in new_bib_data.entries[paper].fields:
118 |                     name_list[paper]['url'] = new_bib_data.entries[paper].fields['url']
119 | 
120 |         return name_list
121 | 
122 |     def if_equal(self, string_a, string_b):
123 |         '''
124 |         Do a basic cleanup to tell whether the names are same or not
125 |         '''
126 |         # remove spaces and lowercase
127 |         string_a = ('').join(string_a).lower()
128 |         string_b = ('').join(string_b).lower()
129 |         # remove punctuations
130 |         string_a = re.sub(r'\W+', '', string_a)
131 |         string_b = re.sub(r'\W+', '', string_b)
132 |         # remove accents
133 |         string_a = unidecode(string_a)
134 |         string_b = unidecode(string_b)
135 |         return string_a == string_b
136 | 
137 |     def compare_changes(self, name_list, config):
138 | 
139 |         warnings = []
140 |         error_count = 1
141 | 
142 |         for paper in name_list:
143 |             output_strings = []
144 |             old = name_list[paper]['old']
145 |             new = name_list[paper]['new']
146 |             title = name_list[paper]['title']
147 |             if 'url' in name_list[paper]:
148 |                 url = name_list[paper]['url']
149 |             else:
150 |                 url = ''
151 |             old_length = len(old)
152 |             new_length = len(new)
153 |             # Citation error: Cites do not contain every author
154 |             if old_length != new_length:
155 |                 error_count += 1
156 |                 output_strings.append(
157 |                     f'Number of authors in the title `{title}` is incorrect.')
158 |                 output_strings.append(
159 |                     f'The number of authors should be {new_length}, not {old_length}.')
160 |                 if url:
161 |                     output_strings.append(
162 |                         f'Please correct the citation by visiting this url: {url}')
163 |             if old_length == new_length:
164 |                 already_warned = False
165 |                 for i in range(old_length):
166 |                     # If you wanna check the full name
167 |                     if config.whole_name:
168 |                         # Check if names are sanme
169 |                         if self.if_equal(old[i], new[i]) is False:
170 |                             # If not, check if we have warned them already
171 |                             if already_warned is False:
172 |                                 error_count += 1
173 |                                 output_strings.append(
174 |                                     f'Your citation for `{title}` might have incorrect author names.')
175 |                                 if url:
176 |                                     output_strings.append(
177 |                                         f'Please correct the citation by visiting this url: {url}')
178 |                             already_warned = True
179 |                             # If you wanna show the names
180 |                             if config.show_names:
181 |                                 old_name = ' '.join(name_list[paper]['old'][i])
182 |                                 new_name = ' '.join(name_list[paper]['new'][i])
183 |                                 output_strings.append(
184 |                                     f'The name should be {new_name} not {old_name}.')
185 |                     else:
186 |                         # If you wanna check only the first name
187 |                         if config.first_name:
188 |                             if config.initials and \
189 |                                     (re.search(r'^[A-Z]\.', old[i][0]) or re.search(r'^[A-Z]\.', new[i][0])):
190 |                                 old_first_name = re.sub(
191 |                                     r'[^A-Z]', '', old[i][0])
192 |                                 new_first_name = re.sub(
193 |                                     r'[^A-Z]', '', new[i][0])
194 |                             else:
195 |                                 old_first_name = old[i][0]
196 |                                 new_first_name = new[i][0]
197 |                             if self.if_equal(old_first_name, new_first_name) is False:
198 |                                 if already_warned is False:
199 |                                     error_count += 1
200 |                                     output_strings.append(
201 |                                         f'Your citation for `{title}` might have incorrect author names.')
202 |                                     if url:
203 |                                         output_strings.append(
204 |                                             f'Please correct the citation by visiting this url: {url}')
205 |                                 already_warned = True
206 |                                 if config.show_names:
207 |                                     old_name = ' '.join(
208 |                                         name_list[paper]['old'][i])
209 |                                     new_name = ' '.join(
210 |                                         name_list[paper]['new'][i])
211 |                                     first_author_id = i
212 |                                     output_strings.append(
213 |                                         f'The author #{first_author_id} name should be {new_name} not {old_name}.')
214 |                         # If you wanna check only the last name
215 |                         if config.last_name:
216 |                             if self.if_equal(old[i][-1], new[i][-1]) is False:
217 |                                 if already_warned is False:
218 |                                     error_count += 1
219 |                                     output_strings.append(
220 |                                         f'Your citation for `{title}` might have incorrect author names.')
221 |                                     if url:
222 |                                         output_strings.append(
223 |                                             f'Please correct the citation by visiting this url: {url}')
224 |                                 already_warned = True
225 |                                 last_author_id = i
226 |                                 if config.show_names and already_warned and (first_author_id != last_author_id):
227 |                                     old_name = ' '.join(
228 |                                         name_list[paper]['old'][i])
229 |                                     new_name = ' '.join(
230 |                                         name_list[paper]['new'][i])
231 |                                     output_strings.append(
232 |                                         f'The author #{last_author_id} name should be {new_name} not {old_name}.')
233 | 
234 |             if len(output_strings) > 0:
235 |                 warning = ' '.join(output_strings)
236 |                 warnings.append(' '.join(output_strings))
237 | 
238 |         return warnings
239 | 
240 |     def execute(self, config):
241 |         self.execute_curl(config)
242 |         self.apply_rebiber()
243 |         name_list = self.extract_names()
244 |         output_strings = self.compare_changes(name_list, config)
245 |         return output_strings
246 | 


--------------------------------------------------------------------------------
/aclpubcheck_additional_info.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/acl-org/aclpubcheck/a340fc0a1a7d1c9808f08ab1dab1d228f63af405/aclpubcheck_additional_info.pdf


--------------------------------------------------------------------------------
/aclpubcheck_online.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |   "nbformat": 4,
  3 |   "nbformat_minor": 0,
  4 |   "metadata": {
  5 |     "colab": {
  6 |       "provenance": []
  7 |     },
  8 |     "kernelspec": {
  9 |       "name": "python3",
 10 |       "display_name": "Python 3"
 11 |     },
 12 |     "language_info": {
 13 |       "name": "python"
 14 |     },
 15 |     "widgets": {
 16 |       "application/vnd.jupyter.widget-state+json": {
 17 |         "f6fed292d0ef485bbb83ad238dffe0b4": {
 18 |           "model_module": "@jupyter-widgets/controls",
 19 |           "model_name": "DropdownModel",
 20 |           "model_module_version": "1.5.0",
 21 |           "state": {
 22 |             "_dom_classes": [],
 23 |             "_model_module": "@jupyter-widgets/controls",
 24 |             "_model_module_version": "1.5.0",
 25 |             "_model_name": "DropdownModel",
 26 |             "_options_labels": [
 27 |               "long",
 28 |               "short",
 29 |               "demo"
 30 |             ],
 31 |             "_view_count": null,
 32 |             "_view_module": "@jupyter-widgets/controls",
 33 |             "_view_module_version": "1.5.0",
 34 |             "_view_name": "DropdownView",
 35 |             "description": "Paper type:",
 36 |             "description_tooltip": null,
 37 |             "disabled": false,
 38 |             "index": 0,
 39 |             "layout": "IPY_MODEL_4e6ae31e5e51498c924eea02c65f9cea",
 40 |             "style": "IPY_MODEL_57e8b8127118436884f916a243091017"
 41 |           }
 42 |         },
 43 |         "4e6ae31e5e51498c924eea02c65f9cea": {
 44 |           "model_module": "@jupyter-widgets/base",
 45 |           "model_name": "LayoutModel",
 46 |           "model_module_version": "1.2.0",
 47 |           "state": {
 48 |             "_model_module": "@jupyter-widgets/base",
 49 |             "_model_module_version": "1.2.0",
 50 |             "_model_name": "LayoutModel",
 51 |             "_view_count": null,
 52 |             "_view_module": "@jupyter-widgets/base",
 53 |             "_view_module_version": "1.2.0",
 54 |             "_view_name": "LayoutView",
 55 |             "align_content": null,
 56 |             "align_items": null,
 57 |             "align_self": null,
 58 |             "border": null,
 59 |             "bottom": null,
 60 |             "display": null,
 61 |             "flex": null,
 62 |             "flex_flow": null,
 63 |             "grid_area": null,
 64 |             "grid_auto_columns": null,
 65 |             "grid_auto_flow": null,
 66 |             "grid_auto_rows": null,
 67 |             "grid_column": null,
 68 |             "grid_gap": null,
 69 |             "grid_row": null,
 70 |             "grid_template_areas": null,
 71 |             "grid_template_columns": null,
 72 |             "grid_template_rows": null,
 73 |             "height": null,
 74 |             "justify_content": null,
 75 |             "justify_items": null,
 76 |             "left": null,
 77 |             "margin": null,
 78 |             "max_height": null,
 79 |             "max_width": null,
 80 |             "min_height": null,
 81 |             "min_width": null,
 82 |             "object_fit": null,
 83 |             "object_position": null,
 84 |             "order": null,
 85 |             "overflow": null,
 86 |             "overflow_x": null,
 87 |             "overflow_y": null,
 88 |             "padding": null,
 89 |             "right": null,
 90 |             "top": null,
 91 |             "visibility": null,
 92 |             "width": null
 93 |           }
 94 |         },
 95 |         "57e8b8127118436884f916a243091017": {
 96 |           "model_module": "@jupyter-widgets/controls",
 97 |           "model_name": "DescriptionStyleModel",
 98 |           "model_module_version": "1.5.0",
 99 |           "state": {
100 |             "_model_module": "@jupyter-widgets/controls",
101 |             "_model_module_version": "1.5.0",
102 |             "_model_name": "DescriptionStyleModel",
103 |             "_view_count": null,
104 |             "_view_module": "@jupyter-widgets/base",
105 |             "_view_module_version": "1.2.0",
106 |             "_view_name": "StyleView",
107 |             "description_width": ""
108 |           }
109 |         }
110 |       }
111 |     }
112 |   },
113 |   "cells": [
114 |     {
115 |       "cell_type": "markdown",
116 |       "source": [
117 |         "#ACL Pubcheck @ colab\n",
118 |         "\n",
119 |         "ACL pubcheck is a Python tool that automatically detects author formatting errors, margin violations as well as many other common formatting errors in papers that are using the LaTeX sty file associated with ACL venues. The script can be used to check your papers before you submit to a conference. (We highly recommend running ACL pubcheck on your papers pre-submission—a well formatted paper helps keep the reviewers focused on the scientific content.) However, its main purpose is to ensure your accepted paper is properly formatted, i.e., it follows the venue's style guidelines. The script is used by the publication chairs at most ACL events to check for formatting issues. Indeed, running this script yourself and fixing errors before uploading the camera-ready version of your paper will often save you a personalized email from the publication chairs.\n",
120 |         "\n",
121 |         "**NOTICE**: ACL pubcheck is meant to be run on the **camera ready** version of the paper, not on the review version (e.g. anonymous, line-numbered submission version). Running ACL pubcheck on a line-numbered version will result in a stream of spurious errors related to the numbers in the margins.\n",
122 |         "\n",
123 |         "More info can be found at: https://github.com/acl-org/aclpubcheck/blob/main/aclpubcheck_additional_info.pdf\n",
124 |         "\n",
125 |         "##What do you have to do?\n",
126 |         "\n",
127 |         "1. Install `aclpubcheck`\n",
128 |         "2. Are you checking a long or short paper?\n",
129 |         "3. Upload your PDF file\n",
130 |         "4. Run `aclpubcheck` and see the outcomes\n",
131 |         "5. (Hopefully not required:) fix the errors and re-run the code\n",
132 |         "\n",
133 |         "Let's check!"
134 |       ],
135 |       "metadata": {
136 |         "id": "_tCawJsGR6RE"
137 |       }
138 |     },
139 |     {
140 |       "cell_type": "markdown",
141 |       "source": [
142 |         "## 1. Install  `aclpubcheck` and import libraries\n",
143 |         "\n",
144 |         "Run the code in this block to installl ACL pubcheck."
145 |       ],
146 |       "metadata": {
147 |         "id": "Vsja4xipT4iz"
148 |       }
149 |     },
150 |     {
151 |       "cell_type": "code",
152 |       "source": [
153 |         "!pip install -q git+https://github.com/acl-org/aclpubcheck"
154 |       ],
155 |       "metadata": {
156 |         "id": "yNEDRLWvQ3NZ"
157 |       },
158 |       "execution_count": null,
159 |       "outputs": []
160 |     },
161 |     {
162 |       "cell_type": "code",
163 |       "source": [
164 |         "from ipywidgets import Dropdown\n",
165 |         "from google.colab import files\n",
166 |         "import os"
167 |       ],
168 |       "metadata": {
169 |         "id": "2rplVJym54Y1"
170 |       },
171 |       "execution_count": null,
172 |       "outputs": []
173 |     },
174 |     {
175 |       "cell_type": "markdown",
176 |       "source": [
177 |         "##2. Are you checking a long or short paper?"
178 |       ],
179 |       "metadata": {
180 |         "id": "0yiTvBIgTbpo"
181 |       }
182 |     },
183 |     {
184 |       "cell_type": "markdown",
185 |       "source": [
186 |         "Please select the correct paper type: Long, short or demo. This will help us check whether you have the correct paper length."
187 |       ],
188 |       "metadata": {
189 |         "id": "Pu1LRCSYahJ8"
190 |       }
191 |     },
192 |     {
193 |       "cell_type": "code",
194 |       "source": [
195 |         "# Define the list of options\n",
196 |         "options = [\"long\", \"short\", \"demo\"]\n",
197 |         "\n",
198 |         "# Create the dropdown widget\n",
199 |         "dropdown = Dropdown(\n",
200 |         "    options=options, value=\"long\", description=\"Paper type:\"\n",
201 |         ")\n",
202 |         "\n",
203 |         "# Display the dropdown\n",
204 |         "display(dropdown)"
205 |       ],
206 |       "metadata": {
207 |         "id": "BePpvMG75sO2",
208 |         "outputId": "1f2204e9-91f6-41a0-b2c6-b18cb6775880",
209 |         "colab": {
210 |           "base_uri": "https://localhost:8080/",
211 |           "height": 49,
212 |           "referenced_widgets": [
213 |             "f6fed292d0ef485bbb83ad238dffe0b4",
214 |             "4e6ae31e5e51498c924eea02c65f9cea",
215 |             "57e8b8127118436884f916a243091017"
216 |           ]
217 |         }
218 |       },
219 |       "execution_count": null,
220 |       "outputs": [
221 |         {
222 |           "output_type": "display_data",
223 |           "data": {
224 |             "text/plain": [
225 |               "Dropdown(description='Paper type:', options=('long', 'short', 'demo'), value='long')"
226 |             ],
227 |             "application/vnd.jupyter.widget-view+json": {
228 |               "version_major": 2,
229 |               "version_minor": 0,
230 |               "model_id": "f6fed292d0ef485bbb83ad238dffe0b4"
231 |             }
232 |           },
233 |           "metadata": {}
234 |         }
235 |       ]
236 |     },
237 |     {
238 |       "cell_type": "markdown",
239 |       "source": [
240 |         "## 3. Upload your PDF file"
241 |       ],
242 |       "metadata": {
243 |         "id": "9pOjnKu3Tmtm"
244 |       }
245 |     },
246 |     {
247 |       "cell_type": "code",
248 |       "source": [
249 |         "paper_type = dropdown.value\n",
250 |         "uploaded = files.upload()\n",
251 |         "filename = list(uploaded.keys())[0]\n",
252 |         "length = len(uploaded[filename])\n",
253 |         "os.rename(filename, \"paper.pdf\")"
254 |       ],
255 |       "metadata": {
256 |         "id": "0KyiVz9gQRqa"
257 |       },
258 |       "execution_count": null,
259 |       "outputs": []
260 |     },
261 |     {
262 |       "cell_type": "markdown",
263 |       "source": [
264 |         "# 4. Run `aclpubcheck` and see the outcomes\n",
265 |         "\n",
266 |         "Please, see the output of this code block to read the output of the analysis.\n",
267 |         "\n",
268 |         "**Notice**: if the tool finds any issue, it will show the problematic page(s)."
269 |       ],
270 |       "metadata": {
271 |         "id": "eFjTQ1nWT_h3"
272 |       }
273 |     },
274 |     {
275 |       "cell_type": "code",
276 |       "source": [
277 |         "!aclpubcheck --paper_type $paper_type paper.pdf"
278 |       ],
279 |       "metadata": {
280 |         "id": "5KmieUUJTBv7",
281 |         "outputId": "fc63f552-8469-4de9-9a34-1c04f81a2f79",
282 |         "colab": {
283 |           "base_uri": "https://localhost:8080/"
284 |         }
285 |       },
286 |       "execution_count": null,
287 |       "outputs": [
288 |         {
289 |           "output_type": "stream",
290 |           "name": "stdout",
291 |           "text": [
292 |             "Checking paper.pdf\n",
293 |             "\u001b[32mAll Clear!\u001b[0m\n"
294 |           ]
295 |         }
296 |       ]
297 |     }
298 |   ]
299 | }


--------------------------------------------------------------------------------
/example/2023.acl-tutorials.1.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/acl-org/aclpubcheck/a340fc0a1a7d1c9808f08ab1dab1d228f63af405/example/2023.acl-tutorials.1.pdf


--------------------------------------------------------------------------------
/pdf_image.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/acl-org/aclpubcheck/a340fc0a1a7d1c9808f08ab1dab1d228f63af405/pdf_image.png


--------------------------------------------------------------------------------
/screenshot.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/acl-org/aclpubcheck/a340fc0a1a7d1c9808f08ab1dab1d228f63af405/screenshot.png


--------------------------------------------------------------------------------
/setup.py:
--------------------------------------------------------------------------------
 1 | from setuptools import setup, find_packages
 2 | 
 3 | 
 4 | install_requires = [
 5 | 	"tqdm",
 6 | 	"termcolor",
 7 | 	"pandas",
 8 | 	"pdfplumber",
 9 | 	"rebiber<2.0.0",  # 2.0 introduces breaking changes
10 | 	"pybtex",
11 | 	"pylatexenc",
12 | 	"setuptools",
13 | 	"Unidecode",
14 | 	"tsv"
15 | ]
16 | 
17 | 
18 | setup(
19 | 	name="aclpubcheck",
20 | 	install_requires=install_requires,
21 | 	version="0.1",
22 | 	scripts=[],
23 | 	packages=find_packages(include=["aclpubcheck*"]),
24 | 	entry_points = {
25 | 		'console_scripts': [
26 | 			"aclpubcheck=aclpubcheck.__main__:main",
27 |   		],	
28 | 	},
29 | )
30 | 


--------------------------------------------------------------------------------