├── .gitignore ├── LICENSE ├── README.md ├── aclpubcheck ├── __init__.py ├── __main__.py ├── copyright_signatures.py ├── formatchecker.py ├── googletools.py ├── metadatachecker.py └── name_check.py ├── aclpubcheck_additional_info.pdf ├── aclpubcheck_online.ipynb ├── example └── 2023.acl-tutorials.1.pdf ├── pdf_image.png ├── screenshot.png └── setup.py /.gitignore: -------------------------------------------------------------------------------- 1 | aclpubcheck.egg-info 2 | .idea 3 | dist/*.egg 4 | dist/*.tar.gz 5 | build/ 6 | */__pycache__ -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2022 Association for Computational Linguistics 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # ACL pubcheck 2 | ACL pubcheck is a Python tool that automatically detects font errors, author formatting errors, margin violations, outdated citations as well as many other common formatting errors in papers that are using the LaTeX sty file associated with ACL venues. The script can be used to check your papers before you submit to a conference. (We highly recommend running ACL pubcheck on your papers *pre-submission*—a well formatted paper helps keep the reviewers focused on the scientific content.) However, its main purpose is to ensure your accepted paper is properly formatted, i.e., it follows the venue's style guidelines. The script is used by the publication chairs at most ACL events to check for formatting issues. Indeed, running this script yourself and fixing errors before uploading the camera-ready version of your paper will often save you a personalized email from the publication chairs. 3 | 4 | ## Installation 5 | 6 | The simplest way to use `aclpubcheck` is to install using `pip` directly from the GitHub repository (DIFFERENT from `pypi`): 7 | 8 | ```bash 9 | pip3 install git+https://github.com/acl-org/aclpubcheck 10 | ``` 11 | 12 | Alternatively, you can install directly from source and build locally: 13 | ```bash 14 | # clone using ssh 15 | git clone git@github.com:acl-org/aclpubcheck.git 16 | # or http 17 | git clone https://github.com/acl-org/aclpubcheck.git 18 | 19 | cd aclpubcheck/ 20 | 21 | # install locally 22 | pip install -e . 23 | ``` 24 | 25 | ## Usage 26 | 27 | Once installed, you can use apply it on a PDF: 28 | 29 | ```bash 30 | # Script execution 31 | aclpubcheck --paper_type PAPER_TYPE path/to/paper.pdf 32 | 33 | # Module execution (in case script execution does not work) 34 | python3 -m aclpubcheck --paper_type PAPER_TYPE path/to/paper.pdf 35 | ``` 36 | 37 | Replace `PAPER_TYPE` with one of (1) `long`, (2) `short`, (3) `demo`, depending on the type of paper you have accepted. Then, change `path/to/paper.pdf` to be the path to your paper. For example: 38 | 39 | ```bash 40 | # -p is a shorthand for --paper_type 41 | python3 -m aclpubcheck --p long example/2023.acl-tutorials.1.pdf 42 | ``` 43 | 44 | If you find that ACL pubcheck gives you a margin error due to a figure that runs into the margin, you can often fix the problem by applying the [adjustbox package](https://ctan.org/pkg/adjustbox?lang=en). Additionally, if the margin error is caused by an equation, then it may help to break the equation over two lines. 45 | 46 | ACL pubcheck is meant to be run on the camera ready version of the paper, not on the review version (e.g. anonymous, line-numbered submission version). Running ACL pubcheck on a line-numbered version will result in a stream of spurious errors related to the numbers in the margins. 47 | 48 | **Note**: Additional info can be found in the PDF document ``aclpubcheck_additional_info.pdf`` included in this package. 49 | 50 | ## Page Numbering 51 | 52 | Typically, the space at the bottom of a paper should be left empty, as page numbers will be added during the watermarking process of the proceedings. By default, ACL pubcheck ensures that a margin of approximately 2 cm at the bottom of each page is left blank. If any text is detected in this area, such as page numbers mistakenly added, a warning is generated. However, if this area must contain information, or if you need to bypass this check for any reason, you can disable it by using the parameter `--disable_bottom_check`. 53 | 54 | 55 | ## Online Versions 56 | 57 | If you are having trouble with installing and using the Python toolkit directly, you can use: 58 | - a [**Colab version** you can use to directly upload and run aclpubcheck](https://colab.research.google.com/github/acl-org/aclpubcheck/blob/main/aclpubcheck_online.ipynb) without local installation (thank Danilo Croce). 59 | - a **Hugging Face Space** at https://huggingface.co/spaces/teelinsan/aclpubcheck (thank Andrea Santilli). More info about this version can be found at https://github.com/teelinsan/aclpubcheck-gui 60 | 61 | ## Updating the names in citations 62 | 63 | ### Description 64 | 65 | Our toolkit now automatically checks your citations and will leave a warning if you have used incorrect names or author list. Please have a look [here](https://2021.naacl.org/blog/name-change-procedure/) on why it is important to use updated citations. 66 | 67 | Demo version of PDF name checking is available [here](https://pdf-name-change-checking.herokuapp.com/). 68 | 69 | ### How it's done 70 | 71 | The bibilography from your PDF file is extracted using [Scholarcy API](https://ref.scholarcy.com/api/). Each bib entry in this bib file is updated by pulling information from ACL anthology, DBLP and arXiv; by using fuzzy match of the titles. After updating the bibs, the author names are compared and mismatches in author names are warned. 72 | 73 | ![Procedure](pdf_image.png) 74 | 75 | ### Functionality 76 | 77 | The functions are present in `aclpubcheck/name_check.py`. The class `PDFNameCheck` is used in `formatchecker.py`. 78 | 79 | ### Caveats 80 | 81 | Some of the warnings generated for citations may be spurious and inaccurate, due to parsing and indexing errors. We encourage you to double check the citations and update them depending on the latest source. If you believe that your citation is updated and correct, then please ignore those warnings. You can fix your bib files using the toolkit like [rebiber](https://github.com/yuchenlin/rebiber). 82 | 83 | ### Screenshots 84 | 85 | This is how the warnings appear for the outdated names. You would be directed to a URL where you can correct the citations. We are not showing the name changes as it might out the deadnames in the warnings. 86 | 87 | ![Screenshot](screenshot.png) 88 | 89 | ## Credits 90 | The original version of ACL pubcheck was written by Yichao Zhou, Iz Beltagy, Steven Bethard, Ryan Cotterell and Tanmoy Chakraborty in their role as publications chairs of [NAACL 2021](https://2021.naacl.org/organization/). The tool was improved by Ryan Cotterell and Danilo Croce in their role as publication chairs of [ACL 2022](https://www.2022.aclweb.org/organisers) and [NAACL 2022](https://2022.naacl.org/). Pranav A added the name checking functions to this toolkit. 91 | 92 | ## Maintenance 93 | The tool is primarily maintained by Ryan Cotterell and Danilo Croce. More volunteers are welcome! 94 | -------------------------------------------------------------------------------- /aclpubcheck/__init__.py: -------------------------------------------------------------------------------- 1 | # This file is needed in order for aclpubcheck/ to be considered a directory -------------------------------------------------------------------------------- /aclpubcheck/__main__.py: -------------------------------------------------------------------------------- 1 | # __main__.py allows this library to be used with `python -m aclpubcheck` 2 | from .formatchecker import main 3 | 4 | if __name__ == "__main__": 5 | main() -------------------------------------------------------------------------------- /aclpubcheck/copyright_signatures.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import textwrap 3 | import pandas as pd 4 | 5 | 6 | def write_copyright_signatures(submissions_path): 7 | 8 | def clean_str(value): 9 | return '' if pd.isna(value) else value.strip() 10 | 11 | # write all copyright signatures to a single file, noting any problems 12 | with open("copyright-signatures.txt", "w") as output_file: 13 | df = pd.read_csv(submissions_path, keep_default_na=False) 14 | for index, row in df.iterrows(): 15 | submission_id = row["Submission ID"] 16 | 17 | # NOTE: These were the names in the custom final submission form 18 | # for NAACL 2021. Names and structure may be different depending 19 | # on your final submission form. 20 | signature = clean_str(row["copyrightSig"]) 21 | org_name = clean_str(row["orgName"]) 22 | org_address = clean_str(row["orgAddress"]) 23 | 24 | # collect all authors and their affiliations 25 | authors_parts = [] 26 | for i in range(1, 25): 27 | name_parts = [ 28 | clean_str(row[f'{i}: {x} Name']) 29 | for x in ['First', 'Middle', 'Last']] 30 | name = ' '.join(x for x in name_parts if x) 31 | if name: 32 | affiliation = clean_str(row[f"{i}: Affiliation"]) 33 | authors_parts.append(f'{name} ({affiliation})') 34 | authors = '\n'.join(authors_parts) 35 | 36 | # write out the copyright signature in the standard ACL format 37 | indent = " " * 4 38 | output_file.write(f""" 39 | Submission # {submission_id} 40 | Title: {row["Title"]} 41 | Authors: 42 | {textwrap.indent(authors, indent)} 43 | Signature: {signature} 44 | Your job title (if not one of the authors): {clean_str(row["jobTitle"])} 45 | Name and address of your organization: 46 | {textwrap.indent(org_name, indent)} 47 | {textwrap.indent(org_address, indent)} 48 | 49 | ================================================================= 50 | """) 51 | 52 | 53 | if __name__ == "__main__": 54 | parser = argparse.ArgumentParser() 55 | parser.add_argument('--submissions', dest='submissions_path', 56 | default='Submission_Information.csv') 57 | args = parser.parse_args() 58 | write_copyright_signatures(**vars(args)) 59 | -------------------------------------------------------------------------------- /aclpubcheck/formatchecker.py: -------------------------------------------------------------------------------- 1 | ''' 2 | python3 formatchecker.py [-h] [--paper_type {long,short,demo,other}] file_or_dir [file_or_dir ...] 3 | ''' 4 | 5 | import argparse 6 | from argparse import Namespace 7 | import json 8 | from enum import Enum 9 | from collections import defaultdict 10 | from os import walk 11 | from os.path import isfile, join 12 | import pdfplumber 13 | from tqdm import tqdm 14 | from termcolor import colored 15 | import os 16 | import numpy as np 17 | import traceback 18 | 19 | from .name_check import PDFNameCheck 20 | 21 | 22 | class Error(Enum): 23 | SIZE = "Size" 24 | PARSING = "Parsing" 25 | MARGIN = "Margin" 26 | SPELLING = "Spelling" 27 | FONT = "Font" 28 | PAGELIMIT = "Page Limit" 29 | 30 | 31 | class Warn(Enum): 32 | BIB = "Bibliography" 33 | 34 | 35 | class Page(Enum): 36 | # 595 pixels (72ppi) = 21cm 37 | WIDTH = 595 38 | # 842 pixels (72ppi) = 29.7cm 39 | HEIGHT = 842 40 | 41 | 42 | class Margin(Enum): 43 | TOP = "top" 44 | BOTTOM = "bottom" 45 | RIGHT = "right" 46 | LEFT = "left" 47 | 48 | 49 | class Formatter(object): 50 | 51 | def __init__(self): 52 | # TODO: these should be constants 53 | self.right_offset = 4.5 54 | self.left_offset = 2 55 | self.top_offset = 1 56 | self.bottom_offset = 1 57 | 58 | # this is used to check if an area out of the margin is a "false positive", 59 | # i.e., an area containing invisible symbols. When a candidate area out of 60 | # the margin is proposed, this is cropped and if all pixels are equal to 61 | # the background, this is skipped 62 | self.background_color = 255 63 | self.pdf_namecheck = PDFNameCheck() 64 | 65 | 66 | def format_check(self, submission, paper_type, output_dir = ".", print_only_errors = False, check_references = False): 67 | """ 68 | Return True if the paper is correct, False otherwise. 69 | """ 70 | print(f"Checking {submission}") 71 | 72 | # TOOD: make this less of a hack 73 | self.number = submission.split("/")[-1].split("_")[0].replace(".pdf", "") 74 | self.pdf = pdfplumber.open(submission) 75 | self.logs = defaultdict(list) # reset log before calling the format-checking functions 76 | self.page_errors = set() 77 | self.pdfpath = submission 78 | 79 | # TODO: A few papers take hours to check. Consider using a timeout 80 | self.check_page_size() 81 | self.check_page_margin(output_dir) 82 | self.check_page_num(paper_type) 83 | self.check_font() 84 | 85 | if check_references: 86 | self.check_references() 87 | 88 | # TODO: put json dump back on 89 | output_file = "errors-{0}.json".format(self.number) 90 | # string conversion for json dump 91 | logs_json = {} 92 | for k, v in self.logs.items(): 93 | logs_json[str(k)] = v 94 | 95 | if self.logs: 96 | print(f"Errors. Check {output_file} for details.") 97 | 98 | errors, warnings = 0, 0 99 | if self.logs.items(): 100 | for e, ms in self.logs.items(): 101 | for m in ms: 102 | if isinstance(e, Error) and e != Error.PARSING: 103 | print(colored("Error ({0}):".format(e.value), "red")+" "+m) 104 | errors += 1 105 | elif e == Error.PARSING: 106 | print(colored("Parsing Error:".format(e.value), "yellow")+" "+m) 107 | else: 108 | print(colored("Warning ({0}):".format(e.value), "yellow")+" "+m) 109 | warnings += 1 110 | 111 | 112 | # English nominal morphology 113 | error_text = "errors" 114 | if errors == 1: 115 | error_text = "error" 116 | warning_text = "warnings" 117 | if warnings == 1: 118 | warning_text = "warning" 119 | 120 | 121 | if print_only_errors == False: 122 | json.dump(logs_json, open(os.path.join(output_dir,output_file), 'w')) # always write a log file even if it is empty 123 | 124 | # display to user 125 | print() 126 | print("We detected {0} {1} and {2} {3} in your paper.".format(*(errors, error_text, warnings, warning_text))) 127 | print("In general, it is required that you fix errors for your paper to be published. Fixing warnings is optional, but recommended.") 128 | print("Important: Some of the margin errors may be spurious. The library detects the location of images, but not whether they have a white background that blends in.") 129 | print("Important: Some of the warnings generated for citations may be spurious and inaccurate, due to parsing and indexing errors.") 130 | print("We encourage you to double check the citations and update them depending on the latest source. If you believe that your citation is updated and correct, then please ignore those warnings.") 131 | 132 | if errors >= 1: 133 | return logs_json 134 | else: 135 | return {} 136 | 137 | 138 | else: 139 | if print_only_errors == False: 140 | json.dump(logs_json, open(os.path.join(output_dir,output_file), 'w')) 141 | 142 | print(colored("All Clear!", "green")) 143 | return logs_json 144 | 145 | 146 | 147 | def check_page_size(self): 148 | """ Checks the paper size (A4) of each pages in the submission. """ 149 | 150 | pages = [] 151 | for i, page in enumerate(self.pdf.pages): 152 | 153 | if (round(page.width), round(page.height)) != (Page.WIDTH.value, Page.HEIGHT.value): 154 | pages.append(i+1) 155 | for page in pages: 156 | error = "Page #{} is not A4.".format(page) 157 | self.logs[Error.SIZE] += [error] 158 | self.page_errors.update(pages) 159 | 160 | 161 | def check_page_margin(self, output_dir): 162 | """ Checks if any text or figure is in the margin of pages. """ 163 | 164 | pages_image = defaultdict(list) 165 | pages_text = defaultdict(list) 166 | perror = [] 167 | for i, p in enumerate(self.pdf.pages): 168 | if i+1 in self.page_errors: 169 | continue 170 | try: 171 | # Parse images 172 | # 57 pixels (72ppi) = 2cm; 71 pixels (72ppi) = 2.5cm. 173 | for image in p.images: 174 | violation = None 175 | if int(image["bottom"]) > 0 and float(image["top"]) < (57-self.top_offset): 176 | violation = Margin.TOP 177 | elif int(image["x1"]) > 0 and float(image["x0"]) < (71-self.left_offset): 178 | violation = Margin.LEFT 179 | elif int(image["x0"]) < Page.WIDTH.value and Page.WIDTH.value-float(image["x1"]) < (71-self.right_offset): 180 | violation = Margin.RIGHT 181 | 182 | if violation: 183 | # if the image is completely white, it can be skipped 184 | 185 | # get the actual visible area 186 | x0 = max(0, int(image["x0"])) 187 | # check the intersection with the right margin to handle larger images 188 | # but with an "overflow" that is of the same color of the backgrond 189 | if violation == Margin.RIGHT: 190 | x0 = max(x0, Page.WIDTH.value - 71 + self.right_offset) 191 | 192 | x1 = min(int(image["x1"]), Page.WIDTH.value) 193 | if violation == Margin.LEFT: 194 | x1 = min(x1, 71 - self.right_offset) 195 | 196 | y0 = max(0, int(image["top"])) 197 | 198 | y1 = min(int(image["bottom"]), Page.HEIGHT.value) 199 | if violation == Margin.TOP: 200 | y1 = min(y1, 57-self.top_offset) 201 | 202 | bbox = (x0, y0, x1, y1) 203 | 204 | # avoid problems in cropping images too small 205 | if x1 - x0 <= 1 or y1 - y0 <= 1: 206 | continue 207 | 208 | # cropping the image to check if it is white 209 | # i.e., all pixels set to 255 210 | cropped_page = p.crop(bbox) 211 | try: 212 | image_obj = cropped_page.to_image(resolution=100) 213 | if np.mean(image_obj.original) != self.background_color: 214 | pages_image[i] += [(image, violation)] 215 | # if there are some errors during cropping, it is better to check 216 | except: 217 | pages_image[i] += [(image, violation)] 218 | 219 | # Parse texts 220 | for j, word in enumerate(p.extract_words(extra_attrs=["non_stroking_color", "stroking_color"])): 221 | violation = None 222 | 223 | #if word["non_stroking_color"] == (0, 0, 0) or word["non_stroking_color"] == 0 or word["stroking_color"] == 0: 224 | if word["non_stroking_color"] == (0, 0, 0) or word["non_stroking_color"] == [0]: 225 | continue 226 | 227 | if word["non_stroking_color"] is None and word["stroking_color"] is None: 228 | continue 229 | 230 | if int(word["bottom"]) > 0 and float(word["top"]) < (57-self.top_offset): 231 | violation = Margin.TOP 232 | elif int(word["x1"]) > 0 and float(word["x0"]) < (71-self.left_offset): 233 | violation = Margin.LEFT 234 | elif int(word["x0"]) < Page.WIDTH.value and Page.WIDTH.value-float(word["x1"]) < (71-self.right_offset): 235 | violation = Margin.RIGHT 236 | 237 | if violation and int(word["x0"]) < Page.WIDTH.value and int(word["x1"]) >= 0 and int(word["bottom"]) >= 0: 238 | # if the area image is completely white, it can be skipped 239 | # get the actual visible area 240 | x0 = max(0, int(word["x0"])) 241 | # check the intersection with the right margin to handle larger images 242 | # but with an "overflow" that is of the same color of the backgrond 243 | if violation == Margin.RIGHT: 244 | x0 = max(x0, Page.WIDTH.value - 71 + self.right_offset) 245 | 246 | x1 = min(int(word["x1"]), Page.WIDTH.value) 247 | if violation == Margin.LEFT: 248 | x1 = min(x1, 71 - self.right_offset) 249 | 250 | y0 = max(0, int(word["top"])) 251 | 252 | y1 = min(int(word["bottom"]), Page.HEIGHT.value) 253 | if violation == Margin.TOP: 254 | y1 = min(y1, 57-self.top_offset) 255 | 256 | bbox = (x0, y0, x1, y1) 257 | 258 | # avoid problems in cropping images too small 259 | if x1 - x0 <= 1 or y1 - y0 <= 1: 260 | continue 261 | 262 | # cropping the image to check if it is white 263 | # i.e., all pixels set to 255 264 | try: 265 | cropped_page = p.crop(bbox) 266 | image_obj = cropped_page.to_image(resolution=100) 267 | if np.mean(image_obj.original) != self.background_color: 268 | print("Found text violation:\t" + str(violation) + "\t" + str(word)) 269 | pages_text[i] += [(word, violation)] 270 | except: 271 | # if there are some errors during cropping, it is better to check 272 | pages_image[i] += [(word, violation)] 273 | 274 | # CHECK THE AREA BELOW THE TEXT, it should be empty as it is expected to 275 | # be populated with watermark and pages during the construction of the 276 | # proceedings 277 | if args.disable_bottom_check: 278 | bpixels = 62 279 | bbox = (0, Page.HEIGHT.value - bpixels, Page.WIDTH.value - self.bottom_offset, Page.HEIGHT.value - self.bottom_offset) 280 | word = {"top": bbox[1], "bottom": bbox[3]} 281 | 282 | # cropping the image to check if it is white 283 | # i.e., all pixels set to 255 284 | try: 285 | cropped_page = p.crop(bbox) 286 | image_obj = cropped_page.to_image(resolution=100) 287 | if np.mean(image_obj.original) != self.background_color: 288 | print("Found text violation:\t" + str(Margin.BOTTOM) + "\t" + str(word)) 289 | pages_text[i] += [(word, Margin.BOTTOM)] 290 | except: 291 | # if there are some errors during cropping, it is better to check 292 | pages_image[i] += [(word, Margin.BOTTOM)] 293 | traceback.print_exc() 294 | 295 | except: 296 | traceback.print_exc() 297 | perror.append(i+1) 298 | 299 | if perror: 300 | self.page_errors.update(perror) 301 | self.logs[Error.PARSING] = ["Error occurs when parsing page {}.".format(perror)] 302 | 303 | if pages_text or pages_image: 304 | pages = sorted(set(pages_text.keys()).union(set((pages_image.keys())))) 305 | for page in pages: 306 | im = self.pdf.pages[page].to_image(resolution=150) 307 | for (word, violation) in pages_text[page]: 308 | 309 | bbox = None 310 | if violation == Margin.RIGHT: 311 | self.logs[Error.MARGIN] += ["Text on page {} bleeds into the right margin.".format(page+1)] 312 | bbox = (Page.WIDTH.value-80, int(word["top"]-20), Page.WIDTH.value-20, int(word["bottom"]+20)) 313 | im.draw_rect(bbox, fill=None, stroke="red", stroke_width=5) 314 | elif violation == Margin.LEFT: 315 | self.logs[Error.MARGIN] += ["Text on page {} bleeds into the left margin.".format(page+1)] 316 | bbox = (20, int(word["top"]-20), 80, int(word["bottom"]+20)) 317 | im.draw_rect(bbox, fill=None, stroke="red", stroke_width=5) 318 | elif violation == Margin.TOP: 319 | self.logs[Error.MARGIN] += ["Text on page {} bleeds into the top margin.".format(page+1)] 320 | bbox = (20, int(word["top"]-20), 80, int(word["bottom"]+20)) 321 | im.draw_rect(bbox, fill=None, stroke="red", stroke_width=5) 322 | elif violation == Margin.BOTTOM: 323 | self.logs[Error.MARGIN] += ["Text on page {} bleeds into the bottom margin. It should be empty (e.g., without page number) and populated when building the proceedings.".format(page+1)] 324 | bbox = (0, int(word["top"]), Page.WIDTH.value, int(word["bottom"])) 325 | im.draw_rect(bbox, fill=None, stroke="red", stroke_width=5) 326 | else: 327 | # TODO: add bottom margin violations 328 | pass 329 | 330 | 331 | for (image, violation) in pages_image[page]: 332 | 333 | self.logs[Error.MARGIN] += ["An image on page {} bleeds into the margin.".format(page+1)] 334 | bbox = (image["x0"], image["top"], image["x1"], image["bottom"]) 335 | im.draw_rect(bbox, fill=None, stroke="red", stroke_width=5) 336 | 337 | png_file_name = "errors-{0}-page-{1}.png".format(*(self.number, page+1)) 338 | im.save(os.path.join(output_dir, png_file_name), format="PNG") 339 | #+ "Specific text: "+str([v for k, v in pages_text.values()])] 340 | 341 | 342 | def check_page_num(self, paper_type): 343 | """Check if the paper exceeds the page limit.""" 344 | 345 | # TODO: Enable uploading a paper_type file to include all papers' types. 346 | 347 | # thresholds for different types of papers 348 | standards = {"short": 5, "long": 9, "demo": 7, "other": float("inf")} 349 | page_threshold = standards[paper_type.lower()] 350 | candidates = {"References", "Acknowledgments", "Acknowledgement", "Acknowledgment", "EthicsStatement", "EthicalConsiderations", "Ethicalconsiderations", "BroaderImpact", "EthicalConcerns", "EthicalStatement", "EthicalDeclaration", "Limitations", "Limitation"} 351 | #acks = {"Acknowledgment", "Acknowledgement"} 352 | 353 | # Find (references, acknowledgements, ethics). 354 | marker = None 355 | if len(self.pdf.pages) <= page_threshold: 356 | return 357 | 358 | for i, page in enumerate(self.pdf.pages): 359 | if i+1 in self.page_errors: 360 | continue 361 | text = page.extract_text().split('\n') 362 | for j, line in enumerate(text): 363 | if marker is None and any(x in line for x in candidates): 364 | marker = (i+1, j+1) 365 | #if "Acknowl" in line and all(x not in line for x in acks): 366 | # self.logs[Error.SPELLING] = ["'Acknowledgments' was misspelled."] 367 | 368 | # if the first marker appears after the first line of page 10, 369 | # there is high probability the paper exceeds the page limit. 370 | 371 | # If we reached this state and that marker is still None it means all pages already have errors 372 | # We can return here to print the already existing errors 373 | if marker is None: 374 | return 375 | 376 | if marker > (page_threshold + 1, 1): 377 | page, line = marker 378 | self.logs[Error.PAGELIMIT] = [f"Paper exceeds the page limit " 379 | f"because first (References, " 380 | f"Acknowledgments, Ethics Statement) was found on " 381 | f"page {page}, line {line}."] 382 | 383 | 384 | def check_font(self): 385 | """ Checks the fonts. """ 386 | 387 | correct_fontnames = set(["NimbusRomNo9L-Regu", 388 | "TeXGyreTermesX-Regular", 389 | "TimesNewRomanPSMT", 390 | "ICWANT+STIXGeneral-Regular", 391 | "ICZIZQ+Inconsolatazi4-Regular" 392 | ]) 393 | 394 | fonts = defaultdict(int) 395 | for i, page in enumerate(self.pdf.pages): 396 | try: 397 | for char in page.chars: 398 | fonts[char['fontname']] += 1 399 | except: 400 | self.logs[Error.FONT] += [f"Can't parse page #{i+1}"] 401 | 402 | max_font_count, max_font_name = max((count, name) for name, count in fonts.items()) # find most used font 403 | sum_char_count = sum(fonts.values()) 404 | 405 | # TODO: make this a command line argument 406 | if max_font_count / sum_char_count < 0.35: # the most used font should be used more than 35% of the time 407 | self.logs[Error.FONT] += ["Can't find the main font"] 408 | 409 | if not any([max_font_name.endswith(correct_fontname) for correct_fontname in correct_fontnames]): # the most used font should be `correct_fontname` 410 | self.logs[Error.FONT] += [f"Wrong font. The main font used is {max_font_name} when it should a font in {correct_fontnames}."] 411 | 412 | def make_name_check_config(self): 413 | """Configure the name checking parameters""" 414 | 415 | config_dict = { 416 | 'file': self.pdfpath, 417 | 'show_names': False, # Show how the name is changed 418 | 'whole_name': False, # Consider the whole name changes 419 | 'first_name': True, # Consider only first name changes 420 | 'last_name': True, # Consider only last name changes 421 | 'ref_string': 'References', # How the bibilography starts 422 | 'mode': 'ensemble', # The mode for scholarcy, ensemble worked the best for ACL papers 423 | 'initials': True # Allow abbreviating first names to initials only. 424 | } 425 | 426 | return Namespace(**config_dict) 427 | 428 | 429 | def check_references(self): 430 | """ Check that citations have URLs, and that they have venues (not just arXiv ids). """ 431 | 432 | found_references = False 433 | arxiv_word_count = 0 434 | doi_url_count = 0 435 | arxiv_url_count = 0 436 | all_url_count = 0 437 | 438 | for i, page in enumerate(self.pdf.pages): 439 | try: 440 | page_text = page.extract_text() 441 | except: 442 | page_text = "" 443 | self.logs[Warn.BIB] += [f"Can't parse page #{i+1}"] 444 | 445 | lines = page_text.split('\n') 446 | for j, line in enumerate(lines): 447 | if "References" in line: 448 | found_references = True 449 | break 450 | if found_references: 451 | arxiv_word_count += page_text.lower().count('arxiv') 452 | urls = [h['uri'] for h in page.hyperlinks] 453 | urls = set(urls) # When link text spans more than one line, it returns the same url multiple times 454 | for url in urls: 455 | if 'doi.org' in url: 456 | doi_url_count += 1 457 | elif 'arxiv.org' in url: 458 | arxiv_url_count += 1 459 | all_url_count += 1 460 | 461 | # The following checks fail in ~60% of the papers. TODO: relax them a bit 462 | 463 | if args.disable_name_check: 464 | config = self.make_name_check_config() 465 | output_strings = self.pdf_namecheck.execute(config) 466 | self.logs[Warn.BIB] += output_strings 467 | 468 | if doi_url_count < 3: 469 | self.logs[Warn.BIB] += [f"Bibliography should use ACL Anthology DOIs whenever possible. Only {doi_url_count} references do."] 470 | 471 | if arxiv_url_count > 0.2 * all_url_count: # only 20% of the links are allowed to be arXiv links 472 | self.logs[Warn.BIB] += [f"It appears you are using arXiv links more than you should ({arxiv_url_count}/{all_url_count}). Consider using ACL Anthology DOIs instead."] 473 | 474 | if all_url_count < 5: 475 | self.logs[Warn.BIB] += [f"It appears most of the references are not using paper links. Only {all_url_count} links found."] 476 | 477 | if arxiv_word_count > 10: 478 | self.logs[Warn.BIB] += [f"It appears you are using arXiv references more than you should ({arxiv_word_count} found). Consider using ACL Anthology references instead."] 479 | 480 | if not found_references: 481 | self.logs[Warn.BIB] += ["Couldn't find any references."] 482 | 483 | 484 | args = None 485 | def worker(pdf_path, paper_type): 486 | """ process one pdf """ 487 | return Formatter().format_check(submission=pdf_path, paper_type=paper_type) 488 | 489 | 490 | def main(): 491 | global args 492 | parser = argparse.ArgumentParser() 493 | parser.add_argument('submission_paths', metavar='file_or_dir', nargs='+', 494 | default=[]) 495 | parser.add_argument('-p', '--paper_type', choices={"short", "long", "demo", "other"}, 496 | default='long', help="") 497 | parser.add_argument('--num_workers', type=int, default=1) 498 | parser.add_argument('--disable_name_check', action='store_false') 499 | parser.add_argument('--disable_bottom_check', action='store_false') 500 | 501 | 502 | args = parser.parse_args() 503 | 504 | 505 | # retrieve file paths 506 | paths = {join(root, file_name) 507 | for path in args.submission_paths 508 | for root, _, file_names in walk(path) 509 | for file_name in file_names} 510 | paths.update(args.submission_paths) 511 | 512 | # retrieve files 513 | fileset = sorted([p for p in paths if isfile(p) and p.endswith(".pdf")]) 514 | 515 | if not fileset: 516 | print(f"No PDF files found in {paths}") 517 | 518 | if args.num_workers > 1: 519 | from multiprocessing.pool import Pool 520 | with Pool(args.num_workers) as p: 521 | list(tqdm(p.imap(worker, fileset), total=len(fileset))) 522 | else: 523 | # TODO: make the tqdm togglable 524 | #for submission in tqdm(fileset): 525 | for submission in fileset: 526 | worker(submission, args.paper_type) 527 | 528 | if __name__ == "__main__": 529 | main() 530 | -------------------------------------------------------------------------------- /aclpubcheck/googletools.py: -------------------------------------------------------------------------------- 1 | import os.path 2 | 3 | import google_auth_oauthlib.flow 4 | import google.auth.transport.requests 5 | import google.oauth2.credentials 6 | import googleapiclient.discovery 7 | 8 | 9 | def sheets_service(): 10 | """Loads credentials and opens a Google Sheets API client. 11 | 12 | A credentials.json file should be in the current directory. 13 | https://developers.google.com/workspace/guides/create-credentials 14 | A token.json file will be written to the current directory to avoid 15 | repeatedly asking the user to login. 16 | 17 | :return: the Google Sheets API client 18 | """ 19 | scopes = ['https://www.googleapis.com/auth/spreadsheets'] 20 | creds = None 21 | if os.path.exists('token.json'): 22 | creds = google.oauth2.credentials.Credentials.from_authorized_user_file( 23 | 'token.json', scopes) 24 | if not creds or not creds.valid: 25 | if creds and creds.expired and creds.refresh_token: 26 | creds.refresh(google.auth.transport.requests.Request()) 27 | else: 28 | iaf = google_auth_oauthlib.flow.InstalledAppFlow 29 | flow = iaf.from_client_secrets_file('credentials.json', scopes) 30 | creds = flow.run_local_server(port=0) 31 | with open('token.json', 'w') as token: 32 | token.write(creds.to_json()) 33 | return googleapiclient.discovery.build('sheets', 'v4', credentials=creds) 34 | -------------------------------------------------------------------------------- /aclpubcheck/metadatachecker.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import collections 3 | import itertools 4 | import os 5 | import os.path 6 | import regex as re 7 | import unicodedata 8 | import textwrap 9 | 10 | import pandas as pd 11 | import pdfplumber 12 | import unidecode 13 | 14 | from . import googletools 15 | 16 | 17 | def _clean_str(value): 18 | if pd.isna(value): 19 | return '' 20 | # uncurl all quotes 21 | value = re.sub(r'[\u2018\u2019]', "'", value) 22 | value = re.sub(r'[\u201C\u201D]', '"', value) 23 | # use simple dashes 24 | value = re.sub(r'[\u2013\u2014]', "-", value) 25 | # not exactly sure why, but this has to be done iteratively 26 | old_value = None 27 | value = value.strip() 28 | while old_value != value: 29 | old_value = value 30 | # strip space before accent; PDF seems to introduce these 31 | value = re.sub(r'\p{Zs}+(\p{Mn})', r'\1', value) 32 | # combine accents with characters 33 | value = unicodedata.normalize('NFKC', value) 34 | return value 35 | 36 | 37 | def yield_author_problems(names, text): 38 | # check for author names in the expected order, allowing for 39 | # punctuation, affiliations, etc. between names 40 | # NOTE: only removed or re-ordered (not added) authors will be caught 41 | match = re.search('.*?'.join(names), text, re.DOTALL) 42 | if not match: 43 | 44 | # check if there is a match when ignoring case, punctuation, accents 45 | # since this is the most common type of error 46 | allowed_chars = r'[\p{Zs}\p{p}\p{Mn}]' 47 | match_ignoring_case_punct_accent = re.search( 48 | '.*?'.join( 49 | fr'{allowed_chars}*'.join(unidecode.unidecode(c) for c in p) 50 | for part in names for p in re.split(allowed_chars, part)), 51 | unidecode.unidecode(text), 52 | re.DOTALL | re.IGNORECASE) 53 | if match_ignoring_case_punct_accent: 54 | problem = 'AUTHOR-MISMATCH-CASE-PUNCT-ACCENT' 55 | # these offsets may be slightly incorrect because unidecode may 56 | # change the number of characters, but it should be close enough 57 | start, end = match_ignoring_case_punct_accent.span() 58 | in_text = text[start: end] 59 | else: 60 | problem = 'AUTHOR-MISMATCH' 61 | in_text = text 62 | yield problem, f"meta=\"{' '.join(names)}\"\npdf =\"{in_text}\"" 63 | 64 | 65 | def yield_title_problems(title, text): 66 | # ignore spaces and some LaTeX-isms 67 | title_chars = re.sub(r'[\s{}$^]', '', title.replace('--', '-')) 68 | title_regex = r'\s*'.join(re.escape(c) for c in title_chars) 69 | 70 | # ignore differences in case; LaTeX \sc comes out as caps in PDF 71 | match = re.search(title_regex, text, re.IGNORECASE) 72 | if not match: 73 | yield 'TITLE', f"meta=\"{title}\"\npdf =\"{text}\"" 74 | 75 | 76 | def yield_copyright_problems(signature, org_name, org_address): 77 | if not signature: 78 | yield "COPYRIGHT", "The signature is missing." 79 | elif signature == "NA": 80 | yield "COPYRIGHT", f'The signature "{signature}" must be accompanied ' \ 81 | f'by a "License to Publish" or equivalent.' 82 | elif len(signature) < 3 or len(signature.split()) < 2: 83 | yield "COPYRIGHT", f'The signature "{signature}" does not appear to ' \ 84 | f'be a full name.' 85 | if not org_name: 86 | yield "COPYRIGHT", "The organization name is missing." 87 | elif len(org_name) < 5 and org_name not in {'IBM'}: 88 | yield "COPYRIGHT", f'The organization name "{org_name}" does not ' \ 89 | f'appear to be a full name. ' 90 | if not org_address: 91 | yield "COPYRIGHT", "The organization address is missing." 92 | elif len(org_address) < 3 or len(org_address.split()) < 2: 93 | org_address_simple = org_address.replace("\n", " ") 94 | yield "COPYRIGHT", f'The organization address "{org_address_simple}" ' \ 95 | f'does not appear to be a complete physical address.' 96 | 97 | 98 | def check_metadata( 99 | submissions_path, 100 | pdfs_dir, 101 | spreadsheet_id, 102 | sheet_id, 103 | id_column, 104 | problem_column, 105 | post=False): 106 | 107 | # map submission IDs to PDF paths 108 | id_to_pdf = {} 109 | for root, _, filenames in os.walk(pdfs_dir): 110 | for filename in filenames: 111 | if filename.endswith("_Paper.pdf"): 112 | submission_id, _ = filename.split("_", 1) 113 | id_to_pdf[int(submission_id)] = os.path.join(root, filename) 114 | 115 | id_to_sheet_row = {} 116 | problems = collections.defaultdict(lambda: collections.defaultdict(list)) 117 | 118 | df = pd.read_csv(submissions_path, keep_default_na=False) 119 | for index, row in df.iterrows(): 120 | submission_id = row["Submission ID"] 121 | title = _clean_str(row["Title"]) 122 | 123 | # NOTE: These were the names in the custom final submission form 124 | # for NAACL 2021. Names and structure may be different depending 125 | # on your final submission form. 126 | signature = _clean_str(row["copyrightSig"]) 127 | org_name = _clean_str(row["orgName"]) 128 | org_address = _clean_str(row["orgAddress"]) 129 | 130 | # row in the spreadsheet is 1-based and first row is the header 131 | id_to_sheet_row[submission_id] = index + 2 132 | 133 | # open the PDF 134 | pdf_path = id_to_pdf[submission_id] 135 | pdf = pdfplumber.open(pdf_path) 136 | 137 | # assumes metadata can be found in the first 500 characters 138 | text = _clean_str(pdf.pages[0].extract_text()[:500]) 139 | 140 | # collect all authors and their affiliations 141 | names = [] 142 | for i in range(1, 25): 143 | for x in ['First', 'Middle', 'Last']: 144 | name_part = _clean_str(row[f'{i}: {x} Name']) 145 | if name_part: 146 | names.extend(name_part.split()) 147 | 148 | # collect all problems 149 | for problem_type, problem_text in itertools.chain( 150 | yield_author_problems(names, text), 151 | yield_title_problems(title, text), 152 | yield_copyright_problems(signature, org_name, org_address)): 153 | problems[submission_id][problem_type].append(problem_text) 154 | 155 | # print all problems, grouped by type of problem 156 | for submission_id in sorted(problems): 157 | for problem_type in sorted(problems[submission_id]): 158 | problem_text = '\n'.join(problems[submission_id][problem_type]) 159 | problem_text = textwrap.indent(problem_text, ' ') 160 | print(f'{submission_id}:{problem_type}:\n{problem_text}\n') 161 | 162 | # report overall problem statistics 163 | print(f"{len(problems)} submissions failed:") 164 | problem_counts = collections.Counter( 165 | problem_type 166 | for type_texts in problems.values() 167 | for problem_type in type_texts.keys() 168 | ) 169 | for problem_type in sorted(problem_counts.keys()): 170 | print(f" {problem_counts[problem_type]} {problem_type}") 171 | 172 | # if requested, post problems to the Google Sheet 173 | if post: 174 | values = googletools.sheets_service().spreadsheets().values() 175 | 176 | # get the number of rows 177 | id_range = f'{sheet_id}!{id_column}2:{id_column}' 178 | request = values.get(spreadsheetId=spreadsheet_id, range=id_range) 179 | submission_ids = {int(value) for [value] in request.execute()['values']} 180 | if submission_ids != id_to_sheet_row.keys(): 181 | raise ValueError(f'in Google sheet only: ' 182 | f'{submission_ids - id_to_sheet_row.keys()}; ' 183 | f'in START sheet only: ' 184 | f'{id_to_sheet_row.keys() - submission_ids}') 185 | n_rows = len(submission_ids) + 1 186 | 187 | sheet_row_to_problems = collections.defaultdict(list) 188 | for submission_id, type_texts in problems.items(): 189 | for problem_type, texts in type_texts.items(): 190 | problems = '\n'.join(texts) 191 | sheet_row_to_problems[id_to_sheet_row[submission_id]].append( 192 | f'{problem_type}:\n{problems}') 193 | 194 | # fill in the problem column 195 | request = values.update( 196 | spreadsheetId=spreadsheet_id, 197 | range=f'{sheet_id}!{problem_column}2:{problem_column}', 198 | valueInputOption='RAW', 199 | body={'values': [['\n'.join(sheet_row_to_problems.get(i, []))] 200 | for i in range(2, n_rows)]}) 201 | request.execute() 202 | 203 | 204 | if __name__ == "__main__": 205 | parser = argparse.ArgumentParser() 206 | parser.add_argument('--submissions', dest='submissions_path', 207 | default='Submission_Information.csv') 208 | parser.add_argument('--pdfs', dest='pdfs_dir', default='final') 209 | parser.add_argument('--post', action='store_true') 210 | parser.add_argument('--spreadsheet-id', 211 | default='1lQyGZNBEBwukf8-mgPzIH57xUX9y4o2OUCzpEvNpW9A') 212 | parser.add_argument('--sheet-id', default='Sheet1') 213 | parser.add_argument('--id-column', default='A') 214 | parser.add_argument('--problem-column', default='E') 215 | args = parser.parse_args() 216 | check_metadata(**vars(args)) 217 | -------------------------------------------------------------------------------- /aclpubcheck/name_check.py: -------------------------------------------------------------------------------- 1 | import os 2 | import rebiber 3 | from pylatexenc.latex2text import LatexNodes2Text 4 | from pybtex.database import parse_file 5 | import contextlib 6 | from unidecode import unidecode 7 | import re 8 | 9 | 10 | class PDFNameCheck: 11 | 12 | def __init__(self): 13 | # Generate and update the bib list from various conferences 14 | filepath = os.path.abspath(rebiber.__file__).replace("__init__.py", "") 15 | bib_list_path = os.path.join(filepath, "bib_list.txt") 16 | with open(os.devnull, "w") as f, contextlib.redirect_stdout(f): 17 | self.bib_db = rebiber.construct_bib_db( 18 | bib_list_path, start_dir=filepath) 19 | 20 | def execute_curl(self, config): 21 | # The curl string to convert the PDF to bib. 22 | # I have used scholarcy API here. 23 | # See the link here: https://ref.scholarcy.com/api/ 24 | # I used the POST curl for download 25 | 26 | self.filename = config.file.split('.')[0] 27 | temp_name = self.filename.split('/')[-1] 28 | os.makedirs('temp', exist_ok=True) 29 | 30 | curl_string = 'curl --silent -X \'POST\'' \ 31 | ' \'https://ref.scholarcy.com/api/references/download\'' \ 32 | ' -H \'accept: application/json\'' \ 33 | ' -H \'Authorization: Bearer \'' \ 34 | ' -H \'Content-Type: multipart/form-data\'' \ 35 | f' -F \'file=@{config.file};type=application/pdf\'' \ 36 | ' -F \'document_type=full_paper\'' \ 37 | f' -F \'references={config.ref_string}\'' \ 38 | f' -F \'reference_style={config.mode}\'' \ 39 | ' -F \'reference_format=bibtex\'' \ 40 | ' -F \'parser=v2\'' \ 41 | f' -F \'engine=v1\' > temp/before-rebiber-{temp_name}.bib' 42 | 43 | # Execute that curl string 44 | with open(os.devnull, "w") as f, contextlib.redirect_stdout(f): 45 | os.system(curl_string) 46 | 47 | def apply_rebiber(self): 48 | # The curl string generates a bib file called 'before rebiber' 49 | # Pass it to rebiber 50 | temp_name = self.filename.split('/')[-1] 51 | all_bib_entries = rebiber.load_bib_file(f'temp/before-rebiber-{temp_name}.bib') 52 | 53 | # Update the bib file using rebiber and call it 'after rebiber' 54 | with open(os.devnull, "w") as f, contextlib.redirect_stdout(f): 55 | rebiber.normalize_bib( 56 | self.bib_db, all_bib_entries, f'temp/after-rebiber-{temp_name}.bib') 57 | 58 | def extract_names(self): 59 | # Parse both bib files 60 | temp_name = self.filename.split('/')[-1] 61 | old_bib_data = parse_file(f'temp/before-rebiber-{temp_name}.bib') 62 | new_bib_data = parse_file(f'temp/after-rebiber-{temp_name}.bib') 63 | 64 | name_list = {} 65 | 66 | paper_keys = list(new_bib_data.entries.keys()) 67 | 68 | # Here old means before updating 69 | # Here new means after updating 70 | # We will collect the author names before and after the bib updates 71 | for paper in paper_keys: 72 | old_paper_authors = [] 73 | new_paper_authors = [] 74 | if 'author' in old_bib_data.entries[paper].persons: 75 | old_key = old_bib_data.entries[paper].persons['author'] 76 | new_key = new_bib_data.entries[paper].persons['author'] 77 | old_length = len(old_bib_data.entries[paper].persons['author']) 78 | new_length = len(new_bib_data.entries[paper].persons['author']) 79 | additional = False 80 | for i in range(new_length): 81 | if i < old_length: 82 | # Bugfix: Sometimes names with dots are being parsed as full names 83 | if ' '.join(old_key[i].bibtex_first_names).replace('.', '') == \ 84 | ' '.join(new_key[i].bibtex_first_names + new_key[i].last_names): 85 | old_key[i] = new_key[i] 86 | 87 | old_name = old_key[i].bibtex_first_names + \ 88 | old_key[i].last_names 89 | new_name = new_key[i].bibtex_first_names + \ 90 | new_key[i].last_names 91 | 92 | # Bugfix: Sometimes there are two names in a name 93 | if old_key[i].last_names == new_key[i].bibtex_first_names + new_key[i].last_names: 94 | additional = i 95 | new_name = [LatexNodes2Text().latex_to_text(name) 96 | for name in new_name] 97 | old_paper_authors.append(old_name) 98 | new_paper_authors.append(new_name) 99 | else: 100 | # Sometimes authors tend to cite only n authors 101 | new_name = new_key[i].first_names + \ 102 | new_key[i].last_names 103 | new_paper_authors.append(new_name) 104 | 105 | # Bugfix: Sometimes there are two names in a name 106 | if additional: 107 | old_paper_authors[additional] = new_paper_authors[additional] 108 | if additional+1 < len(new_paper_authors): 109 | old_paper_authors.insert( 110 | additional+1, new_paper_authors[additional+1]) 111 | name_list[paper] = {} 112 | name_list[paper]['old'] = old_paper_authors 113 | name_list[paper]['new'] = new_paper_authors 114 | name_list[paper]['title'] = LatexNodes2Text().latex_to_text( 115 | new_bib_data.entries[paper].fields['title']) 116 | 117 | if 'url' in new_bib_data.entries[paper].fields: 118 | name_list[paper]['url'] = new_bib_data.entries[paper].fields['url'] 119 | 120 | return name_list 121 | 122 | def if_equal(self, string_a, string_b): 123 | ''' 124 | Do a basic cleanup to tell whether the names are same or not 125 | ''' 126 | # remove spaces and lowercase 127 | string_a = ('').join(string_a).lower() 128 | string_b = ('').join(string_b).lower() 129 | # remove punctuations 130 | string_a = re.sub(r'\W+', '', string_a) 131 | string_b = re.sub(r'\W+', '', string_b) 132 | # remove accents 133 | string_a = unidecode(string_a) 134 | string_b = unidecode(string_b) 135 | return string_a == string_b 136 | 137 | def compare_changes(self, name_list, config): 138 | 139 | warnings = [] 140 | error_count = 1 141 | 142 | for paper in name_list: 143 | output_strings = [] 144 | old = name_list[paper]['old'] 145 | new = name_list[paper]['new'] 146 | title = name_list[paper]['title'] 147 | if 'url' in name_list[paper]: 148 | url = name_list[paper]['url'] 149 | else: 150 | url = '' 151 | old_length = len(old) 152 | new_length = len(new) 153 | # Citation error: Cites do not contain every author 154 | if old_length != new_length: 155 | error_count += 1 156 | output_strings.append( 157 | f'Number of authors in the title `{title}` is incorrect.') 158 | output_strings.append( 159 | f'The number of authors should be {new_length}, not {old_length}.') 160 | if url: 161 | output_strings.append( 162 | f'Please correct the citation by visiting this url: {url}') 163 | if old_length == new_length: 164 | already_warned = False 165 | for i in range(old_length): 166 | # If you wanna check the full name 167 | if config.whole_name: 168 | # Check if names are sanme 169 | if self.if_equal(old[i], new[i]) is False: 170 | # If not, check if we have warned them already 171 | if already_warned is False: 172 | error_count += 1 173 | output_strings.append( 174 | f'Your citation for `{title}` might have incorrect author names.') 175 | if url: 176 | output_strings.append( 177 | f'Please correct the citation by visiting this url: {url}') 178 | already_warned = True 179 | # If you wanna show the names 180 | if config.show_names: 181 | old_name = ' '.join(name_list[paper]['old'][i]) 182 | new_name = ' '.join(name_list[paper]['new'][i]) 183 | output_strings.append( 184 | f'The name should be {new_name} not {old_name}.') 185 | else: 186 | # If you wanna check only the first name 187 | if config.first_name: 188 | if config.initials and \ 189 | (re.search(r'^[A-Z]\.', old[i][0]) or re.search(r'^[A-Z]\.', new[i][0])): 190 | old_first_name = re.sub( 191 | r'[^A-Z]', '', old[i][0]) 192 | new_first_name = re.sub( 193 | r'[^A-Z]', '', new[i][0]) 194 | else: 195 | old_first_name = old[i][0] 196 | new_first_name = new[i][0] 197 | if self.if_equal(old_first_name, new_first_name) is False: 198 | if already_warned is False: 199 | error_count += 1 200 | output_strings.append( 201 | f'Your citation for `{title}` might have incorrect author names.') 202 | if url: 203 | output_strings.append( 204 | f'Please correct the citation by visiting this url: {url}') 205 | already_warned = True 206 | if config.show_names: 207 | old_name = ' '.join( 208 | name_list[paper]['old'][i]) 209 | new_name = ' '.join( 210 | name_list[paper]['new'][i]) 211 | first_author_id = i 212 | output_strings.append( 213 | f'The author #{first_author_id} name should be {new_name} not {old_name}.') 214 | # If you wanna check only the last name 215 | if config.last_name: 216 | if self.if_equal(old[i][-1], new[i][-1]) is False: 217 | if already_warned is False: 218 | error_count += 1 219 | output_strings.append( 220 | f'Your citation for `{title}` might have incorrect author names.') 221 | if url: 222 | output_strings.append( 223 | f'Please correct the citation by visiting this url: {url}') 224 | already_warned = True 225 | last_author_id = i 226 | if config.show_names and already_warned and (first_author_id != last_author_id): 227 | old_name = ' '.join( 228 | name_list[paper]['old'][i]) 229 | new_name = ' '.join( 230 | name_list[paper]['new'][i]) 231 | output_strings.append( 232 | f'The author #{last_author_id} name should be {new_name} not {old_name}.') 233 | 234 | if len(output_strings) > 0: 235 | warning = ' '.join(output_strings) 236 | warnings.append(' '.join(output_strings)) 237 | 238 | return warnings 239 | 240 | def execute(self, config): 241 | self.execute_curl(config) 242 | self.apply_rebiber() 243 | name_list = self.extract_names() 244 | output_strings = self.compare_changes(name_list, config) 245 | return output_strings 246 | -------------------------------------------------------------------------------- /aclpubcheck_additional_info.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/acl-org/aclpubcheck/a340fc0a1a7d1c9808f08ab1dab1d228f63af405/aclpubcheck_additional_info.pdf -------------------------------------------------------------------------------- /aclpubcheck_online.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "nbformat": 4, 3 | "nbformat_minor": 0, 4 | "metadata": { 5 | "colab": { 6 | "provenance": [] 7 | }, 8 | "kernelspec": { 9 | "name": "python3", 10 | "display_name": "Python 3" 11 | }, 12 | "language_info": { 13 | "name": "python" 14 | }, 15 | "widgets": { 16 | "application/vnd.jupyter.widget-state+json": { 17 | "f6fed292d0ef485bbb83ad238dffe0b4": { 18 | "model_module": "@jupyter-widgets/controls", 19 | "model_name": "DropdownModel", 20 | "model_module_version": "1.5.0", 21 | "state": { 22 | "_dom_classes": [], 23 | "_model_module": "@jupyter-widgets/controls", 24 | "_model_module_version": "1.5.0", 25 | "_model_name": "DropdownModel", 26 | "_options_labels": [ 27 | "long", 28 | "short", 29 | "demo" 30 | ], 31 | "_view_count": null, 32 | "_view_module": "@jupyter-widgets/controls", 33 | "_view_module_version": "1.5.0", 34 | "_view_name": "DropdownView", 35 | "description": "Paper type:", 36 | "description_tooltip": null, 37 | "disabled": false, 38 | "index": 0, 39 | "layout": "IPY_MODEL_4e6ae31e5e51498c924eea02c65f9cea", 40 | "style": "IPY_MODEL_57e8b8127118436884f916a243091017" 41 | } 42 | }, 43 | "4e6ae31e5e51498c924eea02c65f9cea": { 44 | "model_module": "@jupyter-widgets/base", 45 | "model_name": "LayoutModel", 46 | "model_module_version": "1.2.0", 47 | "state": { 48 | "_model_module": "@jupyter-widgets/base", 49 | "_model_module_version": "1.2.0", 50 | "_model_name": "LayoutModel", 51 | "_view_count": null, 52 | "_view_module": "@jupyter-widgets/base", 53 | "_view_module_version": "1.2.0", 54 | "_view_name": "LayoutView", 55 | "align_content": null, 56 | "align_items": null, 57 | "align_self": null, 58 | "border": null, 59 | "bottom": null, 60 | "display": null, 61 | "flex": null, 62 | "flex_flow": null, 63 | "grid_area": null, 64 | "grid_auto_columns": null, 65 | "grid_auto_flow": null, 66 | "grid_auto_rows": null, 67 | "grid_column": null, 68 | "grid_gap": null, 69 | "grid_row": null, 70 | "grid_template_areas": null, 71 | "grid_template_columns": null, 72 | "grid_template_rows": null, 73 | "height": null, 74 | "justify_content": null, 75 | "justify_items": null, 76 | "left": null, 77 | "margin": null, 78 | "max_height": null, 79 | "max_width": null, 80 | "min_height": null, 81 | "min_width": null, 82 | "object_fit": null, 83 | "object_position": null, 84 | "order": null, 85 | "overflow": null, 86 | "overflow_x": null, 87 | "overflow_y": null, 88 | "padding": null, 89 | "right": null, 90 | "top": null, 91 | "visibility": null, 92 | "width": null 93 | } 94 | }, 95 | "57e8b8127118436884f916a243091017": { 96 | "model_module": "@jupyter-widgets/controls", 97 | "model_name": "DescriptionStyleModel", 98 | "model_module_version": "1.5.0", 99 | "state": { 100 | "_model_module": "@jupyter-widgets/controls", 101 | "_model_module_version": "1.5.0", 102 | "_model_name": "DescriptionStyleModel", 103 | "_view_count": null, 104 | "_view_module": "@jupyter-widgets/base", 105 | "_view_module_version": "1.2.0", 106 | "_view_name": "StyleView", 107 | "description_width": "" 108 | } 109 | } 110 | } 111 | } 112 | }, 113 | "cells": [ 114 | { 115 | "cell_type": "markdown", 116 | "source": [ 117 | "#ACL Pubcheck @ colab\n", 118 | "\n", 119 | "ACL pubcheck is a Python tool that automatically detects author formatting errors, margin violations as well as many other common formatting errors in papers that are using the LaTeX sty file associated with ACL venues. The script can be used to check your papers before you submit to a conference. (We highly recommend running ACL pubcheck on your papers pre-submission—a well formatted paper helps keep the reviewers focused on the scientific content.) However, its main purpose is to ensure your accepted paper is properly formatted, i.e., it follows the venue's style guidelines. The script is used by the publication chairs at most ACL events to check for formatting issues. Indeed, running this script yourself and fixing errors before uploading the camera-ready version of your paper will often save you a personalized email from the publication chairs.\n", 120 | "\n", 121 | "**NOTICE**: ACL pubcheck is meant to be run on the **camera ready** version of the paper, not on the review version (e.g. anonymous, line-numbered submission version). Running ACL pubcheck on a line-numbered version will result in a stream of spurious errors related to the numbers in the margins.\n", 122 | "\n", 123 | "More info can be found at: https://github.com/acl-org/aclpubcheck/blob/main/aclpubcheck_additional_info.pdf\n", 124 | "\n", 125 | "##What do you have to do?\n", 126 | "\n", 127 | "1. Install `aclpubcheck`\n", 128 | "2. Are you checking a long or short paper?\n", 129 | "3. Upload your PDF file\n", 130 | "4. Run `aclpubcheck` and see the outcomes\n", 131 | "5. (Hopefully not required:) fix the errors and re-run the code\n", 132 | "\n", 133 | "Let's check!" 134 | ], 135 | "metadata": { 136 | "id": "_tCawJsGR6RE" 137 | } 138 | }, 139 | { 140 | "cell_type": "markdown", 141 | "source": [ 142 | "## 1. Install `aclpubcheck` and import libraries\n", 143 | "\n", 144 | "Run the code in this block to installl ACL pubcheck." 145 | ], 146 | "metadata": { 147 | "id": "Vsja4xipT4iz" 148 | } 149 | }, 150 | { 151 | "cell_type": "code", 152 | "source": [ 153 | "!pip install -q git+https://github.com/acl-org/aclpubcheck" 154 | ], 155 | "metadata": { 156 | "id": "yNEDRLWvQ3NZ" 157 | }, 158 | "execution_count": null, 159 | "outputs": [] 160 | }, 161 | { 162 | "cell_type": "code", 163 | "source": [ 164 | "from ipywidgets import Dropdown\n", 165 | "from google.colab import files\n", 166 | "import os" 167 | ], 168 | "metadata": { 169 | "id": "2rplVJym54Y1" 170 | }, 171 | "execution_count": null, 172 | "outputs": [] 173 | }, 174 | { 175 | "cell_type": "markdown", 176 | "source": [ 177 | "##2. Are you checking a long or short paper?" 178 | ], 179 | "metadata": { 180 | "id": "0yiTvBIgTbpo" 181 | } 182 | }, 183 | { 184 | "cell_type": "markdown", 185 | "source": [ 186 | "Please select the correct paper type: Long, short or demo. This will help us check whether you have the correct paper length." 187 | ], 188 | "metadata": { 189 | "id": "Pu1LRCSYahJ8" 190 | } 191 | }, 192 | { 193 | "cell_type": "code", 194 | "source": [ 195 | "# Define the list of options\n", 196 | "options = [\"long\", \"short\", \"demo\"]\n", 197 | "\n", 198 | "# Create the dropdown widget\n", 199 | "dropdown = Dropdown(\n", 200 | " options=options, value=\"long\", description=\"Paper type:\"\n", 201 | ")\n", 202 | "\n", 203 | "# Display the dropdown\n", 204 | "display(dropdown)" 205 | ], 206 | "metadata": { 207 | "id": "BePpvMG75sO2", 208 | "outputId": "1f2204e9-91f6-41a0-b2c6-b18cb6775880", 209 | "colab": { 210 | "base_uri": "https://localhost:8080/", 211 | "height": 49, 212 | "referenced_widgets": [ 213 | "f6fed292d0ef485bbb83ad238dffe0b4", 214 | "4e6ae31e5e51498c924eea02c65f9cea", 215 | "57e8b8127118436884f916a243091017" 216 | ] 217 | } 218 | }, 219 | "execution_count": null, 220 | "outputs": [ 221 | { 222 | "output_type": "display_data", 223 | "data": { 224 | "text/plain": [ 225 | "Dropdown(description='Paper type:', options=('long', 'short', 'demo'), value='long')" 226 | ], 227 | "application/vnd.jupyter.widget-view+json": { 228 | "version_major": 2, 229 | "version_minor": 0, 230 | "model_id": "f6fed292d0ef485bbb83ad238dffe0b4" 231 | } 232 | }, 233 | "metadata": {} 234 | } 235 | ] 236 | }, 237 | { 238 | "cell_type": "markdown", 239 | "source": [ 240 | "## 3. Upload your PDF file" 241 | ], 242 | "metadata": { 243 | "id": "9pOjnKu3Tmtm" 244 | } 245 | }, 246 | { 247 | "cell_type": "code", 248 | "source": [ 249 | "paper_type = dropdown.value\n", 250 | "uploaded = files.upload()\n", 251 | "filename = list(uploaded.keys())[0]\n", 252 | "length = len(uploaded[filename])\n", 253 | "os.rename(filename, \"paper.pdf\")" 254 | ], 255 | "metadata": { 256 | "id": "0KyiVz9gQRqa" 257 | }, 258 | "execution_count": null, 259 | "outputs": [] 260 | }, 261 | { 262 | "cell_type": "markdown", 263 | "source": [ 264 | "# 4. Run `aclpubcheck` and see the outcomes\n", 265 | "\n", 266 | "Please, see the output of this code block to read the output of the analysis.\n", 267 | "\n", 268 | "**Notice**: if the tool finds any issue, it will show the problematic page(s)." 269 | ], 270 | "metadata": { 271 | "id": "eFjTQ1nWT_h3" 272 | } 273 | }, 274 | { 275 | "cell_type": "code", 276 | "source": [ 277 | "!aclpubcheck --paper_type $paper_type paper.pdf" 278 | ], 279 | "metadata": { 280 | "id": "5KmieUUJTBv7", 281 | "outputId": "fc63f552-8469-4de9-9a34-1c04f81a2f79", 282 | "colab": { 283 | "base_uri": "https://localhost:8080/" 284 | } 285 | }, 286 | "execution_count": null, 287 | "outputs": [ 288 | { 289 | "output_type": "stream", 290 | "name": "stdout", 291 | "text": [ 292 | "Checking paper.pdf\n", 293 | "\u001b[32mAll Clear!\u001b[0m\n" 294 | ] 295 | } 296 | ] 297 | } 298 | ] 299 | } -------------------------------------------------------------------------------- /example/2023.acl-tutorials.1.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/acl-org/aclpubcheck/a340fc0a1a7d1c9808f08ab1dab1d228f63af405/example/2023.acl-tutorials.1.pdf -------------------------------------------------------------------------------- /pdf_image.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/acl-org/aclpubcheck/a340fc0a1a7d1c9808f08ab1dab1d228f63af405/pdf_image.png -------------------------------------------------------------------------------- /screenshot.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/acl-org/aclpubcheck/a340fc0a1a7d1c9808f08ab1dab1d228f63af405/screenshot.png -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | from setuptools import setup, find_packages 2 | 3 | 4 | install_requires = [ 5 | "tqdm", 6 | "termcolor", 7 | "pandas", 8 | "pdfplumber", 9 | "rebiber<2.0.0", # 2.0 introduces breaking changes 10 | "pybtex", 11 | "pylatexenc", 12 | "setuptools", 13 | "Unidecode", 14 | "tsv" 15 | ] 16 | 17 | 18 | setup( 19 | name="aclpubcheck", 20 | install_requires=install_requires, 21 | version="0.1", 22 | scripts=[], 23 | packages=find_packages(include=["aclpubcheck*"]), 24 | entry_points = { 25 | 'console_scripts': [ 26 | "aclpubcheck=aclpubcheck.__main__:main", 27 | ], 28 | }, 29 | ) 30 | --------------------------------------------------------------------------------