├── .gitignore ├── LICENSE ├── README.md ├── TODO ├── inputs └── lazy_astroph.py /.gitignore: -------------------------------------------------------------------------------- 1 | *~ 2 | \#* 3 | *.pyc 4 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Copyright (c) 2015, Michael Zingale 2 | All rights reserved. 3 | 4 | Redistribution and use in source and binary forms, with or without 5 | modification, are permitted provided that the following conditions are met: 6 | 7 | * Redistributions of source code must retain the above copyright notice, this 8 | list of conditions and the following disclaimer. 9 | 10 | * Redistributions in binary form must reproduce the above copyright notice, 11 | this list of conditions and the following disclaimer in the documentation 12 | and/or other materials provided with the distribution. 13 | 14 | * Neither the name of pyjournal nor the names of its 15 | contributors may be used to endorse or promote products derived from 16 | this software without specific prior written permission. 17 | 18 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" 19 | AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 20 | IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE 21 | DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE 22 | FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 23 | DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR 24 | SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER 25 | CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, 26 | OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE 27 | OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 28 | 29 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # about lazy_astroph.py 2 | 3 | This is a simple script to get the latest papers from astro-ph/arxiv, 4 | search their abstracts and titles for keywords, and report the 5 | results. Reports are done either through e-mail or by posting to 6 | slack channels using web-hooks. This way if we forget to read 7 | astro-ph for a bit, we can atleast get notified of the papers deemed 8 | important to us (and discuss them on slack). 9 | 10 | Note: this requires python 3 11 | 12 | ## usage 13 | 14 | ``` 15 | ./lazy_astroph.py [-m e-mail-address] [-w slack-webhook-file] inputs 16 | ``` 17 | 18 | where `inputs` is a file of (case-insensitive) keywords, one per 19 | line. Note, ending a keyword with "-" will make sure that keyword 20 | is uniquely matched, and not embedded in another keyword. Adding 21 | a clause "NOT:" to a keyword line followed by common-separated 22 | terms will result in a match only if the terms following NOT are 23 | not found 24 | 25 | E.g., 26 | 27 | ``` 28 | supernova NOT: dark energy, interstellar medium, ISM 29 | nova- 30 | xrb 31 | ``` 32 | 33 | will return the matching papers containing "supernova", so long as 34 | they don't also contain "dark energy", "interstellar medium", or 35 | "ISM". It will also return papers that contain "nova" distinct from 36 | "supernova" (since `"nova" in "supernova"` is `True` in python). 37 | And it will return those papers containing xrb. 38 | 39 | Slack channels are indicated by a line beginning with "#", with 40 | an optional "requires=N", where N is the number of keywords 41 | we must match before posting a paper to a slack channel. 42 | 43 | You need to create a webhook via slack. Put the URL into a file 44 | (just the URL, nothing else) and then pass the name of that 45 | file into `lazy_astroph.py` using the `-w` parameter. 46 | 47 | 48 | ## automating 49 | 50 | To have this run nightly, add a line to your crontab. Assuming that 51 | you've put the `lazy_astroph.py` script and an `inputs` file in your 52 | ~/bin/, then do something like: 53 | 54 | ``` 55 | crontab -e 56 | ``` 57 | 58 | add a new line with: 59 | 60 | ``` 61 | 00 04 * * * /home/username/bin/lazy_astroph.py -m me@something.edu /home/username/bin/inputs 62 | ``` 63 | 64 | replacing the e-mail with your e-mail and `username` with your username. 65 | 66 | 67 | # arXiv appearance dates 68 | 69 | articles appear according to the following schedule: 70 | 71 | ``` 72 | submitted appear 73 | 74 | Th 20:00 -> F 19:59 M 75 | F 20:00 -> M 19:59 Tu 76 | M 20:00 -> Tu 19:59 W 77 | Tu 20:00 -> W 19:59 Th 78 | W 20:00 -> Th 19:59 F 79 | ``` 80 | 81 | but note that holidays can throw this off by a day or so. 82 | 83 | A possible solution is to store in a config file the id of the last 84 | paper parsed and then start each day by requesting 1000 papers leading 85 | up to today and go back only until we hit the previously parsed paper. 86 | 87 | 88 | # inspiration 89 | 90 | * The basic feedparser mechanism of interacting with the arxiv API 91 | came from this example: 92 | 93 | https://github.com/zonca/python-parse-arxiv/blob/master/python_arXiv_parsing_example.py 94 | 95 | * The instructions on how to search by date range came from the arxiv API google group: 96 | 97 | https://groups.google.com/forum/#!msg/arxiv-api/I95YLIPesSE/3MZ83Pq_cugJ 98 | 99 | https://groups.google.com/forum/#!searchin/arxiv-api/lastupdateddate/arxiv-api/GkTlg6tikps/C-E5noLbu94J 100 | 101 | 102 | -------------------------------------------------------------------------------- /TODO: -------------------------------------------------------------------------------- 1 | -- we could encode the e-mails as Unicode instead of the ASCII default 2 | by adding 'utf-8' to MIMEText 3 | 4 | -- we need to deal with the line breaks more smartly. Currently we replace 5 | "\n" with "", but that concatenates words. If we replace with " ", then 6 | it potentially breaks up hyphenated words... 7 | -------------------------------------------------------------------------------- /inputs: -------------------------------------------------------------------------------- 1 | #sne require=2 2 | supernova NOT: dark energy, interstellar medium, ISM, redshift, cosmic microwave, CMB, cosmology, cosmologies 3 | type Ia supernova 4 | SNIa 5 | SN Ia 6 | SNe Ia 7 | Chandrasekhar 8 | degenerate NOT: CMB, lensing, doped 9 | progenitor NOT: asteroid, basaltic, jupiter 10 | flame 11 | deflagration 12 | detonation 13 | shell NOT: lens, refractive 14 | helium detonation 15 | urca- 16 | thermonuclear 17 | collision 18 | 19 | #wdmerger 20 | white dwarf merger 21 | WD merger 22 | double degenerate 23 | tidal disruption 24 | 25 | #xrb 26 | x-ray burst 27 | X-ray bursts 28 | xrb 29 | thin shell instability 30 | thin-shell instability 31 | nova- 32 | superburst 33 | quasi-periodic oscillations 34 | low-mass X-ray binaries 35 | QPOs 36 | 37 | #gpu 38 | CUDA 39 | gpu 40 | openacc 41 | KNL 42 | 43 | #hydro require=2 44 | flash 45 | castro 46 | maestro 47 | hydro NOT: galaxy, galaxies, cosmology, large scale structure, large-scale structure, dust, cloud, interstellar medium, ISM, interstellar, jets, general relativity, general relativistic, hydrogen, hydrostatic, Lyman 48 | magnetohydrodynamics 49 | MHD 50 | anelastic 51 | low mach 52 | adaptive mesh refinement 53 | AMR 54 | RHD 55 | radiation hydrodynamics 56 | convection 57 | convective NOT: dust, cloud 58 | rayleigh-taylor 59 | kelvin-helmholtz 60 | turbulence NOT: cluster, interstellar, cloud, galactic, nebula, adaptive optics, seeing, guide star, molecular, disks, dust 61 | shock 62 | well-balanced 63 | rotation NOT: galaxy, cluster, disk, hubble, quantum, spectrum, asteroid 64 | rotating NOT: galaxy, cluster, disk 65 | fluid dynamics 66 | CFL 67 | CFD 68 | nonlinear 69 | boundary conditions 70 | plasma 71 | simulation 72 | converge 73 | 74 | #random require=2 75 | code NOT: photometry, telescope, spectrograph, spectra, galaxy, galaxies, dark matter, dark energy, large scale structure, large-scale structure 76 | software NOT: surveys 77 | python 78 | open source 79 | jupyter 80 | julia 81 | reproducibility 82 | 83 | 84 | 85 | 86 | -------------------------------------------------------------------------------- /lazy_astroph.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | 3 | 4 | import argparse 5 | import datetime as dt 6 | import json 7 | import os 8 | import platform 9 | import shlex 10 | import smtplib 11 | import subprocess 12 | import sys 13 | from collections import defaultdict 14 | from email.mime.text import MIMEText 15 | 16 | import feedparser 17 | 18 | # python 2 and 3 do different things with urllib 19 | try: 20 | from urllib.request import urlopen 21 | except ImportError: 22 | from urllib2 import urlopen 23 | 24 | 25 | class Paper: 26 | """a Paper is a single paper listed on arXiv. In addition to the 27 | paper's title, ID, and URL (obtained from arXiv), we also store 28 | which keywords it matched and which Slack channel it should go 29 | to""" 30 | 31 | def __init__(self, arxiv_id, title, url, keywords_by_channel): 32 | self.arxiv_id = arxiv_id 33 | self.title = title.replace("'", r"") 34 | self.url = url 35 | self.keywords_by_channel = keywords_by_channel 36 | self.keywords = [] 37 | for kws in keywords_by_channel.values(): 38 | self.keywords.extend(kws) 39 | self.posted_to_slack = 0 40 | 41 | def __str__(self): 42 | t = " ".join(self.title.split()) # remove extra spaces 43 | return f"{self.arxiv_id} : {t}\n {self.url}\n" 44 | 45 | def kw_str(self): 46 | """ return the union of keywords """ 47 | return ", ".join(self.keywords) 48 | 49 | def __lt__(self, other): 50 | """we compare Papers by the number of keywords, and then 51 | alphabetically by the union of their keywords""" 52 | 53 | if len(self.keywords) == len(other.keywords): 54 | return self.kw_str() < other.kw_str() 55 | 56 | return len(self.keywords) < len(other.keywords) 57 | 58 | 59 | class Keyword: 60 | """a Keyword includes: the text we should match, how the matching 61 | should be done (unique or any), which words, if present, negate 62 | the match, and what Slack channel this keyword is associated with""" 63 | 64 | def __init__(self, name, matching="any", channel=None, excludes=None): 65 | self.name = name 66 | self.matching = matching 67 | self.channel = channel 68 | self.excludes = list(set(excludes)) 69 | 70 | def __str__(self): 71 | return f"{self.name}: matching={self.matching}, channel={self.channel}, NOTs={self.excludes}" 72 | 73 | 74 | class AstrophQuery: 75 | """ a class to define a query to the arXiv astroph papers """ 76 | 77 | def __init__(self, start_date, end_date, max_papers, old_id=None): 78 | self.start_date = start_date 79 | self.end_date = end_date 80 | self.max_papers = max_papers 81 | self.old_id = old_id 82 | 83 | self.base_url = "http://export.arxiv.org/api/query?" 84 | self.sort_query = f"max_results={self.max_papers}&sortBy=submittedDate&sortOrder=descending" 85 | 86 | self.subcat = ["GA", "CO", "EP", "HE", "IM", "SR"] 87 | 88 | def get_cat_query(self): 89 | """ create the category portion of the astro ph query """ 90 | 91 | cat_query = "%28" # open parenthesis 92 | for n, s in enumerate(self.subcat): 93 | cat_query += f"astro-ph.{s}" 94 | if n < len(self.subcat)-1: 95 | cat_query += "+OR+" 96 | else: 97 | cat_query += "%29" # close parenthesis 98 | 99 | return cat_query 100 | 101 | def get_range_query(self): 102 | """ get the query string for the date range """ 103 | 104 | # here the 2000 on each date is 8:00pm 105 | start = self.start_date.strftime("%Y%m%d") 106 | end = self.end_date.strftime("%Y%m%d") 107 | range_str = f"[{start}2000+TO+{end}2000]" 108 | range_query = f"lastUpdatedDate:{range_str}" 109 | return range_query 110 | 111 | def get_url(self): 112 | """ create the URL we will use to query arXiv """ 113 | 114 | cat_query = self.get_cat_query() 115 | range_query = self.get_range_query() 116 | 117 | full_query = f"search_query={cat_query}+AND+{range_query}&{self.sort_query}" 118 | 119 | return self.base_url + full_query 120 | 121 | def do_query(self, keywords=None, old_id=None): 122 | """ perform the actual query """ 123 | 124 | # note, in python3 this will be bytes not str 125 | response = urlopen(self.get_url()).read() 126 | response = response.replace(b"author", b"contributor") 127 | 128 | # this feedparser magic comes from the example of Julius Lucks / Andrea Zonca 129 | # https://github.com/zonca/python-parse-arxiv/blob/master/python_arXiv_parsing_example.py 130 | #feedparser._FeedParserMixin.namespaces['http://a9.com/-/spec/opensearch/1.1/'] = 'opensearch' 131 | #feedparser._FeedParserMixin.namespaces['http://arxiv.org/schemas/atom'] = 'arxiv' 132 | 133 | feed = feedparser.parse(response) 134 | 135 | if feed.feed.opensearch_totalresults == 0: 136 | sys.exit("no results found") 137 | 138 | results = [] 139 | 140 | latest_id = None 141 | 142 | for e in feed.entries: 143 | 144 | arxiv_id = e.id.split("/abs/")[-1] 145 | title = e.title.replace("\n", " ") 146 | 147 | # the papers are sorted now such that the first is the 148 | # most recent -- we want to store this id, so the next 149 | # time we run the script, we can pick up from here 150 | if latest_id is None: 151 | latest_id = arxiv_id 152 | 153 | # now check if we hit the old_id -- this is where we 154 | # left off last time. Note things may not be in id order, 155 | # so we keep looking through the entire list of returned 156 | # results. 157 | if old_id is not None: 158 | if arxiv_id < old_id: 159 | continue 160 | 161 | # link 162 | for l in e.links: 163 | if l.rel == "alternate": 164 | url = l.href 165 | 166 | abstract = e.summary 167 | 168 | # any keyword matches? 169 | # we do two types of matches here. If the keyword tuple has the "any" 170 | # qualifier, then we don't care how it appears in the text, but if 171 | # it has "unique", then we want to make sure only that word matches, 172 | # i.e., "nova" and not "supernova". If any of the exclude words associated 173 | # with the keyword are present, then we reject any match 174 | keys_matched = defaultdict(list) 175 | for k in keywords: 176 | # first check the "NOT"s 177 | excluded = False 178 | for n in k.excludes: 179 | if n in abstract.lower().replace("\n", " ") or n in title.lower(): 180 | # we've matched one of the excludes 181 | excluded = True 182 | 183 | if excluded: 184 | continue 185 | 186 | if k.matching == "any": 187 | if k.name in abstract.lower().replace("\n", " ") or k.name in title.lower(): 188 | keys_matched[k.channel].append(k.name) 189 | 190 | elif k.matching == "unique": 191 | qa = [l.lower().strip('\":.,!?') for l in abstract.split()] 192 | qt = [l.lower().strip('\":.,!?') for l in title.split()] 193 | if k.name in qa + qt: 194 | keys_matched[k.channel].append(k.name) 195 | 196 | if keys_matched: 197 | results.append(Paper(arxiv_id, title, url, dict(keys_matched))) 198 | 199 | return results, latest_id 200 | 201 | 202 | def report(body, subject, sender, receiver): 203 | """ send an email """ 204 | 205 | msg = MIMEText(body) 206 | msg['Subject'] = subject 207 | msg['From'] = sender 208 | msg['To'] = receiver 209 | 210 | try: 211 | sm = smtplib.SMTP('localhost') 212 | sm.sendmail(sender, receiver, msg.as_string()) 213 | except smtplib.SMTPException: 214 | sys.exit("ERROR sending mail") 215 | 216 | 217 | def search_astroph(keywords, old_id=None): 218 | """ do the actual search though astro-ph by first querying astro-ph 219 | for the latest papers and then looking for keyword matches""" 220 | 221 | today = dt.date.today() 222 | day = dt.timedelta(days=1) 223 | 224 | max_papers = 1000 225 | 226 | # we pick a wide-enough search range to ensure we catch papers 227 | # if there is a holiday 228 | 229 | # also, something wierd happens -- the arxiv ids appear to be 230 | # in descending order if you look at the "pastweek" listing 231 | # but the submission dates can vary wildly. It seems that some 232 | # papers are held for a week or more before appearing. 233 | q = AstrophQuery(today - 10*day, today, max_papers, old_id=old_id) 234 | print(q.get_url()) 235 | 236 | papers, last_id = q.do_query(keywords=keywords, old_id=old_id) 237 | 238 | papers.sort(reverse=True) 239 | 240 | return papers, last_id 241 | 242 | 243 | def send_email(papers, mail=None): 244 | 245 | # compose the body of our e-mail 246 | body = "" 247 | 248 | # sort papers by keywords 249 | current_kw = None 250 | for p in papers: 251 | if not p.kw_str() == current_kw: 252 | current_kw = p.kw_str() 253 | body += f"\nkeywords: {current_kw}\n\n" 254 | 255 | body += f"{p}\n" 256 | 257 | # e-mail it 258 | if not len(papers) == 0: 259 | if not mail is None: 260 | report(body, "astro-ph papers of interest", 261 | f"lazy-astroph@{platform.node()}", mail) 262 | else: 263 | print(body) 264 | 265 | 266 | def run(string): 267 | """ run a UNIX command """ 268 | 269 | # shlex.split will preserve inner quotes 270 | prog = shlex.split(string) 271 | p0 = subprocess.Popen(prog, stdout=subprocess.PIPE, 272 | stderr=subprocess.STDOUT) 273 | 274 | stdout0, stderr0 = p0.communicate() 275 | rc = p0.returncode 276 | p0.stdout.close() 277 | 278 | return stdout0, stderr0, rc 279 | 280 | 281 | def slack_post(papers, channel_req, username=None, icon_emoji=None, webhook=None): 282 | """ post the information to a slack channel """ 283 | 284 | # loop by channel 285 | for c in channel_req: 286 | channel_body = "" 287 | for p in papers: 288 | if not p.posted_to_slack: 289 | if c in p.keywords_by_channel: 290 | if len(p.keywords_by_channel[c]) >= channel_req[c]: 291 | keywds = ", ".join(p.keywords).strip() 292 | channel_body += f"{p} [{keywds}]\n\n" 293 | p.posted_to_slack = 1 294 | 295 | if webhook is None: 296 | print(f"channel: {c}") 297 | continue 298 | 299 | payload = {} 300 | payload["channel"] = c 301 | if username is not None: 302 | payload["username"] = username 303 | if icon_emoji is not None: 304 | payload["icon_emoji"] = icon_emoji 305 | payload["text"] = channel_body 306 | 307 | cmd = f"curl -X POST --data-urlencode 'payload={json.dumps(payload)}' {webhook}" 308 | run(cmd) 309 | 310 | def doit(): 311 | """ the main driver for the lazy-astroph script """ 312 | 313 | # parse runtime parameters 314 | parser = argparse.ArgumentParser() 315 | 316 | parser.add_argument("-m", help="e-mail address to send report to", 317 | type=str, default=None) 318 | parser.add_argument("inputs", help="inputs file containing keywords", 319 | type=str, nargs=1) 320 | parser.add_argument("-w", help="file containing slack webhook URL", 321 | type=str, default=None) 322 | parser.add_argument("-u", help="slack username appearing in post", 323 | type=str, default=None) 324 | parser.add_argument("-e", help="slack icon_emoji appearing in post", 325 | type=str, default=None) 326 | parser.add_argument("--dry_run", 327 | help="don't send any mail or slack posts and don't update the marker where we left off", 328 | action="store_true") 329 | args = parser.parse_args() 330 | 331 | # get the keywords 332 | keywords = [] 333 | try: 334 | f = open(args.inputs[0]) 335 | except: 336 | sys.exit("ERROR: unable to open inputs file") 337 | else: 338 | channel = None 339 | channel_req = {} 340 | for line in f: 341 | l = line.lower().rstrip() 342 | 343 | if l == "": 344 | continue 345 | 346 | elif l.startswith("#") or l.startswith("@"): 347 | # this line defines a channel 348 | ch = l.split() 349 | channel = ch[0] 350 | if len(ch) == 2: 351 | requires = int(ch[1].split("=")[1]) 352 | else: 353 | requires = 1 354 | channel_req[channel] = requires 355 | 356 | else: 357 | # this line has a keyword (and optional NOT keywords) 358 | if "not:" in l: 359 | kw, nots = l.split("not:") 360 | kw = kw.strip() 361 | excludes = [x.strip() for x in nots.split(",")] 362 | else: 363 | kw = l.strip() 364 | excludes = [] 365 | 366 | if kw[len(kw)-1] == "-": 367 | matching = "unique" 368 | kw = kw[:len(kw)-1] 369 | else: 370 | matching = "any" 371 | 372 | keywords.append(Keyword(kw, matching=matching, 373 | channel=channel, excludes=excludes)) 374 | 375 | # have we done this before? if so, read the .lazy_astroph file to get 376 | # the id of the paper we left off with 377 | param_file = os.path.expanduser("~") + "/.lazy_astroph" 378 | try: 379 | f = open(param_file) 380 | except: 381 | old_id = None 382 | else: 383 | old_id = f.readline().rstrip() 384 | f.close() 385 | 386 | papers, last_id = search_astroph(keywords, old_id=old_id) 387 | 388 | if not args.dry_run: 389 | send_email(papers, mail=args.m) 390 | 391 | if not args.w is None: 392 | try: 393 | f = open(args.w) 394 | except: 395 | sys.exit("ERROR: unable to open webhook file") 396 | 397 | webhook = str(f.readline()) 398 | f.close() 399 | else: 400 | webhook = None 401 | 402 | slack_post(papers, channel_req, icon_emoji=args.e, username=args.u, webhook=webhook) 403 | 404 | print("writing param_file", param_file) 405 | 406 | try: 407 | f = open(param_file, "w") 408 | except: 409 | sys.exit("ERROR: unable to open parameter file for writting") 410 | else: 411 | f.write(last_id) 412 | f.close() 413 | else: 414 | send_email(papers, mail=None) 415 | 416 | if __name__ == "__main__": 417 | doit() 418 | --------------------------------------------------------------------------------