├── README.md ├── contributing.md ├── download_latest.py ├── output-1696514831.9555178-bacon.md ├── output-1696514935.4285033-quantum computing.md ├── output.md └── weekly ├── 2023-10-05 arxiv dump.txt └── 2023-10-05 benchmarking.md /README.md: -------------------------------------------------------------------------------- 1 | # weekly_arxiv 2 | 3 | - Main Export Link: https://export.arxiv.org/api/query?search_query=all:llms&sortBy=lastUpdatedDate&sortOrder=descending&max_results=200 4 | 5 | ## Usage 6 | 7 | - run `python download_latest.py` 8 | - it will update `output.md` 9 | 10 | You can modify the link in the script to change the search term, sorting, and limits in the URL 11 | 12 | 13 | ## Claude Prompting 14 | 15 | Claude has 100k token window, which makes it useful for surveying the resulting documents 16 | 17 | ### Identifying trends 18 | 19 | Copy and paste the Markdown into Claude and then use this to get a list of trends 20 | 21 | ```text 22 | Can you please characterize the major trends in the latest LLM research. ONLY use the material I have given you here. The goal is to summarize the last week of LLM research. 23 | ``` 24 | 25 | ### Listing out papers for specific trends 26 | 27 | Once you identify trends, you can use this example. In this case I was asking about benchmarks. 28 | 29 | ```text 30 | Please write a summary of the benchmarking trends as elucidated by the text I gave you. The summary should be in the format of a complete paragraph followed by a list of key innovations. Each item in the list should include the paper link and paper title, along with a brief description of the innovation. Make sure you only pull from the information I gave you in this conversation. 31 | ``` 32 | -------------------------------------------------------------------------------- /contributing.md: -------------------------------------------------------------------------------- 1 | # Contributing 2 | 3 | NO >:-( 4 | -------------------------------------------------------------------------------- /download_latest.py: -------------------------------------------------------------------------------- 1 | import requests 2 | from time import time 3 | from xml.etree import ElementTree as ET 4 | 5 | 6 | # get search term 7 | a = input("Whatcha wanna lookup? ") 8 | 9 | 10 | # URL of the XML object 11 | url = "https://export.arxiv.org/api/query?search_query=all:%s&sortBy=lastUpdatedDate&sortOrder=descending&max_results=200" % a.lower().replace(' ','%20') 12 | 13 | # Send a GET request to the URL 14 | response = requests.get(url) 15 | 16 | # Parse the XML response 17 | root = ET.fromstring(response.content) 18 | 19 | # Namespace dictionary to find elements 20 | namespaces = {'atom': 'http://www.w3.org/2005/Atom', 'arxiv': 'http://arxiv.org/schemas/atom'} 21 | 22 | # Open the output file with UTF-8 encoding 23 | with open("output-%s-%s.md" % (time(), a), "w", encoding='utf-8') as file: 24 | # Iterate over each entry in the XML data 25 | for entry in root.findall('atom:entry', namespaces): 26 | # Extract and write the title 27 | title = entry.find('atom:title', namespaces).text 28 | title = ' '.join(title.split()) # Replace newlines and superfluous whitespace with a single space 29 | file.write(f"# {title}\n\n") 30 | 31 | # Extract and write the link to the paper 32 | id = entry.find('atom:id', namespaces).text 33 | file.write(f"[Link to the paper]({id})\n\n") 34 | 35 | # Extract and write the authors 36 | authors = entry.findall('atom:author', namespaces) 37 | file.write("## Authors\n") 38 | for author in authors: 39 | name = author.find('atom:name', namespaces).text 40 | file.write(f"- {name}\n") 41 | file.write("\n") 42 | 43 | # Extract and write the summary 44 | summary = entry.find('atom:summary', namespaces).text 45 | file.write(f"## Summary\n{summary}\n\n") 46 | -------------------------------------------------------------------------------- /weekly/2023-10-05 benchmarking.md: -------------------------------------------------------------------------------- 1 | Recent research has led to the introduction of numerous benchmarks aimed at systematically evaluating the capabilities of large language models (LLMs) across diverse domains and tasks. These benchmarks enable the community to gain a more comprehensive understanding of the strengths and weaknesses of LLMs. 2 | 3 | Key benchmarking innovations include: 4 | 5 | - [L-Eval: Instituting Standardized Evaluation for Long Context Language Models](http://arxiv.org/abs/2307.11088v3) - Proposes benchmark suite and metrics to evaluate LLM performance on long input contexts. 6 | 7 | - [TRAM: Benchmarking Temporal Reasoning for Large Language Models](http://arxiv.org/abs/2310.00835v2) - Introduces benchmark composed of datasets covering various aspects of temporal reasoning. 8 | 9 | - [MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts](http://arxiv.org/abs/2310.02255v1) - Constructs benchmark amalgamating math and visual QA datasets requiring compositional reasoning. 10 | 11 | - [ARN: A Comprehensive Framework and Dataset for Analogical Reasoning on Narratives](http://arxiv.org/abs/2310.01074v1) - Develops benchmark and framework for evaluating analogical reasoning in narratives using LLMs. 12 | 13 | - [LLM-grounded Video Diffusion Models](http://arxiv.org/abs/2309.17444v2) - Benchmarks video generation from text descriptions using LLMs to guide video diffusion models. 14 | 15 | - [FELM: Benchmarking Factuality Evaluation of Large Language Models](http://arxiv.org/abs/2310.00741v1) - Introduces benchmark annotating LLM responses for evaluating factuality models. 16 | 17 | - [L2CEval: Evaluating Language-to-Code Generation Capabilities of Large Language Models](http://arxiv.org/abs/2309.17446v2) - Presents systematic evaluation of LLMs on 7 language-to-code generation tasks. 18 | 19 | The introduction of these rigorous benchmarks has been pivotal in furthering LLM research and enabling the community to track progress. However, there remains ample room for constructing additional benchmarks evaluating LLMs across even more diverse settings and applications. --------------------------------------------------------------------------------