├── Utils ├── __init__.py ├── webutils.py └── webinterface.py ├── ScholarlyRecommender ├── Repository │ ├── __init__.py │ ├── tests │ │ ├── outputs │ │ │ ├── test_configuration_update.json │ │ │ ├── test_recommendations.csv │ │ │ └── test_feed.html │ │ └── inputs │ │ │ ├── test_configuration.json │ │ │ ├── ref_recommendations.csv │ │ │ └── ref_feed.html │ ├── admin.py │ ├── Recommendations.csv │ └── Feed.csv ├── Scraper │ ├── __init__.py │ ├── .DS_Store │ ├── GoogleScholar.py │ └── Arxiv.py ├── Recommender │ ├── __init__.py │ └── rec_sys.py ├── _cython │ ├── zlib.pxd │ └── cython_functions.pyx ├── configuration.json ├── __init__.py ├── const.py ├── config.py └── Newsletter │ ├── mail.py │ ├── feed.py │ └── html │ ├── Feed.html │ └── WebFeed.html ├── .gitattributes ├── images ├── logo.png ├── system.png └── example_1.png ├── .vscode └── settings.json ├── docs ├── local-config.md ├── README.md ├── web-config.md └── scholarlyAPI.md ├── .gitignore ├── .streamlit └── config.toml ├── .github ├── SECURITY.md ├── ISSUE_TEMPLATE │ ├── feature_request.md │ └── bug_report.md └── workflows │ └── dependency-review.yml ├── .deepsource.toml ├── .devcontainer └── devcontainer.json ├── Makefile ├── requirements.txt ├── calibrate.py ├── README.md ├── LICENSE ├── testing.py └── webapp.py /Utils/__init__.py: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /ScholarlyRecommender/Repository/__init__.py: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /ScholarlyRecommender/Scraper/__init__.py: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /ScholarlyRecommender/Recommender/__init__.py: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /.gitattributes: -------------------------------------------------------------------------------- 1 | # Auto detect text files and perform LF normalization 2 | * text=auto 3 | -------------------------------------------------------------------------------- /images/logo.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/iansnyder333/ScholarlyRecommender/HEAD/images/logo.png -------------------------------------------------------------------------------- /images/system.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/iansnyder333/ScholarlyRecommender/HEAD/images/system.png -------------------------------------------------------------------------------- /.vscode/settings.json: -------------------------------------------------------------------------------- 1 | { 2 | "python.analysis.extraPaths": [ 3 | "./Repository" 4 | ] 5 | } -------------------------------------------------------------------------------- /images/example_1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/iansnyder333/ScholarlyRecommender/HEAD/images/example_1.png -------------------------------------------------------------------------------- /docs/local-config.md: -------------------------------------------------------------------------------- 1 | # Scholarly Recommender Local Configuration and Override Instructions 2 | 3 | 4 | ## TODO 5 | -------------------------------------------------------------------------------- /ScholarlyRecommender/Scraper/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/iansnyder333/ScholarlyRecommender/HEAD/ScholarlyRecommender/Scraper/.DS_Store -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | /env 2 | /__pycache__ 3 | *.pyc 4 | ScholarlyRecommender/.DS_Store 5 | .DS_Store 6 | .DS_Store 7 | 8 | .streamlit/secrets.toml 9 | 10 | /images/Logos-All 11 | .DS_Store 12 | testlog 13 | 70Stars.png 14 | -------------------------------------------------------------------------------- /.streamlit/config.toml: -------------------------------------------------------------------------------- 1 | [theme] 2 | # Custom theme configurations 3 | base = "light" 4 | primaryColor = "#A2A2F5" 5 | backgroundColor = "#FFFFFF" 6 | secondaryBackgroundColor = "#E3E3fA" 7 | textColor = "#262730" 8 | font = "sans serif" -------------------------------------------------------------------------------- /ScholarlyRecommender/_cython/zlib.pxd: -------------------------------------------------------------------------------- 1 | cdef extern from "zlib.h": 2 | int compress(unsigned char *dest, unsigned long *destLen, const unsigned char *source, unsigned long sourceLen) 3 | unsigned long compressBound(unsigned long sourceLen) -------------------------------------------------------------------------------- /.github/SECURITY.md: -------------------------------------------------------------------------------- 1 | # Security Policy 2 | 3 | 4 | 5 | 6 | ## Reporting a Vulnerability 7 | 8 | Please report any Security or Vulnerability issues to the projects email: [scholarlyrecommender@gmail.com](mailto:scholarlyrecommender@gmail.com) 9 | -------------------------------------------------------------------------------- /.deepsource.toml: -------------------------------------------------------------------------------- 1 | version = 1 2 | 3 | test_patterns = ["testing.py"] 4 | 5 | [[analyzers]] 6 | name = "secrets" 7 | 8 | [[analyzers]] 9 | name = "cxx" 10 | 11 | [[analyzers]] 12 | name = "python" 13 | 14 | [analyzers.meta] 15 | runtime_version = "3.x.x" 16 | 17 | [[transformers]] 18 | name = "black" 19 | -------------------------------------------------------------------------------- /ScholarlyRecommender/Repository/tests/outputs/test_configuration_update.json: -------------------------------------------------------------------------------- 1 | { 2 | "queries": [ 3 | "Computer Science", 4 | "Mathematics" 5 | ], 6 | "labels": "ScholarlyRecommender/Repository/tests/test_candidates_labeled.csv", 7 | "feed_length": 7, 8 | "feed_path": "ScholarlyRecommender/Repository/tests/test_feed.html" 9 | } -------------------------------------------------------------------------------- /ScholarlyRecommender/configuration.json: -------------------------------------------------------------------------------- 1 | { 2 | "queries": [ 3 | "Artificial Intelligence", 4 | "Natural language processing", 5 | "Computer Vision", 6 | "Machine Learning" 7 | ], 8 | "labels": "ScholarlyRecommender/Repository/labeled/Candidates_Labeled.csv", 9 | "feed_length": 5, 10 | "feed_path": "ScholarlyRecommender/Newsletter/html/WebFeed.html" 11 | } -------------------------------------------------------------------------------- /ScholarlyRecommender/Repository/tests/inputs/test_configuration.json: -------------------------------------------------------------------------------- 1 | { 2 | "queries": [ 3 | "Artificial Intelligence", 4 | "Natural language processing", 5 | "Computer Vision", 6 | "Machine Learning" 7 | 8 | ], 9 | "labels": "ScholarlyRecommender/Repository/tests/test_candidates_labeled.csv", 10 | "feed_length": 5, 11 | "feed_path": "ScholarlyRecommender/Repository/tests/test_feed.html" 12 | } -------------------------------------------------------------------------------- /ScholarlyRecommender/__init__.py: -------------------------------------------------------------------------------- 1 | from .Scraper.Arxiv import source_candidates, fast_search 2 | from .Recommender.rec_sys import get_recommendations, evaluate 3 | from .Newsletter.feed import get_feed 4 | 5 | from .config import get_config, update_config 6 | 7 | __all__ = [ 8 | "source_candidates", 9 | "fast_search", 10 | "get_recommendations", 11 | "evaluate", 12 | "get_feed", 13 | "get_config", 14 | "update_config", 15 | ] 16 | -------------------------------------------------------------------------------- /ScholarlyRecommender/const.py: -------------------------------------------------------------------------------- 1 | from copy import deepcopy 2 | 3 | 4 | def BASE_REPO(): 5 | """ 6 | initialize a new base repo as a deep copy 7 | @param: None 8 | @return: a dictionary (str:list) 9 | """ 10 | return deepcopy( 11 | { 12 | "Id": [], 13 | "Category": [], 14 | "Title": [], 15 | "Published": [], 16 | "Abstract": [], 17 | "URL": [], 18 | } 19 | ) 20 | -------------------------------------------------------------------------------- /docs/README.md: -------------------------------------------------------------------------------- 1 | # ScholarlyRecommender Documentation 2 | 3 | These documents are intended to help you use the ScholarlyRecommender Application. These docs are still in development, as the project continues to scale, more documentation will be added. 4 | 5 | Please **note:** most of, if not all the source code contained within the repo are formatted in accordance with Black. This includes docstrings, comments, and typedefs. Please use this as a reference if various functionality is yet to be implemented in the document abstraction. 6 | 7 | 8 | ![system](https://github.com/iansnyder333/ScholarlyRecommender/assets/58576523/b9239c91-6e76-48db-900d-ca9b845b5639) 9 | -------------------------------------------------------------------------------- /.github/ISSUE_TEMPLATE/feature_request.md: -------------------------------------------------------------------------------- 1 | --- 2 | name: Feature request 3 | about: Suggest an idea for this project 4 | title: '' 5 | labels: '' 6 | assignees: '' 7 | 8 | --- 9 | 10 | **Is your feature request related to a problem? Please describe.** 11 | A clear and concise description of what the problem is. Ex. I'm always frustrated when [...] 12 | 13 | **Describe the solution you'd like** 14 | A clear and concise description of what you want to happen. 15 | 16 | **Describe alternatives you've considered** 17 | A clear and concise description of any alternative solutions or features you've considered. 18 | 19 | **Additional context** 20 | Add any other context or screenshots about the feature request here. 21 | -------------------------------------------------------------------------------- /ScholarlyRecommender/config.py: -------------------------------------------------------------------------------- 1 | # Default configuration for ScholarlyRecommender 2 | 3 | from json import load, dump 4 | 5 | with open("ScholarlyRecommender/configuration.json") as json_file: 6 | config = load(json_file) 7 | 8 | 9 | def get_config(): 10 | """Return the configuration dictionary.""" 11 | with open("ScholarlyRecommender/configuration.json") as json_file: 12 | config = load(json_file) 13 | return config 14 | 15 | 16 | def update_config(new_config, **kwargs): 17 | """Update the configuration file.""" 18 | if kwargs["test_mode"]: 19 | with open(kwargs["test_path"], "w") as json_file: 20 | dump(new_config, json_file, indent=4) 21 | else: 22 | with open("ScholarlyRecommender/configuration.json", "w") as json_file: 23 | dump(new_config, json_file, indent=4) 24 | -------------------------------------------------------------------------------- /.github/workflows/dependency-review.yml: -------------------------------------------------------------------------------- 1 | # Dependency Review Action 2 | # 3 | # This Action will scan dependency manifest files that change as part of a Pull Request, surfacing known-vulnerable versions of the packages declared or updated in the PR. Once installed, if the workflow run is marked as required, PRs introducing known-vulnerable packages will be blocked from merging. 4 | # 5 | # Source repository: https://github.com/actions/dependency-review-action 6 | # Public documentation: https://docs.github.com/en/code-security/supply-chain-security/understanding-your-software-supply-chain/about-dependency-review#dependency-review-enforcement 7 | name: 'Dependency Review' 8 | on: [pull_request] 9 | 10 | permissions: 11 | contents: read 12 | 13 | jobs: 14 | dependency-review: 15 | runs-on: ubuntu-latest 16 | steps: 17 | - name: 'Checkout Repository' 18 | uses: actions/checkout@v3 19 | - name: 'Dependency Review' 20 | uses: actions/dependency-review-action@v3 21 | -------------------------------------------------------------------------------- /.github/ISSUE_TEMPLATE/bug_report.md: -------------------------------------------------------------------------------- 1 | --- 2 | name: Bug report 3 | about: Create a report to help us improve 4 | title: '' 5 | labels: '' 6 | assignees: '' 7 | 8 | --- 9 | 10 | **Describe the bug** 11 | A clear and concise description of what the bug is. 12 | 13 | **To Reproduce** 14 | Steps to reproduce the behavior: 15 | 1. Go to '...' 16 | 2. Click on '....' 17 | 3. Scroll down to '....' 18 | 4. See error 19 | 20 | **Expected behavior** 21 | A clear and concise description of what you expected to happen. 22 | 23 | **Screenshots** 24 | If applicable, add screenshots to help explain your problem. 25 | 26 | **Desktop (please complete the following information):** 27 | - OS: [e.g. iOS] 28 | - Browser [e.g. chrome, safari] 29 | - Version [e.g. 22] 30 | 31 | **Smartphone (please complete the following information):** 32 | - Device: [e.g. iPhone6] 33 | - OS: [e.g. iOS8.1] 34 | - Browser [e.g. stock browser, safari] 35 | - Version [e.g. 22] 36 | 37 | **Additional context** 38 | Add any other context about the problem here. 39 | -------------------------------------------------------------------------------- /.devcontainer/devcontainer.json: -------------------------------------------------------------------------------- 1 | { 2 | "name": "Python 3", 3 | // Or use a Dockerfile or Docker Compose file. More info: https://containers.dev/guide/dockerfile 4 | "image": "mcr.microsoft.com/devcontainers/python:1-3.11-bullseye", 5 | "customizations": { 6 | "codespaces": { 7 | "openFiles": [ 8 | "README.md", 9 | "webapp.py" 10 | ] 11 | }, 12 | "vscode": { 13 | "settings": {}, 14 | "extensions": [ 15 | "ms-python.python", 16 | "ms-python.vscode-pylance" 17 | ] 18 | } 19 | }, 20 | "updateContentCommand": "[ -f packages.txt ] && sudo apt update && sudo apt upgrade -y && sudo xargs apt install -y = min_ver))') 16 | ifeq ($(PYTHON_VERSION_OK), 0) 17 | $(error "Need python version >= $(PYTHON_VERSION_MIN). Current version is $(PYTHON_VERSION_CUR)") 18 | endif 19 | 20 | ENV_DIR = env 21 | PIP_VERSION = pip3 22 | 23 | # Phony targets 24 | .PHONY: all build setup install run test clean 25 | 26 | # Default target 27 | all: setup install run 28 | 29 | # Setup the virtual environment if it doesn't exist 30 | setup: 31 | if [ ! -d "$(ENV_DIR)" ]; then \ 32 | $(PYTHON) -m venv $(ENV_DIR); \ 33 | fi 34 | 35 | # Activate the virtual environment and install dependencies 36 | install: 37 | if [ -d "$(ENV_DIR)/bin" ]; then \ 38 | source $(ENV_DIR)/bin/activate && \ 39 | $(PIP_VERSION) install -r requirements.txt; \ 40 | elif [ -d "$(ENV_DIR)/Scripts" ]; then \ 41 | source $(ENV_DIR)/Scripts/activate && \ 42 | $(PIP_VERSION) install -r requirements.txt; \ 43 | fi 44 | 45 | build: setup install 46 | 47 | 48 | ## Run Streamlit App (virtual env already activated) 49 | run: 50 | streamlit run webapp.py 51 | 52 | # Run tests 53 | test: 54 | $(PYTHON) testing.py 55 | 56 | # Clean the environment 57 | clean: 58 | rm -rf $(ENV_DIR) -------------------------------------------------------------------------------- /ScholarlyRecommender/Newsletter/mail.py: -------------------------------------------------------------------------------- 1 | """ 2 | Example of how to configure the email server if you have installed ScholarlyRecommender 3 | locally: 4 | Replace the constants with your own email address, password, port number and 5 | subscribers (delivery address). 6 | Note that you need to enable less secure apps in your google account settings, 7 | otherwise this will not work. 8 | Once you have your own constants working, you can delete the input() calls and replace 9 | them with your own constants. 10 | To send an email, simply call send_email(content=html_string) from the Newsletter 11 | module. html_string is the return value from the get_feed() function in feed.py. 12 | """ 13 | import smtplib 14 | from email.message import EmailMessage 15 | import re 16 | 17 | 18 | def validate_email(email): 19 | """Validate an email address using regex.""" 20 | regex = r"^\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b" 21 | if re.match(regex, email): 22 | return True 23 | return False 24 | 25 | 26 | def send_email(**kwargs): 27 | """Send an email using the configured email server.""" 28 | EMAIL_ADDRESS = input("Enter your email address: ") 29 | EMAIL_PASSWORD = input("Enter your email password: ") 30 | SUBSCRIBERS = input( 31 | "Enter your subscribers email addresses (separated by commas): " 32 | ) 33 | PORT = input("Enter your port number: (465 for gmail)") 34 | 35 | msg = EmailMessage() 36 | msg["Subject"] = "Your Scholarly Recommender Newsletter" 37 | msg["From"] = EMAIL_ADDRESS 38 | msg["To"] = SUBSCRIBERS 39 | 40 | html_string = kwargs["content"] 41 | 42 | msg.set_content(html_string, subtype="html") 43 | with smtplib.SMTP_SSL("smtp.gmail.com", PORT) as smtp: 44 | smtp.login(EMAIL_ADDRESS, EMAIL_PASSWORD) 45 | smtp.send_message(msg) 46 | -------------------------------------------------------------------------------- /ScholarlyRecommender/_cython/cython_functions.pyx: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | cimport numpy as np 3 | from libc.stdlib cimport malloc, free 4 | from libc.string cimport strcat, strcpy 5 | from cython.parallel cimport prange 6 | 7 | cdef extern from "zlib.h": 8 | int compress(unsigned char *dest, unsigned long *destLen, const unsigned char *source, unsigned long sourceLen) 9 | unsigned long compressBound(unsigned long sourceLen) 10 | 11 | def calculate_ncd(np.ndarray[object, ndim=1] test_texts, np.ndarray[object, ndim=1] train_texts): 12 | cdef int i, j 13 | cdef int num_test = test_texts.shape[0] 14 | cdef int num_train = train_texts.shape[0] 15 | cdef np.ndarray[np.float64_t, ndim=2] ncd_results = np.zeros((num_test, num_train), dtype=np.float64) 16 | cdef unsigned long Cx1, Cx2, Cx1x2 17 | cdef bytes x1, x2, x1x2 18 | 19 | for i in range(num_test): 20 | x1 = test_texts[i].encode('utf-8') 21 | Cx1 = compress_c(x1) 22 | 23 | for j in range(num_train): 24 | x2 = train_texts[j].encode('utf-8') 25 | Cx2 = compress_c(x2) 26 | 27 | x1x2 = x1 + b" " + x2 # Use bytes concatenation 28 | Cx1x2 = compress_c(x1x2) 29 | 30 | ncd_results[i, j] = (Cx1x2 - min(Cx1, Cx2)) / max(Cx1, Cx2) 31 | 32 | return ncd_results 33 | 34 | cdef unsigned long compress_c(bytes input_str): 35 | cdef unsigned long source_len = len(input_str) 36 | cdef unsigned long max_compressed_len = compressBound(source_len) 37 | cdef unsigned char *compressed_data = malloc(max_compressed_len) 38 | cdef unsigned long compressed_len = max_compressed_len 39 | 40 | if compress(compressed_data, &compressed_len, input_str, source_len) != 0: 41 | # Handle compression error 42 | return 0 43 | 44 | free(compressed_data) 45 | return compressed_len -------------------------------------------------------------------------------- /Utils/webinterface.py: -------------------------------------------------------------------------------- 1 | from pandas import DataFrame 2 | 3 | 4 | def build_query(selected_sub_categories: dict) -> list: 5 | """ 6 | Build a query from the selected sub-categories 7 | @param selected_sub_categories: dict 8 | @return: list of queries represented as strings 9 | """ 10 | return 11 | 12 | 13 | def validate_email(email) -> bool: 14 | """ 15 | Validate an email address using regex 16 | @param email: string representing an email address 17 | @return: bool indicating whether the email is valid 18 | """ 19 | return 20 | 21 | 22 | def send_email(**kwargs) -> None: 23 | """ 24 | Send an email using the configured email server 25 | @param kwargs: dict of keyword arguments 26 | @return: None 27 | @raises: ValueError if the email address is invalid or if a server error occurs 28 | """ 29 | return 30 | 31 | 32 | def generate_feed_pipeline(query: list, n: int, days: int) -> None: 33 | """ 34 | Generate a feed from a query, this is the main pipeline for generating 35 | recommendations 36 | @param query: list of queries represented as strings, defaults to sys_config() 37 | @param n: number of recommendations to generate, defaults to 5 38 | @param days: number of days back to search, defaults to 7 39 | @return: None 40 | """ 41 | return 42 | 43 | 44 | def fetch_papers(num_papers: int = 10) -> DataFrame: 45 | """ 46 | Collect a sample of papers from arXiv for calibration, sourced using the default 47 | configuration of interest categories 48 | Papers are collected, shuffled, and returned as a formatted DataFrame 49 | @param num_papers: number of papers to collect, defaults to 10 50 | @return: DataFrame of papers formatted for labeling 51 | """ 52 | return 53 | 54 | 55 | def calibrate_rec_sys(num_papers: int = 10) -> None: 56 | """ 57 | Interactive calibration tool for the recommender system, essentially serves as a 58 | user interface for manual labeling 59 | @param num_papers: number of papers to rate, defaults to 10 60 | @return: None, labels configured in sys_config["labels"] 61 | """ 62 | return 63 | -------------------------------------------------------------------------------- /ScholarlyRecommender/Repository/admin.py: -------------------------------------------------------------------------------- 1 | """ 2 | This module contains functions to create and manage the database for the 3 | recommender system. 4 | """ 5 | import pandas as pd 6 | import arxiv 7 | 8 | from ScholarlyRecommender.const import BASE_REPO 9 | 10 | 11 | def get_papers(ids: list, query: str = "") -> pd.DataFrame: 12 | """ 13 | Scrape arxiv.org for papers matching the query and return a dataframe matching 14 | the BASE_REPO format. 15 | """ 16 | repository = BASE_REPO() 17 | search = arxiv.Search( 18 | query=query, 19 | id_list=ids, 20 | ) 21 | for result in search.results(): 22 | repository["Id"].append(result.entry_id.split("/")[-1]) 23 | repository["Category"].append(result.primary_category) 24 | repository["Title"].append(result.title.strip("\n")) 25 | repository["Published"].append(result.published) 26 | repository["Abstract"].append(result.summary.strip("\n")) 27 | repository["URL"].append(result.pdf_url) 28 | return pd.DataFrame(repository).set_index("Id") 29 | 30 | 31 | def build_arxiv_repo(ids: list, path: str) -> None: 32 | """Build a csv file containing the papers matching the ids.""" 33 | if not path.endswith(".csv"): 34 | raise AssertionError("Path must be a csv file") 35 | df = get_papers(ids) 36 | df.to_csv(path) 37 | 38 | 39 | def add_paper(ids: list, to_repo: str) -> None: 40 | """Add papers matching the ids to the repository. Duplicates are removed.""" 41 | if not to_repo.endswith(".csv"): 42 | raise AssertionError("Repository must be a csv file") 43 | df1 = pd.read_csv(to_repo, index_col="Id") 44 | df2 = get_papers(ids) 45 | df = pd.concat([df1, df2]) 46 | df = df[~df.index.duplicated(keep="first")] 47 | df.to_csv(to_repo) 48 | 49 | 50 | def remove_paper(ids: list, from_repo: str) -> None: 51 | """Remove papers matching the ids from the repository.""" 52 | if not from_repo.endswith(".csv"): 53 | raise AssertionError("Repository must be a csv file") 54 | df1 = pd.read_csv(from_repo, index_col="Id") 55 | df2 = get_papers(ids) 56 | df = df1.drop(df2.index) 57 | df.to_csv(from_repo) 58 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | # 2 | # This file is autogenerated by pip-compile with Python 3.11 3 | # by the following command: 4 | # 5 | # pip-compile 6 | # 7 | altair==5.1.1 8 | # via streamlit 9 | arxiv==1.4.8 10 | # via -r requirements.in 11 | attrs==23.1.0 12 | # via 13 | # jsonschema 14 | # referencing 15 | beautifulsoup4==4.12.2 16 | # via -r requirements.in 17 | blinker==1.6.2 18 | # via streamlit 19 | cachetools==5.3.1 20 | # via streamlit 21 | certifi==2024.7.4 22 | # via requests 23 | charset-normalizer==3.2.0 24 | # via requests 25 | click==8.1.7 26 | # via streamlit 27 | cython==3.0.2 28 | # via -r requirements.in 29 | feedparser==6.0.10 30 | # via arxiv 31 | gitdb==4.0.10 32 | # via gitpython 33 | gitpython==3.1.41 34 | # via streamlit 35 | idna==3.7 36 | # via requests 37 | importlib-metadata==6.8.0 38 | # via streamlit 39 | jinja2==3.1.6 40 | # via 41 | # altair 42 | # pydeck 43 | jsonschema==4.19.1 44 | # via altair 45 | jsonschema-specifications==2023.7.1 46 | # via jsonschema 47 | markdown-it-py==3.0.0 48 | # via rich 49 | markupsafe==2.1.3 50 | # via jinja2 51 | mdurl==0.1.2 52 | # via markdown-it-py 53 | numpy==1.23.5 54 | # via 55 | # -r requirements.in 56 | # altair 57 | # pandas 58 | # pyarrow 59 | # pydeck 60 | # streamlit 61 | packaging==23.1 62 | # via 63 | # altair 64 | # streamlit 65 | pandas==2.0.1 66 | # via 67 | # -r requirements.in 68 | # altair 69 | # streamlit 70 | pillow>=9.5.0 71 | # via streamlit 72 | protobuf==4.25.8 73 | # via streamlit 74 | pyarrow>=13.0.0 75 | # via streamlit 76 | pydeck==0.8.0 77 | # via streamlit 78 | pygments==2.16.1 79 | # via rich 80 | pympler==1.0.1 81 | # via streamlit 82 | python-dateutil==2.8.2 83 | # via 84 | # pandas 85 | # streamlit 86 | pytz==2023.3.post1 87 | # via pandas 88 | pytz-deprecation-shim==0.1.0.post0 89 | # via tzlocal 90 | referencing==0.30.2 91 | # via 92 | # jsonschema 93 | # jsonschema-specifications 94 | requests==2.32.4 95 | # via 96 | # -r requirements.in 97 | # streamlit 98 | rich==13.5.3 99 | # via streamlit 100 | rpds-py==0.10.3 101 | # via 102 | # jsonschema 103 | # referencing 104 | sgmllib3k==1.0.0 105 | # via feedparser 106 | six==1.16.0 107 | # via python-dateutil 108 | smmap==5.0.1 109 | # via gitdb 110 | soupsieve==2.5 111 | # via beautifulsoup4 112 | streamlit==1.37.0 113 | # via -r requirements.in 114 | tenacity==8.2.3 115 | # via streamlit 116 | toml==0.10.2 117 | # via streamlit 118 | toolz==0.12.0 119 | # via altair 120 | tornado==6.5.1 121 | # via streamlit 122 | tqdm==4.66.3 123 | # via -r requirements.in 124 | typing-extensions==4.8.0 125 | # via streamlit 126 | tzdata==2023.3 127 | # via 128 | # pandas 129 | # pytz-deprecation-shim 130 | tzlocal==4.3.1 131 | # via streamlit 132 | urllib3==2.5.0 133 | # via requests 134 | validators==0.22.0 135 | # via streamlit 136 | zipp==3.19.1 137 | # via importlib-metadata 138 | 139 | # The following packages are considered to be unsafe in a requirements file: 140 | # setuptools 141 | -------------------------------------------------------------------------------- /docs/web-config.md: -------------------------------------------------------------------------------- 1 | 2 | # Scholarly Recommender System Calibration Tool 3 | 4 | This document will help you calibrate the recommender system to your interests! 5 | Below are the various configuration steps, it is advised to do them in order. 6 | Once a step is completed, the changes will be applied automatically, regardless of whether you continue to the next step or not. 7 | 8 | ## Step 1: Access the Configuration API 9 | 10 | First, use the following [link](https://scholarlyrecommender.streamlit.app) to access the cloud application. If you are running this project locally, you should refer to this document. 11 | 12 | Once you are on the cloud app, use the navigation bar on the left side of the screen and click on the "Configure" tab, shown below. 13 | 14 | Screen Shot 2023-09-22 at 8 20 59 AM 15 | 16 | 17 | 18 | ## Step 2: Configure your interests 19 | 20 | Follow the instructions displayed on the screen to start the configuration process. You will be asked to select categories and subcategories that are of interest to you. You may select as few or as many as you like, these categories will help the ScholarlyRecommender source candidate papers that align with your interests. Once you have selected all the categories, click the done button just below the categories box. You should see the following message indicating that your preferences were successfully changed. 21 | 22 | Screen Shot 2023-09-22 at 8 33 27 AM 23 | 24 | Congratulations! ScholarlyRecommender will now automatically source candidates that are relevant to your interests, unless specified otherwise. You can go back and change this configuration at any time! 25 | 26 | ## Step 3: Calibrate the Recommender System 27 | 28 | Scroll down to the next section on the page, labeled *Calibrate the Recommender System*. This step will calibrate the recommender system to rank candidate papers based on your interests, and will significantly improve recommendations. This process will show you snippets of 10 papers and ask you to rate them on a scale of 1 to 10 (1 being the least relevant and 10 being the most relevant). 29 | 30 | **Note**: Many improvements are planned for this process, including the ability to skip papers, change sample size, and dynamically update the system based on your feedback from the generated feed. 31 | 32 | When you are ready, click the "Start Calibration" button, after rating the 10 papers, you should see the following message: 33 | 34 | Screen Shot 2023-09-22 at 8 40 27 AM 35 | 36 | Amazing! The ScholarlyRecommender is now configured and is ready to be your personal agent to find you new, personalized, academic publications. Now you can navigate to the Get Recommendations page and click the "generate recommendations" button to see your personalized feed! 37 | 38 | Screen Shot 2023-09-22 at 8 48 25 AM 39 | 40 | Thank you so much for using ScholarlyRecommender. I recently graduated college and am the sole developer of this project, so I would love any constructive feedback you have to offer to help me improve as a developer. 41 | 42 | Please report any issues by creating an issue on the GitHub repository, or by sending an email to the project email directly. 43 | 44 | - **Github Issue**: https://github.com/iansnyder333/ScholarlyRecommender/issues 45 | - **Project Email**: scholarlyrecommender@gmail.com 46 | 47 | 48 | -------------------------------------------------------------------------------- /calibrate.py: -------------------------------------------------------------------------------- 1 | """ 2 | This script allows you to reconfigure the recommender system to your interests without 3 | having to use the web interface. 4 | It is not nearly as robust as the web interface, but offers a quick and lightweight 5 | alternative for users who do not want to use the web interface. 6 | """ 7 | import ScholarlyRecommender as sr 8 | import arxiv 9 | 10 | to_path = "ScholarlyRecommender/Repository/labeled/Candidates_Labeled.csv" 11 | old_config = sr.get_config() 12 | 13 | 14 | def calibrate_rec_sys(query: list, num_papers: int = 10): 15 | """ 16 | Interactive calibration tool for the recommender system, essentially serves 17 | as a user interface for manual labeling. 18 | """ 19 | # get sample of papers to label 20 | c = sr.source_candidates( 21 | queries=query, 22 | max_results=100, 23 | as_df=True, 24 | sort_by=arxiv.SortCriterion.Relevance, 25 | ) 26 | sam = c.sample(frac=1) 27 | sam.reset_index(inplace=True) 28 | df = sam[["Title", "Abstract"]].copy() 29 | df["Abstract"] = df["Abstract"].str[:500] + "..." 30 | 31 | df = df.head(num_papers) 32 | 33 | labels = [] 34 | for _, row in df.iterrows(): 35 | print(f"{row['Title']} \n") 36 | print(f"{row['Abstract']} \n") 37 | print("Rate this paper on a scale of 1 to 10? \n") 38 | labels.append(int(input("enter a number: "))) 39 | print("\n \n") 40 | df["label"] = labels 41 | df.to_csv(to_path) 42 | old_config["labels"] = to_path 43 | sr.update_config(old_config) 44 | return True 45 | 46 | 47 | def main(): 48 | """ 49 | This script allows you to reconfigure the recommender system to your interests 50 | without having to use the web interface. 51 | """ 52 | print("\n WARNING: This script is deprecated") 53 | print("Please use the web interface!") 54 | print("Welcome to the Scholarly Recommender System Calibration Tool \n") 55 | print( 56 | "This tool will help you calibrate the recommender system to your interests \n" 57 | ) 58 | print("Please answer the following questions to help us get to know you better \n") 59 | print("Select the categories that interest you the most: \n") 60 | print("1. Computer Science \n 2. Mathematics \n 3. Physics") 61 | print("4. Quantitative Biology \n 5. Quantitative Finance \n 6. Statistics \n") 62 | print("Please enter the numbers of the categories that interest you the most.") 63 | print("Separate each number with a comma. ex: 1,2,3 \n") 64 | categories = input("Enter a list of numbers: ") 65 | categories = categories.split(",") 66 | categories = [int(i) for i in categories] 67 | search_categories = { 68 | 1: "Computer Science", 69 | 2: "Mathematics", 70 | 3: "Physics", 71 | 4: "Quantitative Biology", 72 | 5: "Quantitative Finance", 73 | 6: "Statistics", 74 | } 75 | categories = list(map(search_categories.get, categories)) 76 | print(f"Thank you for your input. You selected {categories} \n") 77 | print("Now we will ask you to rate a few papers to help us get to know you better.") 78 | print("This will take a few minutes.") 79 | res = input("Press enter if you want to proceed, enter 'skip' to skip: \n") 80 | if res == "skip": 81 | print("You have chosen to skip this step. \n") 82 | print("The recommender system will not be calibrated to your interests. \n") 83 | return 84 | print("Please rate the following papers on a scale of 1 to 10 \n") 85 | state = calibrate_rec_sys(categories) 86 | if state: 87 | print("Thank you for your input. Your results have been saved. \n") 88 | print("The recommender system will now be calibrated to your interests \n") 89 | print("No further action is required! \n") 90 | 91 | return 92 | print("Something went wrong. Please try again. \n") 93 | return 94 | 95 | 96 | if __name__ == "__main__": 97 | main() 98 | -------------------------------------------------------------------------------- /ScholarlyRecommender/Scraper/GoogleScholar.py: -------------------------------------------------------------------------------- 1 | import requests 2 | from bs4 import BeautifulSoup 3 | import pandas as pd 4 | 5 | 6 | class ScraperForGoogleScholar: 7 | """Scraper for google scholar""" 8 | 9 | def __init__(self, headers, repository: dict = None): 10 | self.headers = headers 11 | if repository is None: 12 | self.repository = { 13 | "Paper Title": [], 14 | "Author": [], 15 | "Publication": [], 16 | "Url of paper": [], 17 | "Abstract": [], 18 | } 19 | else: 20 | self.repository = repository 21 | 22 | def _get_paperinfo(self, paper_url: str): 23 | """get the paper info from the url""" 24 | # download the page 25 | response = requests.get(paper_url, headers=self.headers) 26 | # check successful response 27 | if response.status_code != 200: 28 | raise AssertionError(f"Failed to fetch web page {paper_url}") 29 | # parse using beautiful soup 30 | return BeautifulSoup(response.text, "html.parser") 31 | 32 | @staticmethod 33 | def _get_tags(doc: BeautifulSoup) -> tuple: 34 | """get the tags from the document""" 35 | paper_tag = doc.select("[data-lid]") 36 | link_tag = doc.find_all("h3", {"class": "gs_rt"}) 37 | author_tag = doc.find_all("div", {"class": "gs_a"}) 38 | abstract_tag = doc.find_all("div", {"class": "gs_rs"}) 39 | return (paper_tag, link_tag, author_tag, abstract_tag) 40 | 41 | @staticmethod 42 | def _get_papertitle(paper_tag: list) -> list: 43 | """get the paper title from the tag""" 44 | return [tag.select("h3")[0].get_text() for tag in paper_tag] 45 | 46 | @staticmethod 47 | def _get_link(link_tag: list) -> list: 48 | """get the link from the tag""" 49 | return [link_tag[i].a["href"] for i in range(len(link_tag))] 50 | 51 | @staticmethod 52 | def _get_author_publisher_info(authors_tag: list) -> tuple: 53 | """get the author and publisher info from the tag""" 54 | authors = [] 55 | publishers = [] 56 | for v, _ in enumerate(authors_tag): 57 | authortag_text = (authors_tag[v].text).split("-") 58 | 59 | if len(authortag_text) == 0: 60 | authors.append("None") 61 | publishers.append("None") 62 | elif len(authortag_text) == 1: 63 | authors.append(authortag_text[0]) 64 | publishers.append("None") 65 | else: 66 | authors.append(authortag_text[0]) 67 | publishers.append(authortag_text[-1]) 68 | 69 | return (authors, publishers) 70 | 71 | @staticmethod 72 | def _get_abstract(abstract_tag: list) -> list: 73 | """get the abstract from the tag""" 74 | abstract = [] 75 | for i, _ in enumerate(abstract_tag): 76 | s = (abstract_tag[i].text).strip().split("-") 77 | s = " ".join(s[1:]) 78 | s = s.strip("\n") 79 | abstract.append(s) 80 | return abstract 81 | 82 | def _add_in_paper_repo(self, **kwargs) -> pd.DataFrame: 83 | """add the paper info in the repository""" 84 | self.repository["Paper Title"].extend(kwargs["papername"]) 85 | self.repository["Author"].extend(kwargs["author"]) 86 | self.repository["Publication"].extend(kwargs["publisher"]) 87 | self.repository["Url of paper"].extend(kwargs["url"]) 88 | self.repository["Abstract"].extend(kwargs["abstract"]) 89 | return pd.DataFrame(self.repository) 90 | 91 | def scrape(self, url: str, to_df: bool = True): 92 | """scrape google scholar""" 93 | # function for the get content of each page 94 | doc = self._get_paperinfo(url) 95 | 96 | # function for the collecting tags 97 | paper_tag, link_tag, author_tag, abstract_tag = self._get_tags(doc) 98 | 99 | # paper title from each page 100 | papername = self._get_papertitle(paper_tag) 101 | 102 | # year , author , publication of the paper 103 | author, public = self._get_author_publisher_info(author_tag) 104 | 105 | # url of the paper 106 | link = self._get_link(link_tag) 107 | abstract = self._get_abstract(abstract_tag) 108 | # add in paper repo dict 109 | paper = self._add_in_paper_repo( 110 | papername=papername, 111 | author=author, 112 | publisher=public, 113 | url=link, 114 | abstract=abstract, 115 | ) 116 | 117 | if to_df: 118 | return paper 119 | return paper.to_dict() 120 | -------------------------------------------------------------------------------- /ScholarlyRecommender/Newsletter/feed.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import re 3 | from ScholarlyRecommender.config import get_config 4 | 5 | config = get_config() 6 | 7 | 8 | def clean_feed(dataframe: pd.DataFrame): 9 | """Clean the dataframe to match the BASE_REPO format.""" 10 | df = dataframe[ 11 | ["Id", "Category", "Title", "Published", "Abstract", "URL", "Author"] 12 | ].copy() 13 | df.reset_index(inplace=True) 14 | df["Author"] = df["Author"].astype(str) 15 | df["Id"] = df["Id"].apply(lambda x: "Entry Id: " + str(x)) 16 | df["Published"] = pd.to_datetime(df["Published"]).dt.strftime("%m-%d-%Y") 17 | df["Published"] = df["Published"].apply(lambda x: "Published on " + str(x)) 18 | df["Author"] = df["Author"].apply(extract_author_names) 19 | df["Author"] = df["Author"].str[:500] + "..." 20 | df["Abstract"] = df["Abstract"].str[:500] + "..." 21 | 22 | df["Abstract"] = df["Abstract"].apply(remove_latex) 23 | df["Title"] = df["Title"].apply(remove_latex) 24 | return df 25 | 26 | 27 | def extract_author_names(author_string): 28 | """Extract the author names from the author string.""" 29 | # The regular expression to match any characters enclosed within single quotes 30 | pattern = r"\'(.*?)\'" 31 | 32 | # Find all matches of the pattern 33 | matches = re.findall(pattern, author_string) 34 | 35 | return ", ".join(matches) 36 | 37 | 38 | def remove_latex(text): 39 | """Remove LaTeX from the text.""" 40 | # Remove inline LaTeX 41 | clean_text = re.sub(r"\$.*?\$", "", text) 42 | 43 | # Remove block LaTeX 44 | clean_text = re.sub(r"\\begin{.*?}\\end{.*?}", "", clean_text) 45 | 46 | return clean_text 47 | 48 | 49 | def build_email( 50 | df: pd.DataFrame, 51 | email: bool = False, 52 | to_path: str = None, 53 | web: bool = False, 54 | ): 55 | """Build the HTML email.""" 56 | flanT5_out = { 57 | "headline": "Your Scholarly Recommender Newsletter Feed", 58 | "intro": "Thank you for using Scholarly Recommender. Here is your feed.", 59 | } 60 | 61 | html_content = """ 62 | 63 | 64 | """ 65 | if email: 66 | body_template = """ 67 |

{headline}

74 |

76 | Dear Reader, 77 |

78 |

80 | {intro} 81 |

82 | """ 83 | body_html = body_template.format( 84 | headline=flanT5_out["headline"], 85 | intro=flanT5_out["intro"], 86 | ) 87 | html_content += body_html 88 | # HTML template for each feed item 89 | html_template = """ 90 |
94 |

{title}

97 |

{author}

100 | 107 |
109 | {abstract} 110 |
111 | Read More 117 |
118 | """ 119 | 120 | # Iterate through the DataFrame and fill in the HTML template 121 | for _, row in df.iterrows(): 122 | item_html = html_template.format( 123 | title=row["Title"], 124 | author=row["Author"], 125 | id=row["Id"], 126 | category=row["Category"], 127 | published=row["Published"], 128 | abstract=row["Abstract"], 129 | url=row["URL"], 130 | ) 131 | html_content += item_html 132 | html_content += """ 133 | """ 134 | # Save the generated HTML to a file for demonstration 135 | if web: 136 | return html_content 137 | if to_path is None: 138 | to_path = config["feed_path"] 139 | html_file_path = to_path 140 | with open(html_file_path, "w") as f: 141 | f.write(html_content) 142 | return True 143 | 144 | 145 | def get_feed( 146 | data, 147 | email: bool = False, 148 | to_path: str = None, 149 | web: bool = False, 150 | ): 151 | """Get the feed.""" 152 | if isinstance(data, pd.DataFrame): 153 | df = clean_feed(data) 154 | res = build_email(df, email=email, to_path=to_path, web=web) 155 | return res 156 | 157 | if isinstance(data, str): 158 | df = clean_feed(pd.read_csv(data)) 159 | res = build_email(df, email=email, to_path=to_path, web=web) 160 | return res 161 | raise TypeError("data must be a pandas DataFrame or a path to a csv file") 162 | -------------------------------------------------------------------------------- /ScholarlyRecommender/Scraper/Arxiv.py: -------------------------------------------------------------------------------- 1 | import logging 2 | import pandas as pd 3 | import arxiv 4 | from ScholarlyRecommender.const import BASE_REPO 5 | from ScholarlyRecommender.config import get_config 6 | 7 | config = get_config() 8 | 9 | logging.basicConfig( 10 | level=logging.INFO, 11 | format="%(asctime)s [%(levelname)s]: %(message)s", 12 | handlers=[logging.StreamHandler()], 13 | ) 14 | 15 | # logging.disable(logging.CRITICAL) 16 | 17 | 18 | def search( 19 | query: str, max_results: int = 100, sort_by=arxiv.SortCriterion.SubmittedDate 20 | ) -> pd.DataFrame: 21 | """ 22 | Scrape arxiv.org for papers matching the query and return a dataframe 23 | matching the BASE_REPO format. 24 | """ 25 | search_client = arxiv.Client(page_size=max_results, delay_seconds=3, num_retries=5) 26 | 27 | repository = BASE_REPO() 28 | 29 | arx_search = arxiv.Search(query=query, max_results=max_results, sort_by=sort_by) 30 | 31 | for result in search_client.results(arx_search): 32 | try: 33 | repository["Id"].append(result.entry_id.split("/")[-1]) 34 | repository["Category"].append(result.primary_category) 35 | repository["Title"].append(result.title.strip("\n")) 36 | repository["Published"].append(result.published) 37 | repository["Abstract"].append(result.summary.strip("\n")) 38 | repository["URL"].append(result.pdf_url) 39 | except arxiv.arxiv.UnexpectedEmptyPageError as error: 40 | logging.exception(error) 41 | continue 42 | if len(repository["Id"]) == 0: 43 | raise ValueError("No papers found for this query") 44 | return pd.DataFrame(repository).set_index("Id") 45 | 46 | 47 | def fast_search( 48 | queries: list, 49 | max_results: int = 100, 50 | to_path: str = None, 51 | as_df: bool = False, 52 | prev_days: int = 7, 53 | sort_by=arxiv.SortCriterion.SubmittedDate, 54 | ): 55 | """ 56 | fast version for source candidates in development, using query 57 | optimization and parallelization to significantly reduce runtime. 58 | """ 59 | if queries is None: 60 | queries = config["queries"] 61 | if not isinstance(queries, list) or len(queries) == 0: 62 | raise ValueError("queries must be a list of strings with at least one element") 63 | if prev_days <= 0 or prev_days >= 30: 64 | raise ValueError("prev_days must be greater than 0 and at most 30") 65 | if len(queries) > 100: 66 | raise ValueError("Too many queries, please reduce the number of queries ") 67 | # Turn queries into a single string of each query, seperated by " OR " 68 | queries = " OR ".join(queries) 69 | max_results = 500 70 | 71 | logging.info("Searching for %s", queries) 72 | # df = search(queries, max_results=max_results, sort_by=sort_by) 73 | repository = BASE_REPO() 74 | arx_search = arxiv.Search(query=queries, max_results=max_results, sort_by=sort_by) 75 | for result in arx_search.results(): 76 | try: 77 | repository["Id"].append(result.entry_id.split("/")[-1]) 78 | repository["Category"].append(result.primary_category) 79 | repository["Title"].append(result.title.strip("\n")) 80 | repository["Published"].append(result.published) 81 | repository["Abstract"].append(result.summary.strip("\n")) 82 | repository["URL"].append(result.pdf_url) 83 | except arxiv.arxiv.UnexpectedEmptyPageError as error: 84 | logging.exception(error) 85 | continue 86 | if len(repository["Id"]) == 0: 87 | raise ValueError("No papers found for this query") 88 | df = pd.DataFrame(repository).set_index("Id") 89 | if df.index.has_duplicates: 90 | df = df[~df.index.duplicated(keep="first")] 91 | df["Published"] = pd.to_datetime(df["Published"]) 92 | num_days = pd.Timestamp.now(tz="UTC") - pd.Timedelta(days=prev_days) 93 | df = df[df["Published"] >= num_days] 94 | logging.info("Number of papers extracted : %s", len(df.index)) 95 | if to_path is not None: 96 | df.to_csv(to_path) 97 | if as_df: 98 | return df 99 | return None 100 | 101 | 102 | def source_candidates( 103 | queries: list, 104 | max_results: int = 100, 105 | to_path: str = None, 106 | as_df: bool = False, 107 | prev_days: int = 7, 108 | sort_by=arxiv.SortCriterion.SubmittedDate, 109 | ): 110 | """ 111 | Scrape arxiv.org for papers matching the queries, 112 | filter them and return a dataframe or save it to a csv file. 113 | """ 114 | if queries is None: 115 | queries = config["queries"] 116 | if not isinstance(queries, list) or len(queries) == 0: 117 | raise ValueError("queries must be a list of strings with at least one element") 118 | if prev_days <= 0 or prev_days >= 30: 119 | raise ValueError("prev_days must be greater than 0 and at most 30") 120 | if len(queries) > 100: 121 | raise ValueError("Too many queries, please reduce the number of queries ") 122 | # normalize queries if for recommendations 123 | if sort_by == arxiv.SortCriterion.SubmittedDate: 124 | max_results = max(((100 * prev_days) // len(queries)), 100) 125 | num_days = pd.Timestamp.now(tz="UTC") - pd.Timedelta(days=prev_days) 126 | else: 127 | max_results = 100 128 | num_days = pd.Timestamp.now(tz="UTC") - pd.Timedelta(days=1095) 129 | 130 | dfs = [] 131 | for query in queries: 132 | logging.info("Searching for %s", query) 133 | 134 | df2 = search(query, max_results=max_results, sort_by=sort_by) 135 | dfs.append(df2) 136 | 137 | df = pd.concat(dfs) 138 | if df.index.has_duplicates: 139 | df = df[~df.index.duplicated(keep="first")] 140 | # Filter 141 | 142 | # Remove duplicates 143 | 144 | # Only keep papers from the last week 145 | df["Published"] = pd.to_datetime(df["Published"]) 146 | 147 | df = df[df["Published"] >= num_days] 148 | logging.info("Number of papers extracted : %s", len(df.index)) 149 | 150 | if to_path is not None: 151 | df.to_csv(to_path) 152 | if as_df: 153 | return df 154 | return None 155 | -------------------------------------------------------------------------------- /docs/scholarlyAPI.md: -------------------------------------------------------------------------------- 1 | # Scholarly Recommender API Documentation (In Progress) 2 | 3 | ## TODO 4 | 5 | source_candidates 6 | get_recommendations 7 | evaluate 8 | get_feed 9 | get_config 10 | update_config 11 | 12 |
13 | Table of Contents 14 |
    15 |
  1. 16 | About This Document 17 | 20 |
  2. 21 |
  3. 22 | Functions and Usage 23 | 31 |
  4. 32 |
33 |
34 | 35 | 36 | ## About this Document 37 | 38 | TODO 39 | 40 | ## Frequently Asked Questions 41 | 42 | TODO 43 | 44 | 45 | ## Functions and Usage 46 | 47 | TODO 48 | 49 | 50 | ### get_config 51 | 52 | ```sh 53 | ScholarlyRecommender.get_config() -> dict: 54 | ``` 55 | 56 | Retrieves the current configurations being used by the ScholarlyRecommender system. 57 | 58 | - **Parameters:** 59 | - None 60 | - **Returns:** 61 | - config: dict 62 | - A python dictionary representing the current configuration being used by the system. It is internally stored as a json file. 63 | - **Example** 64 | - ```sh 65 | import ScholarlyRecommender as sr 66 | 67 | config = sr.get_config 68 | queries = config['queries] 69 | ``` 70 | 71 | ### update_config 72 | 73 | ```sh 74 | ScholarlyRecommender.update_config(new_config, **kwargs) -> None: 75 | ``` 76 | 77 | what it does 78 | 79 | - **Parameters:** 80 | - new_config: dict 81 | - A python dictionary representing the new configuration the system will use. This will be internally stored as a json file to configuration.json and will overwrite the previous configuration. 82 | - **kwargs: optional params 83 | - Optional keyword arguments used for testing and debugging. 84 | - **Returns:** 85 | - None 86 | - **Example** 87 | - ```sh 88 | import ScholarlyRecommender as sr 89 | ``` 90 | 91 | ### source_candidates 92 | 93 | ```sh 94 | ScholarlyRecommender.source_candidates( 95 | queries: list, 96 | max_results: int = 100, 97 | to_path: str = None, 98 | as_df: bool = True, 99 | prev_days: int = 7, 100 | sort_by=arxiv.SortCriterion.SubmittedDate, 101 | ) -> pd.DataFrame: 102 | ``` 103 | 104 | Scrapes the web for papers matching the queries, filters them and returns a dataframe containing the results. Results can also be saved to a csv file. 105 | 106 | - **Parameters:** 107 | - queries: list of str 108 | - The search queries to scrape, represented as strings. The length of queries must be greater than zero and less than 100. 109 | - max_results: int, *optional* 110 | - The maximum number of candidates to source per query, defaults to 100 and dynamically scales based on the length of queries. 111 | - to_path: str, path object, file-like object, *optional* 112 | - Where to store the resulting candidates if desired, the dataframe will be saved here as a csv. Defaults to None. 113 | - as_df: bool, *optional* 114 | - Boolean to indicate if the resulting candidates should be returned as a Pandas Dataframe. defaults to True, and should only be changed to False if to_path is provided. 115 | - prev_days: int, *optional* 116 | - The maximum number of days (inclusive) since the publication date for candidates, defaults to 7, must be greater than 0 and less than 31. 117 | - sort_by: arxiv.SortCriterion, *optional* 118 | - TODO 119 | - **Returns:** 120 | - Pandas Dataframe if as_df is True, otherwise None. 121 | - **Example** 122 | - ```sh 123 | import ScholarlyRecommender as sr 124 | ``` 125 | 126 | ### get_recommendations 127 | 128 | ```sh 129 | ScholarlyRecommender.get_recommendations( 130 | data, 131 | labels, 132 | size: int = None, 133 | to_path: str = None, 134 | as_df: bool = False, 135 | ): 136 | ``` 137 | 138 | Rank the papers in the data and return the top n papers as a dataframe or saved to a csv file. 139 | 140 | - **Parameters:** 141 | - data: Dataframe or path like object 142 | - The dataset containing the candidates for the system to rank. This dataset is returned by source_candidates. 143 | - labels: Dataframe, path like object, or None 144 | - The labeled dataset the system will use to generate recommendations. This dataset is stored in the enviroment configuration and should be how it is accessed. 145 | - size: int, *optional* 146 | - The number of papers to return, defaults to the enviroment configuration (typically 5). 147 | - Throws a ValueError for manual inputs less than 0 or greater than the number of candidates provided by data. 148 | - to_path: str, path object, file-like object, *optional* 149 | - Where to store the resulting candidates if desired, the dataframe will be saved here as a csv. Defaults to None. 150 | - as_df: bool, *optional* 151 | - Boolean to indicate if the resulting candidates should be returned as a Pandas Dataframe. defaults to True, and should only be changed to False if to_path is provided. 152 | - **Returns:** 153 | - Pandas Dataframe if as_df is True, otherwise None. 154 | - **Example** 155 | - ```sh 156 | import ScholarlyRecommender as sr 157 | ``` 158 | 159 | ### evaluate 160 | 161 | ```sh 162 | ScholarlyRecommender.evaluate(n: int = 5, k: int = 6, on: str = "Abstract") -> float: 163 | ``` 164 | 165 | Evaluate the recommender system on the labeled dataset. Uses the normalized compression distance to predict the masked rating. 166 | Calculates the mean squared error between the predicted and actual ratings and returns the total loss. 167 | 168 | 169 | - **Parameters:** 170 | - None 171 | - **Returns:** 172 | - None 173 | - **Example** 174 | - ```sh 175 | import ScholarlyRecommender as sr 176 | ``` 177 | 178 | ### get_feed 179 | 180 | ```sh 181 | ScholarlyRecommender.get_feed( 182 | data, 183 | email: bool = False, 184 | to_path: str = None, 185 | web: bool = False, 186 | ): 187 | ``` 188 | 189 | what it does 190 | 191 | - **Parameters:** 192 | - None 193 | - **Returns:** 194 | - None 195 | - **Example** 196 | - ```sh 197 | import ScholarlyRecommender as sr 198 | ``` 199 | -------------------------------------------------------------------------------- /ScholarlyRecommender/Repository/tests/outputs/test_recommendations.csv: -------------------------------------------------------------------------------- 1 | Id,Category,Title,Published,Abstract,URL,Author 2 | 2309.11400v1,q-fin.TR,Transformers versus LSTMs for electronic trading,2023-09-20 15:25:43+00:00,"With the rapid development of artificial intelligence, long short term memory 3 | (LSTM), one kind of recurrent neural network (RNN), has been widely applied in 4 | time series prediction. 5 | Like RNN, Transformer is designed to handle the sequential data. As 6 | Transformer achieved great success in Natural Language Processing (NLP), 7 | researchers got interested in Transformer's performance on time series 8 | prediction, and plenty of Transformer-based solutions on long time series 9 | forecasting have come out recently. However, when it comes to financial time 10 | series prediction, LSTM is still a dominant architecture. Therefore, the 11 | question this study wants to answer is: whether the Transformer-based model can 12 | be applied in financial time series prediction and beat LSTM. 13 | To answer this question, various LSTM-based and Transformer-based models are 14 | compared on multiple financial prediction tasks based on high-frequency limit 15 | order book data. A new LSTM-based model called DLSTM is built and new 16 | architecture for the Transformer-based model is designed to adapt for financial 17 | prediction. The experiment result reflects that the Transformer-based model 18 | only has the limited advantage in absolute price sequence prediction. The 19 | LSTM-based models show better and more robust performance on difference 20 | sequence prediction, such as price difference and price movement.",http://arxiv.org/pdf/2309.11400v1,"['Paul Bilokon', 'Yitao Qiu']" 21 | 2309.10982v1,cs.AI,Is GPT4 a Good Trader?,2023-09-20 00:47:52+00:00,"Recently, large language models (LLMs), particularly GPT-4, have demonstrated 22 | significant capabilities in various planning and reasoning tasks 23 | \cite{cheng2023gpt4,bubeck2023sparks}. Motivated by these advancements, there 24 | has been a surge of interest among researchers to harness the capabilities of 25 | GPT-4 for the automated design of quantitative factors that do not overlap with 26 | existing factor libraries, with an aspiration to achieve alpha returns 27 | \cite{webpagequant}. In contrast to these work, this study aims to examine the 28 | fidelity of GPT-4's comprehension of classic trading theories and its 29 | proficiency in applying its code interpreter abilities to real-world trading 30 | data analysis. Such an exploration is instrumental in discerning whether the 31 | underlying logic GPT-4 employs for trading is intrinsically reliable. 32 | Furthermore, given the acknowledged interpretative latitude inherent in most 33 | trading theories, we seek to distill more precise methodologies of deploying 34 | these theories from GPT-4's analytical process, potentially offering invaluable 35 | insights to human traders. 36 | To achieve this objective, we selected daily candlestick (K-line) data from 37 | specific periods for certain assets, such as the Shanghai Stock Index. Through 38 | meticulous prompt engineering, we guided GPT-4 to analyze the technical 39 | structures embedded within this data, based on specific theories like the 40 | Elliott Wave Theory. We then subjected its analytical output to manual 41 | evaluation, assessing its interpretative depth and accuracy vis-\`a-vis these 42 | trading theories from multiple dimensions. The results and findings from this 43 | study could pave the way for a synergistic amalgamation of human expertise and 44 | AI-driven insights in the realm of trading.",http://arxiv.org/pdf/2309.10982v1,['Bingzhe Wu'] 45 | 2309.11495v1,cs.CL,Chain-of-Verification Reduces Hallucination in Large Language Models,2023-09-20 17:50:55+00:00,"Generation of plausible yet incorrect factual information, termed 46 | hallucination, is an unsolved issue in large language models. We study the 47 | ability of language models to deliberate on the responses they give in order to 48 | correct their mistakes. We develop the Chain-of-Verification (CoVe) method 49 | whereby the model first (i) drafts an initial response; then (ii) plans 50 | verification questions to fact-check its draft; (iii) answers those questions 51 | independently so the answers are not biased by other responses; and (iv) 52 | generates its final verified response. In experiments, we show CoVe decreases 53 | hallucinations across a variety of tasks, from list-based questions from 54 | Wikidata, closed book MultiSpanQA and longform text generation.",http://arxiv.org/pdf/2309.11495v1,"['Shehzaad Dhuliawala', 'Mojtaba Komeili', 'Jing Xu', 'Roberta Raileanu', 'Xian Li', 'Asli Celikyilmaz', 'Jason Weston']" 55 | 2309.11830v1,cs.CL,A Chinese Prompt Attack Dataset for LLMs with Evil Content,2023-09-21 07:07:49+00:00,"Large Language Models (LLMs) present significant priority in text 56 | understanding and generation. However, LLMs suffer from the risk of generating 57 | harmful contents especially while being employed to applications. There are 58 | several black-box attack methods, such as Prompt Attack, which can change the 59 | behaviour of LLMs and induce LLMs to generate unexpected answers with harmful 60 | contents. Researchers are interested in Prompt Attack and Defense with LLMs, 61 | while there is no publicly available dataset to evaluate the abilities of 62 | defending prompt attack. In this paper, we introduce a Chinese Prompt Attack 63 | Dataset for LLMs, called CPAD. Our prompts aim to induce LLMs to generate 64 | unexpected outputs with several carefully designed prompt attack approaches and 65 | widely concerned attacking contents. Different from previous datasets involving 66 | safety estimation, We construct the prompts considering three dimensions: 67 | contents, attacking methods and goals, thus the responses can be easily 68 | evaluated and analysed. We run several well-known Chinese LLMs on our dataset, 69 | and the results show that our prompts are significantly harmful to LLMs, with 70 | around 70% attack success rate. We will release CPAD to encourage further 71 | studies on prompt attack and defense.",http://arxiv.org/pdf/2309.11830v1,"['Chengyuan Liu', 'Fubang Zhao', 'Lizhi Qing', 'Yangyang Kang', 'Changlong Sun', 'Kun Kuang', 'Fei Wu']" 72 | 2309.11688v1,cs.CL,LLM Guided Inductive Inference for Solving Compositional Problems,2023-09-20 23:44:16+00:00,"While large language models (LLMs) have demonstrated impressive performance 73 | in question-answering tasks, their performance is limited when the questions 74 | require knowledge that is not included in the model's training data and can 75 | only be acquired through direct observation or interaction with the real world. 76 | Existing methods decompose reasoning tasks through the use of modules invoked 77 | sequentially, limiting their ability to answer deep reasoning tasks. We 78 | introduce a method, Recursion based extensible LLM (REBEL), which handles 79 | open-world, deep reasoning tasks by employing automated reasoning techniques 80 | like dynamic planning and forward-chaining strategies. REBEL allows LLMs to 81 | reason via recursive problem decomposition and utilization of external tools. 82 | The tools that REBEL uses are specified only by natural language description. 83 | We further demonstrate REBEL capabilities on a set of problems that require a 84 | deeply nested use of external tools in a compositional and conversational 85 | setting.",http://arxiv.org/pdf/2309.11688v1,"['Abhigya Sodani', 'Lauren Moos', 'Matthew Mirman']" 86 | -------------------------------------------------------------------------------- /ScholarlyRecommender/Repository/tests/inputs/ref_recommendations.csv: -------------------------------------------------------------------------------- 1 | Id,Category,Title,Published,Abstract,URL,Author 2 | 2309.11495v1,cs.CL,Chain-of-Verification Reduces Hallucination in Large Language Models,2023-09-20 17:50:55+00:00,"Generation of plausible yet incorrect factual information, termed 3 | hallucination, is an unsolved issue in large language models. We study the 4 | ability of language models to deliberate on the responses they give in order to 5 | correct their mistakes. We develop the Chain-of-Verification (CoVe) method 6 | whereby the model first (i) drafts an initial response; then (ii) plans 7 | verification questions to fact-check its draft; (iii) answers those questions 8 | independently so the answers are not biased by other responses; and (iv) 9 | generates its final verified response. In experiments, we show CoVe decreases 10 | hallucinations across a variety of tasks, from list-based questions from 11 | Wikidata, closed book MultiSpanQA and longform text generation.",http://arxiv.org/pdf/2309.11495v1,"['Shehzaad Dhuliawala', 'Mojtaba Komeili', 'Jing Xu', 'Roberta Raileanu', 'Xian Li', 'Asli Celikyilmaz', 'Jason Weston']" 12 | 2309.11295v1,cs.CL,CPLLM: Clinical Prediction with Large Language Models,2023-09-20 13:24:12+00:00,"We present Clinical Prediction with Large Language Models (CPLLM), a method 13 | that involves fine-tuning a pre-trained Large Language Model (LLM) for clinical 14 | disease prediction. We utilized quantization and fine-tuned the LLM using 15 | prompts, with the task of predicting whether patients will be diagnosed with a 16 | target disease during their next visit or in the subsequent diagnosis, 17 | leveraging their historical diagnosis records. We compared our results versus 18 | various baselines, including Logistic Regression, RETAIN, and Med-BERT, which 19 | is the current state-of-the-art model for disease prediction using structured 20 | EHR data. Our experiments have shown that CPLLM surpasses all the tested models 21 | in terms of both PR-AUC and ROC-AUC metrics, displaying noteworthy enhancements 22 | compared to the baseline models.",http://arxiv.org/pdf/2309.11295v1,"['Ofir Ben Shoham', 'Nadav Rappoport']" 23 | 2309.11259v1,cs.CL,Sequence-to-Sequence Spanish Pre-trained Language Models,2023-09-20 12:35:19+00:00,"In recent years, substantial advancements in pre-trained language models have 24 | paved the way for the development of numerous non-English language versions, 25 | with a particular focus on encoder-only and decoder-only architectures. While 26 | Spanish language models encompassing BERT, RoBERTa, and GPT have exhibited 27 | prowess in natural language understanding and generation, there remains a 28 | scarcity of encoder-decoder models designed for sequence-to-sequence tasks 29 | involving input-output pairs. This paper breaks new ground by introducing the 30 | implementation and evaluation of renowned encoder-decoder architectures, 31 | exclusively pre-trained on Spanish corpora. Specifically, we present Spanish 32 | versions of BART, T5, and BERT2BERT-style models and subject them to a 33 | comprehensive assessment across a diverse range of sequence-to-sequence tasks, 34 | spanning summarization, rephrasing, and generative question answering. Our 35 | findings underscore the competitive performance of all models, with BART and T5 36 | emerging as top performers across all evaluated tasks. As an additional 37 | contribution, we have made all models publicly available to the research 38 | community, fostering future exploration and development in Spanish language 39 | processing.",http://arxiv.org/pdf/2309.11259v1,"['Vladimir Araujo', 'Maria Mihaela Trusca', 'Rodrigo Tufiño', 'Marie-Francine Moens']" 40 | 2309.10982v1,cs.AI,Is GPT4 a Good Trader?,2023-09-20 00:47:52+00:00,"Recently, large language models (LLMs), particularly GPT-4, have demonstrated 41 | significant capabilities in various planning and reasoning tasks 42 | \cite{cheng2023gpt4,bubeck2023sparks}. Motivated by these advancements, there 43 | has been a surge of interest among researchers to harness the capabilities of 44 | GPT-4 for the automated design of quantitative factors that do not overlap with 45 | existing factor libraries, with an aspiration to achieve alpha returns 46 | \cite{webpagequant}. In contrast to these work, this study aims to examine the 47 | fidelity of GPT-4's comprehension of classic trading theories and its 48 | proficiency in applying its code interpreter abilities to real-world trading 49 | data analysis. Such an exploration is instrumental in discerning whether the 50 | underlying logic GPT-4 employs for trading is intrinsically reliable. 51 | Furthermore, given the acknowledged interpretative latitude inherent in most 52 | trading theories, we seek to distill more precise methodologies of deploying 53 | these theories from GPT-4's analytical process, potentially offering invaluable 54 | insights to human traders. 55 | To achieve this objective, we selected daily candlestick (K-line) data from 56 | specific periods for certain assets, such as the Shanghai Stock Index. Through 57 | meticulous prompt engineering, we guided GPT-4 to analyze the technical 58 | structures embedded within this data, based on specific theories like the 59 | Elliott Wave Theory. We then subjected its analytical output to manual 60 | evaluation, assessing its interpretative depth and accuracy vis-\`a-vis these 61 | trading theories from multiple dimensions. The results and findings from this 62 | study could pave the way for a synergistic amalgamation of human expertise and 63 | AI-driven insights in the realm of trading.",http://arxiv.org/pdf/2309.10982v1,['Bingzhe Wu'] 64 | 2309.12053v1,cs.CL,"AceGPT, Localizing Large Language Models in Arabic",2023-09-21 13:20:13+00:00,"This paper explores the imperative need and methodology for developing a 65 | localized Large Language Model (LLM) tailored for Arabic, a language with 66 | unique cultural characteristics that are not adequately addressed by current 67 | mainstream models like ChatGPT. Key concerns additionally arise when 68 | considering cultural sensitivity and local values. To this end, the paper 69 | outlines a packaged solution, including further pre-training with Arabic texts, 70 | supervised fine-tuning (SFT) using native Arabic instructions and GPT-4 71 | responses in Arabic, and reinforcement learning with AI feedback (RLAIF) using 72 | a reward model that is sensitive to local culture and values. The objective is 73 | to train culturally aware and value-aligned Arabic LLMs that can serve the 74 | diverse application-specific needs of Arabic-speaking communities. 75 | Extensive evaluations demonstrated that the resulting LLM called 76 | `\textbf{AceGPT}' is the SOTA open Arabic LLM in various benchmarks, including 77 | instruction-following benchmark (i.e., Arabic Vicuna-80 and Arabic AlpacaEval), 78 | knowledge benchmark (i.e., Arabic MMLU and EXAMs), as well as the 79 | newly-proposed Arabic cultural \& value alignment benchmark. Notably, AceGPT 80 | outperforms ChatGPT in the popular Vicuna-80 benchmark when evaluated with 81 | GPT-4, despite the benchmark's limited scale. % Natural Language Understanding 82 | (NLU) benchmark (i.e., ALUE) 83 | Codes, data, and models are in https://github.com/FreedomIntelligence/AceGPT.",http://arxiv.org/pdf/2309.12053v1,"['Huang Huang', 'Fei Yu', 'Jianqing Zhu', 'Xuening Sun', 'Hao Cheng', 'Dingjie Song', 'Zhihong Chen', 'Abdulmohsen Alharthi', 'Bang An', 'Ziche Liu', 'Zhiyi Zhang', 'Junying Chen', 'Jianquan Li', 'Benyou Wang', 'Lian Zhang', 'Ruoyu Sun', 'Xiang Wan', 'Haizhou Li', 'Jinchao Xu']" 84 | -------------------------------------------------------------------------------- /ScholarlyRecommender/Repository/Recommendations.csv: -------------------------------------------------------------------------------- 1 | Id,Category,Title,Published,Abstract,URL,Author 2 | 2308.14337v1,cs.AI,Cognitive Effects in Large Language Models,2023-08-28 06:30:33+00:00,"Large Language Models (LLMs) such as ChatGPT have received enormous attention 3 | over the past year and are now used by hundreds of millions of people every 4 | day. The rapid adoption of this technology naturally raises questions about the 5 | possible biases such models might exhibit. In this work, we tested one of these 6 | models (GPT-3) on a range of cognitive effects, which are systematic patterns 7 | that are usually found in human cognitive tasks. We found that LLMs are indeed 8 | prone to several human cognitive effects. Specifically, we show that the 9 | priming, distance, SNARC, and size congruity effects were presented with GPT-3, 10 | while the anchoring effect is absent. We describe our methodology, and 11 | specifically the way we converted real-world experiments to text-based 12 | experiments. Finally, we speculate on the possible reasons why GPT-3 exhibits 13 | these effects and discuss whether they are imitated or reinvented.",http://arxiv.org/pdf/2308.14337v1,"[arxiv.Result.Author('Jonathan Shaki'), arxiv.Result.Author('Sarit Kraus'), arxiv.Result.Author('Michael Wooldridge')]" 14 | 2308.14921v1,cs.CL,Gender bias and stereotypes in Large Language Models,2023-08-28 22:32:05+00:00,"Large Language Models (LLMs) have made substantial progress in the past 15 | several months, shattering state-of-the-art benchmarks in many domains. This 16 | paper investigates LLMs' behavior with respect to gender stereotypes, a known 17 | issue for prior models. We use a simple paradigm to test the presence of gender 18 | bias, building on but differing from WinoBias, a commonly used gender bias 19 | dataset, which is likely to be included in the training data of current LLMs. 20 | We test four recently published LLMs and demonstrate that they express biased 21 | assumptions about men and women's occupations. Our contributions in this paper 22 | are as follows: (a) LLMs are 3-6 times more likely to choose an occupation that 23 | stereotypically aligns with a person's gender; (b) these choices align with 24 | people's perceptions better than with the ground truth as reflected in official 25 | job statistics; (c) LLMs in fact amplify the bias beyond what is reflected in 26 | perceptions or the ground truth; (d) LLMs ignore crucial ambiguities in 27 | sentence structure 95% of the time in our study items, but when explicitly 28 | prompted, they recognize the ambiguity; (e) LLMs provide explanations for their 29 | choices that are factually inaccurate and likely obscure the true reason behind 30 | their predictions. That is, they provide rationalizations of their biased 31 | behavior. This highlights a key property of these models: LLMs are trained on 32 | imbalanced datasets; as such, even with the recent successes of reinforcement 33 | learning with human feedback, they tend to reflect those imbalances back at us. 34 | As with other types of societal biases, we suggest that LLMs must be carefully 35 | tested to ensure that they treat minoritized individuals and communities 36 | equitably.",http://arxiv.org/pdf/2308.14921v1,"[arxiv.Result.Author('Hadas Kotek'), arxiv.Result.Author('Rikker Dockum'), arxiv.Result.Author('David Q. Sun')]" 37 | 2308.15126v1,cs.LG,Evaluation and Analysis of Hallucination in Large Vision-Language Models,2023-08-29 08:51:24+00:00,"Large Vision-Language Models (LVLMs) have recently achieved remarkable 38 | success. However, LVLMs are still plagued by the hallucination problem, which 39 | limits the practicality in many scenarios. Hallucination refers to the 40 | information of LVLMs' responses that does not exist in the visual input, which 41 | poses potential risks of substantial consequences. There has been limited work 42 | studying hallucination evaluation in LVLMs. In this paper, we propose 43 | Hallucination Evaluation based on Large Language Models (HaELM), an LLM-based 44 | hallucination evaluation framework. HaELM achieves an approximate 95% 45 | performance comparable to ChatGPT and has additional advantages including low 46 | cost, reproducibility, privacy preservation and local deployment. Leveraging 47 | the HaELM, we evaluate the hallucination in current LVLMs. Furthermore, we 48 | analyze the factors contributing to hallucination in LVLMs and offer helpful 49 | suggestions to mitigate the hallucination problem. Our training data and human 50 | annotation hallucination data will be made public soon.",http://arxiv.org/pdf/2308.15126v1,"[arxiv.Result.Author('Junyang Wang'), arxiv.Result.Author('Yiyang Zhou'), arxiv.Result.Author('Guohai Xu'), arxiv.Result.Author('Pengcheng Shi'), arxiv.Result.Author('Chenlin Zhao'), arxiv.Result.Author('Haiyang Xu'), arxiv.Result.Author('Qinghao Ye'), arxiv.Result.Author('Ming Yan'), arxiv.Result.Author('Ji Zhang'), arxiv.Result.Author('Jihua Zhu'), arxiv.Result.Author('Jitao Sang'), arxiv.Result.Author('Haoyu Tang')]" 51 | 2308.14182v1,cs.CL,Generative AI for Business Strategy: Using Foundation Models to Create Business Strategy Tools,2023-08-27 19:03:12+00:00,"Generative models (foundation models) such as LLMs (large language models) 52 | are having a large impact on multiple fields. In this work, we propose the use 53 | of such models for business decision making. In particular, we combine 54 | unstructured textual data sources (e.g., news data) with multiple foundation 55 | models (namely, GPT4, transformer-based Named Entity Recognition (NER) models 56 | and Entailment-based Zero-shot Classifiers (ZSC)) to derive IT (information 57 | technology) artifacts in the form of a (sequence of) signed business networks. 58 | We posit that such artifacts can inform business stakeholders about the state 59 | of the market and their own positioning as well as provide quantitative 60 | insights into improving their future outlook.",http://arxiv.org/pdf/2308.14182v1,"[arxiv.Result.Author('Son The Nguyen'), arxiv.Result.Author('Theja Tulabandhula')]" 61 | 2308.15930v1,cs.CL,LLaSM: Large Language and Speech Model,2023-08-30 10:12:39+00:00,"Multi-modal large language models have garnered significant interest 62 | recently. Though, most of the works focus on vision-language multi-modal models 63 | providing strong capabilities in following vision-and-language instructions. 64 | However, we claim that speech is also an important modality through which 65 | humans interact with the world. Hence, it is crucial for a general-purpose 66 | assistant to be able to follow multi-modal speech-and-language instructions. In 67 | this work, we propose Large Language and Speech Model (LLaSM). LLaSM is an 68 | end-to-end trained large multi-modal speech-language model with cross-modal 69 | conversational abilities, capable of following speech-and-language 70 | instructions. Our early experiments show that LLaSM demonstrates a more 71 | convenient and natural way for humans to interact with artificial intelligence. 72 | Specifically, we also release a large Speech Instruction Following dataset 73 | LLaSM-Audio-Instructions. Code and demo are available at 74 | https://github.com/LinkSoul-AI/LLaSM and 75 | https://huggingface.co/spaces/LinkSoul/LLaSM. The LLaSM-Audio-Instructions 76 | dataset is available at 77 | https://huggingface.co/datasets/LinkSoul/LLaSM-Audio-Instructions.",http://arxiv.org/pdf/2308.15930v1,"[arxiv.Result.Author('Yu Shu'), arxiv.Result.Author('Siwei Dong'), arxiv.Result.Author('Guangyao Chen'), arxiv.Result.Author('Wenhao Huang'), arxiv.Result.Author('Ruihua Zhang'), arxiv.Result.Author('Daochen Shi'), arxiv.Result.Author('Qiqi Xiang'), arxiv.Result.Author('Yemin Shi')]" 78 | -------------------------------------------------------------------------------- /ScholarlyRecommender/Repository/Feed.csv: -------------------------------------------------------------------------------- 1 | Id,Category,Title,Published,Abstract,URL 2 | 2308.14608v1,cs.LG,AI in the Gray: Exploring Moderation Policies in Dialogic Large Language Models vs. Human Answers in Controversial Topics,2023-08-28 14:23:04+00:00,"The introduction of ChatGPT and the subsequent improvement of Large Language 3 | Models (LLMs) have prompted more and more individuals to turn to the use of 4 | ChatBots, both for information and assistance with decision-making. However, 5 | the information the user is after is often not formulated by these ChatBots 6 | objectively enough to be provided with a definite, globally accepted answer. 7 | Controversial topics, such as ""religion"", ""gender identity"", ""freedom of 8 | speech"", and ""equality"", among others, can be a source of conflict as partisan 9 | or biased answers can reinforce preconceived notions or promote disinformation. 10 | By exposing ChatGPT to such debatable questions, we aim to understand its level 11 | of awareness and if existing models are subject to socio-political and/or 12 | economic biases. We also aim to explore how AI-generated answers compare to 13 | human ones. For exploring this, we use a dataset of a social media platform 14 | created for the purpose of debating human-generated claims on polemic subjects 15 | among users, dubbed Kialo. 16 | Our results show that while previous versions of ChatGPT have had important 17 | issues with controversial topics, more recent versions of ChatGPT 18 | (gpt-3.5-turbo) are no longer manifesting significant explicit biases in 19 | several knowledge areas. In particular, it is well-moderated regarding economic 20 | aspects. However, it still maintains degrees of implicit libertarian leaning 21 | toward right-winged ideals which suggest the need for increased moderation from 22 | the socio-political point of view. In terms of domain knowledge on 23 | controversial topics, with the exception of the ""Philosophical"" category, 24 | ChatGPT is performing well in keeping up with the collective human level of 25 | knowledge. Finally, we see that sources of Bing AI have slightly more tendency 26 | to the center when compared to human answers. All the analyses we make are 27 | generalizable to other types of biases and domains.",http://arxiv.org/pdf/2308.14608v1 28 | 2308.14921v1,cs.CL,Gender bias and stereotypes in Large Language Models,2023-08-28 22:32:05+00:00,"Large Language Models (LLMs) have made substantial progress in the past 29 | several months, shattering state-of-the-art benchmarks in many domains. This 30 | paper investigates LLMs' behavior with respect to gender stereotypes, a known 31 | issue for prior models. We use a simple paradigm to test the presence of gender 32 | bias, building on but differing from WinoBias, a commonly used gender bias 33 | dataset, which is likely to be included in the training data of current LLMs. 34 | We test four recently published LLMs and demonstrate that they express biased 35 | assumptions about men and women's occupations. Our contributions in this paper 36 | are as follows: (a) LLMs are 3-6 times more likely to choose an occupation that 37 | stereotypically aligns with a person's gender; (b) these choices align with 38 | people's perceptions better than with the ground truth as reflected in official 39 | job statistics; (c) LLMs in fact amplify the bias beyond what is reflected in 40 | perceptions or the ground truth; (d) LLMs ignore crucial ambiguities in 41 | sentence structure 95% of the time in our study items, but when explicitly 42 | prompted, they recognize the ambiguity; (e) LLMs provide explanations for their 43 | choices that are factually inaccurate and likely obscure the true reason behind 44 | their predictions. That is, they provide rationalizations of their biased 45 | behavior. This highlights a key property of these models: LLMs are trained on 46 | imbalanced datasets; as such, even with the recent successes of reinforcement 47 | learning with human feedback, they tend to reflect those imbalances back at us. 48 | As with other types of societal biases, we suggest that LLMs must be carefully 49 | tested to ensure that they treat minoritized individuals and communities 50 | equitably.",http://arxiv.org/pdf/2308.14921v1 51 | 2308.14752v1,cs.CY,"AI Deception: A Survey of Examples, Risks, and Potential Solutions",2023-08-28 17:59:35+00:00,"This paper argues that a range of current AI systems have learned how to 52 | deceive humans. We define deception as the systematic inducement of false 53 | beliefs in the pursuit of some outcome other than the truth. We first survey 54 | empirical examples of AI deception, discussing both special-use AI systems 55 | (including Meta's CICERO) built for specific competitive situations, and 56 | general-purpose AI systems (such as large language models). Next, we detail 57 | several risks from AI deception, such as fraud, election tampering, and losing 58 | control of AI systems. Finally, we outline several potential solutions to the 59 | problems posed by AI deception: first, regulatory frameworks should subject AI 60 | systems that are capable of deception to robust risk-assessment requirements; 61 | second, policymakers should implement bot-or-not laws; and finally, 62 | policymakers should prioritize the funding of relevant research, including 63 | tools to detect AI deception and to make AI systems less deceptive. 64 | Policymakers, researchers, and the broader public should work proactively to 65 | prevent AI deception from destabilizing the shared foundations of our society.",http://arxiv.org/pdf/2308.14752v1 66 | 2308.14359v1,cs.AI,Effect of Attention and Self-Supervised Speech Embeddings on Non-Semantic Speech Tasks,2023-08-28 07:11:27+00:00,"Human emotion understanding is pivotal in making conversational technology 67 | mainstream. We view speech emotion understanding as a perception task which is 68 | a more realistic setting. With varying contexts (languages, demographics, etc.) 69 | different share of people perceive the same speech segment as a non-unanimous 70 | emotion. As part of the ACM Multimedia 2023 Computational Paralinguistics 71 | ChallengE (ComParE) in the EMotion Share track, we leverage their rich dataset 72 | of multilingual speakers and multi-label regression target of 'emotion share' 73 | or perception of that emotion. We demonstrate that the training scheme of 74 | different foundation models dictates their effectiveness for tasks beyond 75 | speech recognition, especially for non-semantic speech tasks like emotion 76 | understanding. This is a very complex task due to multilingual speakers, 77 | variability in the target labels, and inherent imbalance in the regression 78 | dataset. Our results show that HuBERT-Large with a self-attention-based 79 | light-weight sequence model provides 4.6% improvement over the reported 80 | baseline.",http://arxiv.org/pdf/2308.14359v1 81 | 2308.14149v1,cs.CL,"Examining User-Friendly and Open-Sourced Large GPT Models: A Survey on Language, Multimodal, and Scientific GPT Models",2023-08-27 16:14:19+00:00,"Generative pre-trained transformer (GPT) models have revolutionized the field 82 | of natural language processing (NLP) with remarkable performance in various 83 | tasks and also extend their power to multimodal domains. Despite their success, 84 | large GPT models like GPT-4 face inherent limitations such as considerable 85 | size, high computational requirements, complex deployment processes, and closed 86 | development loops. These constraints restrict their widespread adoption and 87 | raise concerns regarding their responsible development and usage. The need for 88 | user-friendly, relatively small, and open-sourced alternative GPT models arises 89 | from the desire to overcome these limitations while retaining high performance. 90 | In this survey paper, we provide an examination of alternative open-sourced 91 | models of large GPTs, focusing on user-friendly and relatively small models 92 | that facilitate easier deployment and accessibility. Through this extensive 93 | survey, we aim to equip researchers, practitioners, and enthusiasts with a 94 | thorough understanding of user-friendly and relatively small open-sourced 95 | models of large GPTs, their current state, challenges, and future research 96 | directions, inspiring the development of more efficient, accessible, and 97 | versatile GPT models that cater to the broader scientific community and advance 98 | the field of general artificial intelligence. The source contents are 99 | continuously updating in https://github.com/GPT-Alternatives/gpt_alternatives.",http://arxiv.org/pdf/2308.14149v1 100 | -------------------------------------------------------------------------------- /ScholarlyRecommender/Newsletter/html/Feed.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 |
9 |

Some spectral comparison results on infinite quantum graphs

12 |

Patrizio Bifulco, Joachim Kerner

15 | 22 |
24 | In this paper we establish spectral comparison results for Schr\"odinger 25 | operators on a certain class of infinite quantum graphs, using recent results 26 | obtained in the finite setting. We also show that new features do appear on 27 | infinite quantum graphs such as a modified local Weyl law. In this sense, we 28 | regard this paper as a starting point for a more thorough investigation of 29 | spectral comparison results on more general infinite metric graphs.... 30 |
31 | Read More 37 |
38 | 39 |
43 |

Motor crosslinking augments elasticity in active nematics

46 |

Steven A. Redford, Jonathan Colen, Jordan L. Shivers, Sasha Zemsky, Mehdi Molaei, Carlos Floyd, Paul V. Ruijgrok, Vincenzo Vitelli, Zev Bryant, Aaron R. Dinner, Margaret L. Gardel

49 | 56 |
58 | In active materials, uncoordinated internal stresses lead to emergent 59 | long-range flows. An understanding of how the behavior of active materials 60 | depends on mesoscopic (hydrodynamic) parameters is developing, but there 61 | remains a gap in knowledge concerning how hydrodynamic parameters depend on the 62 | properties of microscopic elements. In this work, we combine experiments and 63 | multiscale modeling to relate the structure and dynamics of active nematics 64 | composed of biopolymer filaments and molecular mo... 65 |
66 | Read More 72 |
73 | 74 |
78 |

The stability of unevenly spaced planetary systems

81 |

Sheng Yang, Liangyu Wu, Zekai Zheng, Masahiro Ogihara, Kangrou Guo, Wenzhan Ouyang, Yaxing He

84 | 91 |
93 | Studying the orbital stability of multi-planet systems is essential to 94 | understand planet formation, estimate the stable time of an observed planetary 95 | system, and advance population synthesis models. Although previous studies have 96 | primarily focused on ideal systems characterized by uniform orbital 97 | separations, in reality a diverse range of orbital separations exists among 98 | planets within the same system. This study focuses on investigating the 99 | dynamical stability of systems with non-uniform separa... 100 |
101 | Read More 107 |
108 | 109 |
113 |

Quantized thermal and spin transports of dirty planar topological superconductors

116 |

Sanjib Kumar Das, Bitan Roy

119 | 126 |
128 | Nontrivial bulk topological invariants of quantum materials can leave their 129 | signatures on charge, thermal and spin transports. In two dimensions, their 130 | imprints can be experimentally measured from well-developed multi-terminal Hall 131 | bar arrangements. Here, we numerically compute the low temperature ($T$) 132 | thermal ($\kappa_{xy}$) and zero temperature spin ($\sigma^{sp}_{xy}$) Hall 133 | conductivities, and longitudinal thermal conductance ($G^{th}_{xx}$) of various 134 | paradigmatic two-dimensional fully gapp... 135 |
136 | Read More 142 |
143 | 144 |
148 |

Dielectron production in central Pb$-$Pb collisions at $\sqrt{s_\mathrm{NN}}$ = 5.02 TeV

151 |

ALICE Collaboration

154 | 161 |
163 | The first measurement of the e$^+$e$^-$ pair production at midrapidity and 164 | low invariant mass in central Pb$-$Pb collisions at 165 | $\sqrt{s_{\mathrm{NN}}}=5.02$ TeV at the LHC is presented. The yield of 166 | e$^+$e$^-$ pairs is compared with a cocktail of expected hadronic decay 167 | contributions in the invariant mass ($m_{\rm ee}$) and pair transverse momentum 168 | ($p_{\rm T,ee}$) ranges $m_{\rm ee} < 3.5$ GeV$/c^2$ and $p_{\rm T,ee} < 8$ 169 | GeV$/c$. For $0.18 < m_{\rm ee} < 0.5$ GeV$/c^2$ the ratio of data to the... 170 |
171 | Read More 177 |
178 | 179 | -------------------------------------------------------------------------------- /ScholarlyRecommender/Newsletter/html/WebFeed.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 |
9 |

YaRN: Efficient Context Window Extension of Large Language Models

12 |

Bowen Peng, Jeffrey Quesnelle, Honglu Fan, Enrico Shippole

15 | 22 |
24 | Rotary Position Embeddings (RoPE) have been shown to effectively encode 25 | positional information in transformer-based language models. However, these 26 | models fail to generalize past the sequence length they were trained on. We 27 | present YaRN (Yet another RoPE extensioN method), a compute-efficient method to 28 | extend the context window of such models, requiring 10x less tokens and 2.5x 29 | less training steps than previous methods. Using YaRN, we show that LLaMA 30 | models can effectively utilize and extrapolat... 31 |
32 | Read More 38 |
39 | 40 |
44 |

LLaSM: Large Language and Speech Model

47 |

Yu Shu, Siwei Dong, Guangyao Chen, Wenhao Huang, Ruihua Zhang, Daochen Shi, Qiqi Xiang, Yemin Shi

50 | 57 |
59 | Multi-modal large language models have garnered significant interest 60 | recently. Though, most of the works focus on vision-language multi-modal models 61 | providing strong capabilities in following vision-and-language instructions. 62 | However, we claim that speech is also an important modality through which 63 | humans interact with the world. Hence, it is crucial for a general-purpose 64 | assistant to be able to follow multi-modal speech-and-language instructions. In 65 | this work, we propose Large Language and Spee... 66 |
67 | Read More 73 |
74 | 75 |
79 |

Large Language Models as Data Preprocessors

82 |

Haochen Zhang, Yuyang Dong, Chuan Xiao, Masafumi Oyamada

85 | 92 |
94 | Large Language Models (LLMs), typified by OpenAI's GPT series and Meta's 95 | LLaMA variants, have marked a significant advancement in artificial 96 | intelligence. Trained on vast amounts of text data, LLMs are capable of 97 | understanding and generating human-like text across a diverse range of topics. 98 | This study expands on the applications of LLMs, exploring their potential in 99 | data preprocessing, a critical stage in data mining and analytics applications. 100 | We delve into the applicability of state-of-the-art... 101 |
102 | Read More 108 |
109 | 110 |
114 |

LM-Infinite: Simple On-the-Fly Length Generalization for Large Language Models

117 |

Chi Han, Qifan Wang, Wenhan Xiong, Yu Chen, Heng Ji, Sinong Wang

120 | 127 |
129 | In recent years, there have been remarkable advancements in the performance 130 | of Transformer-based Large Language Models (LLMs) across various domains. As 131 | these LLMs are deployed for increasingly complex tasks, they often face the 132 | needs to conduct longer reasoning processes or understanding larger contexts. 133 | In these situations, the length generalization failure of LLMs on long 134 | sequences become more prominent. Most pre-training schemes truncate training 135 | sequences to a fixed length (such as 2048 for... 136 |
137 | Read More 143 |
144 | 145 |
149 |

WALL-E: Embodied Robotic WAiter Load Lifting with Large Language Model

152 |

Tianyu Wang, Yifan Li, Haitao Lin, Xiangyang Xue, Yanwei Fu

155 | 162 |
164 | Enabling robots to understand language instructions and react accordingly to 165 | visual perception has been a long-standing goal in the robotics research 166 | community. Achieving this goal requires cutting-edge advances in natural 167 | language processing, computer vision, and robotics engineering. Thus, this 168 | paper mainly investigates the potential of integrating the most recent Large 169 | Language Models (LLMs) and existing visual grounding and robotic grasping 170 | system to enhance the effectiveness of the human-ro... 171 |
172 | Read More 178 |
179 | 180 | -------------------------------------------------------------------------------- /ScholarlyRecommender/Repository/tests/outputs/test_feed.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 |
9 |

Chain-of-Verification Reduces Hallucination in Large Language Models

12 |

Shehzaad Dhuliawala, Mojtaba Komeili, Jing Xu, Roberta Raileanu, Xian Li, Asli Celikyilmaz, Jason Weston...

15 | 22 |
24 | Generation of plausible yet incorrect factual information, termed 25 | hallucination, is an unsolved issue in large language models. We study the 26 | ability of language models to deliberate on the responses they give in order to 27 | correct their mistakes. We develop the Chain-of-Verification (CoVe) method 28 | whereby the model first (i) drafts an initial response; then (ii) plans 29 | verification questions to fact-check its draft; (iii) answers those questions 30 | independently so the answers are not biased by other r... 31 |
32 | Read More 38 |
39 | 40 |
44 |

CPLLM: Clinical Prediction with Large Language Models

47 |

Ofir Ben Shoham, Nadav Rappoport...

50 | 57 |
59 | We present Clinical Prediction with Large Language Models (CPLLM), a method 60 | that involves fine-tuning a pre-trained Large Language Model (LLM) for clinical 61 | disease prediction. We utilized quantization and fine-tuned the LLM using 62 | prompts, with the task of predicting whether patients will be diagnosed with a 63 | target disease during their next visit or in the subsequent diagnosis, 64 | leveraging their historical diagnosis records. We compared our results versus 65 | various baselines, including Logistic Regr... 66 |
67 | Read More 73 |
74 | 75 |
79 |

Sequence-to-Sequence Spanish Pre-trained Language Models

82 |

Vladimir Araujo, Maria Mihaela Trusca, Rodrigo Tufiño, Marie-Francine Moens...

85 | 92 |
94 | In recent years, substantial advancements in pre-trained language models have 95 | paved the way for the development of numerous non-English language versions, 96 | with a particular focus on encoder-only and decoder-only architectures. While 97 | Spanish language models encompassing BERT, RoBERTa, and GPT have exhibited 98 | prowess in natural language understanding and generation, there remains a 99 | scarcity of encoder-decoder models designed for sequence-to-sequence tasks 100 | involving input-output pairs. This paper br... 101 |
102 | Read More 108 |
109 | 110 |
114 |

Is GPT4 a Good Trader?

117 |

Bingzhe Wu...

120 | 127 |
129 | Recently, large language models (LLMs), particularly GPT-4, have demonstrated 130 | significant capabilities in various planning and reasoning tasks 131 | \cite{cheng2023gpt4,bubeck2023sparks}. Motivated by these advancements, there 132 | has been a surge of interest among researchers to harness the capabilities of 133 | GPT-4 for the automated design of quantitative factors that do not overlap with 134 | existing factor libraries, with an aspiration to achieve alpha returns 135 | \cite{webpagequant}. In contrast to these work, th... 136 |
137 | Read More 143 |
144 | 145 |
149 |

AceGPT, Localizing Large Language Models in Arabic

152 |

Huang Huang, Fei Yu, Jianqing Zhu, Xuening Sun, Hao Cheng, Dingjie Song, Zhihong Chen, Abdulmohsen Alharthi, Bang An, Ziche Liu, Zhiyi Zhang, Junying Chen, Jianquan Li, Benyou Wang, Lian Zhang, Ruoyu Sun, Xiang Wan, Haizhou Li, Jinchao Xu...

155 | 162 |
164 | This paper explores the imperative need and methodology for developing a 165 | localized Large Language Model (LLM) tailored for Arabic, a language with 166 | unique cultural characteristics that are not adequately addressed by current 167 | mainstream models like ChatGPT. Key concerns additionally arise when 168 | considering cultural sensitivity and local values. To this end, the paper 169 | outlines a packaged solution, including further pre-training with Arabic texts, 170 | supervised fine-tuning (SFT) using native Arabic inst... 171 |
172 | Read More 178 |
179 | 180 | -------------------------------------------------------------------------------- /ScholarlyRecommender/Repository/tests/inputs/ref_feed.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 |
9 |

Chain-of-Verification Reduces Hallucination in Large Language Models

12 |

Shehzaad Dhuliawala, Mojtaba Komeili, Jing Xu, Roberta Raileanu, Xian Li, Asli Celikyilmaz, Jason Weston...

15 | 22 |
24 | Generation of plausible yet incorrect factual information, termed 25 | hallucination, is an unsolved issue in large language models. We study the 26 | ability of language models to deliberate on the responses they give in order to 27 | correct their mistakes. We develop the Chain-of-Verification (CoVe) method 28 | whereby the model first (i) drafts an initial response; then (ii) plans 29 | verification questions to fact-check its draft; (iii) answers those questions 30 | independently so the answers are not biased by other r... 31 |
32 | Read More 38 |
39 | 40 |
44 |

CPLLM: Clinical Prediction with Large Language Models

47 |

Ofir Ben Shoham, Nadav Rappoport...

50 | 57 |
59 | We present Clinical Prediction with Large Language Models (CPLLM), a method 60 | that involves fine-tuning a pre-trained Large Language Model (LLM) for clinical 61 | disease prediction. We utilized quantization and fine-tuned the LLM using 62 | prompts, with the task of predicting whether patients will be diagnosed with a 63 | target disease during their next visit or in the subsequent diagnosis, 64 | leveraging their historical diagnosis records. We compared our results versus 65 | various baselines, including Logistic Regr... 66 |
67 | Read More 73 |
74 | 75 |
79 |

Sequence-to-Sequence Spanish Pre-trained Language Models

82 |

Vladimir Araujo, Maria Mihaela Trusca, Rodrigo Tufiño, Marie-Francine Moens...

85 | 92 |
94 | In recent years, substantial advancements in pre-trained language models have 95 | paved the way for the development of numerous non-English language versions, 96 | with a particular focus on encoder-only and decoder-only architectures. While 97 | Spanish language models encompassing BERT, RoBERTa, and GPT have exhibited 98 | prowess in natural language understanding and generation, there remains a 99 | scarcity of encoder-decoder models designed for sequence-to-sequence tasks 100 | involving input-output pairs. This paper br... 101 |
102 | Read More 108 |
109 | 110 |
114 |

Is GPT4 a Good Trader?

117 |

Bingzhe Wu...

120 | 127 |
129 | Recently, large language models (LLMs), particularly GPT-4, have demonstrated 130 | significant capabilities in various planning and reasoning tasks 131 | \cite{cheng2023gpt4,bubeck2023sparks}. Motivated by these advancements, there 132 | has been a surge of interest among researchers to harness the capabilities of 133 | GPT-4 for the automated design of quantitative factors that do not overlap with 134 | existing factor libraries, with an aspiration to achieve alpha returns 135 | \cite{webpagequant}. In contrast to these work, th... 136 |
137 | Read More 143 |
144 | 145 |
149 |

AceGPT, Localizing Large Language Models in Arabic

152 |

Huang Huang, Fei Yu, Jianqing Zhu, Xuening Sun, Hao Cheng, Dingjie Song, Zhihong Chen, Abdulmohsen Alharthi, Bang An, Ziche Liu, Zhiyi Zhang, Junying Chen, Jianquan Li, Benyou Wang, Lian Zhang, Ruoyu Sun, Xiang Wan, Haizhou Li, Jinchao Xu...

155 | 162 |
164 | This paper explores the imperative need and methodology for developing a 165 | localized Large Language Model (LLM) tailored for Arabic, a language with 166 | unique cultural characteristics that are not adequately addressed by current 167 | mainstream models like ChatGPT. Key concerns additionally arise when 168 | considering cultural sensitivity and local values. To this end, the paper 169 | outlines a packaged solution, including further pre-training with Arabic texts, 170 | supervised fine-tuning (SFT) using native Arabic inst... 171 |
172 | Read More 178 |
179 | 180 | -------------------------------------------------------------------------------- /ScholarlyRecommender/Recommender/rec_sys.py: -------------------------------------------------------------------------------- 1 | import logging 2 | import gzip 3 | import numpy as np 4 | from tqdm import tqdm 5 | import pandas as pd 6 | from arxiv.arxiv import Search 7 | from ScholarlyRecommender.const import BASE_REPO 8 | from ScholarlyRecommender.config import get_config 9 | 10 | config = get_config() 11 | 12 | logging.basicConfig( 13 | level=logging.INFO, 14 | format="%(asctime)s [%(levelname)s]: %(message)s", 15 | handlers=[logging.StreamHandler()], 16 | ) 17 | # logging.disable(logging.CRITICAL) 18 | 19 | 20 | def rankerV3( 21 | context: pd.DataFrame, labels: pd.DataFrame, k: int = 6, on: str = "Abstract" 22 | ) -> pd.DataFrame: 23 | """ 24 | Rank the papers in the context using the normalized compression distance 25 | combined with a weighted top-k mean rating. 26 | Return a list of the top n ranked papers. 27 | This is a modified version of the algorithm described in the paper "“Low-Resource” 28 | Text Classification- A Parameter-Free Classification Method with Compressors." 29 | The algorithim gets the top k most similar papers to each paper in the context that 30 | the user rated and calculates the mean rating of those papers as its prediction. 31 | """ 32 | likes = labels 33 | candidates = context 34 | 35 | train = np.array([(row[on], row["label"]) for _, row in likes.iterrows()]) 36 | test = np.array([(row[on], row["Id"]) for _, row in candidates.iterrows()]) 37 | 38 | results = [] 39 | # logging.info(f"Starting to rank {len(test)} candidates on {on}\n") 40 | # print(f"Starting to rank {len(test)} candidates on {on}\n") 41 | for x1, id in tqdm(test, disable=False): 42 | # calculate the compressed length of the utf-8 encoded text 43 | Cx1 = len(gzip.compress(x1.encode())) 44 | # create a distance array 45 | similarity_to_x1 = [] 46 | for x2, _ in train: 47 | # calculate the compressed length of the utf-8 encoded text 48 | Cx2 = len(gzip.compress(x2.encode())) 49 | # concatenate the two texts 50 | x1x2 = " ".join([x1, x2]) 51 | # calculate the compressed length of the utf-8 encoded concatenated text 52 | Cx1x2 = len(gzip.compress(x1x2.encode())) 53 | # calculate the normalized compression distance 54 | ncd = (Cx1x2 - min(Cx1, Cx2)) / max(Cx1, Cx2) 55 | 56 | similarity_to_x1.append(ncd) 57 | 58 | # calculate the similarity weights for the top k most similar papers 59 | # Converting the list to a numpy array for vectorized operations 60 | similarity_to_x1 = np.array(similarity_to_x1) 61 | # sort the array and get the top k most similar papers 62 | sorted_idx = np.argsort(similarity_to_x1) 63 | values = similarity_to_x1[sorted_idx[:k]] 64 | # calculate the similarity weights for the top k most similar papers 65 | weights = values / np.sum(values) 66 | # Weights need to be inverted so that the most similar papers (lowest distance) 67 | inverse_weights = 1 / weights 68 | inverse_weights_norm = (inverse_weights) / np.sum(inverse_weights) 69 | # get the top k ratings 70 | top_k_ratings = train[sorted_idx[:k], 1].astype(int) 71 | # calculate the prediction as the inverse weighted mean of the top k ratings 72 | prediction = np.sum(np.dot(inverse_weights_norm, top_k_ratings)) 73 | 74 | results.append((prediction, id)) 75 | 76 | df = pd.DataFrame(results, columns=["predicted", "Id"]) 77 | return df 78 | 79 | 80 | def rank(context, labels=None, n: int = 5) -> list: 81 | """ 82 | Run the rankerV3 algorithm on the context and return a list 83 | of the top 5 ranked papers. 84 | """ 85 | if labels is None: 86 | labels = config["labels"] 87 | if isinstance(labels, str): 88 | labels = pd.read_csv(labels) 89 | if not isinstance(labels, pd.DataFrame): 90 | raise TypeError("labels must be a pandas DataFrame") 91 | 92 | df1 = rankerV3(context, labels, on="Abstract") 93 | df2 = rankerV3(context, labels, on="Title") 94 | df1["predicted"] = (df1["predicted"] + df2["predicted"]) / 2 95 | df1["rank"] = df1["predicted"].rank(ascending=False) 96 | df1 = df1.nsmallest(n, "rank") 97 | df1["Id"] = df1["Id"].astype(str) 98 | recommended = df1["Id"].iloc[:n].tolist() 99 | 100 | return recommended 101 | 102 | 103 | def rank2(context, labels=None, n: int = 5) -> list: 104 | """ 105 | Run the rankerV3 algorithm on the context and return a list 106 | of the top 5 ranked papers. 107 | """ 108 | if labels is None: 109 | labels = config["labels"] 110 | if isinstance(labels, str): 111 | labels = pd.read_csv(labels) 112 | if not isinstance(labels, pd.DataFrame): 113 | raise TypeError("labels must be a pandas DataFrame") 114 | train = context[["Id", "Title", "Abstract"]].copy() 115 | test = labels[["Title", "Abstract", "label"]].copy() 116 | train["content"] = train["Title"] + ": " + train["Abstract"] 117 | test["content"] = test["Title"] + ": " + test["Abstract"] 118 | train.drop(["Title", "Abstract"], axis=1, inplace=True) 119 | test.drop(["Title", "Abstract"], axis=1, inplace=True) 120 | df1 = rankerV3(train, test, on="content") 121 | df1["rank"] = df1["predicted"].rank(ascending=False) 122 | df1 = df1.nsmallest(n, "rank") 123 | df1["Id"] = df1["Id"].astype(str) 124 | recommended = df1["Id"].iloc[:n].tolist() 125 | 126 | return recommended 127 | 128 | 129 | def evaluate(k: int = 6, on: str = "Abstract") -> float: 130 | """ 131 | Evaluate the recommender system using the normalized compression distance. 132 | Calculate the mean squared error between the predicted and actual ratings. 133 | Return the loss. 134 | """ 135 | likes = pd.read_csv(config["labels"]) 136 | # Set train and test equal to 90% and 10% of the data respectively 137 | train_data = likes.sample(frac=0.9, random_state=0) 138 | test_data = likes.drop(train_data.index) 139 | 140 | train = np.array([(row[on], row["label"]) for _, row in train_data.iterrows()]) 141 | test = np.array([(row[on], row["label"]) for _, row in test_data.iterrows()]) 142 | results = [] 143 | # logging.info(f"Starting to rank {len(test)} candidates...\n") 144 | # print(f"Starting to rank {len(test)} candidates...\n") 145 | for x1, label in tqdm(test): 146 | # calculate the compressed length of the utf-8 encoded text 147 | Cx1 = len(gzip.compress(x1.encode())) 148 | # create a distance array 149 | similarity_to_x1 = [] 150 | for x2, _ in train: 151 | # calculate the compressed length of the utf-8 encoded text 152 | Cx2 = len(gzip.compress(x2.encode())) 153 | # concatenate the two texts 154 | x1x2 = " ".join([x1, x2]) 155 | # calculate the compressed length of the utf-8 encoded concatenated text 156 | Cx1x2 = len(gzip.compress(x1x2.encode())) 157 | # calculate the normalized compression distance 158 | ncd = (Cx1x2 - min(Cx1, Cx2)) / max(Cx1, Cx2) 159 | 160 | similarity_to_x1.append(ncd) 161 | # Converting the list to a numpy array for vectorized operations 162 | similarity_to_x1 = np.array(similarity_to_x1) 163 | # sort the array and get the top k most similar papers 164 | sorted_idx = np.argsort(similarity_to_x1) 165 | values = similarity_to_x1[sorted_idx[:k]] 166 | # calculate the similarity weights for the top k most similar papers 167 | weights = values / np.sum(values) 168 | # Weights need to be inverted so that the most similar papers (lowest distance) 169 | inverse_weights = 1 / weights 170 | inverse_weights_norm = (inverse_weights) / np.sum(inverse_weights) 171 | # get the top k ratings 172 | top_k_ratings = train[sorted_idx[:k], 1].astype(int) 173 | # calculate the prediction as the inverse weighted mean of the top k ratings 174 | prediction = np.sum(np.dot(inverse_weights_norm, top_k_ratings)) 175 | 176 | results.append((prediction, label)) 177 | 178 | df = pd.DataFrame(results, columns=["predicted", "actual"]) 179 | 180 | df["actual"] = df["actual"].astype(int) 181 | # calculate the mean squared error 182 | df["squared_error"] = (df["predicted"] - df["actual"]) ** 2 183 | # loss function 184 | loss = np.sqrt(df["squared_error"].sum() / df.shape[0]) 185 | return loss 186 | 187 | 188 | def fetch(ids: list) -> pd.DataFrame: 189 | """ 190 | Fetch papers from arxiv.org matching the ids and return a dataframe matching 191 | the BASE_REPO including the authors. 192 | """ 193 | # logging.info(f"Fetching {len(ids)} papers from arxiv... \n") 194 | # print(f"Fetching {len(ids)} papers from arxiv... \n") 195 | repository = BASE_REPO() 196 | repository["Author"] = [] 197 | search = Search( 198 | query="", 199 | id_list=ids, 200 | ) 201 | for result in search.results(): 202 | repository["Id"].append(result.entry_id.split("/")[-1]) 203 | repository["Category"].append(result.primary_category) 204 | repository["Title"].append(result.title.strip("\n")) 205 | repository["Published"].append(result.published) 206 | repository["Abstract"].append(result.summary.strip("\n")) 207 | repository["URL"].append(result.pdf_url) 208 | repository["Author"].append([author.name for author in result.authors]) 209 | 210 | return pd.DataFrame(repository) 211 | 212 | 213 | def get_recommendations( 214 | data, 215 | labels, 216 | size: int = None, 217 | to_path: str = None, 218 | as_df: bool = False, 219 | ): 220 | """ 221 | Rank the papers in the data and return a dataframe or save it to a csv file. 222 | Data can be a pandas DataFrame or a path to a csv file. 223 | """ 224 | if labels is None: 225 | labels = config["labels"] 226 | if size is None: 227 | size = config["feed_length"] 228 | if isinstance(data, pd.DataFrame): 229 | df = data 230 | df.reset_index(inplace=True) 231 | elif isinstance(data, str): 232 | df = pd.read_csv(data) 233 | else: 234 | raise TypeError("data must be a pandas DataFrame or a path to a csv file") 235 | if size < 0 or size > len(df.index): 236 | raise ValueError( 237 | "size must be greater than 0 and less than the length of the data" 238 | ) 239 | 240 | reccommended = rank2( 241 | context=df, 242 | labels=labels, 243 | n=size, 244 | ) 245 | feed = fetch(reccommended) 246 | if to_path is not None: 247 | feed.set_index("Id").to_csv(to_path) 248 | # logging.info(f"Feed saved to {to_path} \n") 249 | # print(f"Feed saved to {to_path} \n") 250 | # feed.to_csv(to_path) 251 | if as_df: 252 | return feed 253 | return None 254 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | 2 | 3 |
4 | DeepSource 5 |
6 |
7 | DeepSource 8 |
9 | 10 | 11 |
12 |
13 | 14 | Logo 15 | 16 | 17 |

Scholarly Recommender

18 | 19 |

20 | End-to-end product that sources recent academic publications and prepares a feed of recommended readings in seconds. 21 |
22 | Try it now » 23 |
24 |
25 | Explore the Docs 26 | · 27 | Report Bug 28 | · 29 | Request Feature 30 |

31 |
32 | 33 | 34 |
35 | Table of Contents 36 |
    37 |
  1. 38 | About The Project 39 | 42 |
  2. 43 |
  3. 44 | Getting Started 45 | 49 |
  4. 50 |
  5. Usage
  6. 51 |
  7. Roadmap
  8. 52 |
  9. Contributing
  10. 53 |
  11. License
  12. 54 |
  13. Contact
  14. 55 |
  15. Methods
  16. 56 |
  17. Acknowledgements
  18. 57 |
58 |
59 | 60 | ## About The Project 61 | 62 |
63 | 64 | 65 | 66 |
67 | 68 | 69 | As an upcoming data scientist with a strong passion for deep learning, I am always looking for new technologies and methodologies. Naturally, I spend a considerable amount of time researching and reading new publications to accomplish this. However, **over 14,000 academic papers are published every day** on average, making it extremely tedious to continuously source papers relevant to my interests. My primary motivation for creating ScholarlyRecommender is to address this, creating a fully automated and personalized system that prepares a feed of academic papers relevant to me. This feed is prepared on demand, through a completely abstracted streamlit web interface, or sent directly to my email on a timed basis. This project was designed to be scalable and adaptable, and can be very easily adapted not only to your own interests, but become a fully automated, self improving newsletter. Details on how to use this system, the methods used for retrieval and ranking, along with future plans and features planned or in development currently are listed below. 70 | 71 | 72 |

(back to top)

73 | 74 | ### Built With 75 | 76 | [![Python][Python.com]][Python-url] 77 | [![Streamlit][Streamlit.com]][Python-url] 78 | [![Pandas][Pandas.com]][Pandas-url] 79 | [![NumPy][Numpy.com]][Numpy-url] 80 | [![Arxiv.arxiv][Arxiv.arxiv.com]][Arxiv.arxiv-url] 81 | 82 |

(back to top)

83 | 84 | 85 | 86 | 87 | ## Getting Started 88 | 89 | To try ScholarlyRecommender, you can use the streamlit web application found [Here](https://scholarlyrecommender.streamlit.app/). This will allow you to use the system in its entirety without needing to install anything. If you want to modify the system internally or add functionality, you can follow the directions below to install it locally. 90 | 91 | ### Prerequisites 92 | 93 | In order to install this app locally you need to have following: 94 | - git 95 | - python3.9 or greater *(earlier versions may work)* 96 | - pip3 97 | 98 | 99 | ### Installation 100 | 101 | To install ScholarlyRecommender, run the following in your command line shell 102 | 103 | 1. Clone the repository from github and cd into it 104 | ```sh 105 | git clone https://github.com/iansnyder333/ScholarlyRecommender.git 106 | cd ScholarlyRecommender 107 | ``` 108 | 2. Set up the enviroment and install dependencies 109 | ```sh 110 | make build 111 | ``` 112 | All done, ScholarlyRecommender is now installed. 113 | You can now run the app with 114 | ```sh 115 | make run 116 | ``` 117 | 118 |

(back to top)

119 | 120 | 121 | 122 | 123 | ## Usage 124 | 125 | Once installed, you want to calibrate the system to your own interests. The easiest way to do this is using the webapp.py file. Alternativley, you can use calibrate.py, which runs on the console. 126 | 127 | Make sure you are cd into the parent folder of the cloned repo. 128 | 129 | Run this in your terminal as follows: 130 | ```sh 131 | make run 132 | ``` 133 | This is the same as running: 134 | ```sh 135 | streamlit run webapp.py 136 | ``` 137 | 138 | Navigate to the configure tab and complete the steps. You can now navigate back to the get recommendations tab and generate results! 139 | The web app offers full functionality and serves as an api to the system, while using the webapp, updates made to the configuration will be applied and refreshed in a continuous manner. 140 | 141 | **Note:** If you are using ScholarlyRecommender locally, certain features such as direct email will not work as the original applications database will not be available. If you want to configure the email feature for yourself, you can follow the instructions provided in [mail.py](https://github.com/iansnyder333/ScholarlyRecommender/blob/main/ScholarlyRecommender/Newsletter/mail.py). This will require some proficiency/familiarity with SMPT. If you are having issues please feel free to check the [docs](https://github.com/iansnyder333/ScholarlyRecommender/tree/main/docs), or make a discussion post [here](https://github.com/iansnyder333/ScholarlyRecommender/discussions) and someone will help you out. 142 | 143 |

(back to top)

144 | 145 | 146 | ## Roadmap 147 | 148 | - [x] ✅ Adding email support on the web app ✅ 149 | - [ ] OS support, specifically for windows. 150 | - [ ] shell scripts to make installs, updates, and usage easier. 151 | - [ ] Database to store user configurations, currently zero user data is saved. Also would like to improve data locality and cache to improve user experience. 152 | - [ ] Making it easier to give feedback to suggested papers to improve the system 153 | - [ ] Improving the overall labeling experience, specifically the pooling process and labeling setup to be more integrated. 154 | - [ ] Improve modularity in the webapp and try to improve caching for faster performance. 155 | - [ ] Many visual and user experience improvements, a complete overhaul of the UX is likely. 156 | - [ ] Allowing users to navigate between pages without using the Navbar, streamlit does not currently support this directly. 157 | 158 | 159 | See the [open issues](https://github.com/iansnyder333/ScholarlyRecommender/issues) for a full list of proposed features (and known issues). 160 | 161 |

(back to top)

162 | 163 | 164 | 165 | 166 | ## Contributing 167 | 168 | Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are **greatly appreciated**. 169 | 170 | If you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue with the tag "enhancement". 171 | Don't forget to give the project a star! Thanks again! 172 | 173 | 1. Fork the Project 174 | 2. Create your Feature Branch (`git checkout -b feature/AmazingFeature`) 175 | 3. Commit your Changes (`git commit -m 'Add some AmazingFeature'`) 176 | 4. Push to the Branch (`git push origin feature/AmazingFeature`) 177 | 5. Open a Pull Request 178 | 179 |

(back to top)

180 | 181 | 182 | ## License 183 | 184 | Distributed under the apache license 2.0. See `LICENSE` for more information. 185 | 186 |

(back to top)

187 | 188 | 189 | 190 | 191 | ## Contact 192 | 193 | Ian Snyder - [@iansnydes](https://twitter.com/iansnydes) - idsnyder136@gmail.com 194 | 195 | Project Email - scholarlyrecommender@gmail.com 196 | 197 | My Website: [iansnyder333.github.io/frontend/](https://iansnyder333.github.io/frontend/) 198 | 199 | Linkedin: [www.linkedin.com/in/iandsnyder](https://www.linkedin.com/in/iandsnyder/) 200 | 201 |

(back to top)

202 | 203 | 204 | ## Methods 205 | 206 | Once candidates are sourced in the context of the configuration, they are ranked. The ranking process involves using the normalized compression distance combined with an inverse weighted top-k mean rating from the candidates to the labeled papers. This is a modified version of the algorithm described in the paper "“Low-Resource” Text Classification- A Parameter-Free Classification Method with Compressors" (1). The algorithm gets the top k most similar papers to each paper in the context that the user rated and calculates a weighted mean rating of those papers as its prediction. The results are then sorted by the papers with the highest predicting rating and are returned in accordance with the desired amount. 207 | 208 | While using a large language model such as BERT might yield a higher accuracy, this approach is considerably more lightweight, can run on basically any computer, and requires virtually no labeled data to source relevant content. If this project scales to the capacity of a self improving newsletter, implementing a sophisticated deep learning model such as a transformer could be a worthwhile addition. 209 | 210 |

(back to top)

211 | 212 | 213 | ## Acknowledgements 214 | 215 | **1** - [“Low-Resource” Text Classification: A Parameter-Free Classification Method with Compressors](https://aclanthology.org/2023.findings-acl.426) (Jiang et al., Findings 2023) 216 | 217 | **README Template** - [Best-README-Template](https://github.com/othneildrew/Best-README-Template) by ([othneildrew](https://github.com/othneildrew)) 218 | 219 |

(back to top)

220 | 221 | 222 | 223 | 224 | [Python.com]:https://img.shields.io/badge/Python-blue 225 | [Python-url]:https://www.python.org/ 226 | [Streamlit.com]:https://img.shields.io/badge/Streamlit-red 227 | [Streamlit-url]:https://streamlit.io/ 228 | [Pandas.com]:https://img.shields.io/badge/pandas-purple 229 | [Pandas-url]:https://pandas.pydata.org/ 230 | [Numpy.com]:https://img.shields.io/badge/NumPy-%23ADD8E6 231 | [Numpy-url]:https://numpy.org/ 232 | [Arxiv.arxiv.com]:https://img.shields.io/badge/Arxiv-%23FF0000 233 | [Arxiv.arxiv-url]:http://lukasschwab.me/arxiv.py/index.html 234 | 235 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Apache License 2 | Version 2.0, January 2004 3 | http://www.apache.org/licenses/ 4 | 5 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 6 | 7 | 1. Definitions. 8 | 9 | "License" shall mean the terms and conditions for use, reproduction, 10 | and distribution as defined by Sections 1 through 9 of this document. 11 | 12 | "Licensor" shall mean the copyright owner or entity authorized by 13 | the copyright owner that is granting the License. 14 | 15 | "Legal Entity" shall mean the union of the acting entity and all 16 | other entities that control, are controlled by, or are under common 17 | control with that entity. For the purposes of this definition, 18 | "control" means (i) the power, direct or indirect, to cause the 19 | direction or management of such entity, whether by contract or 20 | otherwise, or (ii) ownership of fifty percent (50%) or more of the 21 | outstanding shares, or (iii) beneficial ownership of such entity. 22 | 23 | "You" (or "Your") shall mean an individual or Legal Entity 24 | exercising permissions granted by this License. 25 | 26 | "Source" form shall mean the preferred form for making modifications, 27 | including but not limited to software source code, documentation 28 | source, and configuration files. 29 | 30 | "Object" form shall mean any form resulting from mechanical 31 | transformation or translation of a Source form, including but 32 | not limited to compiled object code, generated documentation, 33 | and conversions to other media types. 34 | 35 | "Work" shall mean the work of authorship, whether in Source or 36 | Object form, made available under the License, as indicated by a 37 | copyright notice that is included in or attached to the work 38 | (an example is provided in the Appendix below). 39 | 40 | "Derivative Works" shall mean any work, whether in Source or Object 41 | form, that is based on (or derived from) the Work and for which the 42 | editorial revisions, annotations, elaborations, or other modifications 43 | represent, as a whole, an original work of authorship. For the purposes 44 | of this License, Derivative Works shall not include works that remain 45 | separable from, or merely link (or bind by name) to the interfaces of, 46 | the Work and Derivative Works thereof. 47 | 48 | "Contribution" shall mean any work of authorship, including 49 | the original version of the Work and any modifications or additions 50 | to that Work or Derivative Works thereof, that is intentionally 51 | submitted to Licensor for inclusion in the Work by the copyright owner 52 | or by an individual or Legal Entity authorized to submit on behalf of 53 | the copyright owner. For the purposes of this definition, "submitted" 54 | means any form of electronic, verbal, or written communication sent 55 | to the Licensor or its representatives, including but not limited to 56 | communication on electronic mailing lists, source code control systems, 57 | and issue tracking systems that are managed by, or on behalf of, the 58 | Licensor for the purpose of discussing and improving the Work, but 59 | excluding communication that is conspicuously marked or otherwise 60 | designated in writing by the copyright owner as "Not a Contribution." 61 | 62 | "Contributor" shall mean Licensor and any individual or Legal Entity 63 | on behalf of whom a Contribution has been received by Licensor and 64 | subsequently incorporated within the Work. 65 | 66 | 2. Grant of Copyright License. Subject to the terms and conditions of 67 | this License, each Contributor hereby grants to You a perpetual, 68 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 69 | copyright license to reproduce, prepare Derivative Works of, 70 | publicly display, publicly perform, sublicense, and distribute the 71 | Work and such Derivative Works in Source or Object form. 72 | 73 | 3. Grant of Patent License. Subject to the terms and conditions of 74 | this License, each Contributor hereby grants to You a perpetual, 75 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 76 | (except as stated in this section) patent license to make, have made, 77 | use, offer to sell, sell, import, and otherwise transfer the Work, 78 | where such license applies only to those patent claims licensable 79 | by such Contributor that are necessarily infringed by their 80 | Contribution(s) alone or by combination of their Contribution(s) 81 | with the Work to which such Contribution(s) was submitted. If You 82 | institute patent litigation against any entity (including a 83 | cross-claim or counterclaim in a lawsuit) alleging that the Work 84 | or a Contribution incorporated within the Work constitutes direct 85 | or contributory patent infringement, then any patent licenses 86 | granted to You under this License for that Work shall terminate 87 | as of the date such litigation is filed. 88 | 89 | 4. Redistribution. You may reproduce and distribute copies of the 90 | Work or Derivative Works thereof in any medium, with or without 91 | modifications, and in Source or Object form, provided that You 92 | meet the following conditions: 93 | 94 | (a) You must give any other recipients of the Work or 95 | Derivative Works a copy of this License; and 96 | 97 | (b) You must cause any modified files to carry prominent notices 98 | stating that You changed the files; and 99 | 100 | (c) You must retain, in the Source form of any Derivative Works 101 | that You distribute, all copyright, patent, trademark, and 102 | attribution notices from the Source form of the Work, 103 | excluding those notices that do not pertain to any part of 104 | the Derivative Works; and 105 | 106 | (d) If the Work includes a "NOTICE" text file as part of its 107 | distribution, then any Derivative Works that You distribute must 108 | include a readable copy of the attribution notices contained 109 | within such NOTICE file, excluding those notices that do not 110 | pertain to any part of the Derivative Works, in at least one 111 | of the following places: within a NOTICE text file distributed 112 | as part of the Derivative Works; within the Source form or 113 | documentation, if provided along with the Derivative Works; or, 114 | within a display generated by the Derivative Works, if and 115 | wherever such third-party notices normally appear. The contents 116 | of the NOTICE file are for informational purposes only and 117 | do not modify the License. You may add Your own attribution 118 | notices within Derivative Works that You distribute, alongside 119 | or as an addendum to the NOTICE text from the Work, provided 120 | that such additional attribution notices cannot be construed 121 | as modifying the License. 122 | 123 | You may add Your own copyright statement to Your modifications and 124 | may provide additional or different license terms and conditions 125 | for use, reproduction, or distribution of Your modifications, or 126 | for any such Derivative Works as a whole, provided Your use, 127 | reproduction, and distribution of the Work otherwise complies with 128 | the conditions stated in this License. 129 | 130 | 5. Submission of Contributions. Unless You explicitly state otherwise, 131 | any Contribution intentionally submitted for inclusion in the Work 132 | by You to the Licensor shall be under the terms and conditions of 133 | this License, without any additional terms or conditions. 134 | Notwithstanding the above, nothing herein shall supersede or modify 135 | the terms of any separate license agreement you may have executed 136 | with Licensor regarding such Contributions. 137 | 138 | 6. Trademarks. This License does not grant permission to use the trade 139 | names, trademarks, service marks, or product names of the Licensor, 140 | except as required for reasonable and customary use in describing the 141 | origin of the Work and reproducing the content of the NOTICE file. 142 | 143 | 7. Disclaimer of Warranty. Unless required by applicable law or 144 | agreed to in writing, Licensor provides the Work (and each 145 | Contributor provides its Contributions) on an "AS IS" BASIS, 146 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 147 | implied, including, without limitation, any warranties or conditions 148 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A 149 | PARTICULAR PURPOSE. You are solely responsible for determining the 150 | appropriateness of using or redistributing the Work and assume any 151 | risks associated with Your exercise of permissions under this License. 152 | 153 | 8. Limitation of Liability. In no event and under no legal theory, 154 | whether in tort (including negligence), contract, or otherwise, 155 | unless required by applicable law (such as deliberate and grossly 156 | negligent acts) or agreed to in writing, shall any Contributor be 157 | liable to You for damages, including any direct, indirect, special, 158 | incidental, or consequential damages of any character arising as a 159 | result of this License or out of the use or inability to use the 160 | Work (including but not limited to damages for loss of goodwill, 161 | work stoppage, computer failure or malfunction, or any and all 162 | other commercial damages or losses), even if such Contributor 163 | has been advised of the possibility of such damages. 164 | 165 | 9. Accepting Warranty or Additional Liability. While redistributing 166 | the Work or Derivative Works thereof, You may choose to offer, 167 | and charge a fee for, acceptance of support, warranty, indemnity, 168 | or other liability obligations and/or rights consistent with this 169 | License. However, in accepting such obligations, You may act only 170 | on Your own behalf and on Your sole responsibility, not on behalf 171 | of any other Contributor, and only if You agree to indemnify, 172 | defend, and hold each Contributor harmless for any liability 173 | incurred by, or claims asserted against, such Contributor by reason 174 | of your accepting any such warranty or additional liability. 175 | 176 | END OF TERMS AND CONDITIONS 177 | 178 | APPENDIX: How to apply the Apache License to your work. 179 | 180 | To apply the Apache License to your work, attach the following 181 | boilerplate notice, with the fields enclosed by brackets "[]" 182 | replaced with your own identifying information. (Don't include 183 | the brackets!) The text should be enclosed in the appropriate 184 | comment syntax for the file format. We also recommend that a 185 | file or class name and description of purpose be included on the 186 | same "printed page" as the copyright notice for easier 187 | identification within third-party archives. 188 | 189 | Copyright [yyyy] [name of copyright owner] 190 | 191 | Licensed under the Apache License, Version 2.0 (the "License"); 192 | you may not use this file except in compliance with the License. 193 | You may obtain a copy of the License at 194 | 195 | http://www.apache.org/licenses/LICENSE-2.0 196 | 197 | Unless required by applicable law or agreed to in writing, software 198 | distributed under the License is distributed on an "AS IS" BASIS, 199 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 200 | See the License for the specific language governing permissions and 201 | limitations under the License. 202 | -------------------------------------------------------------------------------- /testing.py: -------------------------------------------------------------------------------- 1 | """ 2 | This module contains unit tests for the ScholarlyRecommender package. 3 | Before pushing to the repository, please run the tests to ensure that the package 4 | is working as expected. 5 | More tests will be added as the package is developed. 6 | 7 | TODO Test feed outputs for email, webapp, etc. Add more tests for the webapp, 8 | and more comprehensive tests for the API. 9 | """ 10 | import ScholarlyRecommender as sr 11 | import pandas as pd 12 | from json import load 13 | import unittest 14 | import time 15 | import tracemalloc 16 | 17 | 18 | # Test Constants 19 | # Base directory for testing 20 | BASE_TEST_DIR = "ScholarlyRecommender/Repository/tests/" 21 | TEST_INPUT_DIR = BASE_TEST_DIR + "inputs/" 22 | TEST_OUTPUT_DIR = BASE_TEST_DIR + "outputs/" 23 | # Test Configuration files 24 | TEST_CONFIG_PATH = TEST_INPUT_DIR + "test_configuration.json" 25 | TEST_CONFIG_UPDATE_OUT = TEST_OUTPUT_DIR + "test_configuration_update.json" 26 | # Test Reference input files 27 | REF_CANDIDATES_PATH = TEST_INPUT_DIR + "ref_candidates.csv" 28 | REF_LABELS_PATH = TEST_INPUT_DIR + "ref_labels.csv" 29 | REF_RECOMMENDATIONS_PATH = TEST_INPUT_DIR + "ref_recommendations.csv" 30 | # Test Output files 31 | TEST_CANDIDATES_OUT = TEST_OUTPUT_DIR + "test_candidates.csv" 32 | TEST_LABELS_OUT = TEST_OUTPUT_DIR + "test_labels.csv" 33 | TEST_RECOMMENDATIONS_OUT = TEST_OUTPUT_DIR + "test_recommendations.csv" 34 | TEST_FEED_OUT = TEST_OUTPUT_DIR + "test_feed.html" 35 | 36 | 37 | class TestScholarlyRecommender(unittest.TestCase): 38 | """ 39 | CONFIGURATION TESTS 40 | 41 | First, these tests ensure that the current configuration is valid, validating 42 | the keys, value types, and value shapes against the reference. 43 | Second, the retrieval and update functions for the ScholarlyRecommender 44 | configuration API are tested. 45 | 46 | ScholarlyRecommender API TESTS 47 | 48 | These tests ensure that the ScholarlyRecommender API is working as expected. 49 | Each test checks that the output is the correct shape and type, in both return 50 | formats.These tests are run under the assumption that the configuration is valid. 51 | """ 52 | 53 | def setUp(self): 54 | """Setup the test enviroment for unittest""" 55 | with open(TEST_CONFIG_PATH) as json_file: 56 | self.ref_config = load(json_file) 57 | self.config = sr.get_config() 58 | self.candidates = pd.read_csv(REF_CANDIDATES_PATH) 59 | self.candidates_labeled = pd.read_csv(REF_LABELS_PATH) 60 | self.recommendations = pd.read_csv(REF_RECOMMENDATIONS_PATH) 61 | 62 | # Test the configuration file and its contents 63 | def test_config(self): 64 | """Check that the config file is valid""" 65 | self.assertEqual(self.config.keys(), self.ref_config.keys()) 66 | 67 | def test_config_query(self): 68 | """Test the queries""" 69 | config_queries = self.config["queries"] 70 | self.assertTrue(isinstance(config_queries, list)) 71 | self.assertTrue(len(config_queries) > 0) 72 | self.assertTrue(all(isinstance(item, str) for item in config_queries)) 73 | 74 | def test_config_labels(self): 75 | """Test the labels""" 76 | config_labels = pd.read_csv(self.config["labels"]) 77 | expected_columns = list(self.candidates_labeled.columns) 78 | columns = list(config_labels.columns) 79 | self.assertEqual(columns, expected_columns) 80 | 81 | def test_config_feed_length(self): 82 | """Test the feed length""" 83 | self.assertTrue(isinstance(self.config["feed_length"], int)) 84 | self.assertTrue(self.config["feed_length"] > 0) 85 | self.assertTrue(self.config["feed_length"] <= 10) 86 | 87 | def test_get_config(self): 88 | """Test that the config retrieval function works as expected.""" 89 | with open("ScholarlyRecommender/configuration.json") as json_file: 90 | expected_config = load(json_file) 91 | 92 | config = sr.get_config() 93 | self.assertEqual(config, expected_config) 94 | 95 | def test_update_config(self): 96 | """Test that the config update function works as expected.""" 97 | # Change as necessary 98 | expected_config = { 99 | "queries": ["Computer Science", "Mathematics"], 100 | "labels": BASE_TEST_DIR + "test_candidates_labeled.csv", 101 | "feed_length": 7, 102 | "feed_path": BASE_TEST_DIR + "test_feed.html", 103 | } 104 | sr.update_config( 105 | expected_config, test_mode=True, test_path=TEST_CONFIG_UPDATE_OUT 106 | ) 107 | with open(TEST_CONFIG_UPDATE_OUT) as json_file: 108 | config = load(json_file) 109 | 110 | self.assertEqual(config, expected_config) 111 | 112 | def test_source_candidates(self): 113 | """ 114 | Test that the outputs from candidate sourcing are the correct 115 | shape and type. 116 | """ 117 | out = TEST_CANDIDATES_OUT 118 | df_candidates = sr.source_candidates( 119 | queries=self.config["queries"], 120 | as_df=True, 121 | prev_days=7, 122 | to_path=out, 123 | ) 124 | df_candidates.reset_index(inplace=True) 125 | 126 | candidates = pd.read_csv(out) 127 | expected_columns = list(self.candidates.columns) 128 | expected_dtypes = self.candidates.dtypes.astype(str).to_dict() 129 | 130 | # Compare the column names 131 | self.assertListEqual(list(candidates.columns), expected_columns) 132 | self.assertListEqual(list(df_candidates.columns), expected_columns) 133 | 134 | # Compare the data types of each column 135 | self.assertDictEqual(candidates.dtypes.astype(str).to_dict(), expected_dtypes) 136 | 137 | def test_get_recommendations(self): 138 | """Test that the outputs from the ranking are the correct shape and type""" 139 | out = TEST_RECOMMENDATIONS_OUT 140 | df_recommendations = sr.get_recommendations( 141 | data=REF_CANDIDATES_PATH, 142 | labels=REF_LABELS_PATH, 143 | to_path=out, 144 | as_df=True, 145 | ) 146 | 147 | recommendations = pd.read_csv(out) 148 | expected_columns = list(self.recommendations.columns) 149 | expected_dtypes = self.recommendations.dtypes.astype(str).to_dict() 150 | 151 | # Compare the column names 152 | self.assertListEqual(list(recommendations.columns), expected_columns) 153 | self.assertListEqual(list(df_recommendations.columns), expected_columns) 154 | # Compare the data types of each column 155 | self.assertDictEqual( 156 | recommendations.dtypes.astype(str).to_dict(), expected_dtypes 157 | ) 158 | 159 | 160 | class BenchmarkTests: 161 | """ 162 | This class contains a basic benchmarking tests for the ScholarlyRecommender package. 163 | The benchmarking tests are run under the assumption that the configuration is valid. 164 | Gives the runtime and memory usage of each function in the package, as well as 165 | the total runtime and memory usage for the whole pipeline. 166 | """ 167 | 168 | def __init__(self): 169 | self.config = sr.get_config() 170 | self.candidates = pd.read_csv(REF_CANDIDATES_PATH) 171 | self.candidates_labeled = pd.read_csv(REF_LABELS_PATH) 172 | self.recommendations = pd.read_csv(REF_RECOMMENDATIONS_PATH) 173 | 174 | def benchmark(self, save_log=False): 175 | """Run the benchmarking tests.""" 176 | print("\n Running benchmarks... \n") 177 | times = [] 178 | memory = [] 179 | tracemalloc.start() 180 | full_start_time = time.time() 181 | 182 | start_time = time.time() 183 | can = sr.source_candidates( 184 | queries=self.config["queries"], 185 | as_df=True, 186 | prev_days=7, 187 | ) 188 | elapsed_time = time.time() - start_time 189 | times.append(elapsed_time) 190 | memory.append(tracemalloc.get_traced_memory()) 191 | self._display( 192 | "Source Candidates", 193 | times[0], 194 | memory[0][0], 195 | memory[0][1], 196 | ) 197 | 198 | start_time = time.time() 199 | rec = sr.get_recommendations( 200 | data=REF_CANDIDATES_PATH, 201 | labels=REF_LABELS_PATH, 202 | as_df=True, 203 | ) 204 | elapsed_time = time.time() - start_time 205 | times.append(elapsed_time) 206 | memory.append(tracemalloc.get_traced_memory()) 207 | self._display( 208 | "Get Recommendations", 209 | times[1], 210 | memory[1][0] - memory[0][0], 211 | memory[1][1], 212 | ) 213 | 214 | start_time = time.time() 215 | fee = sr.get_feed( 216 | data=REF_RECOMMENDATIONS_PATH, 217 | to_path=TEST_FEED_OUT, 218 | ) 219 | elapsed_time = time.time() - start_time 220 | 221 | times.append(elapsed_time) 222 | memory.append(tracemalloc.get_traced_memory()) 223 | self._display( 224 | "Get Feed", 225 | times[2], 226 | memory[2][0] - memory[1][0], 227 | memory[2][1], 228 | ) 229 | 230 | full_time = time.time() - full_start_time 231 | tracemalloc.stop() 232 | self._display( 233 | "Total", 234 | full_time, 235 | memory[2][0], 236 | memory[2][1], 237 | ) 238 | if save_log: 239 | return times, memory 240 | return None 241 | 242 | def benchmark_sc(self): 243 | """ 244 | Benchmark the source_candidates function 245 | compared to the fast_search function 246 | """ 247 | # Run this test 3 times and display the average 248 | times = [] 249 | memory = [] 250 | for i in range(3): 251 | tracemalloc.start() 252 | start_time = time.time() 253 | can = sr.source_candidates( 254 | queries=self.config["queries"], 255 | as_df=True, 256 | prev_days=7, 257 | ) 258 | elapsed_time = time.time() - start_time 259 | times.append(elapsed_time) 260 | memory.append(tracemalloc.get_traced_memory()) 261 | tracemalloc.stop() 262 | self._display( 263 | "Average Fast Search", 264 | sum(times) / len(times), 265 | sum(m[0] for m in memory) / len(memory), 266 | sum(m[1] for m in memory) / len(memory), 267 | ) 268 | times = [] 269 | memory = [] 270 | for i in range(3): 271 | tracemalloc.start() 272 | start_time = time.time() 273 | can = sr.source_candidates( 274 | queries=self.config["queries"], 275 | as_df=True, 276 | prev_days=7, 277 | ) 278 | elapsed_time = time.time() - start_time 279 | times.append(elapsed_time) 280 | memory.append(tracemalloc.get_traced_memory()) 281 | tracemalloc.stop() 282 | self._display( 283 | "Average Source Candidates", 284 | sum(times) / len(times), 285 | sum(m[0] for m in memory) / len(memory), 286 | sum(m[1] for m in memory) / len(memory), 287 | ) 288 | 289 | def benchmark_rec_sys(self): 290 | """ 291 | Benchmark the equal weight rank function 292 | compared to the unequal weight function 293 | """ 294 | # Run this test 5 times and display the average 295 | times = [] 296 | memory = [] 297 | for i in range(5): 298 | tracemalloc.start() 299 | start_time = time.time() 300 | rec = sr.get_recommendations( 301 | data=REF_CANDIDATES_PATH, 302 | labels=REF_LABELS_PATH, 303 | as_df=True, 304 | ) 305 | elapsed_time = time.time() - start_time 306 | times.append(elapsed_time) 307 | memory.append(tracemalloc.get_traced_memory()) 308 | tracemalloc.stop() 309 | self._display( 310 | "Average for get_recommendations", 311 | sum(times) / len(times), 312 | sum(m[0] for m in memory) / len(memory), 313 | sum(m[1] for m in memory) / len(memory), 314 | ) 315 | 316 | @staticmethod 317 | def _display(name, runtime, current_memory, peak_memory): 318 | """Display the results of the benchmarking tests.""" 319 | print(f"{name}") 320 | print(f"Runtime: {runtime} seconds") 321 | print(f"Memory: {current_memory} bytes in current memory") 322 | print(f"{peak_memory} bytes in peak memory \n") 323 | 324 | 325 | if __name__ == "__main__": 326 | unittest.main() 327 | # BenchmarkTests().benchmark_rec_sys() 328 | -------------------------------------------------------------------------------- /webapp.py: -------------------------------------------------------------------------------- 1 | import streamlit as st 2 | import streamlit.components.v1 as components 3 | import pandas as pd 4 | 5 | # Standard Library Imports 6 | import smtplib 7 | from email.message import EmailMessage 8 | from re import match 9 | 10 | # Local Imports 11 | import ScholarlyRecommender as sr 12 | from Utils.webutils import search_categories 13 | 14 | 15 | @st.cache_data(show_spinner=False) 16 | def load_sc_config(): 17 | return sr.get_config() 18 | 19 | 20 | def get_sc_config(): 21 | return st.session_state.sys_config 22 | 23 | 24 | def update_sc_config(new_config): 25 | st.session_state.sys_config = new_config 26 | 27 | 28 | @st.cache_data(show_spinner=False) 29 | def build_query(cats: dict) -> list: 30 | if cats.__len__() == 0: 31 | return [] 32 | 33 | usr_query = [] 34 | for key, value in cats.items(): 35 | if len(value) > 0: 36 | usr_query.extend(value) 37 | else: 38 | usr_query.append(key) 39 | return usr_query 40 | 41 | 42 | @st.cache_data(show_spinner=False) 43 | def validate_email(email) -> bool: 44 | regex = r"^\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b" 45 | if match(regex, email): 46 | return True 47 | return False 48 | 49 | 50 | def send_email(**kwargs): 51 | try: 52 | SUBSCRIBERS = kwargs["subscribers"] 53 | 54 | # Validate emails again before sending 55 | for email in SUBSCRIBERS: 56 | if not validate_email(email): 57 | raise ValueError( 58 | f"{email} is not a valid email address. Please try again." 59 | ) 60 | 61 | EMAIL_ADDRESS = st.secrets.email_credentials.EMAIL 62 | EMAIL_PASSWORD = st.secrets.email_credentials.EMAIL_PASSWORD 63 | PORT = st.secrets.email_credentials.PORT 64 | if not EMAIL_ADDRESS or not EMAIL_PASSWORD or not PORT: 65 | raise ValueError( 66 | "Email credentials not set in environment variables. Please report this issue to the developer." 67 | ) 68 | if not kwargs["content"]: 69 | raise ValueError( 70 | "Email content not set. Please report this issue to the developer." 71 | ) 72 | 73 | msg = EmailMessage() 74 | msg["Subject"] = "Your Scholarly Recommender Newsletter" 75 | msg["From"] = EMAIL_ADDRESS 76 | msg["To"] = SUBSCRIBERS 77 | 78 | html_string = kwargs["content"] 79 | 80 | msg.set_content(html_string, subtype="html") 81 | with smtplib.SMTP_SSL("smtp.gmail.com", PORT) as smtp: 82 | smtp.login(EMAIL_ADDRESS, EMAIL_PASSWORD) 83 | smtp.send_message(msg) 84 | st.success("Email sent successfully!") 85 | 86 | except Exception as e: # skipcq: PYL-W0703 87 | st.error(f"Email failed to send. {e}", icon="🚨") 88 | 89 | 90 | def generate_feed_pipeline(query: list, n: int, days: int, to_email: bool): 91 | with st.status("Working...", expanded=True) as status: 92 | status.update(label="Searching for papers...", state="running", expanded=True) 93 | 94 | if len(query) == 0: 95 | query = get_sc_config()["queries"] 96 | # Collect candidates 97 | candidates = sr.source_candidates(queries=query, as_df=True, prev_days=days) 98 | status.update( 99 | label="Generating recommendations...", state="running", expanded=True 100 | ) 101 | 102 | # Generate recommendations 103 | recommendations = sr.get_recommendations( 104 | data=candidates, 105 | labels=get_sc_config()["labels"], 106 | size=n, 107 | as_df=True, 108 | ) 109 | status.update(label="Generating feed...", state="running", expanded=True) 110 | 111 | # Generate feed 112 | source_code = sr.get_feed( 113 | data=recommendations, 114 | email=to_email, 115 | web=True, 116 | ) 117 | status.update(label="Feed Generated", state="complete", expanded=False) 118 | return source_code 119 | 120 | 121 | # TODO make this more efficient 122 | def fetch_papers(num_papers: int = 10) -> pd.DataFrame: 123 | # Import arxiv here to prevent unnecessary imports 124 | from arxiv.arxiv import SortCriterion 125 | 126 | c = sr.source_candidates( 127 | queries=get_sc_config()["queries"], 128 | max_results=100, 129 | as_df=True, 130 | sort_by=SortCriterion.Relevance, 131 | ) 132 | sam = c[["Title", "Abstract"]].sample(frac=1, random_state=1).reset_index(drop=True) 133 | sam.sort_values( 134 | by="Abstract", key=lambda x: x.str.len(), inplace=True, ascending=False 135 | ) 136 | res = sam.iloc[: min(num_papers, len(sam))].copy() 137 | return res 138 | 139 | 140 | def calibrate_rec_sys(num_papers: int = 10): 141 | # Initialize session state variables if they don't exist 142 | if "labels" not in st.session_state: 143 | st.session_state.labels = [] 144 | if "current_index" not in st.session_state: 145 | st.session_state.current_index = 0 146 | if "papers_df" not in st.session_state: 147 | st.session_state.papers_df = fetch_papers(num_papers=num_papers) 148 | 149 | if st.session_state.current_index < num_papers: 150 | # Display the paper at the current index 151 | with st.form("rating_form"): 152 | row = st.session_state.papers_df.iloc[st.session_state.current_index] 153 | st.markdown(f"""## {row['Title']} """) 154 | trimmed_abstract = row["Abstract"][:500] + "..." 155 | st.markdown(f"""{trimmed_abstract}""") 156 | 157 | rating = st.number_input( 158 | "Rate this paper on a scale of 1 to 10?", 159 | min_value=1, 160 | max_value=10, 161 | ) 162 | submit_rating = st.form_submit_button( 163 | f"Submit Rating for Paper: {st.session_state.current_index + 1}" 164 | ) 165 | if submit_rating: 166 | # Save the label and increment the index 167 | st.session_state.labels.append(rating) 168 | st.session_state.current_index += 1 169 | st.rerun() 170 | 171 | elif st.session_state.current_index == num_papers: 172 | # Save all labels once all papers are rated 173 | st.session_state.papers_df["label"] = st.session_state.labels 174 | # st.session_state.papers_df.to_csv(to_path) 175 | # Update the config file 176 | old_config = get_sc_config() 177 | old_config["labels"] = st.session_state.papers_df 178 | update_sc_config(old_config) 179 | st.success( 180 | "Labels saved. Get recommendations is now configured to your interests." 181 | ) 182 | # Increment to prevent re-running this block 183 | st.session_state.current_index += 1 184 | 185 | else: 186 | st.write("Rating process is complete.") 187 | 188 | 189 | if "sys_config" not in st.session_state: 190 | st.session_state.sys_config = load_sc_config() 191 | 192 | # Theme Configuration 193 | st.set_page_config( 194 | page_title="Scholarly Recommender API", 195 | page_icon="images/logo.png", 196 | layout="wide", 197 | initial_sidebar_state="expanded", 198 | ) 199 | 200 | # Custom CSS for further customization 201 | st.markdown( 202 | """ 203 | 206 | """, 207 | unsafe_allow_html=True, 208 | ) 209 | 210 | # Sidebar with logo and navigation 211 | st.sidebar.image("images/logo.png", use_column_width=True) 212 | st.sidebar.title("Navigation") 213 | navigation = st.sidebar.radio( 214 | "Go to", ["Get Recommendations", "Configure", "About", "Contact"] 215 | ) 216 | 217 | # Home Page 218 | if navigation == "Get Recommendations": 219 | st.title("Scholarly Recommender Cloud API") 220 | st.markdown( 221 | """ 222 | ## Welcome to the Scholarly Recommender API 223 | 224 | This platform is designed to offer you highly tailored scholarly recommendations. 225 | Whether you're a researcher, academic, or just someone interested in scientific literature, 226 | this service is built to cater to your specific needs. 227 | 228 | ### To Get Started 229 | 230 | - **Configure**: The first step is to configure the system to your interests. This can be done by navigating to the configure page. If you skip this step, the system will use a default configuration tailored to the interests of the developer. 231 | - **Get Recommendations**: Once you've configured the system, you can get recommendations by pressing the 'generate recommendations' button at the bottom of the page. 232 | - **Categories & Subcategories**: You can customize your interests by selecting from a wide range of categories and subcategories, by default, the system will search based on your configured interests. 233 | - **Recommendation Count**: Choose how many recommendations you want to receive. 234 | - **Time Range**: Decide the time frame for the articles you're interested in. 235 | 236 | """, 237 | unsafe_allow_html=True, 238 | ) 239 | # User input section 240 | st.subheader("Customize your recommendations") 241 | 242 | # Collecting user details 243 | 244 | categories = st.multiselect( 245 | "Would you like to search for any specific categories? (leave blank to use your configured interests)", 246 | search_categories.keys(), 247 | ) 248 | selected_sub_categories = {} 249 | for selected in categories: 250 | selected_sub_categories[selected] = st.multiselect( 251 | f"Select subcategories under {selected} (leave blank for all)", 252 | search_categories[selected], 253 | ) 254 | 255 | user_n = st.slider( 256 | "How many recommendations would you like?", 257 | min_value=1, 258 | max_value=10, 259 | value=5, 260 | ) 261 | user_days = st.slider( 262 | "How many days back would you like to search?", 263 | min_value=1, 264 | max_value=30, 265 | value=7, 266 | ) 267 | 268 | user_to_email = st.checkbox("Email Recommendations?") 269 | 270 | if user_to_email: 271 | with st.form("email_form"): 272 | st.write( 273 | "This feature is currently under development, please report any issues you encounter." 274 | ) 275 | user_email = st.text_input( 276 | "your email address", placeholder="example@email.com", value="" 277 | ) 278 | st.markdown( 279 | """Disclaimer: Scholarly Recommender will only send you an email with your recommendations 280 | and will not use your email address for any other purpose.""", 281 | unsafe_allow_html=True, 282 | ) 283 | 284 | submit_button = st.form_submit_button(label="Confirm") 285 | if submit_button: 286 | if validate_email(user_email): 287 | st.success("Email address confirmed") 288 | 289 | else: 290 | st.error("Please enter a valid email address", icon="🚨") 291 | 292 | else: 293 | user_email = "" 294 | # Call to Action 295 | if st.button("Generate Recommendations", type="primary"): 296 | user_query = build_query(selected_sub_categories) 297 | 298 | user_feed = generate_feed_pipeline( 299 | query=user_query, n=user_n, days=user_days, to_email=user_to_email 300 | ) 301 | 302 | if user_to_email: 303 | send_email( 304 | subscribers=[user_email], 305 | content=user_feed, 306 | ) 307 | 308 | components.html(user_feed, height=1000, scrolling=True) 309 | 310 | 311 | elif navigation == "Configure": 312 | st.title("Scholarly Recommender Configuration API") 313 | st.markdown( 314 | """ 315 | ## Welcome to the Scholarly Recommender System Calibration Tool 316 | 317 | This tool will help you calibrate the recommender system to your interests! 318 | Below are the various configuration steps, it is advised to do them in order. 319 | Once a step is completed, the changes will be applied automatically, regardless of whether you continue to the next step or not. 320 | """ 321 | ) 322 | # User input section 323 | st.markdown( 324 | """ 325 | ### Step 1: Configure your interests 326 | 327 | This section will help you configure the system to your interests. 328 | This ensures the system will use this to only scrape papers relevant to you. 329 | Follow the steps below to get started. 330 | """ 331 | ) 332 | 333 | categories = st.multiselect( 334 | "Select the categories that interest you the most, at least one is required:", 335 | search_categories.keys(), 336 | ) 337 | selected_sub_categories = {} 338 | for selected in categories: 339 | selected_sub_categories[selected] = st.multiselect( 340 | f"Select subcategories under {selected} (Optional, leave blank for all)", 341 | search_categories[selected], 342 | ) 343 | if st.button("Done", type="primary"): 344 | with st.spinner("Configuring..."): 345 | user_config_query = build_query(selected_sub_categories) 346 | # prevent empty queries 347 | if len(user_config_query) == 0: 348 | st.error("Please select at least one interest.") 349 | else: 350 | configuration = get_sc_config() 351 | configuration["queries"] = user_config_query 352 | update_sc_config(configuration) 353 | 354 | st.success("Preferences updated successfully!") 355 | # Initialize a session state variable for calibration status if it doesn't exist 356 | if "calibration_started" not in st.session_state: 357 | st.session_state.calibration_started = False 358 | 359 | st.markdown( 360 | """ 361 | ### Step 2: Calibrate the Recommender System 362 | 363 | This section will help you calibrate the recommender system based on your interests. 364 | This will help the system learn your preferences and will significantly improve recommendations. 365 | This process will show you snippets of 10 papers and ask you to rate them on a scale of 1 to 10 (1 being the least relevant and 10 being the most relevant). 366 | Many improvements are planned for this process, including the ability to skip papers, change sample size, and dynamically update the system based on your feedback from the generated feed. 367 | 368 | Click on the button below to get started. 369 | 370 | **Note** : Currently, if you want to start over or repeat this process, you must refresh the page 371 | """ 372 | ) 373 | 374 | if st.button("Start Calibration", type="primary"): 375 | st.session_state.calibration_started = True 376 | 377 | if st.session_state.calibration_started: 378 | with st.spinner("Preparing Calibration..."): 379 | calibrate_rec_sys(num_papers=10) 380 | st.markdown( 381 | """ 382 | ### Step 3: All done! Navigate to the Get Recommendations page to generate your personalized feed!""" 383 | ) 384 | # TODO add a button to navigate to the get recommendations page 385 | 386 | 387 | # About Page 388 | elif navigation == "About": 389 | st.markdown( 390 | """ 391 | ## Welcome to the Scholarly Recommender Cloud API 392 | 393 | This platform is designed to offer you highly tailored scholarly recommendations. 394 | Whether you're a researcher, academic, or just someone interested in scientific literature, 395 | this service is built to cater to your specific needs. 396 | 397 | ### How It Works 398 | 399 | - **Configure**: The first step is to configure the system to your interests. This can be done by navigating to the configure page. 400 | - **Get Recommendations**: Once you've configured the system, you can generate recommendations by navigating to the get recommendations page. 401 | - **Categories & Subcategories**: You can customize your interests by selecting from a wide range of categories and subcategories, by default, the system will search based on your configured interests. 402 | - **Recommendation Count**: Choose how many recommendations you want to receive. 403 | - **Time Range**: Decide the time frame for the articles you're interested in. 404 | 405 | ### What Makes Us Different 406 | 407 | - **Personalized**: Recommendations are fine-tuned to match your unique academic interests. 408 | - **Up-to-Date**: The platform provides options to focus on the most recent articles. 409 | - **Quality Assured**: We prioritize recommendations from reputable sources and peer-reviewed journals. 410 | 411 | ### Mission Statement 412 | 413 | As an upcoming data scientist with a strong passion for deep learning, I am always looking for new technologies and methodologies. Naturally, I spend a considerable amount of time researching and reading new publications to accomplish this. However, over 14,000 academic papers are published every day on average, making it extremely tedious to continuously source papers relevant to my interests. My primary motivation for creating ScholarlyRecommender is to address this, creating a fully automated and personalized system that prepares a feed of academic papers relevant to me. This feed is prepared on demand, through a completely abstracted streamlit web interface, or sent directly to my email on a timed basis. This project was designed to be scalable and adaptable, and can be very easily adapted not only to your own interests, but become a fully automated, self improving newsletter. Details on how to use this system, the methods used for retrieval and ranking, along with future plans and features planned or in development currently are listed below. 414 | 415 | 416 | """ 417 | ) 418 | 419 | # Contact Page 420 | elif navigation == "Contact": 421 | st.markdown( 422 | """ 423 |

Contact Information

424 | 425 | ## Project Support 426 | 427 | Please report any issues by creating an issue on the GitHub repository, or by sending an email to the project email directly. 428 | 429 | - **Github Issue**: https://github.com/iansnyder333/ScholarlyRecommender/issues 430 | - **Project Email**: scholarlyrecommender@gmail.com 431 | 432 | ## Developer Contact Information 433 | 434 | If you have any questions or concerns, please feel free to reach out to me directly. 435 | I recently graduated college and am the sole developer of this project, so I would love any constructive feedback you have to offer to help me improve as a developer. 436 | 437 | - **Ian Snyder**: [@iansnydes](https://twitter.com/iansnydes) - idsnyder136@gmail.com 438 | - **Website and Portfolio**: [iansnyder333.github.io/frontend](https://iansnyder333.github.io/frontend) 439 | - **LinkedIn**: [www.linkedin.com/in/iandsnyder](https://www.linkedin.com/in/iandsnyder) 440 | 441 | 442 | """, 443 | unsafe_allow_html=True, 444 | ) 445 | 446 | 447 | # Footer 448 | st.markdown( 449 | """ 450 | 453 | """, 454 | unsafe_allow_html=True, 455 | ) 456 | --------------------------------------------------------------------------------