├── Utils
    ├── __init__.py
    ├── webutils.py
    └── webinterface.py
├── ScholarlyRecommender
    ├── Repository
    │   ├── __init__.py
    │   ├── tests
    │   │   ├── outputs
    │   │   │   ├── test_configuration_update.json
    │   │   │   ├── test_recommendations.csv
    │   │   │   └── test_feed.html
    │   │   └── inputs
    │   │   │   ├── test_configuration.json
    │   │   │   ├── ref_recommendations.csv
    │   │   │   └── ref_feed.html
    │   ├── admin.py
    │   ├── Recommendations.csv
    │   └── Feed.csv
    ├── Scraper
    │   ├── __init__.py
    │   ├── .DS_Store
    │   ├── GoogleScholar.py
    │   └── Arxiv.py
    ├── Recommender
    │   ├── __init__.py
    │   └── rec_sys.py
    ├── _cython
    │   ├── zlib.pxd
    │   └── cython_functions.pyx
    ├── configuration.json
    ├── __init__.py
    ├── const.py
    ├── config.py
    └── Newsletter
    │   ├── mail.py
    │   ├── feed.py
    │   └── html
    │       ├── Feed.html
    │       └── WebFeed.html
├── .gitattributes
├── images
    ├── logo.png
    ├── system.png
    └── example_1.png
├── .vscode
    └── settings.json
├── docs
    ├── local-config.md
    ├── README.md
    ├── web-config.md
    └── scholarlyAPI.md
├── .gitignore
├── .streamlit
    └── config.toml
├── .github
    ├── SECURITY.md
    ├── ISSUE_TEMPLATE
    │   ├── feature_request.md
    │   └── bug_report.md
    └── workflows
    │   └── dependency-review.yml
├── .deepsource.toml
├── .devcontainer
    └── devcontainer.json
├── Makefile
├── requirements.txt
├── calibrate.py
├── README.md
├── LICENSE
├── testing.py
└── webapp.py


/Utils/__init__.py:
--------------------------------------------------------------------------------
1 | 


--------------------------------------------------------------------------------
/ScholarlyRecommender/Repository/__init__.py:
--------------------------------------------------------------------------------
1 | 


--------------------------------------------------------------------------------
/ScholarlyRecommender/Scraper/__init__.py:
--------------------------------------------------------------------------------
1 | 


--------------------------------------------------------------------------------
/ScholarlyRecommender/Recommender/__init__.py:
--------------------------------------------------------------------------------
1 | 


--------------------------------------------------------------------------------
/.gitattributes:
--------------------------------------------------------------------------------
1 | # Auto detect text files and perform LF normalization
2 | * text=auto
3 | 


--------------------------------------------------------------------------------
/images/logo.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/iansnyder333/ScholarlyRecommender/HEAD/images/logo.png


--------------------------------------------------------------------------------
/images/system.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/iansnyder333/ScholarlyRecommender/HEAD/images/system.png


--------------------------------------------------------------------------------
/.vscode/settings.json:
--------------------------------------------------------------------------------
1 | {
2 |     "python.analysis.extraPaths": [
3 |         "./Repository"
4 |     ]
5 | }


--------------------------------------------------------------------------------
/images/example_1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/iansnyder333/ScholarlyRecommender/HEAD/images/example_1.png


--------------------------------------------------------------------------------
/docs/local-config.md:
--------------------------------------------------------------------------------
1 | # Scholarly Recommender Local Configuration and Override Instructions
2 | 
3 | 
4 | ## TODO
5 | 


--------------------------------------------------------------------------------
/ScholarlyRecommender/Scraper/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/iansnyder333/ScholarlyRecommender/HEAD/ScholarlyRecommender/Scraper/.DS_Store


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
 1 | /env
 2 | /__pycache__
 3 | *.pyc
 4 | ScholarlyRecommender/.DS_Store
 5 | .DS_Store
 6 | .DS_Store
 7 | 
 8 | .streamlit/secrets.toml
 9 | 
10 | /images/Logos-All
11 | .DS_Store
12 | testlog
13 | 70Stars.png
14 | 


--------------------------------------------------------------------------------
/.streamlit/config.toml:
--------------------------------------------------------------------------------
1 | [theme]
2 | # Custom theme configurations
3 | base = "light"
4 | primaryColor = "#A2A2F5"
5 | backgroundColor = "#FFFFFF"
6 | secondaryBackgroundColor = "#E3E3fA"
7 | textColor = "#262730"
8 | font = "sans serif"


--------------------------------------------------------------------------------
/ScholarlyRecommender/_cython/zlib.pxd:
--------------------------------------------------------------------------------
1 | cdef extern from "zlib.h":
2 |     int compress(unsigned char *dest, unsigned long *destLen, const unsigned char *source, unsigned long sourceLen)
3 |     unsigned long compressBound(unsigned long sourceLen)


--------------------------------------------------------------------------------
/.github/SECURITY.md:
--------------------------------------------------------------------------------
1 | # Security Policy
2 | 
3 | 
4 | 
5 | 
6 | ## Reporting a Vulnerability
7 | 
8 | Please report any Security or Vulnerability issues to the projects email: [scholarlyrecommender@gmail.com](mailto:scholarlyrecommender@gmail.com)
9 | 


--------------------------------------------------------------------------------
/.deepsource.toml:
--------------------------------------------------------------------------------
 1 | version = 1
 2 | 
 3 | test_patterns = ["testing.py"]
 4 | 
 5 | [[analyzers]]
 6 | name = "secrets"
 7 | 
 8 | [[analyzers]]
 9 | name = "cxx"
10 | 
11 | [[analyzers]]
12 | name = "python"
13 | 
14 |   [analyzers.meta]
15 |   runtime_version = "3.x.x"
16 | 
17 | [[transformers]]
18 | name = "black"
19 | 


--------------------------------------------------------------------------------
/ScholarlyRecommender/Repository/tests/outputs/test_configuration_update.json:
--------------------------------------------------------------------------------
1 | {
2 |     "queries": [
3 |         "Computer Science",
4 |         "Mathematics"
5 |     ],
6 |     "labels": "ScholarlyRecommender/Repository/tests/test_candidates_labeled.csv",
7 |     "feed_length": 7,
8 |     "feed_path": "ScholarlyRecommender/Repository/tests/test_feed.html"
9 | }


--------------------------------------------------------------------------------
/ScholarlyRecommender/configuration.json:
--------------------------------------------------------------------------------
 1 | {
 2 |     "queries": [
 3 |         "Artificial Intelligence",
 4 |         "Natural language processing",
 5 |         "Computer Vision",
 6 |         "Machine Learning"
 7 |     ],
 8 |     "labels": "ScholarlyRecommender/Repository/labeled/Candidates_Labeled.csv",
 9 |     "feed_length": 5,
10 |     "feed_path": "ScholarlyRecommender/Newsletter/html/WebFeed.html"
11 | }


--------------------------------------------------------------------------------
/ScholarlyRecommender/Repository/tests/inputs/test_configuration.json:
--------------------------------------------------------------------------------
 1 | {
 2 |     "queries": [
 3 |         "Artificial Intelligence",
 4 |         "Natural language processing",
 5 |         "Computer Vision",
 6 |         "Machine Learning"
 7 |        
 8 |     ],
 9 |     "labels": "ScholarlyRecommender/Repository/tests/test_candidates_labeled.csv",
10 |     "feed_length": 5,
11 |     "feed_path": "ScholarlyRecommender/Repository/tests/test_feed.html"
12 | }


--------------------------------------------------------------------------------
/ScholarlyRecommender/__init__.py:
--------------------------------------------------------------------------------
 1 | from .Scraper.Arxiv import source_candidates, fast_search
 2 | from .Recommender.rec_sys import get_recommendations, evaluate
 3 | from .Newsletter.feed import get_feed
 4 | 
 5 | from .config import get_config, update_config
 6 | 
 7 | __all__ = [
 8 |     "source_candidates",
 9 |     "fast_search",
10 |     "get_recommendations",
11 |     "evaluate",
12 |     "get_feed",
13 |     "get_config",
14 |     "update_config",
15 | ]
16 | 


--------------------------------------------------------------------------------
/ScholarlyRecommender/const.py:
--------------------------------------------------------------------------------
 1 | from copy import deepcopy
 2 | 
 3 | 
 4 | def BASE_REPO():
 5 |     """
 6 |     initialize a new base repo as a deep copy
 7 |     @param: None
 8 |     @return: a dictionary (str:list)
 9 |     """
10 |     return deepcopy(
11 |         {
12 |             "Id": [],
13 |             "Category": [],
14 |             "Title": [],
15 |             "Published": [],
16 |             "Abstract": [],
17 |             "URL": [],
18 |         }
19 |     )
20 | 


--------------------------------------------------------------------------------
/docs/README.md:
--------------------------------------------------------------------------------
1 | # ScholarlyRecommender Documentation 
2 | 
3 | These documents are intended to help you use the ScholarlyRecommender Application. These docs are still in development, as the project continues to scale, more documentation will be added. 
4 | 
5 | Please **note:** most of, if not all the source code contained within the repo are formatted in accordance with Black. This includes docstrings, comments, and typedefs. Please use this as a reference if various functionality is yet to be implemented in the document abstraction. 
6 | 
7 | 
8 | ![system](https://github.com/iansnyder333/ScholarlyRecommender/assets/58576523/b9239c91-6e76-48db-900d-ca9b845b5639)
9 | 


--------------------------------------------------------------------------------
/.github/ISSUE_TEMPLATE/feature_request.md:
--------------------------------------------------------------------------------
 1 | ---
 2 | name: Feature request
 3 | about: Suggest an idea for this project
 4 | title: ''
 5 | labels: ''
 6 | assignees: ''
 7 | 
 8 | ---
 9 | 
10 | **Is your feature request related to a problem? Please describe.**
11 | A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
12 | 
13 | **Describe the solution you'd like**
14 | A clear and concise description of what you want to happen.
15 | 
16 | **Describe alternatives you've considered**
17 | A clear and concise description of any alternative solutions or features you've considered.
18 | 
19 | **Additional context**
20 | Add any other context or screenshots about the feature request here.
21 | 


--------------------------------------------------------------------------------
/ScholarlyRecommender/config.py:
--------------------------------------------------------------------------------
 1 | # Default configuration for ScholarlyRecommender
 2 | 
 3 | from json import load, dump
 4 | 
 5 | with open("ScholarlyRecommender/configuration.json") as json_file:
 6 |     config = load(json_file)
 7 | 
 8 | 
 9 | def get_config():
10 |     """Return the configuration dictionary."""
11 |     with open("ScholarlyRecommender/configuration.json") as json_file:
12 |         config = load(json_file)
13 |     return config
14 | 
15 | 
16 | def update_config(new_config, **kwargs):
17 |     """Update the configuration file."""
18 |     if kwargs["test_mode"]:
19 |         with open(kwargs["test_path"], "w") as json_file:
20 |             dump(new_config, json_file, indent=4)
21 |     else:
22 |         with open("ScholarlyRecommender/configuration.json", "w") as json_file:
23 |             dump(new_config, json_file, indent=4)
24 | 


--------------------------------------------------------------------------------
/.github/workflows/dependency-review.yml:
--------------------------------------------------------------------------------
 1 | # Dependency Review Action
 2 | #
 3 | # This Action will scan dependency manifest files that change as part of a Pull Request, surfacing known-vulnerable versions of the packages declared or updated in the PR. Once installed, if the workflow run is marked as required, PRs introducing known-vulnerable packages will be blocked from merging.
 4 | #
 5 | # Source repository: https://github.com/actions/dependency-review-action
 6 | # Public documentation: https://docs.github.com/en/code-security/supply-chain-security/understanding-your-software-supply-chain/about-dependency-review#dependency-review-enforcement
 7 | name: 'Dependency Review'
 8 | on: [pull_request]
 9 | 
10 | permissions:
11 |   contents: read
12 | 
13 | jobs:
14 |   dependency-review:
15 |     runs-on: ubuntu-latest
16 |     steps:
17 |       - name: 'Checkout Repository'
18 |         uses: actions/checkout@v3
19 |       - name: 'Dependency Review'
20 |         uses: actions/dependency-review-action@v3
21 | 


--------------------------------------------------------------------------------
/.github/ISSUE_TEMPLATE/bug_report.md:
--------------------------------------------------------------------------------
 1 | ---
 2 | name: Bug report
 3 | about: Create a report to help us improve
 4 | title: ''
 5 | labels: ''
 6 | assignees: ''
 7 | 
 8 | ---
 9 | 
10 | **Describe the bug**
11 | A clear and concise description of what the bug is.
12 | 
13 | **To Reproduce**
14 | Steps to reproduce the behavior:
15 | 1. Go to '...'
16 | 2. Click on '....'
17 | 3. Scroll down to '....'
18 | 4. See error
19 | 
20 | **Expected behavior**
21 | A clear and concise description of what you expected to happen.
22 | 
23 | **Screenshots**
24 | If applicable, add screenshots to help explain your problem.
25 | 
26 | **Desktop (please complete the following information):**
27 |  - OS: [e.g. iOS]
28 |  - Browser [e.g. chrome, safari]
29 |  - Version [e.g. 22]
30 | 
31 | **Smartphone (please complete the following information):**
32 |  - Device: [e.g. iPhone6]
33 |  - OS: [e.g. iOS8.1]
34 |  - Browser [e.g. stock browser, safari]
35 |  - Version [e.g. 22]
36 | 
37 | **Additional context**
38 | Add any other context about the problem here.
39 | 


--------------------------------------------------------------------------------
/.devcontainer/devcontainer.json:
--------------------------------------------------------------------------------
 1 | {
 2 |   "name": "Python 3",
 3 |   // Or use a Dockerfile or Docker Compose file. More info: https://containers.dev/guide/dockerfile
 4 |   "image": "mcr.microsoft.com/devcontainers/python:1-3.11-bullseye",
 5 |   "customizations": {
 6 |     "codespaces": {
 7 |       "openFiles": [
 8 |         "README.md",
 9 |         "webapp.py"
10 |       ]
11 |     },
12 |     "vscode": {
13 |       "settings": {},
14 |       "extensions": [
15 |         "ms-python.python",
16 |         "ms-python.vscode-pylance"
17 |       ]
18 |     }
19 |   },
20 |   "updateContentCommand": "[ -f packages.txt ] && sudo apt update && sudo apt upgrade -y && sudo xargs apt install -y <packages.txt; [ -f requirements.txt ] && pip3 install --user -r requirements.txt; pip3 install --user streamlit; echo '✅ Packages installed and Requirements met'",
21 |   "postAttachCommand": {
22 |     "server": "streamlit run webapp.py --server.enableCORS false --server.enableXsrfProtection false"
23 |   },
24 |   "portsAttributes": {
25 |     "8501": {
26 |       "label": "Application",
27 |       "onAutoForward": "openPreview"
28 |     }
29 |   },
30 |   "forwardPorts": [
31 |     8501
32 |   ]
33 | }


--------------------------------------------------------------------------------
/Utils/webutils.py:
--------------------------------------------------------------------------------
 1 | # Currently supported categories and subcategories
 2 | search_categories = {
 3 |     "Computer Science": [
 4 |         "Artificial Intelligence",
 5 |         "Computer Vision",
 6 |         "Natural language processing",
 7 |         "Databases",
 8 |         "Distributed, Parallel, and Cluster Computing",
 9 |         "Data Structures and Algorithms",
10 |         "Computer Science and Game Theory",
11 |         "Machine Learning",
12 |         "Robotics",
13 |         "Software Engineering",
14 |     ],
15 |     "Mathmatics": [
16 |         "Combinatorics",
17 |         "Dynamical Systems",
18 |         "Numerical Analysis",
19 |         "Number Theory",
20 |         "Probability",
21 |         "Quantum Algebra",
22 |         "Logic",
23 |     ],
24 |     "Biology": [
25 |         "Biomolecules",
26 |         "Genomics",
27 |         "Neurons and Cognition",
28 |         "Subcellular Processes",
29 |         "Quantitative Biological Methods",
30 |     ],
31 |     "Physics": [
32 |         "Astrophysics",
33 |         "Condensed matter",
34 |         "General relativity and quantum cosmology",
35 |         "High energy physics",
36 |         "Mathematical physics",
37 |         "Nonlinear sciences",
38 |         "Nuclear experiment",
39 |         "Nuclear theory",
40 |         "Quantum physics",
41 |     ],
42 |     "Statistics": [
43 |         "Statistics Applications",
44 |         "Computational Statistics",
45 |         "Statistics Methodology",
46 |         "Statistics Theory",
47 |     ],
48 | }
49 | 


--------------------------------------------------------------------------------
/Makefile:
--------------------------------------------------------------------------------
 1 | # Makefile for ScholarlyRecommender
 2 | PYTHON=$(shell command -v python3)
 3 | PIP = $(shell command -v pip3)
 4 | 
 5 | ifeq (, $(PYTHON))
 6 |     $(error "PYTHON=$(PYTHON) not found in $(PATH)")
 7 | endif
 8 | 
 9 | ifeq (, $(PIP))
10 | 	$(error "PIP=$(PIP) not found in $(PATH)")
11 | endif
12 | 
13 | PYTHON_VERSION_MIN=3.9
14 | PYTHON_VERSION_CUR=$(shell $(PYTHON) -c 'import sys; print("%d.%d"% sys.version_info[0:2])')
15 | PYTHON_VERSION_OK=$(shell $(PYTHON) -c 'import sys; cur_ver = sys.version_info[0:2]; min_ver = tuple(map(int, "$(PYTHON_VERSION_MIN)".split("."))); print(int(cur_ver >= min_ver))')
16 | ifeq ($(PYTHON_VERSION_OK), 0)
17 |     $(error "Need python version >= $(PYTHON_VERSION_MIN). Current version is $(PYTHON_VERSION_CUR)")
18 | endif
19 | 
20 | ENV_DIR = env
21 | PIP_VERSION = pip3
22 | 
23 | # Phony targets
24 | .PHONY: all build setup install run test clean
25 | 
26 | # Default target
27 | all: setup install run
28 | 
29 | # Setup the virtual environment if it doesn't exist
30 | setup:
31 | 	if [ ! -d "$(ENV_DIR)" ]; then \
32 | 		$(PYTHON) -m venv $(ENV_DIR); \
33 | 	fi
34 | 
35 | # Activate the virtual environment and install dependencies
36 | install:
37 | 	if [ -d "$(ENV_DIR)/bin" ]; then \
38 | 		source $(ENV_DIR)/bin/activate && \
39 | 		$(PIP_VERSION) install -r requirements.txt; \
40 | 	elif [ -d "$(ENV_DIR)/Scripts" ]; then \
41 | 		source $(ENV_DIR)/Scripts/activate && \
42 | 		$(PIP_VERSION) install -r requirements.txt; \
43 | 	fi
44 | 
45 | build: setup install
46 | 
47 | 
48 | ## Run Streamlit App (virtual env already activated)
49 | run:
50 | 	streamlit run webapp.py
51 | 
52 | # Run tests
53 | test:
54 | 	$(PYTHON) testing.py
55 | 	
56 | # Clean the environment
57 | clean:
58 | 	rm -rf $(ENV_DIR)


--------------------------------------------------------------------------------
/ScholarlyRecommender/Newsletter/mail.py:
--------------------------------------------------------------------------------
 1 | """
 2 | Example of how to configure the email server if you have installed ScholarlyRecommender
 3 | locally:
 4 | Replace the constants with your own email address, password, port number and
 5 | subscribers (delivery address).
 6 | Note that you need to enable less secure apps in your google account settings,
 7 | otherwise this will not work.
 8 | Once you have your own constants working, you can delete the input() calls and replace
 9 | them with your own constants.
10 | To send an email, simply call send_email(content=html_string) from the Newsletter
11 | module. html_string is the return value from the get_feed() function in feed.py.
12 | """
13 | import smtplib
14 | from email.message import EmailMessage
15 | import re
16 | 
17 | 
18 | def validate_email(email):
19 |     """Validate an email address using regex."""
20 |     regex = r"^\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b"
21 |     if re.match(regex, email):
22 |         return True
23 |     return False
24 | 
25 | 
26 | def send_email(**kwargs):
27 |     """Send an email using the configured email server."""
28 |     EMAIL_ADDRESS = input("Enter your email address: ")
29 |     EMAIL_PASSWORD = input("Enter your email password: ")
30 |     SUBSCRIBERS = input(
31 |         "Enter your subscribers email addresses (separated by commas): "
32 |     )
33 |     PORT = input("Enter your port number: (465 for gmail)")
34 | 
35 |     msg = EmailMessage()
36 |     msg["Subject"] = "Your Scholarly Recommender Newsletter"
37 |     msg["From"] = EMAIL_ADDRESS
38 |     msg["To"] = SUBSCRIBERS
39 | 
40 |     html_string = kwargs["content"]
41 | 
42 |     msg.set_content(html_string, subtype="html")
43 |     with smtplib.SMTP_SSL("smtp.gmail.com", PORT) as smtp:
44 |         smtp.login(EMAIL_ADDRESS, EMAIL_PASSWORD)
45 |         smtp.send_message(msg)
46 | 


--------------------------------------------------------------------------------
/ScholarlyRecommender/_cython/cython_functions.pyx:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | cimport numpy as np
 3 | from libc.stdlib cimport malloc, free
 4 | from libc.string cimport strcat, strcpy
 5 | from cython.parallel cimport prange
 6 | 
 7 | cdef extern from "zlib.h":
 8 |     int compress(unsigned char *dest, unsigned long *destLen, const unsigned char *source, unsigned long sourceLen)
 9 |     unsigned long compressBound(unsigned long sourceLen)
10 | 
11 | def calculate_ncd(np.ndarray[object, ndim=1] test_texts, np.ndarray[object, ndim=1] train_texts):
12 |     cdef int i, j
13 |     cdef int num_test = test_texts.shape[0]
14 |     cdef int num_train = train_texts.shape[0]
15 |     cdef np.ndarray[np.float64_t, ndim=2] ncd_results = np.zeros((num_test, num_train), dtype=np.float64)
16 |     cdef unsigned long Cx1, Cx2, Cx1x2
17 |     cdef bytes x1, x2, x1x2
18 |     
19 |     for i in range(num_test):
20 |         x1 = test_texts[i].encode('utf-8')
21 |         Cx1 = compress_c(x1)
22 |         
23 |         for j in range(num_train):
24 |             x2 = train_texts[j].encode('utf-8')
25 |             Cx2 = compress_c(x2)
26 |             
27 |             x1x2 = x1 + b" " + x2  # Use bytes concatenation
28 |             Cx1x2 = compress_c(x1x2)
29 |             
30 |             ncd_results[i, j] = (Cx1x2 - min(Cx1, Cx2)) / max(Cx1, Cx2)
31 |     
32 |     return ncd_results
33 | 
34 | cdef unsigned long compress_c(bytes input_str):
35 |     cdef unsigned long source_len = len(input_str)
36 |     cdef unsigned long max_compressed_len = compressBound(source_len)
37 |     cdef unsigned char *compressed_data = <unsigned char *>malloc(max_compressed_len)
38 |     cdef unsigned long compressed_len = max_compressed_len
39 |     
40 |     if compress(compressed_data, &compressed_len, <unsigned char *>input_str, source_len) != 0:
41 |         # Handle compression error
42 |         return 0
43 |     
44 |     free(compressed_data)
45 |     return compressed_len


--------------------------------------------------------------------------------
/Utils/webinterface.py:
--------------------------------------------------------------------------------
 1 | from pandas import DataFrame
 2 | 
 3 | 
 4 | def build_query(selected_sub_categories: dict) -> list:
 5 |     """
 6 |     Build a query from the selected sub-categories
 7 |     @param selected_sub_categories: dict
 8 |     @return: list of queries represented as strings
 9 |     """
10 |     return
11 | 
12 | 
13 | def validate_email(email) -> bool:
14 |     """
15 |     Validate an email address using regex
16 |     @param email: string representing an email address
17 |     @return: bool indicating whether the email is valid
18 |     """
19 |     return
20 | 
21 | 
22 | def send_email(**kwargs) -> None:
23 |     """
24 |     Send an email using the configured email server
25 |     @param kwargs: dict of keyword arguments
26 |     @return: None
27 |     @raises: ValueError if the email address is invalid or if a server error occurs
28 |     """
29 |     return
30 | 
31 | 
32 | def generate_feed_pipeline(query: list, n: int, days: int) -> None:
33 |     """
34 |     Generate a feed from a query, this is the main pipeline for generating
35 |     recommendations
36 |     @param query: list of queries represented as strings, defaults to sys_config()
37 |     @param n: number of recommendations to generate, defaults to 5
38 |     @param days: number of days back to search, defaults to 7
39 |     @return: None
40 |     """
41 |     return
42 | 
43 | 
44 | def fetch_papers(num_papers: int = 10) -> DataFrame:
45 |     """
46 |     Collect a sample of papers from arXiv for calibration, sourced using the default
47 |     configuration of interest categories
48 |     Papers are collected, shuffled, and returned as a formatted DataFrame
49 |     @param num_papers: number of papers to collect, defaults to 10
50 |     @return: DataFrame of papers formatted for labeling
51 |     """
52 |     return
53 | 
54 | 
55 | def calibrate_rec_sys(num_papers: int = 10) -> None:
56 |     """
57 |     Interactive calibration tool for the recommender system, essentially serves as a
58 |     user interface for manual labeling
59 |     @param num_papers: number of papers to rate, defaults to 10
60 |     @return: None, labels configured in sys_config["labels"]
61 |     """
62 |     return
63 | 


--------------------------------------------------------------------------------
/ScholarlyRecommender/Repository/admin.py:
--------------------------------------------------------------------------------
 1 | """
 2 | This module contains functions to create and manage the database for the
 3 |  recommender system.
 4 | """
 5 | import pandas as pd
 6 | import arxiv
 7 | 
 8 | from ScholarlyRecommender.const import BASE_REPO
 9 | 
10 | 
11 | def get_papers(ids: list, query: str = "") -> pd.DataFrame:
12 |     """
13 |     Scrape arxiv.org for papers matching the query and return a dataframe matching
14 |       the BASE_REPO format.
15 |     """
16 |     repository = BASE_REPO()
17 |     search = arxiv.Search(
18 |         query=query,
19 |         id_list=ids,
20 |     )
21 |     for result in search.results():
22 |         repository["Id"].append(result.entry_id.split("/")[-1])
23 |         repository["Category"].append(result.primary_category)
24 |         repository["Title"].append(result.title.strip("\n"))
25 |         repository["Published"].append(result.published)
26 |         repository["Abstract"].append(result.summary.strip("\n"))
27 |         repository["URL"].append(result.pdf_url)
28 |     return pd.DataFrame(repository).set_index("Id")
29 | 
30 | 
31 | def build_arxiv_repo(ids: list, path: str) -> None:
32 |     """Build a csv file containing the papers matching the ids."""
33 |     if not path.endswith(".csv"):
34 |         raise AssertionError("Path must be a csv file")
35 |     df = get_papers(ids)
36 |     df.to_csv(path)
37 | 
38 | 
39 | def add_paper(ids: list, to_repo: str) -> None:
40 |     """Add papers matching the ids to the repository. Duplicates are removed."""
41 |     if not to_repo.endswith(".csv"):
42 |         raise AssertionError("Repository must be a csv file")
43 |     df1 = pd.read_csv(to_repo, index_col="Id")
44 |     df2 = get_papers(ids)
45 |     df = pd.concat([df1, df2])
46 |     df = df[~df.index.duplicated(keep="first")]
47 |     df.to_csv(to_repo)
48 | 
49 | 
50 | def remove_paper(ids: list, from_repo: str) -> None:
51 |     """Remove papers matching the ids from the repository."""
52 |     if not from_repo.endswith(".csv"):
53 |         raise AssertionError("Repository must be a csv file")
54 |     df1 = pd.read_csv(from_repo, index_col="Id")
55 |     df2 = get_papers(ids)
56 |     df = df1.drop(df2.index)
57 |     df.to_csv(from_repo)
58 | 


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
  1 | #
  2 | # This file is autogenerated by pip-compile with Python 3.11
  3 | # by the following command:
  4 | #
  5 | #    pip-compile
  6 | #
  7 | altair==5.1.1
  8 |     # via streamlit
  9 | arxiv==1.4.8
 10 |     # via -r requirements.in
 11 | attrs==23.1.0
 12 |     # via
 13 |     #   jsonschema
 14 |     #   referencing
 15 | beautifulsoup4==4.12.2
 16 |     # via -r requirements.in
 17 | blinker==1.6.2
 18 |     # via streamlit
 19 | cachetools==5.3.1
 20 |     # via streamlit
 21 | certifi==2024.7.4
 22 |     # via requests
 23 | charset-normalizer==3.2.0
 24 |     # via requests
 25 | click==8.1.7
 26 |     # via streamlit
 27 | cython==3.0.2
 28 |     # via -r requirements.in
 29 | feedparser==6.0.10
 30 |     # via arxiv
 31 | gitdb==4.0.10
 32 |     # via gitpython
 33 | gitpython==3.1.41
 34 |     # via streamlit
 35 | idna==3.7
 36 |     # via requests
 37 | importlib-metadata==6.8.0
 38 |     # via streamlit
 39 | jinja2==3.1.6
 40 |     # via
 41 |     #   altair
 42 |     #   pydeck
 43 | jsonschema==4.19.1
 44 |     # via altair
 45 | jsonschema-specifications==2023.7.1
 46 |     # via jsonschema
 47 | markdown-it-py==3.0.0
 48 |     # via rich
 49 | markupsafe==2.1.3
 50 |     # via jinja2
 51 | mdurl==0.1.2
 52 |     # via markdown-it-py
 53 | numpy==1.23.5
 54 |     # via
 55 |     #   -r requirements.in
 56 |     #   altair
 57 |     #   pandas
 58 |     #   pyarrow
 59 |     #   pydeck
 60 |     #   streamlit
 61 | packaging==23.1
 62 |     # via
 63 |     #   altair
 64 |     #   streamlit
 65 | pandas==2.0.1
 66 |     # via
 67 |     #   -r requirements.in
 68 |     #   altair
 69 |     #   streamlit
 70 | pillow>=9.5.0
 71 |     # via streamlit
 72 | protobuf==4.25.8
 73 |     # via streamlit
 74 | pyarrow>=13.0.0
 75 |     # via streamlit
 76 | pydeck==0.8.0
 77 |     # via streamlit
 78 | pygments==2.16.1
 79 |     # via rich
 80 | pympler==1.0.1
 81 |     # via streamlit
 82 | python-dateutil==2.8.2
 83 |     # via
 84 |     #   pandas
 85 |     #   streamlit
 86 | pytz==2023.3.post1
 87 |     # via pandas
 88 | pytz-deprecation-shim==0.1.0.post0
 89 |     # via tzlocal
 90 | referencing==0.30.2
 91 |     # via
 92 |     #   jsonschema
 93 |     #   jsonschema-specifications
 94 | requests==2.32.4
 95 |     # via
 96 |     #   -r requirements.in
 97 |     #   streamlit
 98 | rich==13.5.3
 99 |     # via streamlit
100 | rpds-py==0.10.3
101 |     # via
102 |     #   jsonschema
103 |     #   referencing
104 | sgmllib3k==1.0.0
105 |     # via feedparser
106 | six==1.16.0
107 |     # via python-dateutil
108 | smmap==5.0.1
109 |     # via gitdb
110 | soupsieve==2.5
111 |     # via beautifulsoup4
112 | streamlit==1.37.0
113 |     # via -r requirements.in
114 | tenacity==8.2.3
115 |     # via streamlit
116 | toml==0.10.2
117 |     # via streamlit
118 | toolz==0.12.0
119 |     # via altair
120 | tornado==6.5.1
121 |     # via streamlit
122 | tqdm==4.66.3
123 |     # via -r requirements.in
124 | typing-extensions==4.8.0
125 |     # via streamlit
126 | tzdata==2023.3
127 |     # via
128 |     #   pandas
129 |     #   pytz-deprecation-shim
130 | tzlocal==4.3.1
131 |     # via streamlit
132 | urllib3==2.5.0
133 |     # via requests
134 | validators==0.22.0
135 |     # via streamlit
136 | zipp==3.19.1
137 |     # via importlib-metadata
138 | 
139 | # The following packages are considered to be unsafe in a requirements file:
140 | # setuptools
141 | 


--------------------------------------------------------------------------------
/docs/web-config.md:
--------------------------------------------------------------------------------
 1 | 
 2 | # Scholarly Recommender System Calibration Tool
 3 | 
 4 | This document will help you calibrate the recommender system to your interests!
 5 | Below are the various configuration steps, it is advised to do them in order.
 6 | Once a step is completed, the changes will be applied automatically, regardless of whether you continue to the next step or not.
 7 | 
 8 | ## Step 1: Access the Configuration API
 9 | 
10 | First, use the following [link](https://scholarlyrecommender.streamlit.app) to access the cloud application. If you are running this project locally, you should refer to <a href="https://github.com/iansnyder333/ScholarlyRecommender/blob/main/docs/local-config.md">this document</a>. 
11 | 
12 | Once you are on the cloud app, use the navigation bar on the left side of the screen and click on the "Configure" tab, shown below.
13 | 
14 | <img width="1386" alt="Screen Shot 2023-09-22 at 8 20 59 AM" src="https://github.com/iansnyder333/ScholarlyRecommender/assets/58576523/276b828f-3777-41f6-bb54-3845eb8fa230">
15 | 
16 | 
17 | 
18 | ## Step 2: Configure your interests
19 | 
20 | Follow the instructions displayed on the screen to start the configuration process. You will be asked to select categories and subcategories that are of interest to you. You may select as few or as many as you like, these categories will help the ScholarlyRecommender source candidate papers that align with your interests. Once you have selected all the categories, click the done button just below the categories box. You should see the following message indicating that your preferences were successfully changed.
21 | 
22 | <img width="964" alt="Screen Shot 2023-09-22 at 8 33 27 AM" src="https://github.com/iansnyder333/ScholarlyRecommender/assets/58576523/696e61e5-a912-4ae5-87a3-91bd796ef1b0">
23 | 
24 | Congratulations! ScholarlyRecommender will now automatically source candidates that are relevant to your interests, unless specified otherwise. You can go back and change this configuration at any time!
25 | 
26 | ## Step 3: Calibrate the Recommender System
27 | 
28 | Scroll down to the next section on the page, labeled *Calibrate the Recommender System*. This step will calibrate the recommender system to rank candidate papers based on your interests, and will significantly improve recommendations. This process will show you snippets of 10 papers and ask you to rate them on a scale of 1 to 10 (1 being the least relevant and 10 being the most relevant). 
29 | 
30 | **Note**: Many improvements are planned for this process, including the ability to skip papers, change sample size, and dynamically update the system based on your feedback from the generated feed.
31 | 
32 | When you are ready, click the "Start Calibration" button, after rating the 10 papers, you should see the following message:
33 | 
34 | <img width="975" alt="Screen Shot 2023-09-22 at 8 40 27 AM" src="https://github.com/iansnyder333/ScholarlyRecommender/assets/58576523/f5f1b906-550b-4e36-845c-7fa574ce7a49">
35 | 
36 | Amazing! The ScholarlyRecommender is now configured and is ready to be your personal agent to find you new, personalized, academic publications. Now you can navigate to the Get Recommendations page and click the "generate recommendations" button to see your personalized feed!
37 | 
38 | <img width="1393" alt="Screen Shot 2023-09-22 at 8 48 25 AM" src="https://github.com/iansnyder333/ScholarlyRecommender/assets/58576523/a884a72e-b0fc-4db5-b9b0-c10c30924dda">
39 | 
40 | Thank you so much for using ScholarlyRecommender. I recently graduated college and am the sole developer of this project, so I would love any constructive feedback you have to offer to help me improve as a developer. 
41 | 
42 | Please report any issues by creating an issue on the GitHub repository, or by sending an email to the project email directly.
43 | 
44 | - **Github Issue**: https://github.com/iansnyder333/ScholarlyRecommender/issues
45 | - **Project Email**: scholarlyrecommender@gmail.com
46 | 
47 | 
48 | 


--------------------------------------------------------------------------------
/calibrate.py:
--------------------------------------------------------------------------------
 1 | """
 2 | This script allows you to reconfigure the recommender system to your interests without
 3 | having to use the web interface.
 4 | It is not nearly as robust as the web interface, but offers a quick and lightweight
 5 | alternative for users who do not want to use the web interface.
 6 | """
 7 | import ScholarlyRecommender as sr
 8 | import arxiv
 9 | 
10 | to_path = "ScholarlyRecommender/Repository/labeled/Candidates_Labeled.csv"
11 | old_config = sr.get_config()
12 | 
13 | 
14 | def calibrate_rec_sys(query: list, num_papers: int = 10):
15 |     """
16 |     Interactive calibration tool for the recommender system, essentially serves
17 |     as a user interface for manual labeling.
18 |     """
19 |     # get sample of papers to label
20 |     c = sr.source_candidates(
21 |         queries=query,
22 |         max_results=100,
23 |         as_df=True,
24 |         sort_by=arxiv.SortCriterion.Relevance,
25 |     )
26 |     sam = c.sample(frac=1)
27 |     sam.reset_index(inplace=True)
28 |     df = sam[["Title", "Abstract"]].copy()
29 |     df["Abstract"] = df["Abstract"].str[:500] + "..."
30 | 
31 |     df = df.head(num_papers)
32 | 
33 |     labels = []
34 |     for _, row in df.iterrows():
35 |         print(f"{row['Title']} \n")
36 |         print(f"{row['Abstract']} \n")
37 |         print("Rate this paper on a scale of 1 to 10? \n")
38 |         labels.append(int(input("enter a number: ")))
39 |         print("\n \n")
40 |     df["label"] = labels
41 |     df.to_csv(to_path)
42 |     old_config["labels"] = to_path
43 |     sr.update_config(old_config)
44 |     return True
45 | 
46 | 
47 | def main():
48 |     """
49 |     This script allows you to reconfigure the recommender system to your interests
50 |     without having to use the web interface.
51 |     """
52 |     print("\n WARNING: This script is deprecated")
53 |     print("Please use the web interface!")
54 |     print("Welcome to the Scholarly Recommender System Calibration Tool \n")
55 |     print(
56 |         "This tool will help you calibrate the recommender system to your interests \n"
57 |     )
58 |     print("Please answer the following questions to help us get to know you better \n")
59 |     print("Select the categories that interest you the most: \n")
60 |     print("1. Computer Science \n 2. Mathematics \n 3. Physics")
61 |     print("4. Quantitative Biology \n 5. Quantitative Finance \n 6. Statistics \n")
62 |     print("Please enter the numbers of the categories that interest you the most.")
63 |     print("Separate each number with a comma. ex: 1,2,3 \n")
64 |     categories = input("Enter a list of numbers: ")
65 |     categories = categories.split(",")
66 |     categories = [int(i) for i in categories]
67 |     search_categories = {
68 |         1: "Computer Science",
69 |         2: "Mathematics",
70 |         3: "Physics",
71 |         4: "Quantitative Biology",
72 |         5: "Quantitative Finance",
73 |         6: "Statistics",
74 |     }
75 |     categories = list(map(search_categories.get, categories))
76 |     print(f"Thank you for your input. You selected {categories} \n")
77 |     print("Now we will ask you to rate a few papers to help us get to know you better.")
78 |     print("This will take a few minutes.")
79 |     res = input("Press enter if you want to proceed, enter 'skip' to skip: \n")
80 |     if res == "skip":
81 |         print("You have chosen to skip this step. \n")
82 |         print("The recommender system will not be calibrated to your interests. \n")
83 |         return
84 |     print("Please rate the following papers on a scale of 1 to 10 \n")
85 |     state = calibrate_rec_sys(categories)
86 |     if state:
87 |         print("Thank you for your input. Your results have been saved. \n")
88 |         print("The recommender system will now be calibrated to your interests \n")
89 |         print("No further action is required! \n")
90 | 
91 |         return
92 |     print("Something went wrong. Please try again. \n")
93 |     return
94 | 
95 | 
96 | if __name__ == "__main__":
97 |     main()
98 | 


--------------------------------------------------------------------------------
/ScholarlyRecommender/Scraper/GoogleScholar.py:
--------------------------------------------------------------------------------
  1 | import requests
  2 | from bs4 import BeautifulSoup
  3 | import pandas as pd
  4 | 
  5 | 
  6 | class ScraperForGoogleScholar:
  7 |     """Scraper for google scholar"""
  8 | 
  9 |     def __init__(self, headers, repository: dict = None):
 10 |         self.headers = headers
 11 |         if repository is None:
 12 |             self.repository = {
 13 |                 "Paper Title": [],
 14 |                 "Author": [],
 15 |                 "Publication": [],
 16 |                 "Url of paper": [],
 17 |                 "Abstract": [],
 18 |             }
 19 |         else:
 20 |             self.repository = repository
 21 | 
 22 |     def _get_paperinfo(self, paper_url: str):
 23 |         """get the paper info from the url"""
 24 |         # download the page
 25 |         response = requests.get(paper_url, headers=self.headers)
 26 |         # check successful response
 27 |         if response.status_code != 200:
 28 |             raise AssertionError(f"Failed to fetch web page {paper_url}")
 29 |         # parse using beautiful soup
 30 |         return BeautifulSoup(response.text, "html.parser")
 31 | 
 32 |     @staticmethod
 33 |     def _get_tags(doc: BeautifulSoup) -> tuple:
 34 |         """get the tags from the document"""
 35 |         paper_tag = doc.select("[data-lid]")
 36 |         link_tag = doc.find_all("h3", {"class": "gs_rt"})
 37 |         author_tag = doc.find_all("div", {"class": "gs_a"})
 38 |         abstract_tag = doc.find_all("div", {"class": "gs_rs"})
 39 |         return (paper_tag, link_tag, author_tag, abstract_tag)
 40 | 
 41 |     @staticmethod
 42 |     def _get_papertitle(paper_tag: list) -> list:
 43 |         """get the paper title from the tag"""
 44 |         return [tag.select("h3")[0].get_text() for tag in paper_tag]
 45 | 
 46 |     @staticmethod
 47 |     def _get_link(link_tag: list) -> list:
 48 |         """get the link from the tag"""
 49 |         return [link_tag[i].a["href"] for i in range(len(link_tag))]
 50 | 
 51 |     @staticmethod
 52 |     def _get_author_publisher_info(authors_tag: list) -> tuple:
 53 |         """get the author and publisher info from the tag"""
 54 |         authors = []
 55 |         publishers = []
 56 |         for v, _ in enumerate(authors_tag):
 57 |             authortag_text = (authors_tag[v].text).split("-")
 58 | 
 59 |             if len(authortag_text) == 0:
 60 |                 authors.append("None")
 61 |                 publishers.append("None")
 62 |             elif len(authortag_text) == 1:
 63 |                 authors.append(authortag_text[0])
 64 |                 publishers.append("None")
 65 |             else:
 66 |                 authors.append(authortag_text[0])
 67 |                 publishers.append(authortag_text[-1])
 68 | 
 69 |         return (authors, publishers)
 70 | 
 71 |     @staticmethod
 72 |     def _get_abstract(abstract_tag: list) -> list:
 73 |         """get the abstract from the tag"""
 74 |         abstract = []
 75 |         for i, _ in enumerate(abstract_tag):
 76 |             s = (abstract_tag[i].text).strip().split("-")
 77 |             s = " ".join(s[1:])
 78 |             s = s.strip("\n")
 79 |             abstract.append(s)
 80 |         return abstract
 81 | 
 82 |     def _add_in_paper_repo(self, **kwargs) -> pd.DataFrame:
 83 |         """add the paper info in the repository"""
 84 |         self.repository["Paper Title"].extend(kwargs["papername"])
 85 |         self.repository["Author"].extend(kwargs["author"])
 86 |         self.repository["Publication"].extend(kwargs["publisher"])
 87 |         self.repository["Url of paper"].extend(kwargs["url"])
 88 |         self.repository["Abstract"].extend(kwargs["abstract"])
 89 |         return pd.DataFrame(self.repository)
 90 | 
 91 |     def scrape(self, url: str, to_df: bool = True):
 92 |         """scrape google scholar"""
 93 |         # function for the get content of each page
 94 |         doc = self._get_paperinfo(url)
 95 | 
 96 |         # function for the collecting tags
 97 |         paper_tag, link_tag, author_tag, abstract_tag = self._get_tags(doc)
 98 | 
 99 |         # paper title from each page
100 |         papername = self._get_papertitle(paper_tag)
101 | 
102 |         # year , author , publication of the paper
103 |         author, public = self._get_author_publisher_info(author_tag)
104 | 
105 |         # url of the paper
106 |         link = self._get_link(link_tag)
107 |         abstract = self._get_abstract(abstract_tag)
108 |         # add in paper repo dict
109 |         paper = self._add_in_paper_repo(
110 |             papername=papername,
111 |             author=author,
112 |             publisher=public,
113 |             url=link,
114 |             abstract=abstract,
115 |         )
116 | 
117 |         if to_df:
118 |             return paper
119 |         return paper.to_dict()
120 | 


--------------------------------------------------------------------------------
/ScholarlyRecommender/Newsletter/feed.py:
--------------------------------------------------------------------------------
  1 | import pandas as pd
  2 | import re
  3 | from ScholarlyRecommender.config import get_config
  4 | 
  5 | config = get_config()
  6 | 
  7 | 
  8 | def clean_feed(dataframe: pd.DataFrame):
  9 |     """Clean the dataframe to match the BASE_REPO format."""
 10 |     df = dataframe[
 11 |         ["Id", "Category", "Title", "Published", "Abstract", "URL", "Author"]
 12 |     ].copy()
 13 |     df.reset_index(inplace=True)
 14 |     df["Author"] = df["Author"].astype(str)
 15 |     df["Id"] = df["Id"].apply(lambda x: "Entry Id: " + str(x))
 16 |     df["Published"] = pd.to_datetime(df["Published"]).dt.strftime("%m-%d-%Y")
 17 |     df["Published"] = df["Published"].apply(lambda x: "Published on " + str(x))
 18 |     df["Author"] = df["Author"].apply(extract_author_names)
 19 |     df["Author"] = df["Author"].str[:500] + "..."
 20 |     df["Abstract"] = df["Abstract"].str[:500] + "..."
 21 | 
 22 |     df["Abstract"] = df["Abstract"].apply(remove_latex)
 23 |     df["Title"] = df["Title"].apply(remove_latex)
 24 |     return df
 25 | 
 26 | 
 27 | def extract_author_names(author_string):
 28 |     """Extract the author names from the author string."""
 29 |     # The regular expression to match any characters enclosed within single quotes
 30 |     pattern = r"\'(.*?)\'"
 31 | 
 32 |     # Find all matches of the pattern
 33 |     matches = re.findall(pattern, author_string)
 34 | 
 35 |     return ", ".join(matches)
 36 | 
 37 | 
 38 | def remove_latex(text):
 39 |     """Remove LaTeX from the text."""
 40 |     # Remove inline LaTeX
 41 |     clean_text = re.sub(r"\$.*?\$", "", text)
 42 | 
 43 |     # Remove block LaTeX
 44 |     clean_text = re.sub(r"\\begin{.*?}\\end{.*?}", "", clean_text)
 45 | 
 46 |     return clean_text
 47 | 
 48 | 
 49 | def build_email(
 50 |     df: pd.DataFrame,
 51 |     email: bool = False,
 52 |     to_path: str = None,
 53 |     web: bool = False,
 54 | ):
 55 |     """Build the HTML email."""
 56 |     flanT5_out = {
 57 |         "headline": "Your Scholarly Recommender Newsletter Feed",
 58 |         "intro": "Thank you for using Scholarly Recommender. Here is your feed.",
 59 |     }
 60 | 
 61 |     html_content = """<!DOCTYPE html>
 62 |     <html>
 63 |     <body style="background-color: #FFFFFF; color: #A2A2F5">
 64 |     """
 65 |     if email:
 66 |         body_template = """
 67 |     <h2 class="title-main" style='font-family: "Open Sans", sans-serif; color: #262730;
 68 |     font-family: Arial, sans-serif;
 69 |         font-size: 28px;
 70 |         letter-spacing: 0.05em;
 71 | 
 72 |         color: #2C3E50;
 73 |         margin-bottom: 10px;'>{headline}</h2>
 74 |     <p style='font-family: "Open Sans", sans-serif; color: #262730; font-size: 18px;
 75 |         margin-bottom: 20px; line-height: 1.6;'>
 76 |         Dear Reader,
 77 |         </p>
 78 |     <p style='font-family: "Open Sans", sans-serif; color: #262730; font-size: 18px;
 79 |         margin-bottom: 20px; line-height: 1.6;'>
 80 |             {intro}
 81 |     </p>
 82 |         """
 83 |         body_html = body_template.format(
 84 |             headline=flanT5_out["headline"],
 85 |             intro=flanT5_out["intro"],
 86 |         )
 87 |         html_content += body_html
 88 |     # HTML template for each feed item
 89 |     html_template = """
 90 |     <div class="feed-item" style="border: 1px solid #ccc;
 91 |     padding: 15px;
 92 |     margin: 15px;
 93 |     border-radius: 8px;">
 94 |     <h2 class="title" style=" font-size: 24px;
 95 |     font-weight: bold;
 96 |     margin-bottom: 5px; color: #262730;">{title}</h2>
 97 |     <h4 class="author" style="font-size: 14px;
 98 |     font-weight: bold;
 99 |     margin-bottom: 10px; color:#262730;">{author}</h4>
100 |     <div class="metadata" style="font-size: 14px;
101 |     color: #262730;
102 |     margin-bottom: 10px;">
103 |         <span class="id">{id}</span> |
104 |         <span class="category">{category}</span> |
105 |         <span class="published">{published}</span>
106 |     </div>
107 |     <div class="abstract" style="font-size: 16px;
108 |     margin-bottom: 10px; color: #262730;">
109 |         {abstract}
110 |     </div>
111 |     <a href="{url}" target="_blank" style="display: inline-block;
112 |     background-color: #A2A2F5;
113 |     color: white;
114 |     padding: 8px 16px;
115 |     border-radius: 4px;
116 |     text-decoration: none;">Read More</a>
117 |     </div>
118 |     """
119 | 
120 |     # Iterate through the DataFrame and fill in the HTML template
121 |     for _, row in df.iterrows():
122 |         item_html = html_template.format(
123 |             title=row["Title"],
124 |             author=row["Author"],
125 |             id=row["Id"],
126 |             category=row["Category"],
127 |             published=row["Published"],
128 |             abstract=row["Abstract"],
129 |             url=row["URL"],
130 |         )
131 |         html_content += item_html
132 |     html_content += """ </body>
133 |     </html>"""
134 |     # Save the generated HTML to a file for demonstration
135 |     if web:
136 |         return html_content
137 |     if to_path is None:
138 |         to_path = config["feed_path"]
139 |     html_file_path = to_path
140 |     with open(html_file_path, "w") as f:
141 |         f.write(html_content)
142 |     return True
143 | 
144 | 
145 | def get_feed(
146 |     data,
147 |     email: bool = False,
148 |     to_path: str = None,
149 |     web: bool = False,
150 | ):
151 |     """Get the feed."""
152 |     if isinstance(data, pd.DataFrame):
153 |         df = clean_feed(data)
154 |         res = build_email(df, email=email, to_path=to_path, web=web)
155 |         return res
156 | 
157 |     if isinstance(data, str):
158 |         df = clean_feed(pd.read_csv(data))
159 |         res = build_email(df, email=email, to_path=to_path, web=web)
160 |         return res
161 |     raise TypeError("data must be a pandas DataFrame or a path to a csv file")
162 | 


--------------------------------------------------------------------------------
/ScholarlyRecommender/Scraper/Arxiv.py:
--------------------------------------------------------------------------------
  1 | import logging
  2 | import pandas as pd
  3 | import arxiv
  4 | from ScholarlyRecommender.const import BASE_REPO
  5 | from ScholarlyRecommender.config import get_config
  6 | 
  7 | config = get_config()
  8 | 
  9 | logging.basicConfig(
 10 |     level=logging.INFO,
 11 |     format="%(asctime)s [%(levelname)s]: %(message)s",
 12 |     handlers=[logging.StreamHandler()],
 13 | )
 14 | 
 15 | # logging.disable(logging.CRITICAL)
 16 | 
 17 | 
 18 | def search(
 19 |     query: str, max_results: int = 100, sort_by=arxiv.SortCriterion.SubmittedDate
 20 | ) -> pd.DataFrame:
 21 |     """
 22 |     Scrape arxiv.org for papers matching the query and return a dataframe
 23 |     matching the BASE_REPO format.
 24 |     """
 25 |     search_client = arxiv.Client(page_size=max_results, delay_seconds=3, num_retries=5)
 26 | 
 27 |     repository = BASE_REPO()
 28 | 
 29 |     arx_search = arxiv.Search(query=query, max_results=max_results, sort_by=sort_by)
 30 | 
 31 |     for result in search_client.results(arx_search):
 32 |         try:
 33 |             repository["Id"].append(result.entry_id.split("/")[-1])
 34 |             repository["Category"].append(result.primary_category)
 35 |             repository["Title"].append(result.title.strip("\n"))
 36 |             repository["Published"].append(result.published)
 37 |             repository["Abstract"].append(result.summary.strip("\n"))
 38 |             repository["URL"].append(result.pdf_url)
 39 |         except arxiv.arxiv.UnexpectedEmptyPageError as error:
 40 |             logging.exception(error)
 41 |             continue
 42 |     if len(repository["Id"]) == 0:
 43 |         raise ValueError("No papers found for this query")
 44 |     return pd.DataFrame(repository).set_index("Id")
 45 | 
 46 | 
 47 | def fast_search(
 48 |     queries: list,
 49 |     max_results: int = 100,
 50 |     to_path: str = None,
 51 |     as_df: bool = False,
 52 |     prev_days: int = 7,
 53 |     sort_by=arxiv.SortCriterion.SubmittedDate,
 54 | ):
 55 |     """
 56 |     fast version for source candidates in development, using query
 57 |     optimization and parallelization to significantly reduce runtime.
 58 |     """
 59 |     if queries is None:
 60 |         queries = config["queries"]
 61 |     if not isinstance(queries, list) or len(queries) == 0:
 62 |         raise ValueError("queries must be a list of strings with at least one element")
 63 |     if prev_days <= 0 or prev_days >= 30:
 64 |         raise ValueError("prev_days must be greater than 0 and at most 30")
 65 |     if len(queries) > 100:
 66 |         raise ValueError("Too many queries, please reduce the number of queries ")
 67 |     # Turn queries into a single string of each query, seperated by " OR "
 68 |     queries = " OR ".join(queries)
 69 |     max_results = 500
 70 | 
 71 |     logging.info("Searching for %s", queries)
 72 |     # df = search(queries, max_results=max_results, sort_by=sort_by)
 73 |     repository = BASE_REPO()
 74 |     arx_search = arxiv.Search(query=queries, max_results=max_results, sort_by=sort_by)
 75 |     for result in arx_search.results():
 76 |         try:
 77 |             repository["Id"].append(result.entry_id.split("/")[-1])
 78 |             repository["Category"].append(result.primary_category)
 79 |             repository["Title"].append(result.title.strip("\n"))
 80 |             repository["Published"].append(result.published)
 81 |             repository["Abstract"].append(result.summary.strip("\n"))
 82 |             repository["URL"].append(result.pdf_url)
 83 |         except arxiv.arxiv.UnexpectedEmptyPageError as error:
 84 |             logging.exception(error)
 85 |             continue
 86 |     if len(repository["Id"]) == 0:
 87 |         raise ValueError("No papers found for this query")
 88 |     df = pd.DataFrame(repository).set_index("Id")
 89 |     if df.index.has_duplicates:
 90 |         df = df[~df.index.duplicated(keep="first")]
 91 |     df["Published"] = pd.to_datetime(df["Published"])
 92 |     num_days = pd.Timestamp.now(tz="UTC") - pd.Timedelta(days=prev_days)
 93 |     df = df[df["Published"] >= num_days]
 94 |     logging.info("Number of papers extracted : %s", len(df.index))
 95 |     if to_path is not None:
 96 |         df.to_csv(to_path)
 97 |     if as_df:
 98 |         return df
 99 |     return None
100 | 
101 | 
102 | def source_candidates(
103 |     queries: list,
104 |     max_results: int = 100,
105 |     to_path: str = None,
106 |     as_df: bool = False,
107 |     prev_days: int = 7,
108 |     sort_by=arxiv.SortCriterion.SubmittedDate,
109 | ):
110 |     """
111 |     Scrape arxiv.org for papers matching the queries,
112 |     filter them and return a dataframe or save it to a csv file.
113 |     """
114 |     if queries is None:
115 |         queries = config["queries"]
116 |     if not isinstance(queries, list) or len(queries) == 0:
117 |         raise ValueError("queries must be a list of strings with at least one element")
118 |     if prev_days <= 0 or prev_days >= 30:
119 |         raise ValueError("prev_days must be greater than 0 and at most 30")
120 |     if len(queries) > 100:
121 |         raise ValueError("Too many queries, please reduce the number of queries ")
122 |     # normalize queries if for recommendations
123 |     if sort_by == arxiv.SortCriterion.SubmittedDate:
124 |         max_results = max(((100 * prev_days) // len(queries)), 100)
125 |         num_days = pd.Timestamp.now(tz="UTC") - pd.Timedelta(days=prev_days)
126 |     else:
127 |         max_results = 100
128 |         num_days = pd.Timestamp.now(tz="UTC") - pd.Timedelta(days=1095)
129 | 
130 |     dfs = []
131 |     for query in queries:
132 |         logging.info("Searching for %s", query)
133 | 
134 |         df2 = search(query, max_results=max_results, sort_by=sort_by)
135 |         dfs.append(df2)
136 | 
137 |     df = pd.concat(dfs)
138 |     if df.index.has_duplicates:
139 |         df = df[~df.index.duplicated(keep="first")]
140 |     # Filter
141 | 
142 |     # Remove duplicates
143 | 
144 |     # Only keep papers from the last week
145 |     df["Published"] = pd.to_datetime(df["Published"])
146 | 
147 |     df = df[df["Published"] >= num_days]
148 |     logging.info("Number of papers extracted : %s", len(df.index))
149 | 
150 |     if to_path is not None:
151 |         df.to_csv(to_path)
152 |     if as_df:
153 |         return df
154 |     return None
155 | 


--------------------------------------------------------------------------------
/docs/scholarlyAPI.md:
--------------------------------------------------------------------------------
  1 | # Scholarly Recommender API Documentation (In Progress)
  2 | 
  3 | ## TODO
  4 | 
  5 | source_candidates
  6 | get_recommendations
  7 | evaluate
  8 | get_feed
  9 | get_config
 10 | update_config
 11 | <!-- TABLE OF CONTENTS -->
 12 | <details>
 13 |   <summary>Table of Contents</summary>
 14 |   <ol>
 15 |     <li>
 16 |       <a href="#about-this-document">About This Document</a>
 17 |       <ul>
 18 |         <li><a href="#frequently-asked-questions">Frequently Asked Questions</a></li>
 19 |       </ul>
 20 |     </li>
 21 |     <li>
 22 |       <a href="#functions-and-usage">Functions and Usage</a>
 23 |       <ul>
 24 |         <li><a href="#get_config">get_config</a></li>
 25 |         <li><a href="#update_config">update_config</a></li>
 26 |         <li><a href="#source_candidates">source_candidates</a></li>
 27 |         <li><a href="#get_recommendations">get_recommendations</a></li>
 28 |         <li><a href="#evaluate">evaluate</a></li>
 29 |         <li><a href="#get_feed">get_feed</a></li>
 30 |       </ul>
 31 |     </li>
 32 |   </ol>
 33 | </details>
 34 | 
 35 | 
 36 | ## About this Document
 37 | 
 38 | TODO
 39 | 
 40 | ## Frequently Asked Questions
 41 | 
 42 | TODO
 43 | 
 44 | <!-- FUNCTIONS AND USAGE -->
 45 | ## Functions and Usage
 46 | 
 47 | TODO
 48 | 
 49 | 
 50 | ### get_config
 51 | 
 52 | ```sh
 53 | ScholarlyRecommender.get_config() -> dict:
 54 | ```
 55 |  
 56 | Retrieves the current configurations being used by the ScholarlyRecommender system.
 57 | 
 58 | - **Parameters:**
 59 |   - None
 60 | - **Returns:**
 61 |   - config: dict
 62 |     - A python dictionary representing the current configuration being used by the system. It is internally stored as a json file.
 63 | - **Example**
 64 |   - ```sh
 65 |     import ScholarlyRecommender as sr
 66 | 
 67 |     config = sr.get_config
 68 |     queries = config['queries]
 69 |     ```
 70 | 
 71 | ### update_config
 72 | 
 73 | ```sh
 74 | ScholarlyRecommender.update_config(new_config, **kwargs) -> None:
 75 | ```
 76 |  
 77 | what it does
 78 | 
 79 | - **Parameters:**
 80 |   - new_config: dict
 81 |     - A python dictionary representing the new configuration the system will use. This will be internally stored as a json file to configuration.json and will overwrite the previous configuration.
 82 |   - **kwargs: optional params
 83 |     - Optional keyword arguments used for testing and debugging. 
 84 | - **Returns:**
 85 |   - None
 86 | - **Example**
 87 |   - ```sh
 88 |     import ScholarlyRecommender as sr
 89 |     ```
 90 | 
 91 | ### source_candidates
 92 | 
 93 | ```sh
 94 | ScholarlyRecommender.source_candidates(
 95 |     queries: list,
 96 |     max_results: int = 100,
 97 |     to_path: str = None,
 98 |     as_df: bool = True,
 99 |     prev_days: int = 7,
100 |     sort_by=arxiv.SortCriterion.SubmittedDate,
101 | ) -> pd.DataFrame:
102 | ```
103 |  
104 | Scrapes the web for papers matching the queries, filters them and returns a dataframe containing the results. Results can also be saved to a csv file.
105 | 
106 | - **Parameters:**
107 |   - queries: list of str
108 |     - The search queries to scrape, represented as strings. The length of queries must be greater than zero and less than 100.
109 |   - max_results: int, *optional*
110 |     - The maximum number of candidates to source per query, defaults to 100 and dynamically scales based on the length of queries.
111 |   - to_path: str, path object, file-like object, *optional*
112 |     - Where to store the resulting candidates if desired, the dataframe will be saved here as a csv. Defaults to None.
113 |   - as_df: bool, *optional*
114 |     - Boolean to indicate if the resulting candidates should be returned as a Pandas Dataframe. defaults to True, and should only be changed to False if to_path is provided.
115 |   - prev_days: int, *optional*
116 |     - The maximum number of days (inclusive) since the publication date for candidates, defaults to 7, must be greater than 0 and less than 31.
117 |   - sort_by: arxiv.SortCriterion, *optional*
118 |     - TODO
119 | - **Returns:**
120 |   - Pandas Dataframe if as_df is True, otherwise None.
121 | - **Example**
122 |   - ```sh
123 |     import ScholarlyRecommender as sr
124 |     ```
125 |   
126 | ### get_recommendations
127 | 
128 | ```sh
129 | ScholarlyRecommender.get_recommendations(
130 |     data,
131 |     labels,
132 |     size: int = None,
133 |     to_path: str = None,
134 |     as_df: bool = False,
135 | ):
136 | ```
137 |  
138 |  Rank the papers in the data and return the top n papers as a dataframe or saved to a csv file.
139 | 
140 | - **Parameters:**
141 |   - data: Dataframe or path like object
142 |     - The dataset containing the candidates for the system to rank. This dataset is returned by source_candidates.
143 |   - labels: Dataframe, path like object, or None
144 |     - The labeled dataset the system will use to generate recommendations. This dataset is stored in the enviroment configuration and should be how it is accessed. 
145 |   - size: int, *optional*
146 |     - The number of papers to return, defaults to the enviroment configuration (typically 5).
147 |     - Throws a ValueError for manual inputs less than 0 or greater than the number of candidates provided by data.
148 |   - to_path: str, path object, file-like object, *optional*
149 |     - Where to store the resulting candidates if desired, the dataframe will be saved here as a csv. Defaults to None.
150 |   - as_df: bool, *optional*
151 |     - Boolean to indicate if the resulting candidates should be returned as a Pandas Dataframe. defaults to True, and should only be changed to False if to_path is provided.
152 | - **Returns:**
153 |   - Pandas Dataframe if as_df is True, otherwise None.
154 | - **Example**
155 |   - ```sh
156 |     import ScholarlyRecommender as sr
157 |     ```
158 | 
159 | ### evaluate
160 | 
161 | ```sh
162 | ScholarlyRecommender.evaluate(n: int = 5, k: int = 6, on: str = "Abstract") -> float:
163 | ```
164 |  
165 | Evaluate the recommender system on the labeled dataset. Uses the normalized compression distance to predict the masked rating.
166 | Calculates the mean squared error between the predicted and actual ratings and returns the total loss.
167 | 
168 | 
169 | - **Parameters:**
170 |   - None
171 | - **Returns:**
172 |   - None
173 | - **Example**
174 |   - ```sh
175 |     import ScholarlyRecommender as sr
176 |     ```
177 | 
178 | ### get_feed
179 | 
180 | ```sh
181 | ScholarlyRecommender.get_feed(
182 |     data,
183 |     email: bool = False,
184 |     to_path: str = None,
185 |     web: bool = False,
186 | ):
187 | ```
188 |  
189 | what it does
190 | 
191 | - **Parameters:**
192 |   - None
193 | - **Returns:**
194 |   - None
195 | - **Example**
196 |   - ```sh
197 |     import ScholarlyRecommender as sr
198 |     ```
199 | 


--------------------------------------------------------------------------------
/ScholarlyRecommender/Repository/tests/outputs/test_recommendations.csv:
--------------------------------------------------------------------------------
 1 | Id,Category,Title,Published,Abstract,URL,Author
 2 | 2309.11400v1,q-fin.TR,Transformers versus LSTMs for electronic trading,2023-09-20 15:25:43+00:00,"With the rapid development of artificial intelligence, long short term memory
 3 | (LSTM), one kind of recurrent neural network (RNN), has been widely applied in
 4 | time series prediction.
 5 |   Like RNN, Transformer is designed to handle the sequential data. As
 6 | Transformer achieved great success in Natural Language Processing (NLP),
 7 | researchers got interested in Transformer's performance on time series
 8 | prediction, and plenty of Transformer-based solutions on long time series
 9 | forecasting have come out recently. However, when it comes to financial time
10 | series prediction, LSTM is still a dominant architecture. Therefore, the
11 | question this study wants to answer is: whether the Transformer-based model can
12 | be applied in financial time series prediction and beat LSTM.
13 |   To answer this question, various LSTM-based and Transformer-based models are
14 | compared on multiple financial prediction tasks based on high-frequency limit
15 | order book data. A new LSTM-based model called DLSTM is built and new
16 | architecture for the Transformer-based model is designed to adapt for financial
17 | prediction. The experiment result reflects that the Transformer-based model
18 | only has the limited advantage in absolute price sequence prediction. The
19 | LSTM-based models show better and more robust performance on difference
20 | sequence prediction, such as price difference and price movement.",http://arxiv.org/pdf/2309.11400v1,"['Paul Bilokon', 'Yitao Qiu']"
21 | 2309.10982v1,cs.AI,Is GPT4 a Good Trader?,2023-09-20 00:47:52+00:00,"Recently, large language models (LLMs), particularly GPT-4, have demonstrated
22 | significant capabilities in various planning and reasoning tasks
23 | \cite{cheng2023gpt4,bubeck2023sparks}. Motivated by these advancements, there
24 | has been a surge of interest among researchers to harness the capabilities of
25 | GPT-4 for the automated design of quantitative factors that do not overlap with
26 | existing factor libraries, with an aspiration to achieve alpha returns
27 | \cite{webpagequant}. In contrast to these work, this study aims to examine the
28 | fidelity of GPT-4's comprehension of classic trading theories and its
29 | proficiency in applying its code interpreter abilities to real-world trading
30 | data analysis. Such an exploration is instrumental in discerning whether the
31 | underlying logic GPT-4 employs for trading is intrinsically reliable.
32 | Furthermore, given the acknowledged interpretative latitude inherent in most
33 | trading theories, we seek to distill more precise methodologies of deploying
34 | these theories from GPT-4's analytical process, potentially offering invaluable
35 | insights to human traders.
36 |   To achieve this objective, we selected daily candlestick (K-line) data from
37 | specific periods for certain assets, such as the Shanghai Stock Index. Through
38 | meticulous prompt engineering, we guided GPT-4 to analyze the technical
39 | structures embedded within this data, based on specific theories like the
40 | Elliott Wave Theory. We then subjected its analytical output to manual
41 | evaluation, assessing its interpretative depth and accuracy vis-\`a-vis these
42 | trading theories from multiple dimensions. The results and findings from this
43 | study could pave the way for a synergistic amalgamation of human expertise and
44 | AI-driven insights in the realm of trading.",http://arxiv.org/pdf/2309.10982v1,['Bingzhe Wu']
45 | 2309.11495v1,cs.CL,Chain-of-Verification Reduces Hallucination in Large Language Models,2023-09-20 17:50:55+00:00,"Generation of plausible yet incorrect factual information, termed
46 | hallucination, is an unsolved issue in large language models. We study the
47 | ability of language models to deliberate on the responses they give in order to
48 | correct their mistakes. We develop the Chain-of-Verification (CoVe) method
49 | whereby the model first (i) drafts an initial response; then (ii) plans
50 | verification questions to fact-check its draft; (iii) answers those questions
51 | independently so the answers are not biased by other responses; and (iv)
52 | generates its final verified response. In experiments, we show CoVe decreases
53 | hallucinations across a variety of tasks, from list-based questions from
54 | Wikidata, closed book MultiSpanQA and longform text generation.",http://arxiv.org/pdf/2309.11495v1,"['Shehzaad Dhuliawala', 'Mojtaba Komeili', 'Jing Xu', 'Roberta Raileanu', 'Xian Li', 'Asli Celikyilmaz', 'Jason Weston']"
55 | 2309.11830v1,cs.CL,A Chinese Prompt Attack Dataset for LLMs with Evil Content,2023-09-21 07:07:49+00:00,"Large Language Models (LLMs) present significant priority in text
56 | understanding and generation. However, LLMs suffer from the risk of generating
57 | harmful contents especially while being employed to applications. There are
58 | several black-box attack methods, such as Prompt Attack, which can change the
59 | behaviour of LLMs and induce LLMs to generate unexpected answers with harmful
60 | contents. Researchers are interested in Prompt Attack and Defense with LLMs,
61 | while there is no publicly available dataset to evaluate the abilities of
62 | defending prompt attack. In this paper, we introduce a Chinese Prompt Attack
63 | Dataset for LLMs, called CPAD. Our prompts aim to induce LLMs to generate
64 | unexpected outputs with several carefully designed prompt attack approaches and
65 | widely concerned attacking contents. Different from previous datasets involving
66 | safety estimation, We construct the prompts considering three dimensions:
67 | contents, attacking methods and goals, thus the responses can be easily
68 | evaluated and analysed. We run several well-known Chinese LLMs on our dataset,
69 | and the results show that our prompts are significantly harmful to LLMs, with
70 | around 70% attack success rate. We will release CPAD to encourage further
71 | studies on prompt attack and defense.",http://arxiv.org/pdf/2309.11830v1,"['Chengyuan Liu', 'Fubang Zhao', 'Lizhi Qing', 'Yangyang Kang', 'Changlong Sun', 'Kun Kuang', 'Fei Wu']"
72 | 2309.11688v1,cs.CL,LLM Guided Inductive Inference for Solving Compositional Problems,2023-09-20 23:44:16+00:00,"While large language models (LLMs) have demonstrated impressive performance
73 | in question-answering tasks, their performance is limited when the questions
74 | require knowledge that is not included in the model's training data and can
75 | only be acquired through direct observation or interaction with the real world.
76 | Existing methods decompose reasoning tasks through the use of modules invoked
77 | sequentially, limiting their ability to answer deep reasoning tasks. We
78 | introduce a method, Recursion based extensible LLM (REBEL), which handles
79 | open-world, deep reasoning tasks by employing automated reasoning techniques
80 | like dynamic planning and forward-chaining strategies. REBEL allows LLMs to
81 | reason via recursive problem decomposition and utilization of external tools.
82 | The tools that REBEL uses are specified only by natural language description.
83 | We further demonstrate REBEL capabilities on a set of problems that require a
84 | deeply nested use of external tools in a compositional and conversational
85 | setting.",http://arxiv.org/pdf/2309.11688v1,"['Abhigya Sodani', 'Lauren Moos', 'Matthew Mirman']"
86 | 


--------------------------------------------------------------------------------
/ScholarlyRecommender/Repository/tests/inputs/ref_recommendations.csv:
--------------------------------------------------------------------------------
 1 | Id,Category,Title,Published,Abstract,URL,Author
 2 | 2309.11495v1,cs.CL,Chain-of-Verification Reduces Hallucination in Large Language Models,2023-09-20 17:50:55+00:00,"Generation of plausible yet incorrect factual information, termed
 3 | hallucination, is an unsolved issue in large language models. We study the
 4 | ability of language models to deliberate on the responses they give in order to
 5 | correct their mistakes. We develop the Chain-of-Verification (CoVe) method
 6 | whereby the model first (i) drafts an initial response; then (ii) plans
 7 | verification questions to fact-check its draft; (iii) answers those questions
 8 | independently so the answers are not biased by other responses; and (iv)
 9 | generates its final verified response. In experiments, we show CoVe decreases
10 | hallucinations across a variety of tasks, from list-based questions from
11 | Wikidata, closed book MultiSpanQA and longform text generation.",http://arxiv.org/pdf/2309.11495v1,"['Shehzaad Dhuliawala', 'Mojtaba Komeili', 'Jing Xu', 'Roberta Raileanu', 'Xian Li', 'Asli Celikyilmaz', 'Jason Weston']"
12 | 2309.11295v1,cs.CL,CPLLM: Clinical Prediction with Large Language Models,2023-09-20 13:24:12+00:00,"We present Clinical Prediction with Large Language Models (CPLLM), a method
13 | that involves fine-tuning a pre-trained Large Language Model (LLM) for clinical
14 | disease prediction. We utilized quantization and fine-tuned the LLM using
15 | prompts, with the task of predicting whether patients will be diagnosed with a
16 | target disease during their next visit or in the subsequent diagnosis,
17 | leveraging their historical diagnosis records. We compared our results versus
18 | various baselines, including Logistic Regression, RETAIN, and Med-BERT, which
19 | is the current state-of-the-art model for disease prediction using structured
20 | EHR data. Our experiments have shown that CPLLM surpasses all the tested models
21 | in terms of both PR-AUC and ROC-AUC metrics, displaying noteworthy enhancements
22 | compared to the baseline models.",http://arxiv.org/pdf/2309.11295v1,"['Ofir Ben Shoham', 'Nadav Rappoport']"
23 | 2309.11259v1,cs.CL,Sequence-to-Sequence Spanish Pre-trained Language Models,2023-09-20 12:35:19+00:00,"In recent years, substantial advancements in pre-trained language models have
24 | paved the way for the development of numerous non-English language versions,
25 | with a particular focus on encoder-only and decoder-only architectures. While
26 | Spanish language models encompassing BERT, RoBERTa, and GPT have exhibited
27 | prowess in natural language understanding and generation, there remains a
28 | scarcity of encoder-decoder models designed for sequence-to-sequence tasks
29 | involving input-output pairs. This paper breaks new ground by introducing the
30 | implementation and evaluation of renowned encoder-decoder architectures,
31 | exclusively pre-trained on Spanish corpora. Specifically, we present Spanish
32 | versions of BART, T5, and BERT2BERT-style models and subject them to a
33 | comprehensive assessment across a diverse range of sequence-to-sequence tasks,
34 | spanning summarization, rephrasing, and generative question answering. Our
35 | findings underscore the competitive performance of all models, with BART and T5
36 | emerging as top performers across all evaluated tasks. As an additional
37 | contribution, we have made all models publicly available to the research
38 | community, fostering future exploration and development in Spanish language
39 | processing.",http://arxiv.org/pdf/2309.11259v1,"['Vladimir Araujo', 'Maria Mihaela Trusca', 'Rodrigo Tufiño', 'Marie-Francine Moens']"
40 | 2309.10982v1,cs.AI,Is GPT4 a Good Trader?,2023-09-20 00:47:52+00:00,"Recently, large language models (LLMs), particularly GPT-4, have demonstrated
41 | significant capabilities in various planning and reasoning tasks
42 | \cite{cheng2023gpt4,bubeck2023sparks}. Motivated by these advancements, there
43 | has been a surge of interest among researchers to harness the capabilities of
44 | GPT-4 for the automated design of quantitative factors that do not overlap with
45 | existing factor libraries, with an aspiration to achieve alpha returns
46 | \cite{webpagequant}. In contrast to these work, this study aims to examine the
47 | fidelity of GPT-4's comprehension of classic trading theories and its
48 | proficiency in applying its code interpreter abilities to real-world trading
49 | data analysis. Such an exploration is instrumental in discerning whether the
50 | underlying logic GPT-4 employs for trading is intrinsically reliable.
51 | Furthermore, given the acknowledged interpretative latitude inherent in most
52 | trading theories, we seek to distill more precise methodologies of deploying
53 | these theories from GPT-4's analytical process, potentially offering invaluable
54 | insights to human traders.
55 |   To achieve this objective, we selected daily candlestick (K-line) data from
56 | specific periods for certain assets, such as the Shanghai Stock Index. Through
57 | meticulous prompt engineering, we guided GPT-4 to analyze the technical
58 | structures embedded within this data, based on specific theories like the
59 | Elliott Wave Theory. We then subjected its analytical output to manual
60 | evaluation, assessing its interpretative depth and accuracy vis-\`a-vis these
61 | trading theories from multiple dimensions. The results and findings from this
62 | study could pave the way for a synergistic amalgamation of human expertise and
63 | AI-driven insights in the realm of trading.",http://arxiv.org/pdf/2309.10982v1,['Bingzhe Wu']
64 | 2309.12053v1,cs.CL,"AceGPT, Localizing Large Language Models in Arabic",2023-09-21 13:20:13+00:00,"This paper explores the imperative need and methodology for developing a
65 | localized Large Language Model (LLM) tailored for Arabic, a language with
66 | unique cultural characteristics that are not adequately addressed by current
67 | mainstream models like ChatGPT. Key concerns additionally arise when
68 | considering cultural sensitivity and local values. To this end, the paper
69 | outlines a packaged solution, including further pre-training with Arabic texts,
70 | supervised fine-tuning (SFT) using native Arabic instructions and GPT-4
71 | responses in Arabic, and reinforcement learning with AI feedback (RLAIF) using
72 | a reward model that is sensitive to local culture and values. The objective is
73 | to train culturally aware and value-aligned Arabic LLMs that can serve the
74 | diverse application-specific needs of Arabic-speaking communities.
75 |   Extensive evaluations demonstrated that the resulting LLM called
76 | `\textbf{AceGPT}' is the SOTA open Arabic LLM in various benchmarks, including
77 | instruction-following benchmark (i.e., Arabic Vicuna-80 and Arabic AlpacaEval),
78 | knowledge benchmark (i.e., Arabic MMLU and EXAMs), as well as the
79 | newly-proposed Arabic cultural \& value alignment benchmark. Notably, AceGPT
80 | outperforms ChatGPT in the popular Vicuna-80 benchmark when evaluated with
81 | GPT-4, despite the benchmark's limited scale. % Natural Language Understanding
82 | (NLU) benchmark (i.e., ALUE)
83 |   Codes, data, and models are in https://github.com/FreedomIntelligence/AceGPT.",http://arxiv.org/pdf/2309.12053v1,"['Huang Huang', 'Fei Yu', 'Jianqing Zhu', 'Xuening Sun', 'Hao Cheng', 'Dingjie Song', 'Zhihong Chen', 'Abdulmohsen Alharthi', 'Bang An', 'Ziche Liu', 'Zhiyi Zhang', 'Junying Chen', 'Jianquan Li', 'Benyou Wang', 'Lian Zhang', 'Ruoyu Sun', 'Xiang Wan', 'Haizhou Li', 'Jinchao Xu']"
84 | 


--------------------------------------------------------------------------------
/ScholarlyRecommender/Repository/Recommendations.csv:
--------------------------------------------------------------------------------
 1 | Id,Category,Title,Published,Abstract,URL,Author
 2 | 2308.14337v1,cs.AI,Cognitive Effects in Large Language Models,2023-08-28 06:30:33+00:00,"Large Language Models (LLMs) such as ChatGPT have received enormous attention
 3 | over the past year and are now used by hundreds of millions of people every
 4 | day. The rapid adoption of this technology naturally raises questions about the
 5 | possible biases such models might exhibit. In this work, we tested one of these
 6 | models (GPT-3) on a range of cognitive effects, which are systematic patterns
 7 | that are usually found in human cognitive tasks. We found that LLMs are indeed
 8 | prone to several human cognitive effects. Specifically, we show that the
 9 | priming, distance, SNARC, and size congruity effects were presented with GPT-3,
10 | while the anchoring effect is absent. We describe our methodology, and
11 | specifically the way we converted real-world experiments to text-based
12 | experiments. Finally, we speculate on the possible reasons why GPT-3 exhibits
13 | these effects and discuss whether they are imitated or reinvented.",http://arxiv.org/pdf/2308.14337v1,"[arxiv.Result.Author('Jonathan Shaki'), arxiv.Result.Author('Sarit Kraus'), arxiv.Result.Author('Michael Wooldridge')]"
14 | 2308.14921v1,cs.CL,Gender bias and stereotypes in Large Language Models,2023-08-28 22:32:05+00:00,"Large Language Models (LLMs) have made substantial progress in the past
15 | several months, shattering state-of-the-art benchmarks in many domains. This
16 | paper investigates LLMs' behavior with respect to gender stereotypes, a known
17 | issue for prior models. We use a simple paradigm to test the presence of gender
18 | bias, building on but differing from WinoBias, a commonly used gender bias
19 | dataset, which is likely to be included in the training data of current LLMs.
20 | We test four recently published LLMs and demonstrate that they express biased
21 | assumptions about men and women's occupations. Our contributions in this paper
22 | are as follows: (a) LLMs are 3-6 times more likely to choose an occupation that
23 | stereotypically aligns with a person's gender; (b) these choices align with
24 | people's perceptions better than with the ground truth as reflected in official
25 | job statistics; (c) LLMs in fact amplify the bias beyond what is reflected in
26 | perceptions or the ground truth; (d) LLMs ignore crucial ambiguities in
27 | sentence structure 95% of the time in our study items, but when explicitly
28 | prompted, they recognize the ambiguity; (e) LLMs provide explanations for their
29 | choices that are factually inaccurate and likely obscure the true reason behind
30 | their predictions. That is, they provide rationalizations of their biased
31 | behavior. This highlights a key property of these models: LLMs are trained on
32 | imbalanced datasets; as such, even with the recent successes of reinforcement
33 | learning with human feedback, they tend to reflect those imbalances back at us.
34 | As with other types of societal biases, we suggest that LLMs must be carefully
35 | tested to ensure that they treat minoritized individuals and communities
36 | equitably.",http://arxiv.org/pdf/2308.14921v1,"[arxiv.Result.Author('Hadas Kotek'), arxiv.Result.Author('Rikker Dockum'), arxiv.Result.Author('David Q. Sun')]"
37 | 2308.15126v1,cs.LG,Evaluation and Analysis of Hallucination in Large Vision-Language Models,2023-08-29 08:51:24+00:00,"Large Vision-Language Models (LVLMs) have recently achieved remarkable
38 | success. However, LVLMs are still plagued by the hallucination problem, which
39 | limits the practicality in many scenarios. Hallucination refers to the
40 | information of LVLMs' responses that does not exist in the visual input, which
41 | poses potential risks of substantial consequences. There has been limited work
42 | studying hallucination evaluation in LVLMs. In this paper, we propose
43 | Hallucination Evaluation based on Large Language Models (HaELM), an LLM-based
44 | hallucination evaluation framework. HaELM achieves an approximate 95%
45 | performance comparable to ChatGPT and has additional advantages including low
46 | cost, reproducibility, privacy preservation and local deployment. Leveraging
47 | the HaELM, we evaluate the hallucination in current LVLMs. Furthermore, we
48 | analyze the factors contributing to hallucination in LVLMs and offer helpful
49 | suggestions to mitigate the hallucination problem. Our training data and human
50 | annotation hallucination data will be made public soon.",http://arxiv.org/pdf/2308.15126v1,"[arxiv.Result.Author('Junyang Wang'), arxiv.Result.Author('Yiyang Zhou'), arxiv.Result.Author('Guohai Xu'), arxiv.Result.Author('Pengcheng Shi'), arxiv.Result.Author('Chenlin Zhao'), arxiv.Result.Author('Haiyang Xu'), arxiv.Result.Author('Qinghao Ye'), arxiv.Result.Author('Ming Yan'), arxiv.Result.Author('Ji Zhang'), arxiv.Result.Author('Jihua Zhu'), arxiv.Result.Author('Jitao Sang'), arxiv.Result.Author('Haoyu Tang')]"
51 | 2308.14182v1,cs.CL,Generative AI for Business Strategy: Using Foundation Models to Create Business Strategy Tools,2023-08-27 19:03:12+00:00,"Generative models (foundation models) such as LLMs (large language models)
52 | are having a large impact on multiple fields. In this work, we propose the use
53 | of such models for business decision making. In particular, we combine
54 | unstructured textual data sources (e.g., news data) with multiple foundation
55 | models (namely, GPT4, transformer-based Named Entity Recognition (NER) models
56 | and Entailment-based Zero-shot Classifiers (ZSC)) to derive IT (information
57 | technology) artifacts in the form of a (sequence of) signed business networks.
58 | We posit that such artifacts can inform business stakeholders about the state
59 | of the market and their own positioning as well as provide quantitative
60 | insights into improving their future outlook.",http://arxiv.org/pdf/2308.14182v1,"[arxiv.Result.Author('Son The Nguyen'), arxiv.Result.Author('Theja Tulabandhula')]"
61 | 2308.15930v1,cs.CL,LLaSM: Large Language and Speech Model,2023-08-30 10:12:39+00:00,"Multi-modal large language models have garnered significant interest
62 | recently. Though, most of the works focus on vision-language multi-modal models
63 | providing strong capabilities in following vision-and-language instructions.
64 | However, we claim that speech is also an important modality through which
65 | humans interact with the world. Hence, it is crucial for a general-purpose
66 | assistant to be able to follow multi-modal speech-and-language instructions. In
67 | this work, we propose Large Language and Speech Model (LLaSM). LLaSM is an
68 | end-to-end trained large multi-modal speech-language model with cross-modal
69 | conversational abilities, capable of following speech-and-language
70 | instructions. Our early experiments show that LLaSM demonstrates a more
71 | convenient and natural way for humans to interact with artificial intelligence.
72 | Specifically, we also release a large Speech Instruction Following dataset
73 | LLaSM-Audio-Instructions. Code and demo are available at
74 | https://github.com/LinkSoul-AI/LLaSM and
75 | https://huggingface.co/spaces/LinkSoul/LLaSM. The LLaSM-Audio-Instructions
76 | dataset is available at
77 | https://huggingface.co/datasets/LinkSoul/LLaSM-Audio-Instructions.",http://arxiv.org/pdf/2308.15930v1,"[arxiv.Result.Author('Yu Shu'), arxiv.Result.Author('Siwei Dong'), arxiv.Result.Author('Guangyao Chen'), arxiv.Result.Author('Wenhao Huang'), arxiv.Result.Author('Ruihua Zhang'), arxiv.Result.Author('Daochen Shi'), arxiv.Result.Author('Qiqi Xiang'), arxiv.Result.Author('Yemin Shi')]"
78 | 


--------------------------------------------------------------------------------
/ScholarlyRecommender/Repository/Feed.csv:
--------------------------------------------------------------------------------
  1 | Id,Category,Title,Published,Abstract,URL
  2 | 2308.14608v1,cs.LG,AI in the Gray: Exploring Moderation Policies in Dialogic Large Language Models vs. Human Answers in Controversial Topics,2023-08-28 14:23:04+00:00,"The introduction of ChatGPT and the subsequent improvement of Large Language
  3 | Models (LLMs) have prompted more and more individuals to turn to the use of
  4 | ChatBots, both for information and assistance with decision-making. However,
  5 | the information the user is after is often not formulated by these ChatBots
  6 | objectively enough to be provided with a definite, globally accepted answer.
  7 |   Controversial topics, such as ""religion"", ""gender identity"", ""freedom of
  8 | speech"", and ""equality"", among others, can be a source of conflict as partisan
  9 | or biased answers can reinforce preconceived notions or promote disinformation.
 10 | By exposing ChatGPT to such debatable questions, we aim to understand its level
 11 | of awareness and if existing models are subject to socio-political and/or
 12 | economic biases. We also aim to explore how AI-generated answers compare to
 13 | human ones. For exploring this, we use a dataset of a social media platform
 14 | created for the purpose of debating human-generated claims on polemic subjects
 15 | among users, dubbed Kialo.
 16 |   Our results show that while previous versions of ChatGPT have had important
 17 | issues with controversial topics, more recent versions of ChatGPT
 18 | (gpt-3.5-turbo) are no longer manifesting significant explicit biases in
 19 | several knowledge areas. In particular, it is well-moderated regarding economic
 20 | aspects. However, it still maintains degrees of implicit libertarian leaning
 21 | toward right-winged ideals which suggest the need for increased moderation from
 22 | the socio-political point of view. In terms of domain knowledge on
 23 | controversial topics, with the exception of the ""Philosophical"" category,
 24 | ChatGPT is performing well in keeping up with the collective human level of
 25 | knowledge. Finally, we see that sources of Bing AI have slightly more tendency
 26 | to the center when compared to human answers. All the analyses we make are
 27 | generalizable to other types of biases and domains.",http://arxiv.org/pdf/2308.14608v1
 28 | 2308.14921v1,cs.CL,Gender bias and stereotypes in Large Language Models,2023-08-28 22:32:05+00:00,"Large Language Models (LLMs) have made substantial progress in the past
 29 | several months, shattering state-of-the-art benchmarks in many domains. This
 30 | paper investigates LLMs' behavior with respect to gender stereotypes, a known
 31 | issue for prior models. We use a simple paradigm to test the presence of gender
 32 | bias, building on but differing from WinoBias, a commonly used gender bias
 33 | dataset, which is likely to be included in the training data of current LLMs.
 34 | We test four recently published LLMs and demonstrate that they express biased
 35 | assumptions about men and women's occupations. Our contributions in this paper
 36 | are as follows: (a) LLMs are 3-6 times more likely to choose an occupation that
 37 | stereotypically aligns with a person's gender; (b) these choices align with
 38 | people's perceptions better than with the ground truth as reflected in official
 39 | job statistics; (c) LLMs in fact amplify the bias beyond what is reflected in
 40 | perceptions or the ground truth; (d) LLMs ignore crucial ambiguities in
 41 | sentence structure 95% of the time in our study items, but when explicitly
 42 | prompted, they recognize the ambiguity; (e) LLMs provide explanations for their
 43 | choices that are factually inaccurate and likely obscure the true reason behind
 44 | their predictions. That is, they provide rationalizations of their biased
 45 | behavior. This highlights a key property of these models: LLMs are trained on
 46 | imbalanced datasets; as such, even with the recent successes of reinforcement
 47 | learning with human feedback, they tend to reflect those imbalances back at us.
 48 | As with other types of societal biases, we suggest that LLMs must be carefully
 49 | tested to ensure that they treat minoritized individuals and communities
 50 | equitably.",http://arxiv.org/pdf/2308.14921v1
 51 | 2308.14752v1,cs.CY,"AI Deception: A Survey of Examples, Risks, and Potential Solutions",2023-08-28 17:59:35+00:00,"This paper argues that a range of current AI systems have learned how to
 52 | deceive humans. We define deception as the systematic inducement of false
 53 | beliefs in the pursuit of some outcome other than the truth. We first survey
 54 | empirical examples of AI deception, discussing both special-use AI systems
 55 | (including Meta's CICERO) built for specific competitive situations, and
 56 | general-purpose AI systems (such as large language models). Next, we detail
 57 | several risks from AI deception, such as fraud, election tampering, and losing
 58 | control of AI systems. Finally, we outline several potential solutions to the
 59 | problems posed by AI deception: first, regulatory frameworks should subject AI
 60 | systems that are capable of deception to robust risk-assessment requirements;
 61 | second, policymakers should implement bot-or-not laws; and finally,
 62 | policymakers should prioritize the funding of relevant research, including
 63 | tools to detect AI deception and to make AI systems less deceptive.
 64 | Policymakers, researchers, and the broader public should work proactively to
 65 | prevent AI deception from destabilizing the shared foundations of our society.",http://arxiv.org/pdf/2308.14752v1
 66 | 2308.14359v1,cs.AI,Effect of Attention and Self-Supervised Speech Embeddings on Non-Semantic Speech Tasks,2023-08-28 07:11:27+00:00,"Human emotion understanding is pivotal in making conversational technology
 67 | mainstream. We view speech emotion understanding as a perception task which is
 68 | a more realistic setting. With varying contexts (languages, demographics, etc.)
 69 | different share of people perceive the same speech segment as a non-unanimous
 70 | emotion. As part of the ACM Multimedia 2023 Computational Paralinguistics
 71 | ChallengE (ComParE) in the EMotion Share track, we leverage their rich dataset
 72 | of multilingual speakers and multi-label regression target of 'emotion share'
 73 | or perception of that emotion. We demonstrate that the training scheme of
 74 | different foundation models dictates their effectiveness for tasks beyond
 75 | speech recognition, especially for non-semantic speech tasks like emotion
 76 |   understanding. This is a very complex task due to multilingual speakers,
 77 | variability in the target labels, and inherent imbalance in the regression
 78 | dataset. Our results show that HuBERT-Large with a self-attention-based
 79 | light-weight sequence model provides 4.6% improvement over the reported
 80 | baseline.",http://arxiv.org/pdf/2308.14359v1
 81 | 2308.14149v1,cs.CL,"Examining User-Friendly and Open-Sourced Large GPT Models: A Survey on Language, Multimodal, and Scientific GPT Models",2023-08-27 16:14:19+00:00,"Generative pre-trained transformer (GPT) models have revolutionized the field
 82 | of natural language processing (NLP) with remarkable performance in various
 83 | tasks and also extend their power to multimodal domains. Despite their success,
 84 | large GPT models like GPT-4 face inherent limitations such as considerable
 85 | size, high computational requirements, complex deployment processes, and closed
 86 | development loops. These constraints restrict their widespread adoption and
 87 | raise concerns regarding their responsible development and usage. The need for
 88 | user-friendly, relatively small, and open-sourced alternative GPT models arises
 89 | from the desire to overcome these limitations while retaining high performance.
 90 | In this survey paper, we provide an examination of alternative open-sourced
 91 | models of large GPTs, focusing on user-friendly and relatively small models
 92 | that facilitate easier deployment and accessibility. Through this extensive
 93 | survey, we aim to equip researchers, practitioners, and enthusiasts with a
 94 | thorough understanding of user-friendly and relatively small open-sourced
 95 | models of large GPTs, their current state, challenges, and future research
 96 | directions, inspiring the development of more efficient, accessible, and
 97 | versatile GPT models that cater to the broader scientific community and advance
 98 | the field of general artificial intelligence. The source contents are
 99 | continuously updating in https://github.com/GPT-Alternatives/gpt_alternatives.",http://arxiv.org/pdf/2308.14149v1
100 | 


--------------------------------------------------------------------------------
/ScholarlyRecommender/Newsletter/html/Feed.html:
--------------------------------------------------------------------------------
  1 | <!DOCTYPE html>
  2 |     <html>
  3 |     <body>
  4 |     
  5 |     <div class="feed-item" style="border: 1px solid #ccc;
  6 |     padding: 15px;
  7 |     margin: 15px;
  8 |     border-radius: 8px;">
  9 |     <h2 class="title" style=" font-size: 24px;
 10 |     font-weight: bold;
 11 |     margin-bottom: 5px; color: #000000;">Some spectral comparison results on infinite quantum graphs</h2>
 12 |     <h4 class="author" style="font-size: 14px;
 13 |     font-weight: bold;
 14 |     margin-bottom: 10px; color:#000000;">Patrizio Bifulco, Joachim Kerner</h4>
 15 |     <div class="metadata" style="font-size: 14px;
 16 |     color: #000000;
 17 |     margin-bottom: 10px;">
 18 |         <span class="id">Entry Id: 2308.16869v1</span> | 
 19 |         <span class="category">math.SP</span> | 
 20 |         <span class="published">Published on 08-31-2023</span>
 21 |     </div>
 22 |     <div class="abstract" style="font-size: 16px;
 23 |     margin-bottom: 10px; color: #000000;">
 24 |         In this paper we establish spectral comparison results for Schr\"odinger
 25 | operators on a certain class of infinite quantum graphs, using recent results
 26 | obtained in the finite setting. We also show that new features do appear on
 27 | infinite quantum graphs such as a modified local Weyl law. In this sense, we
 28 | regard this paper as a starting point for a more thorough investigation of
 29 | spectral comparison results on more general infinite metric graphs....
 30 |     </div>
 31 |     <a href="http://arxiv.org/pdf/2308.16869v1" target="_blank" style="display: inline-block;
 32 |     background-color: #007BFF;
 33 |     color: white;
 34 |     padding: 8px 16px;
 35 |     border-radius: 4px;
 36 |     text-decoration: none;">Read More</a>
 37 |     </div>
 38 |     
 39 |     <div class="feed-item" style="border: 1px solid #ccc;
 40 |     padding: 15px;
 41 |     margin: 15px;
 42 |     border-radius: 8px;">
 43 |     <h2 class="title" style=" font-size: 24px;
 44 |     font-weight: bold;
 45 |     margin-bottom: 5px; color: #000000;">Motor crosslinking augments elasticity in active nematics</h2>
 46 |     <h4 class="author" style="font-size: 14px;
 47 |     font-weight: bold;
 48 |     margin-bottom: 10px; color:#000000;">Steven A. Redford, Jonathan Colen, Jordan L. Shivers, Sasha Zemsky, Mehdi Molaei, Carlos Floyd, Paul V. Ruijgrok, Vincenzo Vitelli, Zev Bryant, Aaron R. Dinner, Margaret L. Gardel</h4>
 49 |     <div class="metadata" style="font-size: 14px;
 50 |     color: #000000;
 51 |     margin-bottom: 10px;">
 52 |         <span class="id">Entry Id: 2308.16831v1</span> | 
 53 |         <span class="category">cond-mat.soft</span> | 
 54 |         <span class="published">Published on 08-31-2023</span>
 55 |     </div>
 56 |     <div class="abstract" style="font-size: 16px;
 57 |     margin-bottom: 10px; color: #000000;">
 58 |         In active materials, uncoordinated internal stresses lead to emergent
 59 | long-range flows. An understanding of how the behavior of active materials
 60 | depends on mesoscopic (hydrodynamic) parameters is developing, but there
 61 | remains a gap in knowledge concerning how hydrodynamic parameters depend on the
 62 | properties of microscopic elements. In this work, we combine experiments and
 63 | multiscale modeling to relate the structure and dynamics of active nematics
 64 | composed of biopolymer filaments and molecular mo...
 65 |     </div>
 66 |     <a href="http://arxiv.org/pdf/2308.16831v1" target="_blank" style="display: inline-block;
 67 |     background-color: #007BFF;
 68 |     color: white;
 69 |     padding: 8px 16px;
 70 |     border-radius: 4px;
 71 |     text-decoration: none;">Read More</a>
 72 |     </div>
 73 |     
 74 |     <div class="feed-item" style="border: 1px solid #ccc;
 75 |     padding: 15px;
 76 |     margin: 15px;
 77 |     border-radius: 8px;">
 78 |     <h2 class="title" style=" font-size: 24px;
 79 |     font-weight: bold;
 80 |     margin-bottom: 5px; color: #000000;">The stability of unevenly spaced planetary systems</h2>
 81 |     <h4 class="author" style="font-size: 14px;
 82 |     font-weight: bold;
 83 |     margin-bottom: 10px; color:#000000;">Sheng Yang, Liangyu Wu, Zekai Zheng, Masahiro Ogihara, Kangrou Guo, Wenzhan Ouyang, Yaxing He</h4>
 84 |     <div class="metadata" style="font-size: 14px;
 85 |     color: #000000;
 86 |     margin-bottom: 10px;">
 87 |         <span class="id">Entry Id: 2308.16798v1</span> | 
 88 |         <span class="category">astro-ph.EP</span> | 
 89 |         <span class="published">Published on 08-31-2023</span>
 90 |     </div>
 91 |     <div class="abstract" style="font-size: 16px;
 92 |     margin-bottom: 10px; color: #000000;">
 93 |         Studying the orbital stability of multi-planet systems is essential to
 94 | understand planet formation, estimate the stable time of an observed planetary
 95 | system, and advance population synthesis models. Although previous studies have
 96 | primarily focused on ideal systems characterized by uniform orbital
 97 | separations, in reality a diverse range of orbital separations exists among
 98 | planets within the same system. This study focuses on investigating the
 99 | dynamical stability of systems with non-uniform separa...
100 |     </div>
101 |     <a href="http://arxiv.org/pdf/2308.16798v1" target="_blank" style="display: inline-block;
102 |     background-color: #007BFF;
103 |     color: white;
104 |     padding: 8px 16px;
105 |     border-radius: 4px;
106 |     text-decoration: none;">Read More</a>
107 |     </div>
108 |     
109 |     <div class="feed-item" style="border: 1px solid #ccc;
110 |     padding: 15px;
111 |     margin: 15px;
112 |     border-radius: 8px;">
113 |     <h2 class="title" style=" font-size: 24px;
114 |     font-weight: bold;
115 |     margin-bottom: 5px; color: #000000;">Quantized thermal and spin transports of dirty planar topological superconductors</h2>
116 |     <h4 class="author" style="font-size: 14px;
117 |     font-weight: bold;
118 |     margin-bottom: 10px; color:#000000;">Sanjib Kumar Das, Bitan Roy</h4>
119 |     <div class="metadata" style="font-size: 14px;
120 |     color: #000000;
121 |     margin-bottom: 10px;">
122 |         <span class="id">Entry Id: 2308.16908v1</span> | 
123 |         <span class="category">cond-mat.mes-hall</span> | 
124 |         <span class="published">Published on 08-31-2023</span>
125 |     </div>
126 |     <div class="abstract" style="font-size: 16px;
127 |     margin-bottom: 10px; color: #000000;">
128 |         Nontrivial bulk topological invariants of quantum materials can leave their
129 | signatures on charge, thermal and spin transports. In two dimensions, their
130 | imprints can be experimentally measured from well-developed multi-terminal Hall
131 | bar arrangements. Here, we numerically compute the low temperature ($T$)
132 | thermal ($\kappa_{xy}$) and zero temperature spin ($\sigma^{sp}_{xy}$) Hall
133 | conductivities, and longitudinal thermal conductance ($G^{th}_{xx}$) of various
134 | paradigmatic two-dimensional fully gapp...
135 |     </div>
136 |     <a href="http://arxiv.org/pdf/2308.16908v1" target="_blank" style="display: inline-block;
137 |     background-color: #007BFF;
138 |     color: white;
139 |     padding: 8px 16px;
140 |     border-radius: 4px;
141 |     text-decoration: none;">Read More</a>
142 |     </div>
143 |     
144 |     <div class="feed-item" style="border: 1px solid #ccc;
145 |     padding: 15px;
146 |     margin: 15px;
147 |     border-radius: 8px;">
148 |     <h2 class="title" style=" font-size: 24px;
149 |     font-weight: bold;
150 |     margin-bottom: 5px; color: #000000;">Dielectron production in central Pb$-$Pb collisions at $\sqrt{s_\mathrm{NN}}$ = 5.02 TeV</h2>
151 |     <h4 class="author" style="font-size: 14px;
152 |     font-weight: bold;
153 |     margin-bottom: 10px; color:#000000;">ALICE Collaboration</h4>
154 |     <div class="metadata" style="font-size: 14px;
155 |     color: #000000;
156 |     margin-bottom: 10px;">
157 |         <span class="id">Entry Id: 2308.16704v1</span> | 
158 |         <span class="category">nucl-ex</span> | 
159 |         <span class="published">Published on 08-31-2023</span>
160 |     </div>
161 |     <div class="abstract" style="font-size: 16px;
162 |     margin-bottom: 10px; color: #000000;">
163 |         The first measurement of the e$^+$e$^-$ pair production at midrapidity and
164 | low invariant mass in central Pb$-$Pb collisions at
165 | $\sqrt{s_{\mathrm{NN}}}=5.02$ TeV at the LHC is presented. The yield of
166 | e$^+$e$^-$ pairs is compared with a cocktail of expected hadronic decay
167 | contributions in the invariant mass ($m_{\rm ee}$) and pair transverse momentum
168 | ($p_{\rm T,ee}$) ranges $m_{\rm ee} < 3.5$ GeV$/c^2$ and $p_{\rm T,ee} < 8$
169 | GeV$/c$. For $0.18 < m_{\rm ee} < 0.5$ GeV$/c^2$ the ratio of data to the...
170 |     </div>
171 |     <a href="http://arxiv.org/pdf/2308.16704v1" target="_blank" style="display: inline-block;
172 |     background-color: #007BFF;
173 |     color: white;
174 |     padding: 8px 16px;
175 |     border-radius: 4px;
176 |     text-decoration: none;">Read More</a>
177 |     </div>
178 |      </body>
179 |     </html>


--------------------------------------------------------------------------------
/ScholarlyRecommender/Newsletter/html/WebFeed.html:
--------------------------------------------------------------------------------
  1 | <!DOCTYPE html>
  2 |     <html>
  3 |     <body style="background-color: #FFFFFF; color: #A2A2F5">
  4 |     
  5 |     <div class="feed-item" style="border: 1px solid #ccc;
  6 |     padding: 15px;
  7 |     margin: 15px;
  8 |     border-radius: 8px;">
  9 |     <h2 class="title" style=" font-size: 24px;
 10 |     font-weight: bold;
 11 |     margin-bottom: 5px; color: #262730;">YaRN: Efficient Context Window Extension of Large Language Models</h2>
 12 |     <h4 class="author" style="font-size: 14px;
 13 |     font-weight: bold;
 14 |     margin-bottom: 10px; color:#262730;">Bowen Peng, Jeffrey Quesnelle, Honglu Fan, Enrico Shippole</h4>
 15 |     <div class="metadata" style="font-size: 14px;
 16 |     color: #262730;
 17 |     margin-bottom: 10px;">
 18 |         <span class="id">Entry Id: 2309.00071v1</span> | 
 19 |         <span class="category">cs.CL</span> | 
 20 |         <span class="published">Published on 08-31-2023</span>
 21 |     </div>
 22 |     <div class="abstract" style="font-size: 16px;
 23 |     margin-bottom: 10px; color: #262730;">
 24 |         Rotary Position Embeddings (RoPE) have been shown to effectively encode
 25 | positional information in transformer-based language models. However, these
 26 | models fail to generalize past the sequence length they were trained on. We
 27 | present YaRN (Yet another RoPE extensioN method), a compute-efficient method to
 28 | extend the context window of such models, requiring 10x less tokens and 2.5x
 29 | less training steps than previous methods. Using YaRN, we show that LLaMA
 30 | models can effectively utilize and extrapolat...
 31 |     </div>
 32 |     <a href="http://arxiv.org/pdf/2309.00071v1" target="_blank" style="display: inline-block;
 33 |     background-color: #A2A2F5;
 34 |     color: white;
 35 |     padding: 8px 16px;
 36 |     border-radius: 4px;
 37 |     text-decoration: none;">Read More</a>
 38 |     </div>
 39 |     
 40 |     <div class="feed-item" style="border: 1px solid #ccc;
 41 |     padding: 15px;
 42 |     margin: 15px;
 43 |     border-radius: 8px;">
 44 |     <h2 class="title" style=" font-size: 24px;
 45 |     font-weight: bold;
 46 |     margin-bottom: 5px; color: #262730;">LLaSM: Large Language and Speech Model</h2>
 47 |     <h4 class="author" style="font-size: 14px;
 48 |     font-weight: bold;
 49 |     margin-bottom: 10px; color:#262730;">Yu Shu, Siwei Dong, Guangyao Chen, Wenhao Huang, Ruihua Zhang, Daochen Shi, Qiqi Xiang, Yemin Shi</h4>
 50 |     <div class="metadata" style="font-size: 14px;
 51 |     color: #262730;
 52 |     margin-bottom: 10px;">
 53 |         <span class="id">Entry Id: 2308.15930v1</span> | 
 54 |         <span class="category">cs.CL</span> | 
 55 |         <span class="published">Published on 08-30-2023</span>
 56 |     </div>
 57 |     <div class="abstract" style="font-size: 16px;
 58 |     margin-bottom: 10px; color: #262730;">
 59 |         Multi-modal large language models have garnered significant interest
 60 | recently. Though, most of the works focus on vision-language multi-modal models
 61 | providing strong capabilities in following vision-and-language instructions.
 62 | However, we claim that speech is also an important modality through which
 63 | humans interact with the world. Hence, it is crucial for a general-purpose
 64 | assistant to be able to follow multi-modal speech-and-language instructions. In
 65 | this work, we propose Large Language and Spee...
 66 |     </div>
 67 |     <a href="http://arxiv.org/pdf/2308.15930v1" target="_blank" style="display: inline-block;
 68 |     background-color: #A2A2F5;
 69 |     color: white;
 70 |     padding: 8px 16px;
 71 |     border-radius: 4px;
 72 |     text-decoration: none;">Read More</a>
 73 |     </div>
 74 |     
 75 |     <div class="feed-item" style="border: 1px solid #ccc;
 76 |     padding: 15px;
 77 |     margin: 15px;
 78 |     border-radius: 8px;">
 79 |     <h2 class="title" style=" font-size: 24px;
 80 |     font-weight: bold;
 81 |     margin-bottom: 5px; color: #262730;">Large Language Models as Data Preprocessors</h2>
 82 |     <h4 class="author" style="font-size: 14px;
 83 |     font-weight: bold;
 84 |     margin-bottom: 10px; color:#262730;">Haochen Zhang, Yuyang Dong, Chuan Xiao, Masafumi Oyamada</h4>
 85 |     <div class="metadata" style="font-size: 14px;
 86 |     color: #262730;
 87 |     margin-bottom: 10px;">
 88 |         <span class="id">Entry Id: 2308.16361v1</span> | 
 89 |         <span class="category">cs.AI</span> | 
 90 |         <span class="published">Published on 08-30-2023</span>
 91 |     </div>
 92 |     <div class="abstract" style="font-size: 16px;
 93 |     margin-bottom: 10px; color: #262730;">
 94 |         Large Language Models (LLMs), typified by OpenAI's GPT series and Meta's
 95 | LLaMA variants, have marked a significant advancement in artificial
 96 | intelligence. Trained on vast amounts of text data, LLMs are capable of
 97 | understanding and generating human-like text across a diverse range of topics.
 98 | This study expands on the applications of LLMs, exploring their potential in
 99 | data preprocessing, a critical stage in data mining and analytics applications.
100 | We delve into the applicability of state-of-the-art...
101 |     </div>
102 |     <a href="http://arxiv.org/pdf/2308.16361v1" target="_blank" style="display: inline-block;
103 |     background-color: #A2A2F5;
104 |     color: white;
105 |     padding: 8px 16px;
106 |     border-radius: 4px;
107 |     text-decoration: none;">Read More</a>
108 |     </div>
109 |     
110 |     <div class="feed-item" style="border: 1px solid #ccc;
111 |     padding: 15px;
112 |     margin: 15px;
113 |     border-radius: 8px;">
114 |     <h2 class="title" style=" font-size: 24px;
115 |     font-weight: bold;
116 |     margin-bottom: 5px; color: #262730;">LM-Infinite: Simple On-the-Fly Length Generalization for Large Language Models</h2>
117 |     <h4 class="author" style="font-size: 14px;
118 |     font-weight: bold;
119 |     margin-bottom: 10px; color:#262730;">Chi Han, Qifan Wang, Wenhan Xiong, Yu Chen, Heng Ji, Sinong Wang</h4>
120 |     <div class="metadata" style="font-size: 14px;
121 |     color: #262730;
122 |     margin-bottom: 10px;">
123 |         <span class="id">Entry Id: 2308.16137v1</span> | 
124 |         <span class="category">cs.CL</span> | 
125 |         <span class="published">Published on 08-30-2023</span>
126 |     </div>
127 |     <div class="abstract" style="font-size: 16px;
128 |     margin-bottom: 10px; color: #262730;">
129 |         In recent years, there have been remarkable advancements in the performance
130 | of Transformer-based Large Language Models (LLMs) across various domains. As
131 | these LLMs are deployed for increasingly complex tasks, they often face the
132 | needs to conduct longer reasoning processes or understanding larger contexts.
133 | In these situations, the length generalization failure of LLMs on long
134 | sequences become more prominent. Most pre-training schemes truncate training
135 | sequences to a fixed length (such as 2048 for...
136 |     </div>
137 |     <a href="http://arxiv.org/pdf/2308.16137v1" target="_blank" style="display: inline-block;
138 |     background-color: #A2A2F5;
139 |     color: white;
140 |     padding: 8px 16px;
141 |     border-radius: 4px;
142 |     text-decoration: none;">Read More</a>
143 |     </div>
144 |     
145 |     <div class="feed-item" style="border: 1px solid #ccc;
146 |     padding: 15px;
147 |     margin: 15px;
148 |     border-radius: 8px;">
149 |     <h2 class="title" style=" font-size: 24px;
150 |     font-weight: bold;
151 |     margin-bottom: 5px; color: #262730;">WALL-E: Embodied Robotic WAiter Load Lifting with Large Language Model</h2>
152 |     <h4 class="author" style="font-size: 14px;
153 |     font-weight: bold;
154 |     margin-bottom: 10px; color:#262730;">Tianyu Wang, Yifan Li, Haitao Lin, Xiangyang Xue, Yanwei Fu</h4>
155 |     <div class="metadata" style="font-size: 14px;
156 |     color: #262730;
157 |     margin-bottom: 10px;">
158 |         <span class="id">Entry Id: 2308.15962v2</span> | 
159 |         <span class="category">cs.RO</span> | 
160 |         <span class="published">Published on 08-30-2023</span>
161 |     </div>
162 |     <div class="abstract" style="font-size: 16px;
163 |     margin-bottom: 10px; color: #262730;">
164 |         Enabling robots to understand language instructions and react accordingly to
165 | visual perception has been a long-standing goal in the robotics research
166 | community. Achieving this goal requires cutting-edge advances in natural
167 | language processing, computer vision, and robotics engineering. Thus, this
168 | paper mainly investigates the potential of integrating the most recent Large
169 | Language Models (LLMs) and existing visual grounding and robotic grasping
170 | system to enhance the effectiveness of the human-ro...
171 |     </div>
172 |     <a href="http://arxiv.org/pdf/2308.15962v2" target="_blank" style="display: inline-block;
173 |     background-color: #A2A2F5;
174 |     color: white;
175 |     padding: 8px 16px;
176 |     border-radius: 4px;
177 |     text-decoration: none;">Read More</a>
178 |     </div>
179 |      </body>
180 |     </html>


--------------------------------------------------------------------------------
/ScholarlyRecommender/Repository/tests/outputs/test_feed.html:
--------------------------------------------------------------------------------
  1 | <!DOCTYPE html>
  2 |     <html>
  3 |     <body style="background-color: #FFFFFF; color: #A2A2F5">
  4 |     
  5 |     <div class="feed-item" style="border: 1px solid #ccc;
  6 |     padding: 15px;
  7 |     margin: 15px;
  8 |     border-radius: 8px;">
  9 |     <h2 class="title" style=" font-size: 24px;
 10 |     font-weight: bold;
 11 |     margin-bottom: 5px; color: #262730;">Chain-of-Verification Reduces Hallucination in Large Language Models</h2>
 12 |     <h4 class="author" style="font-size: 14px;
 13 |     font-weight: bold;
 14 |     margin-bottom: 10px; color:#262730;">Shehzaad Dhuliawala, Mojtaba Komeili, Jing Xu, Roberta Raileanu, Xian Li, Asli Celikyilmaz, Jason Weston...</h4>
 15 |     <div class="metadata" style="font-size: 14px;
 16 |     color: #262730;
 17 |     margin-bottom: 10px;">
 18 |         <span class="id">Entry Id: 2309.11495v1</span> |
 19 |         <span class="category">cs.CL</span> |
 20 |         <span class="published">Published on 09-20-2023</span>
 21 |     </div>
 22 |     <div class="abstract" style="font-size: 16px;
 23 |     margin-bottom: 10px; color: #262730;">
 24 |         Generation of plausible yet incorrect factual information, termed
 25 | hallucination, is an unsolved issue in large language models. We study the
 26 | ability of language models to deliberate on the responses they give in order to
 27 | correct their mistakes. We develop the Chain-of-Verification (CoVe) method
 28 | whereby the model first (i) drafts an initial response; then (ii) plans
 29 | verification questions to fact-check its draft; (iii) answers those questions
 30 | independently so the answers are not biased by other r...
 31 |     </div>
 32 |     <a href="http://arxiv.org/pdf/2309.11495v1" target="_blank" style="display: inline-block;
 33 |     background-color: #A2A2F5;
 34 |     color: white;
 35 |     padding: 8px 16px;
 36 |     border-radius: 4px;
 37 |     text-decoration: none;">Read More</a>
 38 |     </div>
 39 |     
 40 |     <div class="feed-item" style="border: 1px solid #ccc;
 41 |     padding: 15px;
 42 |     margin: 15px;
 43 |     border-radius: 8px;">
 44 |     <h2 class="title" style=" font-size: 24px;
 45 |     font-weight: bold;
 46 |     margin-bottom: 5px; color: #262730;">CPLLM: Clinical Prediction with Large Language Models</h2>
 47 |     <h4 class="author" style="font-size: 14px;
 48 |     font-weight: bold;
 49 |     margin-bottom: 10px; color:#262730;">Ofir Ben Shoham, Nadav Rappoport...</h4>
 50 |     <div class="metadata" style="font-size: 14px;
 51 |     color: #262730;
 52 |     margin-bottom: 10px;">
 53 |         <span class="id">Entry Id: 2309.11295v1</span> |
 54 |         <span class="category">cs.CL</span> |
 55 |         <span class="published">Published on 09-20-2023</span>
 56 |     </div>
 57 |     <div class="abstract" style="font-size: 16px;
 58 |     margin-bottom: 10px; color: #262730;">
 59 |         We present Clinical Prediction with Large Language Models (CPLLM), a method
 60 | that involves fine-tuning a pre-trained Large Language Model (LLM) for clinical
 61 | disease prediction. We utilized quantization and fine-tuned the LLM using
 62 | prompts, with the task of predicting whether patients will be diagnosed with a
 63 | target disease during their next visit or in the subsequent diagnosis,
 64 | leveraging their historical diagnosis records. We compared our results versus
 65 | various baselines, including Logistic Regr...
 66 |     </div>
 67 |     <a href="http://arxiv.org/pdf/2309.11295v1" target="_blank" style="display: inline-block;
 68 |     background-color: #A2A2F5;
 69 |     color: white;
 70 |     padding: 8px 16px;
 71 |     border-radius: 4px;
 72 |     text-decoration: none;">Read More</a>
 73 |     </div>
 74 |     
 75 |     <div class="feed-item" style="border: 1px solid #ccc;
 76 |     padding: 15px;
 77 |     margin: 15px;
 78 |     border-radius: 8px;">
 79 |     <h2 class="title" style=" font-size: 24px;
 80 |     font-weight: bold;
 81 |     margin-bottom: 5px; color: #262730;">Sequence-to-Sequence Spanish Pre-trained Language Models</h2>
 82 |     <h4 class="author" style="font-size: 14px;
 83 |     font-weight: bold;
 84 |     margin-bottom: 10px; color:#262730;">Vladimir Araujo, Maria Mihaela Trusca, Rodrigo Tufiño, Marie-Francine Moens...</h4>
 85 |     <div class="metadata" style="font-size: 14px;
 86 |     color: #262730;
 87 |     margin-bottom: 10px;">
 88 |         <span class="id">Entry Id: 2309.11259v1</span> |
 89 |         <span class="category">cs.CL</span> |
 90 |         <span class="published">Published on 09-20-2023</span>
 91 |     </div>
 92 |     <div class="abstract" style="font-size: 16px;
 93 |     margin-bottom: 10px; color: #262730;">
 94 |         In recent years, substantial advancements in pre-trained language models have
 95 | paved the way for the development of numerous non-English language versions,
 96 | with a particular focus on encoder-only and decoder-only architectures. While
 97 | Spanish language models encompassing BERT, RoBERTa, and GPT have exhibited
 98 | prowess in natural language understanding and generation, there remains a
 99 | scarcity of encoder-decoder models designed for sequence-to-sequence tasks
100 | involving input-output pairs. This paper br...
101 |     </div>
102 |     <a href="http://arxiv.org/pdf/2309.11259v1" target="_blank" style="display: inline-block;
103 |     background-color: #A2A2F5;
104 |     color: white;
105 |     padding: 8px 16px;
106 |     border-radius: 4px;
107 |     text-decoration: none;">Read More</a>
108 |     </div>
109 |     
110 |     <div class="feed-item" style="border: 1px solid #ccc;
111 |     padding: 15px;
112 |     margin: 15px;
113 |     border-radius: 8px;">
114 |     <h2 class="title" style=" font-size: 24px;
115 |     font-weight: bold;
116 |     margin-bottom: 5px; color: #262730;">Is GPT4 a Good Trader?</h2>
117 |     <h4 class="author" style="font-size: 14px;
118 |     font-weight: bold;
119 |     margin-bottom: 10px; color:#262730;">Bingzhe Wu...</h4>
120 |     <div class="metadata" style="font-size: 14px;
121 |     color: #262730;
122 |     margin-bottom: 10px;">
123 |         <span class="id">Entry Id: 2309.10982v1</span> |
124 |         <span class="category">cs.AI</span> |
125 |         <span class="published">Published on 09-20-2023</span>
126 |     </div>
127 |     <div class="abstract" style="font-size: 16px;
128 |     margin-bottom: 10px; color: #262730;">
129 |         Recently, large language models (LLMs), particularly GPT-4, have demonstrated
130 | significant capabilities in various planning and reasoning tasks
131 | \cite{cheng2023gpt4,bubeck2023sparks}. Motivated by these advancements, there
132 | has been a surge of interest among researchers to harness the capabilities of
133 | GPT-4 for the automated design of quantitative factors that do not overlap with
134 | existing factor libraries, with an aspiration to achieve alpha returns
135 | \cite{webpagequant}. In contrast to these work, th...
136 |     </div>
137 |     <a href="http://arxiv.org/pdf/2309.10982v1" target="_blank" style="display: inline-block;
138 |     background-color: #A2A2F5;
139 |     color: white;
140 |     padding: 8px 16px;
141 |     border-radius: 4px;
142 |     text-decoration: none;">Read More</a>
143 |     </div>
144 |     
145 |     <div class="feed-item" style="border: 1px solid #ccc;
146 |     padding: 15px;
147 |     margin: 15px;
148 |     border-radius: 8px;">
149 |     <h2 class="title" style=" font-size: 24px;
150 |     font-weight: bold;
151 |     margin-bottom: 5px; color: #262730;">AceGPT, Localizing Large Language Models in Arabic</h2>
152 |     <h4 class="author" style="font-size: 14px;
153 |     font-weight: bold;
154 |     margin-bottom: 10px; color:#262730;">Huang Huang, Fei Yu, Jianqing Zhu, Xuening Sun, Hao Cheng, Dingjie Song, Zhihong Chen, Abdulmohsen Alharthi, Bang An, Ziche Liu, Zhiyi Zhang, Junying Chen, Jianquan Li, Benyou Wang, Lian Zhang, Ruoyu Sun, Xiang Wan, Haizhou Li, Jinchao Xu...</h4>
155 |     <div class="metadata" style="font-size: 14px;
156 |     color: #262730;
157 |     margin-bottom: 10px;">
158 |         <span class="id">Entry Id: 2309.12053v1</span> |
159 |         <span class="category">cs.CL</span> |
160 |         <span class="published">Published on 09-21-2023</span>
161 |     </div>
162 |     <div class="abstract" style="font-size: 16px;
163 |     margin-bottom: 10px; color: #262730;">
164 |         This paper explores the imperative need and methodology for developing a
165 | localized Large Language Model (LLM) tailored for Arabic, a language with
166 | unique cultural characteristics that are not adequately addressed by current
167 | mainstream models like ChatGPT. Key concerns additionally arise when
168 | considering cultural sensitivity and local values. To this end, the paper
169 | outlines a packaged solution, including further pre-training with Arabic texts,
170 | supervised fine-tuning (SFT) using native Arabic inst...
171 |     </div>
172 |     <a href="http://arxiv.org/pdf/2309.12053v1" target="_blank" style="display: inline-block;
173 |     background-color: #A2A2F5;
174 |     color: white;
175 |     padding: 8px 16px;
176 |     border-radius: 4px;
177 |     text-decoration: none;">Read More</a>
178 |     </div>
179 |      </body>
180 |     </html>


--------------------------------------------------------------------------------
/ScholarlyRecommender/Repository/tests/inputs/ref_feed.html:
--------------------------------------------------------------------------------
  1 | <!DOCTYPE html>
  2 |     <html>
  3 |     <body style="background-color: #FFFFFF; color: #A2A2F5">
  4 |     
  5 |     <div class="feed-item" style="border: 1px solid #ccc;
  6 |     padding: 15px;
  7 |     margin: 15px;
  8 |     border-radius: 8px;">
  9 |     <h2 class="title" style=" font-size: 24px;
 10 |     font-weight: bold;
 11 |     margin-bottom: 5px; color: #262730;">Chain-of-Verification Reduces Hallucination in Large Language Models</h2>
 12 |     <h4 class="author" style="font-size: 14px;
 13 |     font-weight: bold;
 14 |     margin-bottom: 10px; color:#262730;">Shehzaad Dhuliawala, Mojtaba Komeili, Jing Xu, Roberta Raileanu, Xian Li, Asli Celikyilmaz, Jason Weston...</h4>
 15 |     <div class="metadata" style="font-size: 14px;
 16 |     color: #262730;
 17 |     margin-bottom: 10px;">
 18 |         <span class="id">Entry Id: 2309.11495v1</span> | 
 19 |         <span class="category">cs.CL</span> | 
 20 |         <span class="published">Published on 09-20-2023</span>
 21 |     </div>
 22 |     <div class="abstract" style="font-size: 16px;
 23 |     margin-bottom: 10px; color: #262730;">
 24 |         Generation of plausible yet incorrect factual information, termed
 25 | hallucination, is an unsolved issue in large language models. We study the
 26 | ability of language models to deliberate on the responses they give in order to
 27 | correct their mistakes. We develop the Chain-of-Verification (CoVe) method
 28 | whereby the model first (i) drafts an initial response; then (ii) plans
 29 | verification questions to fact-check its draft; (iii) answers those questions
 30 | independently so the answers are not biased by other r...
 31 |     </div>
 32 |     <a href="http://arxiv.org/pdf/2309.11495v1" target="_blank" style="display: inline-block;
 33 |     background-color: #A2A2F5;
 34 |     color: white;
 35 |     padding: 8px 16px;
 36 |     border-radius: 4px;
 37 |     text-decoration: none;">Read More</a>
 38 |     </div>
 39 |     
 40 |     <div class="feed-item" style="border: 1px solid #ccc;
 41 |     padding: 15px;
 42 |     margin: 15px;
 43 |     border-radius: 8px;">
 44 |     <h2 class="title" style=" font-size: 24px;
 45 |     font-weight: bold;
 46 |     margin-bottom: 5px; color: #262730;">CPLLM: Clinical Prediction with Large Language Models</h2>
 47 |     <h4 class="author" style="font-size: 14px;
 48 |     font-weight: bold;
 49 |     margin-bottom: 10px; color:#262730;">Ofir Ben Shoham, Nadav Rappoport...</h4>
 50 |     <div class="metadata" style="font-size: 14px;
 51 |     color: #262730;
 52 |     margin-bottom: 10px;">
 53 |         <span class="id">Entry Id: 2309.11295v1</span> | 
 54 |         <span class="category">cs.CL</span> | 
 55 |         <span class="published">Published on 09-20-2023</span>
 56 |     </div>
 57 |     <div class="abstract" style="font-size: 16px;
 58 |     margin-bottom: 10px; color: #262730;">
 59 |         We present Clinical Prediction with Large Language Models (CPLLM), a method
 60 | that involves fine-tuning a pre-trained Large Language Model (LLM) for clinical
 61 | disease prediction. We utilized quantization and fine-tuned the LLM using
 62 | prompts, with the task of predicting whether patients will be diagnosed with a
 63 | target disease during their next visit or in the subsequent diagnosis,
 64 | leveraging their historical diagnosis records. We compared our results versus
 65 | various baselines, including Logistic Regr...
 66 |     </div>
 67 |     <a href="http://arxiv.org/pdf/2309.11295v1" target="_blank" style="display: inline-block;
 68 |     background-color: #A2A2F5;
 69 |     color: white;
 70 |     padding: 8px 16px;
 71 |     border-radius: 4px;
 72 |     text-decoration: none;">Read More</a>
 73 |     </div>
 74 |     
 75 |     <div class="feed-item" style="border: 1px solid #ccc;
 76 |     padding: 15px;
 77 |     margin: 15px;
 78 |     border-radius: 8px;">
 79 |     <h2 class="title" style=" font-size: 24px;
 80 |     font-weight: bold;
 81 |     margin-bottom: 5px; color: #262730;">Sequence-to-Sequence Spanish Pre-trained Language Models</h2>
 82 |     <h4 class="author" style="font-size: 14px;
 83 |     font-weight: bold;
 84 |     margin-bottom: 10px; color:#262730;">Vladimir Araujo, Maria Mihaela Trusca, Rodrigo Tufiño, Marie-Francine Moens...</h4>
 85 |     <div class="metadata" style="font-size: 14px;
 86 |     color: #262730;
 87 |     margin-bottom: 10px;">
 88 |         <span class="id">Entry Id: 2309.11259v1</span> | 
 89 |         <span class="category">cs.CL</span> | 
 90 |         <span class="published">Published on 09-20-2023</span>
 91 |     </div>
 92 |     <div class="abstract" style="font-size: 16px;
 93 |     margin-bottom: 10px; color: #262730;">
 94 |         In recent years, substantial advancements in pre-trained language models have
 95 | paved the way for the development of numerous non-English language versions,
 96 | with a particular focus on encoder-only and decoder-only architectures. While
 97 | Spanish language models encompassing BERT, RoBERTa, and GPT have exhibited
 98 | prowess in natural language understanding and generation, there remains a
 99 | scarcity of encoder-decoder models designed for sequence-to-sequence tasks
100 | involving input-output pairs. This paper br...
101 |     </div>
102 |     <a href="http://arxiv.org/pdf/2309.11259v1" target="_blank" style="display: inline-block;
103 |     background-color: #A2A2F5;
104 |     color: white;
105 |     padding: 8px 16px;
106 |     border-radius: 4px;
107 |     text-decoration: none;">Read More</a>
108 |     </div>
109 |     
110 |     <div class="feed-item" style="border: 1px solid #ccc;
111 |     padding: 15px;
112 |     margin: 15px;
113 |     border-radius: 8px;">
114 |     <h2 class="title" style=" font-size: 24px;
115 |     font-weight: bold;
116 |     margin-bottom: 5px; color: #262730;">Is GPT4 a Good Trader?</h2>
117 |     <h4 class="author" style="font-size: 14px;
118 |     font-weight: bold;
119 |     margin-bottom: 10px; color:#262730;">Bingzhe Wu...</h4>
120 |     <div class="metadata" style="font-size: 14px;
121 |     color: #262730;
122 |     margin-bottom: 10px;">
123 |         <span class="id">Entry Id: 2309.10982v1</span> | 
124 |         <span class="category">cs.AI</span> | 
125 |         <span class="published">Published on 09-20-2023</span>
126 |     </div>
127 |     <div class="abstract" style="font-size: 16px;
128 |     margin-bottom: 10px; color: #262730;">
129 |         Recently, large language models (LLMs), particularly GPT-4, have demonstrated
130 | significant capabilities in various planning and reasoning tasks
131 | \cite{cheng2023gpt4,bubeck2023sparks}. Motivated by these advancements, there
132 | has been a surge of interest among researchers to harness the capabilities of
133 | GPT-4 for the automated design of quantitative factors that do not overlap with
134 | existing factor libraries, with an aspiration to achieve alpha returns
135 | \cite{webpagequant}. In contrast to these work, th...
136 |     </div>
137 |     <a href="http://arxiv.org/pdf/2309.10982v1" target="_blank" style="display: inline-block;
138 |     background-color: #A2A2F5;
139 |     color: white;
140 |     padding: 8px 16px;
141 |     border-radius: 4px;
142 |     text-decoration: none;">Read More</a>
143 |     </div>
144 |     
145 |     <div class="feed-item" style="border: 1px solid #ccc;
146 |     padding: 15px;
147 |     margin: 15px;
148 |     border-radius: 8px;">
149 |     <h2 class="title" style=" font-size: 24px;
150 |     font-weight: bold;
151 |     margin-bottom: 5px; color: #262730;">AceGPT, Localizing Large Language Models in Arabic</h2>
152 |     <h4 class="author" style="font-size: 14px;
153 |     font-weight: bold;
154 |     margin-bottom: 10px; color:#262730;">Huang Huang, Fei Yu, Jianqing Zhu, Xuening Sun, Hao Cheng, Dingjie Song, Zhihong Chen, Abdulmohsen Alharthi, Bang An, Ziche Liu, Zhiyi Zhang, Junying Chen, Jianquan Li, Benyou Wang, Lian Zhang, Ruoyu Sun, Xiang Wan, Haizhou Li, Jinchao Xu...</h4>
155 |     <div class="metadata" style="font-size: 14px;
156 |     color: #262730;
157 |     margin-bottom: 10px;">
158 |         <span class="id">Entry Id: 2309.12053v1</span> | 
159 |         <span class="category">cs.CL</span> | 
160 |         <span class="published">Published on 09-21-2023</span>
161 |     </div>
162 |     <div class="abstract" style="font-size: 16px;
163 |     margin-bottom: 10px; color: #262730;">
164 |         This paper explores the imperative need and methodology for developing a
165 | localized Large Language Model (LLM) tailored for Arabic, a language with
166 | unique cultural characteristics that are not adequately addressed by current
167 | mainstream models like ChatGPT. Key concerns additionally arise when
168 | considering cultural sensitivity and local values. To this end, the paper
169 | outlines a packaged solution, including further pre-training with Arabic texts,
170 | supervised fine-tuning (SFT) using native Arabic inst...
171 |     </div>
172 |     <a href="http://arxiv.org/pdf/2309.12053v1" target="_blank" style="display: inline-block;
173 |     background-color: #A2A2F5;
174 |     color: white;
175 |     padding: 8px 16px;
176 |     border-radius: 4px;
177 |     text-decoration: none;">Read More</a>
178 |     </div>
179 |      </body>
180 |     </html>


--------------------------------------------------------------------------------
/ScholarlyRecommender/Recommender/rec_sys.py:
--------------------------------------------------------------------------------
  1 | import logging
  2 | import gzip
  3 | import numpy as np
  4 | from tqdm import tqdm
  5 | import pandas as pd
  6 | from arxiv.arxiv import Search
  7 | from ScholarlyRecommender.const import BASE_REPO
  8 | from ScholarlyRecommender.config import get_config
  9 | 
 10 | config = get_config()
 11 | 
 12 | logging.basicConfig(
 13 |     level=logging.INFO,
 14 |     format="%(asctime)s [%(levelname)s]: %(message)s",
 15 |     handlers=[logging.StreamHandler()],
 16 | )
 17 | # logging.disable(logging.CRITICAL)
 18 | 
 19 | 
 20 | def rankerV3(
 21 |     context: pd.DataFrame, labels: pd.DataFrame, k: int = 6, on: str = "Abstract"
 22 | ) -> pd.DataFrame:
 23 |     """
 24 |     Rank the papers in the context using the normalized compression distance
 25 |     combined with a weighted top-k mean rating.
 26 |     Return a list of the top n ranked papers.
 27 |     This is a modified version of the algorithm described in the paper "“Low-Resource”
 28 |     Text Classification- A Parameter-Free Classification Method with Compressors."
 29 |     The algorithim gets the top k most similar papers to each paper in the context that
 30 |     the user rated and calculates the mean rating of those papers as its prediction.
 31 |     """
 32 |     likes = labels
 33 |     candidates = context
 34 | 
 35 |     train = np.array([(row[on], row["label"]) for _, row in likes.iterrows()])
 36 |     test = np.array([(row[on], row["Id"]) for _, row in candidates.iterrows()])
 37 | 
 38 |     results = []
 39 |     # logging.info(f"Starting to rank {len(test)} candidates on {on}\n")
 40 |     # print(f"Starting to rank {len(test)} candidates on {on}\n")
 41 |     for x1, id in tqdm(test, disable=False):
 42 |         # calculate the compressed length of the utf-8 encoded text
 43 |         Cx1 = len(gzip.compress(x1.encode()))
 44 |         # create a distance array
 45 |         similarity_to_x1 = []
 46 |         for x2, _ in train:
 47 |             # calculate the compressed length of the utf-8 encoded text
 48 |             Cx2 = len(gzip.compress(x2.encode()))
 49 |             # concatenate the two texts
 50 |             x1x2 = " ".join([x1, x2])
 51 |             # calculate the compressed length of the utf-8 encoded concatenated text
 52 |             Cx1x2 = len(gzip.compress(x1x2.encode()))
 53 |             # calculate the normalized compression distance
 54 |             ncd = (Cx1x2 - min(Cx1, Cx2)) / max(Cx1, Cx2)
 55 | 
 56 |             similarity_to_x1.append(ncd)
 57 | 
 58 |         # calculate the similarity weights for the top k most similar papers
 59 |         # Converting the list to a numpy array for vectorized operations
 60 |         similarity_to_x1 = np.array(similarity_to_x1)
 61 |         # sort the array and get the top k most similar papers
 62 |         sorted_idx = np.argsort(similarity_to_x1)
 63 |         values = similarity_to_x1[sorted_idx[:k]]
 64 |         # calculate the similarity weights for the top k most similar papers
 65 |         weights = values / np.sum(values)
 66 |         # Weights need to be inverted so that the most similar papers (lowest distance)
 67 |         inverse_weights = 1 / weights
 68 |         inverse_weights_norm = (inverse_weights) / np.sum(inverse_weights)
 69 |         # get the top k ratings
 70 |         top_k_ratings = train[sorted_idx[:k], 1].astype(int)
 71 |         # calculate the prediction as the inverse weighted mean of the top k ratings
 72 |         prediction = np.sum(np.dot(inverse_weights_norm, top_k_ratings))
 73 | 
 74 |         results.append((prediction, id))
 75 | 
 76 |     df = pd.DataFrame(results, columns=["predicted", "Id"])
 77 |     return df
 78 | 
 79 | 
 80 | def rank(context, labels=None, n: int = 5) -> list:
 81 |     """
 82 |     Run the rankerV3 algorithm on the context and return a list
 83 |     of the top 5 ranked papers.
 84 |     """
 85 |     if labels is None:
 86 |         labels = config["labels"]
 87 |     if isinstance(labels, str):
 88 |         labels = pd.read_csv(labels)
 89 |     if not isinstance(labels, pd.DataFrame):
 90 |         raise TypeError("labels must be a pandas DataFrame")
 91 | 
 92 |     df1 = rankerV3(context, labels, on="Abstract")
 93 |     df2 = rankerV3(context, labels, on="Title")
 94 |     df1["predicted"] = (df1["predicted"] + df2["predicted"]) / 2
 95 |     df1["rank"] = df1["predicted"].rank(ascending=False)
 96 |     df1 = df1.nsmallest(n, "rank")
 97 |     df1["Id"] = df1["Id"].astype(str)
 98 |     recommended = df1["Id"].iloc[:n].tolist()
 99 | 
100 |     return recommended
101 | 
102 | 
103 | def rank2(context, labels=None, n: int = 5) -> list:
104 |     """
105 |     Run the rankerV3 algorithm on the context and return a list
106 |     of the top 5 ranked papers.
107 |     """
108 |     if labels is None:
109 |         labels = config["labels"]
110 |     if isinstance(labels, str):
111 |         labels = pd.read_csv(labels)
112 |     if not isinstance(labels, pd.DataFrame):
113 |         raise TypeError("labels must be a pandas DataFrame")
114 |     train = context[["Id", "Title", "Abstract"]].copy()
115 |     test = labels[["Title", "Abstract", "label"]].copy()
116 |     train["content"] = train["Title"] + ":  " + train["Abstract"]
117 |     test["content"] = test["Title"] + ":  " + test["Abstract"]
118 |     train.drop(["Title", "Abstract"], axis=1, inplace=True)
119 |     test.drop(["Title", "Abstract"], axis=1, inplace=True)
120 |     df1 = rankerV3(train, test, on="content")
121 |     df1["rank"] = df1["predicted"].rank(ascending=False)
122 |     df1 = df1.nsmallest(n, "rank")
123 |     df1["Id"] = df1["Id"].astype(str)
124 |     recommended = df1["Id"].iloc[:n].tolist()
125 | 
126 |     return recommended
127 | 
128 | 
129 | def evaluate(k: int = 6, on: str = "Abstract") -> float:
130 |     """
131 |     Evaluate the recommender system using the normalized compression distance.
132 |     Calculate the mean squared error between the predicted and actual ratings.
133 |     Return the loss.
134 |     """
135 |     likes = pd.read_csv(config["labels"])
136 |     # Set train and test equal to 90% and 10% of the data respectively
137 |     train_data = likes.sample(frac=0.9, random_state=0)
138 |     test_data = likes.drop(train_data.index)
139 | 
140 |     train = np.array([(row[on], row["label"]) for _, row in train_data.iterrows()])
141 |     test = np.array([(row[on], row["label"]) for _, row in test_data.iterrows()])
142 |     results = []
143 |     # logging.info(f"Starting to rank {len(test)} candidates...\n")
144 |     # print(f"Starting to rank {len(test)} candidates...\n")
145 |     for x1, label in tqdm(test):
146 |         # calculate the compressed length of the utf-8 encoded text
147 |         Cx1 = len(gzip.compress(x1.encode()))
148 |         # create a distance array
149 |         similarity_to_x1 = []
150 |         for x2, _ in train:
151 |             # calculate the compressed length of the utf-8 encoded text
152 |             Cx2 = len(gzip.compress(x2.encode()))
153 |             # concatenate the two texts
154 |             x1x2 = " ".join([x1, x2])
155 |             # calculate the compressed length of the utf-8 encoded concatenated text
156 |             Cx1x2 = len(gzip.compress(x1x2.encode()))
157 |             # calculate the normalized compression distance
158 |             ncd = (Cx1x2 - min(Cx1, Cx2)) / max(Cx1, Cx2)
159 | 
160 |             similarity_to_x1.append(ncd)
161 |         # Converting the list to a numpy array for vectorized operations
162 |         similarity_to_x1 = np.array(similarity_to_x1)
163 |         # sort the array and get the top k most similar papers
164 |         sorted_idx = np.argsort(similarity_to_x1)
165 |         values = similarity_to_x1[sorted_idx[:k]]
166 |         # calculate the similarity weights for the top k most similar papers
167 |         weights = values / np.sum(values)
168 |         # Weights need to be inverted so that the most similar papers (lowest distance)
169 |         inverse_weights = 1 / weights
170 |         inverse_weights_norm = (inverse_weights) / np.sum(inverse_weights)
171 |         # get the top k ratings
172 |         top_k_ratings = train[sorted_idx[:k], 1].astype(int)
173 |         # calculate the prediction as the inverse weighted mean of the top k ratings
174 |         prediction = np.sum(np.dot(inverse_weights_norm, top_k_ratings))
175 | 
176 |         results.append((prediction, label))
177 | 
178 |     df = pd.DataFrame(results, columns=["predicted", "actual"])
179 | 
180 |     df["actual"] = df["actual"].astype(int)
181 |     # calculate the mean squared error
182 |     df["squared_error"] = (df["predicted"] - df["actual"]) ** 2
183 |     # loss function
184 |     loss = np.sqrt(df["squared_error"].sum() / df.shape[0])
185 |     return loss
186 | 
187 | 
188 | def fetch(ids: list) -> pd.DataFrame:
189 |     """
190 |     Fetch papers from arxiv.org matching the ids and return a dataframe matching
191 |     the BASE_REPO including the authors.
192 |     """
193 |     # logging.info(f"Fetching {len(ids)} papers from arxiv... \n")
194 |     # print(f"Fetching {len(ids)} papers from arxiv... \n")
195 |     repository = BASE_REPO()
196 |     repository["Author"] = []
197 |     search = Search(
198 |         query="",
199 |         id_list=ids,
200 |     )
201 |     for result in search.results():
202 |         repository["Id"].append(result.entry_id.split("/")[-1])
203 |         repository["Category"].append(result.primary_category)
204 |         repository["Title"].append(result.title.strip("\n"))
205 |         repository["Published"].append(result.published)
206 |         repository["Abstract"].append(result.summary.strip("\n"))
207 |         repository["URL"].append(result.pdf_url)
208 |         repository["Author"].append([author.name for author in result.authors])
209 | 
210 |     return pd.DataFrame(repository)
211 | 
212 | 
213 | def get_recommendations(
214 |     data,
215 |     labels,
216 |     size: int = None,
217 |     to_path: str = None,
218 |     as_df: bool = False,
219 | ):
220 |     """
221 |     Rank the papers in the data and return a dataframe or save it to a csv file.
222 |     Data can be a pandas DataFrame or a path to a csv file.
223 |     """
224 |     if labels is None:
225 |         labels = config["labels"]
226 |     if size is None:
227 |         size = config["feed_length"]
228 |     if isinstance(data, pd.DataFrame):
229 |         df = data
230 |         df.reset_index(inplace=True)
231 |     elif isinstance(data, str):
232 |         df = pd.read_csv(data)
233 |     else:
234 |         raise TypeError("data must be a pandas DataFrame or a path to a csv file")
235 |     if size < 0 or size > len(df.index):
236 |         raise ValueError(
237 |             "size must be greater than 0 and less than the length of the data"
238 |         )
239 | 
240 |     reccommended = rank2(
241 |         context=df,
242 |         labels=labels,
243 |         n=size,
244 |     )
245 |     feed = fetch(reccommended)
246 |     if to_path is not None:
247 |         feed.set_index("Id").to_csv(to_path)
248 |         # logging.info(f"Feed saved to {to_path} \n")
249 |         # print(f"Feed saved to {to_path} \n")
250 |         # feed.to_csv(to_path)
251 |     if as_df:
252 |         return feed
253 |     return None
254 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | <a name="readme-top"></a> 
  2 | 
  3 | <div align="right">
  4 |   <a align="right" href="https://app.deepsource.com/gh/iansnyder333/ScholarlyRecommender/?ref=repository-badge}" target="_blank"><img alt="DeepSource" title="DeepSource" src="https://app.deepsource.com/gh/iansnyder333/ScholarlyRecommender.svg/?label=active+issues&show_trend=true&token=_aFFwUHTc-lWTHFVDDXaQQX8"/></a>
  5 | </div>
  6 | <div align="right">
  7 |   <a href="https://app.deepsource.com/gh/iansnyder333/ScholarlyRecommender/?ref=repository-badge}" target="_blank"><img alt="DeepSource" title="DeepSource" src="https://app.deepsource.com/gh/iansnyder333/ScholarlyRecommender.svg/?label=resolved+issues&show_trend=true&token=_aFFwUHTc-lWTHFVDDXaQQX8"/></a>
  8 | </div>
  9 | 
 10 | <!-- PROJECT LOGO -->
 11 | <br />
 12 | <div align="center">
 13 |   <a href="https://github.com/iansnyder333/ScholarlyRecommender">
 14 |     <img src="images/logo.png" alt="Logo" width="400" height="400">
 15 |   </a>
 16 | 
 17 | <h3 align="center">Scholarly Recommender</h3>
 18 | 
 19 |   <p align="center">
 20 |     End-to-end product that sources recent academic publications and prepares a feed of recommended readings in seconds. 
 21 |     <br />
 22 |     <a href="https://scholarlyrecommender.streamlit.app"><strong>Try it now »</strong></a>
 23 |     <br />
 24 |     <br />
 25 |     <a href="https://github.com/iansnyder333/ScholarlyRecommender/tree/main/docs">Explore the Docs</a>
 26 |     ·
 27 |     <a href="https://github.com/iansnyder333/ScholarlyRecommender/issues">Report Bug</a>
 28 |     ·
 29 |     <a href="https://github.com/iansnyder333/ScholarlyRecommender/issues">Request Feature</a>
 30 |   </p>
 31 | </div>
 32 | 
 33 | <!-- TABLE OF CONTENTS -->
 34 | <details>
 35 |   <summary>Table of Contents</summary>
 36 |   <ol>
 37 |     <li>
 38 |       <a href="#about-the-project">About The Project</a>
 39 |       <ul>
 40 |         <li><a href="#built-with">Built With</a></li>
 41 |       </ul>
 42 |     </li>
 43 |     <li>
 44 |       <a href="#getting-started">Getting Started</a>
 45 |       <ul>
 46 |         <li><a href="#prerequisites">Prerequisites</a></li>
 47 |         <li><a href="#installation">Installation</a></li>
 48 |       </ul>
 49 |     </li>
 50 |     <li><a href="#usage">Usage</a></li>
 51 |     <li><a href="#roadmap">Roadmap</a></li>
 52 |     <li><a href="#contributing">Contributing</a></li>
 53 |     <li><a href="#license">License</a></li>
 54 |     <li><a href="#contact">Contact</a></li>
 55 |     <li><a href="#methods">Methods</a></li>
 56 |     <li><a href="#acknowledgements">Acknowledgements</a></li>
 57 |   </ol>
 58 | </details>
 59 | 
 60 | ## About The Project
 61 | 
 62 | <div align="center">
 63 |   <a href="https://scholarlyrecommender.streamlit.app">
 64 |     <img src="images/example_1.png">
 65 |   </a>
 66 | </div>
 67 | 
 68 |   
 69 |   As an upcoming data scientist with a strong passion for deep learning, I am always looking for new technologies and methodologies. Naturally, I spend a considerable amount of time researching and reading new publications to accomplish this. However, **over 14,000 academic papers are published every day** on average, making it extremely tedious to continuously source papers relevant to my interests. My primary motivation for creating ScholarlyRecommender is to address this, creating a fully automated and personalized system that prepares a feed of academic papers relevant to me. This feed is prepared on demand, through a completely abstracted streamlit web interface, or sent directly to my email on a timed basis. This project was designed to be scalable and adaptable, and can be very easily adapted not only to your own interests, but become a fully automated, self improving newsletter. Details on how to use this system, the methods used for retrieval and ranking, along with future plans and features planned or in development currently are listed below.
 70 | 
 71 | 
 72 | <p align="right">(<a href="#readme-top">back to top</a>)</p>
 73 | 
 74 | ### Built With
 75 | 
 76 | [![Python][Python.com]][Python-url]
 77 | [![Streamlit][Streamlit.com]][Python-url]
 78 | [![Pandas][Pandas.com]][Pandas-url]
 79 | [![NumPy][Numpy.com]][Numpy-url]
 80 | [![Arxiv.arxiv][Arxiv.arxiv.com]][Arxiv.arxiv-url]
 81 |   
 82 | <p align="right">(<a href="#readme-top">back to top</a>)</p>
 83 | 
 84 | 
 85 | 
 86 | <!-- GETTING STARTED -->
 87 | ## Getting Started
 88 | 
 89 | To try ScholarlyRecommender, you can use the streamlit web application found  [Here](https://scholarlyrecommender.streamlit.app/). This will allow you to use the system in its entirety without needing to install anything. If you want to modify the system internally or add functionality, you can follow the directions below to install it locally. 
 90 | 
 91 | ### Prerequisites
 92 | 
 93 | In order to install this app locally you need to have following:
 94 | - git
 95 | - python3.9 or greater *(earlier versions may work)*
 96 | - pip3
 97 | 
 98 | 
 99 | ### Installation
100 | 
101 | To install ScholarlyRecommender, run the following in your command line shell
102 | 
103 | 1. Clone the repository from github and cd into it
104 |    ```sh
105 |    git clone https://github.com/iansnyder333/ScholarlyRecommender.git
106 |    cd ScholarlyRecommender
107 |    ```
108 | 2. Set up the enviroment and install dependencies
109 |    ```sh
110 |    make build
111 |    ```
112 | All done, ScholarlyRecommender is now installed.
113 | You can now run the app with
114 | ```sh
115 | make run
116 | ```
117 | 
118 | <p align="right">(<a href="#readme-top">back to top</a>)</p>
119 | 
120 | 
121 | 
122 | <!-- USAGE EXAMPLES -->
123 | ## Usage
124 | 
125 | Once installed, you want to calibrate the system to your own interests. The easiest way to do this is using the webapp.py file. Alternativley, you can use calibrate.py, which runs on the console. 
126 | 
127 | Make sure you are cd into the parent folder of the cloned repo.
128 | 
129 | Run this in your terminal as follows:
130 | ```sh
131 | make run
132 | ```
133 | This is the same as running:
134 | ```sh
135 | streamlit run webapp.py
136 | ```
137 | 
138 | Navigate to the configure tab and complete the steps. You can now navigate back to the get recommendations tab and generate results!
139 | The web app offers full functionality and serves as an api to the system, while using the webapp, updates made to the configuration will be applied and refreshed in a continuous manner. 
140 | 
141 | **Note:** If you are using ScholarlyRecommender locally, certain features such as direct email will not work as the original applications database will not be available. If you want to configure the email feature for yourself, you can follow the instructions provided in [mail.py](https://github.com/iansnyder333/ScholarlyRecommender/blob/main/ScholarlyRecommender/Newsletter/mail.py). This will require some proficiency/familiarity with SMPT. If you are having issues please feel free to check the [docs](https://github.com/iansnyder333/ScholarlyRecommender/tree/main/docs), or make a discussion post [here](https://github.com/iansnyder333/ScholarlyRecommender/discussions) and someone will help you out.
142 | 
143 | <p align="right">(<a href="#readme-top">back to top</a>)</p>
144 | 
145 | <!-- ROADMAP -->
146 | ## Roadmap
147 | 
148 | - [x] ✅ Adding email support on the web app ✅
149 | - [ ] OS support, specifically for windows.
150 | - [ ] shell scripts to make installs, updates, and usage easier.
151 | - [ ] Database to store user configurations, currently zero user data is saved. Also would like to improve data locality and cache to improve user experience.
152 | - [ ] Making it easier to give feedback to suggested papers to improve the system
153 | - [ ] Improving the overall labeling experience, specifically the pooling process and labeling setup to be more integrated.
154 | - [ ] Improve modularity in the webapp and try to improve caching for faster performance. 
155 | - [ ] Many visual and user experience improvements, a complete overhaul of the UX is likely.
156 | - [ ] Allowing users to navigate between pages without using the Navbar, streamlit does not currently support this directly.
157 | 
158 | 
159 | See the [open issues](https://github.com/iansnyder333/ScholarlyRecommender/issues) for a full list of proposed features (and known issues).
160 | 
161 | <p align="right">(<a href="#readme-top">back to top</a>)</p>
162 | 
163 | 
164 | 
165 | <!-- CONTRIBUTING -->
166 | ## Contributing
167 | 
168 | Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are **greatly appreciated**.
169 | 
170 | If you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue with the tag "enhancement".
171 | Don't forget to give the project a star! Thanks again!
172 | 
173 | 1. Fork the Project
174 | 2. Create your Feature Branch (`git checkout -b feature/AmazingFeature`)
175 | 3. Commit your Changes (`git commit -m 'Add some AmazingFeature'`)
176 | 4. Push to the Branch (`git push origin feature/AmazingFeature`)
177 | 5. Open a Pull Request
178 | 
179 | <p align="right">(<a href="#readme-top">back to top</a>)</p>
180 | 
181 | <!-- LICENSE -->
182 | ## License
183 | 
184 | Distributed under the apache license 2.0. See `LICENSE` for more information. 
185 | 
186 | <p align="right">(<a href="#readme-top">back to top</a>)</p>
187 | 
188 | 
189 | 
190 | <!-- CONTACT -->
191 | ## Contact
192 | 
193 | Ian Snyder - [@iansnydes](https://twitter.com/iansnydes) - idsnyder136@gmail.com 
194 | 
195 | Project Email - scholarlyrecommender@gmail.com
196 | 
197 | My Website: [iansnyder333.github.io/frontend/](https://iansnyder333.github.io/frontend/)
198 | 
199 | Linkedin: [www.linkedin.com/in/iandsnyder](https://www.linkedin.com/in/iandsnyder/)
200 | 
201 | <p align="right">(<a href="#readme-top">back to top</a>)</p>
202 | 
203 | <!-- METHODS -->
204 | ## Methods 
205 | 
206 | Once candidates are sourced in the context of the configuration, they are ranked. The ranking process involves using the normalized compression distance combined with an inverse weighted top-k mean rating from the candidates to the labeled papers. This is a modified version of the algorithm described in the paper "“Low-Resource” Text Classification- A Parameter-Free Classification Method with Compressors" (<a href="#acknowledgements">1</a>). The algorithm gets the top k most similar papers to each paper in the context that the user rated and calculates a weighted mean rating of those papers as its prediction. The results are then sorted by the papers with the highest predicting rating and are returned in accordance with the desired amount.
207 | 
208 | While using a large language model such as BERT might yield a higher accuracy, this approach is considerably more lightweight, can run on basically any computer, and requires virtually no labeled data to source relevant content. If this project scales to the capacity of a self improving newsletter, implementing a sophisticated deep learning model such as a transformer could be a worthwhile addition. 
209 | 
210 | <p align="right">(<a href="#readme-top">back to top</a>)</p>
211 | 
212 | <!-- ACKNOWLEDGEMENTS -->
213 | ## Acknowledgements
214 | 
215 | **1** - [“Low-Resource” Text Classification: A Parameter-Free Classification Method with Compressors](https://aclanthology.org/2023.findings-acl.426) (Jiang et al., Findings 2023)
216 | 
217 | **README Template** - [Best-README-Template](https://github.com/othneildrew/Best-README-Template) by ([othneildrew](https://github.com/othneildrew))
218 | 
219 | <p align="right">(<a href="#readme-top">back to top</a>)</p>
220 | 
221 | 
222 | <!-- MARKDOWN LINKS & IMAGES -->
223 | <!-- https://www.markdownguide.org/basic-syntax/#reference-style-links -->
224 | [Python.com]:https://img.shields.io/badge/Python-blue
225 | [Python-url]:https://www.python.org/
226 | [Streamlit.com]:https://img.shields.io/badge/Streamlit-red
227 | [Streamlit-url]:https://streamlit.io/
228 | [Pandas.com]:https://img.shields.io/badge/pandas-purple
229 | [Pandas-url]:https://pandas.pydata.org/
230 | [Numpy.com]:https://img.shields.io/badge/NumPy-%23ADD8E6
231 | [Numpy-url]:https://numpy.org/
232 | [Arxiv.arxiv.com]:https://img.shields.io/badge/Arxiv-%23FF0000
233 | [Arxiv.arxiv-url]:http://lukasschwab.me/arxiv.py/index.html
234 | 
235 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
  1 |                                  Apache License
  2 |                            Version 2.0, January 2004
  3 |                         http://www.apache.org/licenses/
  4 | 
  5 |    TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
  6 | 
  7 |    1. Definitions.
  8 | 
  9 |       "License" shall mean the terms and conditions for use, reproduction,
 10 |       and distribution as defined by Sections 1 through 9 of this document.
 11 | 
 12 |       "Licensor" shall mean the copyright owner or entity authorized by
 13 |       the copyright owner that is granting the License.
 14 | 
 15 |       "Legal Entity" shall mean the union of the acting entity and all
 16 |       other entities that control, are controlled by, or are under common
 17 |       control with that entity. For the purposes of this definition,
 18 |       "control" means (i) the power, direct or indirect, to cause the
 19 |       direction or management of such entity, whether by contract or
 20 |       otherwise, or (ii) ownership of fifty percent (50%) or more of the
 21 |       outstanding shares, or (iii) beneficial ownership of such entity.
 22 | 
 23 |       "You" (or "Your") shall mean an individual or Legal Entity
 24 |       exercising permissions granted by this License.
 25 | 
 26 |       "Source" form shall mean the preferred form for making modifications,
 27 |       including but not limited to software source code, documentation
 28 |       source, and configuration files.
 29 | 
 30 |       "Object" form shall mean any form resulting from mechanical
 31 |       transformation or translation of a Source form, including but
 32 |       not limited to compiled object code, generated documentation,
 33 |       and conversions to other media types.
 34 | 
 35 |       "Work" shall mean the work of authorship, whether in Source or
 36 |       Object form, made available under the License, as indicated by a
 37 |       copyright notice that is included in or attached to the work
 38 |       (an example is provided in the Appendix below).
 39 | 
 40 |       "Derivative Works" shall mean any work, whether in Source or Object
 41 |       form, that is based on (or derived from) the Work and for which the
 42 |       editorial revisions, annotations, elaborations, or other modifications
 43 |       represent, as a whole, an original work of authorship. For the purposes
 44 |       of this License, Derivative Works shall not include works that remain
 45 |       separable from, or merely link (or bind by name) to the interfaces of,
 46 |       the Work and Derivative Works thereof.
 47 | 
 48 |       "Contribution" shall mean any work of authorship, including
 49 |       the original version of the Work and any modifications or additions
 50 |       to that Work or Derivative Works thereof, that is intentionally
 51 |       submitted to Licensor for inclusion in the Work by the copyright owner
 52 |       or by an individual or Legal Entity authorized to submit on behalf of
 53 |       the copyright owner. For the purposes of this definition, "submitted"
 54 |       means any form of electronic, verbal, or written communication sent
 55 |       to the Licensor or its representatives, including but not limited to
 56 |       communication on electronic mailing lists, source code control systems,
 57 |       and issue tracking systems that are managed by, or on behalf of, the
 58 |       Licensor for the purpose of discussing and improving the Work, but
 59 |       excluding communication that is conspicuously marked or otherwise
 60 |       designated in writing by the copyright owner as "Not a Contribution."
 61 | 
 62 |       "Contributor" shall mean Licensor and any individual or Legal Entity
 63 |       on behalf of whom a Contribution has been received by Licensor and
 64 |       subsequently incorporated within the Work.
 65 | 
 66 |    2. Grant of Copyright License. Subject to the terms and conditions of
 67 |       this License, each Contributor hereby grants to You a perpetual,
 68 |       worldwide, non-exclusive, no-charge, royalty-free, irrevocable
 69 |       copyright license to reproduce, prepare Derivative Works of,
 70 |       publicly display, publicly perform, sublicense, and distribute the
 71 |       Work and such Derivative Works in Source or Object form.
 72 | 
 73 |    3. Grant of Patent License. Subject to the terms and conditions of
 74 |       this License, each Contributor hereby grants to You a perpetual,
 75 |       worldwide, non-exclusive, no-charge, royalty-free, irrevocable
 76 |       (except as stated in this section) patent license to make, have made,
 77 |       use, offer to sell, sell, import, and otherwise transfer the Work,
 78 |       where such license applies only to those patent claims licensable
 79 |       by such Contributor that are necessarily infringed by their
 80 |       Contribution(s) alone or by combination of their Contribution(s)
 81 |       with the Work to which such Contribution(s) was submitted. If You
 82 |       institute patent litigation against any entity (including a
 83 |       cross-claim or counterclaim in a lawsuit) alleging that the Work
 84 |       or a Contribution incorporated within the Work constitutes direct
 85 |       or contributory patent infringement, then any patent licenses
 86 |       granted to You under this License for that Work shall terminate
 87 |       as of the date such litigation is filed.
 88 | 
 89 |    4. Redistribution. You may reproduce and distribute copies of the
 90 |       Work or Derivative Works thereof in any medium, with or without
 91 |       modifications, and in Source or Object form, provided that You
 92 |       meet the following conditions:
 93 | 
 94 |       (a) You must give any other recipients of the Work or
 95 |           Derivative Works a copy of this License; and
 96 | 
 97 |       (b) You must cause any modified files to carry prominent notices
 98 |           stating that You changed the files; and
 99 | 
100 |       (c) You must retain, in the Source form of any Derivative Works
101 |           that You distribute, all copyright, patent, trademark, and
102 |           attribution notices from the Source form of the Work,
103 |           excluding those notices that do not pertain to any part of
104 |           the Derivative Works; and
105 | 
106 |       (d) If the Work includes a "NOTICE" text file as part of its
107 |           distribution, then any Derivative Works that You distribute must
108 |           include a readable copy of the attribution notices contained
109 |           within such NOTICE file, excluding those notices that do not
110 |           pertain to any part of the Derivative Works, in at least one
111 |           of the following places: within a NOTICE text file distributed
112 |           as part of the Derivative Works; within the Source form or
113 |           documentation, if provided along with the Derivative Works; or,
114 |           within a display generated by the Derivative Works, if and
115 |           wherever such third-party notices normally appear. The contents
116 |           of the NOTICE file are for informational purposes only and
117 |           do not modify the License. You may add Your own attribution
118 |           notices within Derivative Works that You distribute, alongside
119 |           or as an addendum to the NOTICE text from the Work, provided
120 |           that such additional attribution notices cannot be construed
121 |           as modifying the License.
122 | 
123 |       You may add Your own copyright statement to Your modifications and
124 |       may provide additional or different license terms and conditions
125 |       for use, reproduction, or distribution of Your modifications, or
126 |       for any such Derivative Works as a whole, provided Your use,
127 |       reproduction, and distribution of the Work otherwise complies with
128 |       the conditions stated in this License.
129 | 
130 |    5. Submission of Contributions. Unless You explicitly state otherwise,
131 |       any Contribution intentionally submitted for inclusion in the Work
132 |       by You to the Licensor shall be under the terms and conditions of
133 |       this License, without any additional terms or conditions.
134 |       Notwithstanding the above, nothing herein shall supersede or modify
135 |       the terms of any separate license agreement you may have executed
136 |       with Licensor regarding such Contributions.
137 | 
138 |    6. Trademarks. This License does not grant permission to use the trade
139 |       names, trademarks, service marks, or product names of the Licensor,
140 |       except as required for reasonable and customary use in describing the
141 |       origin of the Work and reproducing the content of the NOTICE file.
142 | 
143 |    7. Disclaimer of Warranty. Unless required by applicable law or
144 |       agreed to in writing, Licensor provides the Work (and each
145 |       Contributor provides its Contributions) on an "AS IS" BASIS,
146 |       WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
147 |       implied, including, without limitation, any warranties or conditions
148 |       of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
149 |       PARTICULAR PURPOSE. You are solely responsible for determining the
150 |       appropriateness of using or redistributing the Work and assume any
151 |       risks associated with Your exercise of permissions under this License.
152 | 
153 |    8. Limitation of Liability. In no event and under no legal theory,
154 |       whether in tort (including negligence), contract, or otherwise,
155 |       unless required by applicable law (such as deliberate and grossly
156 |       negligent acts) or agreed to in writing, shall any Contributor be
157 |       liable to You for damages, including any direct, indirect, special,
158 |       incidental, or consequential damages of any character arising as a
159 |       result of this License or out of the use or inability to use the
160 |       Work (including but not limited to damages for loss of goodwill,
161 |       work stoppage, computer failure or malfunction, or any and all
162 |       other commercial damages or losses), even if such Contributor
163 |       has been advised of the possibility of such damages.
164 | 
165 |    9. Accepting Warranty or Additional Liability. While redistributing
166 |       the Work or Derivative Works thereof, You may choose to offer,
167 |       and charge a fee for, acceptance of support, warranty, indemnity,
168 |       or other liability obligations and/or rights consistent with this
169 |       License. However, in accepting such obligations, You may act only
170 |       on Your own behalf and on Your sole responsibility, not on behalf
171 |       of any other Contributor, and only if You agree to indemnify,
172 |       defend, and hold each Contributor harmless for any liability
173 |       incurred by, or claims asserted against, such Contributor by reason
174 |       of your accepting any such warranty or additional liability.
175 | 
176 |    END OF TERMS AND CONDITIONS
177 | 
178 |    APPENDIX: How to apply the Apache License to your work.
179 | 
180 |       To apply the Apache License to your work, attach the following
181 |       boilerplate notice, with the fields enclosed by brackets "[]"
182 |       replaced with your own identifying information. (Don't include
183 |       the brackets!)  The text should be enclosed in the appropriate
184 |       comment syntax for the file format. We also recommend that a
185 |       file or class name and description of purpose be included on the
186 |       same "printed page" as the copyright notice for easier
187 |       identification within third-party archives.
188 | 
189 |    Copyright [yyyy] [name of copyright owner]
190 | 
191 |    Licensed under the Apache License, Version 2.0 (the "License");
192 |    you may not use this file except in compliance with the License.
193 |    You may obtain a copy of the License at
194 | 
195 |        http://www.apache.org/licenses/LICENSE-2.0
196 | 
197 |    Unless required by applicable law or agreed to in writing, software
198 |    distributed under the License is distributed on an "AS IS" BASIS,
199 |    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
200 |    See the License for the specific language governing permissions and
201 |    limitations under the License.
202 | 


--------------------------------------------------------------------------------
/testing.py:
--------------------------------------------------------------------------------
  1 | """
  2 | This module contains unit tests for the ScholarlyRecommender package.
  3 | Before pushing to the repository, please run the tests to ensure that the package
  4 | is working as expected.
  5 | More tests will be added as the package is developed.
  6 | 
  7 | TODO Test feed outputs for email, webapp, etc. Add more tests for the webapp,
  8 | and more comprehensive tests for the API.
  9 | """
 10 | import ScholarlyRecommender as sr
 11 | import pandas as pd
 12 | from json import load
 13 | import unittest
 14 | import time
 15 | import tracemalloc
 16 | 
 17 | 
 18 | # Test Constants
 19 | # Base directory for testing
 20 | BASE_TEST_DIR = "ScholarlyRecommender/Repository/tests/"
 21 | TEST_INPUT_DIR = BASE_TEST_DIR + "inputs/"
 22 | TEST_OUTPUT_DIR = BASE_TEST_DIR + "outputs/"
 23 | # Test Configuration files
 24 | TEST_CONFIG_PATH = TEST_INPUT_DIR + "test_configuration.json"
 25 | TEST_CONFIG_UPDATE_OUT = TEST_OUTPUT_DIR + "test_configuration_update.json"
 26 | # Test Reference input files
 27 | REF_CANDIDATES_PATH = TEST_INPUT_DIR + "ref_candidates.csv"
 28 | REF_LABELS_PATH = TEST_INPUT_DIR + "ref_labels.csv"
 29 | REF_RECOMMENDATIONS_PATH = TEST_INPUT_DIR + "ref_recommendations.csv"
 30 | # Test Output files
 31 | TEST_CANDIDATES_OUT = TEST_OUTPUT_DIR + "test_candidates.csv"
 32 | TEST_LABELS_OUT = TEST_OUTPUT_DIR + "test_labels.csv"
 33 | TEST_RECOMMENDATIONS_OUT = TEST_OUTPUT_DIR + "test_recommendations.csv"
 34 | TEST_FEED_OUT = TEST_OUTPUT_DIR + "test_feed.html"
 35 | 
 36 | 
 37 | class TestScholarlyRecommender(unittest.TestCase):
 38 |     """
 39 |     CONFIGURATION TESTS
 40 | 
 41 |     First, these tests ensure that the current configuration is valid, validating
 42 |     the keys, value types, and value shapes against the reference.
 43 |     Second, the retrieval and update functions for the ScholarlyRecommender
 44 |     configuration API are tested.
 45 | 
 46 |     ScholarlyRecommender API TESTS
 47 | 
 48 |     These tests ensure that the ScholarlyRecommender API is working as expected.
 49 |     Each test checks that the output is the correct shape and type, in both return
 50 |     formats.These tests are run under the assumption that the configuration is valid.
 51 |     """
 52 | 
 53 |     def setUp(self):
 54 |         """Setup the test enviroment for unittest"""
 55 |         with open(TEST_CONFIG_PATH) as json_file:
 56 |             self.ref_config = load(json_file)
 57 |         self.config = sr.get_config()
 58 |         self.candidates = pd.read_csv(REF_CANDIDATES_PATH)
 59 |         self.candidates_labeled = pd.read_csv(REF_LABELS_PATH)
 60 |         self.recommendations = pd.read_csv(REF_RECOMMENDATIONS_PATH)
 61 | 
 62 |     # Test the configuration file and its contents
 63 |     def test_config(self):
 64 |         """Check that the config file is valid"""
 65 |         self.assertEqual(self.config.keys(), self.ref_config.keys())
 66 | 
 67 |     def test_config_query(self):
 68 |         """Test the queries"""
 69 |         config_queries = self.config["queries"]
 70 |         self.assertTrue(isinstance(config_queries, list))
 71 |         self.assertTrue(len(config_queries) > 0)
 72 |         self.assertTrue(all(isinstance(item, str) for item in config_queries))
 73 | 
 74 |     def test_config_labels(self):
 75 |         """Test the labels"""
 76 |         config_labels = pd.read_csv(self.config["labels"])
 77 |         expected_columns = list(self.candidates_labeled.columns)
 78 |         columns = list(config_labels.columns)
 79 |         self.assertEqual(columns, expected_columns)
 80 | 
 81 |     def test_config_feed_length(self):
 82 |         """Test the feed length"""
 83 |         self.assertTrue(isinstance(self.config["feed_length"], int))
 84 |         self.assertTrue(self.config["feed_length"] > 0)
 85 |         self.assertTrue(self.config["feed_length"] <= 10)
 86 | 
 87 |     def test_get_config(self):
 88 |         """Test that the config retrieval function works as expected."""
 89 |         with open("ScholarlyRecommender/configuration.json") as json_file:
 90 |             expected_config = load(json_file)
 91 | 
 92 |         config = sr.get_config()
 93 |         self.assertEqual(config, expected_config)
 94 | 
 95 |     def test_update_config(self):
 96 |         """Test that the config update function works as expected."""
 97 |         # Change as necessary
 98 |         expected_config = {
 99 |             "queries": ["Computer Science", "Mathematics"],
100 |             "labels": BASE_TEST_DIR + "test_candidates_labeled.csv",
101 |             "feed_length": 7,
102 |             "feed_path": BASE_TEST_DIR + "test_feed.html",
103 |         }
104 |         sr.update_config(
105 |             expected_config, test_mode=True, test_path=TEST_CONFIG_UPDATE_OUT
106 |         )
107 |         with open(TEST_CONFIG_UPDATE_OUT) as json_file:
108 |             config = load(json_file)
109 | 
110 |         self.assertEqual(config, expected_config)
111 | 
112 |     def test_source_candidates(self):
113 |         """
114 |         Test that the outputs from candidate sourcing are the correct
115 |         shape and type.
116 |         """
117 |         out = TEST_CANDIDATES_OUT
118 |         df_candidates = sr.source_candidates(
119 |             queries=self.config["queries"],
120 |             as_df=True,
121 |             prev_days=7,
122 |             to_path=out,
123 |         )
124 |         df_candidates.reset_index(inplace=True)
125 | 
126 |         candidates = pd.read_csv(out)
127 |         expected_columns = list(self.candidates.columns)
128 |         expected_dtypes = self.candidates.dtypes.astype(str).to_dict()
129 | 
130 |         # Compare the column names
131 |         self.assertListEqual(list(candidates.columns), expected_columns)
132 |         self.assertListEqual(list(df_candidates.columns), expected_columns)
133 | 
134 |         # Compare the data types of each column
135 |         self.assertDictEqual(candidates.dtypes.astype(str).to_dict(), expected_dtypes)
136 | 
137 |     def test_get_recommendations(self):
138 |         """Test that the outputs from the ranking are the correct shape and type"""
139 |         out = TEST_RECOMMENDATIONS_OUT
140 |         df_recommendations = sr.get_recommendations(
141 |             data=REF_CANDIDATES_PATH,
142 |             labels=REF_LABELS_PATH,
143 |             to_path=out,
144 |             as_df=True,
145 |         )
146 | 
147 |         recommendations = pd.read_csv(out)
148 |         expected_columns = list(self.recommendations.columns)
149 |         expected_dtypes = self.recommendations.dtypes.astype(str).to_dict()
150 | 
151 |         # Compare the column names
152 |         self.assertListEqual(list(recommendations.columns), expected_columns)
153 |         self.assertListEqual(list(df_recommendations.columns), expected_columns)
154 |         # Compare the data types of each column
155 |         self.assertDictEqual(
156 |             recommendations.dtypes.astype(str).to_dict(), expected_dtypes
157 |         )
158 | 
159 | 
160 | class BenchmarkTests:
161 |     """
162 |     This class contains a basic benchmarking tests for the ScholarlyRecommender package.
163 |     The benchmarking tests are run under the assumption that the configuration is valid.
164 |     Gives the runtime and memory usage of each function in the package, as well as
165 |     the total runtime and memory usage for the whole pipeline.
166 |     """
167 | 
168 |     def __init__(self):
169 |         self.config = sr.get_config()
170 |         self.candidates = pd.read_csv(REF_CANDIDATES_PATH)
171 |         self.candidates_labeled = pd.read_csv(REF_LABELS_PATH)
172 |         self.recommendations = pd.read_csv(REF_RECOMMENDATIONS_PATH)
173 | 
174 |     def benchmark(self, save_log=False):
175 |         """Run the benchmarking tests."""
176 |         print("\n Running benchmarks... \n")
177 |         times = []
178 |         memory = []
179 |         tracemalloc.start()
180 |         full_start_time = time.time()
181 | 
182 |         start_time = time.time()
183 |         can = sr.source_candidates(
184 |             queries=self.config["queries"],
185 |             as_df=True,
186 |             prev_days=7,
187 |         )
188 |         elapsed_time = time.time() - start_time
189 |         times.append(elapsed_time)
190 |         memory.append(tracemalloc.get_traced_memory())
191 |         self._display(
192 |             "Source Candidates",
193 |             times[0],
194 |             memory[0][0],
195 |             memory[0][1],
196 |         )
197 | 
198 |         start_time = time.time()
199 |         rec = sr.get_recommendations(
200 |             data=REF_CANDIDATES_PATH,
201 |             labels=REF_LABELS_PATH,
202 |             as_df=True,
203 |         )
204 |         elapsed_time = time.time() - start_time
205 |         times.append(elapsed_time)
206 |         memory.append(tracemalloc.get_traced_memory())
207 |         self._display(
208 |             "Get Recommendations",
209 |             times[1],
210 |             memory[1][0] - memory[0][0],
211 |             memory[1][1],
212 |         )
213 | 
214 |         start_time = time.time()
215 |         fee = sr.get_feed(
216 |             data=REF_RECOMMENDATIONS_PATH,
217 |             to_path=TEST_FEED_OUT,
218 |         )
219 |         elapsed_time = time.time() - start_time
220 | 
221 |         times.append(elapsed_time)
222 |         memory.append(tracemalloc.get_traced_memory())
223 |         self._display(
224 |             "Get Feed",
225 |             times[2],
226 |             memory[2][0] - memory[1][0],
227 |             memory[2][1],
228 |         )
229 | 
230 |         full_time = time.time() - full_start_time
231 |         tracemalloc.stop()
232 |         self._display(
233 |             "Total",
234 |             full_time,
235 |             memory[2][0],
236 |             memory[2][1],
237 |         )
238 |         if save_log:
239 |             return times, memory
240 |         return None
241 | 
242 |     def benchmark_sc(self):
243 |         """
244 |         Benchmark the source_candidates function
245 |         compared to the fast_search function
246 |         """
247 |         # Run this test 3 times and display the average
248 |         times = []
249 |         memory = []
250 |         for i in range(3):
251 |             tracemalloc.start()
252 |             start_time = time.time()
253 |             can = sr.source_candidates(
254 |                 queries=self.config["queries"],
255 |                 as_df=True,
256 |                 prev_days=7,
257 |             )
258 |             elapsed_time = time.time() - start_time
259 |             times.append(elapsed_time)
260 |             memory.append(tracemalloc.get_traced_memory())
261 |             tracemalloc.stop()
262 |         self._display(
263 |             "Average Fast Search",
264 |             sum(times) / len(times),
265 |             sum(m[0] for m in memory) / len(memory),
266 |             sum(m[1] for m in memory) / len(memory),
267 |         )
268 |         times = []
269 |         memory = []
270 |         for i in range(3):
271 |             tracemalloc.start()
272 |             start_time = time.time()
273 |             can = sr.source_candidates(
274 |                 queries=self.config["queries"],
275 |                 as_df=True,
276 |                 prev_days=7,
277 |             )
278 |             elapsed_time = time.time() - start_time
279 |             times.append(elapsed_time)
280 |             memory.append(tracemalloc.get_traced_memory())
281 |             tracemalloc.stop()
282 |         self._display(
283 |             "Average Source Candidates",
284 |             sum(times) / len(times),
285 |             sum(m[0] for m in memory) / len(memory),
286 |             sum(m[1] for m in memory) / len(memory),
287 |         )
288 | 
289 |     def benchmark_rec_sys(self):
290 |         """
291 |         Benchmark the equal weight rank function
292 |         compared to the unequal weight function
293 |         """
294 |         # Run this test 5 times and display the average
295 |         times = []
296 |         memory = []
297 |         for i in range(5):
298 |             tracemalloc.start()
299 |             start_time = time.time()
300 |             rec = sr.get_recommendations(
301 |                 data=REF_CANDIDATES_PATH,
302 |                 labels=REF_LABELS_PATH,
303 |                 as_df=True,
304 |             )
305 |             elapsed_time = time.time() - start_time
306 |             times.append(elapsed_time)
307 |             memory.append(tracemalloc.get_traced_memory())
308 |             tracemalloc.stop()
309 |         self._display(
310 |             "Average for get_recommendations",
311 |             sum(times) / len(times),
312 |             sum(m[0] for m in memory) / len(memory),
313 |             sum(m[1] for m in memory) / len(memory),
314 |         )
315 | 
316 |     @staticmethod
317 |     def _display(name, runtime, current_memory, peak_memory):
318 |         """Display the results of the benchmarking tests."""
319 |         print(f"{name}")
320 |         print(f"Runtime: {runtime} seconds")
321 |         print(f"Memory: {current_memory} bytes in current memory")
322 |         print(f"{peak_memory} bytes in peak memory \n")
323 | 
324 | 
325 | if __name__ == "__main__":
326 |     unittest.main()
327 |     # BenchmarkTests().benchmark_rec_sys()
328 | 


--------------------------------------------------------------------------------
/webapp.py:
--------------------------------------------------------------------------------
  1 | import streamlit as st
  2 | import streamlit.components.v1 as components
  3 | import pandas as pd
  4 | 
  5 | # Standard Library Imports
  6 | import smtplib
  7 | from email.message import EmailMessage
  8 | from re import match
  9 | 
 10 | # Local Imports
 11 | import ScholarlyRecommender as sr
 12 | from Utils.webutils import search_categories
 13 | 
 14 | 
 15 | @st.cache_data(show_spinner=False)
 16 | def load_sc_config():
 17 |     return sr.get_config()
 18 | 
 19 | 
 20 | def get_sc_config():
 21 |     return st.session_state.sys_config
 22 | 
 23 | 
 24 | def update_sc_config(new_config):
 25 |     st.session_state.sys_config = new_config
 26 | 
 27 | 
 28 | @st.cache_data(show_spinner=False)
 29 | def build_query(cats: dict) -> list:
 30 |     if cats.__len__() == 0:
 31 |         return []
 32 | 
 33 |     usr_query = []
 34 |     for key, value in cats.items():
 35 |         if len(value) > 0:
 36 |             usr_query.extend(value)
 37 |         else:
 38 |             usr_query.append(key)
 39 |     return usr_query
 40 | 
 41 | 
 42 | @st.cache_data(show_spinner=False)
 43 | def validate_email(email) -> bool:
 44 |     regex = r"^\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b"
 45 |     if match(regex, email):
 46 |         return True
 47 |     return False
 48 | 
 49 | 
 50 | def send_email(**kwargs):
 51 |     try:
 52 |         SUBSCRIBERS = kwargs["subscribers"]
 53 | 
 54 |         # Validate emails again before sending
 55 |         for email in SUBSCRIBERS:
 56 |             if not validate_email(email):
 57 |                 raise ValueError(
 58 |                     f"{email} is not a valid email address. Please try again."
 59 |                 )
 60 | 
 61 |         EMAIL_ADDRESS = st.secrets.email_credentials.EMAIL
 62 |         EMAIL_PASSWORD = st.secrets.email_credentials.EMAIL_PASSWORD
 63 |         PORT = st.secrets.email_credentials.PORT
 64 |         if not EMAIL_ADDRESS or not EMAIL_PASSWORD or not PORT:
 65 |             raise ValueError(
 66 |                 "Email credentials not set in environment variables. Please report this issue to the developer."
 67 |             )
 68 |         if not kwargs["content"]:
 69 |             raise ValueError(
 70 |                 "Email content not set. Please report this issue to the developer."
 71 |             )
 72 | 
 73 |         msg = EmailMessage()
 74 |         msg["Subject"] = "Your Scholarly Recommender Newsletter"
 75 |         msg["From"] = EMAIL_ADDRESS
 76 |         msg["To"] = SUBSCRIBERS
 77 | 
 78 |         html_string = kwargs["content"]
 79 | 
 80 |         msg.set_content(html_string, subtype="html")
 81 |         with smtplib.SMTP_SSL("smtp.gmail.com", PORT) as smtp:
 82 |             smtp.login(EMAIL_ADDRESS, EMAIL_PASSWORD)
 83 |             smtp.send_message(msg)
 84 |         st.success("Email sent successfully!")
 85 | 
 86 |     except Exception as e:  # skipcq: PYL-W0703
 87 |         st.error(f"Email failed to send. {e}", icon="🚨")
 88 | 
 89 | 
 90 | def generate_feed_pipeline(query: list, n: int, days: int, to_email: bool):
 91 |     with st.status("Working...", expanded=True) as status:
 92 |         status.update(label="Searching for papers...", state="running", expanded=True)
 93 | 
 94 |         if len(query) == 0:
 95 |             query = get_sc_config()["queries"]
 96 |         # Collect candidates
 97 |         candidates = sr.source_candidates(queries=query, as_df=True, prev_days=days)
 98 |         status.update(
 99 |             label="Generating recommendations...", state="running", expanded=True
100 |         )
101 | 
102 |         # Generate recommendations
103 |         recommendations = sr.get_recommendations(
104 |             data=candidates,
105 |             labels=get_sc_config()["labels"],
106 |             size=n,
107 |             as_df=True,
108 |         )
109 |         status.update(label="Generating feed...", state="running", expanded=True)
110 | 
111 |         # Generate feed
112 |         source_code = sr.get_feed(
113 |             data=recommendations,
114 |             email=to_email,
115 |             web=True,
116 |         )
117 |         status.update(label="Feed Generated", state="complete", expanded=False)
118 |         return source_code
119 | 
120 | 
121 | # TODO make this more efficient
122 | def fetch_papers(num_papers: int = 10) -> pd.DataFrame:
123 |     # Import arxiv here to prevent unnecessary imports
124 |     from arxiv.arxiv import SortCriterion
125 | 
126 |     c = sr.source_candidates(
127 |         queries=get_sc_config()["queries"],
128 |         max_results=100,
129 |         as_df=True,
130 |         sort_by=SortCriterion.Relevance,
131 |     )
132 |     sam = c[["Title", "Abstract"]].sample(frac=1, random_state=1).reset_index(drop=True)
133 |     sam.sort_values(
134 |         by="Abstract", key=lambda x: x.str.len(), inplace=True, ascending=False
135 |     )
136 |     res = sam.iloc[: min(num_papers, len(sam))].copy()
137 |     return res
138 | 
139 | 
140 | def calibrate_rec_sys(num_papers: int = 10):
141 |     # Initialize session state variables if they don't exist
142 |     if "labels" not in st.session_state:
143 |         st.session_state.labels = []
144 |     if "current_index" not in st.session_state:
145 |         st.session_state.current_index = 0
146 |     if "papers_df" not in st.session_state:
147 |         st.session_state.papers_df = fetch_papers(num_papers=num_papers)
148 | 
149 |     if st.session_state.current_index < num_papers:
150 |         # Display the paper at the current index
151 |         with st.form("rating_form"):
152 |             row = st.session_state.papers_df.iloc[st.session_state.current_index]
153 |             st.markdown(f"""## {row['Title']} """)
154 |             trimmed_abstract = row["Abstract"][:500] + "..."
155 |             st.markdown(f"""{trimmed_abstract}""")
156 | 
157 |             rating = st.number_input(
158 |                 "Rate this paper on a scale of 1 to 10?",
159 |                 min_value=1,
160 |                 max_value=10,
161 |             )
162 |             submit_rating = st.form_submit_button(
163 |                 f"Submit Rating for Paper: {st.session_state.current_index + 1}"
164 |             )
165 |             if submit_rating:
166 |                 # Save the label and increment the index
167 |                 st.session_state.labels.append(rating)
168 |                 st.session_state.current_index += 1
169 |                 st.rerun()
170 | 
171 |     elif st.session_state.current_index == num_papers:
172 |         # Save all labels once all papers are rated
173 |         st.session_state.papers_df["label"] = st.session_state.labels
174 |         # st.session_state.papers_df.to_csv(to_path)
175 |         # Update the config file
176 |         old_config = get_sc_config()
177 |         old_config["labels"] = st.session_state.papers_df
178 |         update_sc_config(old_config)
179 |         st.success(
180 |             "Labels saved. Get recommendations is now configured to your interests."
181 |         )
182 |         # Increment to prevent re-running this block
183 |         st.session_state.current_index += 1
184 | 
185 |     else:
186 |         st.write("Rating process is complete.")
187 | 
188 | 
189 | if "sys_config" not in st.session_state:
190 |     st.session_state.sys_config = load_sc_config()
191 | 
192 | # Theme Configuration
193 | st.set_page_config(
194 |     page_title="Scholarly Recommender API",
195 |     page_icon="images/logo.png",
196 |     layout="wide",
197 |     initial_sidebar_state="expanded",
198 | )
199 | 
200 | # Custom CSS for further customization
201 | st.markdown(
202 |     """
203 |     <style>
204 |         /* Add your custom CSS here */
205 |     </style>
206 |     """,
207 |     unsafe_allow_html=True,
208 | )
209 | 
210 | # Sidebar with logo and navigation
211 | st.sidebar.image("images/logo.png", use_column_width=True)
212 | st.sidebar.title("Navigation")
213 | navigation = st.sidebar.radio(
214 |     "Go to", ["Get Recommendations", "Configure", "About", "Contact"]
215 | )
216 | 
217 | # Home Page
218 | if navigation == "Get Recommendations":
219 |     st.title("Scholarly Recommender Cloud API")
220 |     st.markdown(
221 |         """
222 |         ## Welcome to the Scholarly Recommender API
223 | 
224 |         This platform is designed to offer you highly tailored scholarly recommendations.
225 |         Whether you're a researcher, academic, or just someone interested in scientific literature,
226 |         this service is built to cater to your specific needs.
227 | 
228 |         ### To Get Started
229 | 
230 |         - **Configure**: The first step is to configure the system to your interests. This can be done by navigating to the configure page. If you skip this step, the system will use a default configuration tailored to the interests of the developer.
231 |         - **Get Recommendations**: Once you've configured the system, you can get recommendations by pressing the 'generate recommendations' button at the bottom of the page.
232 |         - **Categories & Subcategories**: You can customize your interests by selecting from a wide range of categories and subcategories, by default, the system will search based on your configured interests.
233 |         - **Recommendation Count**: Choose how many recommendations you want to receive.
234 |         - **Time Range**: Decide the time frame for the articles you're interested in.
235 | 
236 |         """,
237 |         unsafe_allow_html=True,
238 |     )
239 |     # User input section
240 |     st.subheader("Customize your recommendations")
241 | 
242 |     # Collecting user details
243 | 
244 |     categories = st.multiselect(
245 |         "Would you like to search for any specific categories? (leave blank to use your configured interests)",
246 |         search_categories.keys(),
247 |     )
248 |     selected_sub_categories = {}
249 |     for selected in categories:
250 |         selected_sub_categories[selected] = st.multiselect(
251 |             f"Select subcategories under {selected} (leave blank for all)",
252 |             search_categories[selected],
253 |         )
254 | 
255 |     user_n = st.slider(
256 |         "How many recommendations would you like?",
257 |         min_value=1,
258 |         max_value=10,
259 |         value=5,
260 |     )
261 |     user_days = st.slider(
262 |         "How many days back would you like to search?",
263 |         min_value=1,
264 |         max_value=30,
265 |         value=7,
266 |     )
267 | 
268 |     user_to_email = st.checkbox("Email Recommendations?")
269 | 
270 |     if user_to_email:
271 |         with st.form("email_form"):
272 |             st.write(
273 |                 "This feature is currently under development, please report any issues you encounter."
274 |             )
275 |             user_email = st.text_input(
276 |                 "your email address", placeholder="example@email.com", value=""
277 |             )
278 |             st.markdown(
279 |                 """Disclaimer: Scholarly Recommender will only send you an email with your recommendations
280 |             and will not use your email address for any other purpose.""",
281 |                 unsafe_allow_html=True,
282 |             )
283 | 
284 |             submit_button = st.form_submit_button(label="Confirm")
285 |             if submit_button:
286 |                 if validate_email(user_email):
287 |                     st.success("Email address confirmed")
288 | 
289 |                 else:
290 |                     st.error("Please enter a valid email address", icon="🚨")
291 | 
292 |     else:
293 |         user_email = ""
294 |     # Call to Action
295 |     if st.button("Generate Recommendations", type="primary"):
296 |         user_query = build_query(selected_sub_categories)
297 | 
298 |         user_feed = generate_feed_pipeline(
299 |             query=user_query, n=user_n, days=user_days, to_email=user_to_email
300 |         )
301 | 
302 |         if user_to_email:
303 |             send_email(
304 |                 subscribers=[user_email],
305 |                 content=user_feed,
306 |             )
307 | 
308 |         components.html(user_feed, height=1000, scrolling=True)
309 | 
310 | 
311 | elif navigation == "Configure":
312 |     st.title("Scholarly Recommender Configuration API")
313 |     st.markdown(
314 |         """
315 |         ## Welcome to the Scholarly Recommender System Calibration Tool
316 | 
317 |         This tool will help you calibrate the recommender system to your interests!
318 |         Below are the various configuration steps, it is advised to do them in order.
319 |         Once a step is completed, the changes will be applied automatically, regardless of whether you continue to the next step or not.
320 |                 """
321 |     )
322 |     # User input section
323 |     st.markdown(
324 |         """
325 |         ### Step 1: Configure your interests
326 | 
327 |         This section will help you configure the system to your interests.
328 |         This ensures the system will use this to only scrape papers relevant to you.
329 |         Follow the steps below to get started.
330 |                 """
331 |     )
332 | 
333 |     categories = st.multiselect(
334 |         "Select the categories that interest you the most, at least one is required:",
335 |         search_categories.keys(),
336 |     )
337 |     selected_sub_categories = {}
338 |     for selected in categories:
339 |         selected_sub_categories[selected] = st.multiselect(
340 |             f"Select subcategories under {selected} (Optional, leave blank for all)",
341 |             search_categories[selected],
342 |         )
343 |     if st.button("Done", type="primary"):
344 |         with st.spinner("Configuring..."):
345 |             user_config_query = build_query(selected_sub_categories)
346 |             # prevent empty queries
347 |             if len(user_config_query) == 0:
348 |                 st.error("Please select at least one interest.")
349 |             else:
350 |                 configuration = get_sc_config()
351 |                 configuration["queries"] = user_config_query
352 |                 update_sc_config(configuration)
353 | 
354 |                 st.success("Preferences updated successfully!")
355 |     # Initialize a session state variable for calibration status if it doesn't exist
356 |     if "calibration_started" not in st.session_state:
357 |         st.session_state.calibration_started = False
358 | 
359 |     st.markdown(
360 |         """
361 |         ### Step 2: Calibrate the Recommender System
362 | 
363 |         This section will help you calibrate the recommender system based on your interests.
364 |         This will help the system learn your preferences and will significantly improve recommendations.
365 |         This process will show you snippets of 10 papers and ask you to rate them on a scale of 1 to 10 (1 being the least relevant and 10 being the most relevant).
366 |         Many improvements are planned for this process, including the ability to skip papers, change sample size, and dynamically update the system based on your feedback from the generated feed.
367 | 
368 |         Click on the button below to get started.
369 | 
370 |         **Note** : Currently, if you want to start over or repeat this process, you must refresh the page
371 |         """
372 |     )
373 | 
374 |     if st.button("Start Calibration", type="primary"):
375 |         st.session_state.calibration_started = True
376 | 
377 |     if st.session_state.calibration_started:
378 |         with st.spinner("Preparing Calibration..."):
379 |             calibrate_rec_sys(num_papers=10)
380 |     st.markdown(
381 |         """
382 |     ### Step 3: All done! Navigate to the Get Recommendations page to generate your personalized feed!"""
383 |     )
384 |     # TODO add a button to navigate to the get recommendations page
385 | 
386 | 
387 | # About Page
388 | elif navigation == "About":
389 |     st.markdown(
390 |         """
391 |         ## Welcome to the Scholarly Recommender Cloud API
392 | 
393 |         This platform is designed to offer you highly tailored scholarly recommendations.
394 |         Whether you're a researcher, academic, or just someone interested in scientific literature,
395 |         this service is built to cater to your specific needs.
396 | 
397 |         ### How It Works
398 | 
399 |         - **Configure**: The first step is to configure the system to your interests. This can be done by navigating to the configure page.
400 |         - **Get Recommendations**: Once you've configured the system, you can generate recommendations by navigating to the get recommendations page.
401 |         - **Categories & Subcategories**: You can customize your interests by selecting from a wide range of categories and subcategories, by default, the system will search based on your configured interests.
402 |         - **Recommendation Count**: Choose how many recommendations you want to receive.
403 |         - **Time Range**: Decide the time frame for the articles you're interested in.
404 | 
405 |         ### What Makes Us Different
406 | 
407 |         - **Personalized**: Recommendations are fine-tuned to match your unique academic interests.
408 |         - **Up-to-Date**: The platform provides options to focus on the most recent articles.
409 |         - **Quality Assured**: We prioritize recommendations from reputable sources and peer-reviewed journals.
410 | 
411 |         ### Mission Statement
412 | 
413 |         As an upcoming data scientist with a strong passion for deep learning, I am always looking for new technologies and methodologies. Naturally, I spend a considerable amount of time researching and reading new publications to accomplish this. However, over 14,000 academic papers are published every day on average, making it extremely tedious to continuously source papers relevant to my interests. My primary motivation for creating ScholarlyRecommender is to address this, creating a fully automated and personalized system that prepares a feed of academic papers relevant to me. This feed is prepared on demand, through a completely abstracted streamlit web interface, or sent directly to my email on a timed basis. This project was designed to be scalable and adaptable, and can be very easily adapted not only to your own interests, but become a fully automated, self improving newsletter. Details on how to use this system, the methods used for retrieval and ranking, along with future plans and features planned or in development currently are listed below.
414 | 
415 | 
416 |         """
417 |     )
418 | 
419 | # Contact Page
420 | elif navigation == "Contact":
421 |     st.markdown(
422 |         """
423 |         <h1 align="center">Contact Information</h1>
424 | 
425 |         ## Project Support
426 | 
427 |         Please report any issues by creating an issue on the GitHub repository, or by sending an email to the project email directly.
428 | 
429 |         - **Github Issue**: https://github.com/iansnyder333/ScholarlyRecommender/issues
430 |         - **Project Email**: scholarlyrecommender@gmail.com
431 | 
432 |         ## Developer Contact Information
433 | 
434 |         If you have any questions or concerns, please feel free to reach out to me directly.
435 |         I recently graduated college and am the sole developer of this project, so I would love any constructive feedback you have to offer to help me improve as a developer.
436 | 
437 |         - **Ian Snyder**: [@iansnydes](https://twitter.com/iansnydes) - idsnyder136@gmail.com
438 |         - **Website and Portfolio**: [iansnyder333.github.io/frontend](https://iansnyder333.github.io/frontend)
439 |         - **LinkedIn**: [www.linkedin.com/in/iandsnyder](https://www.linkedin.com/in/iandsnyder)
440 | 
441 | 
442 |                 """,
443 |         unsafe_allow_html=True,
444 |     )
445 | 
446 | 
447 | # Footer
448 | st.markdown(
449 |     """
450 |     <footer>
451 | 
452 |     </footer>
453 |     """,
454 |     unsafe_allow_html=True,
455 | )
456 | 


--------------------------------------------------------------------------------