├── .gitignore
├── LICENSE
├── README.md
├── README_zh.md
├── arxiv_bot
├── __init__.py
├── agent.py
├── ai_service.py
├── config.py
├── fetcher.py
├── messenger.py
└── storage.py
├── assert
└── ARIES.webp
├── config.yaml
├── main.py
└── requirements.txt
/.gitignore:
--------------------------------------------------------------------------------
1 | # Python
2 | __pycache__/
3 | *.py[cod]
4 | *.egg-info/
5 |
6 | .DS_Store
7 |
8 | # Environment
9 | .env
10 |
11 | # Storage
12 | paper_history.json
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | Copyright (c) 2025-present
2 |
3 | Permission is hereby granted, free of charge, to any person obtaining a copy
4 | of this software and associated documentation files (the "Software"), to deal
5 | in the Software without restriction, including without limitation the rights
6 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
7 | copies of the Software, and to permit persons to whom the Software is
8 | furnished to do so, subject to the following conditions:
9 |
10 | The above copyright notice and this permission notice shall be included in all
11 | copies or substantial portions of the Software.
12 |
13 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
14 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
15 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
16 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
17 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
18 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
19 | SOFTWARE.
20 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # ♈️ Aries: ArXiv Research Intelligent Efficient Summary
2 | ## Arxiv Paper to Feishu
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
11 | [中文文档](README_zh.md)
12 | ---
13 |
14 |
15 |
16 |
17 |
18 |
19 |
20 |
21 |
22 |
23 |
24 | ## 🎉 Introduction
25 |
26 | A tool that automatically fetches the latest LLM-related papers from arXiv and pushes them through Feishu bots. The bot uses Deepseek AI for intelligent filtering and summarization, helping you stay up-to-date with the latest research developments.
27 |
28 | ---
29 |
30 | ## ✨ Features
31 |
32 | - 🤖 **Auto Paper Fetching**: Scrapes the latest LLM-related papers from arXiv.
33 | - 🧠 **Smart Filtering & Summarization**: Uses Deepseek AI for high-quality filtering and summarization.
34 | - 📱 **Multi-bot Support**: Configure multiple Feishu bots to push to different groups.
35 | - ⏰ **Scheduled Tasks**: Default push at 9 AM daily, customizable.
36 | - ⚙️ **Flexible Configuration**: Customize paper types, filtering rules, and push methods via config file.
37 | - 📊 **History Tracking**: Pushed papers are recorded in `paper_history.json` to avoid duplicates.
38 |
39 | ---
40 |
41 | ## 🚀 Quick Start
42 |
43 | 1. Install dependencies:
44 | ```bash
45 | pip install -r requirements.txt
46 | ```
47 | 2. Configure environment variables:
48 | - Create a `.env` file and fill in as follows:
49 | ```
50 | DEEPSEEK_API_KEY=your_deepseek_api_key
51 | WEBHOOK_URL_1=https://your_first_webhook_url
52 | WEBHOOK_URL_2=https://your_second_webhook_url
53 | ```
54 |
55 | 3. Configure `config.yaml` to customize paper types and filtering rules.
56 |
57 | 4. Run the script:
58 | ```bash
59 | python main.py
60 | ```
61 |
62 | ---
63 |
64 | ## ⚙️ Configuration Guide
65 |
66 | ### Environment Variables
67 |
68 | - `DEEPSEEK_API_KEY`: **Required**, for calling Deepseek API.
69 | - `WEBHOOK_URL_[n]`: **Required**, Feishu bot webhook URLs, multiple can be configured.
70 |
71 | ### Configuration Details:
72 |
73 | - **`paper_types`**: Define settings for each paper type.
74 | - **`enabled`**: Status, `true` to enable, `false` to disable.
75 | - **`search_query`**: arXiv search query, supports logical conditions and keyword combinations.
76 | - **`keywords`**: Keyword list for paper filtering.
77 | - **`prompt`**: Prompt for Deepseek to judge paper relevance, generated based on search_query and keywords. Can be modified.
78 | - **`max_papers`**: Maximum number of papers per push (default 5, customizable).
79 |
80 | - **`general`**: Global configuration.
81 | - **`max_search_results`**: Maximum number of papers returned by search.
82 | - **`schedule_time`**: Daily scheduled task time (24-hour format).
83 |
84 | ---
85 |
86 | ## ❗ Important Notes
87 |
88 | 1. Ensure sufficient Deepseek API Key quota.
89 | 2. Feishu Webhook URLs should be obtained from bot settings in Feishu groups.
90 | 3. Recommended to deploy on a server for continuous operation.
91 | 4. New paper types can be added or existing configurations adjusted via `config.yaml`.
92 |
93 | ---
94 |
95 | ## ❓ FAQ
96 |
97 | 1. **API Call Failure**
98 | - Check if the Deepseek API Key is correct.
99 | 2. **Message Push Failure**
100 | - Verify if the Webhook URL is valid.
101 | 3. **Test Push**
102 | - Uncomment `agent.run()` in the `main` function to run directly.
103 |
104 | ---
105 |
106 | ## 📝 TODO List
107 |
108 | - 📚 Paper Collection & Management
109 | - [x] Automatic arXiv paper fetching
110 | - [ ] Paper history storage
111 | - [ ] Related paper correlation analysis
112 | - [ ] Paper archiving system
113 |
114 | - 🔍 Intelligent Paper Processing
115 | - [x] Auto Summary: Paper summarization
116 | - [ ] Auto Review: Paper review generation
117 | - [ ] Auto Survey: Field survey generation
118 |
119 | - 📢 Multi-platform Distribution
120 | - [x] Feishu bot integration
121 | - [ ] WeChat bot integration
122 | - [ ] Xiaohongshu content publishing
123 |
124 | ---
125 |
126 | ## 📄 License
127 |
128 | This project is licensed under the [MIT License](https://opensource.org/license/mit).
--------------------------------------------------------------------------------
/README_zh.md:
--------------------------------------------------------------------------------
1 | # ♈️ Aries: ArXiv Research Intelligent Efficient Summary
2 | ## Arxiv 论文推送到飞书
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 | [English](README.md)
11 |
12 |
13 |
14 |
15 |
16 |
17 |
18 |
19 |
20 |
21 | ## 介绍
22 |
23 | 一个自动获取 arXiv 最新 LLM 相关论文,并通过飞书机器人推送的工具。该机器人使用 Deepseek AI 对论文进行智能筛选和总结,帮助您快速了解最新研究动态。
24 |
25 | ---
26 |
27 | ## 功能特点
28 |
29 | - 🤖 **自动获取最新论文**:从 arXiv 抓取最新的 LLM 相关论文。
30 | - 🧠 **智能筛选与总结**:使用 Deepseek AI 提供高质量的筛选和总结。
31 | - 📱 **支持多机器人推送**:可配置多个飞书机器人,分别推送到不同群组。
32 | - ⏰ **定时任务支持**:默认每天早上 9 点推送,支持自定义。
33 | - ⚙️ **灵活配置**:通过配置文件自定义论文类型、筛选规则和推送方式。
34 | - 📊 **历史记录**:推送过的论文会记录在 `paper_history.json` 文件中,下次推送时会跳过。
35 |
36 | ---
37 |
38 | ## 快速开始
39 |
40 | 1. 安装依赖:
41 | ```bash
42 | pip install -r requirements.txt
43 | ```
44 | 2. 配置环境变量:
45 | - 创建一个 `.env` 文件,按照下方样例填写:
46 | ```
47 | DEEPSEEK_API_KEY=your_deepseek_api_key
48 | WEBHOOK_URL_1=https://your_first_webhook_url
49 | WEBHOOK_URL_2=https://your_second_webhook_url
50 | ```
51 |
52 | 3. 配置 `config.yaml`,自定义论文类型和筛选规则。
53 |
54 | 4. 启动脚本:
55 | ```bash
56 | python main.py
57 | ```
58 |
59 | ---
60 |
61 | ## 配置说明
62 |
63 | ### 环境变量配置
64 |
65 | - `DEEPSEEK_API_KEY`:**必填**,用于调用 Deepseek API。
66 | - `WEBHOOK_URL_[n]`:**必填**,飞书机器人的 Webhook 地址,可配置多个。
67 |
68 | #### 配置说明:
69 |
70 | - **`paper_types`**:定义每种论文类型的具体设置。
71 | - **`enabled`**:启用状态,`true` 表示启用,`false` 表示禁用。
72 | - **`search_query`**:在 arXiv 上使用的搜索查询,支持逻辑条件和关键词组合。
73 | - **`keywords`**:用于筛选论文的关键词列表。
74 | - **`prompt`**:使用 Deepseek 或其他工具判断论文相关性的提示词,已经根据search_query和keywords生成。用户可以自行修改。
75 | - **`max_papers`**:单次推送的最大论文数量(默认 5,可自行修改)。
76 |
77 | - **`general`**:全局配置。
78 | - **`max_search_results`**:搜索返回的最大论文数量。
79 | - **`schedule_time`**:每天定时任务运行时间(24 小时制)。
80 |
81 | ---
82 |
83 | ## 注意事项
84 |
85 | 1. 确保 Deepseek API Key 额度充足。
86 | 2. 飞书 Webhook 地址需要在飞书群中通过机器人设置获取。
87 | 3. 建议部署在服务器上持续运行。
88 | 4. 可通过编辑 `config.yaml` 添加新的论文类型或调整现有配置。
89 |
90 | ---
91 |
92 | ## 常见问题
93 |
94 | 1. **API 调用失败**
95 | - 请检查 Deepseek API Key 是否正确。
96 | 2. **消息推送失败**
97 | - 请确认 Webhook 地址是否有效。
98 | 3. **测试推送**
99 | - 可取消 `main` 函数中的 `schedule` 注释,直接运行 `agent.run()`。
100 |
101 | ---
102 |
103 | ## 📝 待办事项
104 |
105 | - 📚 文章收集与管理
106 | - [x] arXiv论文自动抓取
107 | - [ ] 文章历史记忆存储
108 | - [ ] 相关论文关联分析
109 | - [ ] 论文分类归档系统
110 |
111 | - 🔍 文章智能处理
112 | - [x] Auto Summary: 论文自动摘要
113 | - [ ] Auto Review: 论文自动点评
114 | - [ ] Auto Survey: 领域综述生成
115 |
116 | - 📢 多平台推送
117 | - [x] 飞书机器人推送
118 | - [ ] 微信机器人集成
119 | - [ ] 小红书内容发布
120 |
121 | ---
122 |
123 | ## License
124 |
125 | 本项目使用 [MIT License](https://opensource.org/license/mit)。
126 |
--------------------------------------------------------------------------------
/arxiv_bot/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/LAMDASZ-ML/Aries/fb266eddd4e021f1f7db26853dd5ea303e671663/arxiv_bot/__init__.py
--------------------------------------------------------------------------------
/arxiv_bot/agent.py:
--------------------------------------------------------------------------------
1 | from typing import List, Dict
2 | from .config import Config
3 | from .ai_service import AIService
4 | from .fetcher import PaperFetcher
5 | from .messenger import FeishuMessenger
6 |
7 | class ArxivPaperAgent:
8 | def __init__(self, config_path: str = "config.yaml"):
9 | self.config = Config(config_path)
10 | self.ai_service = AIService(self.config.api_key)
11 | self.fetcher = PaperFetcher(self.ai_service, self.config)
12 | self.messenger = FeishuMessenger(self.config.webhook_urls)
13 |
14 | def run(self):
15 | """运行主流程"""
16 | for paper_type, type_config in self.config.config['paper_types'].items():
17 | if not type_config['enabled']:
18 | continue
19 |
20 | papers = self.fetcher.fetch_papers(paper_type)
21 | if not papers:
22 | continue
23 |
24 | summaries = self._process_papers(papers)
25 | self.messenger.send_message(summaries, paper_type, type_config)
26 |
27 | def _process_papers(self, papers: List[Dict]) -> List[Dict]:
28 | summaries = []
29 | for paper in papers:
30 | summary = self.ai_service.summarize(
31 | paper['abstract'],
32 | self.config.get_general_config()['summary_prompt']
33 | )
34 | summaries.append({
35 | 'title': paper['title'],
36 | 'summary': summary,
37 | 'url': paper['url']
38 | })
39 | return summaries
--------------------------------------------------------------------------------
/arxiv_bot/ai_service.py:
--------------------------------------------------------------------------------
1 | import requests
2 | from typing import Dict
3 | import json
4 |
5 | class AIService:
6 | def __init__(self, api_key: str):
7 | self.api_key = api_key
8 | self.api_url = "https://api.deepseek.com/chat/completions"
9 |
10 | def _call_api(self, prompt: str) -> str:
11 | headers = {
12 | "Authorization": f"Bearer {self.api_key}",
13 | "Content-Type": "application/json"
14 | }
15 |
16 | data = {
17 | "model": "deepseek-chat",
18 | "messages": [
19 | {"role": "system", "content": "你是一个专业的学术论文助手,善于总结和分析论文内容。"},
20 | {"role": "user", "content": prompt}
21 | ],
22 | "temperature": 0.7
23 | }
24 |
25 | try:
26 | response = requests.post(self.api_url, headers=headers, json=data)
27 | response.raise_for_status()
28 | result = response.json()
29 |
30 | if 'choices' in result and len(result['choices']) > 0:
31 | output = result['choices'][0]['message']['content']
32 | return output
33 | else:
34 | print(f"API response format error: {json.dumps(result, ensure_ascii=False)}")
35 | return ""
36 |
37 | except Exception as e:
38 | print(f"Unexpected error: {e}")
39 | return ""
40 |
41 | def check_relevance(self, title: str, abstract: str, prompt_template: str) -> bool:
42 | prompt = prompt_template.format(title=title, abstract=abstract)
43 | answer = self._call_api(prompt).strip().lower()
44 | return not ("否" in answer or "no" in answer)
45 |
46 | def summarize(self, abstract: str, prompt_template: str) -> str:
47 | prompt = prompt_template.format(abstract=abstract)
48 | result = self._call_api(prompt)
49 | if not result:
50 | return "抱歉,生成摘要时出现错误。"
51 | return result
--------------------------------------------------------------------------------
/arxiv_bot/config.py:
--------------------------------------------------------------------------------
1 | from typing import Dict, Any
2 | import os
3 | import yaml
4 | from dotenv import load_dotenv, dotenv_values
5 |
6 | load_dotenv(override=True)
7 |
8 | class Config:
9 | def __init__(self, config_path: str = "config.yaml"):
10 | self._load_config(config_path)
11 | self._load_env_vars()
12 |
13 | def _load_config(self, config_path: str):
14 | with open(config_path, 'r', encoding='utf-8') as f:
15 | self.config = yaml.safe_load(f)
16 | for paper_type, details in self.config['paper_types'].items():
17 | if 'title' not in details:
18 | details['title'] = f"今日{paper_type}论文更新".upper()
19 | if 'prompt' not in details:
20 | details['prompt'] = self._generate_prompt(paper_type, details)
21 | if 'max_papers' not in details:
22 | details['max_papers'] = 5
23 |
24 | if 'summary_prompt' not in self.config['general']:
25 | self.config['general']['summary_prompt'] = "请根据摘要用一句话总结这篇文章的核心内容: {abstract}"
26 |
27 | def _generate_prompt(self, paper_type: str, details: Dict[str, Any]) -> str:
28 | """
29 | 根据 paper_type 和 config.yaml 中的配置动态生成 prompt。
30 | """
31 | keywords = "、".join(details.get('keywords', []))
32 | search_query = details.get('search_query', '未定义搜索条件')
33 |
34 | prompt = (
35 | f"请反思批判性的判断这篇论文是否与以下主题相关:{keywords}。\n\n"
36 | f"标题: {{title}}\n"
37 | f"摘要: {{abstract}}\n\n"
38 | f"请只回答\"是\"或\"否\"。如果论文主要研究与以下关键词相关的主题:{keywords},或符合搜索条件:{search_query},回答\"是\";\n"
39 | f"否者请回答\"否\"。"
40 | )
41 | return prompt
42 |
43 | def _load_env_vars(self):
44 | self.webhook_urls = []
45 | i = 1
46 | while True:
47 | webhook_url = os.getenv(f'WEBHOOK_URL_{i}')
48 | if webhook_url:
49 | self.webhook_urls.append(webhook_url)
50 | i += 1
51 | else:
52 | break
53 |
54 | if not self.webhook_urls:
55 | raise ValueError("No webhook URLs found in environment variables")
56 |
57 | env_vars = dotenv_values('.env')
58 | self.api_key = env_vars.get('DEEPSEEK_API_KEY')
59 | if not self.api_key:
60 | raise ValueError("DEEPSEEK_API_KEY not found in environment variables")
61 |
62 | def get_paper_type_config(self, paper_type: str) -> Dict[str, Any]:
63 | return self.config['paper_types'][paper_type]
64 |
65 | def get_general_config(self) -> Dict[str, Any]:
66 | return self.config['general']
--------------------------------------------------------------------------------
/arxiv_bot/fetcher.py:
--------------------------------------------------------------------------------
1 | import arxiv
2 | from typing import List, Dict
3 | from tqdm import tqdm
4 | from .storage import PaperStorage
5 | import time
6 |
7 | class PaperFetcher:
8 | def __init__(self, ai_service, config):
9 | self.ai_service = ai_service
10 | self.config = config
11 | self.storage = PaperStorage()
12 |
13 | def fetch_papers(self, paper_type: str):
14 | total_start = time.time()
15 |
16 | type_config = self.config.get_paper_type_config(paper_type)
17 | general_config = self.config.get_general_config()
18 |
19 | print(f"\n📚 开始获取 {paper_type} 类型的论文...")
20 | print(f"🔍 搜索查询: {type_config['search_query']}")
21 | print(f"📋 目标论文数量: {type_config['max_papers']}\n")
22 |
23 | papers = []
24 | max_papers = type_config['max_papers']
25 | max_results_per_request = general_config['max_search_results']
26 | search_query = type_config['search_query']
27 | max_attempts = 3
28 | retry_delay = 5 # 重试等待时间(秒)
29 |
30 | search = arxiv.Search(
31 | query=search_query,
32 | max_results=general_config['max_search_results'] * max_attempts,
33 | sort_by=arxiv.SortCriterion.SubmittedDate
34 | )
35 |
36 | results_iterator = iter(search.results())
37 | current_batch = 0
38 |
39 | while len(papers) < max_papers:
40 | try:
41 | print(f"\n📥 正在获取第 {current_batch + 1} 批论文")
42 |
43 | # 获取当前批次的论文
44 | current_papers = []
45 | for _ in range(max_results_per_request):
46 | try:
47 | current_papers.append(next(results_iterator))
48 | except StopIteration:
49 | print("\n📢 没有更多论文结果")
50 | break
51 |
52 | if not current_papers:
53 | break
54 |
55 | # 获取历史记录中最新的和最旧的论文ID
56 | latest_paper_id, oldest_paper_id = self.storage.get_latest_and_oldest_paper_id(paper_type)
57 | # 过滤已经推送过的论文
58 | if latest_paper_id:
59 | current_papers = [r for r in current_papers if self._is_valid_paper(r.entry_id, latest_paper_id, oldest_paper_id)]
60 |
61 | relevant_count = 0
62 | for result in tqdm(current_papers, desc=f"🔍 正在分析论文相关性", unit="篇"):
63 | if len(papers) >= max_papers:
64 | break
65 |
66 | if self.storage.is_paper_exists(paper_type, result.entry_id):
67 | continue
68 |
69 | if self._is_relevant_paper(result.title, result.summary, type_config):
70 | paper = {
71 | 'title': result.title,
72 | 'abstract': result.summary,
73 | 'url': result.entry_id
74 | }
75 | papers.append(paper)
76 | self.storage.add_paper(paper_type, result.entry_id)
77 | relevant_count += 1
78 |
79 | print(f"📊 其中相关论文: {relevant_count} 篇")
80 | current_batch += 1
81 |
82 | except Exception as e:
83 | print(f"\n❌ 发生错误: {str(e)}")
84 | time.sleep(retry_delay)
85 | continue
86 |
87 | total_time = time.time() - total_start
88 | print(f"\n✅ 论文获取完成!")
89 | print(f"⏱️ 总耗时: {total_time:.2f} 秒")
90 | print(f"📝 共获取相关论文: {len(papers)} 篇\n")
91 |
92 | return papers
93 |
94 |
95 | def _is_relevant_paper(self, title: str, abstract: str, type_config: Dict) -> bool:
96 | # 关键词检查
97 | for keyword in type_config['keywords']:
98 | if keyword in title.lower() or keyword in abstract.lower():
99 | return True
100 |
101 | # AI 相关性检查
102 | return self.ai_service.check_relevance(title, abstract, type_config['prompt'])
103 |
104 | def _is_valid_paper(self, current_id: str, latest_id: str, oldest_id: str) -> bool:
105 | """
106 | 比较两个论文ID,判断当前论文是否比最新的论文更新,并且比最旧的论文旧
107 | arxiv ID格式例如: 2403.12345v1
108 | """
109 | current_version = float(current_id.split('/')[-1].split('v')[0])
110 | latest_version = float(latest_id)
111 | oldest_version = float(oldest_id)
112 | return current_version > latest_version or current_version < oldest_version
--------------------------------------------------------------------------------
/arxiv_bot/messenger.py:
--------------------------------------------------------------------------------
1 | import requests
2 | from datetime import datetime
3 | from typing import List, Dict
4 |
5 | class FeishuMessenger:
6 | def __init__(self, webhook_urls: List[str]):
7 | self.webhook_urls = webhook_urls
8 |
9 | def send_message(self, summaries: List[Dict], paper_type: str, type_config: Dict) -> bool:
10 | message = self._format_message(summaries, paper_type, type_config)
11 | return self._send_to_webhooks(message)
12 |
13 | def _format_message(self, summaries: List[Dict], paper_type: str, type_config: Dict) -> Dict:
14 | return {
15 | "msg_type": "post",
16 | "content": {
17 | "post": {
18 | "zh_cn": {
19 | "title": f"{type_config['title']} - {datetime.now().strftime('%Y-%m-%d')}",
20 | "content": [
21 | [{
22 | "tag": "text",
23 | "text": f"📑 {paper['title']}\n"
24 | f"💡 总结: {paper['summary']}\n"
25 | f"🔗 链接: {paper['url']}\n\n"
26 | }] for paper in summaries
27 | ]
28 | }
29 | }
30 | }
31 | }
32 |
33 | def _send_to_webhooks(self, message: Dict) -> bool:
34 | results = []
35 | for webhook_url in self.webhook_urls:
36 | try:
37 | response = requests.post(webhook_url, json=message)
38 | success = response.status_code == 200
39 | results.append(success)
40 | print(f"Sent to webhook {webhook_url}: {'Success' if success else 'Failed'}")
41 | except Exception as e:
42 | print(f"Error sending to webhook {webhook_url}: {e}")
43 | results.append(False)
44 |
45 | return any(results)
--------------------------------------------------------------------------------
/arxiv_bot/storage.py:
--------------------------------------------------------------------------------
1 | import json
2 | import os
3 | from typing import Set, Dict
4 | from datetime import datetime
5 |
6 | class PaperStorage:
7 | def __init__(self, storage_file: str = "paper_history.json"):
8 | self.storage_file = storage_file
9 | self.paper_history = self._load_history()
10 |
11 | def _load_history(self) -> Dict[str, Set[str]]:
12 | if os.path.exists(self.storage_file):
13 | with open(self.storage_file, 'r', encoding='utf-8') as f:
14 | history_dict = json.load(f)
15 | # 将列表转换为集合以提高查找效率
16 | return {k: set(v) for k, v in history_dict.items()}
17 | return {}
18 |
19 | def _save_history(self):
20 | # 将集合转换回列表以便JSON序列化
21 | history_dict = {k: list(v) for k, v in self.paper_history.items()}
22 | with open(self.storage_file, 'w', encoding='utf-8') as f:
23 | json.dump(history_dict, f, ensure_ascii=False, indent=2)
24 |
25 | def is_paper_exists(self, paper_type: str, paper_url: str) -> bool:
26 | return paper_url in self.paper_history.get(paper_type, set())
27 |
28 | def add_paper(self, paper_type: str, paper_url: str):
29 | if paper_type not in self.paper_history:
30 | self.paper_history[paper_type] = set()
31 | self.paper_history[paper_type].add(paper_url)
32 | self._save_history()
33 |
34 | def get_latest_and_oldest_paper_id(self, paper_type: str) -> tuple:
35 | """
36 | 获取指定类型中最新的和最旧的论文ID
37 | """
38 | papers = self.paper_history.get(paper_type, set())
39 | if not papers:
40 | return float('-inf'), float('inf')
41 |
42 | try:
43 | paper_ids = []
44 | for pid in papers:
45 | id_part = pid.split('/')[-1].split('v')[0]
46 | paper_ids.append((pid, float(id_part)))
47 |
48 | latest = max(paper_ids, key=lambda x: x[1])
49 | oldest = min(paper_ids, key=lambda x: x[1])
50 | return latest[1], oldest[1]
51 | except:
52 | return float('-inf'), float('inf')
--------------------------------------------------------------------------------
/assert/ARIES.webp:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/LAMDASZ-ML/Aries/fb266eddd4e021f1f7db26853dd5ea303e671663/assert/ARIES.webp
--------------------------------------------------------------------------------
/config.yaml:
--------------------------------------------------------------------------------
1 | paper_types:
2 | reasoning:
3 | enabled: true
4 | search_query: "cat:cs.AI AND (LLM OR 'Large Language Model Reasoning' OR 'LLM Reasoning' OR 'Neuro-Symbolic' OR 'Fast and Slow Thinking')"
5 | keywords:
6 | - reasoning
7 | - fast and slow thinking
8 | - Neuro-Symbolic
9 | - LLM Reasoning
10 | - Large Language Model Reasoning
11 | mllm:
12 | enabled: true
13 | search_query: "cat:cs.AI AND (LLM OR 'Multimodal Large Language Models')"
14 | keywords:
15 | - multimodal Large Language Models
16 | - cross-modal Large Language Models
17 | - MLLM
18 |
19 | general:
20 | max_search_results: 100
21 | schedule_time:
22 | - "09:00"
23 | - "18:00"
--------------------------------------------------------------------------------
/main.py:
--------------------------------------------------------------------------------
1 | import schedule
2 | import time
3 | from arxiv_bot.agent import ArxivPaperAgent
4 |
5 | def main():
6 | agent = ArxivPaperAgent()
7 | # agent.run() # 取消注释以立即运行一次
8 |
9 | schedule_times = agent.config.get_general_config()['schedule_time']
10 | if isinstance(schedule_times, str):
11 | schedule_times = [schedule_times]
12 |
13 | for schedule_time in schedule_times:
14 | schedule.every().day.at(schedule_time).do(agent.run)
15 | print(f"已设置定时任务:每天 {schedule_time} 运行")
16 |
17 | while True:
18 | schedule.run_pending()
19 | time.sleep(60)
20 |
21 | if __name__ == "__main__":
22 | main()
--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | arxiv
2 | requests
3 | python-dotenv
4 | schedule
5 | tqdm
6 | PyYAML
--------------------------------------------------------------------------------