├── .idea
├── .gitignore
├── dictionaries
├── inspectionProfiles
│ └── profiles_settings.xml
├── misc.xml
├── modules.xml
└── gpt-langchain-pdf-chat.iml
├── files
└── Is ChatGPT A Good Translator? Yes With GPT-4 As The Engine.pdf
├── README.md
├── gpt-langchain-pdf-chat-demo.ipynb
└── .ipynb_checkpoints
└── gpt-langchain-pdf-chat-demo-checkpoint.ipynb
/.idea/.gitignore:
--------------------------------------------------------------------------------
1 | # 默认忽略的文件
2 | /shelf/
3 | /workspace.xml
4 |
--------------------------------------------------------------------------------
/.idea/dictionaries:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 |
6 |
--------------------------------------------------------------------------------
/files/Is ChatGPT A Good Translator? Yes With GPT-4 As The Engine.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jackley-dev/gpt-langchain-pdf-chat/HEAD/files/Is ChatGPT A Good Translator? Yes With GPT-4 As The Engine.pdf
--------------------------------------------------------------------------------
/.idea/inspectionProfiles/profiles_settings.xml:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 |
6 |
--------------------------------------------------------------------------------
/.idea/misc.xml:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 |
6 |
7 |
--------------------------------------------------------------------------------
/.idea/modules.xml:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
--------------------------------------------------------------------------------
/.idea/gpt-langchain-pdf-chat.iml:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # 项目简介
2 | 利用chatgpt api和pinecone向量数据库,基于langchain开发的本地知识库问答demo。项目可以读取本地目录下的pdf文档,向量化后存储到pinecone数据库,并基于数据库中的特定领域知识进行问答。
3 |
4 | # 使用指南
5 | 1. 需要在pinecone.io网站申请pinecone的试用版,获取pinecone api key及相关环境变量
6 | 2. 更新demo中的如下参数配置,改成实际的key和环境变量
7 | ```python
8 | PINECONE_API_KEY='xx'
9 | PINECONE_ENV='xx'
10 | os.environ['OPENAI_API_KEY']='xx'
11 | PINECONE_INDEX='xx'
12 | ```
13 |
14 | # 总体思路
15 |
16 | ## 1. 从本地读取pdf,并进行切分
17 | 1. 使用langchain读取本地file_folder目录下的所有pdf文件
18 | 2. 使用langchain将pdf文本切分成小段
19 |
20 | ## 2. 将信息向量化,并存入向量数据库
21 | 1. 通过openai的embedding接口,将文档转化为向量
22 | 2. 将转化后的向量存入Pinecone向量数据库
23 |
24 | ## 3. 在向量数据库中搜索与query相似的内容,合并投喂给gpt进行回答
25 | 1. 利用similarity_search函数搜索与query相似的内容
26 | 2. 利用langchain中的load_qa_chain函数,将query和查询到的相似内容作为参数传入,即可得到基于知识库的回答
27 |
28 | ## 4. openai直接返回结果比对
29 | 1. 利用openai的原生接口返回结果,用于比对
30 |
31 | # 运行环境准备
32 | 1. python最新版本:当前使用的是Python 3.11.3
33 | 2. jupyter notebook
34 | 3. pip install langchain pinecone-client
35 | 4. pip安装其它可能会依赖的库:在使用过程中缺什么,就安装什么
36 | 5. openai api key:使用gpt3.5 key即可
37 | 6. pinecone向量数据库申请试用版(pinecone.io):api key/env/index
38 |
39 | # 坑点记录
40 | 1. juypter notebook使用的python、numpy版本都应更新至最新,否则运行demo代码时,可能提示如下错误:```ModuleNotFoundError: No module named 'numpy.core._multiarray_umath'```
41 | 2. 若系统安装了最新版的python,但jupyter使用的python版本不是最新,如何处理?需要jupyter系统菜单kernel——change kernel,选择最新的python版本
42 | 3. 如果jupyter更换python版本失败?先启动jupyter,然后执行如下命令```$ python -m pip install ipykernel```,```$ python -m ipykernel install --user```,然后在kernel菜单中选择新的python版本
43 |
--------------------------------------------------------------------------------
/gpt-langchain-pdf-chat-demo.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "id": "dd846391",
6 | "metadata": {},
7 | "source": [
8 | "## 1. 本地读取pdf和进行切分\n",
9 | "生成一段代码,实现以下功能:\n",
10 | "1. 使用langchain读取本地file_folder目录下的所有pdf文件\n",
11 | "2. 使用langchain将pdf文本切分成小段"
12 | ]
13 | },
14 | {
15 | "cell_type": "code",
16 | "execution_count": 26,
17 | "id": "e8c82387",
18 | "metadata": {},
19 | "outputs": [],
20 | "source": [
21 | "from IPython.display import clear_output\n",
22 | "clear_output(wait=True)"
23 | ]
24 | },
25 | {
26 | "cell_type": "code",
27 | "execution_count": 28,
28 | "id": "4e636a39",
29 | "metadata": {
30 | "scrolled": true
31 | },
32 | "outputs": [
33 | {
34 | "name": "stdout",
35 | "output_type": "stream",
36 | "text": [
37 | "3.11.3 (main, Apr 7 2023, 20:13:31) [Clang 14.0.0 (clang-1400.0.29.202)]\n",
38 | "/opt/homebrew/opt/python@3.11/bin/python3.11\n"
39 | ]
40 | }
41 | ],
42 | "source": [
43 | "# 包安装\n",
44 | "\n",
45 | "# !pip3 install langchain\n",
46 | "# !pip install pinecone-client\n",
47 | "\n",
48 | "# !pip install --upgrade langchain\n",
49 | "# !pip install tiktoken\n",
50 | "\n",
51 | "import os\n",
52 | "# print(\"PYTHONPATH:\", os.environ.get('PYTHONPATH'))\n",
53 | "# print(\"PATH:\", os.environ.get('PATH'))\n",
54 | "\n",
55 | "import sys\n",
56 | "print(sys.version)\n",
57 | "print(sys.executable)"
58 | ]
59 | },
60 | {
61 | "cell_type": "code",
62 | "execution_count": 30,
63 | "id": "fe989f04",
64 | "metadata": {},
65 | "outputs": [],
66 | "source": [
67 | "import sys\n",
68 | "sys.path.append('/opt/homebrew/lib/python3.11/site-packages')\n",
69 | "# print(sys.path)\n",
70 | "\n",
71 | "# %ls -lrt ./files/\n",
72 | "# %pwd"
73 | ]
74 | },
75 | {
76 | "cell_type": "code",
77 | "execution_count": 32,
78 | "id": "636866e6",
79 | "metadata": {},
80 | "outputs": [
81 | {
82 | "name": "stdout",
83 | "output_type": "stream",
84 | "text": [
85 | "\n",
86 | "page_content='Is ChatGPT A Good Translator? Yes With GPT-4 As The Engine\\nWenxiang Jiao\\x03Wenxuan Wang Jen-tse Huang Xing Wang Zhaopeng Tu\\nTencent AI Lab\\nAbstract\\nThis report provides a preliminary evaluation\\nof ChatGPT for machine translation, includ-\\ning translation prompt, multilingual transla-\\ntion, and translation robustness. We adopt\\nthe prompts advised by ChatGPT to trigger\\nits translation ability and find that the candi-\\ndate prompts generally work well and show\\nminor performance differences. By evalu-\\nating on a number of benchmark test sets1,\\nwe find that ChatGPT performs competitively\\nwith commercial translation products (e.g.,\\nGoogle Translate) on high-resource European\\nlanguages but lags behind significantly on low-\\nresource or distant languages. For distant\\nlanguages, we explore an interesting strategy\\nnamed pivot prompting that asks ChatGPT\\nto translate the source sentence into a high-\\nresource pivot language before into the target\\nlanguage, which improves the translation per-\\nformance significantly. As for the translation\\nrobustness, ChatGPT does not perform as well\\nas the commercial systems on biomedical ab-\\nstracts or Reddit comments but exhibits good\\nresults on spoken language. With the launch\\nof the GPT-4 engine, the translation perfor-\\nmance of ChatGPT is significantly boosted, be-\\ncoming comparable to commercial translation\\nproducts, even for distant languages. In other\\nwords, ChatGPT has already become a good\\ntranslator .\\n1 Introduction\\nChatGPT2is an intelligent chatting machine devel-\\noped by OpenAI upon the InstructGPT (Ouyang\\net al., 2022), which is trained to follow an instruc-\\ntion in a prompt and provide a detailed response.\\nAccording to the official statement, ChatGPT is\\nable to answer followup questions, admit its mis-\\ntakes, challenge incorrect premises, and reject in-\\nappropriate requests due to the dialogue format.\\n\\x03Correspondence: joelwxjiao@tencent.com\\n1Scripts and data: https://github.com/wxjiao/\\nIs-ChatGPT-A-Good-Translator\\n2https://chat.openai.com\\nFigure 1: Prompts advised by ChatGPT for machine\\ntranslation (Date: 2022.12.16).\\nIt integrates various abilities of natural language\\nprocessing, including question answering, story-\\ntelling, logic reasoning, code debugging, machine\\ntranslation, and so on. We are particularly inter-\\nested in how ChatGPT performs for machine trans-\\nlation tasks, especially the gap between ChatGPT\\nand commercial translation products (e.g., Google\\nTranslate, DeepL Translate).\\nIn this report, we provide a preliminary study of\\nChatGPT on machine translation to gain a better\\nunderstanding of it. Specifically, we focus on three\\naspects:\\n•Translation Prompt : ChatGPT is essentially a\\nlarge language model, which needs prompts as\\nguidance to trigger its translation ability. The\\nstyle of prompts may affect the quality of trans-\\nlation outputs. For example, how to mention\\nthe source or target language information mat-\\nters in multilingual machine translation models,\\nwhich is usually solved by attaching language\\ntokens (Johnson et al., 2017; Fan et al., 2021).\\n•Multilingual Translation : ChatGPT is a single\\nmodel handling various NLP tasks and covering\\ndifferent languages, which can be considered a\\nunified multilingual machine translation model.\\nThus, we are curious about how ChatGPT per-\\nforms on different language pairs consideringarXiv:2301.08745v3 [cs.CL] 19 Mar 2023' metadata={'source': 'files/Is ChatGPT A Good Translator? Yes With GPT-4 As The Engine.pdf', 'page': 0}\n",
87 | "8\n"
88 | ]
89 | }
90 | ],
91 | "source": [
92 | "from langchain.document_loaders import PyPDFDirectoryLoader\n",
93 | "\n",
94 | "# 使用PyPDFDirectoryLoader从本地xx目录读取全部的pdf文件\n",
95 | "file_folder='./files'\n",
96 | "loader = PyPDFDirectoryLoader(file_folder)\n",
97 | "# docs是一个list\n",
98 | "docs = loader.load()\n",
99 | "\n",
100 | "print(type(docs[0]))\n",
101 | "print(docs[0])\n",
102 | "print(len(docs))"
103 | ]
104 | },
105 | {
106 | "cell_type": "code",
107 | "execution_count": 33,
108 | "id": "f0ad8e0e",
109 | "metadata": {},
110 | "outputs": [
111 | {
112 | "data": {
113 | "text/plain": [
114 | "Document(page_content='Is ChatGPT A Good Translator? Yes With GPT-4 As The Engine\\nWenxiang Jiao\\x03Wenxuan Wang Jen-tse Huang Xing Wang Zhaopeng Tu\\nTencent AI Lab\\nAbstract\\nThis report provides a preliminary evaluation\\nof ChatGPT for machine translation, includ-\\ning translation prompt, multilingual transla-\\ntion, and translation robustness. We adopt\\nthe prompts advised by ChatGPT to trigger\\nits translation ability and find that the candi-\\ndate prompts generally work well and show\\nminor performance differences. By evalu-\\nating on a number of benchmark test sets1,\\nwe find that ChatGPT performs competitively\\nwith commercial translation products (e.g.,\\nGoogle Translate) on high-resource European\\nlanguages but lags behind significantly on low-\\nresource or distant languages. For distant\\nlanguages, we explore an interesting strategy\\nnamed pivot prompting that asks ChatGPT\\nto translate the source sentence into a high-\\nresource pivot language before into the target\\nlanguage, which improves the translation per-', metadata={'source': 'files/Is ChatGPT A Good Translator? Yes With GPT-4 As The Engine.pdf', 'page': 0})"
115 | ]
116 | },
117 | "execution_count": 33,
118 | "metadata": {},
119 | "output_type": "execute_result"
120 | }
121 | ],
122 | "source": [
123 | "# 使用langchain将pdf文本切分成小文档\n",
124 | "\n",
125 | "from langchain.text_splitter import RecursiveCharacterTextSplitter\n",
126 | "text_splitter = RecursiveCharacterTextSplitter(\n",
127 | " chunk_size = 1000,\n",
128 | " chunk_overlap = 100,\n",
129 | " length_function = len,\n",
130 | ")\n",
131 | "docs = text_splitter.split_documents(docs)\n",
132 | "docs[0]"
133 | ]
134 | },
135 | {
136 | "cell_type": "markdown",
137 | "id": "60269828",
138 | "metadata": {},
139 | "source": [
140 | "## 2. 将信息向量化,并存入向量数据库\n",
141 | "1. 通过openai的embedding接口,将文档转化为向量\n",
142 | "2. 将转化后的向量存入Pinecone向量数据库"
143 | ]
144 | },
145 | {
146 | "cell_type": "code",
147 | "execution_count": 20,
148 | "id": "bf029a89",
149 | "metadata": {},
150 | "outputs": [],
151 | "source": [
152 | "# openai和pinecone的API配置\n",
153 | "import os\n",
154 | "import getpass\n",
155 | "\n",
156 | "# PINECONE_API_KEY = getpass.getpass('Pinecone API Key:')\n",
157 | "# PINECONE_ENV = getpass.getpass('Pinecone Environment:')\n",
158 | "# os.environ['OPENAI_API_KEY'] = getpass.getpass('OpenAI API Key:')\n",
159 | "\n",
160 | "PINECONE_API_KEY='xx'\n",
161 | "PINECONE_ENV='xx'\n",
162 | "os.environ['OPENAI_API_KEY']='xx'\n",
163 | "PINECONE_INDEX='xx'"
164 | ]
165 | },
166 | {
167 | "cell_type": "code",
168 | "execution_count": 34,
169 | "id": "e4a77944",
170 | "metadata": {
171 | "scrolled": false
172 | },
173 | "outputs": [],
174 | "source": [
175 | "# 通过openai的embedding接口将文档转化为向量,并存入pinecone\n",
176 | "from langchain.embeddings.openai import OpenAIEmbeddings\n",
177 | "from langchain.text_splitter import CharacterTextSplitter\n",
178 | "from langchain.vectorstores import Pinecone\n",
179 | "from langchain.document_loaders import TextLoader\n",
180 | "import pinecone \n",
181 | "\n",
182 | "embeddings = OpenAIEmbeddings()\n",
183 | "\n",
184 | "# initialize pinecone\n",
185 | "pinecone.init(\n",
186 | " api_key=PINECONE_API_KEY, # find at app.pinecone.io\n",
187 | " environment=PINECONE_ENV # next to api key in console\n",
188 | ")\n",
189 | "\n",
190 | "index_name = PINECONE_INDEX\n",
191 | "\n",
192 | "# 首次导入时运行:索引导入一次即可\n",
193 | "# 后续运行时,无需重复导入,可以注释掉\n",
194 | "docsearch = Pinecone.from_documents(docs, embeddings, index_name=index_name)"
195 | ]
196 | },
197 | {
198 | "cell_type": "markdown",
199 | "id": "3ac3bb11",
200 | "metadata": {},
201 | "source": [
202 | "## 3. 在向量数据库中搜索与query相似的内容,合并投喂给gpt进行回答\n",
203 | "\n",
204 | "1. 利用similarity_search函数搜索与query相似的内容\n",
205 | "2. 利用langchain中的load_qa_chain函数,将query和查询到的相似内容作为参数传入,即可得到基于知识库的回答"
206 | ]
207 | },
208 | {
209 | "cell_type": "code",
210 | "execution_count": 35,
211 | "id": "8166a55e",
212 | "metadata": {},
213 | "outputs": [
214 | {
215 | "name": "stdout",
216 | "output_type": "stream",
217 | "text": [
218 | "[Document(page_content='Is ChatGPT A Good Translator? Yes With GPT-4 As The Engine\\nWenxiang Jiao\\x03Wenxuan Wang Jen-tse Huang Xing Wang Zhaopeng Tu\\nTencent AI Lab\\nAbstract\\nThis report provides a preliminary evaluation\\nof ChatGPT for machine translation, includ-\\ning translation prompt, multilingual transla-\\ntion, and translation robustness. We adopt\\nthe prompts advised by ChatGPT to trigger\\nits translation ability and find that the candi-\\ndate prompts generally work well and show\\nminor performance differences. By evalu-\\nating on a number of benchmark test sets1,\\nwe find that ChatGPT performs competitively\\nwith commercial translation products (e.g.,\\nGoogle Translate) on high-resource European\\nlanguages but lags behind significantly on low-\\nresource or distant languages. For distant\\nlanguages, we explore an interesting strategy\\nnamed pivot prompting that asks ChatGPT\\nto translate the source sentence into a high-\\nresource pivot language before into the target\\nlanguage, which improves the translation per-', metadata={'page': 0.0, 'source': 'files/Is ChatGPT A Good Translator? Yes With GPT-4 As The Engine.pdf'}), Document(page_content='resource pivot language before into the target\\nlanguage, which improves the translation per-\\nformance significantly. As for the translation\\nrobustness, ChatGPT does not perform as well\\nas the commercial systems on biomedical ab-\\nstracts or Reddit comments but exhibits good\\nresults on spoken language. With the launch\\nof the GPT-4 engine, the translation perfor-\\nmance of ChatGPT is significantly boosted, be-\\ncoming comparable to commercial translation\\nproducts, even for distant languages. In other\\nwords, ChatGPT has already become a good\\ntranslator .\\n1 Introduction\\nChatGPT2is an intelligent chatting machine devel-\\noped by OpenAI upon the InstructGPT (Ouyang\\net al., 2022), which is trained to follow an instruc-\\ntion in a prompt and provide a detailed response.\\nAccording to the official statement, ChatGPT is\\nable to answer followup questions, admit its mis-\\ntakes, challenge incorrect premises, and reject in-\\nappropriate requests due to the dialogue format.', metadata={'page': 0.0, 'source': 'files/Is ChatGPT A Good Translator? Yes With GPT-4 As The Engine.pdf'}), Document(page_content='out-of-distribution data. However, these may not\\nbe done in ChatGPT.\\nAn interesting finding is that ChatGPT outper-\\nforms Google Translate and DeepL Translate sig-\\nnificantly on WMT20 Rob3 test set that contains\\na crowdsourced speech recognition corpus. It sug-\\ngests that ChatGPT, which is essentially an artifi-\\ncial intelligent chatting machine, is capable of gen-\\nerating more natural spoken languages than these\\ncommercial translation systems. We provide some\\nexamples in Table 8.\\n3 Conclusion\\nWe present a preliminary study of ChatGPT for\\nmachine translation, including translation prompt,\\nmultilingual translation, and translation robustness.\\nBy evaluating on a number of benchmark test\\nsets, we find that ChatGPT performs competitively\\nwith commercial translation products (e.g., Google\\nTranslate) on high-resource European languages\\nbut lags behind significantly on low-resource or\\ndistant languages. For distant languages, we ex-', metadata={'page': 5.0, 'source': 'files/Is ChatGPT A Good Translator? Yes With GPT-4 As The Engine.pdf'})]\n"
219 | ]
220 | }
221 | ],
222 | "source": [
223 | "# 在向量数据库中,查询相似的文档\n",
224 | "# if you already have an index, you can load it like this\n",
225 | "docsearch = Pinecone.from_existing_index(index_name, embeddings)\n",
226 | "\n",
227 | "query = \"does chatgpt translates better than google translation?\"\n",
228 | "docs = docsearch.similarity_search(query, 3)\n",
229 | "print(docs)"
230 | ]
231 | },
232 | {
233 | "cell_type": "code",
234 | "execution_count": 36,
235 | "id": "f793bc54",
236 | "metadata": {},
237 | "outputs": [],
238 | "source": [
239 | "from langchain.llms import OpenAI\n",
240 | "# We now initialize the ConversationalRetrievalChain\n",
241 | "llm = OpenAI(openai_api_key=os.environ['OPENAI_API_KEY'], temperature=0)\n"
242 | ]
243 | },
244 | {
245 | "cell_type": "code",
246 | "execution_count": 37,
247 | "id": "ebb34dd0",
248 | "metadata": {},
249 | "outputs": [
250 | {
251 | "data": {
252 | "text/plain": [
253 | "' Yes, ChatGPT performs competitively with commercial translation products (e.g., Google Translate) on high-resource European languages but lags behind significantly on low-resource or distant languages.'"
254 | ]
255 | },
256 | "execution_count": 37,
257 | "metadata": {},
258 | "output_type": "execute_result"
259 | }
260 | ],
261 | "source": [
262 | "from langchain.chains.question_answering import load_qa_chain\n",
263 | "chain = load_qa_chain(llm, chain_type=\"stuff\")\n",
264 | "chain.run(input_documents=docs, question=query)"
265 | ]
266 | },
267 | {
268 | "cell_type": "markdown",
269 | "id": "d75386b1",
270 | "metadata": {},
271 | "source": [
272 | "## 4. openai直接返回结果比对"
273 | ]
274 | },
275 | {
276 | "cell_type": "code",
277 | "execution_count": 3,
278 | "id": "efe6c0e8",
279 | "metadata": {},
280 | "outputs": [],
281 | "source": [
282 | "import openai\n",
283 | "import os\n",
284 | "\n",
285 | "from dotenv import load_dotenv, find_dotenv\n",
286 | "_ = load_dotenv(find_dotenv())\n",
287 | "\n",
288 | "openai.api_key = os.getenv('OPENAI_API_KEY')\n",
289 | "\n",
290 | "def get_completion(prompt, model=\"gpt-3.5-turbo\"):\n",
291 | " messages = [{\"role\": \"user\", \"content\": prompt}]\n",
292 | " response = openai.ChatCompletion.create(\n",
293 | " model=model,\n",
294 | " messages=messages,\n",
295 | " temperature=0, # this is the degree of randomness of the model's output\n",
296 | " )\n",
297 | " return response.choices[0].message[\"content\"]"
298 | ]
299 | },
300 | {
301 | "cell_type": "code",
302 | "execution_count": 5,
303 | "id": "0105eebd",
304 | "metadata": {},
305 | "outputs": [
306 | {
307 | "name": "stdout",
308 | "output_type": "stream",
309 | "text": [
310 | "As an AI language model, I do not have personal preferences or opinions. However, both ChatGPT and Google Translation use different algorithms and techniques to translate text from one language to another. The accuracy of the translation depends on various factors such as the complexity of the text, the language pair, and the context. Both ChatGPT and Google Translation have their strengths and limitations, and the quality of the translation may vary depending on the specific use case. It is recommended to compare the translations from both tools and choose the one that best suits your needs.\n"
311 | ]
312 | }
313 | ],
314 | "source": [
315 | "prompt = \"does chatgpt translates better than google translation?\"\n",
316 | "response = get_completion(prompt)\n",
317 | "print(response)"
318 | ]
319 | }
320 | ],
321 | "metadata": {
322 | "kernelspec": {
323 | "display_name": "Python 3.11.3 (myenv)",
324 | "language": "python",
325 | "name": "myenv"
326 | },
327 | "language_info": {
328 | "codemirror_mode": {
329 | "name": "ipython",
330 | "version": 3
331 | },
332 | "file_extension": ".py",
333 | "mimetype": "text/x-python",
334 | "name": "python",
335 | "nbconvert_exporter": "python",
336 | "pygments_lexer": "ipython3",
337 | "version": "3.11.3"
338 | }
339 | },
340 | "nbformat": 4,
341 | "nbformat_minor": 5
342 | }
343 |
--------------------------------------------------------------------------------
/.ipynb_checkpoints/gpt-langchain-pdf-chat-demo-checkpoint.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "id": "dd846391",
6 | "metadata": {},
7 | "source": [
8 | "## 1. 本地读取pdf和进行切分\n",
9 | "生成一段代码,实现以下功能:\n",
10 | "1. 使用langchain读取本地file_folder目录下的所有pdf文件\n",
11 | "2. 使用langchain将pdf文本切分成小段"
12 | ]
13 | },
14 | {
15 | "cell_type": "code",
16 | "execution_count": 26,
17 | "id": "e8c82387",
18 | "metadata": {},
19 | "outputs": [],
20 | "source": [
21 | "from IPython.display import clear_output\n",
22 | "clear_output(wait=True)"
23 | ]
24 | },
25 | {
26 | "cell_type": "code",
27 | "execution_count": 28,
28 | "id": "4e636a39",
29 | "metadata": {
30 | "scrolled": true
31 | },
32 | "outputs": [
33 | {
34 | "name": "stdout",
35 | "output_type": "stream",
36 | "text": [
37 | "3.11.3 (main, Apr 7 2023, 20:13:31) [Clang 14.0.0 (clang-1400.0.29.202)]\n",
38 | "/opt/homebrew/opt/python@3.11/bin/python3.11\n"
39 | ]
40 | }
41 | ],
42 | "source": [
43 | "# 包安装\n",
44 | "\n",
45 | "# !pip3 install langchain\n",
46 | "# !pip install pinecone-client\n",
47 | "\n",
48 | "# !pip install --upgrade langchain\n",
49 | "# !pip install tiktoken\n",
50 | "\n",
51 | "import os\n",
52 | "# print(\"PYTHONPATH:\", os.environ.get('PYTHONPATH'))\n",
53 | "# print(\"PATH:\", os.environ.get('PATH'))\n",
54 | "\n",
55 | "import sys\n",
56 | "print(sys.version)\n",
57 | "print(sys.executable)"
58 | ]
59 | },
60 | {
61 | "cell_type": "code",
62 | "execution_count": 30,
63 | "id": "fe989f04",
64 | "metadata": {},
65 | "outputs": [],
66 | "source": [
67 | "import sys\n",
68 | "sys.path.append('/opt/homebrew/lib/python3.11/site-packages')\n",
69 | "# print(sys.path)\n",
70 | "\n",
71 | "# %ls -lrt ./files/\n",
72 | "# %pwd"
73 | ]
74 | },
75 | {
76 | "cell_type": "code",
77 | "execution_count": 32,
78 | "id": "636866e6",
79 | "metadata": {},
80 | "outputs": [
81 | {
82 | "name": "stdout",
83 | "output_type": "stream",
84 | "text": [
85 | "\n",
86 | "page_content='Is ChatGPT A Good Translator? Yes With GPT-4 As The Engine\\nWenxiang Jiao\\x03Wenxuan Wang Jen-tse Huang Xing Wang Zhaopeng Tu\\nTencent AI Lab\\nAbstract\\nThis report provides a preliminary evaluation\\nof ChatGPT for machine translation, includ-\\ning translation prompt, multilingual transla-\\ntion, and translation robustness. We adopt\\nthe prompts advised by ChatGPT to trigger\\nits translation ability and find that the candi-\\ndate prompts generally work well and show\\nminor performance differences. By evalu-\\nating on a number of benchmark test sets1,\\nwe find that ChatGPT performs competitively\\nwith commercial translation products (e.g.,\\nGoogle Translate) on high-resource European\\nlanguages but lags behind significantly on low-\\nresource or distant languages. For distant\\nlanguages, we explore an interesting strategy\\nnamed pivot prompting that asks ChatGPT\\nto translate the source sentence into a high-\\nresource pivot language before into the target\\nlanguage, which improves the translation per-\\nformance significantly. As for the translation\\nrobustness, ChatGPT does not perform as well\\nas the commercial systems on biomedical ab-\\nstracts or Reddit comments but exhibits good\\nresults on spoken language. With the launch\\nof the GPT-4 engine, the translation perfor-\\nmance of ChatGPT is significantly boosted, be-\\ncoming comparable to commercial translation\\nproducts, even for distant languages. In other\\nwords, ChatGPT has already become a good\\ntranslator .\\n1 Introduction\\nChatGPT2is an intelligent chatting machine devel-\\noped by OpenAI upon the InstructGPT (Ouyang\\net al., 2022), which is trained to follow an instruc-\\ntion in a prompt and provide a detailed response.\\nAccording to the official statement, ChatGPT is\\nable to answer followup questions, admit its mis-\\ntakes, challenge incorrect premises, and reject in-\\nappropriate requests due to the dialogue format.\\n\\x03Correspondence: joelwxjiao@tencent.com\\n1Scripts and data: https://github.com/wxjiao/\\nIs-ChatGPT-A-Good-Translator\\n2https://chat.openai.com\\nFigure 1: Prompts advised by ChatGPT for machine\\ntranslation (Date: 2022.12.16).\\nIt integrates various abilities of natural language\\nprocessing, including question answering, story-\\ntelling, logic reasoning, code debugging, machine\\ntranslation, and so on. We are particularly inter-\\nested in how ChatGPT performs for machine trans-\\nlation tasks, especially the gap between ChatGPT\\nand commercial translation products (e.g., Google\\nTranslate, DeepL Translate).\\nIn this report, we provide a preliminary study of\\nChatGPT on machine translation to gain a better\\nunderstanding of it. Specifically, we focus on three\\naspects:\\n•Translation Prompt : ChatGPT is essentially a\\nlarge language model, which needs prompts as\\nguidance to trigger its translation ability. The\\nstyle of prompts may affect the quality of trans-\\nlation outputs. For example, how to mention\\nthe source or target language information mat-\\nters in multilingual machine translation models,\\nwhich is usually solved by attaching language\\ntokens (Johnson et al., 2017; Fan et al., 2021).\\n•Multilingual Translation : ChatGPT is a single\\nmodel handling various NLP tasks and covering\\ndifferent languages, which can be considered a\\nunified multilingual machine translation model.\\nThus, we are curious about how ChatGPT per-\\nforms on different language pairs consideringarXiv:2301.08745v3 [cs.CL] 19 Mar 2023' metadata={'source': 'files/Is ChatGPT A Good Translator? Yes With GPT-4 As The Engine.pdf', 'page': 0}\n",
87 | "8\n"
88 | ]
89 | }
90 | ],
91 | "source": [
92 | "from langchain.document_loaders import PyPDFDirectoryLoader\n",
93 | "\n",
94 | "# 使用PyPDFDirectoryLoader从本地xx目录读取全部的pdf文件\n",
95 | "file_folder='./files'\n",
96 | "loader = PyPDFDirectoryLoader(file_folder)\n",
97 | "# docs是一个list\n",
98 | "docs = loader.load()\n",
99 | "\n",
100 | "print(type(docs[0]))\n",
101 | "print(docs[0])\n",
102 | "print(len(docs))"
103 | ]
104 | },
105 | {
106 | "cell_type": "code",
107 | "execution_count": 33,
108 | "id": "f0ad8e0e",
109 | "metadata": {},
110 | "outputs": [
111 | {
112 | "data": {
113 | "text/plain": [
114 | "Document(page_content='Is ChatGPT A Good Translator? Yes With GPT-4 As The Engine\\nWenxiang Jiao\\x03Wenxuan Wang Jen-tse Huang Xing Wang Zhaopeng Tu\\nTencent AI Lab\\nAbstract\\nThis report provides a preliminary evaluation\\nof ChatGPT for machine translation, includ-\\ning translation prompt, multilingual transla-\\ntion, and translation robustness. We adopt\\nthe prompts advised by ChatGPT to trigger\\nits translation ability and find that the candi-\\ndate prompts generally work well and show\\nminor performance differences. By evalu-\\nating on a number of benchmark test sets1,\\nwe find that ChatGPT performs competitively\\nwith commercial translation products (e.g.,\\nGoogle Translate) on high-resource European\\nlanguages but lags behind significantly on low-\\nresource or distant languages. For distant\\nlanguages, we explore an interesting strategy\\nnamed pivot prompting that asks ChatGPT\\nto translate the source sentence into a high-\\nresource pivot language before into the target\\nlanguage, which improves the translation per-', metadata={'source': 'files/Is ChatGPT A Good Translator? Yes With GPT-4 As The Engine.pdf', 'page': 0})"
115 | ]
116 | },
117 | "execution_count": 33,
118 | "metadata": {},
119 | "output_type": "execute_result"
120 | }
121 | ],
122 | "source": [
123 | "# 使用langchain将pdf文本切分成小文档\n",
124 | "\n",
125 | "from langchain.text_splitter import RecursiveCharacterTextSplitter\n",
126 | "text_splitter = RecursiveCharacterTextSplitter(\n",
127 | " chunk_size = 1000,\n",
128 | " chunk_overlap = 100,\n",
129 | " length_function = len,\n",
130 | ")\n",
131 | "docs = text_splitter.split_documents(docs)\n",
132 | "docs[0]"
133 | ]
134 | },
135 | {
136 | "cell_type": "markdown",
137 | "id": "60269828",
138 | "metadata": {},
139 | "source": [
140 | "## 2. 将信息向量化,并存入向量数据库\n",
141 | "1. 通过openai的embedding接口,将文档转化为向量\n",
142 | "2. 将转化后的向量存入Pinecone向量数据库"
143 | ]
144 | },
145 | {
146 | "cell_type": "code",
147 | "execution_count": 20,
148 | "id": "bf029a89",
149 | "metadata": {},
150 | "outputs": [],
151 | "source": [
152 | "# openai和pinecone的API配置\n",
153 | "import os\n",
154 | "import getpass\n",
155 | "\n",
156 | "# PINECONE_API_KEY = getpass.getpass('Pinecone API Key:')\n",
157 | "# PINECONE_ENV = getpass.getpass('Pinecone Environment:')\n",
158 | "# os.environ['OPENAI_API_KEY'] = getpass.getpass('OpenAI API Key:')\n",
159 | "\n",
160 | "PINECONE_API_KEY='xx'\n",
161 | "PINECONE_ENV='xx'\n",
162 | "os.environ['OPENAI_API_KEY']='xx'\n",
163 | "PINECONE_INDEX='xx'"
164 | ]
165 | },
166 | {
167 | "cell_type": "code",
168 | "execution_count": 34,
169 | "id": "e4a77944",
170 | "metadata": {
171 | "scrolled": false
172 | },
173 | "outputs": [],
174 | "source": [
175 | "# 通过openai的embedding接口将文档转化为向量,并存入pinecone\n",
176 | "from langchain.embeddings.openai import OpenAIEmbeddings\n",
177 | "from langchain.text_splitter import CharacterTextSplitter\n",
178 | "from langchain.vectorstores import Pinecone\n",
179 | "from langchain.document_loaders import TextLoader\n",
180 | "import pinecone \n",
181 | "\n",
182 | "embeddings = OpenAIEmbeddings()\n",
183 | "\n",
184 | "# initialize pinecone\n",
185 | "pinecone.init(\n",
186 | " api_key=PINECONE_API_KEY, # find at app.pinecone.io\n",
187 | " environment=PINECONE_ENV # next to api key in console\n",
188 | ")\n",
189 | "\n",
190 | "index_name = PINECONE_INDEX\n",
191 | "\n",
192 | "# 首次导入时运行:索引导入一次即可\n",
193 | "# 后续运行时,无需重复导入,可以注释掉\n",
194 | "docsearch = Pinecone.from_documents(docs, embeddings, index_name=index_name)"
195 | ]
196 | },
197 | {
198 | "cell_type": "markdown",
199 | "id": "3ac3bb11",
200 | "metadata": {},
201 | "source": [
202 | "## 3. 在向量数据库中搜索与query相似的内容,合并投喂给gpt进行回答\n",
203 | "\n",
204 | "1. 利用similarity_search函数搜索与query相似的内容\n",
205 | "2. 利用langchain中的load_qa_chain函数,将query和查询到的相似内容作为参数传入,即可得到基于知识库的回答"
206 | ]
207 | },
208 | {
209 | "cell_type": "code",
210 | "execution_count": 35,
211 | "id": "8166a55e",
212 | "metadata": {},
213 | "outputs": [
214 | {
215 | "name": "stdout",
216 | "output_type": "stream",
217 | "text": [
218 | "[Document(page_content='Is ChatGPT A Good Translator? Yes With GPT-4 As The Engine\\nWenxiang Jiao\\x03Wenxuan Wang Jen-tse Huang Xing Wang Zhaopeng Tu\\nTencent AI Lab\\nAbstract\\nThis report provides a preliminary evaluation\\nof ChatGPT for machine translation, includ-\\ning translation prompt, multilingual transla-\\ntion, and translation robustness. We adopt\\nthe prompts advised by ChatGPT to trigger\\nits translation ability and find that the candi-\\ndate prompts generally work well and show\\nminor performance differences. By evalu-\\nating on a number of benchmark test sets1,\\nwe find that ChatGPT performs competitively\\nwith commercial translation products (e.g.,\\nGoogle Translate) on high-resource European\\nlanguages but lags behind significantly on low-\\nresource or distant languages. For distant\\nlanguages, we explore an interesting strategy\\nnamed pivot prompting that asks ChatGPT\\nto translate the source sentence into a high-\\nresource pivot language before into the target\\nlanguage, which improves the translation per-', metadata={'page': 0.0, 'source': 'files/Is ChatGPT A Good Translator? Yes With GPT-4 As The Engine.pdf'}), Document(page_content='resource pivot language before into the target\\nlanguage, which improves the translation per-\\nformance significantly. As for the translation\\nrobustness, ChatGPT does not perform as well\\nas the commercial systems on biomedical ab-\\nstracts or Reddit comments but exhibits good\\nresults on spoken language. With the launch\\nof the GPT-4 engine, the translation perfor-\\nmance of ChatGPT is significantly boosted, be-\\ncoming comparable to commercial translation\\nproducts, even for distant languages. In other\\nwords, ChatGPT has already become a good\\ntranslator .\\n1 Introduction\\nChatGPT2is an intelligent chatting machine devel-\\noped by OpenAI upon the InstructGPT (Ouyang\\net al., 2022), which is trained to follow an instruc-\\ntion in a prompt and provide a detailed response.\\nAccording to the official statement, ChatGPT is\\nable to answer followup questions, admit its mis-\\ntakes, challenge incorrect premises, and reject in-\\nappropriate requests due to the dialogue format.', metadata={'page': 0.0, 'source': 'files/Is ChatGPT A Good Translator? Yes With GPT-4 As The Engine.pdf'}), Document(page_content='out-of-distribution data. However, these may not\\nbe done in ChatGPT.\\nAn interesting finding is that ChatGPT outper-\\nforms Google Translate and DeepL Translate sig-\\nnificantly on WMT20 Rob3 test set that contains\\na crowdsourced speech recognition corpus. It sug-\\ngests that ChatGPT, which is essentially an artifi-\\ncial intelligent chatting machine, is capable of gen-\\nerating more natural spoken languages than these\\ncommercial translation systems. We provide some\\nexamples in Table 8.\\n3 Conclusion\\nWe present a preliminary study of ChatGPT for\\nmachine translation, including translation prompt,\\nmultilingual translation, and translation robustness.\\nBy evaluating on a number of benchmark test\\nsets, we find that ChatGPT performs competitively\\nwith commercial translation products (e.g., Google\\nTranslate) on high-resource European languages\\nbut lags behind significantly on low-resource or\\ndistant languages. For distant languages, we ex-', metadata={'page': 5.0, 'source': 'files/Is ChatGPT A Good Translator? Yes With GPT-4 As The Engine.pdf'})]\n"
219 | ]
220 | }
221 | ],
222 | "source": [
223 | "# 在向量数据库中,查询相似的文档\n",
224 | "# if you already have an index, you can load it like this\n",
225 | "docsearch = Pinecone.from_existing_index(index_name, embeddings)\n",
226 | "\n",
227 | "query = \"does chatgpt translates better than google translation?\"\n",
228 | "docs = docsearch.similarity_search(query, 3)\n",
229 | "print(docs)"
230 | ]
231 | },
232 | {
233 | "cell_type": "code",
234 | "execution_count": 36,
235 | "id": "f793bc54",
236 | "metadata": {},
237 | "outputs": [],
238 | "source": [
239 | "from langchain.llms import OpenAI\n",
240 | "# We now initialize the ConversationalRetrievalChain\n",
241 | "llm = OpenAI(openai_api_key=os.environ['OPENAI_API_KEY'], temperature=0)\n"
242 | ]
243 | },
244 | {
245 | "cell_type": "code",
246 | "execution_count": 37,
247 | "id": "ebb34dd0",
248 | "metadata": {},
249 | "outputs": [
250 | {
251 | "data": {
252 | "text/plain": [
253 | "' Yes, ChatGPT performs competitively with commercial translation products (e.g., Google Translate) on high-resource European languages but lags behind significantly on low-resource or distant languages.'"
254 | ]
255 | },
256 | "execution_count": 37,
257 | "metadata": {},
258 | "output_type": "execute_result"
259 | }
260 | ],
261 | "source": [
262 | "from langchain.chains.question_answering import load_qa_chain\n",
263 | "chain = load_qa_chain(llm, chain_type=\"stuff\")\n",
264 | "chain.run(input_documents=docs, question=query)"
265 | ]
266 | },
267 | {
268 | "cell_type": "markdown",
269 | "id": "d75386b1",
270 | "metadata": {},
271 | "source": [
272 | "## 4. openai直接返回结果比对"
273 | ]
274 | },
275 | {
276 | "cell_type": "code",
277 | "execution_count": 3,
278 | "id": "efe6c0e8",
279 | "metadata": {},
280 | "outputs": [],
281 | "source": [
282 | "import openai\n",
283 | "import os\n",
284 | "\n",
285 | "from dotenv import load_dotenv, find_dotenv\n",
286 | "_ = load_dotenv(find_dotenv())\n",
287 | "\n",
288 | "openai.api_key = os.getenv('OPENAI_API_KEY')\n",
289 | "\n",
290 | "def get_completion(prompt, model=\"gpt-3.5-turbo\"):\n",
291 | " messages = [{\"role\": \"user\", \"content\": prompt}]\n",
292 | " response = openai.ChatCompletion.create(\n",
293 | " model=model,\n",
294 | " messages=messages,\n",
295 | " temperature=0, # this is the degree of randomness of the model's output\n",
296 | " )\n",
297 | " return response.choices[0].message[\"content\"]"
298 | ]
299 | },
300 | {
301 | "cell_type": "code",
302 | "execution_count": 5,
303 | "id": "0105eebd",
304 | "metadata": {},
305 | "outputs": [
306 | {
307 | "name": "stdout",
308 | "output_type": "stream",
309 | "text": [
310 | "As an AI language model, I do not have personal preferences or opinions. However, both ChatGPT and Google Translation use different algorithms and techniques to translate text from one language to another. The accuracy of the translation depends on various factors such as the complexity of the text, the language pair, and the context. Both ChatGPT and Google Translation have their strengths and limitations, and the quality of the translation may vary depending on the specific use case. It is recommended to compare the translations from both tools and choose the one that best suits your needs.\n"
311 | ]
312 | }
313 | ],
314 | "source": [
315 | "prompt = \"does chatgpt translates better than google translation?\"\n",
316 | "response = get_completion(prompt)\n",
317 | "print(response)"
318 | ]
319 | }
320 | ],
321 | "metadata": {
322 | "kernelspec": {
323 | "display_name": "Python 3.11.3 (myenv)",
324 | "language": "python",
325 | "name": "myenv"
326 | },
327 | "language_info": {
328 | "codemirror_mode": {
329 | "name": "ipython",
330 | "version": 3
331 | },
332 | "file_extension": ".py",
333 | "mimetype": "text/x-python",
334 | "name": "python",
335 | "nbconvert_exporter": "python",
336 | "pygments_lexer": "ipython3",
337 | "version": "3.11.3"
338 | }
339 | },
340 | "nbformat": 4,
341 | "nbformat_minor": 5
342 | }
343 |
--------------------------------------------------------------------------------