├── .gitignore ├── .nojekyll ├── 00~导论 └── README.md ├── 99~参考资料 ├── 2023~Numbers every LLM Developer should know.md ├── 2023~吴恩达~《Building Systems with the ChatGPT API》 │ ├── 1.Introduction.md │ ├── 10.Evaluation-part2.ipynb │ ├── 11.conclusion.md │ ├── 2.Language Models, the Chat Format and Tokens.ipynb │ ├── 3.Classification.ipynb │ ├── 4.Moderation.ipynb │ ├── 5.Chain of Thought Reasoning.ipynb │ ├── 6.Chaining Prompts.ipynb │ ├── 7.Check Outputs.ipynb │ ├── 8.Evaluation.ipynb │ ├── 9.Evaluation-part1.ipynb │ ├── products.json │ ├── readme.md │ ├── utils_en.py │ └── utils_zh.py ├── 2023~吴恩达~《ChatGPT Prompt Engineering for Developers》 │ ├── 00.README.md │ ├── 01. 简介.md │ ├── 02. 提示原则 Guidelines.ipynb │ ├── 03. 迭代优化 Iterative.ipynb │ ├── 04. 文本概括 Summarizing.ipynb │ ├── 05. 推断 Inferring.ipynb │ ├── 06. 文本转换 Transforming.ipynb │ ├── 07. 文本扩展 Expanding.ipynb │ ├── 08. 聊天机器人 Chatbot.ipynb │ └── 09. 总结.md ├── 2023~吴恩达~《LangChain for LLM Application Development》 │ ├── 1.开篇介绍.md │ ├── 2.模型、提示和解析器.ipynb │ ├── 3.存储.ipynb │ ├── 4.模型链.ipynb │ ├── 5.文档问答.ipynb │ ├── 6.评估.ipynb │ ├── 7.代理.ipynb │ ├── 8.课程总结.md │ ├── Data.csv │ ├── OutdoorClothingCatalog_1000.csv │ └── readme.md └── 2023~陆奇~我的大模型世界观.md ├── INTRODUCTION.md ├── LICENSE ├── LLM └── README.link ├── README.md ├── _sidebar.md ├── header.svg ├── index.html ├── 循环神经网络 └── README.md ├── 经典自然语言 ├── 主题模型 │ └── LDA.md ├── 统计语言模型 │ ├── Word2Vec.md │ ├── 基础文本处理.md │ ├── 统计语言模型.md │ └── 词表示.md ├── 词嵌入 │ ├── 99~参考资料 │ │ └── 2023~Embeddings: What they are and why they matter.md │ ├── 概述.md │ └── 词向量 │ │ └── 基于 Gensim 的 Word2Vec 实践.md └── 语法语义分析 │ ├── 1_nlp_basics_tokenization_segmentation.ipynb.txt │ └── 命名实体识别.md └── 行业应用 ├── 机器人问答 └── README.md └── 聊天对话 └── README.md /.gitignore: -------------------------------------------------------------------------------- 1 | # Ignore all 2 | * 3 | 4 | # Unignore all with extensions 5 | !*.* 6 | 7 | # Unignore all dirs 8 | !*/ 9 | 10 | .DS_Store 11 | 12 | # Logs 13 | logs 14 | *.log 15 | npm-debug.log* 16 | yarn-debug.log* 17 | yarn-error.log* 18 | 19 | # Runtime data 20 | pids 21 | *.pid 22 | *.seed 23 | *.pid.lock 24 | 25 | # Directory for instrumented libs generated by jscoverage/JSCover 26 | lib-cov 27 | 28 | # Coverage directory used by tools like istanbul 29 | coverage 30 | 31 | # nyc test coverage 32 | .nyc_output 33 | 34 | # Grunt intermediate storage (http://gruntjs.com/creating-plugins#storing-task-files) 35 | .grunt 36 | 37 | # Bower dependency directory (https://bower.io/) 38 | bower_components 39 | 40 | # node-waf configuration 41 | .lock-wscript 42 | 43 | # Compiled binary addons (https://nodejs.org/api/addons.html) 44 | build/Release 45 | 46 | # Dependency directories 47 | node_modules/ 48 | jspm_packages/ 49 | 50 | # TypeScript v1 declaration files 51 | typings/ 52 | 53 | # Optional npm cache directory 54 | .npm 55 | 56 | # Optional eslint cache 57 | .eslintcache 58 | 59 | # Optional REPL history 60 | .node_repl_history 61 | 62 | # Output of 'npm pack' 63 | *.tgz 64 | 65 | # Yarn Integrity file 66 | .yarn-integrity 67 | 68 | # dotenv environment variables file 69 | .env 70 | 71 | # next.js build output 72 | .next 73 | -------------------------------------------------------------------------------- /.nojekyll: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/wx-chevalier/NLP-Notes/11ac6ed37b1c1e001b8d2139d218629a717eb625/.nojekyll -------------------------------------------------------------------------------- /00~导论/README.md: -------------------------------------------------------------------------------- 1 | # NLP 通用技术 2 | 3 | ## 文本生成 4 | 5 | 文本生成是使用计算机模拟人来生成文本的技术,可以分为 text-to-text,image-to-text,以及 data-to-text 等。文本生成的应用领域包括机器翻译、QA、文本摘要、文字改写、新闻报道(体育、气象、财经、医疗等)、报告的自动生成等。 6 | 7 | 随着深度学习等技术在文本生成领域的应用,近年来文本生成技术发展比较快,特别是源于机器翻译的 seq2seq 结构,广泛应用到了文本生成的各个领域。但是应用中还是存在很多诸如创新度不够、不流畅、语句之间相关性不强等问题。文本生成的难度在于,由于人类的语言表达是多种多样的,因此文本生成的结果的质量没有确定的标准,难以评估模型效果,同时对于结果质量和多样性的的平衡也很难把握。 8 | 9 | ## 情感分析 10 | 11 | 文本情感分析(Sentiment Analysis),又称意见挖掘(Opinion Mining),是自然语言处理领域的一个重要研究方向,在工业界和学术界都有广泛的研究和应用,在每年的国际顶会中(例如:ACL、EMNLP、IJCAI、AAAI、WWW 等)都有大量的论文。 12 | 13 | 简单而言,它是对带有情感色彩的主观性文本进行分析、处理、归纳和推理的过程。相对于客观文本,主观文本包含了用户个人的想法或态度,是用户群体对某产品或事件,从不同角度、不同需求和自身体验去分析评价的结果,这些评价具有主观能动性和多样性,具有情感分析的意义和价值。 14 | -------------------------------------------------------------------------------- /99~参考资料/2023~Numbers every LLM Developer should know.md: -------------------------------------------------------------------------------- 1 | 21 | 22 | # Numbers every LLM Developer should know 23 | 24 | When I was at Google, there was a document put together by [Jeff Dean](https://en.wikipedia.org/wiki/Jeff_Dean), the legendary engineer, called [Numbers every Engineer should know](http://brenocon.com/dean_perf.html). It’s really useful to have a similar set of numbers for LLM developers to know that are useful for back-of-the envelope calculations. Here we share particular numbers we at Anyscale use, why the number is important and how to use it to your advantage. 25 | 26 | ## Notes on the Github version 27 | 28 | Last updates: 2023-05-17 29 | 30 | If you feel there's an issue with the accuracy of the numbers, please file an issue. Think there are more numbers that should be in this doc? Let us know or file a PR. 31 | 32 | We are thinking the next thing we should add here is some stats on tokens per second of different models. 33 | 34 | ## Prompts 35 | 36 | ### 40-90%[^1]: Amount saved by appending “Be Concise” to your prompt 37 | 38 | It’s important to remember that you pay by the token for responses. This means that asking an LLM to be concise can save you a lot of money. This can be broadened beyond simply appending “be concise” to your prompt: if you are using GPT-4 to come up with 10 alternatives, maybe ask it for 5 and keep the other half of the money. 39 | 40 | ### 1.3: Average tokens per word 41 | 42 | LLMs operate on tokens. Tokens are words or sub-parts of words, so “eating” might be broken into two tokens “eat” and “ing”. A 750 word document in English will be about 1000 tokens. For languages other than English, the tokens per word increases depending on their commonality in the LLM's embedding corpus. 43 | 44 | Knowing this ratio is important because most billing is done in tokens, and the LLM’s context window size is also defined in tokens. 45 | 46 | ## Prices[^2] 47 | 48 | Prices are of course subject to change, but given how expensive LLMs are to operate, the numbers in this section are critical. We use OpenAI for the numbers here, but prices from other providers you should check out ([Anthropic](https://cdn2.assets-servd.host/anthropic-website/production/images/model_pricing_may2023.pdf), [Cohere](https://cohere.com/pricing)) are in the same ballpark. 49 | 50 | ### ~50: Cost Ratio of GPT-4 to GPT-3.5 Turbo[^3] 51 | 52 | What this means is that for many practical applications, it’s much better to use GPT-4 for things like generation and then use that data to fine tune a smaller model. It is roughly 50 times cheaper to use GPT-3.5-Turbo than GPT-4 (the “roughly” is because GPT-4 charges differently for the prompt and the generated output) – so you really need to check on how far you can get with GPT-3.5-Turbo. GPT-3.5-Turbo is more than enough for tasks like summarization for example. 53 | 54 | ### 5: Cost Ratio of generation of text using GPT-3.5-Turbo vs OpenAI embedding 55 | 56 | This means it is way cheaper to look something up in a vector store than to ask an LLM to generate it. E.g. “What is the capital of Delaware?” when looked up in an neural information retrieval system costs about 5x[^4] less than if you asked GPT-3.5-Turbo. The cost difference compared to GPT-4 is a whopping 250x! 57 | 58 | ### 10: Cost Ratio of OpenAI embedding to Self-Hosted embedding 59 | 60 | > Note: this number is sensitive to load and embedding batch size, so please consider this approximate. 61 | 62 | In our blog post, we noted that using a g4dn.4xlarge (on-demand price: $1.20/hr) we were able to embed at about 9000 tokens per second using HuggingFace’s SentenceTransformers (which are pretty much as good as OpenAI’s embeddings). Doing some basic math of that rate and that node type indicates it is considerably cheaper (factor of 10 cheaper) to self-host embeddings (and that is before you start to think about things like ingress and egress fees). 63 | 64 | ### 6: Cost Ratio of OpenAI base vs fine tuned model queries 65 | 66 | It costs you 6 times as much to serve a fine tuned model as it does the base model on OpenAI. This is pretty exorbitant, but might make sense because of the possible multi-tenancy of base models. It also means it is far more cost effective to tweak the prompt for a base model than to fine tune a customized model. 67 | 68 | ### 1: Cost Ratio of Self-Hosted base vs fine-tuned model queries 69 | 70 | If you’re self hosting a model, then it more or less costs the same amount to serve a fine tuned model as it does to serve a base one: the models have the same number of parameters. 71 | 72 | ## Training and Fine Tuning 73 | 74 | ### ~$1 million: Cost to train a 13 billion parameter model on 1.4 trillion tokens 75 | 76 | The [LLaMa paper](https://arxiv.org/abs/2302.13971) mentions it took them 21 days to train LLaMa using 2048 GPUs A100 80GB GPUs. We considered training our own model on the Red Pajama training set, then we ran the numbers. The above is assuming everything goes right, nothing crashes, and the calculation succeeds on the first time, etc. Plus it involves the coordination of 2048 GPUs. That’s not something most companies can do (shameless plug time: of course, we at Anyscale can – that’s our [bread and butter](https://www.anyscale.com/blog/training-175b-parameter-language-models-at-1000-gpu-scale-with-alpa-and-ray)! Contact us if you’d like to learn more). The point is that training your own LLM is possible, but it’s not cheap. And it will literally take days to complete each run. Much cheaper to use a pre-trained model. 77 | 78 | ### < 0.001: Cost ratio of fine tuning vs training from scratch 79 | 80 | This is a bit of a generalization, but the cost of fine tuning is negligible. We showed for example that you can fine tune a [6B parameter model for about 7](https://www.anyscale.com/blog/how-to-fine-tune-and-serve-llms-simply-quickly-and-cost-effectively-using). Even at OpenAI’s rate for its most expensive fine-tunable model, Davinci, it is 3c per 1000 tokens. That means to fine tune on the entire works of Shakespeare (about 1 million words), you’re looking at $40[^5]. However, fine tuning is one thing and training from scratch is another … 81 | 82 | ## GPU Memory 83 | 84 | If you’re self-hosting a model, it’s really important to understand GPU memory because LLMs push your GPU’s memory to the limit. The following statistics are specifically about inference. You need considerably more memory for training or fine tuning. 85 | 86 | ### V100: 16GB, A10G: 24GB, A100: 40/80GB: GPU Memory Capacities 87 | 88 | It may seem strange, but it’s important to know the amount of memory different types of GPUs have. This will cap the number of parameters your LLM can have. Generally, we like to use A10Gs because they cost $1.50 to $2 per hour each at AWS on-demand prices and have 24G of GPU memory, vs the A100s which will run you about $5 each at AWS on-demand prices. 89 | 90 | ### 2x number of parameters: Typical GPU memory requirements of an LLM for serving 91 | 92 | For example, if you have a 7 billion parameter model, it takes about 14GB of GPU space. This is because most of the time, one 16-bit float (or 2 bytes) is required per parameter. There’s usually no need to go beyond 16-bit accuracy, and most of the time when you go to 8-bit accuracy you start to lose resolution (though that may be acceptable in some cases). Of course there are efforts to reduce this, notably llama.cpp which runs a 13 billion parameter model on a 6GB GPU by quantizing aggressively down to 4 bits (and 8 bits without too much impact), but that’s atypical. 93 | 94 | ### ~1GB: Typical GPU memory requirements of an embedding model 95 | 96 | Whenever you are doing sentence embedding (a very typical thing you do for clustering, semantic search and classification tasks), you need an embedding model like [sentence transformers](https://www.sbert.net/docs/pretrained_models.html#sentence-embedding-models/). OpenAI also has its own embeddings that they provide commercially. 97 | 98 | You typically don’t have to worry about how much memory embeddings take on the GPU, they’re fairly small. We’ve even had the embedding and the LLM on the same GPU. 99 | 100 | ### >10x: Throughput improvement from batching LLM requests 101 | 102 | Running an LLM query through a GPU is very high latency: it may take, say, 5 seconds, with a throughput of 0.2 queries per second. The funny thing is, though, if you run two tasks, it might only take 5.2 seconds. This means that if you can bundle 25 queries together, it would take about 10 seconds, and our throughput has improved to 2.5 queries per second. However, see the next point. 103 | 104 | ### ~1 MB: GPU Memory required for 1 token of output with a 13B parameter model 105 | 106 | The amount of memory you need is directly proportional to the maximum number of tokens you want to generate. So for example, if you want to generate outputs of up to 512 tokens (about 380 words), you need 512MB. No big deal you might say – I have 24GB to spare, what’s 512MB? Well, if you want to run bigger batches it starts to add up. So if you want to do batches of 16, you need 8GB of space. There are some techniques being developed that overcome this, but it’s still a real issue. 107 | 108 | # Cheatsheet 109 | 110 | Screenshot 2023-05-17 at 1 46 09 PM 111 | 112 | # Next Steps 113 | 114 | See our earlier [blog series on solving Generative AI infrastructure](https://www.anyscale.com/blog/ray-common-production-challenges-for-generative-ai-infrastructure) and [using LangChain with Ray](https://www.anyscale.com/blog/llm-open-source-search-engine-langchain-ray). \ 115 | \ 116 | If you are interested in learning more about Ray, see [Ray.io](http://ray.io/) and [Docs.Ray.io](http://docs.ray.io/). \ 117 | \ 118 | To connect with the Ray community join #LLM on the [Ray Slack](https://docs.google.com/forms/d/e/1FAIpQLSfAcoiLCHOguOm8e7Jnn-JJdZaCxPGjgVCvFijHB5PLaQLeig/viewform) or our [Discuss forum](https://discuss.ray.io/). \ 119 | \ 120 | If you are interested in our Ray hosted service for ML Training and Serving, see [Anyscale.com/Platform ](http://www.anyscale.com/platform)and click the 'Try it now' button 121 | 122 | **Ray Summit 2023:** If you are interested to learn much more about how Ray can be used to build performant and scalable LLM applications and fine-tune/train/serve LLMs on Ray, join [Ray Summit](https://raysummit.anyscale.com/) on September 18-20th! We have a set of great keynote speakers including John Schulman from OpenAI and Aidan Gomez from Cohere, community and tech talks about Ray as well as [practical training focused on LLMs](https://github.com/ray-project/ray-educational-materials/blob/main/NLP_workloads/Text_generation/LLM_finetuning_and_batch_inference.ipynb). 123 | 124 | 125 | 126 | ## Notes 127 | 128 | [^1]: Based on experimentation with GPT-3.5-Turbo using a suite of prompts on 2023-05-08. 129 | [^2]: Retrieved from [http://openai.com/pricing](http://openai.com/pricing) on 2023-05-08. 130 | [^3]: **GPT-4**: 6c/1k tokens for the prompt, 12c/1k tokens for the generation (32,000 window version, 8,000 window version is half that). **GPT-3.5 Turbo**: 0.2c/1k tokens. 131 | [^4]: This assumes the vector lookup is “free.” It’s not, but it uses CPUs (much cheaper) and is fairly fast. 132 | [^5]: 1 million words / 0.75 tokens/word / 1000\*0.03 = $40. 133 | -------------------------------------------------------------------------------- /99~参考资料/2023~吴恩达~《Building Systems with the ChatGPT API》/1.Introduction.md: -------------------------------------------------------------------------------- 1 | # 使用 ChatGPT API 搭建系统 2 | 3 | ## 简介 4 | 5 | 欢迎来到课程《使用 ChatGPT API 搭建系统》👏🏻👏🏻 6 | 7 | 本课程由吴恩达老师联合 OpenAI 开发,旨在指导开发者如何基于 ChatGPT 搭建完整的智能问答系统。 8 | 9 | ### 📚 课程基本内容 10 | 11 | 使用ChatGPT不仅仅是一个单一的提示或单一的模型调用,本课程将分享使用LLM构建复杂应用的最佳实践。 12 | 13 | 以构建客服助手为例,使用不同的指令链式调用语言模型,具体取决于上一个调用的输出,有时甚至需要从外部来源查找信息。 14 | 15 | 本课程将围绕该主题,逐步了解应用程序内部的构建步骤,以及长期视角下系统评估和持续改进的最佳实践。 16 | 17 | 18 | ### 🌹致谢课程重要贡献者 19 | 20 | 感谢来自OpenAI团队的Andrew Kondrick、Joe Palermo、Boris Power和Ted Sanders, 21 | 以及来自DeepLearning.ai团队的Geoff Ladwig、Eddie Shyu和Tommy Nelson。 -------------------------------------------------------------------------------- /99~参考资料/2023~吴恩达~《Building Systems with the ChatGPT API》/11.conclusion.md: -------------------------------------------------------------------------------- 1 | ## 吴恩达 用ChatGPT API构建系统 总结篇 2 | 3 | ## Building Systems with the ChatGPT API 4 | 5 | 本次简短课程涵盖了一系列 ChatGPT 的应用实践,包括处理处理输入、审查输出以及评估等,实现了一个搭建系统的完整流程。 6 | 7 | ### 📚 课程回顾 8 | 9 | 本课程详细介绍了LLM工作原理,包括分词器(tokenizer)等微妙之处、评估用户输入的质量和安全性的方法、使用思维链作为提示词、通过链提示分割任务以及返回用户前检查输出等。 10 | 11 | 本课程还介绍了评估系统长期性能以监控和改进表现的方法。 12 | 13 | 此外,课程也涉及到构建负责任的系统以保证模型提供合理相关的反馈。 14 | 15 | ### 💪🏻 出发~去探索新世界吧~ 16 | 17 | 实践是掌握真知的必经之路。开始构建令人激动的应用吧~ 18 | -------------------------------------------------------------------------------- /99~参考资料/2023~吴恩达~《Building Systems with the ChatGPT API》/3.Classification.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "attachments": {}, 5 | "cell_type": "markdown", 6 | "id": "63651c26", 7 | "metadata": {}, 8 | "source": [ 9 | "第三章 评估输入——分类" 10 | ] 11 | }, 12 | { 13 | "attachments": {}, 14 | "cell_type": "markdown", 15 | "id": "b12f80c9", 16 | "metadata": {}, 17 | "source": [ 18 | "在本节中,我们将专注于评估输入的任务,这对于确保系统的质量和安全性非常重要。\n", 19 | "\n", 20 | "对于需要处理不同情况下的许多独立指令集的任务,首先对查询类型进行分类,然后根据该分类确定要使用哪些指令会很有好处。\n", 21 | "\n", 22 | "这可以通过定义固定的类别和hard-coding与处理给定类别任务相关的指令来实现。\n", 23 | "\n", 24 | "例如,在构建客户服务助手时,首先对查询类型进行分类,然后根据该分类确定要使用哪些指令可能比较重要。\n", 25 | "\n", 26 | "因此,例如,如果用户要求关闭其帐户,您可能会给出不同的辅助指令,而如果用户询问特定产品,则可能会添加其他产品信息。\n" 27 | ] 28 | }, 29 | { 30 | "attachments": {}, 31 | "cell_type": "markdown", 32 | "id": "87d9de1d", 33 | "metadata": {}, 34 | "source": [ 35 | "## Setup\n", 36 | "加载 API_KEY 并封装一个调用 API 的函数" 37 | ] 38 | }, 39 | { 40 | "cell_type": "code", 41 | "execution_count": 9, 42 | "id": "55ee24ab", 43 | "metadata": {}, 44 | "outputs": [], 45 | "source": [ 46 | "import os\n", 47 | "import openai\n", 48 | "from dotenv import load_dotenv, find_dotenv\n", 49 | "_ = load_dotenv(find_dotenv()) # read local .env file\n", 50 | "openai.api_key = os.environ['OPENAI_API_KEY']\n" 51 | ] 52 | }, 53 | { 54 | "cell_type": "code", 55 | "execution_count": 2, 56 | "id": "0318b89e", 57 | "metadata": {}, 58 | "outputs": [], 59 | "source": [ 60 | "def get_completion_from_messages(messages, \n", 61 | " model=\"gpt-3.5-turbo\", \n", 62 | " temperature=0, \n", 63 | " max_tokens=500):\n", 64 | " response = openai.ChatCompletion.create(\n", 65 | " model=model,\n", 66 | " messages=messages,\n", 67 | " temperature=temperature, \n", 68 | " max_tokens=max_tokens,\n", 69 | " )\n", 70 | " return response.choices[0].message[\"content\"]" 71 | ] 72 | }, 73 | { 74 | "attachments": {}, 75 | "cell_type": "markdown", 76 | "id": "f2b55807", 77 | "metadata": {}, 78 | "source": [ 79 | "#### 对用户指令进行分类" 80 | ] 81 | }, 82 | { 83 | "attachments": {}, 84 | "cell_type": "markdown", 85 | "id": "c3216166", 86 | "metadata": {}, 87 | "source": [ 88 | "在这里,我们有我们的系统消息,它是对整个系统的指导,并且我们正在使用这个分隔符——#。\n", 89 | "\n", 90 | "分隔符只是一种分隔指令或输出不同部分的方式,它有助于模型确定不同的部分。\n", 91 | "\n", 92 | "因此,对于这个例子,我们将使用#作为分隔符。\n", 93 | "\n", 94 | "这是一个很好的分隔符,因为它实际上被表示为一个token。" 95 | ] 96 | }, 97 | { 98 | "cell_type": "code", 99 | "execution_count": 3, 100 | "id": "3b406ba8", 101 | "metadata": {}, 102 | "outputs": [], 103 | "source": [ 104 | "delimiter = \"####\"" 105 | ] 106 | }, 107 | { 108 | "attachments": {}, 109 | "cell_type": "markdown", 110 | "id": "049d0d82", 111 | "metadata": {}, 112 | "source": [ 113 | "这是我们的系统消息,我们正在以下面的方式询问模型。" 114 | ] 115 | }, 116 | { 117 | "cell_type": "code", 118 | "execution_count": 4, 119 | "id": "29e2d170", 120 | "metadata": {}, 121 | "outputs": [], 122 | "source": [ 123 | "system_message = f\"\"\"\n", 124 | "You will be provided with customer service queries. \\\n", 125 | "The customer service query will be delimited with \\\n", 126 | "{delimiter} characters.\n", 127 | "Classify each query into a primary category \\\n", 128 | "and a secondary category. \n", 129 | "Provide your output in json format with the \\\n", 130 | "keys: primary and secondary.\n", 131 | "\n", 132 | "Primary categories: Billing, Technical Support, \\\n", 133 | "Account Management, or General Inquiry.\n", 134 | "\n", 135 | "Billing secondary categories:\n", 136 | "Unsubscribe or upgrade\n", 137 | "Add a payment method\n", 138 | "Explanation for charge\n", 139 | "Dispute a charge\n", 140 | "\n", 141 | "Technical Support secondary categories:\n", 142 | "General troubleshooting\n", 143 | "Device compatibility\n", 144 | "Software updates\n", 145 | "\n", 146 | "Account Management secondary categories:\n", 147 | "Password reset\n", 148 | "Update personal information\n", 149 | "Close account\n", 150 | "Account security\n", 151 | "\n", 152 | "General Inquiry secondary categories:\n", 153 | "Product information\n", 154 | "Pricing\n", 155 | "Feedback\n", 156 | "Speak to a human\n", 157 | "\n", 158 | "\"\"\"" 159 | ] 160 | }, 161 | { 162 | "cell_type": "code", 163 | "execution_count": 5, 164 | "id": "61f4b474", 165 | "metadata": {}, 166 | "outputs": [], 167 | "source": [ 168 | "# 中文 Prompt\n", 169 | "system_message = f\"\"\"\n", 170 | "你将获得客户服务查询。\n", 171 | "每个客户服务查询都将用{delimiter}字符分隔。\n", 172 | "将每个查询分类到一个主要类别和一个次要类别中。\n", 173 | "以JSON格式提供你的输出,包含以下键:primary和secondary。\n", 174 | "\n", 175 | "主要类别:计费(Billing)、技术支持(Technical Support)、账户管理(Account Management)或一般咨询(General Inquiry)。\n", 176 | "\n", 177 | "计费次要类别:\n", 178 | "取消订阅或升级(Unsubscribe or upgrade)\n", 179 | "添加付款方式(Add a payment method)\n", 180 | "收费解释(Explanation for charge)\n", 181 | "争议费用(Dispute a charge)\n", 182 | "\n", 183 | "技术支持次要类别:\n", 184 | "常规故障排除(General troubleshooting)\n", 185 | "设备兼容性(Device compatibility)\n", 186 | "软件更新(Software updates)\n", 187 | "\n", 188 | "账户管理次要类别:\n", 189 | "重置密码(Password reset)\n", 190 | "更新个人信息(Update personal information)\n", 191 | "关闭账户(Close account)\n", 192 | "账户安全(Account security)\n", 193 | "\n", 194 | "一般咨询次要类别:\n", 195 | "产品信息(Product information)\n", 196 | "定价(Pricing)\n", 197 | "反馈(Feedback)\n", 198 | "与人工对话(Speak to a human)\n", 199 | "\n", 200 | "\"\"\"" 201 | ] 202 | }, 203 | { 204 | "attachments": {}, 205 | "cell_type": "markdown", 206 | "id": "e6a932ce", 207 | "metadata": {}, 208 | "source": [ 209 | "现在我们来看一个用户消息的例子,我们将使用以下内容。" 210 | ] 211 | }, 212 | { 213 | "cell_type": "code", 214 | "execution_count": 26, 215 | "id": "2b2df0bf", 216 | "metadata": {}, 217 | "outputs": [], 218 | "source": [ 219 | "user_message = f\"\"\"\\ \n", 220 | "I want you to delete my profile and all of my user data\"\"\"" 221 | ] 222 | }, 223 | { 224 | "cell_type": "code", 225 | "execution_count": 6, 226 | "id": "3b8070bf", 227 | "metadata": {}, 228 | "outputs": [], 229 | "source": [ 230 | "user_message = f\"\"\"\\ \n", 231 | "我希望你删除我的个人资料和所有用户数据。\"\"\"" 232 | ] 233 | }, 234 | { 235 | "attachments": {}, 236 | "cell_type": "markdown", 237 | "id": "3a2c1cf0", 238 | "metadata": {}, 239 | "source": [ 240 | "将这个消息格式化为一个消息列表,系统消息和用户消息使用####\"进行分隔。\n", 241 | "\n", 242 | "让我们想一想,作为人类,这句话什么意思:\"我想让您删除我的个人资料。\"\n", 243 | "\n", 244 | "这句话看上去属于\"Account Management\"类别,也许是属于\"Close account\"这一项。" 245 | ] 246 | }, 247 | { 248 | "cell_type": "code", 249 | "execution_count": 7, 250 | "id": "6e2b9049", 251 | "metadata": {}, 252 | "outputs": [], 253 | "source": [ 254 | "messages = [ \n", 255 | "{'role':'system', \n", 256 | " 'content': system_message}, \n", 257 | "{'role':'user', \n", 258 | " 'content': f\"{delimiter}{user_message}{delimiter}\"}, \n", 259 | "]" 260 | ] 261 | }, 262 | { 263 | "attachments": {}, 264 | "cell_type": "markdown", 265 | "id": "4b295207", 266 | "metadata": {}, 267 | "source": [ 268 | "让我们看看模型是如何思考的\n", 269 | "\n", 270 | "模型的分类是\"Account Management\"作为\"primary\",\"Close account\"作为\"secondary\"。\n", 271 | "\n", 272 | "请求结构化输出(如JSON)的好处是,您可以轻松地将其读入某个对象中,\n", 273 | "\n", 274 | "例如Python中的字典,或者如果您使用其他语言,则可以使用其他对象作为输入到后续步骤中。" 275 | ] 276 | }, 277 | { 278 | "cell_type": "code", 279 | "execution_count": 10, 280 | "id": "77328388", 281 | "metadata": {}, 282 | "outputs": [ 283 | { 284 | "name": "stdout", 285 | "output_type": "stream", 286 | "text": [ 287 | "{\n", 288 | " \"primary\": \"账户管理\",\n", 289 | " \"secondary\": \"关闭账户\"\n", 290 | "}\n" 291 | ] 292 | } 293 | ], 294 | "source": [ 295 | "response = get_completion_from_messages(messages)\n", 296 | "print(response)" 297 | ] 298 | }, 299 | { 300 | "attachments": {}, 301 | "cell_type": "markdown", 302 | "id": "2f6b353b", 303 | "metadata": {}, 304 | "source": [ 305 | "这是另一个用户消息: \"告诉我更多关于你们的平板电视\"\n", 306 | "\n", 307 | "我们只是有相同的消息列表,模型的响应,然后我们打印它。\n", 308 | "\n", 309 | "结果这里是我们的第二个分类,看起来应该是正确的。" 310 | ] 311 | }, 312 | { 313 | "cell_type": "code", 314 | "execution_count": 31, 315 | "id": "edf8fbe9", 316 | "metadata": {}, 317 | "outputs": [ 318 | { 319 | "name": "stdout", 320 | "output_type": "stream", 321 | "text": [ 322 | "{\n", 323 | " \"primary\": \"General Inquiry\",\n", 324 | " \"secondary\": \"Product information\"\n", 325 | "}\n" 326 | ] 327 | } 328 | ], 329 | "source": [ 330 | "user_message = f\"\"\"\\\n", 331 | "Tell me more about your flat screen tvs\"\"\"\n", 332 | "messages = [ \n", 333 | "{'role':'system', \n", 334 | " 'content': system_message}, \n", 335 | "{'role':'user', \n", 336 | " 'content': f\"{delimiter}{user_message}{delimiter}\"}, \n", 337 | "] \n", 338 | "response = get_completion_from_messages(messages)\n", 339 | "print(response)" 340 | ] 341 | }, 342 | { 343 | "cell_type": "code", 344 | "execution_count": 12, 345 | "id": "f1d738e1", 346 | "metadata": {}, 347 | "outputs": [ 348 | { 349 | "name": "stdout", 350 | "output_type": "stream", 351 | "text": [ 352 | "以下是针对平板电脑的一般咨询:\n", 353 | "\n", 354 | "{\n", 355 | " \"primary\": \"General Inquiry\",\n", 356 | " \"secondary\": \"Product information\"\n", 357 | "}\n", 358 | "\n", 359 | "如果您有任何特定的问题或需要更详细的信息,请告诉我,我会尽力回答。\n" 360 | ] 361 | } 362 | ], 363 | "source": [ 364 | "user_message = f\"\"\"\\\n", 365 | "告诉我更多有关你们的平板电脑的信息\"\"\"\n", 366 | "messages = [ \n", 367 | "{'role':'system', \n", 368 | " 'content': system_message}, \n", 369 | "{'role':'user', \n", 370 | " 'content': f\"{delimiter}{user_message}{delimiter}\"}, \n", 371 | "] \n", 372 | "response = get_completion_from_messages(messages)\n", 373 | "print(response)" 374 | ] 375 | }, 376 | { 377 | "attachments": {}, 378 | "cell_type": "markdown", 379 | "id": "8f87f68d", 380 | "metadata": {}, 381 | "source": [ 382 | "所以总的来说,根据客户咨询的分类,我们现在可以提供一套更具体的指令来处理后续步骤。\n", 383 | "\n", 384 | "在这种情况下,我们可能会添加关于电视的额外信息,而不同情况下,我们可能希望提供关闭账户的链接或类似的内容。\n", 385 | "\n", 386 | "我们将在以后的视频中了解更多有关处理输入的不同方法。\n", 387 | "\n", 388 | "在下一个视频中,我们将探讨更多评估输入的方法,特别是确保用户以负责任的方式使用系统的方法。" 389 | ] 390 | } 391 | ], 392 | "metadata": { 393 | "kernelspec": { 394 | "display_name": "Python 3 (ipykernel)", 395 | "language": "python", 396 | "name": "python3" 397 | }, 398 | "language_info": { 399 | "codemirror_mode": { 400 | "name": "ipython", 401 | "version": 3 402 | }, 403 | "file_extension": ".py", 404 | "mimetype": "text/x-python", 405 | "name": "python", 406 | "nbconvert_exporter": "python", 407 | "pygments_lexer": "ipython3", 408 | "version": "3.10.11" 409 | } 410 | }, 411 | "nbformat": 4, 412 | "nbformat_minor": 5 413 | } 414 | -------------------------------------------------------------------------------- /99~参考资料/2023~吴恩达~《Building Systems with the ChatGPT API》/5.Chain of Thought Reasoning.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "attachments": {}, 5 | "cell_type": "markdown", 6 | "metadata": {}, 7 | "source": [ 8 | "# L4: 处理输入: 思维链推理" 9 | ] 10 | }, 11 | { 12 | "attachments": {}, 13 | "cell_type": "markdown", 14 | "metadata": {}, 15 | "source": [ 16 | "在本节中,我们将专注于处理输入的任务,即通过一系列步骤生成有用输出的任务。\n", 17 | "\n", 18 | "有时,模型在回答特定问题之前需要详细推理问题,如果您参加了我们之前的课程,您将看到许多这样的例子。有时,模型可能会通过匆忙得出错误的结论而出现推理错误,因此我们可以重新构思查询,要求模型在提供最终答案之前提供一系列相关的推理步骤,以便它可以更长时间、更有方法地思考问题。\n", 19 | "\n", 20 | "通常,我们称这种要求模型逐步推理问题的策略为思维链推理。" 21 | ] 22 | }, 23 | { 24 | "attachments": {}, 25 | "cell_type": "markdown", 26 | "metadata": {}, 27 | "source": [ 28 | "## 设置\n", 29 | "#### 加载 API key 和相关的 Python 库.\n", 30 | "在这门课程中,我们提供了一些代码,帮助你加载OpenAI API key。" 31 | ] 32 | }, 33 | { 34 | "cell_type": "code", 35 | "execution_count": 1, 36 | "metadata": {}, 37 | "outputs": [], 38 | "source": [ 39 | "import os\n", 40 | "import openai\n", 41 | "from dotenv import load_dotenv, find_dotenv\n", 42 | "_ = load_dotenv(find_dotenv()) # read local .env file\n", 43 | "\n", 44 | "openai.api_key = os.environ['OPENAI_API_KEY']\n" 45 | ] 46 | }, 47 | { 48 | "cell_type": "code", 49 | "execution_count": 2, 50 | "metadata": {}, 51 | "outputs": [], 52 | "source": [ 53 | "def get_completion_from_messages(messages, \n", 54 | " model=\"gpt-3.5-turbo\", \n", 55 | " temperature=0, \n", 56 | " max_tokens=500):\n", 57 | " response = openai.ChatCompletion.create(\n", 58 | " model=model,\n", 59 | " messages=messages,\n", 60 | " temperature=temperature, \n", 61 | " max_tokens=max_tokens, \n", 62 | " )\n", 63 | " return response.choices[0].message[\"content\"]" 64 | ] 65 | }, 66 | { 67 | "attachments": {}, 68 | "cell_type": "markdown", 69 | "metadata": {}, 70 | "source": [ 71 | "## 思维链提示" 72 | ] 73 | }, 74 | { 75 | "attachments": {}, 76 | "cell_type": "markdown", 77 | "metadata": {}, 78 | "source": [ 79 | "因此,我们在这里要求模型在得出结论之前推理答案。\n" 80 | ] 81 | }, 82 | { 83 | "cell_type": "code", 84 | "execution_count": 3, 85 | "metadata": {}, 86 | "outputs": [], 87 | "source": [ 88 | "delimiter = \"####\"\n", 89 | "system_message = f\"\"\"\n", 90 | "Follow these steps to answer the customer queries.\n", 91 | "The customer query will be delimited with four hashtags,\\\n", 92 | "i.e. {delimiter}. \n", 93 | "\n", 94 | "Step 1:{delimiter} First decide whether the user is \\\n", 95 | "asking a question about a specific product or products. \\\n", 96 | "Product cateogry doesn't count. \n", 97 | "\n", 98 | "Step 2:{delimiter} If the user is asking about \\\n", 99 | "specific products, identify whether \\\n", 100 | "the products are in the following list.\n", 101 | "All available products: \n", 102 | "1. Product: TechPro Ultrabook\n", 103 | " Category: Computers and Laptops\n", 104 | " Brand: TechPro\n", 105 | " Model Number: TP-UB100\n", 106 | " Warranty: 1 year\n", 107 | " Rating: 4.5\n", 108 | " Features: 13.3-inch display, 8GB RAM, 256GB SSD, Intel Core i5 processor\n", 109 | " Description: A sleek and lightweight ultrabook for everyday use.\n", 110 | " Price: $799.99\n", 111 | "\n", 112 | "2. Product: BlueWave Gaming Laptop\n", 113 | " Category: Computers and Laptops\n", 114 | " Brand: BlueWave\n", 115 | " Model Number: BW-GL200\n", 116 | " Warranty: 2 years\n", 117 | " Rating: 4.7\n", 118 | " Features: 15.6-inch display, 16GB RAM, 512GB SSD, NVIDIA GeForce RTX 3060\n", 119 | " Description: A high-performance gaming laptop for an immersive experience.\n", 120 | " Price: $1199.99\n", 121 | "\n", 122 | "3. Product: PowerLite Convertible\n", 123 | " Category: Computers and Laptops\n", 124 | " Brand: PowerLite\n", 125 | " Model Number: PL-CV300\n", 126 | " Warranty: 1 year\n", 127 | " Rating: 4.3\n", 128 | " Features: 14-inch touchscreen, 8GB RAM, 256GB SSD, 360-degree hinge\n", 129 | " Description: A versatile convertible laptop with a responsive touchscreen.\n", 130 | " Price: $699.99\n", 131 | "\n", 132 | "4. Product: TechPro Desktop\n", 133 | " Category: Computers and Laptops\n", 134 | " Brand: TechPro\n", 135 | " Model Number: TP-DT500\n", 136 | " Warranty: 1 year\n", 137 | " Rating: 4.4\n", 138 | " Features: Intel Core i7 processor, 16GB RAM, 1TB HDD, NVIDIA GeForce GTX 1660\n", 139 | " Description: A powerful desktop computer for work and play.\n", 140 | " Price: $999.99\n", 141 | "\n", 142 | "5. Product: BlueWave Chromebook\n", 143 | " Category: Computers and Laptops\n", 144 | " Brand: BlueWave\n", 145 | " Model Number: BW-CB100\n", 146 | " Warranty: 1 year\n", 147 | " Rating: 4.1\n", 148 | " Features: 11.6-inch display, 4GB RAM, 32GB eMMC, Chrome OS\n", 149 | " Description: A compact and affordable Chromebook for everyday tasks.\n", 150 | " Price: $249.99\n", 151 | "\n", 152 | "Step 3:{delimiter} If the message contains products \\\n", 153 | "in the list above, list any assumptions that the \\\n", 154 | "user is making in their \\\n", 155 | "message e.g. that Laptop X is bigger than \\\n", 156 | "Laptop Y, or that Laptop Z has a 2 year warranty.\n", 157 | "\n", 158 | "Step 4:{delimiter}: If the user made any assumptions, \\\n", 159 | "figure out whether the assumption is true based on your \\\n", 160 | "product information. \n", 161 | "\n", 162 | "Step 5:{delimiter}: First, politely correct the \\\n", 163 | "customer's incorrect assumptions if applicable. \\\n", 164 | "Only mention or reference products in the list of \\\n", 165 | "5 available products, as these are the only 5 \\\n", 166 | "products that the store sells. \\\n", 167 | "Answer the customer in a friendly tone.\n", 168 | "\n", 169 | "Use the following format:\n", 170 | "Step 1:{delimiter} \n", 171 | "Step 2:{delimiter} \n", 172 | "Step 3:{delimiter} \n", 173 | "Step 4:{delimiter} \n", 174 | "Response to user:{delimiter} \n", 175 | "\n", 176 | "Make sure to include {delimiter} to separate every step.\n", 177 | "\"\"\"" 178 | ] 179 | }, 180 | { 181 | "cell_type": "code", 182 | "execution_count": null, 183 | "metadata": {}, 184 | "outputs": [ 185 | { 186 | "name": "stdout", 187 | "output_type": "stream", 188 | "text": [ 189 | "Step 1:#### The user is asking a question about two specific products, the BlueWave Chromebook and the TechPro Desktop.\n", 190 | "Step 2:#### The prices of the two products are as follows:\n", 191 | "- BlueWave Chromebook: $249.99\n", 192 | "- TechPro Desktop: $999.99\n", 193 | "Step 3:#### The user is assuming that the BlueWave Chromebook is more expensive than the TechPro Desktop.\n", 194 | "Step 4:#### The assumption is incorrect. The TechPro Desktop is actually more expensive than the BlueWave Chromebook.\n", 195 | "Response to user:#### The BlueWave Chromebook is actually less expensive than the TechPro Desktop. The BlueWave Chromebook costs $249.99 while the TechPro Desktop costs $999.99.\n" 196 | ] 197 | } 198 | ], 199 | "source": [ 200 | "user_message = f\"\"\"\n", 201 | "by how much is the BlueWave Chromebook more expensive \\\n", 202 | "than the TechPro Desktop\"\"\"\n", 203 | "\n", 204 | "messages = [ \n", 205 | "{'role':'system', \n", 206 | " 'content': system_message}, \n", 207 | "{'role':'user', \n", 208 | " 'content': f\"{delimiter}{user_message}{delimiter}\"}, \n", 209 | "] \n", 210 | "\n", 211 | "response = get_completion_from_messages(messages)\n", 212 | "print(response)" 213 | ] 214 | }, 215 | { 216 | "cell_type": "code", 217 | "execution_count": null, 218 | "metadata": {}, 219 | "outputs": [ 220 | { 221 | "name": "stdout", 222 | "output_type": "stream", 223 | "text": [ 224 | "Step 1:#### The user is asking if the store sells TVs.\n", 225 | "Step 2:#### The list of available products does not include any TVs.\n", 226 | "Response to user:#### I'm sorry, but we do not sell TVs at this store. Our available products include computers and laptops.\n" 227 | ] 228 | } 229 | ], 230 | "source": [ 231 | "user_message = f\"\"\"\n", 232 | "do you sell tvs\"\"\"\n", 233 | "messages = [ \n", 234 | "{'role':'system', \n", 235 | " 'content': system_message}, \n", 236 | "{'role':'user', \n", 237 | " 'content': f\"{delimiter}{user_message}{delimiter}\"}, \n", 238 | "] \n", 239 | "response = get_completion_from_messages(messages)\n", 240 | "print(response)" 241 | ] 242 | }, 243 | { 244 | "cell_type": "code", 245 | "execution_count": 4, 246 | "metadata": {}, 247 | "outputs": [], 248 | "source": [ 249 | "delimiter = \"####\"\n", 250 | "system_message = f\"\"\"\n", 251 | "请按照以下步骤回答客户的查询。客户的查询将以四个井号(#)分隔,即 {delimiter}。\n", 252 | "\n", 253 | "步骤 1:{delimiter} 首先确定用户是否正在询问有关特定产品或产品的问题。产品类别不计入范围。\n", 254 | "\n", 255 | "步骤 2:{delimiter} 如果用户询问特定产品,请确认产品是否在以下列表中。所有可用产品:\n", 256 | "\n", 257 | "产品:TechPro超极本\n", 258 | "类别:计算机和笔记本电脑\n", 259 | "品牌:TechPro\n", 260 | "型号:TP-UB100\n", 261 | "保修期:1年\n", 262 | "评分:4.5\n", 263 | "特点:13.3英寸显示屏,8GB RAM,256GB SSD,Intel Core i5处理器\n", 264 | "描述:一款适用于日常使用的时尚轻便的超极本。\n", 265 | "价格:$799.99\n", 266 | "\n", 267 | "产品:BlueWave游戏笔记本电脑\n", 268 | "类别:计算机和笔记本电脑\n", 269 | "品牌:BlueWave\n", 270 | "型号:BW-GL200\n", 271 | "保修期:2年\n", 272 | "评分:4.7\n", 273 | "特点:15.6英寸显示屏,16GB RAM,512GB SSD,NVIDIA GeForce RTX 3060\n", 274 | "描述:一款高性能的游戏笔记本电脑,提供沉浸式体验。\n", 275 | "价格:$1199.99\n", 276 | "\n", 277 | "产品:PowerLite可转换笔记本电脑\n", 278 | "类别:计算机和笔记本电脑\n", 279 | "品牌:PowerLite\n", 280 | "型号:PL-CV300\n", 281 | "保修期:1年\n", 282 | "评分:4.3\n", 283 | "特点:14英寸触摸屏,8GB RAM,256GB SSD,360度铰链\n", 284 | "描述:一款多功能可转换笔记本电脑,具有响应触摸屏。\n", 285 | "价格:$699.99\n", 286 | "\n", 287 | "产品:TechPro台式电脑\n", 288 | "类别:计算机和笔记本电脑\n", 289 | "品牌:TechPro\n", 290 | "型号:TP-DT500\n", 291 | "保修期:1年\n", 292 | "评分:4.4\n", 293 | "特点:Intel Core i7处理器,16GB RAM,1TB HDD,NVIDIA GeForce GTX 1660\n", 294 | "描述:一款功能强大的台式电脑,适用于工作和娱乐。\n", 295 | "价格:$999.99\n", 296 | "\n", 297 | "产品:BlueWave Chromebook\n", 298 | "类别:计算机和笔记本电脑\n", 299 | "品牌:BlueWave\n", 300 | "型号:BW-CB100\n", 301 | "保修期:1年\n", 302 | "评分:4.1\n", 303 | "特点:11.6英寸显示屏,4GB RAM,32GB eMMC,Chrome OS\n", 304 | "描述:一款紧凑而价格实惠的Chromebook,适用于日常任务。\n", 305 | "价格:$249.99\n", 306 | "\n", 307 | "步骤 3:{delimiter} 如果消息中包含上述列表中的产品,请列出用户在消息中做出的任何假设,例如笔记本电脑 X 比笔记本电脑 Y 大,或者笔记本电脑 Z 有 2 年保修期。\n", 308 | "\n", 309 | "步骤 4:{delimiter} 如果用户做出了任何假设,请根据产品信息确定假设是否正确。\n", 310 | "\n", 311 | "步骤 5:{delimiter} 如果用户有任何错误的假设,请先礼貌地纠正客户的错误假设(如果适用)。只提及或引用可用产品列表中的产品,因为这是商店销售的唯一五款产品。以友好的口吻回答客户。\n", 312 | "\n", 313 | "使用以下格式回答问题:\n", 314 | "步骤 1:{delimiter} <步骤 1的推理>\n", 315 | "步骤 2:{delimiter} <步骤 2 的推理>\n", 316 | "步骤 3:{delimiter} <步骤 3 的推理>\n", 317 | "步骤 4:{delimiter} <步骤 4 的推理>\n", 318 | "回复客户:{delimiter} <回复客户的内容>\n", 319 | "\n", 320 | "请确保在每个步骤之间使用 {delimiter} 进行分隔。\n", 321 | "\"\"\"" 322 | ] 323 | }, 324 | { 325 | "cell_type": "code", 326 | "execution_count": 5, 327 | "metadata": {}, 328 | "outputs": [ 329 | { 330 | "name": "stdout", 331 | "output_type": "stream", 332 | "text": [ 333 | "步骤 1:#### 确认用户正在询问有关特定产品的问题。\n", 334 | "\n", 335 | "步骤 2:#### 用户询问 BlueWave Chromebook 和 TechPro 台式电脑之间的价格差异。\n", 336 | "\n", 337 | "步骤 3:#### 用户假设 BlueWave Chromebook 的价格高于 TechPro 台式电脑。\n", 338 | "\n", 339 | "步骤 4:#### 用户的假设是正确的。BlueWave Chromebook 的价格为 $249.99,而 TechPro 台式电脑的价格为 $999.99,因此 BlueWave Chromebook 的价格比 TechPro 台式电脑低 $750。\n", 340 | "\n", 341 | "回复客户:#### BlueWave Chromebook 比 TechPro 台式电脑便宜 $750。\n" 342 | ] 343 | } 344 | ], 345 | "source": [ 346 | "user_message = f\"\"\"BlueWave Chromebook 比 TechPro 台式电脑贵多少?\"\"\"\n", 347 | "\n", 348 | "messages = [ \n", 349 | "{'role':'system', \n", 350 | " 'content': system_message}, \n", 351 | "{'role':'user', \n", 352 | " 'content': f\"{delimiter}{user_message}{delimiter}\"}, \n", 353 | "] \n", 354 | "\n", 355 | "response = get_completion_from_messages(messages)\n", 356 | "print(response)" 357 | ] 358 | }, 359 | { 360 | "cell_type": "code", 361 | "execution_count": 9, 362 | "metadata": {}, 363 | "outputs": [ 364 | { 365 | "name": "stdout", 366 | "output_type": "stream", 367 | "text": [ 368 | "步骤 1:#### 首先确定用户是否正在询问有关特定产品或产品的问题。产品类别不计入范围。\n", 369 | "\n", 370 | "步骤 2:#### 如果用户询问特定产品,请确认产品是否在以下列表中。所有可用产品:\n", 371 | "\n", 372 | "我们很抱歉,我们商店不出售电视机。\n", 373 | "\n", 374 | "步骤 3:#### 如果消息中包含上述列表中的产品,请列出用户在消息中做出的任何假设,例如笔记本电脑 X 比笔记本电脑 Y 大,或者笔记本电脑 Z 有 2 年保修期。\n", 375 | "\n", 376 | "N/A\n", 377 | "\n", 378 | "步骤 4:#### 如果用户做出了任何假设,请根据产品信息确定假设是否正确。\n", 379 | "\n", 380 | "N/A\n", 381 | "\n", 382 | "回复客户:#### 我们很抱歉,我们商店不出售电视机。\n" 383 | ] 384 | } 385 | ], 386 | "source": [ 387 | "user_message = f\"\"\"你有电视机么\"\"\"\n", 388 | "messages = [ \n", 389 | "{'role':'system', \n", 390 | " 'content': system_message}, \n", 391 | "{'role':'user', \n", 392 | " 'content': f\"{delimiter}{user_message}{delimiter}\"}, \n", 393 | "] \n", 394 | "response = get_completion_from_messages(messages)\n", 395 | "print(response)" 396 | ] 397 | }, 398 | { 399 | "attachments": {}, 400 | "cell_type": "markdown", 401 | "metadata": {}, 402 | "source": [ 403 | "## 内心独白" 404 | ] 405 | }, 406 | { 407 | "attachments": {}, 408 | "cell_type": "markdown", 409 | "metadata": {}, 410 | "source": [ 411 | "对于某些应用程序,模型用于得出最终答案的推理过程可能不适合与用户共享。例如,在辅导应用程序中,我们可能希望鼓励学生自己解决问题,但模型对学生解决方案的推理过程可能会揭示答案。\n", 412 | "\n", 413 | "内心独白是一种可以用来缓解这种情况的策略,这只是一种隐藏模型推理过程的高级方法。\n", 414 | "\n", 415 | "内心独白的想法是指示模型将输出的部分放在不会透露答案的方式中,以便用户无法看到完整的推理过程。旨在将它们隐藏在一个结构化的格式中,使得传递它们变得容易。然后,在向用户呈现输出之前,输出被传递,只有部分输出是可见的。\n" 416 | ] 417 | }, 418 | { 419 | "cell_type": "code", 420 | "execution_count": 10, 421 | "metadata": {}, 422 | "outputs": [ 423 | { 424 | "name": "stdout", 425 | "output_type": "stream", 426 | "text": [ 427 | "我们很抱歉,我们商店不出售电视机。\n" 428 | ] 429 | } 430 | ], 431 | "source": [ 432 | "try:\n", 433 | " final_response = response.split(delimiter)[-1].strip()\n", 434 | "except Exception as e:\n", 435 | " final_response = \"Sorry, I'm having trouble right now, please try asking another question.\"\n", 436 | " \n", 437 | "print(final_response)" 438 | ] 439 | } 440 | ], 441 | "metadata": { 442 | "kernelspec": { 443 | "display_name": "Python 3", 444 | "language": "python", 445 | "name": "python3" 446 | }, 447 | "language_info": { 448 | "codemirror_mode": { 449 | "name": "ipython", 450 | "version": 3 451 | }, 452 | "file_extension": ".py", 453 | "mimetype": "text/x-python", 454 | "name": "python", 455 | "nbconvert_exporter": "python", 456 | "pygments_lexer": "ipython3", 457 | "version": "3.10.11" 458 | }, 459 | "orig_nbformat": 4 460 | }, 461 | "nbformat": 4, 462 | "nbformat_minor": 2 463 | } 464 | -------------------------------------------------------------------------------- /99~参考资料/2023~吴恩达~《Building Systems with the ChatGPT API》/7.Check Outputs.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "f99b8a44", 6 | "metadata": {}, 7 | "source": [ 8 | "# L6: 检查结果\n", 9 | "比较简单轻松的一节" 10 | ] 11 | }, 12 | { 13 | "cell_type": "code", 14 | "execution_count": 1, 15 | "id": "5daec1c7", 16 | "metadata": {}, 17 | "outputs": [], 18 | "source": [ 19 | "import os\n", 20 | "import openai\n", 21 | "\n", 22 | "# from dotenv import load_dotenv, find_dotenv\n", 23 | "# _ = load_dotenv(find_dotenv()) # 读取本地的.env环境文件\n", 24 | "\n", 25 | "openai.api_key = 'sk-xxxxxxxxxxxx' #更换成你自己的key" 26 | ] 27 | }, 28 | { 29 | "cell_type": "code", 30 | "execution_count": 2, 31 | "id": "9c40b32d", 32 | "metadata": {}, 33 | "outputs": [], 34 | "source": [ 35 | "def get_completion_from_messages(messages, model=\"gpt-3.5-turbo\", temperature=0, max_tokens=500):\n", 36 | " response = openai.ChatCompletion.create(\n", 37 | " model=model,\n", 38 | " messages=messages,\n", 39 | " temperature=temperature, \n", 40 | " max_tokens=max_tokens, \n", 41 | " )\n", 42 | " return response.choices[0].message[\"content\"]" 43 | ] 44 | }, 45 | { 46 | "cell_type": "markdown", 47 | "id": "59f69c2e", 48 | "metadata": {}, 49 | "source": [ 50 | "### 检查输出是否有潜在的有害内容\n", 51 | "重要的就是一个moderation" 52 | ] 53 | }, 54 | { 55 | "cell_type": "code", 56 | "execution_count": 4, 57 | "id": "943f5396", 58 | "metadata": {}, 59 | "outputs": [ 60 | { 61 | "name": "stdout", 62 | "output_type": "stream", 63 | "text": [ 64 | "{\n", 65 | " \"categories\": {\n", 66 | " \"hate\": false,\n", 67 | " \"hate/threatening\": false,\n", 68 | " \"self-harm\": false,\n", 69 | " \"sexual\": false,\n", 70 | " \"sexual/minors\": false,\n", 71 | " \"violence\": false,\n", 72 | " \"violence/graphic\": false\n", 73 | " },\n", 74 | " \"category_scores\": {\n", 75 | " \"hate\": 2.6680607e-06,\n", 76 | " \"hate/threatening\": 1.2194433e-08,\n", 77 | " \"self-harm\": 8.294434e-07,\n", 78 | " \"sexual\": 3.41087e-05,\n", 79 | " \"sexual/minors\": 1.5462567e-07,\n", 80 | " \"violence\": 6.3285606e-06,\n", 81 | " \"violence/graphic\": 2.9102332e-06\n", 82 | " },\n", 83 | " \"flagged\": false\n", 84 | "}\n" 85 | ] 86 | } 87 | ], 88 | "source": [ 89 | "final_response_to_customer = f\"\"\"\n", 90 | "SmartX ProPhone有一个6.1英寸的显示屏,128GB存储、1200万像素的双摄像头,以及5G。FotoSnap单反相机有一个2420万像素的传感器,1080p视频,3英寸LCD和 \n", 91 | "可更换的镜头。我们有各种电视,包括CineView 4K电视,55英寸显示屏,4K分辨率、HDR,以及智能电视功能。我们也有SoundMax家庭影院系统,具有5.1声道,1000W输出,无线 \n", 92 | "重低音扬声器和蓝牙。关于这些产品或我们提供的任何其他产品您是否有任何具体问题?\n", 93 | "\"\"\"\n", 94 | "# Moderation是OpenAI的内容审核函数,用于检测这段内容的危害含量\n", 95 | "\n", 96 | "response = openai.Moderation.create(\n", 97 | " input=final_response_to_customer\n", 98 | ")\n", 99 | "moderation_output = response[\"results\"][0]\n", 100 | "print(moderation_output)" 101 | ] 102 | }, 103 | { 104 | "cell_type": "markdown", 105 | "id": "f57f8dad", 106 | "metadata": {}, 107 | "source": [ 108 | "### 检查输出结果是否与提供的产品信息相符合" 109 | ] 110 | }, 111 | { 112 | "cell_type": "code", 113 | "execution_count": 7, 114 | "id": "552e3d8c", 115 | "metadata": {}, 116 | "outputs": [ 117 | { 118 | "name": "stdout", 119 | "output_type": "stream", 120 | "text": [ 121 | "Y\n" 122 | ] 123 | } 124 | ], 125 | "source": [ 126 | "# 这是一段电子产品相关的信息\n", 127 | "system_message = f\"\"\"\n", 128 | "You are an assistant that evaluates whether \\\n", 129 | "customer service agent responses sufficiently \\\n", 130 | "answer customer questions, and also validates that \\\n", 131 | "all the facts the assistant cites from the product \\\n", 132 | "information are correct.\n", 133 | "The product information and user and customer \\\n", 134 | "service agent messages will be delimited by \\\n", 135 | "3 backticks, i.e. ```.\n", 136 | "Respond with a Y or N character, with no punctuation:\n", 137 | "Y - if the output sufficiently answers the question \\\n", 138 | "AND the response correctly uses product information\n", 139 | "N - otherwise\n", 140 | "\n", 141 | "Output a single letter only.\n", 142 | "\"\"\"\n", 143 | "\n", 144 | "#这是顾客的提问\n", 145 | "customer_message = f\"\"\"\n", 146 | "tell me about the smartx pro phone and \\\n", 147 | "the fotosnap camera, the dslr one. \\\n", 148 | "Also tell me about your tvs\"\"\"\n", 149 | "product_information = \"\"\"{ \"name\": \"SmartX ProPhone\", \"category\": \"Smartphones and Accessories\", \"brand\": \"SmartX\", \"model_number\": \"SX-PP10\", \"warranty\": \"1 year\", \"rating\": 4.6, \"features\": [ \"6.1-inch display\", \"128GB storage\", \"12MP dual camera\", \"5G\" ], \"description\": \"A powerful smartphone with advanced camera features.\", \"price\": 899.99 } { \"name\": \"FotoSnap DSLR Camera\", \"category\": \"Cameras and Camcorders\", \"brand\": \"FotoSnap\", \"model_number\": \"FS-DSLR200\", \"warranty\": \"1 year\", \"rating\": 4.7, \"features\": [ \"24.2MP sensor\", \"1080p video\", \"3-inch LCD\", \"Interchangeable lenses\" ], \"description\": \"Capture stunning photos and videos with this versatile DSLR camera.\", \"price\": 599.99 } { \"name\": \"CineView 4K TV\", \"category\": \"Televisions and Home Theater Systems\", \"brand\": \"CineView\", \"model_number\": \"CV-4K55\", \"warranty\": \"2 years\", \"rating\": 4.8, \"features\": [ \"55-inch display\", \"4K resolution\", \"HDR\", \"Smart TV\" ], \"description\": \"A stunning 4K TV with vibrant colors and smart features.\", \"price\": 599.99 } { \"name\": \"SoundMax Home Theater\", \"category\": \"Televisions and Home Theater Systems\", \"brand\": \"SoundMax\", \"model_number\": \"SM-HT100\", \"warranty\": \"1 year\", \"rating\": 4.4, \"features\": [ \"5.1 channel\", \"1000W output\", \"Wireless subwoofer\", \"Bluetooth\" ], \"description\": \"A powerful home theater system for an immersive audio experience.\", \"price\": 399.99 } { \"name\": \"CineView 8K TV\", \"category\": \"Televisions and Home Theater Systems\", \"brand\": \"CineView\", \"model_number\": \"CV-8K65\", \"warranty\": \"2 years\", \"rating\": 4.9, \"features\": [ \"65-inch display\", \"8K resolution\", \"HDR\", \"Smart TV\" ], \"description\": \"Experience the future of television with this stunning 8K TV.\", \"price\": 2999.99 } { \"name\": \"SoundMax Soundbar\", \"category\": \"Televisions and Home Theater Systems\", \"brand\": \"SoundMax\", \"model_number\": \"SM-SB50\", \"warranty\": \"1 year\", \"rating\": 4.3, \"features\": [ \"2.1 channel\", \"300W output\", \"Wireless subwoofer\", \"Bluetooth\" ], \"description\": \"Upgrade your TV's audio with this sleek and powerful soundbar.\", \"price\": 199.99 } { \"name\": \"CineView OLED TV\", \"category\": \"Televisions and Home Theater Systems\", \"brand\": \"CineView\", \"model_number\": \"CV-OLED55\", \"warranty\": \"2 years\", \"rating\": 4.7, \"features\": [ \"55-inch display\", \"4K resolution\", \"HDR\", \"Smart TV\" ], \"description\": \"Experience true blacks and vibrant colors with this OLED TV.\", \"price\": 1499.99 }\"\"\"\n", 150 | "\n", 151 | "q_a_pair = f\"\"\"\n", 152 | "Customer message: ```{customer_message}```\n", 153 | "Product information: ```{product_information}```\n", 154 | "Agent response: ```{final_response_to_customer}```\n", 155 | "\n", 156 | "Does the response use the retrieved information correctly?\n", 157 | "Does the response sufficiently answer the question?\n", 158 | "\n", 159 | "Output Y or N\n", 160 | "\"\"\"\n", 161 | "#判断相关性\n", 162 | "messages = [\n", 163 | " {'role': 'system', 'content': system_message},\n", 164 | " {'role': 'user', 'content': q_a_pair}\n", 165 | "]\n", 166 | "\n", 167 | "response = get_completion_from_messages(messages, max_tokens=1)\n", 168 | "print(response)" 169 | ] 170 | }, 171 | { 172 | "cell_type": "code", 173 | "execution_count": 6, 174 | "id": "afb1b82f", 175 | "metadata": {}, 176 | "outputs": [ 177 | { 178 | "name": "stdout", 179 | "output_type": "stream", 180 | "text": [ 181 | "N\n" 182 | ] 183 | } 184 | ], 185 | "source": [ 186 | "another_response = \"life is like a box of chocolates\"\n", 187 | "q_a_pair = f\"\"\"\n", 188 | "Customer message: ```{customer_message}```\n", 189 | "Product information: ```{product_information}```\n", 190 | "Agent response: ```{another_response}```\n", 191 | "\n", 192 | "Does the response use the retrieved information correctly?\n", 193 | "Does the response sufficiently answer the question?\n", 194 | "\n", 195 | "Output Y or N\n", 196 | "\"\"\"\n", 197 | "messages = [\n", 198 | " {'role': 'system', 'content': system_message},\n", 199 | " {'role': 'user', 'content': q_a_pair}\n", 200 | "]\n", 201 | "\n", 202 | "response = get_completion_from_messages(messages)\n", 203 | "print(response)" 204 | ] 205 | } 206 | ], 207 | "metadata": { 208 | "kernelspec": { 209 | "display_name": "Python 3 (ipykernel)", 210 | "language": "python", 211 | "name": "python3" 212 | }, 213 | "language_info": { 214 | "codemirror_mode": { 215 | "name": "ipython", 216 | "version": 3 217 | }, 218 | "file_extension": ".py", 219 | "mimetype": "text/x-python", 220 | "name": "python", 221 | "nbconvert_exporter": "python", 222 | "pygments_lexer": "ipython3", 223 | "version": "3.10.9" 224 | } 225 | }, 226 | "nbformat": 4, 227 | "nbformat_minor": 5 228 | } 229 | -------------------------------------------------------------------------------- /99~参考资料/2023~吴恩达~《Building Systems with the ChatGPT API》/products.json: -------------------------------------------------------------------------------- 1 | {"TechPro Ultrabook": {"name": "TechPro Ultrabook", "category": "Computers and Laptops", "brand": "TechPro", "model_number": "TP-UB100", "warranty": "1 year", "rating": 4.5, "features": ["13.3-inch display", "8GB RAM", "256GB SSD", "Intel Core i5 processor"], "description": "A sleek and lightweight ultrabook for everyday use.", "price": 799.99}, "BlueWave Gaming Laptop": {"name": "BlueWave Gaming Laptop", "category": "Computers and Laptops", "brand": "BlueWave", "model_number": "BW-GL200", "warranty": "2 years", "rating": 4.7, "features": ["15.6-inch display", "16GB RAM", "512GB SSD", "NVIDIA GeForce RTX 3060"], "description": "A high-performance gaming laptop for an immersive experience.", "price": 1199.99}, "PowerLite Convertible": {"name": "PowerLite Convertible", "category": "Computers and Laptops", "brand": "PowerLite", "model_number": "PL-CV300", "warranty": "1 year", "rating": 4.3, "features": ["14-inch touchscreen", "8GB RAM", "256GB SSD", "360-degree hinge"], "description": "A versatile convertible laptop with a responsive touchscreen.", "price": 699.99}, "TechPro Desktop": {"name": "TechPro Desktop", "category": "Computers and Laptops", "brand": "TechPro", "model_number": "TP-DT500", "warranty": "1 year", "rating": 4.4, "features": ["Intel Core i7 processor", "16GB RAM", "1TB HDD", "NVIDIA GeForce GTX 1660"], "description": "A powerful desktop computer for work and play.", "price": 999.99}, "BlueWave Chromebook": {"name": "BlueWave Chromebook", "category": "Computers and Laptops", "brand": "BlueWave", "model_number": "BW-CB100", "warranty": "1 year", "rating": 4.1, "features": ["11.6-inch display", "4GB RAM", "32GB eMMC", "Chrome OS"], "description": "A compact and affordable Chromebook for everyday tasks.", "price": 249.99}, "SmartX ProPhone": {"name": "SmartX ProPhone", "category": "Smartphones and Accessories", "brand": "SmartX", "model_number": "SX-PP10", "warranty": "1 year", "rating": 4.6, "features": ["6.1-inch display", "128GB storage", "12MP dual camera", "5G"], "description": "A powerful smartphone with advanced camera features.", "price": 899.99}, "MobiTech PowerCase": {"name": "MobiTech PowerCase", "category": "Smartphones and Accessories", "brand": "MobiTech", "model_number": "MT-PC20", "warranty": "1 year", "rating": 4.3, "features": ["5000mAh battery", "Wireless charging", "Compatible with SmartX ProPhone"], "description": "A protective case with built-in battery for extended usage.", "price": 59.99}, "SmartX MiniPhone": {"name": "SmartX MiniPhone", "category": "Smartphones and Accessories", "brand": "SmartX", "model_number": "SX-MP5", "warranty": "1 year", "rating": 4.2, "features": ["4.7-inch display", "64GB storage", "8MP camera", "4G"], "description": "A compact and affordable smartphone for basic tasks.", "price": 399.99}, "MobiTech Wireless Charger": {"name": "MobiTech Wireless Charger", "category": "Smartphones and Accessories", "brand": "MobiTech", "model_number": "MT-WC10", "warranty": "1 year", "rating": 4.5, "features": ["10W fast charging", "Qi-compatible", "LED indicator", "Compact design"], "description": "A convenient wireless charger for a clutter-free workspace.", "price": 29.99}, "SmartX EarBuds": {"name": "SmartX EarBuds", "category": "Smartphones and Accessories", "brand": "SmartX", "model_number": "SX-EB20", "warranty": "1 year", "rating": 4.4, "features": ["True wireless", "Bluetooth 5.0", "Touch controls", "24-hour battery life"], "description": "Experience true wireless freedom with these comfortable earbuds.", "price": 99.99}, "CineView 4K TV": {"name": "CineView 4K TV", "category": "Televisions and Home Theater Systems", "brand": "CineView", "model_number": "CV-4K55", "warranty": "2 years", "rating": 4.8, "features": ["55-inch display", "4K resolution", "HDR", "Smart TV"], "description": "A stunning 4K TV with vibrant colors and smart features.", "price": 599.99}, "SoundMax Home Theater": {"name": "SoundMax Home Theater", "category": "Televisions and Home Theater Systems", "brand": "SoundMax", "model_number": "SM-HT100", "warranty": "1 year", "rating": 4.4, "features": ["5.1 channel", "1000W output", "Wireless subwoofer", "Bluetooth"], "description": "A powerful home theater system for an immersive audio experience.", "price": 399.99}, "CineView 8K TV": {"name": "CineView 8K TV", "category": "Televisions and Home Theater Systems", "brand": "CineView", "model_number": "CV-8K65", "warranty": "2 years", "rating": 4.9, "features": ["65-inch display", "8K resolution", "HDR", "Smart TV"], "description": "Experience the future of television with this stunning 8K TV.", "price": 2999.99}, "SoundMax Soundbar": {"name": "SoundMax Soundbar", "category": "Televisions and Home Theater Systems", "brand": "SoundMax", "model_number": "SM-SB50", "warranty": "1 year", "rating": 4.3, "features": ["2.1 channel", "300W output", "Wireless subwoofer", "Bluetooth"], "description": "Upgrade your TV's audio with this sleek and powerful soundbar.", "price": 199.99}, "CineView OLED TV": {"name": "CineView OLED TV", "category": "Televisions and Home Theater Systems", "brand": "CineView", "model_number": "CV-OLED55", "warranty": "2 years", "rating": 4.7, "features": ["55-inch display", "4K resolution", "HDR", "Smart TV"], "description": "Experience true blacks and vibrant colors with this OLED TV.", "price": 1499.99}, "GameSphere X": {"name": "GameSphere X", "category": "Gaming Consoles and Accessories", "brand": "GameSphere", "model_number": "GS-X", "warranty": "1 year", "rating": 4.9, "features": ["4K gaming", "1TB storage", "Backward compatibility", "Online multiplayer"], "description": "A next-generation gaming console for the ultimate gaming experience.", "price": 499.99}, "ProGamer Controller": {"name": "ProGamer Controller", "category": "Gaming Consoles and Accessories", "brand": "ProGamer", "model_number": "PG-C100", "warranty": "1 year", "rating": 4.2, "features": ["Ergonomic design", "Customizable buttons", "Wireless", "Rechargeable battery"], "description": "A high-quality gaming controller for precision and comfort.", "price": 59.99}, "GameSphere Y": {"name": "GameSphere Y", "category": "Gaming Consoles and Accessories", "brand": "GameSphere", "model_number": "GS-Y", "warranty": "1 year", "rating": 4.8, "features": ["4K gaming", "500GB storage", "Backward compatibility", "Online multiplayer"], "description": "A compact gaming console with powerful performance.", "price": 399.99}, "ProGamer Racing Wheel": {"name": "ProGamer Racing Wheel", "category": "Gaming Consoles and Accessories", "brand": "ProGamer", "model_number": "PG-RW200", "warranty": "1 year", "rating": 4.5, "features": ["Force feedback", "Adjustable pedals", "Paddle shifters", "Compatible with GameSphere X"], "description": "Enhance your racing games with this realistic racing wheel.", "price": 249.99}, "GameSphere VR Headset": {"name": "GameSphere VR Headset", "category": "Gaming Consoles and Accessories", "brand": "GameSphere", "model_number": "GS-VR", "warranty": "1 year", "rating": 4.6, "features": ["Immersive VR experience", "Built-in headphones", "Adjustable headband", "Compatible with GameSphere X"], "description": "Step into the world of virtual reality with this comfortable VR headset.", "price": 299.99}, "AudioPhonic Noise-Canceling Headphones": {"name": "AudioPhonic Noise-Canceling Headphones", "category": "Audio Equipment", "brand": "AudioPhonic", "model_number": "AP-NC100", "warranty": "1 year", "rating": 4.6, "features": ["Active noise-canceling", "Bluetooth", "20-hour battery life", "Comfortable fit"], "description": "Experience immersive sound with these noise-canceling headphones.", "price": 199.99}, "WaveSound Bluetooth Speaker": {"name": "WaveSound Bluetooth Speaker", "category": "Audio Equipment", "brand": "WaveSound", "model_number": "WS-BS50", "warranty": "1 year", "rating": 4.5, "features": ["Portable", "10-hour battery life", "Water-resistant", "Built-in microphone"], "description": "A compact and versatile Bluetooth speaker for music on the go.", "price": 49.99}, "AudioPhonic True Wireless Earbuds": {"name": "AudioPhonic True Wireless Earbuds", "category": "Audio Equipment", "brand": "AudioPhonic", "model_number": "AP-TW20", "warranty": "1 year", "rating": 4.4, "features": ["True wireless", "Bluetooth 5.0", "Touch controls", "18-hour battery life"], "description": "Enjoy music without wires with these comfortable true wireless earbuds.", "price": 79.99}, "WaveSound Soundbar": {"name": "WaveSound Soundbar", "category": "Audio Equipment", "brand": "WaveSound", "model_number": "WS-SB40", "warranty": "1 year", "rating": 4.3, "features": ["2.0 channel", "80W output", "Bluetooth", "Wall-mountable"], "description": "Upgrade your TV's audio with this slim and powerful soundbar.", "price": 99.99}, "AudioPhonic Turntable": {"name": "AudioPhonic Turntable", "category": "Audio Equipment", "brand": "AudioPhonic", "model_number": "AP-TT10", "warranty": "1 year", "rating": 4.2, "features": ["3-speed", "Built-in speakers", "Bluetooth", "USB recording"], "description": "Rediscover your vinyl collection with this modern turntable.", "price": 149.99}, "FotoSnap DSLR Camera": {"name": "FotoSnap DSLR Camera", "category": "Cameras and Camcorders", "brand": "FotoSnap", "model_number": "FS-DSLR200", "warranty": "1 year", "rating": 4.7, "features": ["24.2MP sensor", "1080p video", "3-inch LCD", "Interchangeable lenses"], "description": "Capture stunning photos and videos with this versatile DSLR camera.", "price": 599.99}, "ActionCam 4K": {"name": "ActionCam 4K", "category": "Cameras and Camcorders", "brand": "ActionCam", "model_number": "AC-4K", "warranty": "1 year", "rating": 4.4, "features": ["4K video", "Waterproof", "Image stabilization", "Wi-Fi"], "description": "Record your adventures with this rugged and compact 4K action camera.", "price": 299.99}, "FotoSnap Mirrorless Camera": {"name": "FotoSnap Mirrorless Camera", "category": "Cameras and Camcorders", "brand": "FotoSnap", "model_number": "FS-ML100", "warranty": "1 year", "rating": 4.6, "features": ["20.1MP sensor", "4K video", "3-inch touchscreen", "Interchangeable lenses"], "description": "A compact and lightweight mirrorless camera with advanced features.", "price": 799.99}, "ZoomMaster Camcorder": {"name": "ZoomMaster Camcorder", "category": "Cameras and Camcorders", "brand": "ZoomMaster", "model_number": "ZM-CM50", "warranty": "1 year", "rating": 4.3, "features": ["1080p video", "30x optical zoom", "3-inch LCD", "Image stabilization"], "description": "Capture life's moments with this easy-to-use camcorder.", "price": 249.99}, "FotoSnap Instant Camera": {"name": "FotoSnap Instant Camera", "category": "Cameras and Camcorders", "brand": "FotoSnap", "model_number": "FS-IC10", "warranty": "1 year", "rating": 4.1, "features": ["Instant prints", "Built-in flash", "Selfie mirror", "Battery-powered"], "description": "Create instant memories with this fun and portable instant camera.", "price": 69.99}} -------------------------------------------------------------------------------- /99~参考资料/2023~吴恩达~《Building Systems with the ChatGPT API》/readme.md: -------------------------------------------------------------------------------- 1 | # 使用 ChatGPT API 搭建系统 2 | 3 | 吴恩达老师发布的大模型开发新课程,在《Prompt Engineering for Developers》课程的基础上,指导开发者如何基于 ChatGPT 提供的 API 开发一个完整的、全面的智能问答系统,包括使用大语言模型的基本规范,通过分类与监督评估输入,通过思维链推理及链式提示处理输入,检查并评估系统输出等,介绍了基于大模型开发的新范式,值得每一个有志于使用大模型开发应用程序的开发者学习。 4 | 5 | ### 目录 6 | 7 | 1. 简介 Introduction @Sarai 8 | 2. 模型,范式和 token Language Models, the Chat Format and Tokens @仲泰 9 | 3. 检查输入-分类 Classification @诸世纪 10 | 4. 检查输入-监督 Moderation @诸世纪 11 | 5. 思维链推理 Chain of Thought Reasoning @万礼行 12 | 6. 提示链 Chaining Prompts @万礼行 13 | 7. 检查输入 Check Outputs @仲泰 14 | 8. 评估(端到端系统)Evaluation @邹雨衡 15 | 9. 评估(简单问答)Evaluation-part1 @陈志宏 16 | 10. 评估(复杂问答)Evaluation-part2 @邹雨衡 17 | 11. 总结 Conclusion @Sarai 18 | -------------------------------------------------------------------------------- /99~参考资料/2023~吴恩达~《ChatGPT Prompt Engineering for Developers》/00.README.md: -------------------------------------------------------------------------------- 1 | # Prompt Engineering for Developers 2 | 3 | LLM 正在逐步改变人们的生活,而对于开发者,如何基于 LLM 提供的 API 快速、便捷地开发一些具备更强能力、集成LLM 的应用,来便捷地实现一些更新颖、更实用的能力,是一个急需学习的重要能力。由吴恩达老师与 OpenAI 合作推出的 《ChatGPT Prompt Engineering for Developers》教程面向入门 LLM 的开发者,深入浅出地介绍了对于开发者,如何构造 Prompt 并基于 OpenAI 提供的 API 实现包括总结、推断、转换等多种常用功能,是入门 LLM 开发的经典教程。在可预见的未来,该教程会成为 LLM 的重要入门教程,但是目前还只支持英文版且国内访问受限,打造中文版且国内流畅访问的教程具有重要意义。因此,我们将该课程翻译为中文,并复现其范例代码,支持国内中文学习者直接使用,以帮助中文学习者更好地学习 LLM 开发。 4 | 5 | 本教程为吴恩达《ChatGPT Prompt Engineering for Developers》课程中文版,主要内容为指导开发者如何构建 Prompt 并基于 OpenAI API 构建新的、基于 LLM 的应用,包括: 6 | > 书写 Prompt 的原则 7 | > 文本总结(如总结用户评论); 8 | > 文本推断(如情感分类、主题提取); 9 | > 文本转换(如翻译、自动纠错); 10 | > 扩展(如书写邮件) 11 | 12 | **目录:** 13 | 1. 简介 Introduction @邹雨衡 14 | 2. Prompt 的构建原则 Guidelines @邹雨衡 15 | 3. 如何迭代优化 Prompt Itrative @邹雨衡 16 | 4. 文本总结 Summarizing @玉琳 17 | 5. 文本推断 @长琴 18 | 6. 文本转换 Transforming @玉琳 19 | 7. 文本扩展 Expand @邹雨衡 20 | 8. 聊天机器人 @长琴 21 | 9. 总结 @长琴 22 | -------------------------------------------------------------------------------- /99~参考资料/2023~吴恩达~《ChatGPT Prompt Engineering for Developers》/01. 简介.md: -------------------------------------------------------------------------------- 1 | # 简介 2 | 3 | **作者 吴恩达教授** 4 | 5 | 欢迎来到本课程,我们将为开发人员介绍 ChatGPT 提示工程。本课程由 Isa Fulford 教授和我一起授课。Isa Fulford 是 OpenAI 的技术团队成员,曾开发过受欢迎的 ChatGPT 检索插件,并且在教授人们如何在产品中使用 LLM 或 LLM 技术方面做出了很大贡献。她还参与编写了教授人们使用 Prompt 的 OpenAI cookbook。 6 | 7 | 互联网上有很多有关提示的材料,例如《30 prompts everyone has to know》之类的文章。这些文章主要集中在 ChatGPT Web 用户界面上,许多人在使用它执行特定的、通常是一次性的任务。但是,我认为 LLM 或大型语言模型作为开发人员的更强大功能是使用 API 调用到 LLM,以快速构建软件应用程序。我认为这方面还没有得到充分的重视。实际上,我们在 DeepLearning.AI 的姊妹公司 AI Fund 的团队一直在与许多初创公司合作,将这些技术应用于许多不同的应用程序上。看到 LLM API 能够让开发人员非常快速地构建应用程序,这真是令人兴奋。 8 | 9 | 在本课程中,我们将与您分享一些可能性以及如何实现它们的最佳实践。 10 | 11 | 随着大型语言模型(LLM)的发展,LLM 大致可以分为两种类型,即基础LLM和指令微调LLM。基础LLM是基于文本训练数据,训练出预测下一个单词能力的模型,其通常是在互联网和其他来源的大量数据上训练的。例如,如果你以“从前有一只独角兽”作为提示,基础LLM可能会继续预测“生活在一个与所有独角兽朋友的神奇森林中”。但是,如果你以“法国的首都是什么”为提示,则基础LLM可能会根据互联网上的文章,将答案预测为“法国最大的城市是什么?法国的人口是多少?”,因为互联网上的文章很可能是有关法国国家的问答题目列表。 12 | 13 | 许多 LLMs 的研究和实践的动力正在指令调整的 LLMs 上。指令调整的 LLMs 已经被训练来遵循指令。因此,如果你问它,“法国的首都是什么?”,它更有可能输出“法国的首都是巴黎”。指令调整的 LLMs 的训练通常是从已经训练好的基本 LLMs 开始,该模型已经在大量文本数据上进行了训练。然后,使用输入是指令、输出是其应该返回的结果的数据集来对其进行微调,要求它遵循这些指令。然后通常使用一种称为 RLHF(reinforcement learning from human feedback,人类反馈强化学习)的技术进行进一步改进,使系统更能够有帮助地遵循指令。 14 | 15 | 因为指令调整的 LLMs 已经被训练成有益、诚实和无害的,所以与基础LLMs相比,它们更不可能输出有问题的文本,如有害输出。许多实际使用场景已经转向指令调整的LLMs。您在互联网上找到的一些最佳实践可能更适用于基础LLMs,但对于今天的大多数实际应用,我们建议将注意力集中在指令调整的LLMs上,这些LLMs更容易使用,而且由于OpenAI和其他LLM公司的工作,它们变得更加安全和更加协调。 16 | 17 | 因此,本课程将重点介绍针对指令调整 LLM 的最佳实践,这是我们建议您用于大多数应用程序的。在继续之前,我想感谢 OpenAI 和 DeepLearning.ai 团队为 Izzy 和我所提供的材料作出的贡献。我非常感激 OpenAI 的 Andrew Main、Joe Palermo、Boris Power、Ted Sanders 和 Lillian Weng,他们参与了我们的头脑风暴材料的制定和审核,为这个短期课程编制了课程大纲。我也感激 Deep Learning 方面的 Geoff Ladwig、Eddy Shyu 和 Tommy Nelson 的工作。 18 | 19 | 当您使用指令调整 LLM 时,请类似于考虑向另一个人提供指令,假设它是一个聪明但不知道您任务的具体细节的人。当 LLM 无法正常工作时,有时是因为指令不够清晰。例如,如果您说“请为我写一些关于阿兰·图灵的东西”,清楚表明您希望文本专注于他的科学工作、个人生活、历史角色或其他方面可能会更有帮助。更多的,您还可以指定文本采取像专业记者写作的语调,或者更像是您向朋友写的随笔。 20 | 21 | 当然,如果你想象一下让一位新毕业的大学生为你完成这个任务,你甚至可以提前指定他们应该阅读哪些文本片段来写关于 Alan Turing的文本,那么这能够帮助这位新毕业的大学生更好地成功完成这项任务。下一章你会看到如何让提示清晰明确,创建提示的一个重要原则,你还会从提示的第二个原则中学到给LLM时间去思考。 -------------------------------------------------------------------------------- /99~参考资料/2023~吴恩达~《ChatGPT Prompt Engineering for Developers》/07. 文本扩展 Expanding.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# 第七章 扩展\n", 8 | "\n", 9 | "扩展是将短文本,例如一组说明或主题列表,输入到大型语言模型中,让模型生成更长的文本,例如基于某个主题的电子邮件或论文。这样做有一些很好的用途,例如将大型语言模型用作头脑风暴的伙伴。但这种做法也存在一些问题,例如某人可能会使用它来生成大量垃圾邮件。因此,当你使用大型语言模型的这些功能时,请仅以负责任的方式和有益于人们的方式使用它们。\n", 10 | "\n", 11 | "在本章中,你将学会如何基于 OpenAI API 生成适用于每个客户评价的客户服务电子邮件。我们还将使用模型的另一个输入参数称为温度,这种参数允许您在模型响应中变化探索的程度和多样性。\n" 12 | ] 13 | }, 14 | { 15 | "cell_type": "markdown", 16 | "metadata": {}, 17 | "source": [ 18 | "## 一、环境配置\n", 19 | "\n", 20 | "同以上几章,你需要类似的代码来配置一个可以使用 OpenAI API 的环境" 21 | ] 22 | }, 23 | { 24 | "cell_type": "code", 25 | "execution_count": null, 26 | "metadata": {}, 27 | "outputs": [], 28 | "source": [ 29 | "# 将自己的 API-KEY 导入系统环境变量\n", 30 | "!export OPENAI_API_KEY='api-key'" 31 | ] 32 | }, 33 | { 34 | "cell_type": "code", 35 | "execution_count": 1, 36 | "metadata": {}, 37 | "outputs": [], 38 | "source": [ 39 | "import openai\n", 40 | "import os\n", 41 | "from dotenv import load_dotenv, find_dotenv\n", 42 | "# 导入第三方库\n", 43 | "\n", 44 | "_ = load_dotenv(find_dotenv())\n", 45 | "# 读取系统中的环境变量\n", 46 | "\n", 47 | "openai.api_key = os.getenv('OPENAI_API_KEY')\n", 48 | "# 设置 API_KEY" 49 | ] 50 | }, 51 | { 52 | "cell_type": "code", 53 | "execution_count": 9, 54 | "metadata": {}, 55 | "outputs": [], 56 | "source": [ 57 | "# 一个封装 OpenAI 接口的函数,参数为 Prompt,返回对应结果\n", 58 | "def get_completion(prompt, model=\"gpt-3.5-turbo\", temperature=0):\n", 59 | " '''\n", 60 | " prompt: 对应的提示\n", 61 | " model: 调用的模型,默认为 gpt-3.5-turbo(ChatGPT),有内测资格的用户可以选择 gpt-4\n", 62 | " temperature: 温度系数\n", 63 | " '''\n", 64 | " messages = [{\"role\": \"user\", \"content\": prompt}]\n", 65 | " response = openai.ChatCompletion.create(\n", 66 | " model=model,\n", 67 | " messages=messages,\n", 68 | " temperature=temperature, # 模型输出的温度系数,控制输出的随机程度\n", 69 | " )\n", 70 | " # 调用 OpenAI 的 ChatCompletion 接口\n", 71 | " return response.choices[0].message[\"content\"]\n" 72 | ] 73 | }, 74 | { 75 | "cell_type": "markdown", 76 | "metadata": {}, 77 | "source": [ 78 | "## 二、定制客户邮件" 79 | ] 80 | }, 81 | { 82 | "cell_type": "markdown", 83 | "metadata": {}, 84 | "source": [ 85 | "我们将根据客户评价和情感撰写自定义电子邮件响应。因此,我们将给定客户评价和情感,并生成自定义响应即使用 LLM 根据客户评价和评论情感生成定制电子邮件。" 86 | ] 87 | }, 88 | { 89 | "cell_type": "markdown", 90 | "metadata": {}, 91 | "source": [ 92 | "我们首先给出一个示例,包括一个评论及对应的情感" 93 | ] 94 | }, 95 | { 96 | "cell_type": "code", 97 | "execution_count": 4, 98 | "metadata": {}, 99 | "outputs": [], 100 | "source": [ 101 | "# given the sentiment from the lesson on \"inferring\",\n", 102 | "# and the original customer message, customize the email\n", 103 | "sentiment = \"negative\"\n", 104 | "\n", 105 | "# review for a blender\n", 106 | "review = f\"\"\"\n", 107 | "So, they still had the 17 piece system on seasonal \\\n", 108 | "sale for around $49 in the month of November, about \\\n", 109 | "half off, but for some reason (call it price gouging) \\\n", 110 | "around the second week of December the prices all went \\\n", 111 | "up to about anywhere from between $70-$89 for the same \\\n", 112 | "system. And the 11 piece system went up around $10 or \\\n", 113 | "so in price also from the earlier sale price of $29. \\\n", 114 | "So it looks okay, but if you look at the base, the part \\\n", 115 | "where the blade locks into place doesn’t look as good \\\n", 116 | "as in previous editions from a few years ago, but I \\\n", 117 | "plan to be very gentle with it (example, I crush \\\n", 118 | "very hard items like beans, ice, rice, etc. in the \\ \n", 119 | "blender first then pulverize them in the serving size \\\n", 120 | "I want in the blender then switch to the whipping \\\n", 121 | "blade for a finer flour, and use the cross cutting blade \\\n", 122 | "first when making smoothies, then use the flat blade \\\n", 123 | "if I need them finer/less pulpy). Special tip when making \\\n", 124 | "smoothies, finely cut and freeze the fruits and \\\n", 125 | "vegetables (if using spinach-lightly stew soften the \\ \n", 126 | "spinach then freeze until ready for use-and if making \\\n", 127 | "sorbet, use a small to medium sized food processor) \\ \n", 128 | "that you plan to use that way you can avoid adding so \\\n", 129 | "much ice if at all-when making your smoothie. \\\n", 130 | "After about a year, the motor was making a funny noise. \\\n", 131 | "I called customer service but the warranty expired \\\n", 132 | "already, so I had to buy another one. FYI: The overall \\\n", 133 | "quality has gone done in these types of products, so \\\n", 134 | "they are kind of counting on brand recognition and \\\n", 135 | "consumer loyalty to maintain sales. Got it in about \\\n", 136 | "two days.\n", 137 | "\"\"\"" 138 | ] 139 | }, 140 | { 141 | "cell_type": "code", 142 | "execution_count": 11, 143 | "metadata": {}, 144 | "outputs": [], 145 | "source": [ 146 | "# 我们可以在推理那章学习到如何对一个评论判断其情感倾向\n", 147 | "sentiment = \"negative\"\n", 148 | "\n", 149 | "# 一个产品的评价\n", 150 | "review = f\"\"\"\n", 151 | "他们在11月份的季节性销售期间以约49美元的价格出售17件套装,折扣约为一半。\\\n", 152 | "但由于某些原因(可能是价格欺诈),到了12月第二周,同样的套装价格全都涨到了70美元到89美元不等。\\\n", 153 | "11件套装的价格也上涨了大约10美元左右。\\\n", 154 | "虽然外观看起来还可以,但基座上锁定刀片的部分看起来不如几年前的早期版本那么好。\\\n", 155 | "不过我打算非常温柔地使用它,例如,\\\n", 156 | "我会先在搅拌机中将像豆子、冰、米饭等硬物研磨,然后再制成所需的份量,\\\n", 157 | "切换到打蛋器制作更细的面粉,或者在制作冰沙时先使用交叉切割刀片,然后使用平面刀片制作更细/不粘的效果。\\\n", 158 | "制作冰沙时,特别提示:\\\n", 159 | "将水果和蔬菜切碎并冷冻(如果使用菠菜,则轻轻煮软菠菜,然后冷冻直到使用;\\\n", 160 | "如果制作果酱,则使用小到中号的食品处理器),这样可以避免在制作冰沙时添加太多冰块。\\\n", 161 | "大约一年后,电机发出奇怪的噪音,我打电话给客服,但保修已经过期了,所以我不得不再买一个。\\\n", 162 | "总的来说,这些产品的总体质量已经下降,因此它们依靠品牌认可和消费者忠诚度来维持销售。\\\n", 163 | "货物在两天内到达。\n", 164 | "\"\"\"" 165 | ] 166 | }, 167 | { 168 | "cell_type": "markdown", 169 | "metadata": {}, 170 | "source": [ 171 | "我们已经使用推断课程中学到的提取了情感,这是一个关于搅拌机的客户评价,现在我们将根据情感定制回复。\n", 172 | "\n", 173 | "这里的指令是:假设你是一个客户服务AI助手,你的任务是为客户发送电子邮件回复,根据通过三个反引号分隔的客户电子邮件,生成一封回复以感谢客户的评价。" 174 | ] 175 | }, 176 | { 177 | "cell_type": "code", 178 | "execution_count": 5, 179 | "metadata": {}, 180 | "outputs": [ 181 | { 182 | "name": "stdout", 183 | "output_type": "stream", 184 | "text": [ 185 | "Dear Valued Customer,\n", 186 | "\n", 187 | "Thank you for taking the time to leave a review about our product. We are sorry to hear that you experienced an increase in price and that the quality of the product did not meet your expectations. We apologize for any inconvenience this may have caused you.\n", 188 | "\n", 189 | "We would like to assure you that we take all feedback seriously and we will be sure to pass your comments along to our team. If you have any further concerns, please do not hesitate to reach out to our customer service team for assistance.\n", 190 | "\n", 191 | "Thank you again for your review and for choosing our product. We hope to have the opportunity to serve you better in the future.\n", 192 | "\n", 193 | "Best regards,\n", 194 | "\n", 195 | "AI customer agent\n" 196 | ] 197 | } 198 | ], 199 | "source": [ 200 | "prompt = f\"\"\"\n", 201 | "You are a customer service AI assistant.\n", 202 | "Your task is to send an email reply to a valued customer.\n", 203 | "Given the customer email delimited by ```, \\\n", 204 | "Generate a reply to thank the customer for their review.\n", 205 | "If the sentiment is positive or neutral, thank them for \\\n", 206 | "their review.\n", 207 | "If the sentiment is negative, apologize and suggest that \\\n", 208 | "they can reach out to customer service. \n", 209 | "Make sure to use specific details from the review.\n", 210 | "Write in a concise and professional tone.\n", 211 | "Sign the email as `AI customer agent`.\n", 212 | "Customer review: ```{review}```\n", 213 | "Review sentiment: {sentiment}\n", 214 | "\"\"\"\n", 215 | "response = get_completion(prompt)\n", 216 | "print(response)" 217 | ] 218 | }, 219 | { 220 | "cell_type": "code", 221 | "execution_count": 6, 222 | "metadata": {}, 223 | "outputs": [ 224 | { 225 | "name": "stdout", 226 | "output_type": "stream", 227 | "text": [ 228 | "尊敬的客户,\n", 229 | "\n", 230 | "非常感谢您对我们产品的评价。我们非常抱歉您在购买过程中遇到了价格上涨的问题。我们一直致力于为客户提供最优惠的价格,但由于市场波动,价格可能会有所变化。我们深表歉意,如果您需要任何帮助,请随时联系我们的客户服务团队。\n", 231 | "\n", 232 | "我们非常感谢您对我们产品的详细评价和使用技巧。我们将会把您的反馈传达给我们的产品团队,以便改进我们的产品质量和性能。\n", 233 | "\n", 234 | "再次感谢您对我们的支持和反馈。如果您需要任何帮助或有任何疑问,请随时联系我们的客户服务团队。\n", 235 | "\n", 236 | "祝您一切顺利!\n", 237 | "\n", 238 | "AI客户代理\n" 239 | ] 240 | } 241 | ], 242 | "source": [ 243 | "prompt = f\"\"\"\n", 244 | "你是一位客户服务的AI助手。\n", 245 | "你的任务是给一位重要客户发送邮件回复。\n", 246 | "根据客户通过“```”分隔的评价,生成回复以感谢客户的评价。提醒模型使用评价中的具体细节\n", 247 | "用简明而专业的语气写信。\n", 248 | "作为“AI客户代理”签署电子邮件。\n", 249 | "客户评论:\n", 250 | "```{review}```\n", 251 | "评论情感:{sentiment}\n", 252 | "\"\"\"\n", 253 | "response = get_completion(prompt)\n", 254 | "print(response)" 255 | ] 256 | }, 257 | { 258 | "cell_type": "markdown", 259 | "metadata": {}, 260 | "source": [ 261 | "## 三、使用温度系数\n", 262 | "\n", 263 | "接下来,我们将使用语言模型的一个称为“温度”的参数,它将允许我们改变模型响应的多样性。您可以将温度视为模型探索或随机性的程度。\n", 264 | "\n", 265 | "例如,在一个特定的短语中,“我的最爱食品”最有可能的下一个词是“比萨”,其次最有可能的是“寿司”和“塔可”。因此,在温度为零时,模型将总是选择最有可能的下一个词,而在较高的温度下,它还将选择其中一个不太可能的词,在更高的温度下,它甚至可能选择塔可,而这种可能性仅为五分之一。您可以想象,随着模型继续生成更多单词的最终响应,“我的最爱食品是比萨”将会与第一个响应“我的最爱食品是塔可”产生差异。因此,随着模型的继续,这两个响应将变得越来越不同。\n", 266 | "\n", 267 | "一般来说,在构建需要可预测响应的应用程序时,我建议使用温度为零。在所有课程中,我们一直设置温度为零,如果您正在尝试构建一个可靠和可预测的系统,我认为您应该选择这个温度。如果您尝试以更具创意的方式使用模型,可能需要更广泛地输出不同的结果,那么您可能需要使用更高的温度。" 268 | ] 269 | }, 270 | { 271 | "cell_type": "code", 272 | "execution_count": 7, 273 | "metadata": {}, 274 | "outputs": [], 275 | "source": [ 276 | "# given the sentiment from the lesson on \"inferring\",\n", 277 | "# and the original customer message, customize the email\n", 278 | "sentiment = \"negative\"\n", 279 | "\n", 280 | "# review for a blender\n", 281 | "review = f\"\"\"\n", 282 | "So, they still had the 17 piece system on seasonal \\\n", 283 | "sale for around $49 in the month of November, about \\\n", 284 | "half off, but for some reason (call it price gouging) \\\n", 285 | "around the second week of December the prices all went \\\n", 286 | "up to about anywhere from between $70-$89 for the same \\\n", 287 | "system. And the 11 piece system went up around $10 or \\\n", 288 | "so in price also from the earlier sale price of $29. \\\n", 289 | "So it looks okay, but if you look at the base, the part \\\n", 290 | "where the blade locks into place doesn’t look as good \\\n", 291 | "as in previous editions from a few years ago, but I \\\n", 292 | "plan to be very gentle with it (example, I crush \\\n", 293 | "very hard items like beans, ice, rice, etc. in the \\ \n", 294 | "blender first then pulverize them in the serving size \\\n", 295 | "I want in the blender then switch to the whipping \\\n", 296 | "blade for a finer flour, and use the cross cutting blade \\\n", 297 | "first when making smoothies, then use the flat blade \\\n", 298 | "if I need them finer/less pulpy). Special tip when making \\\n", 299 | "smoothies, finely cut and freeze the fruits and \\\n", 300 | "vegetables (if using spinach-lightly stew soften the \\ \n", 301 | "spinach then freeze until ready for use-and if making \\\n", 302 | "sorbet, use a small to medium sized food processor) \\ \n", 303 | "that you plan to use that way you can avoid adding so \\\n", 304 | "much ice if at all-when making your smoothie. \\\n", 305 | "After about a year, the motor was making a funny noise. \\\n", 306 | "I called customer service but the warranty expired \\\n", 307 | "already, so I had to buy another one. FYI: The overall \\\n", 308 | "quality has gone done in these types of products, so \\\n", 309 | "they are kind of counting on brand recognition and \\\n", 310 | "consumer loyalty to maintain sales. Got it in about \\\n", 311 | "two days.\n", 312 | "\"\"\"" 313 | ] 314 | }, 315 | { 316 | "cell_type": "code", 317 | "execution_count": 10, 318 | "metadata": {}, 319 | "outputs": [ 320 | { 321 | "name": "stdout", 322 | "output_type": "stream", 323 | "text": [ 324 | "Dear valued customer,\n", 325 | "\n", 326 | "Thank you for taking the time to share your review with us. We are sorry to hear that you were disappointed with the prices of our products and the quality of our blender. We apologize for any inconvenience this may have caused you.\n", 327 | "\n", 328 | "We value your feedback and would like to make things right for you. Please feel free to contact our customer service team so we can assist you with any concerns or issues you may have. We are committed to providing you with the best possible service and products.\n", 329 | "\n", 330 | "Thank you again for your review and for being a loyal customer. We hope to have the opportunity to serve you better in the future.\n", 331 | "\n", 332 | "Sincerely,\n", 333 | "AI customer agent\n" 334 | ] 335 | } 336 | ], 337 | "source": [ 338 | "prompt = f\"\"\"\n", 339 | "You are a customer service AI assistant.\n", 340 | "Your task is to send an email reply to a valued customer.\n", 341 | "Given the customer email delimited by ```, \\\n", 342 | "Generate a reply to thank the customer for their review.\n", 343 | "If the sentiment is positive or neutral, thank them for \\\n", 344 | "their review.\n", 345 | "If the sentiment is negative, apologize and suggest that \\\n", 346 | "they can reach out to customer service. \n", 347 | "Make sure to use specific details from the review.\n", 348 | "Write in a concise and professional tone.\n", 349 | "Sign the email as `AI customer agent`.\n", 350 | "Customer review: ```{review}```\n", 351 | "Review sentiment: {sentiment}\n", 352 | "\"\"\"\n", 353 | "response = get_completion(prompt, temperature=0.7)\n", 354 | "print(response)" 355 | ] 356 | }, 357 | { 358 | "cell_type": "code", 359 | "execution_count": 12, 360 | "metadata": {}, 361 | "outputs": [ 362 | { 363 | "name": "stdout", 364 | "output_type": "stream", 365 | "text": [ 366 | "尊敬的客户,\n", 367 | "\n", 368 | "非常感谢您对我们产品的评价。我们由衷地为您在购买过程中遇到的问题表示抱歉。我们确实在12月份的第二周调整了价格,但这是由于市场因素所致,并非价格欺诈。我们深刻意识到您对产品质量的担忧,我们将尽一切努力改进产品,以提供更好的体验。\n", 369 | "\n", 370 | "我们非常感激您对我们产品的使用经验和制作技巧的分享。您的建议和反馈对我们非常重要,我们将以此为基础,进一步改进我们的产品。\n", 371 | "\n", 372 | "如果您有任何疑问或需要进一步帮助,请随时联系我们的客户服务部门。我们将尽快回复您并提供帮助。\n", 373 | "\n", 374 | "最后,请再次感谢您对我们产品的评价和选择。我们期待着未来与您的合作。\n", 375 | "\n", 376 | "此致\n", 377 | "\n", 378 | "敬礼\n", 379 | "\n", 380 | "AI客户代理\n" 381 | ] 382 | } 383 | ], 384 | "source": [ 385 | "prompt = f\"\"\"\n", 386 | "你是一名客户服务的AI助手。\n", 387 | "你的任务是给一位重要的客户发送邮件回复。\n", 388 | "根据通过“```”分隔的客户电子邮件生成回复,以感谢客户的评价。\n", 389 | "如果情感是积极的或中性的,感谢他们的评价。\n", 390 | "如果情感是消极的,道歉并建议他们联系客户服务。\n", 391 | "请确保使用评论中的具体细节。\n", 392 | "以简明和专业的语气写信。\n", 393 | "以“AI客户代理”的名义签署电子邮件。\n", 394 | "客户评价:```{review}```\n", 395 | "评论情感:{sentiment}\n", 396 | "\"\"\"\n", 397 | "response = get_completion(prompt, temperature=0.7)\n", 398 | "print(response)" 399 | ] 400 | }, 401 | { 402 | "cell_type": "markdown", 403 | "metadata": {}, 404 | "source": [ 405 | " " 406 | ] 407 | }, 408 | { 409 | "cell_type": "markdown", 410 | "metadata": {}, 411 | "source": [ 412 | "在温度为零时,每次执行相同的提示时,您应该期望获得相同的完成。而使用温度为0.7,则每次都会获得不同的输出。\n", 413 | "\n", 414 | "所以,您可以看到它与我们之前收到的电子邮件不同。让我们再次执行它,以显示我们将再次获得不同的电子邮件。\n", 415 | "\n", 416 | "因此,我建议您自己尝试温度,以查看输出如何变化。总之,在更高的温度下,模型的输出更加随机。您几乎可以将其视为在更高的温度下,助手更易分心,但也许更有创造力。" 417 | ] 418 | } 419 | ], 420 | "metadata": { 421 | "kernelspec": { 422 | "display_name": "Python 3", 423 | "language": "python", 424 | "name": "python3" 425 | }, 426 | "language_info": { 427 | "codemirror_mode": { 428 | "name": "ipython", 429 | "version": 3 430 | }, 431 | "file_extension": ".py", 432 | "mimetype": "text/x-python", 433 | "name": "python", 434 | "nbconvert_exporter": "python", 435 | "pygments_lexer": "ipython3", 436 | "version": "3.8.13" 437 | }, 438 | "latex_envs": { 439 | "LaTeX_envs_menu_present": true, 440 | "autoclose": false, 441 | "autocomplete": true, 442 | "bibliofile": "biblio.bib", 443 | "cite_by": "apalike", 444 | "current_citInitial": 1, 445 | "eqLabelWithNumbers": true, 446 | "eqNumInitial": 1, 447 | "hotkeys": { 448 | "equation": "Ctrl-E", 449 | "itemize": "Ctrl-I" 450 | }, 451 | "labels_anchors": false, 452 | "latex_user_defs": false, 453 | "report_style_numbering": false, 454 | "user_envs_cfg": false 455 | }, 456 | "toc": { 457 | "base_numbering": 1, 458 | "nav_menu": {}, 459 | "number_sections": true, 460 | "sideBar": true, 461 | "skip_h1_title": false, 462 | "title_cell": "Table of Contents", 463 | "title_sidebar": "Contents", 464 | "toc_cell": false, 465 | "toc_position": {}, 466 | "toc_section_display": true, 467 | "toc_window_display": false 468 | } 469 | }, 470 | "nbformat": 4, 471 | "nbformat_minor": 4 472 | } 473 | -------------------------------------------------------------------------------- /99~参考资料/2023~吴恩达~《ChatGPT Prompt Engineering for Developers》/09. 总结.md: -------------------------------------------------------------------------------- 1 | 恭喜你完成了这门短期课程。 2 | 3 | 总的来说,在这门课程中,我们学习了关于prompt的两个关键原则: 4 | 5 | - 编写清晰具体的指令; 6 | - 如果适当的话,给模型一些思考时间。 7 | 8 | 你还学习了迭代式prompt开发的方法,并了解了如何找到适合你应用程序的prompt的过程是非常关键的。 9 | 10 | 我们还介绍了许多大型语言模型的功能,包括摘要、推断、转换和扩展。你还学会了如何构建自定义聊天机器人。在这门短期课程中,你学到了很多,希望你喜欢这些学习材料。 11 | 12 | 我们希望你能想出一些应用程序的想法,并尝试自己构建它们。请尝试一下并让我们知道你的想法。你可以从一个非常小的项目开始,也许它具有一定的实用价值,也可能完全没有实用价值,只是一些有趣好玩儿的东西。请利用你第一个项目的学习经验来构建更好的第二个项目,甚至更好的第三个项目等。或者,如果你已经有一个更大的项目想法,那就去做吧。 13 | 14 | 大型语言模型非常强大,作为提醒,我们希望大家负责任地使用它们,请仅构建对他人有积极影响的东西。在这个时代,构建人工智能系统的人可以对他人产生巨大的影响。因此必须负责任地使用这些工具。 15 | 16 | 现在,基于大型语言模型构建应用程序是一个非常令人兴奋和不断发展的领域。现在你已经完成了这门课程,我们认为你现在拥有了丰富的知识,可以帮助你构建其他人今天不知道如何构建的东西。因此,我希望你也能帮助我们传播并鼓励其他人也参加这门课程。 17 | 18 | 最后,希望你在完成这门课程时感到愉快,感谢你完成了这门课程。我们期待听到你构建的惊人之作。 -------------------------------------------------------------------------------- /99~参考资料/2023~吴恩达~《LangChain for LLM Application Development》/1.开篇介绍.md: -------------------------------------------------------------------------------- 1 | ## 吴恩达 LangChain大模型应用开发 开端篇 2 | 3 | ## LangChain for LLM Application Development 4 | 5 | 欢迎来到LangChain大模型应用开发短期课程👏🏻👏🏻 6 | 7 | 本课程由哈里森·蔡斯 (Harrison Chase,LangChain作者)与Deeplearning.ai合作开发,旨在教大家使用这个神奇工具。 8 | 9 | ### 🚀 LangChain的诞生和发展 10 | 11 | 通过提示LLM或大型语言模型,现在可以比以往更快地开发AI应用程序,但是一个应用程序可能需要提示和多次并暂停作为输出。 12 | 13 | 在此过程有很多胶水代码需要编写,因此哈里森·蔡斯 (Harrison Chase) 创建了LangChain,整合了常见的抽象功能,使开发过程变得更加丝滑。 14 | 15 | LangChain开源社区快速发展,贡献者已达数百人,正以惊人的速度更新代码和功能。 16 | 17 | 18 | 19 | ### 📚 课程基本内容 20 | 21 | LangChain是用于构建大模型应用程序的开源框架,有Python和JavaScript两个不同版本的包。LangChain基于模块化组合,有许多单独的组件,可以一起使用或单独使用。此外LangChain还拥有很多应用案例,帮助我们了解如何将这些模块化组件组合成链式方式,以形成更多端到端的应用程序 。 22 | 23 | 在本课程中,我们将介绍LandChain的常见组件,并讨论模型、提示(使模型执行操作的方式)、索引(处理数据的方式),然后将讨论链式(端到端用例)以及令人激动的代理(使用模型作为推理引擎的端到端应用)。 24 | 25 | 26 | 27 | ### 🌹致谢课程重要贡献者 28 | 29 | 最后特别感谢Ankush Gholar(LandChain的联合作者)、Geoff Ladwig,、Eddy Shyu 以及 Diala Ezzedine,他们也为课程内容投入了很多思考~ -------------------------------------------------------------------------------- /99~参考资料/2023~吴恩达~《LangChain for LLM Application Development》/8.课程总结.md: -------------------------------------------------------------------------------- 1 | ## 吴恩达 LangChain大模型应用开发 总结篇 2 | 3 | ## LangChain for LLM Application Development 4 | 5 | 本次简短课程涵盖了一系列LangChain的应用实践,包括处理顾客评论和基于文档回答问题,以及通过LLM判断何时求助外部工具 (如网站) 来回答复杂问题。 6 | 7 | ### 👍🏻 LangChain如此强大 8 | 9 | 构建这类应用曾经需要耗费数周时间,而现在只需要非常少的代码,就可以通过LangChain高效构建所需的应用程序。LangChain已成为开发大模型应用的有力范式,希望大家拥抱这个强大工具,积极探索更多更广泛的应用场景。 10 | 11 | ### 🌈 不同组合->更多可能性 12 | 13 | LangChain还可以协助我们做什么呢:基于CSV文件回答问题、查询sql数据库、与api交互,有很多例子通过Chain以及不同的提示(Prompts)和输出解析器(output parsers)组合得以实现。 14 | 15 | ### 💪🏻 出发~去探索新世界吧~ 16 | 17 | 因此非常感谢社区中做出贡献的每一个人,无论是协助文档的改进,还是让其他人更容易上手,还是构建新的Chain打开一个全新的世界。 18 | 19 | 如果你还没有这样做,快去打开电脑,运行 pip install LangChain,然后去使用LangChain、搭建惊艳的应用吧~ 20 | 21 | -------------------------------------------------------------------------------- /99~参考资料/2023~吴恩达~《LangChain for LLM Application Development》/Data.csv: -------------------------------------------------------------------------------- 1 | Product,Review 2 | Queen Size Sheet Set,"I ordered a king size set. My only criticism would be that I wish seller would offer the king size set with 4 pillowcases. I separately ordered a two pack of pillowcases so I could have a total of four. When I saw the two packages, it looked like the color did not exactly match. Customer service was excellent about sending me two more pillowcases so I would have four that matched. Excellent! For the cost of these sheets, I am satisfied with the characteristics and coolness of the sheets." 3 | Waterproof Phone Pouch,"I loved the waterproof sac, although the opening was made of a hard plastic. I don’t know if that would break easily. But I couldn’t turn my phone on, once it was in the pouch." 4 | Luxury Air Mattress,"This mattress had a small hole in the top of it (took forever to find where it was), and the patches that they provide did not work, maybe because it's the top of the mattress where it's kind of like fabric and a patch won't stick. Maybe I got unlucky with a defective mattress, but where's quality assurance for this company? That flat out should not happen. Emphasis on flat. Cause that's what the mattress was. Seriously horrible experience, ruined my friend's stay with me. Then they make you ship it back instead of just providing a refund, which is also super annoying to pack up an air mattress and take it to the UPS store. This company is the worst, and this mattress is the worst." 5 | Pillows Insert,"This is the best throw pillow fillers on Amazon. I’ve tried several others, and they’re all cheap and flat no matter how much fluffing you do. Once you toss these in the dryer after you remove them from the vacuum sealed shipping material, they fluff up great" 6 | "Milk Frother Handheld 7 | "," I loved this product. But they only seem to last a few months. The company was great replacing the first one (the frother falls out of the handle and can't be fixed). The after 4 months my second one did the same. I only use the frother for coffee once a day. It's not overuse or abuse. I'm very disappointed and will look for another. As I understand they will only replace once. Anyway, if you have one good luck." 8 | "L'Or Espresso Café  9 | ","Je trouve le goût médiocre. La mousse ne tient pas, c'est bizarre. J'achète les mêmes dans le commerce et le goût est bien meilleur... 10 | Vieux lot ou contrefaçon !?" 11 | Hervidor de Agua Eléctrico,"Está lu bonita calienta muy rápido, es muy funcional, solo falta ver cuánto dura, solo llevo 3 días en funcionamiento." -------------------------------------------------------------------------------- /99~参考资料/2023~吴恩达~《LangChain for LLM Application Development》/readme.md: -------------------------------------------------------------------------------- 1 | # 使用 LangChain 开发基于 LLM 的应用程序 2 | 3 | 吴恩达老师发布的大模型开发新课程,指导开发者如何结合框架LangChain 使用 ChatGPT API 来搭建基于 LLM 的应用程序,帮助开发者学习使用 LangChain 的一些技巧,包括:模型、提示和解析器,应用程序所需要用到的存储,搭建模型链,基于文档的问答系统,评估与代理等。 4 | 5 | ### 目录 6 | 1. 简介 Introduction @Sarai 7 | 2. 模型,提示和解析器 Models, Prompts and Output Parsers @Joye 8 | 3. 存储 Memory @徐虎 9 | 4. 模型链 Chains @徐虎 10 | 5. 基于文档的问答 Question and Answer @苟晓攀 11 | 6. 评估 Evaluation @苟晓攀 12 | 7. 代理 Agent @Joye 13 | 8. 总结 Conclusion @Sarai 14 | -------------------------------------------------------------------------------- /99~参考资料/2023~陆奇~我的大模型世界观.md: -------------------------------------------------------------------------------- 1 | # 陆奇 2 | 3 | ![](https://assets.ng-tech.icu/pdf/2023-%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2/%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2%EF%BC%882023%E5%B9%B44%E6%9C%8823%E6%97%A5%EF%BC%89-%E7%9C%9F%E6%AD%A3%E5%AE%8C%E6%95%B4%E7%89%88_00.png) 4 | 5 | ![](https://assets.ng-tech.icu/pdf/2023-%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2/%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2%EF%BC%882023%E5%B9%B44%E6%9C%8823%E6%97%A5%EF%BC%89-%E7%9C%9F%E6%AD%A3%E5%AE%8C%E6%95%B4%E7%89%88_01.png) 6 | 7 | ![](https://assets.ng-tech.icu/pdf/2023-%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2/%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2%EF%BC%882023%E5%B9%B44%E6%9C%8823%E6%97%A5%EF%BC%89-%E7%9C%9F%E6%AD%A3%E5%AE%8C%E6%95%B4%E7%89%88_02.png) 8 | 9 | ![](https://assets.ng-tech.icu/pdf/2023-%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2/%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2%EF%BC%882023%E5%B9%B44%E6%9C%8823%E6%97%A5%EF%BC%89-%E7%9C%9F%E6%AD%A3%E5%AE%8C%E6%95%B4%E7%89%88_03.png) 10 | 11 | 这张图是“三位一体结构演化模式”,本质是讲任何复杂体系,包括一个人、一家公司、一个社会,甚至数字化本身的数字化体系,都是复杂体系。“三位一体”包括: 12 | 13 | - “信息”系统(subsystem of information),从环境当中获得信息; 14 | - “模型”系统(subsystem of model),对信息做一种表达,进行推理和规划; 15 | - “行动”系统(subsystem of action),我们最终和环境做交互,达到人类想达到的目的。 16 | 17 | 我们可以得出一个简单结论。今天大部分数字化产品和公司,包括 Google、微软、阿里、字节,本质是信息搬运公司。一定要记住,我们所做的一切,一切的一切,包括在座的大部分企业都在搬运信息。Nothing more than that,You just move bytes(仅此而已,你只是移动字节)。但它已经足够好,改变了世界。 18 | 19 | 早在 1995-1996 年,通过 PC 互联网迎来一个拐点。那时我刚从 CMU(卡内基梅隆大学)毕业。大量公司层出不穷,其中诞生了一家伟大公司叫 Google。为什么会有这个拐点?为什么会有爆炸式增长?把这个观点讲清楚,就能把今天的拐点讲清楚。 20 | 21 | 原因是,获取信息的边际成本开始变成固定成本。一定要记住,任何改变社会、改变产业的,永远是结构性改变。这个结构性改变往往是一类大型成本,从边际成本变成固定成本。举个例子,我在 CMU 念书开车离开匹茨堡出去,一张地图 3 美元,获取信息很贵。今天我要地图,还是有价钱,但都变成固定价格。Google 平均一年付 10 亿美元做一张地图,但每个用户要获得地图的信息,基本上代价是 0。也就是说,获取信息成本变 0 的时候,它一定改变了所有产业。这就是过去 20 年发生的,今天基本是 free information everywhere(免费的信息无处不在)。 22 | 23 | 模型的成本开始从边际走向固定,大模型是技术核心、产业化基础。OpenAI 搭好了,发展速度爬升会很快。为什么模型这么重要、这个拐点这么重要,因为模型和人有内在关系。我们每个人都是模型的组合。人有三种模型: 24 | 25 | - 认知模型,我们能看、能听、能思考、能规划; 26 | - 任务模型,我们能爬楼梯、搬椅子剥鸡蛋; 27 | - 领域模型,我们有些人是医生,有些人是律师,有些人是码农。 28 | 29 | 通用智能四大要素是:涌现(emergence)+代理(agency)+功能可见性(affordence)+具象(embodiment)。 30 | 31 | The only way to make natural language work is you have knowledge(让自然语言处理有效的唯一路径是你有知识)。正好 Transformer 把这么多知识压缩在一起了,这是它的最大突破。 32 | 33 | ![大模型的淘金时代 对机会点进行结构性拆解](https://assets.ng-tech.icu/item/20230424142818.png) 34 | 35 | 这张图是整个人类技术驱动的创业创新,所有事情的机会都在这张图上。 36 | 37 | - 首先,底层是数字化的技术,因为数字化是人的延伸。数字化的基础里有平台,有发展基础,包括开源的代码、开源的设计、开源的数据;平台有前端、后端等。这里有大量机会。 38 | - 第二,波是用数字化的能力去解决人的需求。我们把数字化应用完整放在这张表上。 39 | 40 | ![](https://assets.ng-tech.icu/pdf/2023-%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2/%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2%EF%BC%882023%E5%B9%B44%E6%9C%8823%E6%97%A5%EF%BC%89-%E7%9C%9F%E6%AD%A3%E5%AE%8C%E6%95%B4%E7%89%88_04.png) 41 | 42 | ![](https://assets.ng-tech.icu/pdf/2023-%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2/%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2%EF%BC%882023%E5%B9%B44%E6%9C%8823%E6%97%A5%EF%BC%89-%E7%9C%9F%E6%AD%A3%E5%AE%8C%E6%95%B4%E7%89%88_05.png) 43 | 44 | ![](https://assets.ng-tech.icu/pdf/2023-%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2/%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2%EF%BC%882023%E5%B9%B44%E6%9C%8823%E6%97%A5%EF%BC%89-%E7%9C%9F%E6%AD%A3%E5%AE%8C%E6%95%B4%E7%89%88_06.png) 45 | 46 | ![](https://assets.ng-tech.icu/pdf/2023-%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2/%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2%EF%BC%882023%E5%B9%B44%E6%9C%8823%E6%97%A5%EF%BC%89-%E7%9C%9F%E6%AD%A3%E5%AE%8C%E6%95%B4%E7%89%88_07.png) 47 | 48 | ![](https://assets.ng-tech.icu/pdf/2023-%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2/%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2%EF%BC%882023%E5%B9%B44%E6%9C%8823%E6%97%A5%EF%BC%89-%E7%9C%9F%E6%AD%A3%E5%AE%8C%E6%95%B4%E7%89%88_08.png) 49 | 50 | ![](https://assets.ng-tech.icu/pdf/2023-%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2/%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2%EF%BC%882023%E5%B9%B44%E6%9C%8823%E6%97%A5%EF%BC%89-%E7%9C%9F%E6%AD%A3%E5%AE%8C%E6%95%B4%E7%89%88_09.png) 51 | 52 | ![](https://assets.ng-tech.icu/pdf/2023-%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2/%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2%EF%BC%882023%E5%B9%B44%E6%9C%8823%E6%97%A5%EF%BC%89-%E7%9C%9F%E6%AD%A3%E5%AE%8C%E6%95%B4%E7%89%88_10.png) 53 | 54 | ![](https://assets.ng-tech.icu/pdf/2023-%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2/%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2%EF%BC%882023%E5%B9%B44%E6%9C%8823%E6%97%A5%EF%BC%89-%E7%9C%9F%E6%AD%A3%E5%AE%8C%E6%95%B4%E7%89%88_11.png) 55 | 56 | ![](https://assets.ng-tech.icu/pdf/2023-%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2/%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2%EF%BC%882023%E5%B9%B44%E6%9C%8823%E6%97%A5%EF%BC%89-%E7%9C%9F%E6%AD%A3%E5%AE%8C%E6%95%B4%E7%89%88_12.png) 57 | 58 | ![](https://assets.ng-tech.icu/pdf/2023-%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2/%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2%EF%BC%882023%E5%B9%B44%E6%9C%8823%E6%97%A5%EF%BC%89-%E7%9C%9F%E6%AD%A3%E5%AE%8C%E6%95%B4%E7%89%88_13.png) 59 | 60 | ![](https://assets.ng-tech.icu/pdf/2023-%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2/%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2%EF%BC%882023%E5%B9%B44%E6%9C%8823%E6%97%A5%EF%BC%89-%E7%9C%9F%E6%AD%A3%E5%AE%8C%E6%95%B4%E7%89%88_14.png) 61 | 62 | ![](https://assets.ng-tech.icu/pdf/2023-%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2/%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2%EF%BC%882023%E5%B9%B44%E6%9C%8823%E6%97%A5%EF%BC%89-%E7%9C%9F%E6%AD%A3%E5%AE%8C%E6%95%B4%E7%89%88_15.png) 63 | 64 | ![](https://assets.ng-tech.icu/pdf/2023-%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2/%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2%EF%BC%882023%E5%B9%B44%E6%9C%8823%E6%97%A5%EF%BC%89-%E7%9C%9F%E6%AD%A3%E5%AE%8C%E6%95%B4%E7%89%88_16.png) 65 | 66 | ![](https://assets.ng-tech.icu/pdf/2023-%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2/%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2%EF%BC%882023%E5%B9%B44%E6%9C%8823%E6%97%A5%EF%BC%89-%E7%9C%9F%E6%AD%A3%E5%AE%8C%E6%95%B4%E7%89%88_17.png) 67 | 68 | ![](https://assets.ng-tech.icu/pdf/2023-%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2/%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2%EF%BC%882023%E5%B9%B44%E6%9C%8823%E6%97%A5%EF%BC%89-%E7%9C%9F%E6%AD%A3%E5%AE%8C%E6%95%B4%E7%89%88_18.png) 69 | 70 | ![](https://assets.ng-tech.icu/pdf/2023-%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2/%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2%EF%BC%882023%E5%B9%B44%E6%9C%8823%E6%97%A5%EF%BC%89-%E7%9C%9F%E6%AD%A3%E5%AE%8C%E6%95%B4%E7%89%88_19.png) 71 | 72 | ![](https://assets.ng-tech.icu/pdf/2023-%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2/%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2%EF%BC%882023%E5%B9%B44%E6%9C%8823%E6%97%A5%EF%BC%89-%E7%9C%9F%E6%AD%A3%E5%AE%8C%E6%95%B4%E7%89%88_20.png) 73 | 74 | ![](https://assets.ng-tech.icu/pdf/2023-%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2/%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2%EF%BC%882023%E5%B9%B44%E6%9C%8823%E6%97%A5%EF%BC%89-%E7%9C%9F%E6%AD%A3%E5%AE%8C%E6%95%B4%E7%89%88_21.png) 75 | 76 | ![](https://assets.ng-tech.icu/pdf/2023-%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2/%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2%EF%BC%882023%E5%B9%B44%E6%9C%8823%E6%97%A5%EF%BC%89-%E7%9C%9F%E6%AD%A3%E5%AE%8C%E6%95%B4%E7%89%88_22.png) 77 | 78 | ![](https://assets.ng-tech.icu/pdf/2023-%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2/%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2%EF%BC%882023%E5%B9%B44%E6%9C%8823%E6%97%A5%EF%BC%89-%E7%9C%9F%E6%AD%A3%E5%AE%8C%E6%95%B4%E7%89%88_23.png) 79 | 80 | ![](https://assets.ng-tech.icu/pdf/2023-%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2/%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2%EF%BC%882023%E5%B9%B44%E6%9C%8823%E6%97%A5%EF%BC%89-%E7%9C%9F%E6%AD%A3%E5%AE%8C%E6%95%B4%E7%89%88_24.png) 81 | 82 | ![](https://assets.ng-tech.icu/pdf/2023-%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2/%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2%EF%BC%882023%E5%B9%B44%E6%9C%8823%E6%97%A5%EF%BC%89-%E7%9C%9F%E6%AD%A3%E5%AE%8C%E6%95%B4%E7%89%88_25.png) 83 | 84 | ![](https://assets.ng-tech.icu/pdf/2023-%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2/%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2%EF%BC%882023%E5%B9%B44%E6%9C%8823%E6%97%A5%EF%BC%89-%E7%9C%9F%E6%AD%A3%E5%AE%8C%E6%95%B4%E7%89%88_26.png) 85 | 86 | ![](https://assets.ng-tech.icu/pdf/2023-%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2/%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2%EF%BC%882023%E5%B9%B44%E6%9C%8823%E6%97%A5%EF%BC%89-%E7%9C%9F%E6%AD%A3%E5%AE%8C%E6%95%B4%E7%89%88_27.png) 87 | 88 | ![](https://assets.ng-tech.icu/pdf/2023-%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2/%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2%EF%BC%882023%E5%B9%B44%E6%9C%8823%E6%97%A5%EF%BC%89-%E7%9C%9F%E6%AD%A3%E5%AE%8C%E6%95%B4%E7%89%88_28.png) 89 | 90 | ![](https://assets.ng-tech.icu/pdf/2023-%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2/%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2%EF%BC%882023%E5%B9%B44%E6%9C%8823%E6%97%A5%EF%BC%89-%E7%9C%9F%E6%AD%A3%E5%AE%8C%E6%95%B4%E7%89%88_29.png) 91 | 92 | ![](https://assets.ng-tech.icu/pdf/2023-%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2/%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2%EF%BC%882023%E5%B9%B44%E6%9C%8823%E6%97%A5%EF%BC%89-%E7%9C%9F%E6%AD%A3%E5%AE%8C%E6%95%B4%E7%89%88_30.png) 93 | 94 | ![](https://assets.ng-tech.icu/pdf/2023-%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2/%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2%EF%BC%882023%E5%B9%B44%E6%9C%8823%E6%97%A5%EF%BC%89-%E7%9C%9F%E6%AD%A3%E5%AE%8C%E6%95%B4%E7%89%88_31.png) 95 | 96 | ![](https://assets.ng-tech.icu/pdf/2023-%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2/%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2%EF%BC%882023%E5%B9%B44%E6%9C%8823%E6%97%A5%EF%BC%89-%E7%9C%9F%E6%AD%A3%E5%AE%8C%E6%95%B4%E7%89%88_32.png) 97 | 98 | ![](https://assets.ng-tech.icu/pdf/2023-%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2/%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2%EF%BC%882023%E5%B9%B44%E6%9C%8823%E6%97%A5%EF%BC%89-%E7%9C%9F%E6%AD%A3%E5%AE%8C%E6%95%B4%E7%89%88_33.png) 99 | 100 | ![](https://assets.ng-tech.icu/pdf/2023-%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2/%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2%EF%BC%882023%E5%B9%B44%E6%9C%8823%E6%97%A5%EF%BC%89-%E7%9C%9F%E6%AD%A3%E5%AE%8C%E6%95%B4%E7%89%88_34.png) 101 | 102 | ![](https://assets.ng-tech.icu/pdf/2023-%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2/%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2%EF%BC%882023%E5%B9%B44%E6%9C%8823%E6%97%A5%EF%BC%89-%E7%9C%9F%E6%AD%A3%E5%AE%8C%E6%95%B4%E7%89%88_35.png) 103 | 104 | ![](https://assets.ng-tech.icu/pdf/2023-%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2/%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2%EF%BC%882023%E5%B9%B44%E6%9C%8823%E6%97%A5%EF%BC%89-%E7%9C%9F%E6%AD%A3%E5%AE%8C%E6%95%B4%E7%89%88_36.png) 105 | 106 | ![](https://assets.ng-tech.icu/pdf/2023-%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2/%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2%EF%BC%882023%E5%B9%B44%E6%9C%8823%E6%97%A5%EF%BC%89-%E7%9C%9F%E6%AD%A3%E5%AE%8C%E6%95%B4%E7%89%88_37.png) 107 | 108 | ![](https://assets.ng-tech.icu/pdf/2023-%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2/%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2%EF%BC%882023%E5%B9%B44%E6%9C%8823%E6%97%A5%EF%BC%89-%E7%9C%9F%E6%AD%A3%E5%AE%8C%E6%95%B4%E7%89%88_38.png) 109 | 110 | ![](https://assets.ng-tech.icu/pdf/2023-%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2/%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2%EF%BC%882023%E5%B9%B44%E6%9C%8823%E6%97%A5%EF%BC%89-%E7%9C%9F%E6%AD%A3%E5%AE%8C%E6%95%B4%E7%89%88_39.png) 111 | 112 | ![](https://assets.ng-tech.icu/pdf/2023-%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2/%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2%EF%BC%882023%E5%B9%B44%E6%9C%8823%E6%97%A5%EF%BC%89-%E7%9C%9F%E6%AD%A3%E5%AE%8C%E6%95%B4%E7%89%88_40.png) 113 | 114 | ![](https://assets.ng-tech.icu/pdf/2023-%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2/%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2%EF%BC%882023%E5%B9%B44%E6%9C%8823%E6%97%A5%EF%BC%89-%E7%9C%9F%E6%AD%A3%E5%AE%8C%E6%95%B4%E7%89%88_41.png) 115 | 116 | ![](https://assets.ng-tech.icu/pdf/2023-%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2/%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2%EF%BC%882023%E5%B9%B44%E6%9C%8823%E6%97%A5%EF%BC%89-%E7%9C%9F%E6%AD%A3%E5%AE%8C%E6%95%B4%E7%89%88_42.png) 117 | 118 | ![](https://assets.ng-tech.icu/pdf/2023-%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2/%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2%EF%BC%882023%E5%B9%B44%E6%9C%8823%E6%97%A5%EF%BC%89-%E7%9C%9F%E6%AD%A3%E5%AE%8C%E6%95%B4%E7%89%88_43.png) 119 | 120 | ![](https://assets.ng-tech.icu/pdf/2023-%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2/%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2%EF%BC%882023%E5%B9%B44%E6%9C%8823%E6%97%A5%EF%BC%89-%E7%9C%9F%E6%AD%A3%E5%AE%8C%E6%95%B4%E7%89%88_44.png) 121 | 122 | ![](https://assets.ng-tech.icu/pdf/2023-%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2/%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2%EF%BC%882023%E5%B9%B44%E6%9C%8823%E6%97%A5%EF%BC%89-%E7%9C%9F%E6%AD%A3%E5%AE%8C%E6%95%B4%E7%89%88_45.png) 123 | 124 | ![](https://assets.ng-tech.icu/pdf/2023-%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2/%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2%EF%BC%882023%E5%B9%B44%E6%9C%8823%E6%97%A5%EF%BC%89-%E7%9C%9F%E6%AD%A3%E5%AE%8C%E6%95%B4%E7%89%88_46.png) 125 | 126 | ![](https://assets.ng-tech.icu/pdf/2023-%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2/%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2%EF%BC%882023%E5%B9%B44%E6%9C%8823%E6%97%A5%EF%BC%89-%E7%9C%9F%E6%AD%A3%E5%AE%8C%E6%95%B4%E7%89%88_47.png) 127 | 128 | ![](https://assets.ng-tech.icu/pdf/2023-%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2/%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2%EF%BC%882023%E5%B9%B44%E6%9C%8823%E6%97%A5%EF%BC%89-%E7%9C%9F%E6%AD%A3%E5%AE%8C%E6%95%B4%E7%89%88_48.png) 129 | 130 | ![](https://assets.ng-tech.icu/pdf/2023-%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2/%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2%EF%BC%882023%E5%B9%B44%E6%9C%8823%E6%97%A5%EF%BC%89-%E7%9C%9F%E6%AD%A3%E5%AE%8C%E6%95%B4%E7%89%88_49.png) 131 | 132 | ![](https://assets.ng-tech.icu/pdf/2023-%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2/%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2%EF%BC%882023%E5%B9%B44%E6%9C%8823%E6%97%A5%EF%BC%89-%E7%9C%9F%E6%AD%A3%E5%AE%8C%E6%95%B4%E7%89%88_50.png) 133 | 134 | ![](https://assets.ng-tech.icu/pdf/2023-%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2/%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2%EF%BC%882023%E5%B9%B44%E6%9C%8823%E6%97%A5%EF%BC%89-%E7%9C%9F%E6%AD%A3%E5%AE%8C%E6%95%B4%E7%89%88_51.png) 135 | 136 | ![](https://assets.ng-tech.icu/pdf/2023-%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2/%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2%EF%BC%882023%E5%B9%B44%E6%9C%8823%E6%97%A5%EF%BC%89-%E7%9C%9F%E6%AD%A3%E5%AE%8C%E6%95%B4%E7%89%88_52.png) 137 | 138 | ![](https://assets.ng-tech.icu/pdf/2023-%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2/%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2%EF%BC%882023%E5%B9%B44%E6%9C%8823%E6%97%A5%EF%BC%89-%E7%9C%9F%E6%AD%A3%E5%AE%8C%E6%95%B4%E7%89%88_53.png) 139 | 140 | ![](https://assets.ng-tech.icu/pdf/2023-%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2/%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2%EF%BC%882023%E5%B9%B44%E6%9C%8823%E6%97%A5%EF%BC%89-%E7%9C%9F%E6%AD%A3%E5%AE%8C%E6%95%B4%E7%89%88_54.png) 141 | 142 | ![](https://assets.ng-tech.icu/pdf/2023-%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2/%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2%EF%BC%882023%E5%B9%B44%E6%9C%8823%E6%97%A5%EF%BC%89-%E7%9C%9F%E6%AD%A3%E5%AE%8C%E6%95%B4%E7%89%88_55.png) 143 | 144 | ![](https://assets.ng-tech.icu/pdf/2023-%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2/%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2%EF%BC%882023%E5%B9%B44%E6%9C%8823%E6%97%A5%EF%BC%89-%E7%9C%9F%E6%AD%A3%E5%AE%8C%E6%95%B4%E7%89%88_56.png) 145 | 146 | ![](https://assets.ng-tech.icu/pdf/2023-%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2/%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2%EF%BC%882023%E5%B9%B44%E6%9C%8823%E6%97%A5%EF%BC%89-%E7%9C%9F%E6%AD%A3%E5%AE%8C%E6%95%B4%E7%89%88_57.png) 147 | 148 | ![](https://assets.ng-tech.icu/pdf/2023-%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2/%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2%EF%BC%882023%E5%B9%B44%E6%9C%8823%E6%97%A5%EF%BC%89-%E7%9C%9F%E6%AD%A3%E5%AE%8C%E6%95%B4%E7%89%88_58.png) 149 | 150 | ![](https://assets.ng-tech.icu/pdf/2023-%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2/%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2%EF%BC%882023%E5%B9%B44%E6%9C%8823%E6%97%A5%EF%BC%89-%E7%9C%9F%E6%AD%A3%E5%AE%8C%E6%95%B4%E7%89%88_59.png) 151 | 152 | ![](https://assets.ng-tech.icu/pdf/2023-%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2/%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2%EF%BC%882023%E5%B9%B44%E6%9C%8823%E6%97%A5%EF%BC%89-%E7%9C%9F%E6%AD%A3%E5%AE%8C%E6%95%B4%E7%89%88_60.png) 153 | 154 | ![](https://assets.ng-tech.icu/pdf/2023-%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2/%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2%EF%BC%882023%E5%B9%B44%E6%9C%8823%E6%97%A5%EF%BC%89-%E7%9C%9F%E6%AD%A3%E5%AE%8C%E6%95%B4%E7%89%88_61.png) 155 | 156 | ![](https://assets.ng-tech.icu/pdf/2023-%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2/%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2%EF%BC%882023%E5%B9%B44%E6%9C%8823%E6%97%A5%EF%BC%89-%E7%9C%9F%E6%AD%A3%E5%AE%8C%E6%95%B4%E7%89%88_62.png) 157 | 158 | ![](https://assets.ng-tech.icu/pdf/2023-%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2/%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2%EF%BC%882023%E5%B9%B44%E6%9C%8823%E6%97%A5%EF%BC%89-%E7%9C%9F%E6%AD%A3%E5%AE%8C%E6%95%B4%E7%89%88_63.png) 159 | 160 | ![](https://assets.ng-tech.icu/pdf/2023-%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2/%E9%99%86%E5%A5%87%E6%B7%B1%E5%9C%B3%E6%BC%94%E8%AE%B2%EF%BC%882023%E5%B9%B44%E6%9C%8823%E6%97%A5%EF%BC%89-%E7%9C%9F%E6%AD%A3%E5%AE%8C%E6%95%B4%E7%89%88_64.png) 161 | -------------------------------------------------------------------------------- /INTRODUCTION.md: -------------------------------------------------------------------------------- 1 | # 本篇导读 2 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International 2 | Public License 3 | 4 | By exercising the Licensed Rights (defined below), You accept and agree 5 | to be bound by the terms and conditions of this Creative Commons 6 | Attribution-NonCommercial-ShareAlike 4.0 International Public License 7 | ("Public License"). To the extent this Public License may be 8 | interpreted as a contract, You are granted the Licensed Rights in 9 | consideration of Your acceptance of these terms and conditions, and the 10 | Licensor grants You such rights in consideration of benefits the 11 | Licensor receives from making the Licensed Material available under 12 | these terms and conditions. 13 | 14 | 15 | Section 1 -- Definitions. 16 | 17 | a. Adapted Material means material subject to Copyright and Similar 18 | Rights that is derived from or based upon the Licensed Material 19 | and in which the Licensed Material is translated, altered, 20 | arranged, transformed, or otherwise modified in a manner requiring 21 | permission under the Copyright and Similar Rights held by the 22 | Licensor. For purposes of this Public License, where the Licensed 23 | Material is a musical work, performance, or sound recording, 24 | Adapted Material is always produced where the Licensed Material is 25 | synched in timed relation with a moving image. 26 | 27 | b. Adapter's License means the license You apply to Your Copyright 28 | and Similar Rights in Your contributions to Adapted Material in 29 | accordance with the terms and conditions of this Public License. 30 | 31 | c. BY-NC-SA Compatible License means a license listed at 32 | creativecommons.org/compatiblelicenses, approved by Creative 33 | Commons as essentially the equivalent of this Public License. 34 | 35 | d. Copyright and Similar Rights means copyright and/or similar rights 36 | closely related to copyright including, without limitation, 37 | performance, broadcast, sound recording, and Sui Generis Database 38 | Rights, without regard to how the rights are labeled or 39 | categorized. For purposes of this Public License, the rights 40 | specified in Section 2(b)(1)-(2) are not Copyright and Similar 41 | Rights. 42 | 43 | e. Effective Technological Measures means those measures that, in the 44 | absence of proper authority, may not be circumvented under laws 45 | fulfilling obligations under Article 11 of the WIPO Copyright 46 | Treaty adopted on December 20, 1996, and/or similar international 47 | agreements. 48 | 49 | f. Exceptions and Limitations means fair use, fair dealing, and/or 50 | any other exception or limitation to Copyright and Similar Rights 51 | that applies to Your use of the Licensed Material. 52 | 53 | g. License Elements means the license attributes listed in the name 54 | of a Creative Commons Public License. The License Elements of this 55 | Public License are Attribution, NonCommercial, and ShareAlike. 56 | 57 | h. Licensed Material means the artistic or literary work, database, 58 | or other material to which the Licensor applied this Public 59 | License. 60 | 61 | i. Licensed Rights means the rights granted to You subject to the 62 | terms and conditions of this Public License, which are limited to 63 | all Copyright and Similar Rights that apply to Your use of the 64 | Licensed Material and that the Licensor has authority to license. 65 | 66 | j. Licensor means the individual(s) or entity(ies) granting rights 67 | under this Public License. 68 | 69 | k. NonCommercial means not primarily intended for or directed towards 70 | commercial advantage or monetary compensation. For purposes of 71 | this Public License, the exchange of the Licensed Material for 72 | other material subject to Copyright and Similar Rights by digital 73 | file-sharing or similar means is NonCommercial provided there is 74 | no payment of monetary compensation in connection with the 75 | exchange. 76 | 77 | l. Share means to provide material to the public by any means or 78 | process that requires permission under the Licensed Rights, such 79 | as reproduction, public display, public performance, distribution, 80 | dissemination, communication, or importation, and to make material 81 | available to the public including in ways that members of the 82 | public may access the material from a place and at a time 83 | individually chosen by them. 84 | 85 | m. Sui Generis Database Rights means rights other than copyright 86 | resulting from Directive 96/9/EC of the European Parliament and of 87 | the Council of 11 March 1996 on the legal protection of databases, 88 | as amended and/or succeeded, as well as other essentially 89 | equivalent rights anywhere in the world. 90 | 91 | n. You means the individual or entity exercising the Licensed Rights 92 | under this Public License. Your has a corresponding meaning. 93 | 94 | 95 | Section 2 -- Scope. 96 | 97 | a. License grant. 98 | 99 | 1. Subject to the terms and conditions of this Public License, 100 | the Licensor hereby grants You a worldwide, royalty-free, 101 | non-sublicensable, non-exclusive, irrevocable license to 102 | exercise the Licensed Rights in the Licensed Material to: 103 | 104 | a. reproduce and Share the Licensed Material, in whole or 105 | in part, for NonCommercial purposes only; and 106 | 107 | b. produce, reproduce, and Share Adapted Material for 108 | NonCommercial purposes only. 109 | 110 | 2. Exceptions and Limitations. For the avoidance of doubt, where 111 | Exceptions and Limitations apply to Your use, this Public 112 | License does not apply, and You do not need to comply with 113 | its terms and conditions. 114 | 115 | 3. Term. The term of this Public License is specified in Section 116 | 6(a). 117 | 118 | 4. Media and formats; technical modifications allowed. The 119 | Licensor authorizes You to exercise the Licensed Rights in 120 | all media and formats whether now known or hereafter created, 121 | and to make technical modifications necessary to do so. The 122 | Licensor waives and/or agrees not to assert any right or 123 | authority to forbid You from making technical modifications 124 | necessary to exercise the Licensed Rights, including 125 | technical modifications necessary to circumvent Effective 126 | Technological Measures. For purposes of this Public License, 127 | simply making modifications authorized by this Section 2(a) 128 | (4) never produces Adapted Material. 129 | 130 | 5. Downstream recipients. 131 | 132 | a. Offer from the Licensor -- Licensed Material. Every 133 | recipient of the Licensed Material automatically 134 | receives an offer from the Licensor to exercise the 135 | Licensed Rights under the terms and conditions of this 136 | Public License. 137 | 138 | b. Additional offer from the Licensor -- Adapted Material. 139 | Every recipient of Adapted Material from You 140 | automatically receives an offer from the Licensor to 141 | exercise the Licensed Rights in the Adapted Material 142 | under the conditions of the Adapter's License You apply. 143 | 144 | c. No downstream restrictions. You may not offer or impose 145 | any additional or different terms or conditions on, or 146 | apply any Effective Technological Measures to, the 147 | Licensed Material if doing so restricts exercise of the 148 | Licensed Rights by any recipient of the Licensed 149 | Material. 150 | 151 | 6. No endorsement. Nothing in this Public License constitutes or 152 | may be construed as permission to assert or imply that You 153 | are, or that Your use of the Licensed Material is, connected 154 | with, or sponsored, endorsed, or granted official status by, 155 | the Licensor or others designated to receive attribution as 156 | provided in Section 3(a)(1)(A)(i). 157 | 158 | b. Other rights. 159 | 160 | 1. Moral rights, such as the right of integrity, are not 161 | licensed under this Public License, nor are publicity, 162 | privacy, and/or other similar personality rights; however, to 163 | the extent possible, the Licensor waives and/or agrees not to 164 | assert any such rights held by the Licensor to the limited 165 | extent necessary to allow You to exercise the Licensed 166 | Rights, but not otherwise. 167 | 168 | 2. Patent and trademark rights are not licensed under this 169 | Public License. 170 | 171 | 3. To the extent possible, the Licensor waives any right to 172 | collect royalties from You for the exercise of the Licensed 173 | Rights, whether directly or through a collecting society 174 | under any voluntary or waivable statutory or compulsory 175 | licensing scheme. In all other cases the Licensor expressly 176 | reserves any right to collect such royalties, including when 177 | the Licensed Material is used other than for NonCommercial 178 | purposes. 179 | 180 | 181 | Section 3 -- License Conditions. 182 | 183 | Your exercise of the Licensed Rights is expressly made subject to the 184 | following conditions. 185 | 186 | a. Attribution. 187 | 188 | 1. If You Share the Licensed Material (including in modified 189 | form), You must: 190 | 191 | a. retain the following if it is supplied by the Licensor 192 | with the Licensed Material: 193 | 194 | i. identification of the creator(s) of the Licensed 195 | Material and any others designated to receive 196 | attribution, in any reasonable manner requested by 197 | the Licensor (including by pseudonym if 198 | designated); 199 | 200 | ii. a copyright notice; 201 | 202 | iii. a notice that refers to this Public License; 203 | 204 | iv. a notice that refers to the disclaimer of 205 | warranties; 206 | 207 | v. a URI or hyperlink to the Licensed Material to the 208 | extent reasonably practicable; 209 | 210 | b. indicate if You modified the Licensed Material and 211 | retain an indication of any previous modifications; and 212 | 213 | c. indicate the Licensed Material is licensed under this 214 | Public License, and include the text of, or the URI or 215 | hyperlink to, this Public License. 216 | 217 | 2. You may satisfy the conditions in Section 3(a)(1) in any 218 | reasonable manner based on the medium, means, and context in 219 | which You Share the Licensed Material. For example, it may be 220 | reasonable to satisfy the conditions by providing a URI or 221 | hyperlink to a resource that includes the required 222 | information. 223 | 3. If requested by the Licensor, You must remove any of the 224 | information required by Section 3(a)(1)(A) to the extent 225 | reasonably practicable. 226 | 227 | b. ShareAlike. 228 | 229 | In addition to the conditions in Section 3(a), if You Share 230 | Adapted Material You produce, the following conditions also apply. 231 | 232 | 1. The Adapter's License You apply must be a Creative Commons 233 | license with the same License Elements, this version or 234 | later, or a BY-NC-SA Compatible License. 235 | 236 | 2. You must include the text of, or the URI or hyperlink to, the 237 | Adapter's License You apply. You may satisfy this condition 238 | in any reasonable manner based on the medium, means, and 239 | context in which You Share Adapted Material. 240 | 241 | 3. You may not offer or impose any additional or different terms 242 | or conditions on, or apply any Effective Technological 243 | Measures to, Adapted Material that restrict exercise of the 244 | rights granted under the Adapter's License You apply. 245 | 246 | 247 | Section 4 -- Sui Generis Database Rights. 248 | 249 | Where the Licensed Rights include Sui Generis Database Rights that 250 | apply to Your use of the Licensed Material: 251 | 252 | a. for the avoidance of doubt, Section 2(a)(1) grants You the right 253 | to extract, reuse, reproduce, and Share all or a substantial 254 | portion of the contents of the database for NonCommercial purposes 255 | only; 256 | 257 | b. if You include all or a substantial portion of the database 258 | contents in a database in which You have Sui Generis Database 259 | Rights, then the database in which You have Sui Generis Database 260 | Rights (but not its individual contents) is Adapted Material, 261 | including for purposes of Section 3(b); and 262 | 263 | c. You must comply with the conditions in Section 3(a) if You Share 264 | all or a substantial portion of the contents of the database. 265 | 266 | For the avoidance of doubt, this Section 4 supplements and does not 267 | replace Your obligations under this Public License where the Licensed 268 | Rights include other Copyright and Similar Rights. 269 | 270 | 271 | Section 5 -- Disclaimer of Warranties and Limitation of Liability. 272 | 273 | a. UNLESS OTHERWISE SEPARATELY UNDERTAKEN BY THE LICENSOR, TO THE 274 | EXTENT POSSIBLE, THE LICENSOR OFFERS THE LICENSED MATERIAL AS-IS 275 | AND AS-AVAILABLE, AND MAKES NO REPRESENTATIONS OR WARRANTIES OF 276 | ANY KIND CONCERNING THE LICENSED MATERIAL, WHETHER EXPRESS, 277 | IMPLIED, STATUTORY, OR OTHER. THIS INCLUDES, WITHOUT LIMITATION, 278 | WARRANTIES OF TITLE, MERCHANTABILITY, FITNESS FOR A PARTICULAR 279 | PURPOSE, NON-INFRINGEMENT, ABSENCE OF LATENT OR OTHER DEFECTS, 280 | ACCURACY, OR THE PRESENCE OR ABSENCE OF ERRORS, WHETHER OR NOT 281 | KNOWN OR DISCOVERABLE. WHERE DISCLAIMERS OF WARRANTIES ARE NOT 282 | ALLOWED IN FULL OR IN PART, THIS DISCLAIMER MAY NOT APPLY TO YOU. 283 | 284 | b. TO THE EXTENT POSSIBLE, IN NO EVENT WILL THE LICENSOR BE LIABLE 285 | TO YOU ON ANY LEGAL THEORY (INCLUDING, WITHOUT LIMITATION, 286 | NEGLIGENCE) OR OTHERWISE FOR ANY DIRECT, SPECIAL, INDIRECT, 287 | INCIDENTAL, CONSEQUENTIAL, PUNITIVE, EXEMPLARY, OR OTHER LOSSES, 288 | COSTS, EXPENSES, OR DAMAGES ARISING OUT OF THIS PUBLIC LICENSE OR 289 | USE OF THE LICENSED MATERIAL, EVEN IF THE LICENSOR HAS BEEN 290 | ADVISED OF THE POSSIBILITY OF SUCH LOSSES, COSTS, EXPENSES, OR 291 | DAMAGES. WHERE A LIMITATION OF LIABILITY IS NOT ALLOWED IN FULL OR 292 | IN PART, THIS LIMITATION MAY NOT APPLY TO YOU. 293 | 294 | c. The disclaimer of warranties and limitation of liability provided 295 | above shall be interpreted in a manner that, to the extent 296 | possible, most closely approximates an absolute disclaimer and 297 | waiver of all liability. 298 | 299 | 300 | Section 6 -- Term and Termination. 301 | 302 | a. This Public License applies for the term of the Copyright and 303 | Similar Rights licensed here. However, if You fail to comply with 304 | this Public License, then Your rights under this Public License 305 | terminate automatically. 306 | 307 | b. Where Your right to use the Licensed Material has terminated under 308 | Section 6(a), it reinstates: 309 | 310 | 1. automatically as of the date the violation is cured, provided 311 | it is cured within 30 days of Your discovery of the 312 | violation; or 313 | 314 | 2. upon express reinstatement by the Licensor. 315 | 316 | For the avoidance of doubt, this Section 6(b) does not affect any 317 | right the Licensor may have to seek remedies for Your violations 318 | of this Public License. 319 | 320 | c. For the avoidance of doubt, the Licensor may also offer the 321 | Licensed Material under separate terms or conditions or stop 322 | distributing the Licensed Material at any time; however, doing so 323 | will not terminate this Public License. 324 | 325 | d. Sections 1, 5, 6, 7, and 8 survive termination of this Public 326 | License. 327 | 328 | 329 | Section 7 -- Other Terms and Conditions. 330 | 331 | a. The Licensor shall not be bound by any additional or different 332 | terms or conditions communicated by You unless expressly agreed. 333 | 334 | b. Any arrangements, understandings, or agreements regarding the 335 | Licensed Material not stated herein are separate from and 336 | independent of the terms and conditions of this Public License. 337 | 338 | 339 | Section 8 -- Interpretation. 340 | 341 | a. For the avoidance of doubt, this Public License does not, and 342 | shall not be interpreted to, reduce, limit, restrict, or impose 343 | conditions on any use of the Licensed Material that could lawfully 344 | be made without permission under this Public License. 345 | 346 | b. To the extent possible, if any provision of this Public License is 347 | deemed unenforceable, it shall be automatically reformed to the 348 | minimum extent necessary to make it enforceable. If the provision 349 | cannot be reformed, it shall be severed from this Public License 350 | without affecting the enforceability of the remaining terms and 351 | conditions. 352 | 353 | c. No term or condition of this Public License will be waived and no 354 | failure to comply consented to unless expressly agreed to by the 355 | Licensor. 356 | 357 | d. Nothing in this Public License constitutes or may be interpreted 358 | as a limitation upon, or waiver of, any privileges and immunities 359 | that apply to the Licensor or You, including from the legal 360 | processes of any jurisdiction or authority. 361 | -------------------------------------------------------------------------------- /LLM/README.link: -------------------------------------------------------------------------------- 1 | https://github.com/wx-chevalier/LLM-Notes -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | [![Contributors][contributors-shield]][contributors-url] 2 | [![Forks][forks-shield]][forks-url] 3 | [![Stargazers][stars-shield]][stars-url] 4 | [![Issues][issues-shield]][issues-url] 5 | [![license: CC BY-NC-SA 4.0](https://img.shields.io/badge/license-CC%20BY--NC--SA%204.0-lightgrey.svg)][license-url] 6 | 7 | 8 |
9 |

10 | 11 | Logo 12 | 13 | 14 |

15 | 在线阅读 >> 16 |
17 |
18 | 代码案例 19 | · 20 | 参考资料 21 | 22 |

23 |

24 | 25 | 26 | 27 | ![](http://nebula.wsimg.com/9231017c407c70957eb3f708365e7a49?AccessKeyId=05106B70AA8440180999&disposition=0&alloworigin=1) 28 | 29 | # 深入浅出 Python 机器学习与自然语言处理 30 | 31 | 20 年来,NLP 的技术也经历了从基于语法语义规则系统(1970s-1990s)迁移到基于统计机器学习的框架(2000s-2014)并进一步发展为基于大数据和深度学习的 NLP 技术范式(2014 至今)。 32 | 33 | ![NLP 领域全景图](https://assets.ng-tech.icu/item/20230224144750.png) 34 | 35 | ![NLP 常见任务](https://assets.ng-tech.icu/item/%E8%87%AA%E7%84%B6%E8%AF%AD%E8%A8%80%E5%A4%84%E7%90%86%20%E4%BB%BB%E5%8A%A1%E5%88%86%E7%B1%BB.png) 36 | 37 | # Nav | 关联导航 38 | 39 | # About | 关于 40 | 41 | 42 | 43 | ## Contributing 44 | 45 | Contributions are what make the open source community such an amazing place to be learn, inspire, and create. Any contributions you make are **greatly appreciated**. 46 | 47 | 1. Fork the Project 48 | 2. Create your Feature Branch (`git checkout -b feature/AmazingFeature`) 49 | 3. Commit your Changes (`git commit -m 'Add some AmazingFeature'`) 50 | 4. Push to the Branch (`git push origin feature/AmazingFeature`) 51 | 5. Open a Pull Request 52 | 53 | 54 | 55 | ## Acknowledgements 56 | 57 | - [Awesome-Lists](https://github.com/wx-chevalier/Awesome-Lists): 📚 Guide to Galaxy, curated, worthy and up-to-date links/reading list for ITCS-Coding/Algorithm/SoftwareArchitecture/AI. 💫 ITCS-编程/算法/软件架构/人工智能等领域的文章/书籍/资料/项目链接精选。 58 | 59 | - [Awesome-CS-Books](https://github.com/wx-chevalier/Awesome-CS-Books): :books: Awesome CS Books/Series(.pdf by git lfs) Warehouse for Geeks, ProgrammingLanguage, SoftwareEngineering, Web, AI, ServerSideApplication, Infrastructure, FE etc. :dizzy: 优秀计算机科学与技术领域相关的书籍归档。 60 | 61 | ## Copyright & More | 延伸阅读 62 | 63 | 笔者所有文章遵循[知识共享 署名 - 非商业性使用 - 禁止演绎 4.0 国际许可协议](https://creativecommons.org/licenses/by-nc-nd/4.0/deed.zh),欢迎转载,尊重版权。您还可以前往 [NGTE Books](https://ng-tech.icu/books-gallery/) 主页浏览包含知识体系、编程语言、软件工程、模式与架构、Web 与大前端、服务端开发实践与工程架构、分布式基础架构、人工智能与深度学习、产品运营与创业等多类目的书籍列表: 64 | 65 | [![NGTE Books](https://s2.ax1x.com/2020/01/18/19uXtI.png)](https://ng-tech.icu/books-gallery/) 66 | 67 | 68 | 69 | 70 | [contributors-shield]: https://img.shields.io/github/contributors/wx-chevalier/NLP-Notes.svg?style=flat-square 71 | [contributors-url]: https://github.com/wx-chevalier/NLP-Notes/graphs/contributors 72 | [forks-shield]: https://img.shields.io/github/forks/wx-chevalier/NLP-Notes.svg?style=flat-square 73 | [forks-url]: https://github.com/wx-chevalier/NLP-Notes/network/members 74 | [stars-shield]: https://img.shields.io/github/stars/wx-chevalier/NLP-Notes.svg?style=flat-square 75 | [stars-url]: https://github.com/wx-chevalier/NLP-Notes/stargazers 76 | [issues-shield]: https://img.shields.io/github/issues/wx-chevalier/NLP-Notes.svg?style=flat-square 77 | [issues-url]: https://github.com/wx-chevalier/NLP-Notes/issues 78 | [license-shield]: https://img.shields.io/github/license/wx-chevalier/NLP-Notes.svg?style=flat-square 79 | [license-url]: https://github.com/wx-chevalier/NLP-Notes/blob/master/LICENSE.txt 80 | -------------------------------------------------------------------------------- /_sidebar.md: -------------------------------------------------------------------------------- 1 | - [1 00~导论](/00~导论/README.md) 2 | 3 | - 2 99~参考资料 [5] 4 | - [2.1 Numbers every LLM Developer should know](/99~参考资料/2023-Numbers%20every%20LLM%20Developer%20should%20know.md) 5 | - 2.2 吴恩达 《Building Systems with the ChatGPT API》 [3] 6 | - [2.2.1 1.Introduction](/99~参考资料/2023-吴恩达-《Building%20Systems%20with%20the%20ChatGPT%20API》/1.Introduction.md) 7 | - [2.2.2 11.conclusion](/99~参考资料/2023-吴恩达-《Building%20Systems%20with%20the%20ChatGPT%20API》/11.conclusion.md) 8 | - [2.2.3 readme](/99~参考资料/2023-吴恩达-《Building%20Systems%20with%20the%20ChatGPT%20API》/readme.md) 9 | - 2.3 吴恩达 《ChatGPT Prompt Engineering for Developers》 [2] 10 | - [2.3.1 00.README](/99~参考资料/2023-吴恩达-《ChatGPT%20Prompt%20Engineering%20for%20Developers》/00.README.md) 11 | - [2.3.2 01. 简介](/99~参考资料/2023-吴恩达-《ChatGPT%20Prompt%20Engineering%20for%20Developers》/01.%20简介.md) 12 | - [2.3.3 09. 总结](/99~参考资料/2023-吴恩达-《ChatGPT%20Prompt%20Engineering%20for%20Developers》/09.%20总结.md) 13 | - 2.4 吴恩达 《LangChain for LLM Application Development》 [3] 14 | - [2.4.1 1.开篇介绍](/99~参考资料/2023-吴恩达-《LangChain%20for%20LLM%20Application%20Development》/1.开篇介绍.md) 15 | - [2.4.2 8.课程总结](/99~参考资料/2023-吴恩达-《LangChain%20for%20LLM%20Application%20Development》/8.课程总结.md) 16 | - [2.4.3 readme](/99~参考资料/2023-吴恩达-《LangChain%20for%20LLM%20Application%20Development》/readme.md) 17 | - [2.5 陆奇 我的大模型世界观](/99~参考资料/2023-陆奇-我的大模型世界观.md) 18 | - [3 INTRODUCTION](/INTRODUCTION.md) 19 | - [4 LLM [7]](/LLM/README.md) 20 | - 4.1 99~参考资料 [3] 21 | - [4.1.1 2023~Ben Clarkson~Building an LLM from scratch](/LLM/99~参考资料/2023~Ben%20Clarkson~Building%20an%20LLM%20from%20scratch/README.md) 22 | 23 | - [4.1.2 2023~赵鑫~大语言模型综述 [2]](/LLM/99~参考资料/2023~赵鑫~大语言模型综述/README.md) 24 | - [4.1.2.1 01~引言](/LLM/99~参考资料/2023~赵鑫~大语言模型综述/01~引言.md) 25 | - [4.1.2.2 09~参考](/LLM/99~参考资料/2023~赵鑫~大语言模型综述/09~参考.md) 26 | - [4.1.3 cohere~LLM University [1]](/LLM/99~参考资料/cohere~LLM%20University/README.md) 27 | - [4.1.3.1 01~What are Large Language Models? [1]](/LLM/99~参考资料/cohere~LLM%20University/01~What%20are%20Large%20Language%20Models?/README.md) 28 | - [4.1.3.1.1 01.Text Embeddings](/LLM/99~参考资料/cohere~LLM%20University/01~What%20are%20Large%20Language%20Models?/01.Text%20Embeddings.md) 29 | - 4.2 Agent [1] 30 | - 4.2.1 99~参考资料 [1] 31 | - [4.2.1.1 2023~LLM Agent Survey](/LLM/Agent/99~参考资料/2023~LLM%20Agent%20Survey.md) 32 | - 4.3 GPT [1] 33 | - 4.3.1 ChatGPT [1] 34 | - 4.3.1.1 99~参考资料 [1] 35 | - [4.3.1.1.1 GPT 4 大模型硬核解读](/LLM/GPT/ChatGPT/99~参考资料/2023-GPT-4%20大模型硬核解读.md) 36 | - 4.4 LangChain [1] 37 | - 4.4.1 99~参考资料 [2] 38 | - [4.4.1.1 Hacking LangChain For Fun and Profit](/LLM/LangChain/99~参考资料/2023-Hacking%20LangChain%20For%20Fun%20and%20Profit.md) 39 | - [4.4.1.2 LangChain 中文入门教程](/LLM/LangChain/99~参考资料/2023-LangChain%20中文入门教程.md) 40 | - 4.5 代码生成 [1] 41 | - 4.5.1 99~参考资料 [2] 42 | - [4.5.1.1 An example of LLM prompting for programming](/LLM/代码生成/99~参考资料/2023-An%20example%20of%20LLM%20prompting%20for%20programming.md) 43 | - [4.5.1.2 花了大半个月,我终于逆向分析了 Github Copilot](/LLM/代码生成/99~参考资料/2023-花了大半个月,我终于逆向分析了%20Github%20Copilot.md) 44 | - 4.6 语言模型微调 [2] 45 | - 4.6.1 99~参考资料 [2] 46 | - [4.6.1.1 Finetuning Large Language Models](/LLM/语言模型微调/99~参考资料/2023-Finetuning%20Large%20Language%20Models.md) 47 | - [4.6.1.2 Prompt Tuning:深度解读一种新的微调范式](/LLM/语言模型微调/99~参考资料/2023-Prompt-Tuning:深度解读一种新的微调范式.md) 48 | - 4.6.2 LoRA [1] 49 | - 4.6.2.1 99~参考资料 [1] 50 | - [4.6.2.1.1 2023~LoRA From Scratch – Implement Low Rank Adaptation for LLMs in PyTorch](/LLM/语言模型微调/LoRA/99~参考资料/2023~LoRA%20From%20Scratch%20–%20Implement%20Low-Rank%20Adaptation%20for%20LLMs%20in%20PyTorch.md) 51 | - 4.7 预训练语言模型 [2] 52 | - [4.7.1 BERT [2]](/LLM/预训练语言模型/BERT/README.md) 53 | - [4.7.1.1 目标函数](/LLM/预训练语言模型/BERT/目标函数.md) 54 | - [4.7.1.2 输入表示](/LLM/预训练语言模型/BERT/输入表示.md) 55 | - [4.7.2 Transformer [1]](/LLM/预训练语言模型/Transformer/README.md) 56 | - 4.7.2.1 99~参考资料 [7] 57 | - [4.7.2.1.1 NLP 中的 RNN、Seq2Seq 与 Attention 注意力机制](/LLM/预训练语言模型/Transformer/99~参考资料/2019-NLP%20中的%20RNN、Seq2Seq%20与%20Attention%20注意力机制.md) 58 | - [4.7.2.1.2 举个例子讲下 Transformer 的输入输出细节及其他](/LLM/预训练语言模型/Transformer/99~参考资料/2020-举个例子讲下%20Transformer%20的输入输出细节及其他.md) 59 | - [4.7.2.1.3 完全解析 RNN, Seq2Seq, Attention 注意力机制](/LLM/预训练语言模型/Transformer/99~参考资料/2020-完全解析%20RNN,%20Seq2Seq,%20Attention%20注意力机制.md) 60 | - [4.7.2.1.4 Transformer 模型详解(图解最完整版)](/LLM/预训练语言模型/Transformer/99~参考资料/2021-Transformer%20模型详解(图解最完整版).md) 61 | - [4.7.2.1.5 王嘉宁 【预训练语言模型】Attention Is All You Need(Transformer)](/LLM/预训练语言模型/Transformer/99~参考资料/2021-王嘉宁-【预训练语言模型】Attention%20Is%20All%20You%20Need(Transformer).md) 62 | - [4.7.2.1.6 超详细图解 Self Attention](/LLM/预训练语言模型/Transformer/99~参考资料/2021-超详细图解%20Self-Attention.md) 63 | - [4.7.2.1.7 Transformers from Scratch](/LLM/预训练语言模型/Transformer/99~参考资料/2023-Transformers%20from%20Scratch.md) 64 | - [5 循环神经网络](/循环神经网络/README.md) 65 | 66 | - 6 经典自然语言 [4] 67 | - 6.1 主题模型 [1] 68 | - [6.1.1 LDA](/经典自然语言/主题模型/LDA.md) 69 | - 6.2 统计语言模型 [4] 70 | - [6.2.1 Word2Vec](/经典自然语言/统计语言模型/Word2Vec.md) 71 | - [6.2.2 基础文本处理](/经典自然语言/统计语言模型/基础文本处理.md) 72 | - [6.2.3 统计语言模型](/经典自然语言/统计语言模型/统计语言模型.md) 73 | - [6.2.4 词表示](/经典自然语言/统计语言模型/词表示.md) 74 | - 6.3 词嵌入 [3] 75 | - 6.3.1 99~参考资料 [1] 76 | - [6.3.1.1 2023~Embeddings: What they are and why they matter](/经典自然语言/词嵌入/99~参考资料/2023~Embeddings:%20What%20they%20are%20and%20why%20they%20matter.md) 77 | - [6.3.2 概述](/经典自然语言/词嵌入/概述.md) 78 | - 6.3.3 词向量 [1] 79 | - [6.3.3.1 基于 Gensim 的 Word2Vec 实践](/经典自然语言/词嵌入/词向量/基于%20Gensim%20的%20Word2Vec%20实践.md) 80 | - 6.4 语法语义分析 [1] 81 | - [6.4.1 命名实体识别](/经典自然语言/语法语义分析/命名实体识别.md) 82 | - 7 行业应用 [2] 83 | - [7.1 机器人问答](/行业应用/机器人问答/README.md) 84 | 85 | - [7.2 聊天对话](/行业应用/聊天对话/README.md) 86 | -------------------------------------------------------------------------------- /header.svg: -------------------------------------------------------------------------------- 1 | 2 | 3 |
4 | 101 | 366 |
367 |

368 | AI Series by 王下邀月熊 369 |

370 |

371 | 人工智能与深度学习实战 372 |

373 |
374 | 375 |
376 |
377 |
378 |
379 |
380 |
381 |
382 |
383 |
384 |
385 |
386 |
387 |
388 |
389 |
390 | 391 | 392 |
393 |
394 |
-------------------------------------------------------------------------------- /index.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | AIDL Series 7 | 8 | 9 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 34 | 38 | 40 | 45 | 46 |
47 | 64 | 97 | 116 | 117 | 118 | 119 | 120 | 121 | 122 | 123 | 124 | 125 | 126 | 127 | 128 | 129 | 130 | 131 | 132 | 143 | 144 | 145 | 146 | 155 | 156 | 157 | -------------------------------------------------------------------------------- /循环神经网络/README.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/wx-chevalier/NLP-Notes/11ac6ed37b1c1e001b8d2139d218629a717eb625/循环神经网络/README.md -------------------------------------------------------------------------------- /经典自然语言/主题模型/LDA.md: -------------------------------------------------------------------------------- 1 | # Latent Dirichlet 2 | 3 | # Dirichlet Distribution & Dirichlet Process:狄利克雷分布于狄利克雷过程 4 | 5 | 作者:肉很多链接:https://www.zhihu.com/question/26751755/answer/80931791来源:知乎著作权归作者所有,转载请联系作者获得授权。要想易懂地理解dirichlet distribution,首先先得知道它的特殊版本 beta distribution 干了什么。而要理解 beta distribution 有什么用,还得了解 Bernoulli process。 6 | 7 | 首先先看**Bernoulli process**。要理解什么是 Bernoulli process,首先先看什么 Bernoulli trial。Bernoulli trial 简单地说就是一个只有两个结果的简单 trial,比如**\*抛硬币\***。 8 | 那我们就用**抛一个(不均匀)硬币**来说好了,X = 1 就是头,X = 0 就是字,我们设定 q 是抛出字的概率。 9 | 那什么是 bernoulli process?就是从 Bernoulli population 里随机抽样,或者说就是重复的独立 Bernoulli trials,再或者说就是狂抛这枚硬币 n 次记结果吧(汗=\_=)。好吧,我们就一直抛吧,我们记下 X=0 的次数 k. 10 | 11 | 现在问题来了。 12 | Q:**我们如何知道这枚硬币抛出字的概率?**我们知道,如果可以一直抛下去,最后 k/n 一定会趋近于 q;可是现实中有很多场合不允许我们总抛硬币,比如**我只允许你抛 4 次**。你该怎么回答这个问题?显然你在只抛 4 次的情况下,k/n 基本不靠谱;那你只能"**猜一下 q 大致分布在[0,1]中间的哪些值里会比较合理**",但绝不可能得到一个准确的结果比如 q 就是等于 k/n。 13 | 14 | 举个例子,比如:4 次抛掷出现“头头字字”,你肯定觉得 q 在 0.5 附近比较合理,q 在 0.2 和 0.8 附近的硬币抛出这个结果应该有点不太可能,q = 0.05 和 0.95 那是有点扯淡了。 15 | 你如果把这些值画出来,你会发现 q 在[0,1]区间内呈现的就是一个中间最高,两边低的情况。从感性上说,这样应当是比较符合常理的。 16 | 17 | 那我们如果有个什么工具能描述一下这个 q 可能的分布就好了,比如用一个概率密度函数来描述一下? 这当然可以,可是我们还需要注意另一个问题,那就是随着 n 增长观测变多,**你每次的概率密度函数该怎么计算**?该怎么利用以前的结果更新(这个在形式上和计算上都很重要)? 18 | 19 | 到这里,其实很自然地会想到把 bayes theorem 引进来,因为 Bayes 能随着不断的观测而更新概率;而且每次只需要前一次的 prior 等等…在这先不多说 bayes 有什么好,接下来用更形式化语言来讲其实说得更清楚。 20 | 21 | **我们现在用更正规的语言重新整理一下思路。**现在有个硬币得到 random sample X = (x1,x2,...xn),我们需要基于这 n 次观察的结果来估算一下**q 在[0,1]中取哪个值比较靠谱**,由于我们不能再用单一一个确定的值描述 q,所以我们用一个分布函数来描述:有关 q 的概率密度函数(说得再简单点,即是 q 在[0,1]“分布律”)。当然,这应当写成一个条件密度:f(q|X),因为我们总是观测到 X 的情况下,来猜的 q。 22 | 23 | 现在我们来看看 Bayes theorem,看看它能带来什么不同: 24 | ![P(q|x) P(x) = P(X=x|q)P(q)](//zhihu.com/equation?tex=P%28q%7Cx%29+P%28x%29+%3D+P%28X%3Dx%7Cq%29P%28q%29) 25 | 26 | 在这里 P(q)就是关于 q 的先验概率(所谓先验,就是在得到观察 X 之前,我们设定的关于 q 的概率密度函数)。P(q|x)是观测到 x 之后得到的关于 q 的后验概率。注意,到这里公式里出现的都是"概率",并没有在[0,1]上的概率密度函数出现。为了让贝叶斯定理和密度函数结合到一块。我们可以从方程两边由 P(q)得到 f(q),而由 P(q|x)得到 f(q|x)。 27 | 又注意到 P(x)可以认定为是个常量(Q:why?),可以在分析这类问题时不用管。**那么,这里就有个简单的结论——\*\***关于 q 的后验概率密度 f(q|x)就和“关于 q 的\***\*先验概率密度乘以一个条件概率"成比例,即:** 28 | ![f(q|x)\sim P(X=x|q)f(q)](//zhihu.com/equation?tex=f%28q%7Cx%29%5Csim+P%28X%3Dx%7Cq%29f%28q%29) 29 | 30 | 带着以上这个结论,我们再来看这个抛硬币问题: 31 | 连续抛 n 次,即为一个 bernoulli process,则在 q 确定时,n 次抛掷结果确定时,又观察得到 k 次字的概率可以描述为:![P(X=x|p) = q^{k}(1-q)^{n-k} ](//zhihu.com/equation?tex=P%28X%3Dx%7Cp%29+%3D+q%5E%7Bk%7D%281-q%29%5E%7Bn-k%7D+) 32 | 那么 f(q|x)就和先验概率密度乘以以上的条件概率是成比例的: 33 | ![f(q|x) \sim q^{k}(1-q)^{n-k}f(q) ](//zhihu.com/equation?tex=f%28q%7Cx%29+%5Csim+q%5E%7Bk%7D%281-q%29%5E%7Bn-k%7Df%28q%29+) 34 | 虽然我们不知道,也求不出那个 P(x),但我们知道它是固定的,我们这时其实已经得到了一个求 f(q|x)的公式(只要在 n 次观测下确定了,f(q)确定了,那么 f(q|x)也确定了)。 35 | 36 | 现在在来看 f(q)。显然,在我们对硬币一无所知的时候,我们应当认为硬币抛出字的概率 q 有可能在[0,1]上任意处取值。f(q)在这里取个均匀分布的密度函数是比较合适的,即 f(q) = 1 (for q in [0,1])。 37 | 有些同学可能发现了,这里面![f(q|x) \sim q^{k}(1-q)^{n-k}](//zhihu.com/equation?tex=f%28q%7Cx%29+%5Csim+q%5E%7Bk%7D%281-q%29%5E%7Bn-k%7D),**那个![q^{k}(1-q)^{n-k}](//zhihu.com/equation?tex=q%5E%7Bk%7D%281-q%29%5E%7Bn-k%7D)乘上[0,1]的均匀分布不就是一个 Beta distribution 么**? 38 | 对,它就是一个 Beta distribution。Beta distribution 由两个参数 alpha、beta 确定;在这里对应的 alpha 等于 k+1,beta 等于 n+1-k。而**均匀分布的先验密度函数,就是那个 f(q)也可以被 beta distribution 描述**,这时 alpha 等于 1,beta 也等于 1。 39 | 40 | 更有意思的是,当我们每多抛一次硬币,出现字时,我们只需要 alpha = alpha + 1;出现头只需要 beta = beta + 1。这样就能得到需要估计的概率密度 f(q|x)… 41 | 42 | 其实之所以计算会变得这么简单,是因为被 beta distribution 描述的 prior 经过 bayes formula 前后还是一个 beta distribution;这种不改变函数本身所属 family 的特性,叫**共轭(conjugate)**。 43 | 44 | ok。讲到这你应该明白,对于有两个结果的重复 Bernoulli trial,我们用 beta prior/distribution 就能解决。那么加入我们有 n 个结果呢?比如抛的是骰子? 45 | 这时候上面的 Bernoulli trial 就要变成有一次 trial 有 k 个可能的结果;Bernoulli distribution 就变成 multinomial distribution。而 beta distribution 所表述的先验分布,也要改写成一个多结果版本的先验分布。那就是 dirichlet distribution。 46 | 均匀的先验分布 Beta(1,1)也要变成 k 个结果的 Dir(alpha/K)。dirichlet prior 也有共轭的性质,所以也是非常好计算的。 47 | 简而言之,就是由 2 种外推到 k 种,而看待它们的视角并没有什么不同。 48 | 他们有着非常非常非常相似的形式。 49 | 50 | **结论 1:dirichlet distribution 就是由 2 种结果 bernoulli trial 导出的 beta distribution 外推到 k 种的 generalization** 51 | 52 | ```py 53 | from scipy.stats import dirichlet, poisson 54 | from numpy.random import choice 55 | from collections import defaultdict 56 | 57 | 58 | num_documents = 5 59 | num_topics = 2 60 | topic_dirichlet_parameter = 1 # beta 61 | term_dirichlet_parameter = 1 # alpha 62 | vocabulary = ["see", "spot", "run"] 63 | num_terms = len(vocabulary) 64 | length_param = 10 # xi 65 | 66 | term_distribution_by_topic = {} # Phi 67 | topic_distribution_by_document = {} # Theta 68 | document_length = {} 69 | topic_index = defaultdict(list) 70 | word_index = defaultdict(list) 71 | 72 | term_distribution = dirichlet(num_terms - [term_dirichlet_parameter]) 73 | topic_distribution = dirichlet(num_topics - [topic_dirichlet_parameter]) 74 | 75 | # 遍历每个主题 76 | for topic in range(num_topics): 77 | # 采样得出每个主题对应的词分布 78 | term_distribution_by_topic[topic] = term_distribution.rvs()[0] 79 | 80 | # 遍历所有的文档 81 | for document in range(num_documents): 82 | # 采样出该文档对应的主题分布 83 | topic_distribution_by_document[document] = topic_distribution.rvs()[0] 84 | topic_distribution_param = topic_distribution_by_document[document] 85 | # 从泊松分布中采样出文档长度 86 | document_length[document] = poisson(length_param).rvs() 87 | 88 | # 遍历整个文档中的所有词 89 | for word in range(document_length[document]): 90 | topics = range(num_topics) 91 | # 采样出某个生成主题 92 | topic = choice(topics, p=topic_distribution_param) 93 | topic_index[document].append(topic) 94 | # 采样出某个生成词 95 | term_distribution_param = term_distribution_by_topic[topic] 96 | word_index[document].append(choice(vocabulary, p=term_distribution_param)) 97 | ``` 98 | 99 | 如果还有困惑的同学可以参考如下 Python 代码: 100 | 101 | ```py 102 | def perplexity(self, docs=None): 103 | if docs == None: docs = self.docs 104 | # 单词在主题上的分布矩阵 105 | phi = self.worddist() 106 | log_per = 0 107 | N = 0 108 | Kalpha = self.K * self.alpha 109 | //遍历语料集中的所有文档 110 | for m, doc in enumerate(docs): 111 | // n_m_z 为每个文档中每个主题的单词数,theta 即是每个单词出现的频次占比 112 | theta = self.n_m_z[m] / (len(self.docs[m]) + Kalpha) 113 | for w in doc: 114 | // numpy.inner(phi[:,w], theta) 即是某个出现的概率统计值 115 | log_per -= numpy.log(numpy.inner(phi[:,w], theta)) 116 | N += len(doc) 117 | return numpy.exp(log_per / N) 118 | ``` 119 | 120 | # Introduction 121 | 122 | > LDA has been widely used in textual analysis, 123 | 124 | LDA 是标准的词袋模型。 125 | 126 | - [通俗理解 LDA 主题模型](http://blog.csdn.net/v_july_v/article/details/41209515) 127 | 128 | LDA 主要涉及的问题包括共轭先验分布、Dirichlet 分布以及 Gibbs 采样算法学习参数。LDA 的输入为文档数目$M$,词数目$V$(非重复的 term),主题数目$K$。 129 | ![](http://7xlgth.com1.z0.glb.clouddn.com/5C724613-24AC-4782-B1DB-E890B87885FF.png) 130 | 131 | # Mathematics 132 | 133 | ## Beta 分布:Dirichlet 分布的基础 134 | 135 | Beta 分布的概率密度为: 136 | 137 | $$ 138 | f(x) = 139 | \left \{ 140 | \begin{aligned} 141 | \frac{1}{B(\alpha,\beta)}x^{\alpha-1}(1-x)^{\beta-1}, x \in (0,1) \\ 142 | 0,其他 143 | \end{aligned} 144 | \right. 145 | $$ 146 | 147 | 其中$$B(\alpha,\beta) = \int_0^1 x^{\alpha - 1}(1-x)^{\beta-1}dx=\frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}$$ 148 | 其中 Gamma 函数可以看做阶乘的实数域的推广: 149 | 150 | $$ 151 | \Gamma(x) = \int_0^{\infty}t^{x-1}e^{-t}dt \Rightarrow \Gamma(n) = (n-1)! \Rightarrow B(\alpha,\beta)=\frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)} 152 | $$ 153 | 154 | Beta 分布的期望为: 155 | $$E(X) = \frac{\alpha + \beta}{\alpha}$$ 156 | 157 | ## Dirichlet 分布:多项分布的共轭分布 158 | 159 | Dirichlet 分布实际上就是把: 160 | 161 | $$ 162 | \alpha = \alpha_1 , 163 | \beta = \alpha_2 , 164 | x = x_1 , 165 | x - 1 = x_2 166 | $$ 167 | 168 | $$ 169 | f(\vec{p} | \vec{\alpha}) = \left \{ 170 | \begin{aligned} 171 | \frac{1}{\Delta(\vec{\alpha})} \prod_{k=1}^{K} p_k^{\alpha_k - 1} ,p_k \in (0,1) \\ 172 | 0,其他 173 | \end{aligned} 174 | \right. 175 | $$ 176 | 177 | 可以简记为: 178 | 179 | $$ 180 | Dir(\vec{p} | \vec{\alpha}) = \frac{1}{\Delta(\vec{\alpha})} \prod_{k=1}^{K} p_k^{\alpha_k - 1} 181 | $$ 182 | 183 | 其中 184 | 185 | $$ 186 | \Delta(\vec{\alpha}) = \frac{ \prod_{k=1}^K \Gamma(\alpha_k)}{ \Gamma(\sum_{k=1}^{K}\alpha_k)} 187 | $$ 188 | 189 | 该部分在给定的$\vec{\alpha}$情况下是可以计算出来值的。假设给定的一篇文档有 50 个主题,那么$\vec{\alpha}$就是维度为 50 的向量。在没有任何先验知识的情况下,最方便的也是最稳妥的初始化就是将这个 50 个值设置为同一个值。 190 | 191 | ### Symmetric Dirichlet Distribution(对称 Dirichlet 分布) 192 | 193 | 一旦采取了对称的 Dirichlet 分布,因为参数向量中的所有值都一样,公式可以改变为: 194 | 195 | $$ 196 | Dir(\vec{p} | \alpha,K) = \frac{1}{\Delta_K(\alpha)} \prod_{k=1}^{K} p_k^{\alpha - 1} \\ 197 | \Delta_K(\vec{\alpha}) = \Gamma^K(\alpha){ \Gamma(K * \alpha)} 198 | $$ 199 | 200 | 而不同的$\alpha$取值,当$\alpha=1$时候,退化为均匀分布。当$\alpha>1$时候,$p_1 = p_2 = \dots = p_k$的概率增大。当$\alpha<1$时候,$p_1 = 1, p_{非i} = 0$的概率增大。映射到具体的文档分类中,$\alpha$取值越小,说明各个主题之间的离差越大。而$\alpha$值越大,说明该文档中各个主题出现的概率约接近。 201 | 202 | 在实际的应用中,一般会选用$1/K$作为$\alpha$的初始值。 203 | 204 | # 模型解释 205 | 206 | ![](http://7xlgth.com1.z0.glb.clouddn.com/D73D69FE-BA28-4E66-871F-B594B4BEFC29.png) 207 | 上图的箭头指向即是条件依赖。 208 | 209 | ## Terminology 210 | 211 | - 字典中共有$V$个不可重复的 term,如果这些 term 出现在了具体的文章中,就是 word。在具体的某文章中的 word 当然是可能重复的。 212 | - 语料库(Corpus)中共有$m$篇文档,分别是$d_1,d_2,\dots,d_m$,每篇文章长度为$N_m$,即由$N_i$个 word 组成。每篇文章都有各自的主题分布,主题分布服从多项式分布,该多项式分布的参数服从 Dirichlet 分布,该 Dirichlet 分布的参数为$\vec{ \alpha }$。注意,多项分布的共轭先验分布为 Dirichlet 分布。 213 | 214 | > 怎么来看待所谓的文章主题服从多项分布呢。你每一个文章等于多一次实验,$m$篇文档就等于做了$m$次实验。而每次实验中有$K$个结果,每个结果以一定概率出现。 215 | 216 | - 一共涉及到$K$(值给定)个主题,$T_1,T_2,\dots,T_k$。每个主题都有各自的词分布,词分布为多项式分布,该多项式分布的参数服从 Dirichlet 分布,该 Diriclet 分布的参数为$\vec{\beta}$。注意,一个词可能从属于多个主题。 217 | 218 | ## 模型过程 219 | 220 | $\vec{\alpha}$与$\vec{\beta}$为先验分布的参数,一般会实现给定。如取 0.1 的对称 Dirichlet 分布,表示在参数学习结束后,期望每个文档的主题不会十分集中。 221 | 222 | (1)选定文档主题 223 | 224 | (2)根据主题选定词 225 | 226 | ## 参数学习 227 | 228 | 给定一个文档集合,$w_{mn}$是可以观察到的已知变量,$\vec{\alpha}$与$\vec{\beta}$是根据经验给定的先验参数,其他的变量$z_{mn}$、$\vec{\theta}$、$\vec{\varphi}$都是未知的隐含变量,需要根据观察到的变量来学习估计的。根据上图,可以写出所有变量的联合分布: 229 | 230 | ### 似然概率 231 | 232 | 一个词$w_{mn}$(即 word,可重复的词)初始化为一个词$t$(term/token,不重复的词汇)的概率是: 233 | 234 | $$ 235 | p(w_{m,n}=t | \vec{\theta_m},\Phi) = \sum_{k=1}^K p(w_{m,n}=t | \vec{\phi_k})p(z_{m,n}=k|\vec{\theta}_m) 236 | $$ 237 | 238 | 上式即给定某个主题的情况下能够看到某个词的概率的总和。每个文档中出现主题$k$的概率乘以主题$k$下出现词$t$的概率,然后枚举所有主题求和得到。整个文档集合的似然函数为: 239 | 240 | $$ 241 | p(W | \Theta,\Phi) = \prod_{m=1}^{M}p(\vec{w_m} | \vec{\theta_m},\Phi) = \prod_{m=1}^M \prod_{n=1}^{N_m}p(w_{m,n}|\vec{\theta_m},\Phi) 242 | $$ 243 | 244 | # Gibbs Sampling 245 | 246 | > 首先通俗理解一下,在某篇文档中存在着$N_m$个词,依次根据其他的词推算某个词来自于某个主题的概率,从而达到收敛。最开始的时候,某个词属于某个主题是随机分配的。Gibbs Sampling 的核心在于找出某个词到底属于哪个主题。 247 | 248 | Gibbs Sampling 算法的运行方式是每次选取概率向量的一个维度,给定其他维度的变量值采样当前度的值,不断迭代直到收敛输出待估计的参数。初始时随机给文本中的每个词分配主题$z^{(0)}$,然后统计每个主题$z$下出现词$t$的数量以及每个文档$m$下出现主题$z$的数量,每一轮计算$p(z_i|z_{\neq i},d,w)$,即排除当前词的主题分布。 249 | 这里的联合分布: 250 | 251 | $$ 252 | p(\vec{w},\vec{z} | \vec{\alpha},\vec{\beta}) = p(\vec{w} | \vec{z},\vec{\beta})p(\vec{z} | \vec{\alpha}) 253 | $$ 254 | 255 | 第一项因子是给定主题采样词的过程。后面的因此计算,$n_z^{(t)}$表示词$t$被观察到分配给主题$z$的次数,$n_m^{(k)}$表示主题$k$分配给文档$m$的次数。 256 | 257 | $$ 258 | p(\vec{w} | ,\vec{z},\vec{\beta}) 259 | = \int p(\vec{w} | \vec{z},\vec{\Phi})p(\Phi | \vec{\beta})d \Phi \\ 260 | = \int \prod_{z=1}^{K} \frac{1}{\Delta(\vec{\beta})}\prod_{t=1}^V \phi_{z,t}^{n_z^{(t)} + \beta_t - 1}d\vec{\phi_z} \\ 261 | = \prod_{z=1}^{K}\frac{\Delta(\vec{n_z} + \vec{\beta})}{\Delta(\vec{ \beta })} , 262 | \vec{n_z} = \{ n_z^{(t)} \}_{t=1}^V 263 | $$ 264 | 265 | $$ 266 | p(\vec{z} | \vec{\alpha}) \\ 267 | = \int p(\vec{z} | \Theta) p(\Theta|\vec{\alpha}) d\Theta \\ 268 | = \int \prod_{m=1}^{M} \frac{1}{\Delta(\vec\alpha)} \prod_{k=1}^K\theta_{m,k}^{ n_m^{(k)} + \alpha_k - 1 }d\vec{\theta_m} \\ 269 | = \prod_{m=1}^M \frac{ \Delta(\vec{n_m} + \vec\alpha) }{ \Delta(\vec\alpha) }, \vec{n_m}=\{ n_m^{(k)} \}_{k=1}^K 270 | $$ 271 | 272 | ## Gibbs Updating Rule 273 | 274 | ## 词分布和主题分布总结 275 | 276 | 经过上面的 Gibbs 采样,各个词所被分配到的主题已经完成了收敛,在这里就可以计算出文档属于主题的概率以及词属于文档的概率了。 277 | 278 | $$ 279 | \phi_{k,t} = \frac{ n_k^{(t)} + \beta_t }{ \sum^V_{t=1}n_k^{(t)} + \beta_t } \\ 280 | \theta_{m,k} = \frac{ n_m^{(k)} + \alpha_k }{ \sum^K_{k=1}n_m^{(k)} + \alpha_k } \\ 281 | $$ 282 | 283 | $$ 284 | p(\vec{\theta_m} | \vec{z_m}, \vec{\alpha} ) 285 | = \frac{1}{Z_{\theta_m}} \prod_{n=1}^{N_m} p(z_{m,n} | \vec{\theta_m} * p(\vec{\theta_m} | \vec{alpha} )) 286 | = Dir(\vec{\theta_m} | \vec{n_m} + \vec{\alpha}) 287 | \\ 288 | p(\vec{\phi_k} | \vec{z}, \vec{w}, \vec{\beta} ) = 289 | \frac{1}{Z_{\phi_k}} \prod_{i:z_i=k} p(w_i | \vec{\phi_k}) * p(\vec{\phi_k} | \vec{\beta}) 290 | = Dir(\vec{\phi_k} | \vec{n_k} + \vec{\beta}) 291 | $$ 292 | 293 | # 代码实现 294 | 295 | 代码的输入有文档数目$M$、词的数目$V$(非重复的 term)、主题数目$K$,以及用$d$表示第几个文档,$k$表示主题,$w$表示词汇(term),$n$表示词(word)。 296 | $z[d][w]$:第$d$篇文档的第$w$个词来自哪个主题。$M$行,$X$列,$X$为对应的文档长度:即词(可重复)的数目。 297 | $nw[w][t]$:第 w 个词是第 t 个主题的次数。word-topic 矩阵,列向量$nw[][t]$表示主题 t 的词频数分布;V 行 K 列。 298 | $nd[d][t]$:第 d 篇文档中第 t 个主题出现的次数,doc-topic 矩阵,行向量$nd[d]$表示文档$d$的主题频数分布。M 行,K 列。 299 | 辅助向量: 300 | $ntSum[t]$:第 t 个主题在所有语料出现的次数,K 维 301 | $ndSum[d]$:第 d 篇文档中词的数目(可重复),M 维 302 | $P[t]$:对于当前计算的某词属于主题 t 的概率,K 维 303 | 304 | # 超参数的确定 305 | 306 | - 交叉验证 307 | - $\alpha$表达了不同文档间主题是否鲜明,$\beta$度量了有多少近义词能够属于同一个类别。 308 | - 给定主题数目$K$,可以使用: 309 | 310 | $$ 311 | \alpha = 50 / K \\ 312 | \beta = 0.01 313 | $$ 314 | -------------------------------------------------------------------------------- /经典自然语言/统计语言模型/Word2Vec.md: -------------------------------------------------------------------------------- 1 | # Word2Vec 2 | 3 | 词向量最直观的理解就是将每一个单词表征为 4 | 5 | 深度学习(DeepLearning)在图像、语音、视频等多方应用中大放异彩,从本质而言,深度学习是表征学习(Representation Learning)的一种方法,可以看做对事物进行分类的不同过滤器的组成。 6 | 7 | Word2Vec 是 Google 在 2013 年年中开源的一款将词表征为实数值向量的高效 工具,采用的模型有 CBOW (Continuous Bag-Of-Words,即连续的词袋模型)和 Skip-Gram 两种。word2vec 代码链接为:https://code.google.com/p/word2vec/,遵循 Apache License 2.0 开源协议,是一种对商业应用友好的许可,当然需要充分尊重原作者的著作权。Word2Vec 采用了所谓的 Distributed Representation 方式来表示词。Distributed representation 最早是 Hinton 在 1986 年的论文《Learning distributed representations of concepts》中提出的。虽然这篇文章没有说要将词做 Distributed representation,但至少这种先进的思想在那个时候就在人们的心中埋下了火种,到 2000 年之后开始逐渐被人重视。Distributed representation 用来表示词,通常被称为“Word Representation”或“Word Embedding”,中文俗称“词向量”。 8 | 9 | ![](http://deeplearning4j.org/img/word2vec.png) 10 | 11 | Word2vec 是一个神经网络,它用来在使用深度学习算法之前预处理文本。它本身并没有实现深度学习,但是 Word2Vec 把文本变成深度学习能够理解的向量形式。 12 | 13 | Word2vec 在不需要人工干预的情况下创建特征,包括词的上下文特征。这些上下文来自于多个词的窗口。如果有足够多的数据,用法和上下文,Word2Vec 能够基于这个词的出现情况高度精确的预测一个词的词义(对于深度学习来说,一个词的词义只是一个简单的信号,这个信号能用来对更大的实体分类;比如把一个文档分类到一个类别中)。 14 | 15 | Word2vec 需要一串句子做为其输入。每个句子,也就是一个词的数组,被转换成 n 维向量空间中的一个向量并且可以和其它句子(词的数组)所转换成向量进行比较。在这个向量空间里,相关的词语和词组会出现在一起。把它们变成向量之后,我们可以一定程度的计算它们的相似度并且对其进行聚类。这些类别可以作为搜索,情感分析和推荐的基础。 16 | 17 | Word2vec 神经网络的输出是一个词表,每个词由一个向量来表示,这个向量可以做为深度神经网络的输入来进行分类。 18 | 19 | # Quick Start 20 | 21 | ## Python 22 | 23 | 笔者推荐使用 Anaconda 这个 Python 的机器学习发布包,此处用的测试数据来自于[这里](http://mattmahoney.net/dc/text8.zip) 24 | 25 | - Installation 26 | 27 | 使用`pip install word2vec`,然后使用`import word2vec`引入 28 | 29 | - 文本文件预处理 30 | 31 | ``` 32 | word2vec.word2phrase('/Users/drodriguez/Downloads/text8', '/Users/drodriguez/Downloads/text8-phrases', verbose=True) 33 | ``` 34 | 35 | ``` 36 | [u'word2phrase', u'-train', u'/Users/drodriguez/Downloads/text8', u'-output', u'/Users/drodriguez/Downloads/text8-phrases', u'-min-count', u'5', u'-threshold', u'100', u'-debug', u'2'] 37 | Starting training using file /Users/drodriguez/Downloads/text8 38 | Words processed: 17000K Vocab size: 4399K 39 | Vocab size (unigrams + bigrams): 2419827 40 | Words in train file: 17005206 41 | ``` 42 | 43 | ### 中文实验 44 | 45 | - 语料 46 | 47 | 首先准备数据:采用网上博客上推荐的全网新闻数据(SogouCA),大小为 2.1G。 48 | 49 | 从ftp上下载数据包SogouCA.tar.gz: 50 | 51 | ``` 52 | 1 wget ftp://ftp.labs.sogou.com/Data/SogouCA/SogouCA.tar.gz --ftp-user=hebin_hit@foxmail.com --ftp-password=4FqLSYdNcrDXvNDi -r 53 | ``` 54 | 55 | 解压数据包: 56 | 57 | ``` 58 | 1 gzip -d SogouCA.tar.gz 59 | 2 tar -xvf SogouCA.tar 60 | ``` 61 | 62 | 再将生成的txt文件归并到SogouCA.txt中,取出其中包含content的行并转码,得到语料corpus.txt,大小为2.7G。 63 | 64 | ``` 65 | 1 cat *.txt > SogouCA.txt 66 | 2 cat SogouCA.txt | iconv -f gbk -t utf-8 -c | grep "" > corpus.txt 67 | ``` 68 | 69 | - 分词 70 | 71 | 用 ANSJ 对 corpus.txt 进行分词,得到分词结果 resultbig.txt,大小为 3.1G。在分词工具 seg_tool 目录下先编译再执行得到分词结果 resultbig.txt,内含 426221 个词,次数总计 572308385 个。 72 | 73 | - 词向量训练 74 | 75 | ```shell 76 | nohup ./word2vec -train resultbig.txt -output vectors.bin -cbow 0 -size 200 -window 5 -negative 0 -hs 1 -sample 1e-3 -threads 12 -binary 1 & 77 | ``` 78 | 79 | - 分析 80 | 81 | (1)相似词计算 82 | 83 | ``` 84 | ./distance vectors.bin 85 | ``` 86 | 87 | ./distance可以看成计算词与词之间的距离,把词看成向量空间上的一个点,distance看成向量空间上点与点的距离。 88 | 89 | (2)潜在的语言学规律 90 | 91 | 在对demo-analogy.sh修改后得到下面几个例子: 92 | 93 | 法国的首都是巴黎,英国的首都是伦敦,vector("法国") - vector("巴黎) + vector("英国") --> vector("伦敦")" 94 | 95 | (3)聚类 96 | 97 | 将经过分词后的语料resultbig.txt中的词聚类并按照类别排序: 98 | 99 | ```shell 100 | 1 nohup ./word2vec -train resultbig.txt -output classes.txt -cbow 0 -size 200 -window 5 -negative 0 -hs 1 -sample 1e-3 -threads 12 -classes 500 & 101 | 2 sort classes.txt -k 2 -n > classes_sorted_sogouca.txt 102 | ``` 103 | 104 | (4)短语分析 105 | 106 | 先利用经过分词的语料resultbig.txt中得出包含词和短语的文件sogouca_phrase.txt,再训练该文件中词与短语的向量表示。 107 | 108 | ``` 109 | 1 ./word2phrase -train resultbig.txt -output sogouca_phrase.txt -threshold 500 -debug 2 110 | 2 ./word2vec -train sogouca_phrase.txt -output vectors_sogouca_phrase.bin -cbow 0 -size 300 -window 10 -negative 0 -hs 1 -sample 1e-3 -threads 12 -binary 1 111 | ``` 112 | 113 | ## 维基百科实验 114 | 115 | # Algorithms 116 | 117 | ![](http://deeplearning4j.org/img/word2vec_diagrams.png) 118 | 119 | ## CBOW 120 | 121 | CBOW 是 Continuous Bag-of-Words Model 的缩写,是一种与前向 NNLM 类似 的模型,不同点在于 CBOW 去掉了最耗时的非线性隐层且所有词共享隐层。如 下图所示。可以看出,CBOW 模型是预测$P(w*t|w*{t-k},w*{t-(k-1)},\dots,w*{t-1},w*{t+1},\dots,w*{t+k})$。 122 | 123 | ![](http://7xlgth.com1.z0.glb.clouddn.com/1424C789-5B58-43BA-952C-EACDF43E2AEB.png) 124 | 125 | 从输入层到隐层所进行的操作实际就是上下文向量的加和,具体的代码如下。其中 sentence_position 为当前 word 在句子中的下标。以一个具体的句子 A B C D 为例,第一次进入到下面代码时当前 word 为 A,sentence_position 为 0。b 是一 个随机生成的 0 到$window-1$的词,整个窗口的大小为$2*window + 1 - 2*b$,相当于左右各看$window-b$个词。可以看出随着窗口的从左往右滑动,其大小也 是随机的$3 (b=window-1)$到$2\*window+1(b=0)$之间随机变通,即随机值 b 的大小决定了当前窗口的大小。代码中的 neu1 即为隐层向量,也就是上下文(窗口 内除自己之外的词)对应 vector 之和。 126 | 127 | ![](http://7xlgth.com1.z0.glb.clouddn.com/36F89DA8-F3A0-4C6C-84F8-C31BB19CEEC1.png) 128 | 129 | ## Skip-Gram 130 | 131 | ![](http://7xlgth.com1.z0.glb.clouddn.com/F0E76FE8-7B78-4E4C-BB6A-8FB47A67645C.png) 132 | 133 | Skip-Gram 模型的图与 CBOW 正好方向相反,从图中看应该 Skip-Gram 应该预测概率$p(w_i,|w_t)$,其中$t - c \le i \le t + c$且$i \ne t,c$是决定上下文窗口大小的常数,$c$越大则需要考虑的 pair 就越多,一般能够带来更精确的结果,但是训练时间也 会增加。假设存在一个$w_1,w_2,w_3,…,w_T$的词组序列,Skip-gram 的目标是最大化: 134 | 135 | $$ 136 | \frac{1}{T}\sum^{T}_{t=1}\sum_{-c \le j \le c, j \ne 0}log p(w\_{t+j}|w_t) 137 | $$ 138 | 139 | 基本的 Skip-Gram 模型定义$p(w_o|w_I)$为: 140 | 141 | $$ 142 | P(w*o | w_I) = \frac{e^{v*{w*o}^{T*{V*{w_I}}}}}{\Sigma*{w=1}^{W}e^{V*w^{T*{V\_{w_I}}}}} 143 | $$ 144 | 145 | 从公式不难看出,Skip-Gram 是一个对称的模型,如果$w_t$为中心词时$w_k$在其窗口内,则$w_t$也必然在以$w_k$为中心词的同样大小的窗口内,也就是: 146 | 147 | $$ 148 | \frac{1}{T}\sum^{T}_{t=1}\sum_{-c \le j \le c, j \ne 0}log p(w*{t+j}|w_t) = \\ \frac{1}{T}\sum^{T}*{t=1}\sum*{-c \le j \le c, j \ne 0}log p(w*{t}|w\_{t+j}) 149 | $$ 150 | 151 | 同时,Skip-Gram 中的每个词向量表征了上下文的分布。Skip-Gram 中的 Skip 是指在一定窗口内的词两两都会计算概率,就算他们之间隔着一些词,这样的好处是“白色汽车”和“白色的汽车”很容易被识别为相同的短语。 152 | 153 | 与 CBOW 类似,Skip-Gram 也有两种可选的算法:层次 Softmax 和 Negative Sampling。层次 Sofamax 算法也结合了 Huffman 编码,每个词$w$都可以从树的根节点沿着唯一一条路径被访问到。假设$n(w,j)$为这条路径上的第$j$个结点,且$L(w)$为这条路径的长度,注意$j$从 1 开始编码,即$n(w,1)=root,n(w,L(w))=w$。层次 Softmax 定义的概率$p(w|w_I)$为: 154 | 155 | $$ 156 | p(w|w*I)=\Pi*{j=1}^{L(w)-1}\sigma([n(w,j+1)=ch(n(w,j))]\*v'^T\_{n(w,j)}v_I) 157 | $$ 158 | 159 | $ch(n(w,j))$既可以是$n(w,j)$的左子结点也可以是$n(w,j)$的右子结点,word2vec 源代码中采用的是左子节点(Label 为$1-code[j]$),其实此处改为右子节点也是可以的。 160 | 161 | # Tricks 162 | 163 | ## Learning Phrases 164 | 165 | 对于某些词语,经常出现在一起的,我们就判定他们是短语。那么如何衡量呢?用以下公式。 166 | 167 | $score(w_i,w_j)=\frac{count(w_iw_j) - \delta}{count(w_i) \* count(w_j)}$ 168 | 169 | 输入两个词向量,如果算出的 score 大于某个阈值时,我们就认定他们是“在一起的”。为了考虑到更长的短语,我们拿 2-4 个词语作为训练数据,依次降低阈值。 170 | 171 | # Implementation 172 | 173 | Word2Vec 高效率的原因可以认为如下: 174 | 175 | 1.去掉了费时的非线性隐层; 176 | 177 | 2.Huffman Huffman 编码 相当于做了一定聚类,不需要统计所有词对; 178 | 179 | 3.Negative Sampling; 180 | 181 | 4.随机梯度算法; 182 | 183 | 5.只过一遍数据,不需要反复迭代; 184 | 185 | 6.编程实现中的一些 trick,比如指数运算的预计,高频词亚采样等。 186 | 187 | word2vec 可调整的超参数有很多: 188 | 189 | | 参数名 | 说明 | | 190 | | ---------- | -------------------- | ------------------------------------------------------------------------------------------------ | 191 | | -size | 向量维度 | 一般维度越高越好,但并不总是这样 | 192 | | -window | 上下文窗口大小 | Skip-gram—般 10 左右,CBOW—般 5 左右, | 193 | | -sample | 高频词亚采样 | 对大数据集合可以同时提高精度和速度,sample 的取值 在 1e-3 到 1e-5 之间效果最佳, | 194 | | -hs | 是否采用层次 softmax | 层次 softmax 对低频词效果更好;对应的 negative sampling 对高频词效果更好,向量维度较低时效果更好 | 195 | | -negative | 负例数目 | | 196 | | -min-count | 被截断的低频词阈值 | | 197 | | -alpha | 开始的学习速率 | | 198 | | -cbow | 使用 CBOW | Skip-gram 更慢一些,但是对低频词效果更好;对应的 CBOW 则速度更快一些, | 199 | 200 | ## Deeplearning4j 201 | 202 | - [Word2vec](http://deeplearning4j.org/zh-word2vec.html) 203 | > 204 | - [DL4J-Word2Vec](http://deeplearning4j.org/word2vec.html#intro) 205 | 206 | ## Python 207 | 208 | - [中英文维基百科语料上的 Word2Vec 实验](http://www.52nlp.cn/%E4%B8%AD%E8%8B%B1%E6%96%87%E7%BB%B4%E5%9F%BA%E7%99%BE%E7%A7%91%E8%AF%AD%E6%96%99%E4%B8%8A%E7%9A%84word2vec%E5%AE%9E%E9%AA%8C) 209 | 210 | ``` 211 | %load_ext autoreload 212 | %autoreload 2 213 | ``` 214 | 215 | # word2vec 216 | 217 | This notebook is equivalent to `demo-word.sh`, `demo-analogy.sh`, `demo-phrases.sh` and `demo-classes.sh` from Google. 218 | 219 | ## Training 220 | 221 | Download some data, for example: [http://mattmahoney.net/dc/text8.zip](http://mattmahoney.net/dc/text8.zip) 222 | 223 | ``` 224 | import word2vec 225 | ``` 226 | 227 | Run `word2phrase` to group up similar words "Los Angeles" to "Los_Angeles" 228 | 229 | ``` 230 | word2vec.word2phrase('/Users/drodriguez/Downloads/text8', '/Users/drodriguez/Downloads/text8-phrases', verbose=True) 231 | ``` 232 | 233 | ``` 234 | [u'word2phrase', u'-train', u'/Users/drodriguez/Downloads/text8', u'-output', u'/Users/drodriguez/Downloads/text8-phrases', u'-min-count', u'5', u'-threshold', u'100', u'-debug', u'2'] 235 | Starting training using file /Users/drodriguez/Downloads/text8 236 | Words processed: 17000K Vocab size: 4399K 237 | Vocab size (unigrams + bigrams): 2419827 238 | Words in train file: 17005206 239 | ``` 240 | 241 | This will create a `text8-phrases` that we can use as a better input for `word2vec`.Note that you could easily skip this previous step and use the origial data as input for `word2vec`. 242 | 243 | Train the model using the `word2phrase` output. 244 | 245 | ``` 246 | word2vec.word2vec('/Users/drodriguez/Downloads/text8-phrases', '/Users/drodriguez/Downloads/text8.bin', size=100, verbose=True) 247 | ``` 248 | 249 | ``` 250 | Starting training using file /Users/drodriguez/Downloads/text8-phrases 251 | Vocab size: 98331 252 | Words in train file: 15857306 253 | Alpha: 0.000002 Progress: 100.03% Words/thread/sec: 286.52k 254 | ``` 255 | 256 | That generated a `text8.bin` file containing the word vectors in a binary format. 257 | 258 | Do the clustering of the vectors based on the trained model. 259 | 260 | ``` 261 | word2vec.word2clusters('/Users/drodriguez/Downloads/text8', '/Users/drodriguez/Downloads/text8-clusters.txt', 100, verbose=True) 262 | ``` 263 | 264 | ``` 265 | Starting training using file /Users/drodriguez/Downloads/text8 266 | Vocab size: 71291 267 | Words in train file: 16718843 268 | Alpha: 0.000002 Progress: 100.02% Words/thread/sec: 287.55k 269 | ``` 270 | 271 | That created a `text8-clusters.txt` with the cluster for every word in the vocabulary 272 | 273 | ## Predictions 274 | 275 | ``` 276 | import word2vec 277 | ``` 278 | 279 | Import the `word2vec` binary file created above 280 | 281 | ``` 282 | model = word2vec.load('/Users/drodriguez/Downloads/text8.bin') 283 | ``` 284 | 285 | We can take a look at the vocabulaty as a numpy array 286 | 287 | ``` 288 | model.vocab 289 | ``` 290 | 291 | ``` 292 | array([u'', u'the', u'of', ..., u'dakotas', u'nias', u'burlesques'], 293 | dtype=' string -- 抓取的目标集合 train / test / all 17 | """ 18 | rand = np.random.mtrand.RandomState(8675309) 19 | data = fetch_20newsgroups(subset=subset, 20 | categories=categories, 21 | shuffle=True, 22 | random_state=rand) 23 | 24 | self.data[subset] = data 25 | ``` 26 | 27 | 然后在 Notebook 中交互查看数据格式: 28 | 29 | ```py 30 | # 实例化对象 31 | twp = TwentyNewsGroup() 32 | # 抓取数据 33 | twp.fetch_data() 34 | twenty_train = twp.data['train'] 35 | print("数据集结构", "->", twenty_train.keys()) 36 | print("文档数目", "->", len(twenty_train.data)) 37 | print("目标分类", "->",[ twenty_train.target_names[t] for t in twenty_train.target[:10]]) 38 | 39 | 数据集结构 -> dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR', 'description']) 40 | 文档数目 -> 11314 41 | 目标分类 -> ['sci.space', 'comp.sys.mac.hardware', 'sci.electronics', 'comp.sys.mac.hardware', 'sci.space', 'rec.sport.hockey', 'talk.religion.misc', 'sci.med', 'talk.religion.misc', 'talk.politics.guns'] 42 | ``` 43 | 44 | 接下来我们可以对语料集中的特征进行提取: 45 | 46 | ```py 47 | # 进行特征提取 48 | 49 | # 构建文档-词矩阵(Document-Term Matrix) 50 | 51 | from sklearn.feature_extraction.text import CountVectorizer 52 | 53 | count_vect = CountVectorizer() 54 | 55 | X_train_counts = count_vect.fit_transform(twenty_train.data) 56 | 57 | print("DTM 结构","->",X_train_counts.shape) 58 | 59 | # 查看某个词在词表中的下标 60 | print("词对应下标","->", count_vect.vocabulary_.get(u'algorithm')) 61 | 62 | DTM 结构 -> (11314, 130107) 63 | 词对应下标 -> 27366 64 | ``` 65 | 66 | 为了将文档用于进行分类任务,还需要使用 TF-IDF 等常见方法将其转化为特征向量: 67 | 68 | ``` 69 | # 构建文档的 TF 特征向量 70 | from sklearn.feature_extraction.text import TfidfTransformer 71 | 72 | tf_transformer = TfidfTransformer(use_idf=False).fit(X_train_counts) 73 | X_train_tf = tf_transformer.transform(X_train_counts) 74 | 75 | print("某文档 TF 特征向量","->",X_train_tf) 76 | 77 | # 构建文档的 TF-IDF 特征向量 78 | from sklearn.feature_extraction.text import TfidfTransformer 79 | 80 | tf_transformer = TfidfTransformer().fit(X_train_counts) 81 | X_train_tfidf = tf_transformer.transform(X_train_counts) 82 | 83 | print("某文档 TF-IDF 特征向量","->",X_train_tfidf) 84 | 85 | 某文档 TF 特征向量 -> (0, 6447) 0.0380693493813 86 | (0, 37842) 0.0380693493813 87 | ``` 88 | 89 | 我们可以将特征提取、分类器训练与预测封装为单独函数: 90 | 91 | ``` 92 | def extract_feature(self): 93 | """ 94 | 从语料集中抽取文档特征 95 | """ 96 | 97 | # 获取训练数据的文档-词矩阵 98 | self.train_dtm = self.count_vect.fit_transform(self.data['train'].data) 99 | 100 | # 获取文档的 TF 特征 101 | 102 | tf_transformer = TfidfTransformer(use_idf=False) 103 | 104 | self.train_tf = tf_transformer.transform(self.train_dtm) 105 | 106 | # 获取文档的 TF-IDF 特征 107 | 108 | tfidf_transformer = TfidfTransformer().fit(self.train_dtm) 109 | 110 | self.train_tfidf = tf_transformer.transform(self.train_dtm) 111 | 112 | def train_classifier(self): 113 | """ 114 | 从训练集中训练出分类器 115 | """ 116 | 117 | self.extract_feature(); 118 | 119 | self.clf = MultinomialNB().fit( 120 | self.train_tfidf, self.data['train'].target) 121 | 122 | def predict(self, docs): 123 | """ 124 | 从训练集中训练出分类器 125 | """ 126 | 127 | X_new_counts = self.count_vect.transform(docs) 128 | 129 | tfidf_transformer = TfidfTransformer().fit(X_new_counts) 130 | 131 | X_new_tfidf = tfidf_transformer.transform(X_new_counts) 132 | 133 | return self.clf.predict(X_new_tfidf) 134 | ``` 135 | 136 | 然后执行训练并且进行预测与评价: 137 | 138 | ``` 139 | # 训练分类器 140 | twp.train_classifier() 141 | 142 | # 执行预测 143 | docs_new = ['God is love', 'OpenGL on the GPU is fast'] 144 | predicted = twp.predict(docs_new) 145 | 146 | for doc, category in zip(docs_new, predicted): 147 | print('%r => %s' % (doc, twenty_train.target_names[category])) 148 | 149 | # 执行模型评测 150 | twp.fetch_data(subset='test') 151 | 152 | predicted = twp.predict(twp.data['test'].data) 153 | 154 | import numpy as np 155 | 156 | # 误差计算 157 | 158 | # 简单误差均值 159 | np.mean(predicted == twp.data['test'].target) 160 | 161 | # Metrics 162 | 163 | from sklearn import metrics 164 | 165 | print(metrics.classification_report( 166 | twp.data['test'].target, predicted, 167 | target_names=twp.data['test'].target_names)) 168 | 169 | # Confusion Matrix 170 | metrics.confusion_matrix(twp.data['test'].target, predicted) 171 | 172 | 'God is love' => soc.religion.christian 173 | 'OpenGL on the GPU is fast' => rec.autos 174 | precision recall f1-score support 175 | 176 | alt.atheism 0.79 0.50 0.61 319 177 | ... 178 | talk.religion.misc 1.00 0.08 0.15 251 179 | 180 | avg / total 0.82 0.79 0.77 7532 181 | 182 | Out[16]: 183 | array([[158, 0, 1, 1, 0, 1, 0, 3, 7, 1, 2, 6, 1, 184 | 8, 3, 114, 6, 7, 0, 0], 185 | ... 186 | [ 35, 3, 1, 0, 0, 0, 1, 4, 1, 1, 6, 3, 0, 187 | 6, 5, 127, 30, 5, 2, 21]]) 188 | ``` 189 | 190 | 我们也可以对文档集进行主题提取: 191 | 192 | ``` 193 | # 进行主题提取 194 | 195 | twp.topics_by_lda() 196 | 197 | Topic 0 : stream s1 astronaut zoo laurentian maynard s2 gtoal pem fpu 198 | Topic 1 : 145 cx 0d bh sl 75u 6um m6 sy gld 199 | Topic 2 : apartment wpi mars nazis monash palestine ottoman sas winner gerard 200 | Topic 3 : livesey contest satellite tamu mathew orbital wpd marriage solntze pope 201 | Topic 4 : x11 contest lib font string contrib visual xterm ahl brake 202 | Topic 5 : ax g9v b8f a86 1d9 pl 0t wm 34u giz 203 | Topic 6 : printf null char manes behanna senate handgun civilians homicides magpie 204 | Topic 7 : buf jpeg chi tor bos det que uwo pit blah 205 | Topic 8 : oracle di t4 risc nist instruction msg postscript dma convex 206 | Topic 9 : candida cray yeast viking dog venus bloom symptoms observatory roby 207 | Topic 10 : cx ck hz lk mv cramer adl optilink k8 uw 208 | Topic 11 : ripem rsa sandvik w0 bosnia psuvm hudson utk defensive veal 209 | Topic 12 : db espn sabbath br widgets liar davidian urartu sdpa cooling 210 | Topic 13 : ripem dyer ucsu carleton adaptec tires chem alchemy lockheed rsa 211 | Topic 14 : ingr sv alomar jupiter borland het intergraph factory paradox captain 212 | Topic 15 : militia palestinian cpr pts handheld sharks igc apc jake lehigh 213 | Topic 16 : alaska duke col russia uoknor aurora princeton nsmca gene stereo 214 | Topic 17 : uuencode msg helmet eos satan dseg homosexual ics gear pyron 215 | Topic 18 : entries myers x11r4 radar remark cipher maine hamburg senior bontchev 216 | Topic 19 : cubs ufl vitamin temple gsfc mccall astro bellcore uranium wesleyan 217 | ``` 218 | 219 | # 常见自然语言处理工具封装 220 | 221 | 经过上面对于 20NewsGroup 语料集处理的介绍我们可以发现常见自然语言处理任务包括,数据获取、数据预处理、数据特征提取、分类模型训练、主题模型或者词向量等高级特征提取等等。笔者还习惯用 [python-fire](https://github.com/google/python-fire) 将类快速封装为可通过命令行调用的工具,同时也支持外部模块调用使用。本部分我们主要以中文语料集为例,譬如我们需要对中文维基百科数据进行分析,可以使用 gensim 中的[维基百科处理类](https://parg.co/b44): 222 | 223 | ``` 224 | class Wiki(object): 225 | """ 226 | 维基百科语料集处理 227 | """ 228 | 229 | def wiki2texts(self, wiki_data_path, wiki_texts_path='./wiki_texts.txt'): 230 | """ 231 | 将维基百科数据转化为文本数据 232 | Arguments: 233 | wiki_data_path -- 维基压缩文件地址 234 | """ 235 | if not wiki_data_path: 236 | print("请输入 Wiki 压缩文件路径或者前往 https://dumps.wikimedia.org/zhwiki/ 下载") 237 | exit() 238 | 239 | # 构建维基语料集 240 | wiki_corpus = WikiCorpus(wiki_data_path, dictionary={}) 241 | texts_num = 0 242 | 243 | with open(wiki_text_path, 'w', encoding='utf-8') as output: 244 | for text in wiki_corpus.get_texts(): 245 | output.write(b' '.join(text).decode('utf-8') + '\n') 246 | texts_num += 1 247 | if texts_num % 10000 == 0: 248 | logging.info("已处理 %d 篇文章" % texts_num) 249 | 250 | print("处理完毕,请使用 OpenCC 转化为简体字") 251 | ``` 252 | 253 | 抓取完毕后,我们还需要用 OpenCC 转化为简体字。抓取完毕后我们可以使用结巴分词对生成的文本文件进行分词,代码参考[这里](https://parg.co/b4R),我们直接使用 `python chinese_text_processor.py tokenize_file /output.txt` 直接执行该任务并且生成输出文件。获取分词之后的文件,我们可以将其转化为简单的词袋表示或者文档-词向量,详细代码参考[这里](https://parg.co/b4f): 254 | 255 | ```py 256 | class CorpusProcessor: 257 | """ 258 | 语料集处理 259 | """ 260 | 261 | def corpus2bow(self, tokenized_corpus=default_documents): 262 | """returns (vocab,corpus_in_bow) 263 | 将语料集转化为 BOW 形式 264 | Arguments: 265 | tokenized_corpus -- 经过分词的文档列表 266 | Return: 267 | vocab -- {'human': 0, ... 'minors': 11} 268 | corpus_in_bow -- [[(0, 1), (1, 1), (2, 1)]...] 269 | """ 270 | dictionary = corpora.Dictionary(tokenized_corpus) 271 | 272 | # 获取词表 273 | vocab = dictionary.token2id 274 | 275 | # 获取文档的词袋表示 276 | corpus_in_bow = [dictionary.doc2bow(text) for text in tokenized_corpus] 277 | 278 | return (vocab, corpus_in_bow) 279 | 280 | def corpus2dtm(self, tokenized_corpus=default_documents, min_df=10, max_df=100): 281 | """returns (vocab, DTM) 282 | 将语料集转化为文档-词矩阵 283 | - dtm -> matrix: 文档-词矩阵 284 | I like hate databases 285 | D1 1 1 0 1 286 | D2 1 0 1 1 287 | """ 288 | 289 | if type(tokenized_corpus[0]) is list: 290 | documents = [" ".join(document) for document in tokenized_corpus] 291 | else: 292 | documents = tokenized_corpus 293 | 294 | if max_df == -1: 295 | max_df = round(len(documents) / 2) 296 | 297 | # 构建语料集统计向量 298 | vec = CountVectorizer(min_df=min_df, 299 | max_df=max_df, 300 | analyzer="word", 301 | token_pattern="[\S]+", 302 | tokenizer=None, 303 | preprocessor=None, 304 | stop_words=None 305 | ) 306 | 307 | # 对于数据进行分析 308 | DTM = vec.fit_transform(documents) 309 | 310 | # 获取词表 311 | vocab = vec.get_feature_names() 312 | 313 | return (vocab, DTM) 314 | ``` 315 | 316 | 我们也可以对分词之后的文档进行主题模型或者词向量提取,这里使用分词之后的文件就可以忽略中英文的差异: 317 | 318 | ```py 319 | def topics_by_lda(self, tokenized_corpus_path, num_topics=20, num_words=10, max_lines=10000, split="\s+", max_df=100): 320 | """ 321 | 读入经过分词的文件并且对其进行 LDA 训练 322 | Arguments: 323 | tokenized_corpus_path -> string -- 经过分词的语料集地址 324 | num_topics -> integer -- 主题数目 325 | num_words -> integer -- 主题词数目 326 | max_lines -> integer -- 每次读入的最大行数 327 | split -> string -- 文档的词之间的分隔符 328 | max_df -> integer -- 避免常用词,过滤超过该阈值的词 329 | """ 330 | 331 | # 存放所有语料集信息 332 | corpus = [] 333 | 334 | with open(tokenized_corpus_path, 'r', encoding='utf-8') as tokenized_corpus: 335 | 336 | flag = 0 337 | 338 | for document in tokenized_corpus: 339 | 340 | # 判断是否读取了足够的行数 341 | if(flag > max_lines): 342 | break 343 | 344 | # 将读取到的内容添加到语料集中 345 | corpus.append(re.split(split, document)) 346 | 347 | flag = flag + 1 348 | 349 | # 构建语料集的 BOW 表示 350 | (vocab, DTM) = self.corpus2dtm(corpus, max_df=max_df) 351 | 352 | # 训练 LDA 模型 353 | 354 | lda = LdaMulticore( 355 | matutils.Sparse2Corpus(DTM, documents_columns=False), 356 | num_topics=num_topics, 357 | id2word=dict([(i, s) for i, s in enumerate(vocab)]), 358 | workers=4 359 | ) 360 | 361 | # 打印并且返回主题数据 362 | topics = lda.show_topics( 363 | num_topics=num_topics, 364 | num_words=num_words, 365 | formatted=False, 366 | log=False) 367 | 368 | for ti, topic in enumerate(topics): 369 | print("Topic", ti, ":", " ".join(word[0] for word in topic[1])) 370 | ``` 371 | 372 | 该函数同样可以使用命令行直接调用,传入分词之后的文件。我们也可以对其语料集建立词向量,代码参考[这里](https://parg.co/b4N);如果对于词向量基本使用尚不熟悉的同学可以参考[基于 Gensim 的 Word2Vec 实践](https://zhuanlan.zhihu.com/p/24961011): 373 | 374 | ```py 375 | def wv_train(self, tokenized_text_path, output_model_path='./wv_model.bin'): 376 | """ 377 | 对于文本进行词向量训练,并将输出的词向量保存 378 | """ 379 | 380 | sentences = word2vec.Text8Corpus(tokenized_text_path) 381 | 382 | # 进行模型训练 383 | model = word2vec.Word2Vec(sentences, size=250) 384 | 385 | # 保存模型 386 | model.save(output_model_path) 387 | 388 | def wv_visualize(self, model_path, word=["中国", "航空"]): 389 | """ 390 | 根据输入的词搜索邻近词然后可视化展示 391 | 参数: 392 | model_path: Word2Vec 模型地址 393 | """ 394 | 395 | # 加载模型 396 | model = word2vec.Word2Vec.load(model_path) 397 | 398 | # 寻找出最相似的多个词 399 | words = [wp[0] for wp in model.most_similar(word, topn=20)] 400 | 401 | # 提取出词对应的词向量 402 | wordsInVector = [model[word] for word in words] 403 | 404 | # 进行 PCA 降维 405 | pca = PCA(n_components=2) 406 | pca.fit(wordsInVector) 407 | X = pca.transform(wordsInVector) 408 | 409 | # 绘制图形 410 | xs = X[:, 0] 411 | ys = X[:, 1] 412 | 413 | plt.figure(figsize=(12, 8)) 414 | plt.scatter(xs, ys, marker='o') 415 | 416 | # 遍历所有的词添加点注释 417 | for i, w in enumerate(words): 418 | plt.annotate( 419 | w, 420 | xy=(xs[i], ys[i]), xytext=(6, 6), 421 | textcoords='offset points', ha='left', va='top', 422 | **dict(fontsize=10) 423 | ) 424 | plt.show() 425 | ``` 426 | -------------------------------------------------------------------------------- /经典自然语言/统计语言模型/统计语言模型.md: -------------------------------------------------------------------------------- 1 | - [统计语言模型浅谈](https://zhuanlan.zhihu.com/p/28323093)从属于笔者的[程序猿的数据科学与机器学习实战手册](https://github.com/wx-chevalier/DataScience-And-MachineLearning-Handbook-For-Coders),其他相关阅读[Python 语法速览与机器学习开发环境搭建](https://zhuanlan.zhihu.com/p/24536868),[Scikit-Learn 备忘录](https://zhuanlan.zhihu.com/p/24770526),[基于 Gensim 的 Word2Vec 实践](https://zhuanlan.zhihu.com/p/24961011)。 2 | 3 | ## 统计语言模型 4 | 5 | 统计语言模型(Statistical Language Model)即是用来描述词、语句乃至于整个文档这些不同的语法单元的概率分布的模型,能够用于衡量某句话或者词序列是否符合所处语言环境下人们日常的行文说话方式。统计语言模型对于复杂的大规模自然语言处理应用有着非常重要的价值,它能够有助于提取出自然语言中的内在规律从而提高语音识别、机器翻译、文档分类、光学字符识别等自然语言应用的表现。好的统计语言模型需要依赖大量的训练数据,在上世纪七八十年代,基本上模型的表现优劣往往会取决于该领域数据的丰富程度。IBM 曾进行过一次信息检索评测,发现二元语法模型(Bi-gram)需要数以亿计的词汇才能达到最优表现,而三元语法模型(TriGram)则需要数十亿级别的词汇才能达成饱和。本世纪初,最流行的统计语言模型当属 N-gram,其属于典型的基于稀疏表示(Sparse Representation)的语言模型;近年来随着深度学习的爆发与崛起,以词向量(WordEmbedding)为代表的分布式表示(Distributed Representation)的语言模型取得了更好的效果,并且深刻地影响了自然语言处理领域的其他模型与应用的变革。除此之外,Ronald Rosenfeld[7] 还提到了基于决策树的语言模型(Decision Tree Models)、最大熵模型以及自适应语言模型(Adaptive Models)等。 6 | 统计语言模型可以用来表述词汇序列的统计特性,譬如学习序列中单词的联合分布概率函数。如果我们用$w_1$ 到 $w_t$ 依次表示这句话中的各个词,那么该句式的出现概率可以简单表示为: 7 | 8 | $$ 9 | \begin{equation} 10 | \begin{split} 11 | P(w_1,...,w_t) = \prod_{i=1}^{t}P(w_i|w_1,...,w_{i-1}) = \prod_{i=1}^{t}P(w_i|Context) \\ 12 | P(w_1, w_2, …, w_t) = P(w_1) \times P(w_2 | w_1) \times P(w_3 | w_1, w_2) \times … \times P(w_t | w_1, w_2, …, w_{t-1}) 13 | \end{split} 14 | \end{equation} 15 | $$ 16 | 17 | 统计语言模型训练目标也可以是采用极大似然估计来求取最大化的对数似然,公式为$\frac{1}{T}\sum^T_{t=1}\sum_{-c \le j\le c,j \ne0}log p(w_{t+j}|w_t)$。其中$c$是训练上下文的大小。譬如$c$取值为 5 的情况下,一次就拿 5 个连续的词语进行训练。一般来说$c$越大,效果越好,但是花费的时间也会越多。$p(w_{t+j}|w_t)$表示$w_t$条件下出现$w_{t+j}$的概率。常见的对于某个语言模型度量的标准即是其困惑度(Perplexity),需要注意的是这里的困惑度与信息论中的困惑度并不是相同的含义。这里的困惑度定义公式参考 Stolcke[11],为$exp(-logP(w_t)/|\vec{w}|)$,即是$1/P(w_t|w_1^{t-1})$的几何平均数。最小化困惑度的值即是最大化每个单词的概率,不过困惑度的值严重依赖于词表以及具体使用的单词,因此其常常被用作评判其他因素相同的两个系统而不是通用的绝对性的度量参考。 18 | 19 | ### N-gram 语言模型 20 | 21 | 参照上文的描述,在统计学语言模型中我们致力于计算某个词序列$E = w_1^T$的出现概率,可以形式化表示为: 22 | 23 | $$ 24 | \begin{equation} 25 | P(E) = P(|E| = T,w_1^T) 26 | \end{equation} 27 | $$ 28 | 29 | 上式中我们求取概率的目标词序列$E$的长度为$T$,序列中第一个词为$w_1$,第二个词为$w_2$,等等,直到最后一个词为$w_T$。上式非常直观易懂,不过在真实环境下却是不可行的,因为序列的长度$T$是未知的,并且词表中词的组合方式也是非常庞大的数目,无法直接求得。为了寻找实际可行的简化模型,我们可以将整个词序列的联合概率复写为单个词或者单个词对的概率连乘。即上述公式可以复写为$P(w_1,w_2,w_3)=P(w_1)P(w_2|w_1)P(w_3|w_1,w_2)$,推导到通用词序列,我们可以得到如下形式化表示: 30 | 31 | $$ 32 | \begin{equation} 33 | P(E) = \prod_{t=1}^{T+1}P(w_t|w_1^{t-1}) 34 | \end{equation} 35 | $$ 36 | 37 | 此时我们已经将整个词序列的联合概率分解为近似地求 $P(w_t | w_1, w_2, …, w_{t-1})$。而这里要讨论的 N-gram 模型就是用 $P(w_t | w_{t-n+1}, …, w_{t-1})$ 近似表示前者。根据$N$的取值不同我们又可以分为一元语言模型(Uni-gram)、二元语言模型(Bi-gram)、三元语言模型(Tri-gram)等等类推。该模型在中文中被称为汉语语言模型(CLM, Chinese Language Model),即在需要把代表字母或笔画的数字,或连续无空格的拼音、笔画,转换成汉字串(即句子)时,利用上下文中相邻词间的搭配信息,计算出最大概率的句子;而不需要用户手动选择,避开了许多汉字对应一个相同的拼音(或笔画串、数字串)的重码问题。 38 | 一元语言模型又称为上下文无关语言模型,是一种简单易实现但实际应用价值有限的统计语言模型。该模型不考虑该词所对应的上下文环境,仅考虑当前词本身的概率,即是 N-gram 模型中当$N=1$的特殊情形。 39 | 40 | $$ 41 | \begin{equation} 42 | p(w_t|Context)=p(w_t)=\frac{N_{w_t}}{N} 43 | \end{equation} 44 | $$ 45 | 46 | N-gram 语言模型也存在一些问题,这种模型无法建模出词之间的相似度,有时候两个具有某种相似性的词,如果一个词经常出现在某段词之后,那么也许另一个词出现在这段词后面的概率也比较大。比如“白色的汽车”经常出现,那完全可以认为“白色的轿车”也可能经常出现。N-gram 语言模型无法建模更远的关系,语料的不足使得无法训练更高阶的语言模型。大部分研究或工作都是使用 Tri-gram,就算使用高阶的模型,其统计 到的概率可信度就大打折扣,还有一些比较小的问题采用 Bi-gram。训练语料里面有些 n 元组没有出现过,其对应的条件概率就是 0,导致计算一整句话的概率为 0。最简单的计算词出现概率的方法就是在准备好的训练集中计算固定长度的词序列的出现次数,然后除以其所在上下文的次数;譬如以 Bi-gram 为例,我们有下面三条训练数据: 47 | 48 | - i am from jiangsu. 49 | - i study at nanjing university. 50 | - my mother is from yancheng. 51 | 52 | 我们可以推导出词 am, study 分别相对于 i 的后验概率: 53 | 54 | $$ 55 | \begin{equation} 56 | \begin{split} 57 | p(w_2={am} | w_1 = i) = \frac{w_1=i,w_2=am}{c(w_1 = 1)} = \frac{1}{2} = 0.5 \\ 58 | p(w_2={study} | w_1 = i) = \frac{w_1=i,w_2=study}{c(w_1 = 1)} = \frac{1}{2} = 0.5 59 | \end{split} 60 | \end{equation} 61 | $$ 62 | 63 | 上述的计算过程可以推导为如下的泛化公式: 64 | 65 | $$ 66 | \begin{equation} 67 | P_{ML}(w_t|w_1^{t-1}) = \frac{c_{prefix} (w_1^t) }{c_{prefix} (w_1^{t-1}) } 68 | \end{equation} 69 | $$ 70 | 71 | 这里$c_{prefix}(\cdot)$表示指定字符串在训练集中出现的次数,这种方法也就是所谓的最大似然估计(Maximum Likelihood Estimation);该方法十分简单易用,同时还能保证较好地利用训练集中的统计特性。根据这个方法我们同样可以得出 Tri-gram 模型似然计算公式如下: 72 | 73 | $$ 74 | \begin{equation} 75 | P(w_t | w_{t-2}, w_{t-1}) = 76 | \frac 77 | {count(w_{t-2}w_{t-1}w_t)} 78 | {count(w_{t-2}w_{t-1})} 79 | \end{equation} 80 | $$ 81 | 82 | 我们将 N-gram 模型中的参数记作$\theta$,其包含了给定前$n-1$个词时第$n$个词出现的概率,形式化表示为: 83 | 84 | $$ 85 | \begin{equation} 86 | \begin{split} 87 | \theta_{w_{t-n+1}^t} = P_{ML}(w_t|w_{t-n+1}^{t-1})=\frac{c(w_{t-n+1}^t)}{c(w_{t-n+1}^{t-1})} 88 | \end{split} 89 | \end{equation} 90 | $$ 91 | 92 | 朴素的 N-gram 模型中对于训练集中尚未出现过的词序列会默认其概率为零,因为我们的模型是多个词概率的连乘,最终会导致整个句式的概率为零。我们可以通过所谓的平滑技巧来解决这个问题,即组合对于不同的$N$取值来计算平均概率。譬如我们可以组合 Uni-gram 模型与 Bi-gram 模型: 93 | 94 | $$ 95 | \begin{equation} 96 | \begin{split} 97 | P(w_t|w_{t-1}) = (1-\alpha)P_{ML}(w_t|w_{t-1}) + \alpha P_{ML}(w_t) 98 | \end{split} 99 | \end{equation} 100 | $$ 101 | 102 | 其中$\alpha$表示分配给 Uni-gram 求得的概率的比重,如果我们设置了$\alpha > 0$,那么词表中的任何词都会被赋予一定的概率。这种方法即是所谓的插入平滑(Interpolation),被应用在了很多低频稀疏的模型中以保证其鲁棒性。当然,我们也可以引入更多的$N$的不同的取值,整个组合概率递归定义如下: 103 | 104 | $$ 105 | \begin{equation} 106 | \begin{split} 107 | P(w_t|w_{t-m+1}^{t-1}) = (1 - \alpha_m)P_{ML}(w_t|w_{t-m+1}^{t-1}) + \alpha_mP(w_t|w_{t-m+2}^{t-1}) 108 | \end{split} 109 | \end{equation} 110 | $$ 111 | 112 | [Stanley et al., 1996] 中还介绍了很多其他复杂但精致的平滑方法,譬如基于上下文的平滑因子计算(Context-dependent Smoothing Coefficients),其并没有设置固定的$\alpha$值,而是动态地设置为$\alpha_{w_{t-m+1}^{t-1}}$。这就保证了模型能够在有较多的训练样例时将更多的比重分配给高阶的 N-gram 模型,而在训练样例较少时将更多的比重分配给低阶的 N-gram 模型。目前公认的使用最为广泛也最有效的平滑方式也是 [Stanley et al., 1996] 中提出的 Modified Kneser-Ney smoothing( MKN ) 模型,其综合使用了上下文平滑因子计算、打折以及低阶分布修正等手段来保证较准确地概率估计。 113 | 114 | ### 神经网络语言模型 115 | 116 | 顾名思义,神经网络语言模型(Neural Network Language Model)即是基于神经网络的语言模型,其能够利用神经网络在非线性拟合方面的能力推导出词汇或者文本的分布式表示。在神经网络语言模型中某个单词的分布式表示会被看做激活神经元的向量空间,其区别于所谓的局部表示,即每次仅有一个神经元被激活。标准的神经网络语言模型架构如下图所示: 117 | 118 | 神经网络语言模型中最著名的当属 Bengio[10] 中提出的概率前馈神经网络语言模型(Probabilistic Feedforward Neural Network Language Model),它包含了输入(Input)、投影(Projection)、隐藏(Hidden)以及输出(Output)这四层。在输入层中,会从$V$个单词中挑选出$N$个单词以下标进行编码,其中$V$是整个词表的大小。然后输入层会通过$N \times D$这个共享的投影矩阵投射到投影层$P$;由于同一时刻仅有$N$个输入值处于激活状态,因此这个计算压力还不是很大。NNLM 模型真正的计算压力在于投影层与隐层之间的转换,譬如我们选定$N = 10$,那么投影层$P$的维度在 500 到 2000 之间,而隐层$H$的维度在于$500$到$1000$之间。同时,隐层$H$还负责计算词表中所有单词的概率分布,因此输出层的维度也是$V$。综上所述,整个模型的训练复杂度为: 119 | 120 | $$ 121 | Q = N \times D + N \times D \times H + H \times V 122 | $$ 123 | 124 | 其训练集为某个巨大但固定的词汇集合$V$ 中的单词序列$w_1...w_t$;其目标函数为学习到一个好的模型$f(w_t,w_{t-1},\dots,w_{t-n+2},w_{t-n+1})=p(w_t|w_1^{t-1})$,约束为$f(w_t,w_{t-1},\dots,w_{t-n+2},w_{t-n+1}) > 0$并且$\Sigma_{i=1}^{|V|} f(i,w_{t-1},\dots,w_{t-n+2},w_{t-n+1}) = 1$。每个输入词都被映射为一个向量,该映射用$C$表示,所以$C(w_{t-1})$即为$w_{t-1}$的词向量。定义$g$为一个前馈或者递归神经网络,其输出是一个向量,向量中的第$i$个元素表示概率$p(w_t=i|w_1^{t-1})$。训练的目标依然是最大似然加正则项,即: 125 | 126 | $$ 127 | Max Likelihood = max \frac{1}{T}\sum_tlogf(w_t,w_{t-1},\dots,w_{t-n+2},w_{t-n+1};\theta) + R(\theta) 128 | $$ 129 | 130 | 其中$\theta$为参数,$R(\theta)$为正则项,输出层采用 sofamax 函数: 131 | 132 | $$ 133 | p(w_t|w_{t-1},\dots,w_{t-n+2},w_{t-n+1})=\frac{e^{y_{w_t}}}{\sum_ie^{y_i}} 134 | $$ 135 | 136 | 其中$y_i$是每个输出词$i$的未归一化$log$概率,计算公式为$y=b+Wx+Utanh(d+Hx)$。其中$b,W,U,d,H$都是参数,$x$为输入,需要注意的是,一般的神经网络输入是不需要优化,而在这里,$x=(C(w_{t-1}),C(w_{t-2}),\dots,C(w_{t-n+1}))$,也是需要优化的参数。在图中,如果下层原始输入$x$不直接连到输出的话,可以令$b=0$,$W=0$。如果采用随机梯度算法的话,梯度的更新规则为: 137 | 138 | $$ 139 | \theta + \epsilon \frac{\partial log p(w_t | w_{t-1},\dots,w_{t-n+2},w_{t-n+1})}{\partial \theta} \to \theta 140 | $$ 141 | 142 | 其中$\epsilon$为学习速率,需要注意的是,一般神经网络的输入层只是一个输入值,而在这里,输入层$x$也是参数(存在$C$中),也是需要优化的。优化结束之后,词向量有了,语言模型也有了。这个 Softmax 模型使得概率取值为(0,1),因此不会出现概率为 0 的情况,也就是自带平滑,无需传统 N-gram 模型中那些复杂的平滑算法。Bengio 在 APNews 数据集上做的对比实验也表明他的模型效果比精心设计平滑算法的普通 N-gram 算法要好 10%到 20%。 143 | 144 | ![](http://7xlgth.com1.z0.glb.clouddn.com/2288BF90-FD22-493A-B703-C5AB32726FF2.png) 145 | 146 | ### 循环神经网络语言模型 147 | 148 | 好的语言模型应当至少捕获自然语言的两个特征:语法特性与语义特性。为了保证语法的正确性,我们往往只需要考虑生成词的前置上下文;这也就意味着语法特性往往是属于局部特性。而语义的一致性则复杂了许多,我们需要考虑大量的乃至于整个文档语料集的上下文信息来获取正确的全局语义。神经网络语言模型相较于经典的 N-gram 模型具有更强大的表现力与更好的泛化能力,不过传统的 N-gram 语言模型与 [Bengio et al., 2003] 中提出的神经网络语言模型都不能有效地捕获全局语义信息。为了解决这个问题,[Mikolov et al., 2010; 2011] 中提出的基于循环神经网络(Recurrent Neural Network, RNN)的语言模型使用了隐状态来记录词序的历史信息,其能够捕获语言中的长程依赖。在自然语言中,往往在句式中相隔较远的两个词却具备一定的语法与语义关联,譬如`He doesn't have very much confidence in himself` 与 `She doesn't have very much confidence in herself` 这两句话中的``与``这两个词对,尽管句子中间的词可能会发生变化,但是这两种词对中两个词之间的关联却是固定的。这种依赖也不仅仅出现在英语中,在汉语、俄罗斯语中也都存在有大量此类型的词对组合。而另一种长期依赖(Long-term Dependencies)的典型就是所谓的选择限制(Selectional Preferences);简而言之,选择限制主要基于已知的某人会去做某事这样的信息。譬如`我要用叉子吃沙拉`与`我要和我的朋友一起吃沙拉`这两句话中,`叉子`指代的是某种工具,而`我的朋友`则是伴侣的意思。如果有人说`我要用双肩背包来吃沙拉`就觉得很奇怪了,`双肩背包`并不是工具也不是伴侣;如果我们破坏了这种选择限制就会生成大量的无意义句子。最后,某个句式或者文档往往都会归属于某个主题下,如果我们在某个技术主题的文档中突然发现了某个关于体育的句子,肯定会觉得很奇怪,这也就是所谓的破坏了主题一致性。 149 | 150 | [Eriguchi et al., 2016] 中介绍的循环神经网络在机器翻译上的应用就很值得借鉴,它能够有效地处理这种所谓长期依赖的问题。它的思想精髓在于计算新的隐状态$\vec{h}$时引入某个之前的隐状态$\vec{h_{t-1}}$,形式化表述如下: 151 | 152 | $$ 153 | \begin{equation} 154 | \vec{h}_t = 155 | \begin{cases} 156 | tanh(W_{xh}\vec{x}_t + W_{hh}\vec{h}_{t-1} + \vec{b}_h), & \text{t $\geq1,$} \\ 157 | 0, & \text{otherwises} 158 | \end{cases} 159 | \end{equation} 160 | $$ 161 | 162 | 我们可以看出,在$t \geq 1$时其与标准神经网络中隐层计算公式的区别在于多了一个连接$W_{hh}\vec{h}_{t-1}$,该连接源于前一个时间点的隐状态。在对于 RNN 有了基本的了解之后,我们就可以将其直接引入语言模型的构建中,即对于上文讨论的神经网络语言模型添加新的循环连接: 163 | 164 | $$ 165 | \begin{equation} 166 | \begin{split} 167 | \vec{m}_t = M_{\cdot,w_{t-1}} \\ 168 | \vec{h}_t = 169 | \begin{cases} 170 | tanh(W_{xh}\vec{x}_t + W_{hh}\vec{h}_{t-1} + \vec{b}_h), & \text{t $\geq1,$} \\ 171 | 0, & \text{otherwises} 172 | \end{cases} \\ 173 | \vec{p}_t = softmax(W_{hs}\vec{h}_t + b_s) 174 | \end{split} 175 | \end{equation} 176 | $$ 177 | 178 | 注意,与上文介绍的前馈神经网络语言模型相对,循环神经网络语言模型中只是将前一个词而不是前两个词作为输入;这是因为我们假设$w_{t-2}$的信息已经包含在了隐状态$\vec{h_{t-1}}$中,因此不需要重复代入。 179 | 180 | #### 平滑法 181 | 182 | 方法一为平滑法。最简单的方法是把每个 n 元组的出现次数加 1,那么原来出现 k 次的某个 n 元组就会记为 k+1 次,原来出现 0 次的 n 元组就会记为出现 1 次。这种也称为 Laplace 平滑。当然还有很多更复杂的其他平滑方法,其本质都 是将模型变为贝叶斯模型,通过引入先验分布打破似然一统天下的局面。而引入 先验方法的不同也就产生了很多不同的平滑方法。 183 | 184 | #### 回退法 185 | 186 | 方法二是回退法。有点像决策树中的后剪枝方法,即如果 n 元的概率不到,那就往上回退一步,用 n-1 元的概率乘上一个权重来模拟。 187 | 188 | ## N-Pos 模型(Context = $c(w_{t-n+1}),c(w_{t-n+2}),\dots,c(w_{t-1})$) 189 | 190 | 严格来说 N-Pos 只是 N-Gram 的一种衍生模型。N-Gram 模型假定第 t 个词出现概率条件依赖它前 N-1 个词,而现实中很多词出现的概率是条件依赖于它前面词的语法功能的。N-Pos 模型就是基于这种假设的模型,它将词按照其语法功能进行分类,由这些词类决定下一个词出现的概率。这样的词类称为词性 (Part-of-Speech,简称为 POS)。N-Pos 模型中的每个词的条件概率表示为: 191 | 192 | $p(s)=p(w^T_1)=p(w_1,w_2,\dots,w_T)= \\ \Pi^T_{t=1}p(w_t|c(w_{t-n+1}),c(w_{t-n+2}),\dots,c(w_{t-1}))$ 193 | 194 | $c$为类别映射函数,即把$T$个词映射到$K$个类别($1 \le K \le T$),实际上 N-Pos 使用了一种聚类的思想,使得 N-Gram 中$w_{t-n+1},w_{t-n+2},\dots,w_{t-1}$中的可能为$T^{n-1}$减少为$c(w_{t-n+1}),c(w_{t-n+2}),\dots,c(w_{t-1})$中的$K^{N-1}$,同时这种减少还采用了语义有意义的类别。 195 | 196 | ## 基于决策树的语言模型 197 | 198 | 上面提到的上下文无关语言模型、n-gram 语言模型、n-pos 语言模型等等,都可以以统计决策树的形式表示出来。而统计决策树中每个结点的决策规则是一 个上下文相关的问题。这些问题可以是“前一个词时 w 吗?”“前一个词属于类别 C,吗?”。当然基于决策树的语言模型还可以更灵活一些,可以是一些“前一个词是动词?”,“后面有介词吗?”之类的复杂语法语义问题。基于决策树的语言模型优点是:分布数不是预先固定好的,而是根据训练预 料库中的实际情况确定,更为灵活。缺点是:构造统计决策树的问题很困难,且时空开销很大。 199 | 200 | ## 最大熵模型 201 | 202 | 最大熵原理是 E.T. Jayness 于上世纪 50 年代提出的,其基本思想是:对一个 随机事件的概率分布进行预测时,在满足全部已知的条件下对未知的情况不做任何主观假设。从信息论的角度来说就是:在只掌握关于未知分布的部分知识时,应当选取符合这些知识但又能使得熵最大的概率分布。 203 | 204 | $p(w|Context)=\frac{e^{\Sigma_i \lambda_i f_i(context,w)}}{Z(Context)}$ 205 | 206 | ## 自适应语言模型 207 | 208 | 前面的模型概率分布都是预先从训练语料库中估算好的,属于静态语言模型。而自适应语言模型类似是 Online Learning 的过程,即根据少量新数据动态调整模型,属于动态模型。在自然语言中,经常出现这样现象:某些在文本中通常很少出现的词,在某一局部文本中突然大量地出现。能够根据词在局部文本中出现的 情况动态地调整语言模型中的概率分布数据的语言模型成为动态、自适应或者基于缓存的语言模型。通常的做法是将静态模型与动态模型通过参数融合到一起,这种混合模型可以有效地避免数据稀疏的问题。还有一种主题相关的自适应语言模型,直观的例子为:专门针对体育相关内 容训练一个语言模型,同时保留所有语料训练的整体语言模型,当新来的数据属 于体育类别时,其应该使用的模型就是体育相关主题模型和整体语言模型相融合 的混合模型。 209 | 210 | ## Skip-Gram 211 | 212 | - [A CloserLook at Skip-gram Modelling](http://homepages.inf.ed.ac.uk/ballison/pdf/lrec_skipgrams.pdf) 213 | 214 | 根据论文中的定义可知道,常说的`k-skip-n-grams`在句子$w_1 \dots w_m$可以表示为: 215 | 216 | $\{ w_{i_1},w_{i_2}, \dots w_{i_n} | \sum_{j=1}^{n}i_j - i_{j-1} < k \}$ 217 | 218 | Skip-gram 實際上的定義很簡單,就是允许跳几个字的意思… 依照原論文裡的定義,這個句子: 219 | 220 | > Insurgents killed in ongoing fighting. 221 | 222 | 在 bi-grams 的時候是拆成:{ `insurgents killed, killed in, in ongoing, ongoing fighting` }。 223 | 224 | 在 2-skip-bi-grams 的時候拆成:{ `insurgents killed, insurgents in, insurgents ongoing, killed in, killed ongoing, killed fighting, in ongoing, in fighting, ongoing fighting` }。 225 | 226 | 在 tri-grams 的時候是:{ `insurgents killed in, killed in ongoing, in ongoing fighting` }。 227 | 228 | 在 2-skip-tri-grams 的時候是:{ `insurgents killed in, insurgents killed ongoing, insurgents killed fighting, insurgentsin ongoing, insurgents in fighting, insurgents ongoing fighting, killed in ongoing, killed in fighting, killed ongoing fighting, in ongoing fighting` }。 229 | 230 | 对于上文的语言模型的目标公式而言,Skip-Gram 模型中的$p(w_{t+j} | w_t)$公式采用的是 Softmax 函数: 231 | 232 | $p(w_o | w_I) = \frac{exp(v'^T_{w_o}v_{w_I})}{\sum^W_{w=1}exp(v'^T_wv_{w_I})}$ 233 | 234 | 其中$p(w_o | w_I)$表示在词语$w_I$条件下出现$w_o$的概率,$v_{w_o}$表示$w_o$代表的词向量,而$v_w$代表词汇表中所有词语的向量。$W$是词汇表的长度。不过该公式不太切实际,因为$W$太大了,通常是$10^5–10^7$。 235 | 236 | ### Hierarchical Softmax 237 | 238 | 这种是原始 skip-gram 模型的变形。我们假设有这么一棵二叉树,每个叶子节点对应词汇表的词语,一一对应。所以我们可以通过这棵树来找到一条路径来找到某个词语。比如我们可以对词汇表,根据词频,建立一棵 huffman 树。每个词语都会对应一个 huffman 编码,huffman 编码就反映了这个词语在 huffman 树的路径。对于每个节点,都会定义孩子节点概率,左节点跟右节点的概率不同的,具体跟输入有关。譬如,待训练的词组中存在一句:“我爱中国”。 239 | 240 | 输入:爱 241 | 242 | 预测:我 243 | 244 | 假设,“我”的 Huffman 编码是 1101,那么就在 Huffman 树上从根节点沿着往下走,每次走的时候,我们会根据当前节点和“爱”的向量算出(具体怎么算先不管),走到下一个节点的概率是多少。于是,我们得到一连串的概率,我们的目标就是使得这些概率的连乘值(联合概率)最大。 245 | 246 | $p(w|w_I)=\Pi_{j=1}^{L(w)-1}\sigma([n(w,j+1)=ch(n(w,j))]*v'^T_{n(w,j)}v_{w_I})$ 247 | 248 | - $L(w)$为词语$w$在二叉树路径中的长度 249 | - $\sigma(*)$即为 Sigmoid 函数 250 | - $n(w,j+1)$即指$w$在二叉树的第$j+1$个节点 251 | - $ch(n(w,j))$表示定义了任意一个固定的节点,要么是左,要么是右。合起来的意思是左右节点的正负号是不一致的,可以是左负右正,可以是左正右负。 252 | 253 | 而对于单一的选择左右节点的概率: 254 | 255 | $\sigma(x)=\frac{1}{1+e^{-x}} \\ \sigma(-x)=\frac{1}{1+e^{x}} \\ \sigma(x) + \sigma(-x) = 1 $ 256 | 257 | 显然,我们计算这个联合概率的复杂度取决了词语在 huffman 树的路径长度,显然她比 W 小得多了。另外,由于按词频建立的 huffman 树,词频高的,huffman 编码短,计算起来就比较快。词频高的需要计算概率的次数肯定多,而 huffman 让高频词计算概率的速度比低频词的快。这也是很犀利的一个设计。 258 | 259 | ## NNLM 260 | 261 | NNLM 是 Neural Network Language Model 的缩写,即神经网络语言模型。神经网络语言模型方面最值得阅读的文章是 Deep Learning 二号任人物 Bengio 的《A Neural Probabilistic Language Model》,JMLR 2003。NNLM 米用的是 Distributed Representation,即每个词被表示为一个浮点向量。其模型图如下: 262 | 263 | 目标是要学到一个好的模型: 264 | 265 | $f(w_t,w_{t-1},\dots,w_{t-n+2},w_{t-n+1})=p(w_t|w_1^{t-1})$ 266 | 267 | 需要满足的约束为: 268 | 269 | 上图中, 270 | -------------------------------------------------------------------------------- /经典自然语言/统计语言模型/词表示.md: -------------------------------------------------------------------------------- 1 | # Natural Language Processing 2 | 3 | ———这里是正式回答的分割线———— 4 | 5 | 自然语言处理(简称 NLP),是研究计算机处理人类语言的一门技术,包括: 6 | 7 | 1.句法语义分析:对于给定的句子,进行分词、词性标记、命名实体识别和链接、句法分析、语义角色识别和多义词消歧。 8 | 9 | 2.信息抽取:从给定文本中抽取重要的信息,比如,时间、地点、人物、事件、原因、结果、数字、日期、货币、专有名词等等。通俗说来,就是要了解谁在什么时候、什么原因、对谁、做了什么事、有什么结果。涉及到实体识别、时间抽取、因果关系抽取等关键技术。 10 | 11 | 3.文本挖掘(或者文本数据挖掘):包括文本聚类、分类、信息抽取、摘要、情感分析以及对挖掘的信息和知识的可视化、交互式的表达界面。目前主流的技术都是基于统计机器学习的。 12 | 13 | 4.机器翻译:把输入的源语言文本通过自动翻译获得另外一种语言的文本。根据输入媒介不同,可以细分为文本翻译、语音翻译、手语翻译、图形翻译等。机器翻译从最早的基于规则的方法到二十年前的基于统计的方法,再到今天的基于神经网络(编码-解码)的方法,逐渐形成了一套比较严谨的方法体系。 14 | 15 | 5.信息检索:对大规模的文档进行索引。可简单对文档中的词汇,赋之以不同的权重来建立索引,也可利用 1,2,3 的技术来建立更加深层的索引。在查询的时候,对输入的查询表达式比如一个检索词或者一个句子进行分析,然后在索引里面查找匹配的候选文档,再根据一个排序机制把候选文档排序,最后输出排序得分最高的文档。 16 | 17 | 6.问答系统: 对一个自然语言表达的问题,由问答系统给出一个精准的答案。需要对自然语言查询语句进行某种程度的语义分析,包括实体链接、关系识别,形成逻辑表达式,然后到知识库中查找可能的候选答案并通过一个排序机制找出最佳的答案。 18 | 19 | 7.对话系统:系统通过一系列的对话,跟用户进行聊天、回答、完成某一项任务。涉及到用户意图理解、通用聊天引擎、问答引擎、对话管理等技术。此外,为了体现上下文相关,要具备多轮对话能力。同时,为了体现个性化,要开发用户画像以及基于用户画像的个性化回复。自然语言处理经历了从规则的方法到基于统计的方法。基于统计的自然语言处理方法,在数学模型上和通信就是相同的,甚至相同的。但是科学家们也是用了几十年才认识到这个问题。统计语言模型的初衷是为了解决语音识别问题,在语音识别中,计算机需要知道一个文字序列能否构成一个有意义的句子。 20 | 21 | **简单** 22 | 23 | - 拼写检查 24 | - 关键字搜索 25 | - 查找同义词 26 | 27 | **中等难度** 28 | 29 | - 从网络或文档中提取信息 30 | 31 | **难** 32 | 33 | - 机器翻译(号称自然语言领域的圣杯) 34 | - 语义分析(一句话是什么意思) 35 | - 交叉引用(一句话中,他,这个等代词所对应的主体是哪个) 36 | - 问答系统(Siri, Google Now, 小娜等) 37 | 38 | # Word Representation:词表示 39 | 40 | 自然语言理解的问题要转化为机器学习的问题,第一步肯定是要找一种方法把这些符号数学化。 41 | 42 | ## 词典 43 | 44 | 现实生活中,我们通过查词典来知道一个词的意思,这实际上是用另外的词或短语来表达一个词。这一方法在计算机领域也有,比如 [WordNet](http://wordnet.princeton.edu/) 实际上就是个电子化的英语词典。 45 | 46 | 然而,这一方式有以下几个问题: 47 | 48 | - 有大量的同义词,不利于计算 49 | - 更新缓慢,没有办法自动地添加新词 50 | - 一个词释义含有比较明显的主观色彩 51 | - 需要人工来创建和维护 52 | - 很难计算词的相似性 53 | - 很难进行计算,因为计算机本质上只认识 0 和 1 54 | 55 | ## One-hot Representation:基于统计的词语向量表达 56 | 57 | NLP 中最直观,也是到目前为止最常用的词表示方法是 One-hot Representation,这种方法把每个词表示为一个很长的向量。这个向量的维度是词表大小,其中绝大多数元素为 0,只有一个维度的值为 1,这个维度就代表了当前的词。 58 | 比如,在一个精灵国里,他们的语言非常简单,总共只有三句话: 59 | 60 | 1. I like NLP. 61 | 2. I like deep learning. 62 | 3. I enjoy flying. 63 | 64 | 这样,我们可以看到这个精灵国的词典是 [I, like, NLP, deep, learning, enjoy, flying, .]。没错,我们把标点也认为是一个词。用向量来表达词时,我们创建一个向量,向量的维度与词典的个数相同,然后让向量的某个位置为 1,其他位置全为 0。这样就创建了一个向量词 (one-hot)。 65 | 66 | 比如,在我们的精灵国里,I 这个词的向量是:[1 0 0 0 0 0 0 0], deep 这个词的向量表达是 [0 0 0 1 0 0 0 0]。 67 | 68 | 看起来挺好,我们终于把词转换为 0 和 1 这种计算机能理解的格式了。然而,这种表达也有个问题,很多同义词没办法表达出来,因为他们是不同的向量。怎么解决这个问题呢?我们可以通过词的上下文来表达一个词。通过上下文表达一个词的另外一个好处是,一个词往往有多个意思,具体在某个句子里是什么意思往往由它的上下文决定。 69 | 70 | 这种 One-hot Representation 如果采用稀疏方式存储,会是非常的简洁:也就是给每个词分配一个数字 71 | 72 | ID。比如刚才的例子中,话筒记为 3,麦克记为 8(假设从 0 开始记)。如果要编程实现的话,用 Hash 73 | 74 | 表给每个词分配一个编号就可以了。这么简洁的表示方法配合上最大熵、SVM、CRF 等等算法已经很好地完成了 NLP 领域的各种主流任务。 75 | 76 | 当然这种表示方法也存在一个重要的问题就是“词汇鸿沟”现象:任意两个词之间都是孤立的。光从这两个向量中看不出两个词是否有关系,哪怕是话筒和麦克这样的同义词也不能幸免于难。 77 | 78 | ## 基于上下文的表达 79 | 80 | > You shall know a word by the company it keeps. --- (J. R. Firth 1957: 11) 81 | 82 | 词向量(Distributed Representation) 83 | 84 | 而是用 **Distributed Representation**(不知道这个应该怎么翻译,因为还存在一种叫“Distributional Representation”的表示方法,又是另一个不同的概念)表示的一种低维实数向量。这种向量一般长成这个样子:[0.792, −0.177,−0.107, 0.109, −0.542, …]。维度以 50 维和 100 维比较常见。这种向量的表示不是唯一的,后文会提到目前计算出这种向量的主流方法。(个人认为)Distributed representation 85 | 86 | 最大的贡献就是让相关或者相似的词,在距离上更接近了。向量的距离可以用最传统的欧氏距离来衡量,也可以用 cos 夹角来衡量。用这种方式表示的向量,“麦克”和“话筒”的距离会远远小于“麦克”和“天气”。可能理想情况下“麦克”和“话筒”的表示应该是完全一样的,但是由于有些人会把英文名“迈克”也写成“麦克”,导致“麦克”一词带上了一些人名的语义,因此不会和“话筒”完全一致。 87 | 88 | # Document Representation(文档表示) 89 | 90 | ## Bag-of-Words 91 | 92 | BOW (bag of words) 模型简介 Bag of words 模型最初被用在文本分类中,将文档表示成特征矢量。它的基本思想是假定对于一个文本,忽略其词序和语法、句法,仅仅将其看做是一些词汇的集合,而文本中的每个词汇都是独立的。简单说就是讲每篇文档都看成一个袋子(因为里面装的都是词汇,所以称为词袋,Bag of words 即因此而来),然后看这个袋子里装的都是些什么词汇,将其分类。如果文档中猪、马、牛、羊、山谷、土地、拖拉机这样的词汇多些,而银行、大厦、汽车、公园这样的词汇少些,我们就倾向于判断它是一篇描绘乡村的文档,而不是描述城镇的。举个例子,有如下两个文档: 93 | 94 | 文档一:Bob likes to play basketball, Jim likes too. 95 | 96 | 文档二:Bob also likes to play football games. 97 | 98 | BOW 不仅是直观的感受,其数学原理还依托于离散数学中的[多重集](https://zh.m.wikipedia.org/wiki/%E5%A4%9A%E9%87%8D%E9%9B%86)这个概念,多重集或多重集合是数学中的一个概念,是集合概念的推广。在一个集合中,相同的元素只能出现一次,因此只能显示出有或无的属性。在多重集之中,同一个元素可以出现多次。正式的多重集的概念大约出现在 1970 年代。多重集的势的计算和一般集合的计算方法一样,出现多次的元素则需要按出现的次数计算,不能只算一次。一个元素在多重集里出现的次数称为这个元素在多重集里面的重数(或重次、重复度)。举例来说,{1,2,3} 是一个集合,而 {\displaystyle \left\{1,1,1,2,2,3\right\}} 不是一个集合,而是一个多重集。其中元素 1 的重数是 3,2 的重数是 2,3 的重数是 1。{\displaystyle \left\{1,1,1,2,2,3\right\}} 的元素个数是 6。有时为了和一般的集合相区别,多重集合会用方括号而不是花括号标记,比如 {\displaystyle \left\{1,1,1,2,2,3\right\}} 会被记为 {\displaystyle \left[1,1,1,2,2,3\right]}。和多元组或数组的概念不同,多重集中的元素是没有顺序分别的,也就是说 {\displaystyle \left[1,1,1,2,2,3\right]} 和 {\displaystyle \left[1,1,2,1,2,3\right]} 是同一个多重集。 99 | 100 | 基于这两个文本文档,构造一个词典: 101 | 102 | Dictionary = {1:”Bob”, 2. “like”, 3. “to”, 4. “play”, 5. “basketball”, 6. “also”, 7. “football”,8. “games”, 9. “Jim”, 10. “too”}。 103 | 104 | 这个词典一共包含 10 个不同的单词,利用词典的索引号,上面两个文档每一个都可以用一个 10 维向量表示(用整数数字 0~n(n 为正整数)表示某个单词在文档中出现的次数): 105 | 106 | 1:[1, 2, 1, 1, 1, 0, 0, 0, 1, 1] 107 | 108 | 2:[1, 1, 1, 1 ,0, 1, 1, 1, 0, 0] 109 | 110 | 向量中每个元素表示词典中相关元素在文档中出现的次数(下文中,将用单词的直方图表示)。不过,在构造文档向量的过程中可以看到,我们并没有表达单词在原来句子中出现的次序(这是本 Bag-of-words 模型的缺点之一,不过瑕不掩瑜甚至在此处无关紧要)。 111 | 112 | # Distributed Representation:分布式表示 113 | 114 | 分布式表示的 115 | 116 | 某个词的含义可以由其所处上下文中的其他词推导而来,譬如在 `saying that Europe needs unified banking regulation to replace the hodgepodge` 与 `government debt problems turning into banking cries as has happened in` 117 | -------------------------------------------------------------------------------- /经典自然语言/词嵌入/99~参考资料/2023~Embeddings: What they are and why they matter.md: -------------------------------------------------------------------------------- 1 | > [原文地址](https://simonwillison.net/2023/Oct/23/embeddings/) 2 | 3 | # Embeddings: What they are and why they matter 4 | -------------------------------------------------------------------------------- /经典自然语言/词嵌入/概述.md: -------------------------------------------------------------------------------- 1 | 预训练的词向量已经引领自然语言处理很长时间。Word2vec[4] 在 2013 年被作为一个近似的语言建模模型而提出。当时,硬件速度比现在要慢很多,并且深度学习模型也还没有得到广泛的支持,Word2vec 凭借着自身的效率和易用性被采用。从那时起,实施 NLP 项目的标准方法基本上就没变过:通过 Word2vec 和 GloVe[5] 等算法在大量未标注的数据上进行预训练获得词嵌入向量 (word embedding),然后把词嵌入向量用于初始化神经网络的第一层,而网络的其它部分则是根据特定的任务,利用其余的数据进行训练。在大多数训练数据有限的任务中,这种做法能够使准确率提升 2 到 3 个百分点 [6]。不过,尽管这些预训练的词嵌入向量具有极大的影响力,但是它们存在一个主要的局限:它们只将先前的知识纳入模型的第一层,而网络的其余部分仍然需要从头开始训练。 2 | 3 | ![](https://ww1.sinaimg.cn/large/007rAy9hgy1fz3vrajfpbj30u00aldj6.jpg) 4 | 5 | 由 word2vec 捕捉到的关系(来源:TensorFlow 教程) 6 | Word2vec 以及相关的其它方法属于浅层方法,这是一种以效率换表达力的做法。使用词嵌入向量就像使用仅对图像边缘进行编码的预训练表征来初始化计算机视觉模型,尽管这种做法对许多任务都是有帮助的,但是却无法捕捉到那些也许更有用的高层次信息。采用词嵌入向量初始化的模型需要从头开始学习,模型不仅要学会消除单词歧义,还要理解单词序列的意义。这是语言理解的核心内容,它需要对复杂的语言现象建模,例如语义合成性(compositionality)、多义性(polysemy)、指代(anaphora)、长期依赖(long-term dependencies)、一致性(agreement)和否定(negation)等。因此,使用这些浅层表征初始化的自然语言处理模型仍然需要大量的训练样本,才能获得良好的性能。 7 | -------------------------------------------------------------------------------- /经典自然语言/词嵌入/词向量/基于 Gensim 的 Word2Vec 实践.md: -------------------------------------------------------------------------------- 1 | # Word2Vec 2 | 3 | - [基于 Gensim 的 Word2Vec 实践](https://zhuanlan.zhihu.com/p/24961011),从属于笔者的[程序猿的数据科学与机器学习实战手册](https://github.com/wx-chevalier/DataScience-And-MachineLearning-Handbook-For-Coders),代码参考[gensim.ipynb](https://github.com/wx-chevalier/DataScience-And-MachineLearning-Handbook-For-Coders/blob/master/code/python/nlp/genism/gensim.ipynb)。推荐前置阅读[Python 语法速览与机器学习开发环境搭建](https://zhuanlan.zhihu.com/p/24536868),[Scikit-Learn 备忘录](https://zhuanlan.zhihu.com/p/24770526)。 4 | 5 | ![](https://i.ytimg.com/vi/xMwx2A_o5r4/maxresdefault.jpg) 6 | 7 | > - [Word2Vec Tutorial](https://rare-technologies.com/word2vec-tutorial/) 8 | > - [Getting Started with Word2Vec and GloVe in Python](http://textminingonline.com/getting-started-with-word2vec-and-glove-in-python) 9 | 10 | ## 模型创建 11 | 12 | [Gensim](http://radimrehurek.com/gensim/models/word2vec.html)中 Word2Vec 模型的期望输入是进过分词的句子列表,即是某个二维数组。这里我们暂时使用 Python 内置的数组,不过其在输入数据集较大的情况下会占用大量的 RAM。Gensim 本身只是要求能够迭代的有序句子列表,因此在工程实践中我们可以使用自定义的生成器,只在内存中保存单条语句。 13 | 14 | ``` 15 | # 引入 word2vec 16 | from gensim.models import word2vec 17 | 18 | # 引入日志配置 19 | import logging 20 | 21 | logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO) 22 | 23 | # 引入数据集 24 | raw_sentences = ["the quick brown fox jumps over the lazy dogs","yoyoyo you go home now to sleep"] 25 | 26 | # 切分词汇 27 | sentences= [s.encode('utf-8').split() for s in sentences] 28 | 29 | # 构建模型 30 | model = word2vec.Word2Vec(sentences, min_count=1) 31 | 32 | # 进行相关性比较 33 | model.similarity('dogs','you') 34 | ``` 35 | 36 | 这里我们调用`Word2Vec`创建模型实际上会对数据执行两次迭代操作,第一轮操作会统计词频来构建内部的词典数结构,第二轮操作会进行神经网络训练,而这两个步骤是可以分步进行的,这样对于某些不可重复的流(譬如 Kafka 等流式数据中)可以手动控制: 37 | 38 | ``` 39 | model = gensim.models.Word2Vec(iter=1) # an empty model, no training yet 40 | model.build_vocab(some_sentences) # can be a non-repeatable, 1-pass generator 41 | model.train(other_sentences) # can be a non-repeatable, 1-pass generator 42 | ``` 43 | 44 | ### Word2Vec 参数 45 | 46 | - min_count 47 | 48 | ``` 49 | model = Word2Vec(sentences, min_count=10) # default value is 5 50 | ``` 51 | 52 | 在不同大小的语料集中,我们对于基准词频的需求也是不一样的。譬如在较大的语料集中,我们希望忽略那些只出现过一两次的单词,这里我们就可以通过设置`min_count`参数进行控制。一般而言,合理的参数值会设置在 0~100 之间。 53 | 54 | - size 55 | 56 | `size`参数主要是用来设置神经网络的层数,Word2Vec 中的默认值是设置为 100 层。更大的层次设置意味着更多的输入数据,不过也能提升整体的准确度,合理的设置范围为 10~数百。 57 | 58 | ``` 59 | model = Word2Vec(sentences, size=200) # default value is 100 60 | ``` 61 | 62 | - workers 63 | 64 | `workers`参数用于设置并发训练时候的线程数,不过仅当`Cython`安装的情况下才会起作用: 65 | 66 | ``` 67 | model = Word2Vec(sentences, workers=4) # default = 1 worker = no parallelization 68 | ``` 69 | 70 | ## 外部语料集 71 | 72 | 在真实的训练场景中我们往往会使用较大的语料集进行训练,譬如这里以 Word2Vec 官方的[text8](http://mattmahoney.net/dc/text8.zip)为例,只要改变模型中的语料集开源即可: 73 | 74 | ``` 75 | sentences = word2vec.Text8Corpus('text8') 76 | model = word2vec.Word2Vec(sentences, size=200) 77 | ``` 78 | 79 | 这里语料集中的语句是经过分词的,因此可以直接使用。笔者在第一次使用该类时报错了,因此把 Gensim 中的源代码贴一下,也方便以后自定义处理其他语料集: 80 | 81 | ``` 82 | class Text8Corpus(object): 83 | """Iterate over sentences from the "text8" corpus, unzipped from http://mattmahoney.net/dc/text8.zip .""" 84 | def __init__(self, fname, max_sentence_length=MAX_WORDS_IN_BATCH): 85 | self.fname = fname 86 | self.max_sentence_length = max_sentence_length 87 | 88 | def __iter__(self): 89 | # the entire corpus is one gigantic line -- there are no sentence marks at all 90 | # so just split the sequence of tokens arbitrarily: 1 sentence = 1000 tokens 91 | sentence, rest = [], b'' 92 | with utils.smart_open(self.fname) as fin: 93 | while True: 94 | text = rest + fin.read(8192) # avoid loading the entire file (=1 line) into RAM 95 | if text == rest: # EOF 96 | words = utils.to_unicode(text).split() 97 | sentence.extend(words) # return the last chunk of words, too (may be shorter/longer) 98 | if sentence: 99 | yield sentence 100 | break 101 | last_token = text.rfind(b' ') # last token may have been split in two... keep for next iteration 102 | words, rest = (utils.to_unicode(text[:last_token]).split(), 103 | text[last_token:].strip()) if last_token >= 0 else ([], text) 104 | sentence.extend(words) 105 | while len(sentence) >= self.max_sentence_length: 106 | yield sentence[:self.max_sentence_length] 107 | sentence = sentence[self.max_sentence_length:] 108 | ``` 109 | 110 | 我们在上文中也提及,如果是对于大量的输入语料集或者需要整合磁盘上多个文件夹下的数据,我们可以以迭代器的方式而不是一次性将全部内容读取到内存中来节省 RAM 空间: 111 | 112 | ``` 113 | class MySentences(object): 114 | def __init__(self, dirname): 115 | self.dirname = dirname 116 | 117 | def __iter__(self): 118 | for fname in os.listdir(self.dirname): 119 | for line in open(os.path.join(self.dirname, fname)): 120 | yield line.split() 121 | 122 | sentences = MySentences('/some/directory') # a memory-friendly iterator 123 | model = gensim.models.Word2Vec(sentences) 124 | ``` 125 | 126 | ## 模型保存与读取 127 | 128 | ``` 129 | model.save('text8.model') 130 | 2015-02-24 11:19:26,059 : INFO : saving Word2Vec object under text8.model, separately None 131 | 2015-02-24 11:19:26,060 : INFO : not storing attribute syn0norm 132 | 2015-02-24 11:19:26,060 : INFO : storing numpy array 'syn0' to text8.model.syn0.npy 133 | 2015-02-24 11:19:26,742 : INFO : storing numpy array 'syn1' to text8.model.syn1.npy 134 | 135 | model1 = Word2Vec.load('text8.model') 136 | 137 | model.save_word2vec_format('text.model.bin', binary=True) 138 | 2015-02-24 11:19:52,341 : INFO : storing 71290x200 projection weights into text.model.bin 139 | 140 | model1 = word2vec.Word2Vec.load_word2vec_format('text.model.bin', binary=True) 141 | 2015-02-24 11:22:08,185 : INFO : loading projection weights from text.model.bin 142 | 2015-02-24 11:22:10,322 : INFO : loaded (71290, 200) matrix from text.model.bin 143 | 2015-02-24 11:22:10,322 : INFO : precomputing L2-norms of word weight vectors 144 | ``` 145 | 146 | ## 模型预测 147 | 148 | Word2Vec 最著名的效果即是以语义化的方式推断出相似词汇: 149 | 150 | ``` 151 | model.most_similar(positive=['woman', 'king'], negative=['man'], topn=1) 152 | [('queen', 0.50882536)] 153 | model.doesnt_match("breakfast cereal dinner lunch";.split()) 154 | 'cereal' 155 | model.similarity('woman', 'man') 156 | 0.73723527 157 | model.most_similar(['man']) 158 | [(u'woman', 0.5686948895454407), 159 | (u'girl', 0.4957364797592163), 160 | (u'young', 0.4457539916038513), 161 | (u'luckiest', 0.4420626759529114), 162 | (u'serpent', 0.42716869711875916), 163 | (u'girls', 0.42680859565734863), 164 | (u'smokes', 0.4265017509460449), 165 | (u'creature', 0.4227582812309265), 166 | (u'robot', 0.417464017868042), 167 | (u'mortal', 0.41728296875953674)] 168 | ``` 169 | 170 | 如果我们希望直接获取某个单词的向量表示,直接以下标方式访问即可: 171 | 172 | ``` 173 | model['computer'] # raw NumPy vector of a word 174 | array([-0.00449447, -0.00310097, 0.02421786, ...], dtype=float32) 175 | ``` 176 | 177 | ### 模型评估 178 | 179 | Word2Vec 的训练属于无监督模型,并没有太多的类似于监督学习里面的客观评判方式,更多的依赖于端应用。Google 之前公开了 20000 条左右的语法与语义化训练样本,每一条遵循`A is to B as C is to D`这个格式,地址在[这里](https://word2vec.googlecode.com/svn/trunk/questions-words.txt): 180 | 181 | ``` 182 | model.accuracy('/tmp/questions-words.txt') 183 | 2014-02-01 22:14:28,387 : INFO : family: 88.9% (304/342) 184 | 2014-02-01 22:29:24,006 : INFO : gram1-adjective-to-adverb: 32.4% (263/812) 185 | 2014-02-01 22:36:26,528 : INFO : gram2-opposite: 50.3% (191/380) 186 | 2014-02-01 23:00:52,406 : INFO : gram3-comparative: 91.7% (1222/1332) 187 | 2014-02-01 23:13:48,243 : INFO : gram4-superlative: 87.9% (617/702) 188 | 2014-02-01 23:29:52,268 : INFO : gram5-present-participle: 79.4% (691/870) 189 | 2014-02-01 23:57:04,965 : INFO : gram7-past-tense: 67.1% (995/1482) 190 | 2014-02-02 00:15:18,525 : INFO : gram8-plural: 89.6% (889/992) 191 | 2014-02-02 00:28:18,140 : INFO : gram9-plural-verbs: 68.7% (482/702) 192 | 2014-02-02 00:28:18,140 : INFO : total: 74.3% (5654/7614) 193 | ``` 194 | 195 | 还是需要强调下,训练集上表现的好也不意味着 Word2Vec 在真实应用中就会表现的很好,还是需要因地制宜。 196 | 197 | # 模型训练 198 | 199 | ## 简单语料集 200 | 201 | ## 外部语料集 202 | 203 | ## 中文语料集 204 | 205 | 约三十余万篇文章 206 | 207 | # 模型应用 208 | 209 | ## 可视化预览 210 | 211 | ``` 212 | from sklearn.decomposition import PCA 213 | import matplotlib.pyplot as plt 214 | 215 | 216 | def wv_visualizer(model, word = ["man"]): 217 | 218 | # 寻找出最相似的十个词 219 | words = [wp[0] for wp in model.most_similar(word,20)] 220 | 221 | # 提取出词对应的词向量 222 | wordsInVector = [model[word] for word in words] 223 | 224 | # 进行 PCA 降维 225 | pca = PCA(n_components=2) 226 | pca.fit(wordsInVector) 227 | X = pca.transform(wordsInVector) 228 | 229 | # 绘制图形 230 | xs = X[:, 0] 231 | ys = X[:, 1] 232 | 233 | # draw 234 | plt.figure(figsize=(12,8)) 235 | plt.scatter(xs, ys, marker = 'o') 236 | for i, w in enumerate(words): 237 | plt.annotate( 238 | w, 239 | xy = (xs[i], ys[i]), xytext = (6, 6), 240 | textcoords = 'offset points', ha = 'left', va = 'top', 241 | **dict(fontsize=10) 242 | ) 243 | 244 | plt.show() 245 | 246 | # 调用时传入目标词组即可 247 | wv_visualizer(model,["China","Airline"]) 248 | ``` 249 | -------------------------------------------------------------------------------- /经典自然语言/语法语义分析/命名实体识别.md: -------------------------------------------------------------------------------- 1 | # 命名实体识别 2 | 3 | 命名实体识别(Named Entity Recognition, NER),又称作“专名识别”,主要任务是识别出文本中的人名、地名等专有名称和有意义的时间、日期等数量短语并加以归类。对很多文本挖掘任务来说,命名实体识别系统是重要的组成部分:一方面,命名实体识别可以帮助识别未登录词,而根据 SIGHAN Bakeoff 的数据评测结果,未登录词造成的分词精度损失远大于歧义;另一方面,对关键词提取等任务来说,命名实体的类别是非常有用的文本特征。 4 | 5 | 命名实体是命名实体识别的研究主体,一般包括3大类(实体类、时间类和数字类)和7小类(人名、地名、机构名、时间、日期、货币和百分比)命名实体。当然对于某些特定的应用场景,也可以把产品名、电影电视剧名、编程类库名等作为命名实体的类别。时间、日期、货币等实体识别通常可以采用模式匹配的方式获得较好的识别效果,而人名、地名、机构名的识别方法则比较复杂。 6 | 7 | 命名实体识别的过程通常分两步:识别实体边界、确定实体类别。英语中的命名实体具有比较明显的形态标志,如人名、地名等实体中的每个词的第一个字母要大写等,所以实体边界识别相对来说比较容易。中文内在的特殊性决定了在文本处理时首先必须进行词法分析,中文命名实体识别的难度要比英文的难度大。 8 | 9 | 一个完善的命名实体识别系统应该是词典、规则、统计学习的方法相结合。 10 | 11 | 1. 可以对原始文本进行细粒度的分词,多个连续的单字可以作为命名实体的候选结果;识别文本中的“”以及《》等配对的标点符号,当中的文本也可以作为候选结果。 12 | 2. 挖掘各个领域的专名词典,对候选结果进行前向最大匹配,匹配到的很有可能是各个类别的命名实体。 13 | 3. 利用隐马尔科夫链(HMM)、最大熵(ME)、条件随机场(CRF)等统计模型进行识别,[命名实体识别调研](www.nilday.com/命名实体识别调研/) 有各个模型的效果总结。 14 | -------------------------------------------------------------------------------- /行业应用/机器人问答/README.md: -------------------------------------------------------------------------------- 1 | # 机器人问答 2 | 3 | 当用户询问了机器人,机器人会根据聊天上下文,到知识库里寻找最合适的解决方案,并将答案返回给用户。如果机器人解决不了,则会由人工进行处理。机器人问答常用的组织形式有 FAQ(非结构化),KB(结构化)两类。我们这里讲的定位是 FAQ 形式。 4 | 5 | 知识库首先由很多的 FAQ 构成,比如构建一个客服领域的知识库,就需要将客服整个垂直场景涉及的问题都罗列出来,并配置相应的答案。比如,“我要怎么开淘宝店?”,“我要怎么退款”,“我的密码要怎么重置”。成千上万的 FAQ 对,构成了整个知识库的基础。机器人在回答时,就是从知识库里找到和用户问题非常接近的标准问题(称为“知识”),用它的答案进行回复。 6 | 7 | ![](https://assets.ng-tech.icu/item/20230525221648.png) 8 | 9 | # 知识库的组织 10 | 11 | 知识库的逻辑构成如下,3 层结构:标准问题,相似问题,标准答案。一个标准问题对应一个标准答案,一个标准问题下有多个相似问题。机器人定位时,使用标准问题和相似问题进行定位。 12 | 13 | ![](https://assets.ng-tech.icu/item/20230616143530.png) 14 | 15 | # 算法架构 16 | 17 | 一般的机器人问答链路中我们主要会从不同的链路去生成结果: 18 | 19 | - 检索链路:BERT, HCNN, OpenSearch 20 | - 生成链路:Transformer 21 | - 规则链路:Tire 树,基于依存句法的生成 22 | - 辅助算法:敏感词过滤,语言模型,关键词聚类 23 | 24 | ![](https://assets.ng-tech.icu/item/20230616143550.png) 25 | 26 | ## 检索链路 27 | 28 | ### 模糊搜索 29 | 30 | 检索链路首先是从 OpenSearch 中进行搜索,我们的检索库是定期更新数据,数据通过 ODPS 进行处理,从 ODPS 导入 OpenSearch,没有实时增量,所以 OpenSearch 的基础能力满足我们的需求。公开语料库譬如百度知道,全量 4.5 亿调。语料都通过 ODPS 的 UDF,先进行语料清洗,再进行语料去重。 31 | 32 | 索引构建阶段,使用 alinlp 电商分词的结果进行索引构建,对名词和动词做了单独处理。使用 alinlp 电商分词后,构建搜索表达式,增加名词和动词的权重,通过 OpenSearch 进行搜索。 33 | 34 | ### HCNN 精排 35 | 36 | 精排过程就是比较两个句子的相似度,比较方式一般有两种,Sentence Interaction 和 Sentence Embedding。SE 就是将两个句子变成同一空间里的独立向量,然后计算这两个向量之间的余弦相似度。典型代表如:DSSM,ABCNN。SI 就是将两个句子进行交叉,比如使用向量构造矩阵,通过对矩阵的理解,得到句子之间的相似度。典型代表如:Pyramid。 37 | 38 | HCNN 是 Hybrid CNN 的缩写,它包括了 SE 和 SI,分别构造左右两个子网络,一个是 SI,一个是 SE,把两种方式进行了结合。 39 | 40 | ![](https://assets.ng-tech.icu/item/20230616143617.png) 41 | 42 | ### BERT 精排 43 | 44 | BERT 的信息抽取器是 Transformer,Transformer 在翻译的任务里就表现出了极强的信息抽取能力,再经过大量数据进行训练,我们相信 BERT 能够比 HCNN 有更好的效果。 45 | 46 | ## 生成链路 47 | 48 | 针对生成问题,由于是有无到有,而且整个模型是端到端生成,人工可以干预的地方并不多,对整个模型的控制几种在模型设计、超参配置和训练过程中。 49 | 50 | 在开发阶段,我们尝试了 Seq2Seq、ConSeq2Seq,以上模型经常生成 save answer 和不通顺的语句,场景建模能力较弱,非该场景的生成语句偏多。在尝试了 Transformer + Beam Search 的架构后,Transformer 本身的并行化特性,以及网络结构里,Multi-head Attention 能够获取到更丰富语义,生成效率和生成句子的多样性以及句子质量都得到了答复提高。 51 | 52 | ![](https://assets.ng-tech.icu/item/20230525221720.png) 53 | 54 | ## 规则链路 55 | 56 | ### Trie 树 57 | 58 | 使用 Trie 树替换原始句子里的同义词。原始 Trie 树包含了模糊匹配、语义节点、集合词、同义词等,目的是为了扩大覆盖,语义归一。但是我们的需求是替换原始句子的同义词,所以很多功能用不上。考虑到修改原始 Trie 树的成本比较大,于是写了 MiniTrie 树,只做同义词替换。 59 | 60 | MiniTrie 先读取 Trie 树的同义词文件,在内存里建立替换关系图,然后对输入的 Query 可以进行替换,输出结果。MiniTrie 在同义词替换时,支持最小匹配、最大匹配、全匹配三种匹配方式,通过参数进行配置。 61 | 62 | ### 基于依存句法的生成 63 | 64 | ![](https://assets.ng-tech.icu/item/20230616143634.png) 65 | 66 | 将 root 出发的子树进行合并,调整/删除子树,构造新的句子。整个流程包括两部分,训练和预测。训练是依赖于相似的句对,构造可转换的规则库。 67 | 68 | 1)确保相似句对来自同一个领域 69 | 2)使用 Alinlp 对句对分别进行依存句法分析,获得 chunk 序列,包含依存关系及相应的 label 70 | 3)使用 chunk 合并器,对生成的 chunk 序列进行合并 71 | 4)取 chunk 序列的 label 作为句子表示,将句对的 label 序列作为一条规则加入到规则库 72 | 73 | 预测阶段通过规则库,对转换后的句子进行筛选: 74 | 75 | 1)使用 Alinlp 对输入语句进行依存句法分析,获得 chunk 序列,包含依存关系及相应的 label 76 | 2)使用 chunk 合并器,对生成的 chunk 序列进行合并 77 | 3)对 chunk 序列进行位置置换,或者删除 chunk,生成候选集 78 | 4)对候选集中的 chunk 序列,取 label 作为表示,用规则库判断转换是否合理,不合理的则丢弃 79 | 80 | ![](https://assets.ng-tech.icu/item/20230616143652.png) 81 | 82 | ## 辅助算法 83 | 84 | ### 语言模型 85 | 86 | 多层双向 LSTM,使用淘系语料进行训练 87 | 88 | ### 聚类 89 | 90 | 基于 TextRank 的关键词聚类 91 | 92 | ### 敏感词过滤 93 | 94 | 基于 KFC 的 AC 自动机和双数组 trie 树的关键词过滤 95 | -------------------------------------------------------------------------------- /行业应用/聊天对话/README.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/wx-chevalier/NLP-Notes/11ac6ed37b1c1e001b8d2139d218629a717eb625/行业应用/聊天对话/README.md --------------------------------------------------------------------------------