├── .gitignore
└── docs
    ├── .nojekyll
    ├── README.md
    ├── _sidebar.md
    ├── about.md
    ├── business.md
    ├── community.md
    ├── custom.css
    ├── images
        ├── blip2-1.png
        ├── blip2-2.png
        ├── blip2-3.png
        ├── blip2-4.png
        ├── blip2-5.png
        ├── blip2-6.png
        ├── blip2-7.png
        ├── blip2-8.png
        ├── blip2-9.png
        ├── blip2.png
        ├── cos_lr.png
        ├── deepnorm.png
        ├── dpp.jpg
        ├── eva.png
        ├── eva2.png
        ├── flashattention1.1.png
        ├── flashattention1.2.png
        ├── flashattention1.png
        ├── gqa.png
        ├── isoflops.png
        ├── llama3_post_approach.png
        ├── llama3_sft_mix.png
        ├── llama3_table3.png
        ├── llama3_vision.png
        ├── llama_fig3.png
        ├── logo.jpg
        ├── lr1.png
        ├── norm.png
        ├── pagedattn.png
        ├── pagedattn2.png
        ├── pagedattn3.png
        ├── perceiver.png
        ├── rope1.png
        ├── rope2.png
        ├── rope3.png
        ├── rope4.png
        ├── rope5.png
        ├── rope6.png
        ├── rope7.png
        ├── scale_law1.png
        ├── scale_law2.png
        ├── swish.png
        ├── video-chatgpt.png
        └── vit.png
    ├── index.html
    ├── llm_foundation.md
    ├── llm_foundation
        ├── FLOPs.md
        ├── cos_lr_schedule.md
        ├── llama3
        │   ├── llama3.md
        │   ├── llama3_128k.md
        │   ├── llama3_arch.md
        │   ├── llama3_params_count.md
        │   ├── llama3_post_training.md
        │   ├── llama3_pre_data.md
        │   ├── llama3_quote.md
        │   ├── llama3_recipe.md
        │   ├── llama3_scale.md
        │   ├── llama3_sft_data.md
        │   ├── llama3_speech.md
        │   └── llama3_vision.md
        ├── llm_architecture.md
        ├── llm_attention.md
        ├── memory_usage_training.md
        ├── norm.md
        └── swiglu.md
    ├── misellaneous
        └── ai_search_review.md
    ├── mllm.md
    ├── mllm
        ├── mllm_papers.md
        └── mllm_surveys.md
    ├── notes.md
    ├── notes
        ├── .ipynb_checkpoints
        │   ├── get_extended_attention_mask-checkpoint.ipynb
        │   └── pretrain_stage1-checkpoint.ipynb
        ├── blip2.md
        ├── bn.md
        ├── datasets_intro.md
        ├── dpo.md
        ├── dspy_intro.md
        ├── dspy_theory.md
        ├── eva.md
        ├── extend.html
        ├── get_extended_attention_mask.html
        ├── get_extended_attention_mask.ipynb
        ├── image_grounded_text_generation.html
        ├── image_grounded_text_generation.ipynb
        ├── image_text_contrastive_learning.html
        ├── image_text_contrastive_learning.ipynb
        ├── image_text_matching.html
        ├── image_text_matching.ipynb
        ├── itc_forward.html
        ├── itc_forward.ipynb
        ├── itg_forward.html
        ├── itg_forward.ipynb
        ├── itm_forward.html
        ├── itm_forward.ipynb
        ├── mim.md
        ├── minigpt4_vs_blip2.md
        ├── mulinfo_contrastive.md
        ├── o1_literatures.md
        ├── plan.md
        ├── pretrain_stage1.html
        ├── pretrain_stage1.ipynb
        ├── rope.md
        ├── rope_imp.html
        ├── rope_imp.ipynb
        ├── save_checkpoint.html
        ├── save_checkpoint.ipynb
        ├── scaliing_laws.md
        ├── video_chatgpt.md
        ├── vit.md
        └── weight_init.md
    ├── reading.md
    └── reading
        ├── 年度书单2022.md
        ├── 年度书单2023.md
        └── 年度书单2024.md


/.gitignore:
--------------------------------------------------------------------------------
1 | .DS_Store
2 | docs/notes/.ipynb_checkpoints
3 | docs/*/.ipynb_checkpoints


--------------------------------------------------------------------------------
/docs/.nojekyll:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PromptExpert/blogs/b315a1d329f01677a650f3955a1e76263d09f82b/docs/.nojekyll


--------------------------------------------------------------------------------
/docs/README.md:
--------------------------------------------------------------------------------
 1 | # <img src="images/logo.jpg" alt="Logo" style="width: 40px; height: auto; vertical-align: middle; margin-left: 5px;">PromptExpert
 2 | 
 3 | ## 主页
 4 | https://promptexpert.github.io/blogs
 5 | 
 6 | 注：在GitHub上看可能出现文件路径问题，建议点击博客主页浏览。
 7 | 
 8 | ## 最新文章 
 9 | - 2025.03.08 [关于AI搜索的一些想法](misellaneous/ai_search_review.md)
10 | - 2025.03.xx [MLLM论文精选（持续更新）](mllm/mllm_papers.md)
11 | - 2024.08.09 [Llama 3精读系列](llm_foundation/llama3/llama3.md)  
12 | - 2024.07.30 [学习率和Cosine Learning Rate Schedule](llm_foundation/cos_lr_schedule.md)
13 | - 2024.07.27 [大模型中的Attention](llm_foundation/llm_attention.md)
14 | 
15 | 
16 | ## 联系
17 | - [社群](community.md)
18 | 
19 | ## 精选文章
20 | - [MLLM论文精选（持续更新）](mllm/mllm_papers.md)
21 | - [Llama 3精读系列](llm_foundation/llama3/llama3.md)  
22 | - [各种大模型架构，傻傻分不清楚？](llm_architecture.md)
23 | - [大模型中的Attention](llm_foundation/llm_attention.md)
24 | 
25 | 
26 | 
27 | <div style="position: absolute; top: 10px; right: 10px;">
28 |   <a class="github-button" href="https://github.com/PromptExpert/blogs" data-icon="octicon-star" data-show-count="true" aria-label="Star YOUR_GITHUB_USERNAME/YOUR_REPO_NAME on GitHub">点赞</a>
29 | </div>
30 | 
31 | <script async defer src="https://buttons.github.io/buttons.js"></script>
32 | 


--------------------------------------------------------------------------------
/docs/_sidebar.md:
--------------------------------------------------------------------------------
1 | - [首页](README.md)
2 | - [大语言模型基础](llm_foundation.md)
3 | - [多模态](mllm.md)
4 | - [笔记](notes.md)
5 | - [关于我](about.md)
6 | - [个人业务](business.md)
7 | - [年度书单](reading.md)
8 | 


--------------------------------------------------------------------------------
/docs/about.md:
--------------------------------------------------------------------------------
1 | 现就职于某知名互联网公司，曾经做过NLP，现在做CV、多模态和搜索。
2 | 
3 | 本硕毕业于中外合作大学，专业都是计算机科学。
4 | 
5 | 小红书prompt_expert，分享大模型最新动态、最新AI论文。
6 | 
7 | 邮箱: hai_hsu@outlook.com
8 | 


--------------------------------------------------------------------------------
/docs/business.md:
--------------------------------------------------------------------------------
 1 | # 个人业务
 2 | 
 3 | 联系方式: hai_hsu@outlook.com
 4 | 
 5 | 
 6 | ## 人工智能论文辅导
 7 | - 面相对象：人工智能领域在读硕博，想发论文，但是自己的导师太忙，想要找靠谱的辅导老师。
 8 | - 服务内容：专业老师一对一辅导，帮你寻找研究方向，设计论文框架，修改论文，投稿。
 9 | - 老师资质：这里有多名老师，均是人工智能领域名校博士，均有多篇顶会发表。学生可详细了解之后，选择中意的老师。
10 | - 计费方式：邮件联系详聊。
11 | 
12 | ## 多模态入门
13 | - 面相对象：在校本科生，研究生，有一定自学能力，想要入门多模态。
14 | - 服务内容：根据个人实际情况和学习目标，定制多模态学习路径，涵盖书籍、课程、论文、项目等，帮你少走弯路。
15 | - 服务方式：语音沟通或文字沟通，最后落实到文档。
16 | - 计费方式：初步咨询免费，之后150元/半小时。
17 | - 售后：（1）如果对定制完成的学习路径有疑问，可追加疑问半小时内不计费。超过半小时，按正常服务5折计费。（2）长期跟踪学习效果。（3）如果服务不满意，且理由充分，可最多退款一半。
18 | 
19 | ## 算法岗位模拟面试
20 | - 面相对象：相关专业在校学生和毕业2年以内的算法工程师，找实习、校招、社招岗位，领域包含CV、NLP、大模型、多模态、搜索。
21 | - 服务内容：（1）模拟面试，针对简历提出高质量问题，挖掘简历中的优势，排查简历中的劣势。（2）附赠简历优化建议。（3）附赠大厂中厂内推。
22 | - 服务方式：语音。
23 | - 计费方式：100元/半小时。
24 | 
25 | ## 论文宣传
26 | - 面相对象：已有论文投中会议，或发表在arxiv，想提高论文影响力。
27 | - 服务内容：发布小红书笔记，介绍宣传论文。
28 | - 计费方式：38元/篇。
29 | - 售后：如果点赞+收藏总量低于50，退款。
30 | 
31 | 
32 | 


--------------------------------------------------------------------------------
/docs/community.md:
--------------------------------------------------------------------------------
1 | # 社群
2 | 
3 | - 人工智能论文辅导QQ群: 202879592
4 | - 多模态学习和交流QQ群: 237976286
5 | - 算法校招2024QQ群: 202579945
6 | 


--------------------------------------------------------------------------------
/docs/custom.css:
--------------------------------------------------------------------------------
 1 | /* 自定义 sidebar 的宽度 */
 2 | .sidebar {
 3 |     width: 150px;
 4 | }
 5 | 
 6 | /* 将 200px 替换为你想要的宽度，相同于 sidebar 的宽度 */
 7 | .content {
 8 |     margin-left: 150px; 
 9 | }
10 | 


--------------------------------------------------------------------------------
/docs/images/blip2-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PromptExpert/blogs/b315a1d329f01677a650f3955a1e76263d09f82b/docs/images/blip2-1.png


--------------------------------------------------------------------------------
/docs/images/blip2-2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PromptExpert/blogs/b315a1d329f01677a650f3955a1e76263d09f82b/docs/images/blip2-2.png


--------------------------------------------------------------------------------
/docs/images/blip2-3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PromptExpert/blogs/b315a1d329f01677a650f3955a1e76263d09f82b/docs/images/blip2-3.png


--------------------------------------------------------------------------------
/docs/images/blip2-4.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PromptExpert/blogs/b315a1d329f01677a650f3955a1e76263d09f82b/docs/images/blip2-4.png


--------------------------------------------------------------------------------
/docs/images/blip2-5.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PromptExpert/blogs/b315a1d329f01677a650f3955a1e76263d09f82b/docs/images/blip2-5.png


--------------------------------------------------------------------------------
/docs/images/blip2-6.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PromptExpert/blogs/b315a1d329f01677a650f3955a1e76263d09f82b/docs/images/blip2-6.png


--------------------------------------------------------------------------------
/docs/images/blip2-7.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PromptExpert/blogs/b315a1d329f01677a650f3955a1e76263d09f82b/docs/images/blip2-7.png


--------------------------------------------------------------------------------
/docs/images/blip2-8.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PromptExpert/blogs/b315a1d329f01677a650f3955a1e76263d09f82b/docs/images/blip2-8.png


--------------------------------------------------------------------------------
/docs/images/blip2-9.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PromptExpert/blogs/b315a1d329f01677a650f3955a1e76263d09f82b/docs/images/blip2-9.png


--------------------------------------------------------------------------------
/docs/images/blip2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PromptExpert/blogs/b315a1d329f01677a650f3955a1e76263d09f82b/docs/images/blip2.png


--------------------------------------------------------------------------------
/docs/images/cos_lr.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PromptExpert/blogs/b315a1d329f01677a650f3955a1e76263d09f82b/docs/images/cos_lr.png


--------------------------------------------------------------------------------
/docs/images/deepnorm.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PromptExpert/blogs/b315a1d329f01677a650f3955a1e76263d09f82b/docs/images/deepnorm.png


--------------------------------------------------------------------------------
/docs/images/dpp.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PromptExpert/blogs/b315a1d329f01677a650f3955a1e76263d09f82b/docs/images/dpp.jpg


--------------------------------------------------------------------------------
/docs/images/eva.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PromptExpert/blogs/b315a1d329f01677a650f3955a1e76263d09f82b/docs/images/eva.png


--------------------------------------------------------------------------------
/docs/images/eva2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PromptExpert/blogs/b315a1d329f01677a650f3955a1e76263d09f82b/docs/images/eva2.png


--------------------------------------------------------------------------------
/docs/images/flashattention1.1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PromptExpert/blogs/b315a1d329f01677a650f3955a1e76263d09f82b/docs/images/flashattention1.1.png


--------------------------------------------------------------------------------
/docs/images/flashattention1.2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PromptExpert/blogs/b315a1d329f01677a650f3955a1e76263d09f82b/docs/images/flashattention1.2.png


--------------------------------------------------------------------------------
/docs/images/flashattention1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PromptExpert/blogs/b315a1d329f01677a650f3955a1e76263d09f82b/docs/images/flashattention1.png


--------------------------------------------------------------------------------
/docs/images/gqa.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PromptExpert/blogs/b315a1d329f01677a650f3955a1e76263d09f82b/docs/images/gqa.png


--------------------------------------------------------------------------------
/docs/images/isoflops.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PromptExpert/blogs/b315a1d329f01677a650f3955a1e76263d09f82b/docs/images/isoflops.png


--------------------------------------------------------------------------------
/docs/images/llama3_post_approach.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PromptExpert/blogs/b315a1d329f01677a650f3955a1e76263d09f82b/docs/images/llama3_post_approach.png


--------------------------------------------------------------------------------
/docs/images/llama3_sft_mix.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PromptExpert/blogs/b315a1d329f01677a650f3955a1e76263d09f82b/docs/images/llama3_sft_mix.png


--------------------------------------------------------------------------------
/docs/images/llama3_table3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PromptExpert/blogs/b315a1d329f01677a650f3955a1e76263d09f82b/docs/images/llama3_table3.png


--------------------------------------------------------------------------------
/docs/images/llama3_vision.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PromptExpert/blogs/b315a1d329f01677a650f3955a1e76263d09f82b/docs/images/llama3_vision.png


--------------------------------------------------------------------------------
/docs/images/llama_fig3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PromptExpert/blogs/b315a1d329f01677a650f3955a1e76263d09f82b/docs/images/llama_fig3.png


--------------------------------------------------------------------------------
/docs/images/logo.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PromptExpert/blogs/b315a1d329f01677a650f3955a1e76263d09f82b/docs/images/logo.jpg


--------------------------------------------------------------------------------
/docs/images/lr1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PromptExpert/blogs/b315a1d329f01677a650f3955a1e76263d09f82b/docs/images/lr1.png


--------------------------------------------------------------------------------
/docs/images/norm.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PromptExpert/blogs/b315a1d329f01677a650f3955a1e76263d09f82b/docs/images/norm.png


--------------------------------------------------------------------------------
/docs/images/pagedattn.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PromptExpert/blogs/b315a1d329f01677a650f3955a1e76263d09f82b/docs/images/pagedattn.png


--------------------------------------------------------------------------------
/docs/images/pagedattn2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PromptExpert/blogs/b315a1d329f01677a650f3955a1e76263d09f82b/docs/images/pagedattn2.png


--------------------------------------------------------------------------------
/docs/images/pagedattn3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PromptExpert/blogs/b315a1d329f01677a650f3955a1e76263d09f82b/docs/images/pagedattn3.png


--------------------------------------------------------------------------------
/docs/images/perceiver.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PromptExpert/blogs/b315a1d329f01677a650f3955a1e76263d09f82b/docs/images/perceiver.png


--------------------------------------------------------------------------------
/docs/images/rope1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PromptExpert/blogs/b315a1d329f01677a650f3955a1e76263d09f82b/docs/images/rope1.png


--------------------------------------------------------------------------------
/docs/images/rope2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PromptExpert/blogs/b315a1d329f01677a650f3955a1e76263d09f82b/docs/images/rope2.png


--------------------------------------------------------------------------------
/docs/images/rope3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PromptExpert/blogs/b315a1d329f01677a650f3955a1e76263d09f82b/docs/images/rope3.png


--------------------------------------------------------------------------------
/docs/images/rope4.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PromptExpert/blogs/b315a1d329f01677a650f3955a1e76263d09f82b/docs/images/rope4.png


--------------------------------------------------------------------------------
/docs/images/rope5.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PromptExpert/blogs/b315a1d329f01677a650f3955a1e76263d09f82b/docs/images/rope5.png


--------------------------------------------------------------------------------
/docs/images/rope6.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PromptExpert/blogs/b315a1d329f01677a650f3955a1e76263d09f82b/docs/images/rope6.png


--------------------------------------------------------------------------------
/docs/images/rope7.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PromptExpert/blogs/b315a1d329f01677a650f3955a1e76263d09f82b/docs/images/rope7.png


--------------------------------------------------------------------------------
/docs/images/scale_law1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PromptExpert/blogs/b315a1d329f01677a650f3955a1e76263d09f82b/docs/images/scale_law1.png


--------------------------------------------------------------------------------
/docs/images/scale_law2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PromptExpert/blogs/b315a1d329f01677a650f3955a1e76263d09f82b/docs/images/scale_law2.png


--------------------------------------------------------------------------------
/docs/images/swish.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PromptExpert/blogs/b315a1d329f01677a650f3955a1e76263d09f82b/docs/images/swish.png


--------------------------------------------------------------------------------
/docs/images/video-chatgpt.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PromptExpert/blogs/b315a1d329f01677a650f3955a1e76263d09f82b/docs/images/video-chatgpt.png


--------------------------------------------------------------------------------
/docs/images/vit.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PromptExpert/blogs/b315a1d329f01677a650f3955a1e76263d09f82b/docs/images/vit.png


--------------------------------------------------------------------------------
/docs/index.html:
--------------------------------------------------------------------------------
 1 | <!DOCTYPE html>
 2 | <html lang="en">
 3 | <head>
 4 |   <meta charset="UTF-8">
 5 |   <title>Document</title>
 6 |   <meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1" />
 7 |   <meta name="description" content="Description">
 8 |   <meta name="viewport" content="width=device-width, initial-scale=1.0, minimum-scale=1.0">
 9 |   <link rel="stylesheet" href="//cdn.jsdelivr.net/npm/docsify@4/lib/themes/vue.css">
10 |   <link rel="stylesheet" href="//cdn.jsdelivr.net/npm/prismjs/themes/prism.css">
11 |   <script src="//cdn.jsdelivr.net/npm/prismjs/components/prism-python.min.js"></script>
12 |   <!-- <link rel="stylesheet" href="custom.css"> -->
13 | </head>
14 | <body>
15 |   <div id="app"></div>
16 |   <script src="https://polyfill.io/v3/polyfill.min.js?features=es6"></script>
17 |   <script id="MathJax-script" async src="https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js"></script>
18 |   <!-- Docsify v4 -->
19 |   <script src="//cdn.jsdelivr.net/npm/docsify@4"></script>
20 |   <script src="//cdn.jsdelivr.net/npm/prismjs/prism.js"></script>
21 |   <script src="//cdn.jsdelivr.net/npm/prismjs/components/prism-python.min.js"></script>
22 |   <script>
23 |     window.MathJax = {
24 |       tex: {
25 |         inlineMath: [['$', '$'], ['\\(', '\\)']],
26 |         displayMath: [['$$', '$$'], ['\\[', '\\]']],
27 |         processEscapes: true,
28 |         processEnvironments: true
29 |       },
30 |       options: {
31 |         skipHtmlTags: ['script', 'noscript', 'style', 'textarea', 'pre'],
32 |         ignoreHtmlClass: 'tex2jax_ignore',
33 |         processHtmlClass: 'tex2jax_process'
34 |       }
35 |     };
36 |   </script>
37 |   <script>
38 |     window.$docsify = {
39 |       subMaxLevel: 3,
40 |       name: '',
41 |       // repo: 'https://github.com/PromptExpert/',
42 |       loadSidebar: true,
43 |       loadNavbar: true,
44 |       locales: {
45 |         '/': {
46 |           language: 'English'
47 |         },
48 |         '/zh-cn/': {
49 |           language: '简体中文'
50 |         }
51 |       },
52 |       plugins: [
53 |         function(hook, vm) {
54 |           hook.doneEach(function() {
55 |             // 重新渲染 MathJax
56 |             if (window.MathJax) {
57 |               MathJax.typesetPromise();
58 |             }
59 |             // 重新渲染 Prism.js
60 |             if (window.Prism) {
61 |               Prism.highlightAll();
62 |             }
63 |           });
64 |         },
65 |         function (hook, vm) {
66 |           hook.beforeEach(function (content) {
67 |             var toc = '<div class="toc-container">' + window.Docsify.utils.genToc(content) + '</div>';
68 |             return toc + content;
69 |           });
70 |         }
71 |       ]
72 |     }
73 |   </script>
74 |   
75 |   
76 | </body>
77 | </html>
78 | 


--------------------------------------------------------------------------------
/docs/llm_foundation.md:
--------------------------------------------------------------------------------
 1 | # 大语言模型基础
 2 | 
 3 | - [Normalization概述](llm_foundation/norm.md)
 4 | - [各种大模型架构，傻傻分不清楚？](llm_foundation/llm_architecture.md)
 5 | - [SwiGLU激活函数](llm_foundation/swiglu.md)
 6 | - [大模型中的Attention](llm_foundation/llm_attention.md)
 7 | - [Llama 3精读系列](llm_foundation/llama3/llama3.md)
 8 | - [FLOPs](llm_foundation/FLOPs.md)
 9 | - [学习率和Cosine Learning Rate Schedule](llm_foundation/cos_lr_schedule.md)
10 | - [训练时的显存如何计算？](llm_foundation/memory_usage_training.md)


--------------------------------------------------------------------------------
/docs/llm_foundation/FLOPs.md:
--------------------------------------------------------------------------------
 1 | # FLOPs
 2 | 
 3 | ## 概念
 4 | FLOPs（Floating Point Operations）既可以指“每秒浮点运算次数”，也可以单纯指一个计算过程中的“浮点运算次数”。在神经网络中，通常讨论的是后者，即计算一个前向传播或训练过程所需的总浮点运算次数。
 5 | 
 6 | 计算FLOPs的主要目的是评估模型的计算复杂度和资源需求。这对于以下几个方面非常重要：
 7 | - 模型设计：在设计和选择模型时，通过比较FLOPs，可以选择在性能和效率之间达到最佳平衡的模型。
 8 | - 硬件适配：了解模型的FLOPs有助于选择合适的硬件平台。例如，高FLOPs的模型通常需要更强大的计算资源，如GPU或TPU。
 9 | - 优化和压缩：为了在资源受限的设备上运行模型（如移动设备），需要对模型进行优化和压缩。通过分析FLOPs，可以确定哪些部分可以优化以减少计算量。
10 | - 比较和基准：FLOPs作为一个标准化的指标，可以用于比较不同模型的计算需求，帮助研究者和工程师进行更有依据的选择和调整。
11 | 
12 | 每一层神经网络，如卷积层、全连接层等，都需要执行大量的浮点运算。通过计算每一层的FLOPs，可以得出整个网络的总FLOPs。
13 | 
14 | ## 例子
15 | 以多头注意力层为例，计算FLOPs。
16 | 
17 | 1. 记：
18 | - 输入序列长度：$L$
19 | - 输入特征维度：$d_{model}$
20 | - 注意力头数：$h$
21 | - 每个头的维度：$d_k = d_v = d_{model} / h$
22 | 
23 | 已知两个形状为 $(a, b)$ 和 $(b, c)$ 的矩阵相乘所需的FLOPs数为 $abc$ 。
24 | 
25 | 2. 线性变换
26 | 由于Q、K和V的计算是相同的，因此我们只需要计算一次，然后乘以3：
27 | - 每个头的Q/K/V: $L \times d_{model} \times d_{model}$
28 | - 总FLOPs: $ 3 \times L \times d_{model}^2$
29 | 
30 | 3. 计算注意力权重
31 | 
32 | 计算Q和K的点积，然后进行缩放和Softmax操作：
33 | 
34 | - 点积：$L \times d_{model} \times L$
35 | - Softmax：要计算$L$次，每次$L$个操作，所以总数 $L \times L$
36 | - 总FLOPs: $L^2 \times d_{model} + L^2$
37 | 
38 | 4. 加权和
39 | - 注意力权重矩阵乘以V：$L \times L \times d_{model}$
40 | 
41 | 5. 线性变换
42 | - $L \times d_{model} \times d_{model}$
43 | 
44 | 
45 | 6. 总FLOPs计算
46 | 
47 | 将上述各部分的FLOPs相加，得到多头注意力层的总FLOPs：
48 | 
49 | $$
50 | 4 \times L \times d_{model}^2 + 2 \times L^2 \times d_{model} + L^2 
51 | $$


--------------------------------------------------------------------------------
/docs/llm_foundation/cos_lr_schedule.md:
--------------------------------------------------------------------------------
 1 | # 学习率和Cosine Learning Rate Schedule
 2 | 
 3 | ## 学习率为什么重要
 4 | 
 5 | 学习率需要在收敛和过火之间权衡。学习率太小，则收敛得慢。学习率太大，则损失会震荡甚至变大。
 6 | 
 7 | 不同学习率的影响就像：
 8 | 
 9 | 
10 | ![](../images/lr1.png)
11 | 
12 | 学习率可能是最重要的参数了。如果你只有时间调整一个参数，那么就是学习率。
13 | 
14 | 学习率还会跟优化过程的其他方面相互作用，这个相互作用可能是非线性的。小的batch size最好搭配小的学习率，因为batch size越小也可能有噪音，这时候就需要小心翼翼地调整参数。
15 | 
16 | ## 训练过程中调整学习率
17 | 
18 | 训练过程中调整学习率的意思是，在训练过程中，根据训练的火候，适当增加或减小学习率。调整学习率的目的是尽快收敛到较优值（同时满足快和优）。一般来说，调整学习率实际上都是减少学习率，因此有的地方也叫学习率衰减(decay)或学习率退火(annealling)。调整学习率主要有两类方法：学习率schedule和自适应学习率。学习率schedule根据一定的公式调整学习率，公式中学习率是训练步数的函数；自适应学习率是算法自动根据训练中的实际情况，调整学习率。
19 | 
20 | ## Cosine Learning Rate Schedule
21 | 也叫Cosine Annealing。
22 | 
23 | Cosine Annealing是一种学习率schedule方法，它的效果是从一个较大的学习率开始，然后相对快速地减少到一个最小值，再快速增加到一个较高的值。这个学习率的重置类似于对学习过程的模拟重启。
24 | 
25 | 具体来说，Cosine Annealing的学习率调度方式通常遵循余弦函数的变化模式。例如，在一个周期内，学习率 $\eta(t)$ 会按照下列公式变化：
26 | 
27 | $$
28 | \eta_t = \eta_{\min} + \frac{1}{2} (\eta_{\max} - \eta_{\min}) \left(1 + \cos\left(\frac{T_{cur}}{T_{max}} \pi\right)\right)
29 | $$
30 | 
31 | 其中：
32 | - $\eta_t$ 是在第 $t$ 次迭代时的学习率。
33 | - $\eta_{\max}$ 是初始学习率的最大值。
34 | - $\eta_{\min}$ 是学习率的最小值（通常设为0）。
35 | - $T_{cur}$ 是当前迭代次数。
36 | - $T_{max}$ 是总的迭代次数。
37 | 
38 | 举例一个Cosine Annealing曲线：
39 | 
40 | ![](../images/cos_lr.png)
41 | 
42 | *参考资料* 
43 | - [Cosine Annealing](https://paperswithcode.com/method/cosine-annealing)
44 | - [Wikipedia Learning rate](https://en.wikipedia.org/wiki/Learning_rate) 
45 | - [cs231n](https://cs231n.github.io/neural-networks-3/)
46 | - [Deep Learning, Chpater 11](https://www.deeplearningbook.org/contents/guidelines.html), 424页 
47 | - [How to Configure the Learning Rate When Training Deep Learning Neural Networks](https://machinelearningmastery.com/learning-rate-for-deep-learning-neural-networks/) 
48 | - [SGDR: Stochastic Gradient Descent with Warm Restarts](https://arxiv.org/abs/1608.03983) 
49 | 


--------------------------------------------------------------------------------
/docs/llm_foundation/llama3/llama3.md:
--------------------------------------------------------------------------------
 1 | # Llama3精读系列
 2 | 
 3 | 精读 [The Llama 3 Herd of Models](https://arxiv.org/abs/2407.21783) ，涵盖除infra、evaluation和inference之外的大部分内容。
 4 | 
 5 | - [Llama3的模型结构](llm_foundation/llama3/llama3_arch.md)
 6 | - [Llams3的参数量计算](llm_foundation/llama3/llama3_params_count.md)
 7 | - [Llama3如何实现128K的上下文长度](llm_foundation/llama3/llama3_128k.md)
 8 | - [Llama3预训练数据](llm_foundation/llama3/llama3_pre_data.md)
 9 | - [Llama3 Scale Law](llm_foundation/llama3/llama3_scale.md)
10 | - [Llama3 Training Recipe](llm_foundation/llama3/llama3_recipe.md)
11 | - [Llama3 Post-Training](llm_foundation/llama3/llama3_post_training.md)
12 | - [Llama3 的SFT数据构建方法](llm_foundation/llama3/llama3_sft_data.md)
13 | - [Llama3多模态之视觉](llm_foundation/llama3/llama3_vision.md)
14 | - [Llama3金句摘抄](llm_foundation/llama3/llama3_quote.md)
15 | 


--------------------------------------------------------------------------------
/docs/llm_foundation/llama3/llama3_128k.md:
--------------------------------------------------------------------------------
 1 | # Llama3如何实现128K的上下文长度
 2 | 
 3 | > We use a lower batch size early in training to improve training stability, and increase it subsequently to improve efficiency. Specifically, we use an initial batch size of 4M tokens and sequences of length 4,096, and double these values to a batch size of 8M sequences of 8,192 tokens after pre-training 252M tokens. We double the batch size again to 16M after pre-training on 2.87T tokens.
 4 | 
 5 | > In the final stages of pre-training, we train on long sequences to support context windows of up to 128K tokens. We increase the supported context length in increments, pre-training until the model has successfully adapted to the increased context length. We assess successful adaptation by measuring whether (1) model performance on short-context evaluations has recovered completely and (2) the model perfectly solves “needle in a haystack” tasks up to that length. In Llama 3 405B pre-training, we increased context length gradually in six stages, starting from the original 8K context window and ending in the final 128K context window. This long-context pre-training stage was performed using approximately 800B training tokens.
 6 | 
 7 | 总结：
 8 | 1. 在4M tokens的batch size上训练4K长度。
 9 | 2. 252M tokens之后，batch size增加到8M，长度增加到8K。
10 | 3. 逐渐提高上下文长度，从8K到128K，共经历6个提高阶段，整个过程持续约800B tokens。
11 | 4. 在这个扩展上下文的过程中，评测：
12 |     - 在短上下文评估中的表现是否完全恢复。
13 |     - 模型是否能够完美地解决长度范围内的“needle in a haystack”任务。“Needle in a haystack”任务通常涉及在大量无关信息中找到特定且稀有的目标信息，就像在一堆干草中找到一根针。
14 | 


--------------------------------------------------------------------------------
/docs/llm_foundation/llama3/llama3_arch.md:
--------------------------------------------------------------------------------
 1 | # Llama3的模型结构
 2 | 
 3 | > Llama 3 uses a standard, dense Transformer architecture. It does not deviate significantly from Llama and Llama 2 in terms of model architecture; our performance gains are primarily driven by improvements in data quality and diversity as well as by increased training scale.
 4 | 
 5 | 和原始Transformer的几点不同：
 6 | - Pre-normalization. To improve the training stability, we normalize the input of each transformer sub-layer, instead of normalizing the output. We use the RMSNorm normalizing function.
 7 | - SwiGLU activation function. We replace the ReLU non-linearity by the SwiGLU activation function to improve the performance. 
 8 | - RoPE。
 9 | - Grouped-Query Attention (GQA)。We use grouped query attention (GQA) with 8 key-value heads to improve inference speed and to reduce the size of key-value caches during decoding.
10 | 
11 | 和Llama 2的几点不同：
12 | - We use an attention mask that prevents self-attention between different documents within the same sequence. We find that this change had limited impact during in standard pre-training, but find it to be important in continued pre-training on very long sequences.
13 | - We use a vocabulary with 128K tokens. Our token vocabulary combines 100K tokens from the tiktoken3 tokenizer with 28K additional tokens to better support non-English languages. Compared to the Llama 2 tokenizer, our new tokenizer improves compression rates on a sample of English data from 3.17 to 3.94 characters per token. This enables the model to “read” more text for the same amount of training compute. We also found that adding 28K tokens from select non-English languages improved both compression ratios and downstream performance, with no impact on English tokenization.
14 | - We increase the RoPE base frequency hyperparameter to 500,000. This enables us to better support longer contexts; Xiong et al. (2023) showed this value to be effective for context lengths up to 32,768.
15 | 
16 | 
17 | Llama 3 405B uses an architecture with 126 layers, a token representation dimension of 16,384, and 128 attention heads; see Table 3 for details. This leads to a model size that is approximately compute-optimal according to scaling laws on our data for our training budget of $3.8 \times 10^{25}$ FLOPs.
18 | 
19 | ![](../images/llama3_table3.png)
20 | 
21 | 实现: https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/api/model.py
22 | 
23 | *参考资料*
24 | - LLaMA- Open and Efficient Foundation Language Models
25 | - Llama 2- Open Foundation and Fine-Tuned Chat Models
26 | - The Llama 3 Herd of Models
27 | 
28 | 


--------------------------------------------------------------------------------
/docs/llm_foundation/llama3/llama3_params_count.md:
--------------------------------------------------------------------------------
  1 | # Llama3的参数量计算
  2 | 
  3 | 以Llama-3.1-8B为例，结合[llama3的实现](https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/api/model.py)，计算llama3的参数量。
  4 | 
  5 | llama3的参数量是以下部分之和：
  6 | - Embedding层
  7 | - Transformer Block * Block层数
  8 | - Output层
  9 | 
 10 | ## Config
 11 | ```json
 12 | {
 13 |     "dim": 4096, 
 14 |     "ffn_dim_multiplier": 1.3, 
 15 |     "multiple_of": 1024, 
 16 |     "n_heads": 32, 
 17 |     "n_kv_heads": 8, 
 18 |     "n_layers": 32, 
 19 |     "norm_eps": 1e-05, 
 20 |     "rope_theta": 500000.0, 
 21 |     "use_scaled_rope": true, 
 22 |     "vocab_size": 128256
 23 | }
 24 | ```
 25 | 
 26 | ## Embedding层
 27 | 
 28 | ```python
 29 | self.tok_embeddings = nn.Embedding(params.vocab_size, params.dim)
 30 | ```
 31 | 
 32 | 计算公式：embed_params = vocab_size * dim
 33 | 
 34 | 参数量：0.53B
 35 | 
 36 | ## Transformer Block
 37 | ### RMSNorm
 38 | ```python
 39 | class RMSNorm(torch.nn.Module):
 40 |     def __init__(self, dim: int, eps: float = 1e-6):
 41 |         super().__init__()
 42 |         self.eps = eps
 43 |         self.weight = nn.Parameter(torch.ones(dim))
 44 | 
 45 |     def _norm(self, x):
 46 |         return x * torch.rsqrt(x.pow(2).mean(-1, keepdim=True) + self.eps)
 47 | 
 48 |     def forward(self, x):
 49 |         output = self._norm(x.float()).type_as(x)
 50 |         return output * self.weight
 51 | ```
 52 | 
 53 | 计算公式：rms_params = dim + dim # each Transformer block has two RMSNorm layers
 54 | 计算结果：0.00B
 55 | 
 56 | ### Attention
 57 | ```python
 58 | class Attention(nn.Module):
 59 |     def __init__(self, args: ModelArgs):
 60 |         super().__init__()
 61 |         self.n_kv_heads = args.n_heads if args.n_kv_heads is None else args.n_kv_heads
 62 |         self.n_local_heads = args.n_heads
 63 |         self.n_local_kv_heads = self.n_kv_heads
 64 |         self.n_rep = self.n_local_heads // self.n_local_kv_heads
 65 |         self.head_dim = args.dim // args.n_heads
 66 | 
 67 |         self.wq = nn.Linear(args.dim, args.n_heads * self.head_dim, bias=False)
 68 |         self.wk = nn.Linear(args.dim, self.n_kv_heads * self.head_dim, bias=False)
 69 |         self.wv = nn.Linear(args.dim, self.n_kv_heads * self.head_dim, bias=False)
 70 |         self.wo = nn.Linear(args.n_heads * self.head_dim, args.dim, bias=False)
 71 | ```
 72 | 
 73 | 计算公式：
 74 | ```
 75 | head_dim = dim / n_heads
 76 | query_proj_params = dim*(n_heads*head_dim) 
 77 | key_proj_params = dim*(n_kv_heads*head_dim)
 78 | value_proj_params = dim*(n_kv_heads*head_dim)
 79 | out_proj_params = (n_heads*head_dim)*dim
 80 | attn_params = query_proj_params+key_proj_params+value_proj_params+out_proj_params
 81 | ``` 
 82 | 计算结果：0.04B
 83 | 
 84 | ### FFN
 85 | ```python
 86 | class FeedForward(nn.Module):
 87 |     def __init__(
 88 |         self,
 89 |         dim: int,
 90 |         hidden_dim: int,
 91 |         multiple_of: int,
 92 |         ffn_dim_multiplier: Optional[float],
 93 |     ):
 94 |         super().__init__()
 95 |         hidden_dim = int(2 * hidden_dim / 3)
 96 |         if ffn_dim_multiplier is not None:
 97 |             hidden_dim = int(ffn_dim_multiplier * hidden_dim)
 98 |         hidden_dim = multiple_of * ((hidden_dim + multiple_of - 1) // multiple_of)
 99 | 
100 |         self.w1 = nn.Linear(dim, hidden_dim, bias=False)
101 |         self.w2 = nn.Linear(hidden_dim, dim, bias=False)
102 |         self.w3 = nn.Linear(dim, hidden_dim, bias=False)
103 | ```
104 | 
105 | 计算公式：
106 | ```
107 | hidden_dim = 4 * args.dim
108 | hidden_dim = int(2 * dim*4 / 3)
109 | hidden_dim = multiple_of * ((hidden_dim + multiple_of - 1) // multiple_of)
110 | hidden_dim = int(ffn_dim_multiplier * hidden_dim)
111 | w1_params = dim * hidden_dim
112 | w2_params = dim * hidden_dim
113 | w3_params = dim * hidden_dim
114 | ffn_params = w1_params+w2_params+w3_params
115 | ```
116 | 
117 | 计算结果：0.18B
118 | 
119 | ### Single Transformer Block Sum
120 | 计算公式：transformer_block_params = rms_params+attn_params+ffn_params
121 | 
122 | 计算结果：0.22B
123 | 
124 | ## All Transformer Blocks Sum
125 | 计算公式：transformer_params = transformer_block_params*n_layers
126 | 
127 | 计算结果：7.10B
128 | 
129 | ## Output
130 | ```python
131 | self.output = nn.Linear(params.dim, params.vocab_size, bias=False)
132 | ```
133 | 计算公式：output_params = dim * vocab_size
134 | 
135 | 计算结果：0.53B
136 | 
137 | ## Total
138 | ```python
139 | class Transformer(nn.Module):
140 |     def __init__(self, params: ModelArgs):
141 |         super().__init__()
142 |         self.params = params
143 |         self.vocab_size = params.vocab_size
144 |         self.n_layers = params.n_layers
145 | 
146 |         self.tok_embeddings = nn.Embedding(params.vocab_size, params.dim)
147 | 
148 |         self.layers = torch.nn.ModuleList()
149 |         for layer_id in range(params.n_layers):
150 |             self.layers.append(TransformerBlock(layer_id, params))
151 | 
152 |         self.norm = RMSNorm(params.dim, eps=params.norm_eps)
153 |         self.output = nn.Linear(params.dim, params.vocab_size, bias=False)
154 | ```
155 | 计算公式：total_params = embed_params + transformer_params + output_params
156 | 
157 | 计算结果：8.00B
158 | 
159 | ## 计算公式汇总
160 | ```python
161 | # config
162 | dim = 4096
163 | n_layers = 32
164 | n_heads = 32
165 | n_kv_heads = 8
166 | ffn_dim = 6144
167 | vocab_size = 128256
168 | multiple_of = 1024  # make SwiGLU hidden layer size multiple of large power of 2
169 | ffn_dim_multiplier = 1.3
170 | norm_eps = 1e-5
171 | rope_theta = 500000
172 | max_seq_len = 8192
173 | 
174 | # Emebdding
175 | embed_params = vocab_size * dim
176 | print (f'Parameters of Embebdding layer = {embed_params/10**9:.2f}B')
177 | 
178 | # RMSNorm
179 | rms_params = dim + dim # each Transformer layer has two RMSNorm layers
180 | print (f'Parameters of RMSNorm layer = {rms_params/10**9:.2f}B')
181 | 
182 | # Attention
183 | head_dim = dim / n_heads
184 | query_proj_params = dim*(n_heads*head_dim)
185 | key_proj_params = dim*(n_kv_heads*head_dim)
186 | value_proj_params = dim*(n_kv_heads*head_dim)
187 | out_proj_params = (n_heads*head_dim)*dim
188 | attn_params = query_proj_params+key_proj_params+value_proj_params+out_proj_params
189 | print (f'Parameters of Attention layer = {attn_params/10**9:.2f}B')
190 | 
191 | # FFN
192 | hidden_dim = int(2 * dim*4 / 3)
193 | hidden_dim = multiple_of * ((hidden_dim + multiple_of - 1) // multiple_of)
194 | hidden_dim = int(ffn_dim_multiplier * hidden_dim)
195 | print (f'FFN hidden_dim = {hidden_dim}')
196 | w1_params = dim * hidden_dim
197 | w2_params = dim * hidden_dim
198 | w3_params = dim * hidden_dim
199 | ffn_params = w1_params+w2_params+w3_params
200 | print (f'Parameters of FFN layer = {ffn_params/10**9:.2f}B')
201 | 
202 | # Transformer Block
203 | transformer_block_params = rms_params+attn_params+ffn_params
204 | print (f'Parameters of Transformer block = {transformer_block_params/10**9:.2f}B')
205 | 
206 | # Transformer
207 | transformer_params = transformer_block_params*n_layers
208 | print (f'Parameters of Transformer = {transformer_params/10**9:.2f}B')
209 | 
210 | # Output
211 | output_params = dim * vocab_size
212 | print (f'Parameters of output layer = {output_params/10**9:.2f}B')
213 | 
214 | # Total
215 | total_params = embed_params + transformer_params + output_params
216 | print (f'Parameters total = {total_params//10**9:.2f}B')
217 | 
218 | #Parameters of Embebdding layer = 0.53B
219 | #Parameters of RMSNorm layer = 0.00B
220 | #Parameters of Attention layer = 0.04B
221 | #FFN hidden_dim = 14643
222 | #Parameters of FFN layer = 0.18B
223 | #Parameters of Transformer block = 0.22B
224 | #Parameters of Transformer = 7.10B
225 | #Parameters of output layer = 0.53B
226 | #Parameters total = 8.00B
227 | ```


--------------------------------------------------------------------------------
/docs/llm_foundation/llama3/llama3_post_training.md:
--------------------------------------------------------------------------------
  1 | # Llama 3 Post-Training
  2 | 
  3 | Post-Training是Llama 3训练的重头戏，决定了模型的上限。
  4 | 
  5 | > We produce the aligned Llama 3 models by applying several rounds of post-training, or aligning the model with human feedback on top of a pre-trained checkpoint. Each round of post-training involves supervised finetuning (SFT) followed by Direct Preference Optimization (DPO) on examples collected either via human annotations or generated synthetically.
  6 | 
  7 | > The backbone of our post-training strategy is a reward model and a language model. We first train a reward model on top of the pre-trained checkpoint using human-annotated preference data. We then finetune pre-trained checkpoints with SFT, and further align the checkpoints with DPO.
  8 | 
  9 | ![](../../images/llama3_post_approach.png)
 10 | 
 11 | ## Reward Modeling
 12 | RM在这里是用来挑选答案的，不是用来训练RLHF的。
 13 | 
 14 | > The training objective is the same as Llama 2 except that we remove the margin term in the loss, as we observe diminishing improvements after data scaling. 
 15 | 
 16 | > The reward model takes a model response and its corresponding prompt (including contexts from previous turns) as inputs and outputs a scalar score to indicate the quality (e.g., helpfulness and safety) of the model generation. 
 17 | 
 18 | 有用和安全需要权衡，于是训练了两个RM，一个是Helpfulness RM， 一个是Safety RM。
 19 | 
 20 | RM初始化为预训练后的模型。模型的结构和超参数都和预训练模型相同，except that the classification head for next-token prediction is replaced with a regression head for outputting a scalar reward.
 21 | 
 22 | To train the reward model, we convert our collected pairwise human preference data into a binary ranking label format (i.e., chosen & rejected) and enforce the chosen response to have a higher score than its counterpart. We used a binary ranking loss:
 23 | 
 24 | $$
 25 | L_{\text{ranking}} = -\log(\sigma(r_{\theta}(x, y_c) - r_{\theta}(x, y_r)))
 26 | $$
 27 | 
 28 | where $r_{\theta}(x, y)$ is the scalar score output for prompt $x$ and completion $y$ with model weights $\theta$. $y_c$ is the preferred response that annotators choose and $y_r$ is the rejected counterpart.
 29 | 
 30 | In addition to standard preference pair of (chosen, rejected) response, annotations also create a third “edited response” for some prompts, where the chosen response from the pair is further edited for improvement. Hence, each preference ranking sample has two or three responses with clear ranking (edited > chosen > rejected). We concatenate the prompt and multiple responses into a single row during training with responses randomly shuffled. This is an approximation to the standard scenario of putting the responses in separate rows and computing the scores, but in our ablations, this approach improves training efficiency without a loss in accuracy.
 31 | 
 32 | 数据构造略。
 33 | 
 34 | 原文这里有地方没说清楚。前面说RM takes a model response and its corresponding prompt (including contexts from previous turns) as inputs and outputs a scalar score， 后面又说 We concatenate the prompt and multiple responses into a single row during training with responses randomly shuffled，那么优化目标是什么呢？还是上面那个公式吗？还有，edited response是怎么利用的？
 35 | 
 36 | In each round of post-training, we use all the preference data that is available at the time for reward modeling, while only using the latest batches from various capabilities for DPO training. For both reward modeling and DPO, we use samples that are labeled as the chosen response being significantly better or better than the rejected counterpart for training and discard samples with similar responses.
 37 | 
 38 | ## Supervised Finetuning
 39 | 
 40 | SFT的训练部分没有什么特殊的，we finetune the pre-trained language model using a standard cross entropy loss on the target tokens (while masking loss on prompt tokens)。
 41 | 
 42 | SFT的关键是数据的构建，详见[Llama3 SFT的数据构建](llm_foundation/llama3/llama3_sft_data.md)。
 43 | 
 44 | ## Direct Preference Optimization
 45 | For training, we primarily use the most recent batches of preference data collected using the best performing models from the previous alignment rounds.
 46 | 
 47 | In addition, we apply the following algorithmic modifications to DPO:
 48 | - Masking out formatting tokens in DPO loss. We hypothesize that this is due to the contrastive nature of the DPO loss – the presence of common tokens in both chosen and rejected responses leads to a conflicting learning objective as the model needs to increase and reduce the likelihood of these tokens simultaneously.
 49 | - Regularization with NLL loss. This helps further stabilize DPO training. 
 50 | 
 51 | ## Model Averaging
 52 | Finally, we average models obtained from experiments using various versions of data or hyperparameters at each RM, SFT, or DPO stage. 
 53 | 
 54 | ## Iterative Rounds
 55 | Following Llama 2, we apply the above methods in six rounds. In each cycle, we collect new preference annotations and SFT data, sampling synthetic data from the latest models.
 56 | 
 57 | 回到开头的两段话，这里将所有东西串联起来。
 58 | 
 59 | 模拟一下6个round：
 60 | - Round 1
 61 |     - 步骤1：奖励模型训练
 62 |         - 数据准备：收集初始的人类偏好数据，标注一些提示及其对应的响应（包括选择的、拒绝的和编辑后的响应）。
 63 |         - 模型训练：使用这些数据训练奖励模型（Helpfulness RM和Safety RM），初始化为预训练后的模型。
 64 |     - 步骤2：SFT
 65 |         - 数据准备：收集初始的SFT数据集，包含高质量的人类标注文本。
 66 |         - 模型微调：使用标准的交叉熵损失对预训练语言模型进行微调。
 67 |     - 步骤3：DPO
 68 |         - 数据准备：从最新的SFT模型中采集一些偏好数据（详见附录1）。
 69 |         - 模型优化：使用最新的偏好数据进行DPO训练，应用屏蔽格式化标记和NLL损失正则化。
 70 | - Round 2
 71 |     - 步骤1：奖励模型训练
 72 |         - 数据准备：收集更多的偏好数据，包括新的提示及其响应。
 73 |         - 模型训练：使用新的数据再次训练奖励模型。此时，模型会利用第一轮中学到的权重进行初始化。
 74 |     - 步骤2：SFT
 75 |         - 数据准备：收集更多的SFT数据集，新的RM在这里起作用。
 76 |         - 模型微调。
 77 |     - 步骤3：DPO
 78 |         - 数据准备：从新的SFT模型中采集更多的偏好数据。
 79 |         - 模型优化。
 80 | - Round 3 ~ Round 6
 81 |     - 以此类推
 82 | - 在每一轮结束后，对不同数据版本和超参数设置下得到的模型进行平均，确保最终模型的稳定性和高性能。
 83 | 
 84 | 
 85 | ## 附录
 86 | ### 附录1 从最新的SFT模型中采集偏好数据
 87 | 在DPO中，从最新的SFT模型中采集偏好数据的意思是利用最新经过SFT微调后的模型生成新的响应，并使用这些响应来构建偏好数据集。具体步骤如下：
 88 | 1. **生成响应**：
 89 |     - 使用最新的SFT模型对一组新的或现有的提示（prompts）进行生成任务。SFT模型已经经过多轮微调，具有较好的生成能力。
 90 |     - 对每个提示，生成多个不同的响应（outputs），这些响应将用于偏好评估。
 91 | 2. **人工评估**：
 92 |     - 将生成的响应交给人类评估者进行评估。评估者会根据响应的质量、相关性、帮助性和安全性等标准对响应进行排序。
 93 |     - 标注出每个提示下的“选择的响应”（chosen response）和“拒绝的响应”（rejected response），有时也会有“编辑后的响应”（edited response）。
 94 | 3. **构建偏好数据集**：
 95 |     - 使用人工评估的结果构建新的偏好数据集。这些数据集包含提示和对应的多个响应（包括选择的、拒绝的和可能的编辑后的响应）。
 96 |     - 确保这些数据集有明确的偏好标签（例如，选择的响应优于拒绝的响应）。
 97 | 4. **DPO训练**：
 98 |     - 使用新构建的偏好数据集对模型进行DPO训练。DPO的目标是优化模型，使其生成的响应更符合人类的偏好。
 99 |     - 在DPO训练过程中，应用特定的算法修改（如屏蔽格式化标记和NLL损失正则化）以提高训练效果和稳定性。
100 | 
101 | 为什么从最新的SFT模型中采集偏好数据？
102 | - **提升数据质量**：最新的SFT模型生成的响应质量通常较高，这些响应更接近实际应用场景中的高质量回答。
103 | - **反映最新模型能力**：通过使用最新SFT模型生成的响应，可以更准确地反映出模型当前的能力和不足之处，这对于进一步优化非常重要。
104 | - **保持模型一致性**：每一轮训练后，模型都会有所改进。使用经过最新SFT训练的模型生成的数据可以确保训练数据与模型的当前状态更匹配，从而提高训练效果。


--------------------------------------------------------------------------------
/docs/llm_foundation/llama3/llama3_pre_data.md:
--------------------------------------------------------------------------------
 1 | # Llama3 预训练数据
 2 | 
 3 | > Compared to prior versions of Llama, we improved both the quantity and quality of the data we use for pre-training and post-training. These improvements include the development of more careful pre-processing and curation pipelines for pre-training data and the development of more rigorous quality assurance and filtering approaches for post-training data. We pre-train Llama 3 on a corpus of about 15T multilingual tokens, compared to 1.8T tokens for Llama 2.
 4 | 
 5 | > We create our dataset for language model pre-training from a variety of data sources containing knowledge until the end of 2023. We apply several de-duplication methods and data cleaning mechanisms on each data source to obtain high-quality tokens. We remove domains that contain large amounts of personally identifiable information (PII), and domains with known adult content.
 6 | 
 7 | ## Web Data Curation
 8 | 
 9 | > Much of the data we utilize is obtained from the web and we describe our cleaning process below.
10 | 
11 | ### 过滤掉有害信息和PII信息。
12 | 如题。
13 | 
14 | ### HTML抽取和清洗
15 | - 构建了一个parse，能够去除广告、导航栏、版权声明等信息，保留主体信息，比第三方的好。
16 | - 小心处理公式和代码，保留原来的结构。
17 | - markdown对模型表现有害，删除了。
18 | 
19 | ### 去重
20 | - URL级别去重。每个URL，只保留最新的版本。
21 | - 文档级别去重。MinHash去重。
22 | - 行级别去重。和ccNet类似。实践证明有较大提升。
23 | 
24 | ### Heuristic去重
25 | 用各种规则和pattern，删除各类重复。
26 | 
27 | ### 基于模型的质量过滤
28 | - fasttext分类，判断文档是否会被wikipedia引用。
29 | - 微调Llama 2，给定一些质量要求，让它判断是否满足质量要求。以微调后的Llama 2生成的数据为训练数据，训练DistilRoberta，为文档输出一个质量分。
30 | 
31 | ### 代码和推理数据
32 | 为代码和数学相关的文档特别处理。特别微调Llama 2，特别解析。
33 | 
34 | ### 多语言数据
35 | - 用fasttext分类语种。
36 | - 每个语言内部，进行文档级别和行级别去重。
37 | - 每个语言分别过滤低质量。
38 | - 实验选择多语言的token量，平衡英语和多语言评测。
39 | 
40 | ## Determining the Data Mix
41 | > To obtain a high-quality language model, it is essential to carefully determine the proportion of different data sources in the pre-training data mix. Our main tools in determining this data mix are knowledge classification and scaling law experiments.
42 | 
43 | ### Knowledge classification
44 | 训练一个知识领域分类器。
45 | 
46 | ### Scaling laws for data mix
47 | > To determine the best data mix, we perform scaling law experiments in which we train several small models on a data mix and use that to predict the performance of a large model on that mix. We repeat this process multiple times for different data mixes to select a new data mix candidate. Subsequently, we train a larger model on this candidate data mix and evaluate the performance of
48 | that model on several key benchmarks.
49 | 
50 | ### Data mix summary
51 | 最终的数据集，包含50%的通用领域，25%的数学和推理，17%的代码，8%的多语言。
52 | 
53 | ## Annealing Data
54 | Annealing阶段是学习率降低、训练趋于收敛的阶段。
55 | 
56 | 在退火阶段，用少量高质量代码和数学数据可提高模型表现。
57 | 


--------------------------------------------------------------------------------
/docs/llm_foundation/llama3/llama3_quote.md:
--------------------------------------------------------------------------------
 1 | # Llama 3 金句摘抄
 2 | > The development of modern foundation models consists of two main stages: (1) a pre-training stage in which the model is trained at massive scale using straightforward tasks such as next-word prediction or captioning and (2) a post-training stage in which the model is tuned to follow instructions, align with human preferences, and improve specific capabilities (for example, coding and reasoning).
 3 | 
 4 | > We believe there are three key levers in the development of high-quality foundation models: data, scale, and managing complexity.
 5 | 
 6 | > Language model pre-training involves: (1) the curation and filtering of a large-scale training corpus, (2) the development of a model architecture and corresponding scaling laws for determining model size, (3) the development of techniques for efficient pre-training at large scale, and (4) the development of a pre-training recipe.
 7 | 
 8 | > To obtain a high-quality language model, it is essential to carefully determine the proportion of different data sources in the pre-training data mix. Our main tools in determining this data mix are knowledge classification and scaling law experiments.
 9 | 
10 | > Our performance gains are primarily driven by improvements in data quality and diversity as well as by increased training scale.
11 | 
12 | > In addition to determining the optimal model size, a major challenge is to forecast the flagship model’s performance on downstream benchmark tasks, due to a couple of issues: (1) Existing scaling laws typically predict only next-token prediction loss rather than specific benchmark performance. (2) Scaling laws can be noisy and unreliable because they are developed based on pre-training runs conducted with small compute budgets.
13 | 
14 | > The recipe used to pre-train Llama 3 405B consists of three main stages: (1) initial pre-training, (2) long-context pre-training, and (3) annealing.
15 | 
16 | > We produce the aligned Llama 3 models by applying several rounds of post-training,6 or aligning the model with human feedback on top of a pre-trained checkpoint. Each round of post-training involves supervised finetuning (SFT) followed by Direct Preference Optimization (DPO) on examples collected either via human annotations or generated synthetically.
17 | 
18 | > We define reasoning as the ability to perform multi-step computations and arrive at the correct final answer.
19 | 
20 | > Steerability is the ability to direct the model’s actions and outcomes to meet developer and user specifications.
21 | 
22 | > In many ways, the development of high-quality foundation models is still in its infancy. Our experience in developing Llama 3 suggests that substantial further improvements of these models are on the horizon.
23 | 
24 | > Throughout the development of the Llama 3 model family, we found that a strong focus on high-quality data, scale, and simplicity consistently yielded the best results. In preliminary experiments, we explored more complex model architectures and training recipes but did not find the benefits of such approaches to outweigh the additional complexity they introduce in model development.
25 | 
26 | > Developing a flagship foundation model such as Llama 3 involves overcoming a plethora of deep technical problems but also requires clever organizational decisions.
27 | 
28 | > We believe that the public release of foundation models plays a key role in the responsible development of such models, and we hope that the release of Llama 3 encourages the industry to embrace the open, responsible development of AGI.


--------------------------------------------------------------------------------
/docs/llm_foundation/llama3/llama3_recipe.md:
--------------------------------------------------------------------------------
 1 | # Llama3 Training Recipe
 2 | 
 3 | > The recipe used to pre-train Llama 3 405B consists of three main stages: (1) initial pre-training, (2) long-context pre-training, and (3) annealing. 
 4 | 
 5 | ## Initial Pre-Training
 6 | > We pre-train Llama 3 405B using a cosine learning rate schedule, with a peak learning rate of $8 \times 10^{-5} , a linear warm up of 8,000 steps, and a decay to $8 \times 10^{-7}$ over 1,200,000 training steps. We use a lower batch size early in training to improve training stability, and increase it subsequently to improve efficiency. We found this training recipe to be very stable: we observed few loss spikes and did not require interventions to correct for model training divergence.
 7 | 
 8 | 小的batch size在训练初期能够引入更多的梯度噪声，从而起到类似正则化的效果，帮助模型跳出局部最优解，增强训练稳定性和泛化能力，同时由于参数更新更频繁，有助于更快速地找到较好的优化路径。
 9 | 
10 | ## Long Context Pre-Training
11 | 详见 [Llama3如何实现128K的上下文长度](llm_foundation/llama3/llama3_128k.md)
12 | 
13 | ## Annealing
14 | > During pre-training on the final 40M tokens, we linearly annealed the learning rate to 0, maintaining a context length of 128K tokens. During this annealing phase, we also adjusted the data mix to upsample data sources of very high quality; Finally, we compute the average of model checkpoints during annealing to produce the final pre-trained model.


--------------------------------------------------------------------------------
/docs/llm_foundation/llama3/llama3_scale.md:
--------------------------------------------------------------------------------
 1 | # Llama3 Scale Law
 2 | > We train a model at far larger scale than previous Llama models: our flagship language model was pre-trained using 3.8 × 10^{25} FLOPs, almost 50× more than the largest version of Llama 2. Specifically, we pre-trained a flagship model with 405B trainable parameters on 15.6T text tokens. As expected per scaling laws for foundation models, our flagship model outperforms smaller models trained using the same procedure. While our scaling laws suggest our flagship model is an approximately compute-optimal size for our training budget, we also train our smaller models for much longer than is compute-optimal. The resulting models perform better than compute-optimal models at the same inference budget. We use the flagship model to further improve the quality of those smaller models during post-training.
 3 | 
 4 | > We develop scaling laws to determine the optimal model size for our flagship model given our pre-training compute budget. In addition to determining the optimal model size, a major challenge is to forecast the flagship model’s performance on downstream benchmark tasks, due to a couple of issues: (1) Existing scaling laws typically predict only next-token prediction loss rather than specific benchmark performance. (2) Scaling laws can be noisy and unreliable because they are developed based on pre-training runs conducted with small compute budgets.
 5 | 
 6 | > To address these challenges, we implement a two-stage methodology to develop scaling laws that accurately predict downstream benchmark performance:
 7 | > 1. We first establish a correlation between the compute-optimal model’s negative log-likelihood on downstream tasks and the training FLOPs.
 8 | > 2. Next, we correlate the negative log-likelihood on downstream tasks with task accuracy, utilizing both the scaling law models and older models trained with higher compute FLOPs. In this step, we specifically leverage the Llama 2 family of models.
 9 | > 
10 | > This approach enables us to predict downstream task performance given a specific number of training FLOPs for compute-optimal models. 
11 | 
12 | 实验：在一个FLOPs范围内，训练各个大小的模型。然后，画出IsoFLOPs曲线。"Iso"来自希腊语，意思是“相同”或“等同”。IsoFLOP是指在给定的计算预算下，比较不同模型或实验结果的一种方法。
13 | 
14 | ![](../../images/isoflops.png)
15 | 
16 | 用抛物线拟合实验值，抛物线的最低点，视为给定计算预算下的compute-optimal模型。
17 | 
18 | 然后，再用compute-optimal models预测特定计算预算下的最优训练token数。做法是，假设计算预算 $C$ 和最佳token数 $N^{*}(C)$ 之间是幂率关系：
19 | $$
20 | N^{*}(C) = AC^{\alpha}
21 | $$
22 | 
23 | 用Fig 2中的数据拟合这个公式，得到 $(\alpha, A) = (0.53, 0.29)$，曲线如下：
24 | 
25 | ![](../../images/llama_fig3.png)
26 | 
27 | 将$3.8 × 10^{25}$代入，得到402B参数模型和16.55T token数。
28 | 


--------------------------------------------------------------------------------
/docs/llm_foundation/llama3/llama3_sft_data.md:
--------------------------------------------------------------------------------
  1 | # Llama3 的SFT数据构建方法
  2 | 
  3 | > Our finetuning data is largely comprised of the following sources:
  4 | > - Prompts from our human annotation collection with rejection-sampled responses
  5 | > - Synthetic data targeting specific capabilities
  6 | > - Small amounts of human-curated data 
  7 | 
  8 | ## Rejection sampling
  9 | 
 10 | 这种方法的核心思想是不断生成候选样本并拒绝那些不符合目标分布要求的样本，直到得到一个符合要求的样本。由于这个方法在每次生成候选样本后都会通过某种标准来决定是否接受该样本，所以称之为“拒绝采样”。
 11 | 
 12 | 在拒绝采样过程中，每个通过人类标记收集到的prompt会生成K个输出（通常在10到30之间），这些输出来自最新的模型策略（通常是从前一次训练迭代中表现最好的checkpoint，或者是某个特定能力表现最好的checkpoint）。然后，使用奖励模型选择最好的候选输出。
 13 | 
 14 | 在后续的训练轮次中，引入系统提示来引导拒绝采样的响应，确保其符合期望的语气、风格或格式，不同能力可能有不同的要求。系统提示与实际提示一起输入模型。例如，如果用户的实际提示是“我如何重置密码？”，系统提示可以是“请使用友好且礼貌的语气回复客户的询问”。
 15 | 
 16 | 详细步骤：
 17 | 
 18 | 1. 输入Prompt：假设我们有一个提示“如何重置密码？”。
 19 | 2. 生成候选响应：模型生成多个响应，例如：
 20 |     - 响应1：“请点击设置，然后选择重置密码。”
 21 |     - 响应2：“你可以通过邮件找回你的密码。”
 22 |     - 响应3：“访问用户中心，选择‘忘记密码’，然后按照提示操作。”
 23 | 3. 评估候选响应：使用奖励模型对这些响应进行评分。
 24 |     - 响应1得分：0.8
 25 |     - 响应2得分：0.6
 26 |     - 响应3得分：0.9
 27 | 4. 选择最佳响应：选择得分最高的响应，即响应3：“访问用户中心，选择‘忘记密码’，然后按照提示操作。”
 28 | 5. 形成 (Prompt, Response) 对：最终输出的是 (prompt, response) 对：
 29 |     - 提示：“如何重置密码？”
 30 |     - 响应：“访问用户中心，选择‘忘记密码’，然后按照提示操作。”
 31 | 
 32 | ## 数据配比
 33 | 
 34 | ![](../../images/llama3_sft_mix.png)    
 35 | 
 36 | > In each round of post-training, we adjust our overall data mix carefully across these axes to tune performance across a wide range of benchmarks. Our final data mix epochs multiple times on some high quality sources and downsamples others.
 37 | 
 38 | ## 数据处理和质量控制
 39 | 基于模型，去掉低质量数据，包括：
 40 | - 话题分类。微调finetune Llama 3 8B作为话题分类器。
 41 | - 质量打分。
 42 |     - RM判高分的认为高质量。
 43 |     - 让Llama 3 checkpoint给数据打分，打分高的认为高质量。
 44 |     - 如果RM和Llama 3的判断有冲突，任意一个判为高质量，则认为是高质量。
 45 | - 难度打分。
 46 |     - 通过 Llama 3 70B 模型对 SFT prompt 进行意图标注。一个样本包含的意图越多，其复杂性就越高。
 47 |     - 直接让Llama 3判断数据的难度。
 48 | - 语义去重。基于RoBERTa进行聚类，在每个簇中，根据quality score × difficulty score排序。遍历每个样本，如果和已遍历的样本的相似度的最大值小于阈值，就留着。
 49 | 
 50 | ## 各项能力
 51 | 
 52 | > We highlight special efforts to improve performance for specific capabilities such as code, multilinguality, math and reasoning, long context, tool use, factuality, and steerability.
 53 | 
 54 | ### 代码
 55 | 
 56 | 首先，在代码数据上继续预训练，得到一个code expert模型。
 57 | 
 58 | 然后，用code expert生成2.7M条合成数据，这是关键。
 59 | - 首先，进行常规的问题生成和答案生成。给定模型代码片段，让它生成相关的问题。然后，让模型生成答案。We observe that adding general rules of good programming to the prompt improves the generated solution quality. Also, we find it is helpful to require the model to explain its thought process in comments.
 60 | - 用静态分析和单元测试，初步判定答案的正确性。如果有错误，把原问题、错误答案、错误提示都给模型，让它修改。只有没问题的数据才加入到最终的SFT数据。这就叫**execution feedback**。
 61 | 
 62 | 最后，是数据的过滤和筛选。有之前提到的rejection sampling，还有一个model-as-judge方法。在model-as-judge方法中，用Llama 3的早期版本对代码进行评估，并根据两个标准（代码正确性和代码风格）打分，分数为二进制（0/1）。只有那些在这两个标准上都获得满分（2分）的样本才会被保留。最初，这种严格的筛选导致了下游基准测试性能的下降，主要原因是它不成比例地移除了带有挑战性提示的示例。为了应对这一问题，有策略地修改了一些被分类为最具挑战性的编码数据的响应，直到它们符合Llama-based“model-as-judge”的标准。通过改进这些具有挑战性的问题，编码数据在质量和难度之间达到了平衡，从而实现了最佳的下游性能。
 63 | 
 64 | ### 多语言
 65 | 和代码类似，略。
 66 | 
 67 | ### 数学和推理
 68 | 列举了数学推理的难点，以及是怎么解决的，略。
 69 | 
 70 | ### 长上下文
 71 | SFT数据中加入长上下文数据是必要的。但是长上下文人工标记不现实，所以要寄希望于合成数据。原文详细介绍了如何生成长上下文的合成数据，略。
 72 | 
 73 | ### 工具使用
 74 | **数据收集** 主要使用人类标注和偏好数据。有两点和前面的方法不太一样：
 75 | - 对于工具来说，一个对话通常会包含多个助手消息。
 76 |     - 这些消息可能包括调用工具的请求和对工具输出的推理。标注是在消息级别进行的。也就是说，标注员需要对每条助手消息进行评价，而不是对整个对话进行整体评价。
 77 |     - 标注员会在相同上下文中，对比两条助手消息，并提供偏好。如果这两条消息都有重大问题，则标注员会编辑其中一条消息。
 78 |     - 被选择或编辑过的消息会被添加到对话上下文中，然后对话继续进行。这种方法可以确保每一步的改进都被累积到对话中，从而不断提高助手的表现。
 79 |     - 通过这种方法，标注员能够为助手在调用工具和对工具输出进行推理的能力提供人类反馈。这种反馈对于改进助手的性能至关重要。
 80 | - 没用rejection sampling，因为发现对工具没有作用。
 81 | 
 82 | **工具数据集** 步骤如下：
 83 | - 单步工具使用。
 84 |     - 生成用户Prompts的初始集合。通过Few-shot generation生成一个合成的用户提示集合。这些提示在设计时就要求调用某些核心工具。例如，涉及超出知识截止日期的问题。
 85 |     - 生成工具调用。使用Few-shot generation，为这些用户提示生成适当的工具调用。例如，如果用户提示是关于某个超出模型知识截止日期的问题，模型会生成调用最新知识库的请求。
 86 |     - 执行工具调用并添加输出。生成的工具调用被执行，工具返回相应的输出。例如，如果调用的是一个知识库查询工具，工具会返回最新的信息。这个工具输出被添加到模型的上下文中，以便后续步骤可以利用这些信息。
 87 |     - 生成最终答案。通过再一次提示模型，基于工具的输出生成最终的用户答案。
 88 |     - 数据集过滤。约移除30%的数据。这些被移除的数据可能是由于工具调用无法执行或其他格式问题导致的。
 89 |     - 最终得到的轨迹形式：System Prompt, User Prompt, Tool Call, Tool Output, Final Answer。
 90 | - 多步工具使用。和单步类似，只不过变成了多步。
 91 | - 文件上传。基于上传的文件， 回答相关问题，略。
 92 | 
 93 | **Zero-shot tool use data** Zero-shot指模型能够在未见过的工具上进行函数调用。略。
 94 | 
 95 | ### 真实性(Factuality)
 96 | 真实性问题不仅仅限于幻觉，然而幻觉是较为直接和突出的错误类型，所以这里从幻觉入手(hallucination-first approach)。
 97 | 
 98 | 这里用的方法是调整模型的回答，使其知道自己知道什么，而不是添加新的知识。
 99 | 
100 | 方法是知识探测(knowledge probing technique)，具体步骤如下：
101 | 1. 从预训练数据中提取数据片段
102 | 2. 通过提示Llama 3生成关于这些片段的问题。
103 | 3. 对生成的问题，让Llama 3回答，然后采样。
104 | 4. 将生成的回答与原始数据片段进行比对，并由Llama 3来判断这些回答的正确性。
105 | 5. 由Llama 3评估回答的内容是否具有足够的信息量。
106 | 6. 对于多次生成中都信息丰富但错误的回答，通过Llama 3生成一个refusal，refusal指的是模型拒绝回答某个问题的行为或机制。通过拒绝回答，模型可以避免在不确定的情况下生成可能误导用户的不准确信息。
107 | 
108 | 通过这个知识探测技术生成的数据，模型被鼓励在其确实有知识的问题上进行回答，而对于不确定的问题则拒绝回答。
109 | 
110 | 另外，预训练数据也不总是事实正确的，特别是在处理敏感话题时。因此，还需要收集一小部分标注过的事实性数据，专门用于处理那些容易出现事实矛盾或错误陈述的敏感问题。
111 | 
112 | ### 可控性(Steerability)
113 | 可控性是指将模型的行为和结果导向符合开发者和用户需求的能力。对于像Llama 3这样的基础模型来说，应该具备极强的可控性，以便轻松适应不同的下游应用场景。
114 | 
115 | 在Llama 3的开发中，我们通过系统提示和自然语言指令来增强其可控性，特别是针对响应的长度、格式、语气和角色/个性等方面进行调整。这意味着开发者和用户可以通过指定的指令来精确控制模型的输出，使其更符合特定的需求和期望。
116 | 
117 | 具体做法略。
118 | 
119 | 
120 | 


--------------------------------------------------------------------------------
/docs/llm_foundation/llama3/llama3_speech.md:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PromptExpert/blogs/b315a1d329f01677a650f3955a1e76263d09f82b/docs/llm_foundation/llama3/llama3_speech.md


--------------------------------------------------------------------------------
/docs/llm_foundation/llama3/llama3_vision.md:
--------------------------------------------------------------------------------
 1 | # Llama3多模态之视觉
 2 | 
 3 | 这里只分析模型结构。
 4 | 
 5 | ![](../../images/llama3_vision.png)
 6 | 
 7 | ## Image encoder
 8 | Image encoder采用ViT-H/14的变种，$16 \times 16$ 个patch，每个patch 7680维。Image encoder的参数在后面的训练中参与调整。
 9 | 
10 | ## Image adapter
11 | > We introduce cross-attention layers between the visual token representations produced by the image encoder and the token representations produced by the language model. The cross-attention layers are applied after every fourth self-attention layer in the core language model. The cross-attention layers introduce substantial numbers of additional trainable parameters into the model: for Llama 3 405B, the cross-attention layers have ≈100B parameters.
12 | 
13 | Cross-attention的本质特征是q和kv来自不同的来源。在这里，每4层施加一个cross-attention，意思是query来自输入文本特征，kv来自视觉特征。
14 | 
15 | ## Video adapter
16 | 每个视频均匀采样64帧，每帧先经过image encoder计算，那么每个视频就得到 $64 \times (16 \times 16=256) \times 7680$ 的多维特征。
17 | 
18 | Video adapter的作用是将 $64 \times 256 \times 7680$ 的时序特征转化成和图片一样的 $256 \times 7680$ 形状。
19 | 
20 | 将时序特征合并为一个特征的组件叫temporal aggregator，具体来说是perceiver resampler。
21 | 
22 | Perceiver resampler来自flamingo模型，大致结构如下：
23 | ![](../../images/perceiver.png)
24 | 
25 | 图中的latent array，对应最上面Figure 28中的random initialized latents，latent array在cross attention中充当query，它的权重随机初始化，参与训练。
26 | 
27 | 原文中还提到，temporal aggregator将32个连续的帧合并为1个，那64个帧就合并为2个帧，然后呢？拼接还是求平均？原文没有说清楚。
28 | 
29 | 
30 | 
31 | 
32 | 


--------------------------------------------------------------------------------
/docs/llm_foundation/llm_architecture.md:
--------------------------------------------------------------------------------
  1 | # 各种大模型架构，傻傻分不清楚？
  2 | 
  3 | 各种（基于Transformer的）大模型架构，什么encoder-decoder, encoder-only, decoder-only, prefix-decoder之类的，到底是什么意思？有什么区别？
  4 | 
  5 | 其实没那么多概念，基本的思想是很简单的，本文解释几个最基本的概念，其他所有术语都是这些概念的变形和组合。
  6 | 
  7 | ## 基础概念
  8 | 
  9 | **Auto-regressive** 训练语言模型时，当前的预测（或当前的编码），依赖且仅依赖于过去的信息。
 10 | 
 11 | **Causal** 几乎和auto-regressive同义。
 12 | 
 13 | **Encoder** 将文本（或其他模态的raw data）转换为向量的模块。
 14 | 
 15 | **Decoder** 将向量转换为文本的模块。
 16 | 
 17 | ## 模型结构
 18 | 
 19 | **Encoder-Decoder** 原始Transforer结构。模型一部分参数只做编码，另一部分参数只做解码。比如，T5。
 20 | 
 21 | **Causal Decoder-only** 模型只有原始Transformer的decoder部分。语言建模时，仅依赖过去的序列，预测下一个token。比如，GPT2。
 22 | 
 23 | **Non-causal decoder-only** 模型只有原始Transformer的decoder部分。语言建模时，不是仅依赖过去的序列。也叫prefix language model，简称prefixLM。也叫prefix decoder。比如，GLM。
 24 | 
 25 | **Encoder-only** 模型只有原始Transformer的encoder部分，也叫autoencoding。比如，BERT。
 26 | 
 27 | ## 预训练目标函数
 28 | 
 29 | 模型结构不是孤立存在，要和目标函数共同完成模型建模的任务。
 30 | 
 31 | **Full language modeling** 从第一个token开始，一个一个地进行auto-regressive语言建模。
 32 | 
 33 | **Prefix language modeling** 允许先将文本的一部分定义为prefix，attention可以关注到全部prefix，也就是说在prefix内部是non-causal的。在prefix之外，基于之前的token，一个一个地预测之后的token。Encoder-decoder和Non-causal decoder-only的训练目标都是prefix language modeling。
 34 | 
 35 | **Masked language modeling** tokens或spans of tokens被[mask]替换，然后预测这替换的token，典型的如BERT。
 36 | 
 37 | ## 大佬发言
 38 | 这里列举一些大佬的观点，非常精辟，一针见血，有助于加深理解。
 39 | 
 40 | > What's called "encoder only" actually has an encoder and a decoder (just not an auto-regressive decoder). 
 41 | -- Yann LeCun
 42 | 
 43 | 解释：以BERT为例，在将窗口转换为特征的时候，模型是encoder。在基于窗口向量，预测mask的时候，模型是decoder。因为预测的时候同时用了过去和将来的信息，所以不是auto-regressive 。
 44 | 
 45 | ---
 46 | 
 47 | > What's called "encoder-decoder" really means "encoder with auto-regressive decoder"
 48 | -- Yann LeCun
 49 | 
 50 | 解释：以原始Transformer为例，decoder解码的时候只依赖于原文和已经翻译了的译文，所以是auto-regressive decoder。 
 51 | 
 52 | ---
 53 | 
 54 | > What's called "decoder only" really means "auto-regressive encoder-decoder"
 55 | -- Yann LeCun
 56 | 
 57 | 解释：以GPT2为例，模型既是编码器，也是解码器，不论编码还是解码，都是auto-regressive。这句话非常精辟，要仔细品味。
 58 | 
 59 | ---
 60 | 
 61 | > EncDec, PrefixLMs, Causal Dec-onlys are all autoregressive. 
 62 | -- Yi Tay 
 63 | 
 64 | 解释：在生成阶段，都是autoregressive。
 65 | 
 66 | ---
 67 | 
 68 | > All 3 archs are not that different. People somehow imagine that EncDec to be "not good at generation or chat". Not true. It's the objective function that matters.
 69 | -- Yi Tay 
 70 | 
 71 | 解释：模型结构是表象，是外形，目标函数才决定了模型的本质。
 72 | 
 73 | ---
 74 | 
 75 | > PrefixLM are causal decoders with non-causal input (but causal targets).
 76 | -- Yi Tay 
 77 | 
 78 | 解释：refixLM在non-causel input(即prefix)的基础上，进行causal语言建模。
 79 | 
 80 | ---
 81 | 
 82 | > Encoder-Decoders are prefixLMs with non-shared weights that connects two (enc/dec) stacks with cross attention. 
 83 | -- Yi Tay 
 84 | 
 85 | 解释：在encoder-decoder中，prefix就是encoder所编码的文本，encoder和decode是两套参数，decoder解码的时候用了cross attension。
 86 | 
 87 | ---
 88 | 
 89 | > Everything is technically seq2seq. Just whether it has a mask and whether the 'inputs' is empty.
 90 | -- Yi Tay 
 91 | 
 92 | 解释：所有架构都联系到一起了，最高层面的一般化。
 93 | 
 94 | 
 95 | 
 96 | *参考资料*
 97 | - [https://deepgenerativemodels.github.io/notes/autoregressive/](https://deepgenerativemodels.github.io/notes/autoregressive/)
 98 | - [https://huggingface.co/docs/transformers/tasks/language_modeling](https://huggingface.co/docs/transformers/tasks/language_modeling)
 99 | - [UL2: Unifying Language Learning Paradigms](https://arxiv.org/pdf/2205.05131.pdf)
100 | - [What Language Model Architecture and Pretraining
101 |     Objective Work Best for Zero-Shot Generalization?](https://arxiv.org/pdf/2204.05832.pdf)
102 | - A Survey of Large Language Models
103 | - [https://twitter.com/ylecun/status/1651762787373428736](https://twitter.com/ylecun/status/1651762787373428736)
104 | - [https://twitter.com/YiTayML/status/1651927473884655616](https://twitter.com/YiTayML/status/1651927473884655616)


--------------------------------------------------------------------------------
/docs/llm_foundation/memory_usage_training.md:
--------------------------------------------------------------------------------
 1 | # 训练时的显存如何计算？
 2 | 
 3 | ## 概述
 4 | 
 5 | 对于一个1.5B参数量的GPT-2模型，在16比特精度下需要3GB的内存来存储它的权重，但它却没法在具有32GB显存的一块GPU上进行训练。在训练阶段，绝大部分显存都是模型状态(model states)消耗的，即，由优化器状态、梯度、参数构成的张量。除此之外，剩下的显存由activations、临时缓冲区和fragmented memory消耗，统称为残留状态(residual states)。
 6 | 
 7 | 注：下文的内存指的是GPU内存，即显存。
 8 | 
 9 | ## 模型状态
10 | 
11 | ### 模型参数
12 | 
13 | 模型的内存 = 参数量 $\times$ 每个参数的内存
14 | 
15 | 每个参数的内存，取决于精度。一个float32精度的参数，内存是4字节；int8则是1字节。
16 | 
17 | 一个6B参数量的模型，如果是float32精度，需要24GB内存；如果是int8精度，需要6GB。
18 | 
19 | ### 梯度内存
20 | 
21 | 因为模型中梯度的数量通常等于中间变量的数量，所以memory_activations= memory_gradients。memory_activations的计算见下文。梯度内存的计算有时候也包含在优化器中。
22 | 
23 | ### 优化器内存
24 | 
25 | 优化器的状态通常会占到很大一部分的内存，这在混合精度训练中尤为明显。
26 | 
27 | 不同优化器所存储的参数量不同。以Adam为例，它需要存储两部分的优化器状态：time averaged momentum和variance of the gradients。因此，在使用Adam进行模型训练的时候需要有足够的的内存空间来存储动量估计和梯度方差的复制值，除此之外还需要有足够的空间存储模型本身的梯度和权重。1.5B的GPT2，模型本身只需要3GB，但加上优化器之后，则需要24GB。
28 | 
29 | ## 残留状态
30 | 
31 | ### Activations
32 | 
33 | 训练样例和模型参数进行矩阵运算，产生activations。activations是计算并存储在传播中的中间变量，在计算梯度时需要使用这些变量。
34 | 
35 | 以llama为例。llama架构为hidden_size=4096, intermediate_size=1008, num_hidden_layers=32, content_length=2048。每个instance所需内存为(4096+11008) x 32 x 2048 = 990MB。
36 | 
37 | 再比如，使用序列长度为1K和batch size为32训练的1.5B参数GPT-2模型，需要大约60GB的内存。
38 | 
39 | Activation checkpointing是一种常见的方法，通过牺牲33％的重新计算开销，从而将激活内存减少约总激活数量的平方根。这将该模型的激活内存消耗降低到约8GB。尽管有了activation checkpointing，对于更大的模型，激活内存仍可能会变得非常大。例如，类似GPT的模型具有1000亿个参数，即使使用激活检查点，对于批处理大小为32，也需要大约60GB的内存。
40 | 
41 | ### 临时缓冲区
42 | 
43 | 用于存储中间结果的临时缓冲区会消耗大量内存。像梯度all-reduce、梯度norm计算这样的操作，会把所有梯度融合到单个扁平的缓冲区(Operations such as gradient all-reduce, or gradient norm computation tend to fuse all the gradients into a single flattened buffer)。例如，对于一个包含1.5B个参数的模型，一个扁平的的fp32缓冲区将需要6GB的内存。
44 | 
45 | ### 内存碎片
46 | 
47 | 到目前为止，我们已经讨论了训练期间实际的内存消耗。此外，即使有大量可用内存，也有可能耗尽可用内存。这可能是由于内存碎片造成的。如果没有足够的连续内存来满足请求，申请内存将失败，即使总可用内存大于请求的内存量。我们观察到在训练非常大的模型时存在显着的内存碎片，导致在某些极端情况下即使仍有超过30％的内存可用，也会发生内存不足的问题。
48 | 
49 | *参考资料*
50 | 
51 | - [ZeRO: Memory Optimizations Toward Training Trillion Parameter Models](https://arxiv.org/abs/1910.02054)
52 | - [有哪些省内存的大语言模型训练/微调/推理方法？](https://news.sohu.com/a/664956986_121119001)
53 | 


--------------------------------------------------------------------------------
/docs/llm_foundation/norm.md:
--------------------------------------------------------------------------------
 1 | # Normalization概述
 2 | 
 3 | 对于预训练LLMs来说，训练不稳定是一个具有挑战性的问题。为了缓解这个问题，Normalization是稳定神经网络训练的一种广泛采用的策略。
 4 | 
 5 | ## LayerNorm
 6 | 
 7 | 神经网络层数变多时，会遇到Internal Covariate Shift的问题，指的是每一层的梯度高度依赖于前一层的输出，每一层的参数更新会导致上层的输入数据分布发生变化，通过层层叠加，高层的输入分布变化会非常剧烈，这就使得高层需要不断去重新适应底层的参数更新。
 8 | 
 9 | LayerNorm可以缓解Internal Covariate Shift问题，做法是对于每一层的输入，先计算它们的均值和标准差：
10 | 
11 | $$\mu^{l}=\frac{1}{H} \sum_{i=1}^{H} a_{i}^{l} \quad \sigma^{l}=\sqrt{\frac{1}{H} \sum_{i=1}^{H}\left(a_{i}^{l}-\mu^{l}\right)^{2}}$$
12 | 
13 | 然后每个输入减去均值，除以标准差，再输入到神经网络中。
14 | 
15 | ## 原始Transformer中的LayerNorm
16 | 
17 | 原始Transformer中，MHA和LLN后面加了LayerNorm。
18 | 
19 | ## RMSNorm
20 | 
21 | RMSNorm仅将输入除以root mean square，可加快神经网络的训练速度。Llama 2使用了RMSNorm。
22 | 
23 | $$
24 | \bar{a_i}=\frac{a_i}{\operatorname{RMS}(\mathbf{a})} g_i, \quad \text { where } \operatorname{RMS}(\mathbf{a})=\sqrt{\frac{1}{n} \sum_{i=1}^n a_i^2} .
25 | $$
26 | 
27 | 其中，$\mathbf{g} \in \mathbb{R}^n$ 是缩放系数，初始化为1。
28 | 
29 | ## Norm的位置
30 | 
31 | **Post-LN** 原始Transformer用的就是post-LN。Post-LN不太稳定，所以现在已经很少使用了。
32 | **Pre-LN** pre-LN 是在每个sub-layer之前应用的，最后一个预测层之前也加了个LN。和post-LN相比，pre-LN更稳定，但是效果有所下降。
33 | 
34 | **Sandwich-LN**  层之前和之后都加LN。
35 | 
36 | ![截屏2023-08-30 13.20.03.png](../images/norm.png)
37 | 
38 | ## DeepNorm
39 | 
40 | ![截屏2023-08-30 13.24.06.png](../images/deepnorm.png)
41 | 
42 | DeepNorm形式上很简单，同时具有稳定和表现好的优点，可以将Transformer扩展到1000层。GLM-130B用了DeepNorm。
43 | 
44 | 参考资料
45 | 
46 | - [A Survey of Large Language Models](https://arxiv.org/abs/2303.18223)
47 | - [Layer Normalization](https://arxiv.org/pdf/1607.06450v1.pdf)
48 | - [详解深度学习中的Normalization，BN/LN/WN](https://zhuanlan.zhihu.com/p/33173246)
49 | - [Root Mean Square Layer Normalization](https://arxiv.org/pdf/1910.07467.pdf)
50 | - [CogView: Mastering Text-to-Image Generation via Transformers](https://arxiv.org/pdf/2105.13290.pdf)
51 | - [DeepNet: Scaling Transformers to 1,000 Layers](https://arxiv.org/pdf/2203.00555.pdf)


--------------------------------------------------------------------------------
/docs/llm_foundation/swiglu.md:
--------------------------------------------------------------------------------
  1 | # SwiGLU激活函数
  2 | 
  3 | Gated Linear Units由两个线性映射的元素乘组成，其中一个过一道sigmoid函数。用其他激活函数替代sigmoid，就形成了GLU的变种，SwiGLU就是其中一个。SwiGLU替换Transformer FFN中的ReLU，可提升Transformer的训练效果。
  4 | 
  5 | ## GLU
  6 | Gated Linear Units（GLUs）是一种在神经网络中使用的激活机制，旨在提高模型的表达能力和计算效率。GLUs是由Yann Dauphin等人于2016年在论文《Language Modeling with Gated Convolutional Networks》中提出的，最初用于提升语言模型中的卷积神经网络（CNN）的性能。
  7 | 
  8 | GLUs利用了门控机制（gating mechanism），类似于LSTM中的门控单元，通过引入额外的门控结构来控制信息的传递。具体的公式如下：
  9 | 
 10 | $$
 11 | y=(X W+b) \otimes \sigma(X V+c)
 12 | $$
 13 | 
 14 | 其中：
 15 | - $X$ 是输入张量。
 16 | - $W$ 和 $V$ 是权重矩阵。
 17 | - $b$ 和 $c$ 是偏置向量。
 18 | - $\sigma$ 是sigmoid激活函数。
 19 | - $\otimes$ 表示元素乘。
 20 | 
 21 | 公式的前半部分是一个线性变换，后半部分是一个门控信号，利用sigmoid函数将其范围压缩到0,1之间。最终的输出是这两个部分的逐元素乘积。
 22 | 
 23 | 优点：
 24 | 1. 提高模型表达能力：GLUs通过引入门控机制，允许网络选择性地传递信息，增强了模型的非线性表达能力。
 25 | 2. 缓解梯度消失问题：由于使用了sigmoid函数，GLUs可以有效缓解梯度消失问题，使得训练更加稳定。
 26 | 3. 计算效率高：与一些复杂的激活函数相比，GLUs的计算开销较低，适合大规模神经网络的训练。
 27 | 
 28 | 实现：
 29 | ```python
 30 | import torch
 31 | import torch.nn as nn
 32 | 
 33 | class GLU(nn.Module):
 34 |     def __init__(self, input_dim, output_dim):
 35 |         super(GLU, self).__init__()
 36 |         self.linear = nn.Linear(input_dim, output_dim)
 37 |         self.gate = nn.Linear(input_dim, output_dim)
 38 |         
 39 |     def forward(self, x):
 40 |         return self.linear(x) * torch.sigmoid(self.gate(x))
 41 | 
 42 | # 示例使用
 43 | input_dim = 10
 44 | output_dim = 20
 45 | glu = GLU(input_dim, output_dim)
 46 | x = torch.randn(5, input_dim)
 47 | output = glu(x)
 48 | print(output)
 49 | ```
 50 | 
 51 | ## Swish
 52 | Swish激活函数是一种由谷歌研究团队提出的激活函数，具有以下形式：
 53 | 
 54 | $$
 55 | \operatorname{Swish}_{\beta}(x)=x \cdot \operatorname{sigmoid}(\beta x)
 56 | $$
 57 | 
 58 | Swish 激活函数的特点是它的输出不仅仅依赖于输入 $x$ 本身，还受$\sigma(x)$ 的影响，使其在一定程度上保留了输入的信息，同时引入了非线性变换。这种特性使得 Swish 函数在深度神经网络中表现出色，尤其是在处理梯度传播和优化问题时。
 59 | 
 60 | Swish 激活函数的优点：
 61 | - 平滑性：Swish 是一个平滑的激活函数，没有 ReLU 激活函数的硬拐点，这有助于梯度的稳定传播。
 62 | - 自门控机制：Swish 函数通过输入本身进行自门控，允许小负值通过，这对于梯度的流动有利，尤其是深层网络。
 63 | - 实验表现：在一些实验中，Swish 激活函数在图像分类和机器翻译等任务上显示出优于 ReLU 和其他传统激活函数的性能。
 64 | 
 65 | ![](../images/swish.png)
 66 | 
 67 | ## Transformer中的SwiGLU
 68 | 在Transformer中, attention层之后，还要过一层position-wise feed-forward networks。
 69 | $$
 70 | \operatorname{FFN}(x)=\max \left(0, x W_1+b_1\right) W_2+b_2
 71 | $$
 72 | 其中，$x$是每个位置的向量。这个公式的含义是$x$经过一个线性映射，再经过ReLU，最后再经过另一个线性映射。
 73 | 
 74 | SwiGLU替换ReLU之后，FFN则变成：
 75 | $$
 76 | \operatorname{FFN}_{\text {SwiGLU }}\left(x\right)=\left(\operatorname{Swish}_1(x W) \otimes x V\right) W_2
 77 | $$
 78 | 
 79 | 其中， $\operatorname{Swish}_1$ 意思是 $\beta=1$。公式中省略偏置项。
 80 | 
 81 | ```python
 82 | # 原始FFN
 83 | import torch
 84 | import torch.nn as nn
 85 | 
 86 | class PositionWiseFeedForward(nn.Module):
 87 |     def __init__(self, d_model, d_ffn):
 88 |         super(PositionWiseFeedForward, self).__init__()
 89 |         self.linear1 = nn.Linear(d_model, d_ffn)
 90 |         self.relu = nn.ReLU()
 91 |         self.linear2 = nn.Linear(d_ffn, d_model)
 92 |     
 93 |     def forward(self, x):
 94 |         return self.linear2(self.relu(self.linear1(x)))
 95 | 
 96 | #SwiGLU FFN
 97 | class SwiGLUFeedForward(nn.Module):
 98 |     def __init__(self, d_model, d_ffn):
 99 |         super(SwiGLUFeedForward, self).__init__()
100 |         self.linear1 = nn.Linear(d_model, d_ffn)
101 |         self.linear2 = nn.Linear(d_model, d_ffn)
102 |         self.linear3 = nn.Linear(d_ffn, d_model)
103 |     
104 |     def forward(self, x):
105 |         swish = x * torch.sigmoid(self.linear1(x))
106 |         v = self.linear2(x)
107 |         x = swish * v
108 |         return self.linear3(x)
109 | ```
110 | 
111 | 
112 | 
113 | 
114 | *参考资料*
115 | - [Language Modeling with Gated Convolutional Networks](https://arxiv.org/pdf/1612.08083)
116 | - [SEARCHING FOR ACTIVATION FUNCTIONS](https://arxiv.org/pdf/1710.05941)
117 | - [GLU Variants Improve Transformer](https://arxiv.org/pdf/2002.05202) 


--------------------------------------------------------------------------------
/docs/misellaneous/ai_search_review.md:
--------------------------------------------------------------------------------
 1 | # 关于AI搜索的一些想法
 2 | 
 3 | - 什么是AI搜索？Grok 3 给出了一个很准确的定义：AI Search refers to the application of AI to improve how information is retrieved and presented. 
 4 | - AI搜索 $\approx$ 联网大模型 $\approx$ 智能问答机器人 $\approx$ RAG
 5 | - 搜索不会死，只会进化。永远有新的数据产生，永远有私域数据，永远有精准获取信息的需求，那么搜索就会一直存在。大模型会吃掉一部分原来的搜索的份额，但不会完全取代搜索。
 6 | - 有些人希望把AI搜索当Deep Search用，这是期望管理没做好。话说回来，确实有的AI搜索强到可以当Deep Search用。
 7 | - AI搜索优化到一定程度，得用Agent，甚至直接变身Agent。
 8 | - AI搜索系统设计没有固定范式。面对具体需求和问题，从第一性原理出发，就会自然找到解决办法。
 9 | - 前期要想好定位，什么都想做，很可能什么都做不好。
10 | - AI搜索不总是会提升用户体验，要保持克制，这一点微信搜索就做得很好。
11 | - RAG中的Retrieval非常重要。
12 | - 对于一个大型AI搜索应用，搜索该有的东西一样都不能少，向量检索是远远不够的。意图理解、query改写、全文检索、相关性、时效性、权威性、多路召回、粗排、精排，优化到后来，都得有。
13 | - LangChain是垃圾，珍爱生命，原理LangChain。
14 | - 向量模型随便选一个下载量高的，就够了。
15 | - 99%的RAG论文都是水货。
16 | - 多模态没什么可说的，早就有了。
17 | - 个性化没什么可说的，早就有了。
18 | - 有些query的意图就是模糊的，可以引导，但不要强猜。
19 | 
20 | 
21 | 


--------------------------------------------------------------------------------
/docs/mllm.md:
--------------------------------------------------------------------------------
1 | # 目录
2 | 
3 | - [MLLM论文精选（持续更新）](mllm/mllm_papers.md)


--------------------------------------------------------------------------------
/docs/mllm/mllm_papers.md:
--------------------------------------------------------------------------------
 1 | # MLLM论文精选（持续更新）
 2 | 
 3 | 多模态交流QQ群: 237976286
 4 | 
 5 | ## 最新动态
 6 | - 2025.03 [Gemma 3 Technical Report](https://storage.googleapis.com/deepmind-media/gemma/Gemma3Report.pdf)
 7 | - 2025.03 [AgiBot World Colosseo: Large-scale Manipulation Platform for Scalable and Intelligent Embodied System](https://agibot-world.com/blog/agibot_go1.pdf)
 8 | - 2025.02 [SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features](https://arxiv.org/abs/2502.14786)
 9 | - 2025.02 [Qwen2.5-VL Technical Report](https://arxiv.org/abs/2502.13923)
10 | - 2025.02 [mmE5: Improving Multimodal Multilingual Embeddings via High-quality Synthetic Data](https://arxiv.org/pdf/2502.08468)
11 | - 2025.02 [Ovis2](https://github.com/AIDC-AI/Ovis)
12 | - 2025.02 [ALIGNVLM: Bridging Vision and Language Latent Spaces for Multimodal Understanding](https://arxiv.org/pdf/2502.01341)
13 | - 2025.01 [Multimodal Open R1](https://github.com/EvolvingLMMs-Lab/open-r1-multimodal)
14 | - 2025.01 [Eagle 2: Building Post-Training Data Strategies from Scratch for Frontier Vision-Language Models](https://arxiv.org/abs/2501.14818)
15 | - 2025.01 [Qwen2.5 VL](https://qwenlm.github.io/blog/qwen2.5-vl/)
16 | - 2025.01 [VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding](https://arxiv.org/abs/2501.13106)
17 | - 2025.01 [vikhyatk/moondream2](https://hf-mirror.com/vikhyatk/moondream2) 
18 | - 2025.01 [VideoRAG: Retrieval-Augmented Generation over Video Corpus](https://arxiv.org/pdf/2501.05874)
19 | - 2025.01 [MiniCPM-o](https://github.com/OpenBMB/MiniCPM-o) MiniCPM-V升级版。
20 | - 2025.01 [Imagine while Reasoning in Space: Multimodal Visualization-of-Thought](https://arxiv.org/abs/2501.07542)
21 | - 2025.01 [Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos](https://github.com/magic-research/Sa2VA)
22 | - 2025.01 [2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining](https://arxiv.org/abs/2501.00958) 达摩院开源的多模态数据集，由22,000小时的上课视频而来。
23 | - 2024.12 [MetaMorph: Multimodal Understanding and Generation via Instruction Tuning](https://arxiv.org/abs/2412.14164v1) 
24 | - 2024.12 [Apollo: An Exploration of Video Understanding in Large Multimodal Models](https://apollo-lmms.github.io/) Meta出品的Video-LLM
25 | - 2024.12 [DeepSeek-VL2](https://github.com/deepseek-ai/DeepSeek-VL2/tree/main)
26 | - 2024.12 [FastVLM: Efficient Vision Encoding for Vision Language Models](https://arxiv.org/html/2412.13303v1)
27 | - 2024.12 [POINTS1.5 Buiding a Vision-Language Model towards Real World Applications](https://github.com/WePOINTS/WePOINTS/blob/main/POINTS1_5_preview.pdf) 微信出品。
28 | - 2024.12 [InternVL 2.5](https://hf-mirror.com/collections/OpenGVLab/internvl-25-673e1019b66e2218f68d7c1c)  1B 到 78B 都有。
29 | - 2024.12 [Qwen2-VL-72B Model](https://hf-mirror.com/Qwen/Qwen2-VL-72B) 
30 | - 2024.12 [Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion](https://arxiv.org/abs/2412.04424) 
31 | - 2024.12 [NVILA: Efficient Frontier Visual Language Models](https://arxiv.org/abs/2412.04468) NVIDIA出品，同时优化效率和准确率的VLM。
32 | - 2024.12 [PaliGemma 2:A Family of Versatile VLMs for Transfer](https://arxiv.org/pdf/2412.03555) 
33 | - 2024.11 [Multimodal Autoregressive Pre-training of Large Vision Encoders](https://arxiv.org/pdf/2411.14402) 苹果提出全新的视觉编码器训练方式，支持多模态。
34 | - 2024.11 [Pixtral Large](https://mistral.ai/news/pixtral-large/) Mistral发布124B的多模态大模型。
35 | - 2024.11 [OmniVision-968M: World's Smallest Vision Language Model](https://nexa.ai/blogs/omni-vision)
36 | - 2024.11 [LLM2CLIP: Powerful Language Model Unlocks Richer Visual Representation](https://microsoft.github.io/LLM2CLIP/) 微软出品，将CLIP中的text encoder替换成LLM，支持更长的上下文和更复杂的文本，有更好的topk检索效果。
37 | - 2024.11 [HourVideo: 1-Hour Video-Language Understanding](https://arxiv.org/abs/2411.04998) 李飞飞团队提出长视频理解评测集
38 | - 2024.11 [Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models](https://arxiv.org/abs/2411.04996)
39 | - 2024.11 [MM-EMBED: UNIVERSAL MULTIMODAL RETRIEVAL WITH MULTIMODAL LLMS](https://arxiv.org/pdf/2411.02571) 英伟达提出基于MLLM的通用多模态检索。
40 | - 2024.11 [Attacking Vision-Language Computer Agents via Pop-ups](https://arxiv.org/abs/2411.02391) 
41 | - 2024.11 [Know Where You're Uncertain When Planning with Multimodal Foundation Models: A Formal Framework](https://arxiv.org/abs/2411.01639) 提高多模态基础模型在处理不确定性时的能力，从而增强机器人在规划任务中的可靠性。
42 | - 2024.10 [Mini-Omni2: Towards Open-source GPT-4o with Vision, Speech and Duplex Capabilities](https://arxiv.org/abs/2410.11190)
43 | - 2024.10 [LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding](https://arxiv.org/pdf/2410.17434) Meta提出长视频理解方法。
44 | - 2024.10 [Infinity-MM: Scaling Multimodal Performance with Large-Scale and High-Quality Instruction Data](https://arxiv.org/pdf/2410.18558) 智源开源4千万多模态指令数据。
45 | - 2024.10 [Mini-InternVL: A Flexible-Transfer Pocket Multimodal Model with 5% Parameters and 90% Performance](https://arxiv.org/abs/2410.16261) 
46 | - 2024.10 [VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation](https://arxiv.org/pdf/2409.04429) VILA团队的统一理解和生成模型。
47 | - 2024.10 [Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation](https://arxiv.org/pdf/2410.13848) DeepSeek首个多模态模型。
48 | - 2024.10 [ARIA : An Open Multimodal Native Mixture-of-Experts Model](https://arxiv.org/pdf/2410.05993) 3.9B模型，号称超过 Pixtral-12B 和 Llama3.2-11。
49 | - 2024.10 [BAICHUAN-OMNI TECHNICAL REPORT](https://arxiv.org/pdf/2410.08565) 百川首个7B多模态模型。
50 | - 2024.10 [Pixtral 12B](https://arxiv.org/abs/2410.07073) Mistral出品。
51 | - 2024.10 [Movie Gen: A Cast of Media Foundation Models](https://ai.meta.com/static-resource/movie-gen-research-paper) Meta出品
52 | - 2024.10 [LEOPARD : A Vision Language Model for Text-Rich Multi-Image Tasks](https://arxiv.org/pdf/2410.01744) 
53 | - 2024.10 [Video Instruction Tuning with Synthetic Data](https://llava-vl.github.io/blog/2024-09-30-llava-video/) LLaVA和字节合作开源视频指令数据
54 | - 2024.09 [MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning](https://arxiv.org/abs/2409.20566) 苹果MM1的升级版。
55 | - 2024.09 [Emu3: Next-Token Prediction is All You Need](https://emu.baai.ac.cn/about) BAAI出品。
56 | - 2024.09 [Molmo and PixMo:Open Weights and Open Data for State-of-the-Art Multimodal Models](https://www.arxiv.org/abs/2409.17146) Allen出品，同时开源模型和数据。
57 | - 2024.09 [MIO: A Foundation Model on Multimodal Tokens](https://arxiv.org/abs/2409.17692)
58 | - 2024.09 [Phantom of Latent for Large Language and Vision Models](https://arxiv.org/abs/2409.14713)
59 | - 2024.09 [Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution](https://arxiv.org/pdf/2409.12191)
60 | - 2024.09 [Llama 3.2: Revolutionizing edge AI and vision with open, customizable models](https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/) 
61 | - 2024.09 [NVLM: Open Frontier-Class Multimodal LLMs](https://arxiv.org/pdf/2409.11402) 英伟达出品。
62 | - 2024.09 [Viper: Open Mamba-based Vision-Language Models](https://github.com/EvanZhuang/viper/tree/main) 首个基于Mamba的VLM系列
63 | - 2024.09 [MoMa: Efficient Early-Fusion Pre-training with Mixture of Modality-Aware Experts](https://arxiv.org/abs/2407.21770)
64 | - 2024.09 [General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model](https://arxiv.org/pdf/2409.01704v1) 
65 | - 2024.09 [VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation](https://www.arxiv.org/pdf/2409.04429) 
66 | - 2024.08 [Law of Vision Representation in MLLMs](https://arxiv.org/abs/2408.16357) 提出了AC score指标，AC score越高，视觉表示越好。
67 | - 2024.08 [CogVLM2: Visual Language Models for Image and Video Understanding](https://arxiv.org/abs/2408.16500)
68 | - 2024.08 [EAGLE: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders](https://arxiv.org/pdf/2408.15998)
69 | - 2024.08 [A Practitioner's Guide to Continual Multimodal Pretraining](https://www.arxiv.org/abs/2408.14471)
70 | - 2024.08 [Building and better understanding vision-language models: insights and future directions](https://www.arxiv.org/pdf/2408.12637)
71 | - 2024.08 [LongVILA: Scaling Long-Context Visual Language Models for Long Videos](https://arxiv.org/abs/2408.10188)
72 | - 2024.08 [UniBench: Visual Reasoning Requires Rethinking Vision-Language Beyond Scaling](https://arxiv.org/pdf/2408.04810)
73 | - 2024.08 [xGen-MM (BLIP-3): A Family of Open Large Multimodal Models](https://www.arxiv.org/abs/2408.08872) 
74 | - 2024.08 [LLaVA-OneVision: Easy Visual Task Transfer](https://arxiv.org/abs/2408.03326) LLaVA-NeXT系列的集大成。
75 | - 2024.08 [MiniCPM-V: A GPT-4V Level MLLM on Your Phone](https://arxiv.org/abs/2408.01800) 超强的小钢炮MLLM。
76 | - 2024.08 [SAM 2: Segment Anything in Images and Videos](https://arxiv.org/abs/2408.00714) 
77 | 
78 | ## 经典论文
79 | - 2021.02 [Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/pdf/2103.00020) CLIP
80 | - 2022.04 [Flamingo: a Visual Language Model for Few-Shot Learning](https://proceedings.neurips.cc/paper_files/paper/2022/file/960a172bc7fbf0177ccccbb411a7d800-Paper-Conference.pdf) DeepMind出品，MLLM先驱。
81 | - 2023.01 [BLIP-2](https://arxiv.org/abs/2301.12597) 提出Q-Former。
82 | - 2023.03 [Sigmoid Loss for Language Image Pre-Training](https://arxiv.org/abs/2303.15343) CLIP的变种替代品，Sigmoid损失。
83 | - 2023.04 [MiniGPT-4](https://arxiv.org/abs/2304.10592) 热度很高。
84 | - 2023.04 [Visual Instruction Tuning](https://arxiv.org/abs/2304.08485) LLaVA系列的第一篇文章。
85 | - 2023.05 [InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning](https://arxiv.org/abs/2305.06500) 
86 | - 2023.05 [Segment Anything](https://arxiv.org/abs/2304.02643) SAM
87 | - 2023.12 [Gemini: A Family of Highly Capable Multimodal Models](https://arxiv.org/pdf/2312.11805)
88 | - 2024.01 [AGENT AI:SURVEYING THE HORIZONS OF MULTIMODAL INTERACTION](https://arxiv.org/pdf/2401.03568) 李飞飞团队出品。
89 | - 2024.04 [MM1- Methods, Analysis & Insights from Multimodal LLM Pre-training](https://arxiv.org/abs/2403.09611) 苹果出品。
90 | - 2024.05 [An Introduction to Vision-Language Modeling](https://arxiv.org/pdf/2405.17247) Meta出品，短小精悍。
91 | - 2024.05 [DeepSeek-VL: Towards Real-World Vision-Language Understanding](https://arxiv.org/pdf/2403.05525)
92 | - 2024.06 [Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs](https://arxiv.org/abs/2406.16860) 提出以视觉为中心的benchmark CV-Bench，实验探究各个方面对VLM表现的影响，训练Cambrian-1模型。
93 | - 2024.09 [Llama 3.2: Revolutionizing edge AI and vision with open, customizable models](https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/)
94 | - 2024.09 [Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution](https://arxiv.org/pdf/2409.12191)
95 | - 2024.09 [Molmo and PixMo:Open Weights and Open Data for State-of-the-Art Multimodal Models](https://www.arxiv.org/abs/2409.17146) Allen出品，同时开源模型和数据。
96 | 


--------------------------------------------------------------------------------
/docs/mllm/mllm_surveys.md:
--------------------------------------------------------------------------------
1 | # 多模态大模型综述Paper List
2 | 
3 | * [An Introduction to Vision-Language Modeling](https://arxiv.org/abs/2405.17247)
4 | * [The Evolution of Multimodal Model Architectures](https://arxiv.org/abs/2405.17927)
5 | * [Efficient Multimodal Large Language Models: A Survey](https://arxiv.org/abs/2405.10739)


--------------------------------------------------------------------------------
/docs/notes.md:
--------------------------------------------------------------------------------
 1 | # Notes
 2 | 
 3 | 笔记中的内容比较随意，没有精心编排。
 4 | 
 5 | - [Video-ChatGPT结构](notes/video_chatgpt.md) 
 6 | - [DSPy理论详解](notes/dspy_theory.md)
 7 | - [参数初始化](notes/weight_init.md)
 8 | - [Batch Normalization](notes/bn.md)
 9 | - [ViT](notes/vit.md)
10 | - [EVA](notes/eva.md)
11 | - [数据集介绍](notes/datasets_intro.md)
12 | - [BLIP-2 Explained](notes/blip2.md)
13 | - [RoPE Explained](notes/rope.md)
14 | - [DPO](notes/dpo.md)
15 | - [MiniGPT4和BLIP2的区别](notes/minigpt4_vs_blip2.md)
16 | - [用最短的篇幅说清楚，DSPy是干吗的](notes/dspy_intro.md)
17 | - [Scaling Laws](notes/scaliing_laws.md)
18 | - [了解o1可以参考的一些资料](notes/o1_literatures.md)


--------------------------------------------------------------------------------
/docs/notes/.ipynb_checkpoints/get_extended_attention_mask-checkpoint.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "code",
  5 |    "execution_count": null,
  6 |    "id": "ada2a079-2749-448f-88ce-d05744a9af23",
  7 |    "metadata": {},
  8 |    "outputs": [],
  9 |    "source": [
 10 |     " def get_extended_attention_mask(\n",
 11 |     "        self,\n",
 12 |     "        attention_mask: Tensor,\n",
 13 |     "        input_shape: Tuple[int],\n",
 14 |     "        device: device,\n",
 15 |     "        is_decoder: bool,\n",
 16 |     "        has_query: bool = False,\n",
 17 |     "    ) -> Tensor:\n",
 18 |     "        \"\"\"\n",
 19 |     "        Makes broadcastable attention and causal masks so that future and masked tokens are ignored.\n",
 20 |     "\n",
 21 |     "        Arguments:\n",
 22 |     "            attention_mask (:obj:`torch.Tensor`):\n",
 23 |     "                Mask with ones indicating tokens to attend to, zeros for tokens to ignore.\n",
 24 |     "            input_shape (:obj:`Tuple[int]`):\n",
 25 |     "                The shape of the input to the model.\n",
 26 |     "            device: (:obj:`torch.device`):\n",
 27 |     "                The device of the input to the model.\n",
 28 |     "\n",
 29 |     "        Returns:\n",
 30 |     "            :obj:`torch.Tensor` The extended attention mask, with a the same dtype as :obj:`attention_mask.dtype`.\n",
 31 |     "        \"\"\""
 32 |    ]
 33 |   },
 34 |   {
 35 |    "cell_type": "markdown",
 36 |    "id": "3529e2b9",
 37 |    "metadata": {},
 38 |    "source": [
 39 |     "get_extended_attention_mask用于生成扩展的attention mask。该函数特别适用于处理self-attention和causal mask情况，尤其是在处理解码器decoder时。"
 40 |    ]
 41 |   },
 42 |   {
 43 |    "cell_type": "raw",
 44 |    "id": "5b25e072-e580-40bb-bf2c-94e9f8365a88",
 45 |    "metadata": {},
 46 |    "source": []
 47 |   },
 48 |   {
 49 |    "cell_type": "code",
 50 |    "execution_count": null,
 51 |    "id": "31a9e32a-d15b-4bbe-a8cd-c14463838466",
 52 |    "metadata": {},
 53 |    "outputs": [],
 54 |    "source": [
 55 |     "# We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]\n",
 56 |     "# ourselves in which case we just need to make it broadcastable to all heads.\n",
 57 |     "if attention_mask.dim() == 3:\n",
 58 |     "    extended_attention_mask = attention_mask[:, None, :, :]"
 59 |    ]
 60 |   },
 61 |   {
 62 |    "cell_type": "markdown",
 63 |    "id": "00ae49bd-0b1f-444e-8b00-64f535aa877e",
 64 |    "metadata": {},
 65 |    "source": [
 66 |     "Transformer模型中的多头自注意力机制需要一个形状为 [batch_size, num_heads, from_seq_length, to_seq_length] 的掩码，其中 num_heads 是注意力头的数量。通过在第二个维度插入一个新的维度（None），我们可以确保 attention_mask 变得可广播到所有注意力头上。"
 67 |    ]
 68 |   },
 69 |   {
 70 |    "cell_type": "raw",
 71 |    "id": "e4ecdb8a-4572-4468-ad0f-adab0a256ede",
 72 |    "metadata": {},
 73 |    "source": []
 74 |   },
 75 |   {
 76 |    "cell_type": "code",
 77 |    "execution_count": null,
 78 |    "id": "c939167a-afef-4afc-9dc1-c5808ff7d238",
 79 |    "metadata": {},
 80 |    "outputs": [],
 81 |    "source": [
 82 |     "elif attention_mask.dim() == 2:\n",
 83 |     "    # Provided a padding mask of dimensions [batch_size, seq_length]\n",
 84 |     "    # - if the model is a decoder, apply a causal mask in addition to the padding mask\n",
 85 |     "    # - if the model is an encoder, make the mask broadcastable to [batch_size, num_heads, seq_length, seq_length]"
 86 |    ]
 87 |   },
 88 |   {
 89 |    "cell_type": "markdown",
 90 |    "id": "7cddf2a4-0086-4c4f-a2f7-14d96cc0ff41",
 91 |    "metadata": {},
 92 |    "source": [
 93 |     "Padding Mask：用于标记输入序列中的填充位置。填充位置通常是为了使序列长度一致，但这些位置不应该在计算注意力得分时被考虑。\n",
 94 |     "Causal Mask：用于确保解码器只能关注当前和之前的标记，以防止信息泄露。这对于自回归生成模型尤为重要，因为模型在生成下一个标记时不应该访问未来的标记。"
 95 |    ]
 96 |   },
 97 |   {
 98 |    "cell_type": "raw",
 99 |    "id": "371eb5d9-cdc4-403f-a467-038c44a8bf2b",
100 |    "metadata": {},
101 |    "source": []
102 |   },
103 |   {
104 |    "cell_type": "code",
105 |    "execution_count": null,
106 |    "id": "febf7d45-7efb-4bbd-bb03-fbb961affd39",
107 |    "metadata": {},
108 |    "outputs": [],
109 |    "source": [
110 |     "if is_decoder:\n",
111 |     "    batch_size, seq_length = input_shape\n",
112 |     "\n",
113 |     "    seq_ids = torch.arange(seq_length, device=device)\n",
114 |     "    causal_mask = (\n",
115 |     "        seq_ids[None, None, :].repeat(batch_size, seq_length, 1)\n",
116 |     "        <= seq_ids[None, :, None]\n",
117 |     "    )\n",
118 |     "\n",
119 |     "    # add a prefix ones mask to the causal mask\n",
120 |     "    # causal and attention masks must have same type with pytorch version < 1.3\n",
121 |     "    causal_mask = causal_mask.to(attention_mask.dtype)"
122 |    ]
123 |   },
124 |   {
125 |    "cell_type": "markdown",
126 |    "id": "763725d2-820e-47f2-aa6c-57f627fd9ac4",
127 |    "metadata": {},
128 |    "source": [
129 |     "这段代码的目的是生成一个causal mask，以确保每个位置只能关注到它之前的位置。\n",
130 |     "seq_ids[None, None, :]通过添加两个维度变成形状为(1, 1, seq_length)。\n",
131 |     "seq_ids[None, None, :].repeat(batch_size, seq_length, 1)将其重复成形状为(batch_size, seq_length, seq_length)。\n",
132 |     "seq_ids[None, :, None]通过添加两个维度变成形状为(1, seq_length, 1)。\n",
133 |     "比较操作<=会生成一个布尔张量causal_mask，形状为(batch_size, seq_length, seq_length)。对于每个位置(i, j)，如果i <= j，则设为True，否则设为False。"
134 |    ]
135 |   },
136 |   {
137 |    "cell_type": "raw",
138 |    "id": "7032bbd6-a518-4089-8a32-bd8802b17272",
139 |    "metadata": {},
140 |    "source": []
141 |   },
142 |   {
143 |    "cell_type": "code",
144 |    "execution_count": null,
145 |    "id": "b1721911-f6c6-4ea4-8108-d2153c3bd2b8",
146 |    "metadata": {},
147 |    "outputs": [],
148 |    "source": [
149 |     "if causal_mask.shape[1] < attention_mask.shape[1]:\n",
150 |     "    prefix_seq_len = attention_mask.shape[1] - causal_mask.shape[1]\n",
151 |     "    if has_query:  # UniLM style attention mask\n",
152 |     "        causal_mask = torch.cat(\n",
153 |     "            [\n",
154 |     "                torch.zeros(\n",
155 |     "                    (batch_size, prefix_seq_len, seq_length),\n",
156 |     "                    device=device,\n",
157 |     "                    dtype=causal_mask.dtype,\n",
158 |     "                ),\n",
159 |     "                causal_mask,\n",
160 |     "            ],\n",
161 |     "            axis=1,\n",
162 |     "        )"
163 |    ]
164 |   },
165 |   {
166 |    "cell_type": "markdown",
167 |    "id": "8091cc7a-aeaf-44cc-9998-9861bd8c1fc3",
168 |    "metadata": {},
169 |    "source": [
170 |     "attention_mask所有位置的mask，causal_mask只是causal部分的mask，causal_mask的长度可能小于attention_mask的长度。\n",
171 |     "如果这种情况是因为存在query(即论文中的learned query)，需要把query这部分的mask设为0。"
172 |    ]
173 |   },
174 |   {
175 |    "cell_type": "raw",
176 |    "id": "00abbe76-df20-4cbc-97b5-89f1daa771fd",
177 |    "metadata": {},
178 |    "source": []
179 |   },
180 |   {
181 |    "cell_type": "code",
182 |    "execution_count": null,
183 |    "id": "e16525a0-deeb-4dcb-b5b5-a3018a1a5a32",
184 |    "metadata": {},
185 |    "outputs": [],
186 |    "source": [
187 |     "causal_mask = torch.cat(\n",
188 |     "        [\n",
189 |     "            torch.ones(\n",
190 |     "                (batch_size, causal_mask.shape[1], prefix_seq_len),\n",
191 |     "                device=device,\n",
192 |     "                dtype=causal_mask.dtype,\n",
193 |     "            ),\n",
194 |     "            causal_mask,\n",
195 |     "        ],\n",
196 |     "        axis=-1,\n",
197 |     "    )"
198 |    ]
199 |   },
200 |   {
201 |    "cell_type": "markdown",
202 |    "id": "f492c606-fc56-4413-9cce-22a0f368a758",
203 |    "metadata": {},
204 |    "source": [
205 |     "否则，prefix的mask设为1."
206 |    ]
207 |   },
208 |   {
209 |    "cell_type": "raw",
210 |    "id": "2d569517-3bb7-4305-8ba3-1e5670a33303",
211 |    "metadata": {},
212 |    "source": []
213 |   },
214 |   {
215 |    "cell_type": "code",
216 |    "execution_count": null,
217 |    "id": "8b277a4e-1ea2-44e2-b530-7bb40d88ebea",
218 |    "metadata": {},
219 |    "outputs": [],
220 |    "source": [
221 |     "extended_attention_mask = (\n",
222 |     "    causal_mask[:, None, :, :] * attention_mask[:, None, None, :]\n",
223 |     ")"
224 |    ]
225 |   },
226 |   {
227 |    "cell_type": "markdown",
228 |    "id": "b3c0bcca-9cb6-4287-b7dd-58da71d9ffc0",
229 |    "metadata": {},
230 |    "source": [
231 |     "将两个mask广播后相乘，就是前面说的apply a causal mask in addition to the padding mask。"
232 |    ]
233 |   }
234 |  ],
235 |  "metadata": {
236 |   "kernelspec": {
237 |    "display_name": "Python 3 (ipykernel)",
238 |    "language": "python",
239 |    "name": "python3"
240 |   },
241 |   "language_info": {
242 |    "codemirror_mode": {
243 |     "name": "ipython",
244 |     "version": 3
245 |    },
246 |    "file_extension": ".py",
247 |    "mimetype": "text/x-python",
248 |    "name": "python",
249 |    "nbconvert_exporter": "python",
250 |    "pygments_lexer": "ipython3",
251 |    "version": "3.10.8"
252 |   }
253 |  },
254 |  "nbformat": 4,
255 |  "nbformat_minor": 5
256 | }
257 | 


--------------------------------------------------------------------------------
/docs/notes/.ipynb_checkpoints/pretrain_stage1-checkpoint.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "code",
  5 |    "execution_count": null,
  6 |    "id": "cba2ee99-f930-48c0-a8b7-79dc8876afc3",
  7 |    "metadata": {},
  8 |    "outputs": [],
  9 |    "source": [
 10 |     "model:\n",
 11 |     "  arch: blip2"
 12 |    ]
 13 |   },
 14 |   {
 15 |    "cell_type": "markdown",
 16 |    "id": "c5d6f2c4-019a-46c7-85c5-1237e3fd05d2",
 17 |    "metadata": {},
 18 |    "source": [
 19 |     "模型结构是blip2。\n",
 20 |     "在 https://github.com/salesforce/LAVIS/blob/main/train.py#L94 的build_model方法，会根据arch得到模型类。\n",
 21 |     "\n",
 22 |     "build_model调用registry.get_model_class(model_config.arch), https://github.com/salesforce/LAVIS/blob/ac8fc98c93c02e2dfb727e24a361c4c309c8dbbc/lavis/tasks/base_task.py#L32\n",
 23 |     "\n",
 24 |     "blip2模型实际是Blip2Qformer， 定义于 https://github.com/salesforce/LAVIS/blob/ac8fc98c93c02e2dfb727e24a361c4c309c8dbbc/lavis/models/blip2_models/blip2_qformer.py#L25 "
 25 |    ]
 26 |   },
 27 |   {
 28 |    "cell_type": "raw",
 29 |    "id": "9b905999-ebaa-48cf-b990-6321b09eaf8a",
 30 |    "metadata": {},
 31 |    "source": []
 32 |   },
 33 |   {
 34 |    "cell_type": "code",
 35 |    "execution_count": null,
 36 |    "id": "796a203f-998a-4dbd-b6f1-8ae44f76d989",
 37 |    "metadata": {},
 38 |    "outputs": [],
 39 |    "source": [
 40 |     "model_type: pretrain"
 41 |    ]
 42 |   },
 43 |   {
 44 |    "cell_type": "markdown",
 45 |    "id": "36d05ff7-4772-43a6-98f4-38de9af81dd1",
 46 |    "metadata": {},
 47 |    "source": [
 48 |     "Each architecture has variants trained on different datasets and at different scale. 好像没什么实际作用。"
 49 |    ]
 50 |   },
 51 |   {
 52 |    "cell_type": "raw",
 53 |    "id": "9b79a370-3262-4755-910f-c2ad589bc800",
 54 |    "metadata": {},
 55 |    "source": []
 56 |   },
 57 |   {
 58 |    "cell_type": "code",
 59 |    "execution_count": null,
 60 |    "id": "11281184-e10f-4108-bc84-7af56ea873e8",
 61 |    "metadata": {},
 62 |    "outputs": [],
 63 |    "source": [
 64 |     "datasets:\n",
 65 |     "  coco_caption:\n",
 66 |     "    vis_processor:\n",
 67 |     "        train:\n",
 68 |     "          name: \"blip2_image_train\"\n",
 69 |     "          image_size: 224\n",
 70 |     "    text_processor:\n",
 71 |     "        train:\n",
 72 |     "          name: \"blip_caption\"        \n",
 73 |     "  vg_caption: # name of the dataset builder\n",
 74 |     "    vis_processor:\n",
 75 |     "        train:\n",
 76 |     "          name: \"blip_image_train\"\n",
 77 |     "          image_size: 224\n",
 78 |     "    text_processor:\n",
 79 |     "        train:\n",
 80 |     "          name: \"blip_caption\""
 81 |    ]
 82 |   },
 83 |   {
 84 |    "cell_type": "markdown",
 85 |    "id": "664de2a2-af46-4a7e-905a-aa168c2ca1af",
 86 |    "metadata": {},
 87 |    "source": [
 88 |     "训练所用的数据集。每个数据集，都会被管理到独有的处理类，加载成同一的形式。"
 89 |    ]
 90 |   },
 91 |   {
 92 |    "cell_type": "raw",
 93 |    "id": "54e7dba6-7cf0-4f94-a00b-903adb2a4fb4",
 94 |    "metadata": {},
 95 |    "source": []
 96 |   },
 97 |   {
 98 |    "cell_type": "code",
 99 |    "execution_count": null,
100 |    "id": "6b207b02-52aa-485d-a1aa-297110ef2262",
101 |    "metadata": {},
102 |    "outputs": [],
103 |    "source": [
104 |     "run:\n",
105 |     "  task: image_text_pretrain"
106 |    ]
107 |   },
108 |   {
109 |    "cell_type": "markdown",
110 |    "id": "238db4cf-0533-46aa-9a5d-ce737baa6789",
111 |    "metadata": {},
112 |    "source": [
113 |     "配置任务。每个任务都会关联到具体的任务类。\n",
114 |     "\n",
115 |     "image_text_pretrain任务定义于 https://github.com/salesforce/LAVIS/blob/ac8fc98c93c02e2dfb727e24a361c4c309c8dbbc/lavis/tasks/image_text_pretrain.py"
116 |    ]
117 |   }
118 |  ],
119 |  "metadata": {
120 |   "kernelspec": {
121 |    "display_name": "Python 3 (ipykernel)",
122 |    "language": "python",
123 |    "name": "python3"
124 |   },
125 |   "language_info": {
126 |    "codemirror_mode": {
127 |     "name": "ipython",
128 |     "version": 3
129 |    },
130 |    "file_extension": ".py",
131 |    "mimetype": "text/x-python",
132 |    "name": "python",
133 |    "nbconvert_exporter": "python",
134 |    "pygments_lexer": "ipython3",
135 |    "version": "3.10.8"
136 |   }
137 |  },
138 |  "nbformat": 4,
139 |  "nbformat_minor": 5
140 | }
141 | 


--------------------------------------------------------------------------------
/docs/notes/blip2.md:
--------------------------------------------------------------------------------
  1 | # BLIP-2 Explained：原理剖析与代码解读
  2 | 
  3 | ## BLIP-2概述
  4 | 
  5 | BLIP代表Bootstrapping Language-Image Pre-training。在这里，“bootstrap”一词可以理解为“自举”或“引导”的意思。它表示通过某种方法逐步建立或提升系统的能力，而无需从零开始完全重新训练。具体来说，在BLIP-2的框架中，bootstrap的作用是利用已有的预训练模型（如预训练的图像编码器和大型语言模型）来逐步学习和提升视觉-语言表示和生成能力，而不需要从头开始训练整个模型。原文标题很恰如其分，BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models。
  6 | 
  7 | BLIP-2训练分为两个阶段：
  8 | - 第一阶段：使用冻结的图像编码器来引导视觉-语言表示学习。在这个阶段中，图像编码器的参数是固定的，不会更新，模型通过学习如何利用这些固定的图像表示来进行视觉-语言任务。
  9 | - 第二阶段：使用冻结的语言模型来引导视觉到语言的生成学习。在这个阶段，语言模型的参数也是固定的，模型通过学习如何将视觉信息转换为语言表示，并生成符合自然语言指令的文本。
 10 | 
 11 | <img src="../images/blip2.png" style="zoom:70%;" />
 12 | 
 13 | BLIP-2的核心组件是Q-Former，它是一个transformer，用于从图片编码器中提取视觉特征，是图片编码器和LLM之间的桥梁。在第一阶段，Q-Former会学习到和文本最相关的视觉特征。在第二阶段，Q-Former的输出和固定参数的LLM连接起来，使得Q-Former的输出能被LLM理解。
 14 | 
 15 | ## Stage 1解释
 16 | 
 17 | Stage 1: Bootstrap Vision-Language Representation Learning from a Frozen Image Encoder
 18 | 
 19 | Stage 1是多任务学习，对应三个损失函数，总损失是三个任务的损失之和。。原文的示意图把三个任务混在一起了，所以不是很好理解，把他们拆开更容易理解。
 20 | 
 21 | ![](../images/blip2-1.png)
 22 | 
 23 | ### 图文对比学习
 24 | 
 25 | 图文对比学习，让同一个图文对的视觉特征和文本特征尽可能相似。
 26 | 
 27 | <img src="../images/blip2-2.png" style="zoom: 33%;" />
 28 | 
 29 | <p style="text-align:center;">插图来源：https://www.youtube.com/watch?v=k0DAtZCCl1w</p>
 30 | 
 31 | Learned queries经过self-attention得到query隐状态，图片经过image encoder得到图片向量，图片向量和query隐状态经过cross attention，得到query output，再经过FFN，得到最终的图片特征。图片特征是一个列表的向量，记为$Z$。
 32 | 
 33 | 文本经过self attention和FFN，取`[CLS]`的向量，记为$t$。
 34 | 
 35 | 图文相似度的计算方式：$Z$中的每个向量和$t$计算相似度，取最大的那个。
 36 | 
 37 | 所谓unimodal self-attention mask，就是图片和文本分别计算self-attention。
 38 | 
 39 | <img src="../images/blip2-3.png" style="zoom:50%;" />
 40 | 
 41 | 这个示意图按行看，阴影是mask，Q是图片的表示，T是文本的表示，图片只能注意图片，文本只能注意文本。
 42 | 
 43 | 代码解释：
 44 | 
 45 | <iframe src="notes/image_text_contrastive_learning.html" width="100%" height="600px"></iframe>
 46 | 
 47 | 
 48 | ### 以图生文
 49 | 
 50 | <img src="../images/blip2-8.png" style="zoom: 25%;" />
 51 | 
 52 | 这个任务是为了让Q-Former具备给图片生成配文的能力，self attention既是编码器，也是解码器。
 53 | 
 54 | self attention的视野范围是全部query output(cross attention的输出)，以及已经解码过的token，这就是所谓的multimodal causal self-attention mask。第一个解码token是`[DEC]`。
 55 | 
 56 | <img src="../images/blip2-4.png" style="zoom:50%;" />
 57 | 
 58 | 这个示意图按行看，阴影是mask，Q是图片的表示，T是文本的表示，图片可以注意所有图片，文本可以注意所有图片和已经解码过的文本。
 59 | 
 60 | 代码解释：
 61 | 
 62 | <iframe src="notes/image_grounded_text_generation.html" width="120%" height="600px"></iframe>
 63 | 
 64 | 
 65 | ### 图文匹配
 66 | 
 67 | <img src="../images/blip2-9.png" style="zoom: 25%;" />
 68 | 
 69 | 明白前两个任务之后，理解第三个任务就比较简单了。
 70 | 
 71 | 图文匹配是一个二分类任务，对于一个图文对，判定是否匹配。
 72 | 
 73 | 在这个任务中，所有的图片特征和文本特征都可以互相注意，这就是所谓的 bi-directional self-attention mask。
 74 | 
 75 | <img src="../images/blip2-5.png" style="zoom: 25%;" />
 76 | 
 77 | 代码解释：
 78 | 
 79 | <iframe src="notes/image_text_matching.html" width="100%" height="600px"></iframe>
 80 | 
 81 | ### 回顾
 82 | 
 83 | 现在在看原论文中的示意图，就好理解了。
 84 | ![](../images/blip2-6.png)
 85 | 
 86 | 再理解几个原文中的句子：
 87 | 
 88 | > 原文：Q-Former consists of two transformer submodules that share the same self-attention layers: (1) an image transformer that interacts with the frozen image encoder for visual feature extraction, (2) a text transformer that can function as both a text encoder and a text decoder. 
 89 | 
 90 | 解释：这个说法有点误导。在实现上，其实只有一个transformer(更具体的，初始化为bert-base-uncased)，这个transformer有时候输入是文本，有时候输入是图片向量+learned queries。有时候只用self attention，有时候也用cross attention。
 91 | 
 92 | 
 93 | 
 94 | > 原文：We create a set number of learnable query embeddings as input to the image transformer. The queries interact with each other through self-attention layers, and interact with frozen image features through cross-attention layers (inserted every other transformer block). 
 95 | 
 96 | 解释：Q-Former有N个block层，每个block层有一个self attention层和一个cross attention层。提取图片特征时，在每个block层，先用self attention计算queies的隐状态，再结合cross attention计算视觉特征。这就是所谓的inserted every other transformer block。这种安排可以在保留queries自我交互的同时，定期引入图像特征信息，平衡两者之间的信息流。
 97 | 
 98 | 
 99 | 
100 | > 原文：The queries can additionally interact with the text through the same self-attention layers.
101 | 
102 | 解释：因为他们共用一个self attention，所以就间接交互了。
103 | 
104 | 
105 | 
106 | > 原文： Depending on the pre-training task, we apply different self-attention masks to control query-text interaction. We initialize Q-Former with the pre-trained weights of BERTbase (Devlin et al., 2019), whereas the cross-attention layers are randomly initialized. In total, Q-Former contains 188M parameters. Note that the queries are considered as model parameters.
107 | 
108 | 解释：we apply different self-attention masks to control query-text interaction这句话是重中之重，是理解Q-Former实现的关键。
109 | 
110 | 
111 | 
112 | 
113 | 
114 | ## Stage 2解释
115 | 
116 | Stage 2: Bootstrap Vision-to-Language Generative Learning from a Frozen LLM
117 | 
118 | 这个阶段就比较简单了，没什么可说的。
119 | ![](../images/blip2-7.png)
120 | 
121 | 以OPT为例，代码在 https://github.com/salesforce/LAVIS/blob/main/lavis/models/blip2_models/blip2_opt.py#L107。
122 | 
123 | 
124 | ## 代码详细解释
125 | ### Qformer代码
126 | #### BertSelfAttention
127 | 代码位置: https://github.com/salesforce/LAVIS/blob/main/lavis/models/blip2_models/Qformer.py#L111
128 | 
129 | init参数:
130 | ```python
131 | config：包含了注意力机制的配置参数，例如隐藏层大小、注意力头的数量等。
132 | is_cross_attention：一个布尔值，表示是否是跨注意力（cross-attention）机制，即查询（query）和键值（key-value）来自不同的输入。
133 | ```
134 | 
135 | config在这里设置：https://github.com/salesforce/LAVIS/blob/ac8fc98c93c02e2dfb727e24a361c4c309c8dbbc/lavis/models/blip2_models/blip2.py#L48
136 | 
137 | 具体config文件是：https://huggingface.co/google-bert/bert-base-uncased/blob/main/config.json
138 | 
139 | 根据是否is_cross_attention，K和V的输入维度有所不同。如果is_cross_attention==True，输入维度是Image Encoder的维度，在 https://github.com/salesforce/LAVIS/blob/ac8fc98c93c02e2dfb727e24a361c4c309c8dbbc/lavis/models/blip2_models/blip2.py#L47 配置。vision_width就是图片编码器的维度，在 https://github.com/salesforce/LAVIS/blob/ac8fc98c93c02e2dfb727e24a361c4c309c8dbbc/lavis/models/blip2_models/blip2_qformer.py#L71 配置，源头在 https://github.com/salesforce/LAVIS/blob/ac8fc98c93c02e2dfb727e24a361c4c309c8dbbc/lavis/models/eva_vit.py#L257 。
140 | 
141 | forward时，根据是否有encoder_hidden_states来赋值is_cross_attention。如果cross，k和v从encoder_hidden_states计算；否则，从hidden_states计算。
142 | 
143 | past_key_value只在解码的时候才有，实现KVcache。经过 transpose_for_scores 处理后，key_layer和value_layer的形状通常是 (batch_size, num_attention_heads, seq_length, attention_head_size)，所以dim=2就是在时间维度上的拼接。
144 | 
145 | `attention_scores = attention_scores + attention_mask`是pointwise操作，有mask的位置会变成一个非常小的数，softmax之后权重就接近0。
146 | 
147 | context_layer是注意力的最终输出，形状是(batch_size, seq_length, all_head_size)。
148 | 
149 | forward输出元组，(context_layer, past_key_value)。
150 | 
151 | #### BertLayer
152 | 代码位置： https://github.com/salesforce/LAVIS/blob/main/lavis/models/blip2_models/Qformer.py#L378
153 | 
154 | `chunk_size_feed_forward`定义在 https://huggingface.co/docs/transformers/main_classes/configuration , bert-base-uncased没有配置，则默认为0. The chunk size of all feed forward layers in the residual attention blocks. A chunk size of 0 means that the feed forward layer is not chunked. A chunk size of n means that the feed forward layer processes n < sequence_length embeddings at a time. 处理长序列输入时，内存占用可能非常高。通过分块处理，可以将输入序列分成若干个小块，每次只处理一个小块，从而减少单次计算所需的内存量。
155 | 
156 | self.seq_len_dim = 1 表示序列长度在shape中的索引是1.
157 | 
158 | cross_attention_freq 设置每隔多少层会插入一个交叉注意力层，在 https://github.com/salesforce/LAVIS/blob/ac8fc98c93c02e2dfb727e24a361c4c309c8dbbc/lavis/models/blip2_models/blip2.py#L47 中设置为2。
159 | 
160 | BertIntermediate 是 BERT 模型中的一个中间层组件，被用于实现前馈网络（Feed-Forward Network, FFN）的第一部分。前馈网络是 Transformer 架构中的一个关键组件，它通常包含两个线性变换（全连接层）和一个激活函数。
161 | 在 Transformer 架构中，一个典型的前馈网络由两个线性层和一个激活函数组成：
162 | 1. **第一部分（BertIntermediate）**：
163 |     - 输入维度为 `hidden_size`。
164 |     - 通过线性层变换到 `intermediate_size`。
165 |     - 应用激活函数。
166 | 2. **第二部分（BertOutput）**：
167 |     - 输入维度为 `intermediate_size`。
168 |     - 通过线性层变换回 `hidden_size`。
169 |     - 进行残差连接和层归一化（Layer Normalization）。
170 | 
171 | 
172 | query_length是learned queries的数量，在 https://github.com/salesforce/LAVIS/blob/ac8fc98c93c02e2dfb727e24a361c4c309c8dbbc/lavis/models/blip2_models/blip2_qformer.py#L71 中默认设置为num_query_token=32. 根据num_query_token数量，声明query tokens权重矩阵， https://github.com/salesforce/LAVIS/blob/ac8fc98c93c02e2dfb727e24a361c4c309c8dbbc/lavis/models/blip2_models/blip2.py#L57 
173 | 
174 | #### BertEncoder
175 | 代码在 https://github.com/salesforce/LAVIS/blob/main/lavis/models/blip2_models/Qformer.py#L487 
176 | 
177 | 这个类没什么需要特别解释的，输入一层一层经过BertLayer。
178 | 
179 | 虽然在示意图中好像有两个encoder，其实在实现上其实只有一个，对应所谓的"share the same self-attention layers"。
180 | 
181 | #### BertEmbeddings
182 | https://github.com/salesforce/LAVIS/blob/main/lavis/models/blip2_models/Qformer.py#L51
183 | 
184 | 这是attention前面的层，将输入转换成attention接收的向量。
185 | 
186 | 注意这个计算逻辑:
187 | ```python
188 |  def forward(
189 |         self,
190 |         input_ids=None,
191 |         position_ids=None,
192 |         query_embeds=None,
193 |         past_key_values_length=0,
194 |     ):
195 |         # ...
196 |         if input_ids is not None:
197 |             embeddings = self.word_embeddings(input_ids)
198 |             # ...
199 |             if query_embeds is not None:
200 |                 embeddings = torch.cat((query_embeds, embeddings), dim=1)
201 |         else:
202 |             embeddings = query_embeds
203 | 
204 | ```
205 | 
206 | 
207 | #### BertModel
208 | https://github.com/salesforce/LAVIS/blob/main/lavis/models/blip2_models/Qformer.py#L677 
209 | 
210 | 比BertEncoder多了个输入embedding层。
211 | 
212 | 重点是get_extended_attention_mask这个方法。
213 | 
214 | <iframe src="notes/get_extended_attention_mask.html" width="100%" height="600px"></iframe>
215 | 
216 | #### BertLMHeadModel
217 | https://github.com/salesforce/LAVIS/blob/main/lavis/models/blip2_models/Qformer.py#L968
218 | 
219 | 这是qFormer的核心组件。blip.Qformer变量就是BertLMHeadModel对象。
220 | 
221 | 这个类执行语言建模。LM代表语言模型，Head表示这个类实现了一个特定的头部。头部通常指代模型的最后一层或几层，它们负责将模型的编码器输出转换为特定的任务输出。对于语言模型，头部通常包括一个线性层和一个softmax层，用于预测词汇表中的下一个词。`self.cls = BertOnlyMLMHead(config)`做的就是预测下一个词。
222 | 
223 | 继承BertPreTrainedModel的作用：使用基类中定义的初始化方法，初始化参数为预训练BERT权重。
224 | 
225 | ### Stage 1训练代码
226 | 运行命令： https://github.com/salesforce/LAVIS/blob/main/run_scripts/blip2/train/pretrain_stage1.sh
227 | 
228 | #### 配置文件pretrain_stage1.yaml
229 | https://github.com/salesforce/LAVIS/blob/ac8fc98c93c02e2dfb727e24a361c4c309c8dbbc/lavis/projects/blip2/train/pretrain_stage1.yaml
230 | 
231 | 全部训练配置都在这里设置。
232 | 
233 | <iframe src="notes/pretrain_stage1.html" width="100%" height="600px"></iframe>
234 | 
235 | #### Blip2Qformer
236 | https://github.com/salesforce/LAVIS/blob/ac8fc98c93c02e2dfb727e24a361c4c309c8dbbc/lavis/models/blip2_models/blip2_qformer.py#L25
237 | 
238 | 这是训练stage1的核心，数据进入Blip2Qformer类，输出损失。
239 | 
240 | 三个损失函数前面已经分析过。
241 | 
242 | #### ImageTextPretrainTask 和 BaseTask
243 | 训练代码在这里定义。
244 | 
245 | https://github.com/salesforce/LAVIS/blob/ac8fc98c93c02e2dfb727e24a361c4c309c8dbbc/lavis/tasks/base_task.py
246 | 
247 | #### RunnerBase
248 | 前面是定义模型、数据、任务，在这里开始训练。
249 | 
250 | https://github.com/salesforce/LAVIS/blob/main/lavis/runners/runner_base.py
251 | 
252 | train方法执行训练循环。train_epoch方法训练一个epoch，它再调用task类的train_epoch方法，进行实际的训练一轮。
253 | 
254 | _save_checkpoint解释：
255 | <iframe src="notes/save_checkpoint.html" width="100%" height="600px"></iframe>
256 | 
257 | ### Stage 2训练代码
258 | 运行命令： https://github.com/salesforce/LAVIS/blob/main/run_scripts/blip2/train/pretrain_stage2.sh
259 | 
260 | #### 配置文件pretrain_stage2.yaml
261 | 
262 | https://github.com/salesforce/LAVIS/blob/ac8fc98c93c02e2dfb727e24a361c4c309c8dbbc/lavis/projects/blip2/train/pretrain_stage2.yaml
263 | 
264 | ```yml
265 | model:
266 |   arch: blip2_opt
267 |   model_type: pretrain_opt2.7b 
268 |   load_pretrained: True
269 |   # intialize stage 2 pretraining from stage 1 pretrained model
270 |   pretrained: "https://storage.googleapis.com/sfr-vision-language-research/LAVIS/models/BLIP2/blip2_pretrained.pth"
271 |   freeze_vit: True
272 | ```
273 | 
274 | 模型是blip2_opt
275 | 
276 | #### Blip2OPT
277 | 
278 | https://github.com/salesforce/LAVIS/blob/ac8fc98c93c02e2dfb727e24a361c4c309c8dbbc/lavis/models/blip2_models/blip2_opt.py#L21 
279 | 
280 | 这个类包含qFormer和OPT，forward时计算语言模型损失。
281 | 
282 | ### Stage 1 三个任务的mask和前向计算
283 | 
284 | 正文简单大致介绍过各种attention mask, 这里从前向计算角度，再作一次细致的分析。
285 | 
286 | #### 图文对比学习
287 | 有三个对象：图片、queries和文本。
288 | 
289 | queries的self attention是全注意力。
290 | 图片和queries计算cross attention，没有mask。
291 | 文本的self attention是全注意力。
292 | 
293 | 代码解释：
294 | <iframe src="notes/itc_forward.html" width="100%" height="600px"></iframe>
295 | 
296 | #### 以图生文
297 | 文本隐状态是q，kv是queries的kv+文本隐状态。mask是causal mask。
298 | 计算self attention，预测词表，计算交叉熵。
299 | 
300 | 代码解释：
301 | <iframe src="notes/itg_forward.html" width="100%" height="600px"></iframe>
302 | 
303 | #### 图文匹配
304 | 
305 | 代码解释：
306 | <iframe src="notes/itm_forward.html" width="100%" height="600px"></iframe>
307 | 
308 | 
309 | 
310 | 
311 | 
312 | 


--------------------------------------------------------------------------------
/docs/notes/bn.md:
--------------------------------------------------------------------------------
  1 | # Batch Normalization
  2 | 
  3 | ## 介绍Batch Normalization
  4 | Batch Normalization（批归一化）是一种用于加速深度神经网络训练并提高其稳定性的方法。它通过在每个小批量（batch）的数据上进行归一化操作，使得每层神经元的输入分布更加稳定，从而帮助缓解梯度消失和梯度爆炸问题。这种方法由Sergey Ioffe和Christian Szegedy在2015年提出。
  5 | 
  6 | ### Batch Normalization的主要步骤：
  7 | 
  8 | 1. **计算当前批次的均值和方差**：
  9 |     对于一个批次的输入 $\mathbf{X}$，其中每个输入为 $\mathbf{x}_i$​，首先计算该批次中的均值 $\mu_B$​ 和方差 $\sigma_B^2$​：
 10 | 
 11 |     $$\mu_B = \frac{1}{m} \sum_{i=1}^m \mathbf{x}_i$$
 12 | 
 13 |     $$\sigma_B^2 = \frac{1}{m} \sum_{i=1}^m (\mathbf{x}_i - \mu_B)^2$$
 14 | 
 15 |     其中，$m$ 是批次中的样本数量。
 16 | 
 17 | 2. **归一化**：
 18 |     使用计算出的均值和方差对每个输入进行归一化处理：
 19 | 
 20 |     $$\hat{\mathbf{x}}_i = \frac{\mathbf{x}_i - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}$$
 21 | 
 22 |     这里 $\epsilon$ 是一个很小的常数，用于防止分母为零。
 23 | 
 24 | 3. **缩放和平移**：
 25 |     引入两个可学习的参数 $\gamma$ 和 $\beta$，分别用于缩放和平移归一化后的值：
 26 | 
 27 |     $$\mathbf{y}_i = \gamma \hat{\mathbf{x}}_i + \beta$$
 28 | 
 29 |     这两个参数允许模型在需要时恢复原始数据的分布。
 30 | 
 31 | ### Batch Normalization的优点：
 32 | 
 33 | 1. **加速训练**：
 34 |     通过稳定每一层的输入分布，Batch Normalization可以使得训练过程更加平滑，从而允许使用更高的学习率。
 35 | 2. **减少对初始化的依赖**：
 36 |     由于每层的输入被归一化，网络对权重初始化的敏感性减弱，使得模型更容易训练。
 37 | 3. **正则化效果**：
 38 |     在某种程度上，Batch Normalization具有正则化效果，可以减少对其他正则化方法（如Dropout）的依赖。
 39 | 4. **缓解梯度消失和梯度爆炸问题**：
 40 |     通过保持输入数据的稳定性，Batch Normalization帮助缓解了深层网络中常见的梯度消失和梯度爆炸问题。
 41 | 
 42 | ### 应用场景：
 43 | 
 44 | Batch Normalization几乎可以应用于任何类型的神经网络，包括卷积神经网络（CNN）、循环神经网络（RNN）和全连接网络。通常，它被插入到每个层的非线性激活函数之前。
 45 | 
 46 | ### 实际应用中的注意事项：
 47 | 
 48 | 1. **批量大小的选择**：
 49 |     较小的批量大小可能会导致均值和方差估计的不稳定，从而影响归一化效果。
 50 | 2. **推理阶段的处理**：
 51 |     在训练阶段，均值和方差是基于小批量数据计算的；在推理阶段（测试阶段），通常使用整个训练数据集的均值和方差的移动平均值。
 52 | 
 53 | Batch Normalization是深度学习领域中的一个重要技术，显著提升了训练效率和模型性能。通过适当的使用，能够帮助开发者构建更深、更复杂的神经网络。
 54 | 
 55 | ## Internal Covariate Shift是什么
 56 | 在Batch Normalization的原始论文《Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift》中，提到的“shift”概念是指“Internal Covariate Shift”（内部协变量偏移）。
 57 | 
 58 | ### Internal Covariate Shift
 59 | 
 60 | **Internal Covariate Shift**是指在神经网络训练过程中，由于网络层的参数在不断更新，每层的输入分布也在不断变化。这种现象会导致以下几个问题：
 61 | 
 62 | 1. **训练不稳定**：每层输入分布的变化使得梯度更新更加不稳定，训练过程变得不那么平滑。
 63 | 2. **训练速度变慢**：为了应对输入分布的变化，模型需要更多的迭代次数才能收敛。
 64 | 3. **对初始化敏感**：由于输入分布的变化，模型对初始权重的选择更加敏感，不同的初始化可能导致显著不同的训练效果。
 65 | 
 66 | ### Batch Normalization如何缓解Internal Covariate Shift
 67 | 
 68 | Batch Normalization通过对每个小批量数据进行归一化处理，使得每层的输入分布在训练过程中更加稳定，从而减轻了Internal Covariate Shift的问题。具体来说，Batch Normalization在每一层对小批量数据的均值和方差进行归一化，然后再应用缩放和平移操作，使得每层的输入分布更接近一定的范围，使得每层的输入分布更加稳定，减轻了内部协变量偏移，进而加速了训练过程，提高了训练的稳定性。
 69 | 
 70 | ### 总结
 71 | 
 72 | Internal Covariate Shift是指神经网络各层输入分布在训练过程中不断变化的问题，导致训练不稳定和速度变慢。Batch Normalization通过对小批量数据进行归一化，降低了这种偏移，使得训练过程更加平滑和高效。这一技术已经成为深度学习模型训练中不可或缺的一部分，显著提升了复杂神经网络的训练性能。
 73 | 
 74 | ## 能用几何解释一下吗
 75 | 当然，可以用几何的角度来解释Batch Normalization以及它如何缓解Internal Covariate Shift的问题。
 76 | 
 77 | ### 几何解释
 78 | 
 79 | 假设我们在训练一个神经网络，输入数据 $\mathbf{X}$是一个高维空间的点，经过每一层的变换后，我们得到新的点 $\mathbf{Y}$。在训练过程中，由于每层的参数（权重和偏置）在不断更新，导致每层输出的分布发生变化。这种变化会影响后续层的输入，进而影响整个网络的训练过程。
 80 | 
 81 | #### Internal Covariate Shift的几何视角
 82 | 
 83 | 1. **输入分布变化**：在高维空间中，每一层的输入数据点 $\mathbf{Y}$ 的分布在不断变化。例如，假设原始输入数据点在一个球形区域内分布，经过一层变换后，这些点可能被拉伸、旋转或平移到一个新的形状和位置。这种变化在训练过程中不断发生。
 84 | 2. **影响训练**：由于输入分布的变化，网络需要不断调整参数以适应新的输入分布，这使得训练过程更加复杂和不稳定。
 85 | 
 86 | #### Batch Normalization的几何视角
 87 | 
 88 | Batch Normalization通过归一化操作，使得每层的输入数据在一个更稳定的分布中，这样可以使得训练更加高效和平稳。具体几何上看：
 89 | 
 90 | 1. **归一化操作**：
 91 |     - **中心化**：将小批量数据的均值移动到原点。这意味着在高维空间中，数据点的中心被移动到了坐标原点。
 92 |     - **标准化**：将数据点的分布缩放到单位球面，即每个数据点的方差变为1。
 93 | 2. **缩放和平移**：
 94 |     - **缩放**：引入参数 $\gamma*γ$ 对归一化后的数据进行缩放。这相当于在高维空间中对数据点进行放大或缩小。
 95 |     - **平移**：引入参数 $\beta$ 对数据点进行平移。这相当于在高维空间中将数据点移动到新的位置。
 96 | 
 97 | ### 几何变换的效果
 98 | 
 99 | 通过上述几何变换，Batch Normalization将每一层的输入数据点分布变得更加稳定和规范化：
100 | 
101 | 1. **保持数据分布稳定**：每层输入数据点的均值被固定在原点，方差被固定在1，这使得后续层的输入分布更加稳定，不会因为前一层的参数更新而发生剧烈变化。
102 | 2. **提高训练效率**：由于每层输入分布的稳定性，网络的参数更新变得更加平滑，可以使用更高的学习率，从而加速训练过程。
103 | 3. **减少梯度问题**：通过保持输入数据点的规范化分布，Batch Normalization帮助缓解了梯度消失和梯度爆炸问题，使得深层网络的训练更加稳定。
104 | 
105 | ### 总结
106 | 
107 | 从几何的角度来看，Batch Normalization通过将每层输入数据点的分布规范化，使得训练过程中的输入分布更加稳定。这种稳定性减少了Internal Covariate Shift的问题，使得网络训练更加高效和平稳。几何上的中心化、标准化、缩放和平移操作共同作用，使得每层输入的数据点分布在高维空间中保持在一个可控的范围内，从而提升了训练性能。
108 | 
109 | ## 缩放和平移的作用是什么
110 | 在Batch Normalization中，缩放（scale）和平移（shift）操作通过引入可学习的参数 $\gamma$ 和 $\beta$，不仅仅是为了恢复数据的原始分布，更重要的是赋予网络更高的灵活性和表达能力。具体来说，它们有以下几个主要作用：
111 | 
112 | ### 1. 恢复表达能力
113 | 
114 | 在Batch Normalization中，归一化步骤将数据的均值调整为0、方差调整为1。这种标准化虽然有助于训练的稳定性和速度，但也有可能限制网络的表达能力，因为标准化后的数据在某些情况下可能不适合某些激活函数（如ReLU）。通过引入 $\gamma$ 和 $\beta$，网络能够在标准化的基础上恢复必要的分布，以适应不同的激活函数和任务需求。
115 | 
116 | 
117 | ### 2. 保持非线性变换
118 | 
119 | 标准化后的数据虽然被规范化了，但可能会破坏原有的特征分布。通过缩放和平移，网络可以在标准化后重新调整数据的尺度和位置，从而保留原始输入特征的非线性变换能力。这样，网络可以更好地学习复杂的非线性关系。
120 | 
121 | ### 3. 提供额外的学习参数
122 | 
123 | 缩放参数 $\gamma$ 和平移参数 $\beta$ 是可学习的，这意味着它们可以通过反向传播进行优化。这些参数提供了额外的自由度，使得每一层的输出不仅依赖于输入数据的分布，还能够通过训练进行微调，从而更好地适应特定的任务。
124 | 
125 | ### 4. 增强稳定性
126 | 
127 | 虽然标准化操作已经在一定程度上提高了训练的稳定性，但完全标准化可能并不总是最优的。缩放和平移操作允许网络在标准化的基础上进行细微调整，从而在保持训练稳定性的同时，增强模型的泛化能力。
128 | 
129 | ### 5. 适应不同层的需求
130 | 
131 | 不同的网络层可能对输入数据的尺度和位置有不同的要求。通过引入 $\gamma$ 和 $\beta$，每一层可以根据自己的需求对输入数据进行调整，从而更好地适应整个网络的训练过程。这样，每一层都可以根据其特定需要对标准化后的数据进行调整，从而达到最佳的训练效果。
132 | 
133 | ### 例子：ReLU激活函数
134 | 
135 | ReLU激活函数的输出为非负值。如果没有缩放和平移，标准化后的输出均值为0，可能会导致大量的负数输入到ReLU中，这会导致大量的神经元输出为0，进而影响模型的表达能力。通过缩放和平移，网络可以调整数据分布，使得更多的输入值为正，从而充分利用ReLU的非线性变换能力。
136 | 
137 | ### 总结
138 | 
139 | 缩放和平移在Batch Normalization中起到以下关键作用：
140 | 
141 | 1. 恢复原始数据的分布，保持网络的表达能力。
142 | 2. 提供额外的学习参数，增强模型的灵活性。
143 | 3. 保持非线性变换，提高模型的复杂性。
144 | 4. 增强训练的稳定性和泛化能力。
145 | 5. 适应不同层的需求，优化整体训练效果。
146 | 
147 | 通过这些作用，Batch Normalization不仅提高了训练的速度和稳定性，还增强了模型的表达能力和泛化性能。
148 | 
149 | ## 如何在PyTorch中使用batch normalization
150 | 在PyTorch中使用Batch Normalization非常方便，PyTorch提供了内置的模块来实现这一功能。以下是如何在不同类型的神经网络中使用Batch Normalization的示例。
151 | 
152 | ### 1. 对于全连接层（Fully Connected Layers）
153 | 
154 | 对于全连接层，可以使用`torch.nn.BatchNorm1d`。
155 | 
156 | ```python
157 | import torch
158 | import torch.nn as nn
159 | 
160 | class FullyConnectedModel(nn.Module):
161 |     def __init__(self, input_size, hidden_size, num_classes):
162 |         super(FullyConnectedModel, self).__init__()
163 |         self.fc1 = nn.Linear(input_size, hidden_size)
164 |         self.bn1 = nn.BatchNorm1d(hidden_size)
165 |         self.fc2 = nn.Linear(hidden_size, num_classes)
166 | 
167 |     def forward(self, x):
168 |         out = self.fc1(x)
169 |         out = self.bn1(out)
170 |         out = torch.relu(out)
171 |         out = self.fc2(out)
172 |         return out
173 | 
174 | # Example usage:
175 | model = FullyConnectedModel(input_size=784, hidden_size=500, num_classes=10)
176 | ```
177 | 
178 | ### 2. 对于卷积层（Convolutional Layers）
179 | 
180 | 对于卷积层，可以使用`torch.nn.BatchNorm2d`。
181 | 
182 | ```python
183 | import torch
184 | import torch.nn as nn
185 | 
186 | class ConvNet(nn.Module):
187 |     def __init__(self, num_classes=10):
188 |         super(ConvNet, self).__init__()
189 |         self.layer1 = nn.Sequential(
190 |             nn.Conv2d(3, 16, kernel_size=3, padding=1),
191 |             nn.BatchNorm2d(16),
192 |             nn.ReLU(),
193 |             nn.MaxPool2d(kernel_size=2, stride=2)
194 |         )
195 |         self.layer2 = nn.Sequential(
196 |             nn.Conv2d(16, 32, kernel_size=3, padding=1),
197 |             nn.BatchNorm2d(32),
198 |             nn.ReLU(),
199 |             nn.MaxPool2d(kernel_size=2, stride=2)
200 |         )
201 |         self.fc = nn.Linear(32*8*8, num_classes)  # assuming input size is 32x32
202 | 
203 |     def forward(self, x):
204 |         out = self.layer1(x)
205 |         out = self.layer2(out)
206 |         out = out.view(out.size(0), -1)
207 |         out = self.fc(out)
208 |         return out
209 | 
210 | # Example usage:
211 | model = ConvNet(num_classes=10)
212 | ```
213 | 
214 | ### 3. 对于循环神经网络（Recurrent Neural Networks）
215 | 
216 | 对于RNN层，可以使用`torch.nn.BatchNorm1d`，不过这种情况较为少见，因为RNN本身的输入输出维度较为复杂，通常不直接对RNN层进行Batch Normalization，而是对其前后层进行。
217 | 
218 | ```python
219 | import torch
220 | import torch.nn as nn
221 | 
222 | class RNNModel(nn.Module):
223 |     def __init__(self, input_size, hidden_size, num_layers, num_classes):
224 |         super(RNNModel, self).__init__()
225 |         self.rnn = nn.RNN(input_size, hidden_size, num_layers, batch_first=True)
226 |         self.bn = nn.BatchNorm1d(hidden_size)
227 |         self.fc = nn.Linear(hidden_size, num_classes)
228 | 
229 |     def forward(self, x):
230 |         out, _ = self.rnn(x)
231 |         out = self.bn(out[:, -1, :])  # apply BN to the last time step's output
232 |         out = self.fc(out)
233 |         return out
234 | 
235 | # Example usage:
236 | model = RNNModel(input_size=28, hidden_size=128, num_layers=2, num_classes=10)
237 | ```
238 | 
239 | ### 4. 训练和推理中的处理
240 | 
241 | 在训练和推理阶段，Batch Normalization的行为是不同的：
242 | 
243 | - **训练阶段**：Batch Normalization使用当前批次的均值和方差进行归一化。
244 | - **推理阶段**：Batch Normalization使用训练过程中累积的均值和方差。
245 | 
246 | 在PyTorch中，这种行为由`model.train()`和`model.eval()`自动处理：
247 | 
248 | ```python
249 | model.train()  # set the model to training mode
250 | # training code here
251 | 
252 | model.eval()  # set the model to evaluation mode
253 | # inference code here
254 | ```
255 | 
256 | ### 总结
257 | 
258 | 在PyTorch中，可以通过`torch.nn.BatchNorm1d`、`torch.nn.BatchNorm2d`等模块非常方便地使用Batch Normalization。将这些模块插入到网络的适当位置，并在训练和推理阶段正确设置模型模式，可以显著提高网络的训练速度和稳定性。
259 | 
260 | ## 用numpy实现batch normalization
261 | 使用NumPy实现Batch Normalization涉及以下几个步骤：计算均值和方差、归一化、缩放和平移。这些步骤在上面的解释中已经详细描述了。下面是一个完整的例子，展示如何在NumPy中实现Batch Normalization。
262 | 
263 | ### 实现步骤
264 | 
265 | 1. **计算当前批次的均值和方差**
266 | 2. **归一化**
267 | 3. **缩放和平移**
268 | 
269 | ### 代码实现
270 | 
271 | ```python
272 | import numpy as np
273 | 
274 | def batch_normalization(X, gamma, beta, epsilon=1e-5):
275 |     """
276 |     Arguments:
277 |     X -- numpy array of shape (m, n), where m is the number of examples and n is the number of features
278 |     gamma -- numpy array of shape (n,), scaling parameter
279 |     beta -- numpy array of shape (n,), shifting parameter
280 |     epsilon -- small float added to variance to avoid dividing by zero
281 |     
282 |     Returns:
283 |     X_normalized -- batch-normalized output
284 |     """
285 |     # Step 1: Calculate mean and variance for the batch
286 |     mu = np.mean(X, axis=0)
287 |     var = np.var(X, axis=0)
288 |     
289 |     # Step 2: Normalize the batch
290 |     X_normalized = (X - mu) / np.sqrt(var + epsilon)
291 |     
292 |     # Step 3: Scale and shift
293 |     X_scaled_shifted = gamma * X_normalized + beta
294 |     
295 |     return X_scaled_shifted
296 | 
297 | 
298 | # Example usage:
299 | # Generate a batch of data with shape (m, n)
300 | np.random.seed(0)
301 | X = np.random.randn(5, 3)  # Example batch with 5 examples and 3 features
302 | 
303 | # Initialize gamma and beta
304 | gamma = np.ones((3,))
305 | beta = np.zeros((3,))
306 | 
307 | # Apply batch normalization
308 | X_bn = batch_normalization(X, gamma, beta)
309 | 
310 | print("Input X:")
311 | print(X)
312 | print("\nBatch-normalized X:")
313 | print(X_bn)
314 | ```
315 | 


--------------------------------------------------------------------------------
/docs/notes/datasets_intro.md:
--------------------------------------------------------------------------------
 1 | # 数据集介绍
 2 | 
 3 | ## [COCO](https://cocodataset.org/#home)
 4 | COCO（Common Objects in Context）数据集是计算机视觉领域中一个非常重要且广泛使用的数据集。它由微软公司发布，用于多种计算机视觉任务，如物体检测、分割、关键点检测和图像描述生成等。COCO数据集的全称是“Common Objects in Context”，强调了其数据中的物体是处于自然场景中的，而不是孤立的。
 5 | 
 6 | 以下是关于COCO数据集的一些关键点：
 7 | 
 8 | ### 数据集规模
 9 | 
10 | - **图像数量**：COCO数据集包含超过20万张高质量的图像。
11 | - **类别数量**：数据集标注了80种常见物体类别，如人、车、椅子、动物等。
12 | - **实例数量**：数据集中总计包含超过150万个物体实例。
13 | 
14 | ### 标注类型
15 | 
16 | COCO数据集的标注非常丰富，包含以下几种主要类型：
17 | 
18 | 1. **物体检测（Object Detection）**：每个物体实例用一个边界框（Bounding Box）标注。边界框标注了物体在图像中的位置和大小。
19 | 2. **语义分割（Semantic Segmentation）**：每个像素都被标注为某一类物体或背景。语义分割提供了更精细的图像理解。
20 | 3. **实例分割（Instance Segmentation）**：不仅标注了每个像素所属的类别，还区分了同一类别中的不同实例。
21 | 4. **关键点检测（Keypoint Detection）**：标注了人体的关键点（如眼睛、耳朵、肩膀、膝盖等），用于姿态估计等任务。
22 | 5. **图像标题生成（Image Captioning）**：每张图像都有多个自然语言描述，便于图像标题生成任务的研究。
23 | 
24 | ## [Visual Genome](https://homes.cs.washington.edu/~ranjay/visualgenome/index.html)
25 | Visual Genome数据集是一个大规模的视觉数据集，旨在将计算机视觉与自然语言处理结合起来。它由斯坦福大学的研究团队创建，目的是促进机器在视觉理解和语义描述方面的研究。
26 | 
27 | Visual Genome包含超过10万张图像，每张图像都经过详细注释。这些注释不仅包括图像中的物体，还包括物体之间的关系以及场景中的属性。
28 | 每张图像的注释通常包括区域描述（Region Descriptions）、对象（Objects）、属性（Attributes）、关系（Relationships）和问答对（QA pairs）。
29 | 
30 | ## CC3M
31 | CC3M（Conceptual Captions 3M）数据集是一个用于图像字幕生成（Image Captioning）任务的大规模数据集。由Google AI在2018年发布，这个数据集包含了约330万对图像和相应的自然语言描述（即字幕）。
32 | 
33 | ## SBU
34 | SBU数据集通常指的是SBU Captioned Photo Dataset，是由纽约州立大学石溪分校（Stony Brook University，简称SBU）发布的一个用于自然语言处理和计算机视觉领域的图像描述生成任务的数据集。SBU数据集包含约100万张图片，每张图片都配有一段描述性的文字（caption）。
35 | 
36 | ## [LAION](https://laion.ai/blog/laion-400-open-dataset/)
37 | LAION数据集（Large-scale Artificial Intelligence Open Network）是一个开放的、多样化的大规模图像-文本数据集，旨在支持和推动计算机视觉和自然语言处理领域的研究。这个数据集由LAION (Large-scale AI Open Network) 社区创建，主要用于训练和评估大规模的视觉-语言模型，如CLIP（Contrastive Language-Image Pretraining）等。
38 | 
39 | 以下是关于LAION数据集的一些关键点：
40 | 
41 | 规模庞大：LAION数据集包含数亿个图像-文本对，是目前公开可用的最大规模的此类数据集之一。这使得它非常适合用于训练需要大量数据的大规模模型。
42 | 
43 | 多样性：数据集涵盖了广泛的图像和文本内容，来源包括网络上的各种公开资源。这种多样性有助于训练出更具泛化能力的模型。
44 | 
45 | 开放获取：LAION数据集是开放的，任何研究人员或开发者都可以自由下载和使用。这种开放性促进了学术研究和工业应用的快速发展和创新。
46 | 
47 | 质量控制：尽管数据集规模庞大，LAION社区也投入了大量精力来保证数据质量，通过自动化和手动审查相结合的方法，尽量减少噪声和不良数据。
48 | 
49 | 应用广泛：LAION数据集可以用于多种任务，包括但不限于图像分类、图像生成、图像描述生成、跨模态检索、图像和文本的对比学习等。
50 | 
51 | LAION数据集的出现为研究人员和工程师提供了一个宝贵的资源，能够有效地推动视觉-语言模型的发展和应用。它不仅在学术界产生了深远影响，也在工业界的实际应用中展示了巨大的潜力。


--------------------------------------------------------------------------------
/docs/notes/dpo.md:
--------------------------------------------------------------------------------
 1 | # DPO
 2 | 
 3 | The core DPO equation is shown again below:
 4 | 
 5 | ![](../images/dpp.jpg)
 6 | 
 7 | 其中：
 8 | - "expected value" is statistics jargon and stands for the average or mean value of the random variable (the expression inside the brackets)
 9 | - The variable $\pi_{\theta}$ is the so-called policy and represents the LLM we want to optimize; $\pi_{\text{ref}}$ is a reference LLM, which is typically the original LLM before optimization (at the beginning of the training, $\pi$ and $\pi_{\text{ref}}$ are typically the same).
10 | -  $\beta$ is a hyperparameter to control the divergence between the $\pi_{\theta}$ and the reference model; increasing $\beta$ increases the impact of the difference between $\pi_{\theta}$ and $\pi_{\text{ref}}$ in terms of their log probabilities on the overall loss function, thereby increasing the divergence between the two models.
11 | 
12 | 
13 | [参考](https://github.com/rasbt/LLMs-from-scratch/blob/main/ch07/04_preference-tuning-with-dpo/dpo-from-scratch.ipynb) 


--------------------------------------------------------------------------------
/docs/notes/dspy_intro.md:
--------------------------------------------------------------------------------
 1 | # 用最短的篇幅说清楚，DSPy是干吗的
 2 | 
 3 | 你手写过决策树吗？我写过。那是一个由无数个if else组成的程序，用来检测一张图片是否含有某种模式。
 4 | 
 5 | 然而，决策树不必手写的。你只要设定一个优化目标，给定一些数据，训练模型就好了。
 6 | 
 7 | 我们现在做的很多prompt工程，和手写决策树有相似之处：想一个prompt模板或pipeline，调prompt，分析case，再调prompt。这个工作，可不可以也通过“训练”完成呢？
 8 | 
 9 | 答案是可以，dspy就是做这件事的，它是一个自动调试prompt pipeline的框架。dspy类比于xgboost, pytorch，他们做的都是同样的事：给定一个任务，一个指标，一些训练集，然后训练出想要的模型。
10 | 
11 | 和pytorch不同的是，dspy优化的“参数”，不是神经网络，而是prompt和prompt pipeline。
12 | 
13 | 
14 | 


--------------------------------------------------------------------------------
/docs/notes/dspy_theory.md:
--------------------------------------------------------------------------------
  1 | # DSPy理论详解
  2 | 
  3 | 这两篇论文有很多计算机科学术语，而且同一个概念可能在不同的地方用不同的术语表达，容易引起困惑。
  4 | 
  5 | [官方github](https://github.com/stanfordnlp/dspy)给出的定义是：
  6 | > DSPy: The framework for programming—not prompting—foundation models
  7 | 
  8 | 随着下面的分析，我们会逐渐理解这个定义。
  9 | 
 10 | > DSPy pushes building new LM pipelines away from manipulating free-form strings and closer to programming (composing modular operators to build text transformation graphs) where a compiler automatically generates optimized LM invocation strategies and prompts from a program. 
 11 | 
 12 | 这里programming的含义不是大家常规里的那样（一连串的指令，执行条件分支、循环和表达式计算），而是计算机科学里更一般化的定义（其实两者是等价的，都是图灵完备）。这里的定义和lambda演算类似，lambda演算是一种抽象的计算模型，它使用函数来表达计算，lambda演算中有三个基本概念：变量、抽象（表示函数的定义）、应用（表示函数的调用）。这里modular operators就是函数，text是变量，transformation graphs是一连串的函数。
 13 | 
 14 | ---
 15 | 
 16 | 
 17 | > We first translate string-based prompting techniques, including complex and task-dependent ones like Chain of Thoughtand ReAct, into declarative modules that carry natural-language typed signatures. 
 18 | 
 19 | declarative modules可以理解为所定义的函数、所声明的函数，就好像Python中的`def foo(x,y)`。
 20 | typed signatures是函数的输入类型和返回类型，例如，在Haskell中，`add :: Int -> Int -> Int`就是一个签名，它的意思是，add函数输入两个Int，返回一个Int。
 21 | 
 22 | ---
 23 | 
 24 | > DSPy modules are task-adaptive components—akin to neural network layers—that abstract any particular text transformation, like answering a question or summarizing a paper. 
 25 | 
 26 | modules, transformation, modular operators, 说的都是一回事，都是函数，输入若干个变量，返回若干个变量。
 27 | 
 28 | 
 29 | > Inspired directly by PyTorch abstractions, DSPy modules are used via expressive define-by-run computational graphs. 
 30 | 
 31 | “Define-by-run” 是指在运行时动态定义计算图的编程范式，也就是动态图。在computational graph中，节点是module，边是有向的，表示文本传递的方向。
 32 | 
 33 | 
 34 | > Pipelines are expressed by (1) declaring the modules needed and (2) using these modules in any logical control flow (e.g., if statements, for loops, exceptions, etc.) to logically connect the modules.
 35 | 
 36 | 对Pipelines做出来定义，跟上面的computational graphs是一个意思，
 37 | 
 38 | 
 39 | > We then develop the DSPy compiler, which optimizes any DSPy program to improve quality or cost. The compiler inputs are the program, a few training inputs with optional labels, and a validation metric. The compiler simulates versions of the program on the inputs and bootstraps example traces of each module for self-improvement, using them to construct effective few-shot prompts or finetuning small LMs for steps of the pipeline. 
 40 | 
 41 | > Optimization in DSPy is highly modular: it is conducted by teleprompters, which are general-purpose optimization strategies that determine how the modules should learn from data. In this way, the compilerautomatically maps the declarative modules to high-quality compositions of prompting, finetuning, reasoning, and augmentation.
 42 | 
 43 | 
 44 | > DSPy programs are expressed in Python: each program takes the task input (e.g., a question to answer or a paper to summarize) and returns the output (e.g., an answer or a summary) after a series of steps. DSPy contributes three abstractions toward automatic optimization: signatures, modules, and teleprompters. Signatures abstract the input/output behavior of a module; modules replace existing hand-prompting techniques and can be composed in arbitrary pipelines; and teleprompters optimize all modules in the pipeline to maximize a metric.
 45 | 
 46 | ---
 47 | 
 48 | > Instead of free-form string prompts, DSPy programs use natural language signatures to assign work to the LM. A DSPy signature is natural-language typed declaration of a function: a short declarative spec that tells DSPy what a text transformation needs to do (e.g., “consume questions and return answers”), rather than how a specific LM should be prompted to implement that behavior. 
 49 | 
 50 | 这里的signature借用了编程语言中的术语，含义是类似的。在编程语言中，signature是函数的输入类型和返回类型，例如，在Haskell中，`add :: Int -> Int -> Int`是一个签名，它的意思是，add函数输入两个Int，返回一个Int。DSPy签名告诉LM需要做什么（而不是怎么做）。
 51 | 
 52 | ---
 53 | 
 54 | > Akin to type signatures in programming languages, DSPy signatures simply define an interface and provide type-like hints on the expected behavior. To use a signature, we must declare a module with that signature, like we instantiated a Predict module above. A module declaration like this returns a function having that signature.
 55 | 
 56 | 用编程语言的术语类比，DSPy signatures就像接口，只定义方法，不提供实现细节。module declaration则是实现。
 57 | 
 58 | ---
 59 | 
 60 | >  Like layers in PyTorch, the instantiated module behaves as a callable function: it takes in keyword arguments corresponding to the signature input fields (e.g., question), formats a prompt to implement the signature and includes the appropriate demonstrations, calls the LM, and parses the output fields.
 61 | 
 62 | Field就是signature中的一个字段，类比于函数定义中的类型。
 63 | 
 64 | ---
 65 | 
 66 | > Parameterization Uniquely, DSPy parameterizes these prompting techniques. To understand this parameterization, observe that any LM call seeking to implement a particular signature needs to specify parameters that include: (1) the specific LM to call, (2) the prompt instructions and the string prefix of each signature field and, most importantly, (3) the demonstrations used as few-shot prompts (for frozen LMs) or as training data (for finetuning). We focus primarily on automatically generating and selecting useful demonstrations. In our case studies, we find that bootstrapping good demonstrations gives us a powerful way to teach sophisticated pipelines of LMs new behaviors systematically.
 67 | 
 68 | 在DSPy中，参数有（1）所调用的语言模型。（2）prompt instructions and string prefix of each signature field。（3）示例，用作few shot或训练数据。
 69 | 
 70 | ---
 71 | 
 72 | > DSPy programs may use tools, which are modules that execute computation. 
 73 | 
 74 | 和Agent中的tools是一个概念。
 75 | 
 76 | ---
 77 | 
 78 | > Programs: DSPy modules can be composed in arbitrary pipelines in a define-by-run interface. Inspired directly by PyTorch and Chainer, one first declares the modules needed at initialization, allowing DSPy to keep track of them for optimization, and then one expresses the pipeline with arbitrary code that calls the modules in a forward method. 
 79 | 
 80 | “Define-by-run” 是指在运行时动态定义计算图的编程范式，也就是动态图。Programs也是modules，就像torch中，一些小的module组成一个最终的大的module。
 81 | 
 82 | ---
 83 | 
 84 | > When compiling a DSPy program, we generally invoke a teleprompter, which is an optimizer that takes the program, a training set, and a metric—and returns a new optimized program. Different teleprompters apply different strategies for optimization.
 85 | 
 86 | teleprompter类比于torch中的优化器。
 87 | 
 88 | ---
 89 | 
 90 | > A key source of DSPy’s expressive power is its ability to compile—or automatically optimize—any program in this programming model. Compiling relies on a teleprompter, which is an optimizer for DSPy programs that improves the quality (or cost) of modules via prompting or finetuning, which are unified in DSPy. While DSPy does not enforce this when creating new teleprompters, typical teleprompters go through three stages.
 91 | 
 92 | compile类比于torch中的train。
 93 | 
 94 | ---
 95 | 
 96 | 
 97 | 
 98 | 
 99 | 
100 | 
101 | 
102 | 
103 | 
104 | 
105 | 
106 | 
107 | 
108 | 
109 | 
110 | 


--------------------------------------------------------------------------------
/docs/notes/eva.md:
--------------------------------------------------------------------------------
 1 | # EVA
 2 | 
 3 | In this work, we launch EVA, a one billion parameters vanilla ViT encoder to explore the limits of masked visual representation learning.
 4 | 
 5 | ![](../images/eva.png)
 6 | 
 7 | ## EVA预训练
 8 | EVA是一个预训练模型，只通过[MIM](notes/mim.md)训练。训练配置如下：
 9 | ![](../images/eva2.png)
10 | 
11 | 预训练目标：随机mask 40%的patch。在mask上，CLIP-L/14的预测作为标记，和eva的预测做余弦相似度。目标是最大化余弦相似度。因为CLIP-L/14本身自带图文对齐，所以EVA预训练完之后，也间接学习了图文对齐。
12 | 
13 | 
14 | 
15 | 
16 | 
17 | 
18 | 
19 | 


--------------------------------------------------------------------------------
/docs/notes/get_extended_attention_mask.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "code",
  5 |    "execution_count": null,
  6 |    "id": "ada2a079-2749-448f-88ce-d05744a9af23",
  7 |    "metadata": {},
  8 |    "outputs": [],
  9 |    "source": [
 10 |     " def get_extended_attention_mask(\n",
 11 |     "        self,\n",
 12 |     "        attention_mask: Tensor,\n",
 13 |     "        input_shape: Tuple[int],\n",
 14 |     "        device: device,\n",
 15 |     "        is_decoder: bool,\n",
 16 |     "        has_query: bool = False,\n",
 17 |     "    ) -> Tensor:\n",
 18 |     "        \"\"\"\n",
 19 |     "        Makes broadcastable attention and causal masks so that future and masked tokens are ignored.\n",
 20 |     "\n",
 21 |     "        Arguments:\n",
 22 |     "            attention_mask (:obj:`torch.Tensor`):\n",
 23 |     "                Mask with ones indicating tokens to attend to, zeros for tokens to ignore.\n",
 24 |     "            input_shape (:obj:`Tuple[int]`):\n",
 25 |     "                The shape of the input to the model.\n",
 26 |     "            device: (:obj:`torch.device`):\n",
 27 |     "                The device of the input to the model.\n",
 28 |     "\n",
 29 |     "        Returns:\n",
 30 |     "            :obj:`torch.Tensor` The extended attention mask, with a the same dtype as :obj:`attention_mask.dtype`.\n",
 31 |     "        \"\"\""
 32 |    ]
 33 |   },
 34 |   {
 35 |    "cell_type": "markdown",
 36 |    "id": "3529e2b9",
 37 |    "metadata": {},
 38 |    "source": [
 39 |     "get_extended_attention_mask用于生成扩展的attention mask。该函数特别适用于处理self-attention和causal mask情况，尤其是在处理解码器decoder时。"
 40 |    ]
 41 |   },
 42 |   {
 43 |    "cell_type": "raw",
 44 |    "id": "5b25e072-e580-40bb-bf2c-94e9f8365a88",
 45 |    "metadata": {},
 46 |    "source": []
 47 |   },
 48 |   {
 49 |    "cell_type": "code",
 50 |    "execution_count": null,
 51 |    "id": "31a9e32a-d15b-4bbe-a8cd-c14463838466",
 52 |    "metadata": {},
 53 |    "outputs": [],
 54 |    "source": [
 55 |     "# We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]\n",
 56 |     "# ourselves in which case we just need to make it broadcastable to all heads.\n",
 57 |     "if attention_mask.dim() == 3:\n",
 58 |     "    extended_attention_mask = attention_mask[:, None, :, :]"
 59 |    ]
 60 |   },
 61 |   {
 62 |    "cell_type": "markdown",
 63 |    "id": "00ae49bd-0b1f-444e-8b00-64f535aa877e",
 64 |    "metadata": {},
 65 |    "source": [
 66 |     "Transformer模型中的多头自注意力机制需要一个形状为 [batch_size, num_heads, from_seq_length, to_seq_length] 的掩码，其中 num_heads 是注意力头的数量。通过在第二个维度插入一个新的维度（None），我们可以确保 attention_mask 变得可广播到所有注意力头上。"
 67 |    ]
 68 |   },
 69 |   {
 70 |    "cell_type": "raw",
 71 |    "id": "e4ecdb8a-4572-4468-ad0f-adab0a256ede",
 72 |    "metadata": {},
 73 |    "source": []
 74 |   },
 75 |   {
 76 |    "cell_type": "code",
 77 |    "execution_count": null,
 78 |    "id": "c939167a-afef-4afc-9dc1-c5808ff7d238",
 79 |    "metadata": {},
 80 |    "outputs": [],
 81 |    "source": [
 82 |     "elif attention_mask.dim() == 2:\n",
 83 |     "    # Provided a padding mask of dimensions [batch_size, seq_length]\n",
 84 |     "    # - if the model is a decoder, apply a causal mask in addition to the padding mask\n",
 85 |     "    # - if the model is an encoder, make the mask broadcastable to [batch_size, num_heads, seq_length, seq_length]"
 86 |    ]
 87 |   },
 88 |   {
 89 |    "cell_type": "markdown",
 90 |    "id": "7cddf2a4-0086-4c4f-a2f7-14d96cc0ff41",
 91 |    "metadata": {},
 92 |    "source": [
 93 |     "Padding Mask：用于标记输入序列中的填充位置。填充位置通常是为了使序列长度一致，但这些位置不应该在计算注意力得分时被考虑。\n",
 94 |     "Causal Mask：用于确保解码器只能关注当前和之前的标记，以防止信息泄露。这对于自回归生成模型尤为重要，因为模型在生成下一个标记时不应该访问未来的标记。"
 95 |    ]
 96 |   },
 97 |   {
 98 |    "cell_type": "raw",
 99 |    "id": "371eb5d9-cdc4-403f-a467-038c44a8bf2b",
100 |    "metadata": {},
101 |    "source": []
102 |   },
103 |   {
104 |    "cell_type": "code",
105 |    "execution_count": null,
106 |    "id": "febf7d45-7efb-4bbd-bb03-fbb961affd39",
107 |    "metadata": {},
108 |    "outputs": [],
109 |    "source": [
110 |     "if is_decoder:\n",
111 |     "    batch_size, seq_length = input_shape\n",
112 |     "\n",
113 |     "    seq_ids = torch.arange(seq_length, device=device)\n",
114 |     "    causal_mask = (\n",
115 |     "        seq_ids[None, None, :].repeat(batch_size, seq_length, 1)\n",
116 |     "        <= seq_ids[None, :, None]\n",
117 |     "    )\n",
118 |     "\n",
119 |     "    # add a prefix ones mask to the causal mask\n",
120 |     "    # causal and attention masks must have same type with pytorch version < 1.3\n",
121 |     "    causal_mask = causal_mask.to(attention_mask.dtype)"
122 |    ]
123 |   },
124 |   {
125 |    "cell_type": "markdown",
126 |    "id": "763725d2-820e-47f2-aa6c-57f627fd9ac4",
127 |    "metadata": {},
128 |    "source": [
129 |     "这段代码的目的是生成一个causal mask，以确保每个位置只能关注到它之前的位置。\n",
130 |     "seq_ids[None, None, :]通过添加两个维度变成形状为(1, 1, seq_length)。\n",
131 |     "seq_ids[None, None, :].repeat(batch_size, seq_length, 1)将其重复成形状为(batch_size, seq_length, seq_length)。\n",
132 |     "seq_ids[None, :, None]通过添加两个维度变成形状为(1, seq_length, 1)。\n",
133 |     "比较操作<=会生成一个布尔张量causal_mask，形状为(batch_size, seq_length, seq_length)。对于每个位置(i, j)，如果i <= j，则设为True，否则设为False。"
134 |    ]
135 |   },
136 |   {
137 |    "cell_type": "raw",
138 |    "id": "7032bbd6-a518-4089-8a32-bd8802b17272",
139 |    "metadata": {},
140 |    "source": []
141 |   },
142 |   {
143 |    "cell_type": "code",
144 |    "execution_count": null,
145 |    "id": "b1721911-f6c6-4ea4-8108-d2153c3bd2b8",
146 |    "metadata": {},
147 |    "outputs": [],
148 |    "source": [
149 |     "if causal_mask.shape[1] < attention_mask.shape[1]:\n",
150 |     "    prefix_seq_len = attention_mask.shape[1] - causal_mask.shape[1]\n",
151 |     "    if has_query:  # UniLM style attention mask\n",
152 |     "        causal_mask = torch.cat(\n",
153 |     "            [\n",
154 |     "                torch.zeros(\n",
155 |     "                    (batch_size, prefix_seq_len, seq_length),\n",
156 |     "                    device=device,\n",
157 |     "                    dtype=causal_mask.dtype,\n",
158 |     "                ),\n",
159 |     "                causal_mask,\n",
160 |     "            ],\n",
161 |     "            axis=1,\n",
162 |     "        )"
163 |    ]
164 |   },
165 |   {
166 |    "cell_type": "markdown",
167 |    "id": "8091cc7a-aeaf-44cc-9998-9861bd8c1fc3",
168 |    "metadata": {},
169 |    "source": [
170 |     "attention_mask所有位置的mask，causal_mask只是causal部分的mask，causal_mask的长度可能小于attention_mask的长度。\n",
171 |     "如果这种情况是因为存在query(即论文中的learned query)，需要把query这部分的mask设为0。"
172 |    ]
173 |   },
174 |   {
175 |    "cell_type": "raw",
176 |    "id": "00abbe76-df20-4cbc-97b5-89f1daa771fd",
177 |    "metadata": {},
178 |    "source": []
179 |   },
180 |   {
181 |    "cell_type": "code",
182 |    "execution_count": null,
183 |    "id": "e16525a0-deeb-4dcb-b5b5-a3018a1a5a32",
184 |    "metadata": {},
185 |    "outputs": [],
186 |    "source": [
187 |     "causal_mask = torch.cat(\n",
188 |     "        [\n",
189 |     "            torch.ones(\n",
190 |     "                (batch_size, causal_mask.shape[1], prefix_seq_len),\n",
191 |     "                device=device,\n",
192 |     "                dtype=causal_mask.dtype,\n",
193 |     "            ),\n",
194 |     "            causal_mask,\n",
195 |     "        ],\n",
196 |     "        axis=-1,\n",
197 |     "    )"
198 |    ]
199 |   },
200 |   {
201 |    "cell_type": "markdown",
202 |    "id": "f492c606-fc56-4413-9cce-22a0f368a758",
203 |    "metadata": {},
204 |    "source": [
205 |     "否则，prefix的mask设为1."
206 |    ]
207 |   },
208 |   {
209 |    "cell_type": "raw",
210 |    "id": "2d569517-3bb7-4305-8ba3-1e5670a33303",
211 |    "metadata": {},
212 |    "source": []
213 |   },
214 |   {
215 |    "cell_type": "code",
216 |    "execution_count": null,
217 |    "id": "8b277a4e-1ea2-44e2-b530-7bb40d88ebea",
218 |    "metadata": {},
219 |    "outputs": [],
220 |    "source": [
221 |     "extended_attention_mask = (\n",
222 |     "    causal_mask[:, None, :, :] * attention_mask[:, None, None, :]\n",
223 |     ")"
224 |    ]
225 |   },
226 |   {
227 |    "cell_type": "markdown",
228 |    "id": "b3c0bcca-9cb6-4287-b7dd-58da71d9ffc0",
229 |    "metadata": {},
230 |    "source": [
231 |     "将两个mask广播后相乘，就是前面说的apply a causal mask in addition to the padding mask。"
232 |    ]
233 |   }
234 |  ],
235 |  "metadata": {
236 |   "kernelspec": {
237 |    "display_name": "Python 3 (ipykernel)",
238 |    "language": "python",
239 |    "name": "python3"
240 |   },
241 |   "language_info": {
242 |    "codemirror_mode": {
243 |     "name": "ipython",
244 |     "version": 3
245 |    },
246 |    "file_extension": ".py",
247 |    "mimetype": "text/x-python",
248 |    "name": "python",
249 |    "nbconvert_exporter": "python",
250 |    "pygments_lexer": "ipython3",
251 |    "version": "3.10.8"
252 |   }
253 |  },
254 |  "nbformat": 4,
255 |  "nbformat_minor": 5
256 | }
257 | 


--------------------------------------------------------------------------------
/docs/notes/image_grounded_text_generation.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "code",
  5 |    "execution_count": null,
  6 |    "id": "3051da94-0270-40e2-b197-cde09695b771",
  7 |    "metadata": {},
  8 |    "outputs": [],
  9 |    "source": [
 10 |     "decoder_input_ids = text_tokens.input_ids.clone()\n",
 11 |     "decoder_input_ids[:, 0] = self.tokenizer.bos_token_id"
 12 |    ]
 13 |   },
 14 |   {
 15 |    "cell_type": "markdown",
 16 |    "id": "6a1e5457-1db9-4f76-a2ed-8127f0f5b255",
 17 |    "metadata": {},
 18 |    "source": [
 19 |     "所有句子（即每一行）的第一个token设置为bos_token_id，即[DEC]。"
 20 |    ]
 21 |   },
 22 |   {
 23 |    "cell_type": "raw",
 24 |    "id": "921d0e0b-2e1a-4401-8f4a-df978ca84cd8",
 25 |    "metadata": {},
 26 |    "source": []
 27 |   },
 28 |   {
 29 |    "cell_type": "code",
 30 |    "execution_count": null,
 31 |    "id": "a1b09179-6db8-4e3e-bea4-e114e8b6ce84",
 32 |    "metadata": {},
 33 |    "outputs": [],
 34 |    "source": [
 35 |     "labels = decoder_input_ids.masked_fill(\n",
 36 |     "    decoder_input_ids == self.tokenizer.pad_token_id, -100\n",
 37 |     ")"
 38 |    ]
 39 |   },
 40 |   {
 41 |    "cell_type": "markdown",
 42 |    "id": "b1ddfa30-5376-490e-95b7-6718ff12e71b",
 43 |    "metadata": {},
 44 |    "source": [
 45 |     "decoder_input_ids == self.tokenizer.pad_token_id，生成一个布尔张量，其形状与 decoder_input_ids 相同。\n",
 46 |     "在这个布尔张量中，所有等于 pad_token_id 的位置的值为 True，其他位置的值为 False。\n",
 47 |     "\n",
 48 |     "masked_fill 根据给定的条件将张量中的某些值替换为指定的值，在这里，pad的位置设为-100。这样做是为了在计算损失时忽略padding部分，-100是CrossEntropyLoss中的默认ignore_index。"
 49 |    ]
 50 |   },
 51 |   {
 52 |    "cell_type": "raw",
 53 |    "id": "9528544d-e160-4a8f-b15b-21cdc72d9a99",
 54 |    "metadata": {},
 55 |    "source": []
 56 |   },
 57 |   {
 58 |    "cell_type": "code",
 59 |    "execution_count": null,
 60 |    "id": "fdd8c005-fb5b-4fe2-8ffd-87bf03537a45",
 61 |    "metadata": {},
 62 |    "outputs": [],
 63 |    "source": [
 64 |     "query_atts = torch.ones(query_tokens.size()[:-1], dtype=torch.long).to(image.device)"
 65 |    ]
 66 |   },
 67 |   {
 68 |    "cell_type": "markdown",
 69 |    "id": "df559a37-da1a-4938-9b44-cb4677303dc8",
 70 |    "metadata": {},
 71 |    "source": [
 72 |     "query_token可以全部注意到。"
 73 |    ]
 74 |   },
 75 |   {
 76 |    "cell_type": "raw",
 77 |    "id": "b9812459-fec4-4a85-a788-51004263ec13",
 78 |    "metadata": {},
 79 |    "source": []
 80 |   },
 81 |   {
 82 |    "cell_type": "code",
 83 |    "execution_count": null,
 84 |    "id": "333feba3-6d35-4407-b6f1-28e0d1bb4c79",
 85 |    "metadata": {},
 86 |    "outputs": [],
 87 |    "source": [
 88 |     "attention_mask = torch.cat([query_atts, text_tokens.attention_mask], dim=1)\n"
 89 |    ]
 90 |   },
 91 |   {
 92 |    "cell_type": "markdown",
 93 |    "id": "fe6bc312-52d2-47cd-9b07-7c720a6061ff",
 94 |    "metadata": {},
 95 |    "source": [
 96 |     "合并mask。"
 97 |    ]
 98 |   },
 99 |   {
100 |    "cell_type": "raw",
101 |    "id": "f9e168de-3e12-4497-a3e1-7863b4a799b9",
102 |    "metadata": {},
103 |    "source": []
104 |   },
105 |   {
106 |    "cell_type": "code",
107 |    "execution_count": null,
108 |    "id": "78f7905e-6bc0-4029-bfa5-6557aa0f774d",
109 |    "metadata": {},
110 |    "outputs": [],
111 |    "source": [
112 |     "lm_output = self.Qformer(\n",
113 |     "    decoder_input_ids,\n",
114 |     "    attention_mask=attention_mask,\n",
115 |     "    past_key_values=query_output.past_key_values,\n",
116 |     "    return_dict=True,\n",
117 |     "    labels=labels,\n",
118 |     ")\n",
119 |     "\n",
120 |     "loss_lm = lm_output.loss"
121 |    ]
122 |   },
123 |   {
124 |    "cell_type": "markdown",
125 |    "id": "7ab5d53a-6ca7-4b11-a28f-18be6740c153",
126 |    "metadata": {},
127 |    "source": [
128 |     "这里进行语言模型预测，计算语言模型损失。\n",
129 |     "\n",
130 |     "在其内部，进行了causal mask计算以及labels移位。causal mask使得解码时只能注意到已解码的词，labels移位使得每个位置的label都是它的下一个token。详细解释见附录。"
131 |    ]
132 |   }
133 |  ],
134 |  "metadata": {
135 |   "kernelspec": {
136 |    "display_name": "Python 3 (ipykernel)",
137 |    "language": "python",
138 |    "name": "python3"
139 |   },
140 |   "language_info": {
141 |    "codemirror_mode": {
142 |     "name": "ipython",
143 |     "version": 3
144 |    },
145 |    "file_extension": ".py",
146 |    "mimetype": "text/x-python",
147 |    "name": "python",
148 |    "nbconvert_exporter": "python",
149 |    "pygments_lexer": "ipython3",
150 |    "version": "3.10.8"
151 |   }
152 |  },
153 |  "nbformat": 4,
154 |  "nbformat_minor": 5
155 | }
156 | 


--------------------------------------------------------------------------------
/docs/notes/image_text_contrastive_learning.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "code",
  5 |    "execution_count": null,
  6 |    "id": "03d91395-c8d2-4527-93e3-2215dabeb772",
  7 |    "metadata": {},
  8 |    "outputs": [],
  9 |    "source": [
 10 |     "# https://github.com/salesforce/LAVIS/blob/main/lavis/models/blip2_models/blip2_qformer.py#L90\n",
 11 |     "image = samples[\"image\"]\n",
 12 |     "text = samples[\"text_input\"]"
 13 |    ]
 14 |   },
 15 |   {
 16 |    "cell_type": "markdown",
 17 |    "id": "d726f1c9-6bce-4004-9984-7df434a3a149",
 18 |    "metadata": {},
 19 |    "source": [
 20 |     "image是图像的原始数据，shape是[batch_size, num_channels, height, width]。\n",
 21 |     "text是原始文本。"
 22 |    ]
 23 |   },
 24 |   {
 25 |    "cell_type": "raw",
 26 |    "id": "db9f7015-901c-4222-aa95-2ec1ab04497f",
 27 |    "metadata": {},
 28 |    "source": []
 29 |   },
 30 |   {
 31 |    "cell_type": "code",
 32 |    "execution_count": null,
 33 |    "id": "65dcb59a-cf13-4e65-9386-05afebfee0ad",
 34 |    "metadata": {},
 35 |    "outputs": [],
 36 |    "source": [
 37 |     "image_embeds = self.ln_vision(self.visual_encoder(image))\n",
 38 |     "image_atts = torch.ones(image_embeds.size()[:-1], dtype=torch.long).to(image.device)"
 39 |    ]
 40 |   },
 41 |   {
 42 |    "cell_type": "markdown",
 43 |    "id": "e9edeae1-b5eb-4643-9149-5ace08245f42",
 44 |    "metadata": {},
 45 |    "source": [
 46 |     "visual_encoder就是Figure 2中的Image Encoder，比如EVA。image_embeds的shape是[batch_size, num_patches, embed_dim]\n",
 47 |     "image_atts将图片的attention全部设置为1，在之后的三个损失函数的计算中，图片特征都是可以全部互相注意的。"
 48 |    ]
 49 |   },
 50 |   {
 51 |    "cell_type": "raw",
 52 |    "id": "e91fdef3-709b-4ffc-80f2-6994f7d49cdf",
 53 |    "metadata": {},
 54 |    "source": []
 55 |   },
 56 |   {
 57 |    "cell_type": "code",
 58 |    "execution_count": null,
 59 |    "id": "d07bb915-93ce-4cfb-90e7-e616880fc2d0",
 60 |    "metadata": {},
 61 |    "outputs": [],
 62 |    "source": [
 63 |     "query_tokens = self.query_tokens.expand(image_embeds.shape[0], -1, -1)"
 64 |    ]
 65 |   },
 66 |   {
 67 |    "cell_type": "markdown",
 68 |    "id": "852bb6e7-33b8-432f-9695-de5f18072de9",
 69 |    "metadata": {},
 70 |    "source": [
 71 |     "query_tokens是示意图中的learned queries，扩张成image_embeds一样的batch size。"
 72 |    ]
 73 |   },
 74 |   {
 75 |    "cell_type": "raw",
 76 |    "id": "773d8fb2-4b7d-4943-9eb6-a3f73232fa12",
 77 |    "metadata": {},
 78 |    "source": []
 79 |   },
 80 |   {
 81 |    "cell_type": "code",
 82 |    "execution_count": null,
 83 |    "id": "ad3e7ac0-7d28-4a68-b118-5562912bed0a",
 84 |    "metadata": {},
 85 |    "outputs": [],
 86 |    "source": [
 87 |     "query_output = self.Qformer.bert(\n",
 88 |     "    query_embeds=query_tokens,\n",
 89 |     "    encoder_hidden_states=image_embeds,\n",
 90 |     "    encoder_attention_mask=image_atts,\n",
 91 |     "    use_cache=True,\n",
 92 |     "    return_dict=True,\n",
 93 |     ")"
 94 |    ]
 95 |   },
 96 |   {
 97 |    "cell_type": "markdown",
 98 |    "id": "f8134df3-b697-4506-b8eb-9ec34d8056d9",
 99 |    "metadata": {},
100 |    "source": [
101 |     "这部分的代码细节就开始多起来了。这里的bert是Q-Former的核心参数，其实Q-Former就是基于bert实现的。下文提到的bert，都是指self.Qformer.bert。\n",
102 |     "\n",
103 |     "bert的block层包含self attention层和cross attention层。它既可以接收query_embeds，也可以接收input_ids。如果只有query_embeds，就不需要过embedding层了。\n",
104 |     "\n",
105 |     "这里的encoder_hidden_states和encoder_attention_mask，和原始Transformer中的encoder不是一个东西。在Q-Former中，看到encoder_hidden_states和encoder_attention_mask，就说明有cross attention发生。\n",
106 |     "\n",
107 |     "详细的代码解释，见附录。"
108 |    ]
109 |   },
110 |   {
111 |    "cell_type": "raw",
112 |    "id": "d66e7721-f353-4e2f-8193-d1ed897a09f7",
113 |    "metadata": {},
114 |    "source": []
115 |   },
116 |   {
117 |    "cell_type": "code",
118 |    "execution_count": null,
119 |    "id": "ab815174-bc99-43a4-b1ca-196594bae665",
120 |    "metadata": {},
121 |    "outputs": [],
122 |    "source": [
123 |     "image_feats = F.normalize(\n",
124 |     "    self.vision_proj(query_output.last_hidden_state), dim=-1\n",
125 |     ")"
126 |    ]
127 |   },
128 |   {
129 |    "cell_type": "markdown",
130 |    "id": "e8e7212c-bba4-4519-8c51-119cd7f6669c",
131 |    "metadata": {},
132 |    "source": [
133 |     "query_output.last_hidden_state是图片和queries计算完cross attention的隐状态。\n",
134 |     "vision_proj是维度映射。\n",
135 |     "image_feats是图片的最终特征。"
136 |    ]
137 |   },
138 |   {
139 |    "cell_type": "raw",
140 |    "id": "313d24bc-8033-4b6e-87a5-b2d9f2be397f",
141 |    "metadata": {},
142 |    "source": []
143 |   },
144 |   {
145 |    "cell_type": "code",
146 |    "execution_count": null,
147 |    "id": "49d10480-7a39-4c3c-a816-f64bc90400b2",
148 |    "metadata": {},
149 |    "outputs": [],
150 |    "source": [
151 |     "text_tokens = self.tokenizer(\n",
152 |     "    text,\n",
153 |     "    padding=\"max_length\",\n",
154 |     "    truncation=True,\n",
155 |     "    max_length=self.max_txt_len,\n",
156 |     "    return_tensors=\"pt\",\n",
157 |     ").to(image.device)"
158 |    ]
159 |   },
160 |   {
161 |    "cell_type": "markdown",
162 |    "id": "24b35523-3f88-45fb-934c-e41e3879e11b",
163 |    "metadata": {},
164 |    "source": [
165 |     "tokenizer将原始文字转化为inpud_ids."
166 |    ]
167 |   },
168 |   {
169 |    "cell_type": "raw",
170 |    "id": "65bdcf0c-f113-4990-9cd2-8aaa434506e4",
171 |    "metadata": {},
172 |    "source": []
173 |   },
174 |   {
175 |    "cell_type": "code",
176 |    "execution_count": null,
177 |    "id": "fa5ff5b8-ebd3-4ea1-a4cf-e7648077bd78",
178 |    "metadata": {},
179 |    "outputs": [],
180 |    "source": [
181 |     "text_output = self.Qformer.bert(\n",
182 |     "    text_tokens.input_ids,\n",
183 |     "    attention_mask=text_tokens.attention_mask,\n",
184 |     "    return_dict=True,\n",
185 |     ")"
186 |    ]
187 |   },
188 |   {
189 |    "cell_type": "markdown",
190 |    "id": "ad0a8998-03b2-462b-82a7-81130948d0bc",
191 |    "metadata": {},
192 |    "source": [
193 |     "bert是多功能的，在这里又可以编码文本。这就是原文所说的share the same self-attention layers。\n",
194 |     "attention_mask和input_ids同形状，pad部分设为0，其他是1。也就是说，文本之前也是全部可以互相注意。"
195 |    ]
196 |   },
197 |   {
198 |    "cell_type": "raw",
199 |    "id": "ec6e730f-05b7-4424-a873-9e78807ac378",
200 |    "metadata": {},
201 |    "source": []
202 |   },
203 |   {
204 |    "cell_type": "code",
205 |    "execution_count": null,
206 |    "id": "fbf19429-f5c9-4932-9c61-4e256027cf21",
207 |    "metadata": {},
208 |    "outputs": [],
209 |    "source": [
210 |     "text_feat = F.normalize(\n",
211 |     "    self.text_proj(text_output.last_hidden_state[:, 0, :]), dim=-1\n",
212 |     ")"
213 |    ]
214 |   },
215 |   {
216 |    "cell_type": "markdown",
217 |    "id": "d0ba3aca-32e0-4b49-9fd9-2d25425e4d02",
218 |    "metadata": {},
219 |    "source": [
220 |     "同理，text_feat是最终的文本特征。"
221 |    ]
222 |   },
223 |   {
224 |    "cell_type": "raw",
225 |    "id": "160cb8bd-cf31-4b1f-8680-f75f2dc8c6c5",
226 |    "metadata": {},
227 |    "source": []
228 |   },
229 |   {
230 |    "cell_type": "code",
231 |    "execution_count": null,
232 |    "id": "a7f9a855-9014-4b1d-90d4-bd0f3153eaeb",
233 |    "metadata": {},
234 |    "outputs": [],
235 |    "source": [
236 |     "image_feats_all = concat_all_gather(image_feats)  # [batch_size*num_gpu, num_query_tokens, embed_dim]\n",
237 |     "text_feat_all = concat_all_gather(text_feat)  # [batch_size*num_gpu, embed_dim]"
238 |    ]
239 |   },
240 |   {
241 |    "cell_type": "markdown",
242 |    "id": "a16ca2a6-212c-4f35-a7f7-00c5c83ef696",
243 |    "metadata": {},
244 |    "source": [
245 |     "将多个gpu的数据合并。"
246 |    ]
247 |   },
248 |   {
249 |    "cell_type": "raw",
250 |    "id": "4a3889bd-5854-4d6a-b3f4-7fcb6faf5ab2",
251 |    "metadata": {},
252 |    "source": []
253 |   },
254 |   {
255 |    "cell_type": "code",
256 |    "execution_count": null,
257 |    "id": "623aea73-4dae-4054-8dc8-131af150dd89",
258 |    "metadata": {},
259 |    "outputs": [],
260 |    "source": [
261 |     "sim_q2t = torch.matmul(\n",
262 |     "    image_feats.unsqueeze(1), text_feat_all.unsqueeze(-1)\n",
263 |     ").squeeze()\n",
264 |     "# [batch_size, batch_size*num_gpu, num_query_tokens]"
265 |    ]
266 |   },
267 |   {
268 |    "cell_type": "markdown",
269 |    "id": "ac130d1d-0319-44ef-846a-b35d973d691e",
270 |    "metadata": {},
271 |    "source": [
272 |     "image_feats 的原始形状是 [batch_size, num_query_tokens, embed_dim]。unsqueeze(1) 在第1维度上增加一个维度，结果形状变为 [batch_size, 1, num_query_tokens, embed_dim]。\n",
273 |     "\n",
274 |     "text_feat_all 的原始形状是 [batch_size*num_gpu, embed_dim]。unsqueeze(-1) 在最后一个维度上增加一个维度，结果形状变为 [batch_size*num_gpu, embed_dim, 1]。\n",
275 |     "\n",
276 |     "torch.matmul将两个张量相乘。假设 A 的形状为 [a, b, c, d]，B 的形状为 [e, f, g]，如果 d 与 e 匹配，结果张量的形状为 [a, b, c, g]。\n",
277 |     "在此上下文中：\n",
278 |     "image_feats.unsqueeze(1) 的形状是 [batch_size, 1, num_query_tokens, embed_dim]\n",
279 |     "text_feat_all.unsqueeze(-1) 的形状是 [batch_size*num_gpu, embed_dim, 1]\n",
280 |     "通过矩阵乘法 torch.matmul，embed_dim 是匹配的维度，结果形状为 [batch_size, batch_size*num_gpu, num_query_tokens, 1]\n",
281 |     "\n",
282 |     "最终结果 sim_q2t 表示每个图像查询 token 与所有文本特征的相似度矩阵，其中：\n",
283 |     "- 第一维度 batch_size 对应当前批次中的图像。\n",
284 |     "- 第二维度 batch_size*num_gpu 对应所有收集到的文本特征（跨多个 GPU）。\n",
285 |     "- 第三维度 num_query_tokens 对应每个图像的查询 token。"
286 |    ]
287 |   },
288 |   {
289 |    "cell_type": "raw",
290 |    "id": "c4935d00-bb60-43d9-95e0-df1bf8552c17",
291 |    "metadata": {},
292 |    "source": []
293 |   },
294 |   {
295 |    "cell_type": "code",
296 |    "execution_count": null,
297 |    "id": "571d8b7f-d302-42af-afb9-569df403af9d",
298 |    "metadata": {},
299 |    "outputs": [],
300 |    "source": [
301 |     "# image-text similarity: aggregate across all query tokens\n",
302 |     "sim_i2t, _ = sim_q2t.max(-1)\n",
303 |     "sim_i2t = sim_i2t / self.temp"
304 |    ]
305 |   },
306 |   {
307 |    "cell_type": "markdown",
308 |    "id": "b3755736-4ff2-475b-84ba-1edc05ad01fb",
309 |    "metadata": {},
310 |    "source": [
311 |     "计算最大值，得到[batch_size, batch_size*num_gpu]"
312 |    ]
313 |   },
314 |   {
315 |    "cell_type": "raw",
316 |    "id": "0cb8239c-3bee-4edf-8198-23fc87a71cc8",
317 |    "metadata": {},
318 |    "source": []
319 |   },
320 |   {
321 |    "cell_type": "code",
322 |    "execution_count": null,
323 |    "id": "485ae165-0176-468c-bbd2-e3ed099e0278",
324 |    "metadata": {},
325 |    "outputs": [],
326 |    "source": [
327 |     "# text-query similarity: [batch_size, batch_size*num_gpu, num_query_tokens]\n",
328 |     "sim_t2q = torch.matmul(\n",
329 |     "    text_feat.unsqueeze(1).unsqueeze(1), image_feats_all.permute(0, 2, 1)\n",
330 |     ").squeeze()\n",
331 |     "\n",
332 |     "# text-image similarity: aggregate across all query tokens\n",
333 |     "sim_t2i, _ = sim_t2q.max(-1)\n",
334 |     "sim_t2i = sim_t2i / self.temp  # [batch_size, batch_size*num_gpu]"
335 |    ]
336 |   },
337 |   {
338 |    "cell_type": "markdown",
339 |    "id": "06226198-5732-4318-80c1-4cfb7ce93515",
340 |    "metadata": {},
341 |    "source": [
342 |     "同理。"
343 |    ]
344 |   },
345 |   {
346 |    "cell_type": "raw",
347 |    "id": "9165570c-2f0f-4bbe-bc11-7a01f2886ed8",
348 |    "metadata": {},
349 |    "source": []
350 |   },
351 |   {
352 |    "cell_type": "code",
353 |    "execution_count": null,
354 |    "id": "faa4f10b-3b3f-46e7-8899-6eaf4ecaff26",
355 |    "metadata": {},
356 |    "outputs": [],
357 |    "source": [
358 |     "rank = dist.get_rank()\n",
359 |     "bs = image.size(0)\n",
360 |     "targets = torch.linspace(rank * bs, rank * bs + bs - 1, bs, dtype=int).to(image.device)"
361 |    ]
362 |   },
363 |   {
364 |    "cell_type": "markdown",
365 |    "id": "f26042ab-b3ed-4988-921f-b1d58ecf9c35",
366 |    "metadata": {},
367 |    "source": [
368 |     "torch.linspace(start, end, steps, dtype) 生成一个从 start 到 end 的等差数列，共有 steps 个元素。\n",
369 |     "这里生成的是目标标签，具体解释如下：\n",
370 |     "- start 是 rank * bs，即当前进程（或 GPU）的排名乘以批次大小。这是当前批次的起始索引。\n",
371 |     "- end 是 rank * bs + bs - 1，即当前批次的结束索引。\n",
372 |     "- steps 是 bs，即生成 bs 个目标标签。\n",
373 |     "\n",
374 |     "假设有4个 GPU，每个 GPU 的批次大小为 32。\n",
375 |     "- 对于 rank=0 的 GPU，生成的目标标签为 [0, 1, 2, ..., 31]。\n",
376 |     "- 对于 rank=1 的 GPU，生成的目标标签为 [32, 33, 34, ..., 63]。\n",
377 |     "- 对于 rank=2 的 GPU，生成的目标标签为 [64, 65, 66, ..., 95]。\n",
378 |     "- 对于 rank=3 的 GPU，生成的目标标签为 [96, 97, 98, ..., 127]。"
379 |    ]
380 |   },
381 |   {
382 |    "cell_type": "raw",
383 |    "id": "cfbb1f7e-8f23-443f-aafb-1bd05f5e9e52",
384 |    "metadata": {},
385 |    "source": []
386 |   },
387 |   {
388 |    "cell_type": "code",
389 |    "execution_count": null,
390 |    "id": "7582b217-6a79-4263-9104-cfb1a891e47f",
391 |    "metadata": {},
392 |    "outputs": [],
393 |    "source": [
394 |     "if \"image_id\" in samples.keys(): #coco retrieval finetuning\n",
395 |     "    image_ids = samples[\"image_id\"].view(-1,1)\n",
396 |     "    image_ids_all = concat_all_gather(image_ids)\n",
397 |     "    pos_idx = torch.eq(image_ids, image_ids_all.t()).float()       \n",
398 |     "    sim_targets = pos_idx / pos_idx.sum(1,keepdim=True)   \n",
399 |     "    sim_targets = 0.9 * sim_targets + 0.1 * torch.ones_like(sim_targets) / sim_targets.size(1)\n",
400 |     "\n",
401 |     "    loss_t2i = -torch.sum(F.log_softmax(sim_t2i, dim=1)*sim_targets,dim=1).mean()\n",
402 |     "    loss_i2t = -torch.sum(F.log_softmax(sim_i2t, dim=1)*sim_targets,dim=1).mean()     \n",
403 |     "    loss_itc = (loss_t2i+loss_i2t)/2  \n",
404 |     "else:                     \n",
405 |     "    loss_itc = (\n",
406 |     "        F.cross_entropy(sim_i2t, targets, label_smoothing=0.1)\n",
407 |     "        + F.cross_entropy(sim_t2i, targets, label_smoothing=0.1)\n",
408 |     "    ) / 2"
409 |    ]
410 |   },
411 |   {
412 |    "cell_type": "markdown",
413 |    "id": "4f42a9a1-7dbf-490a-a076-f8c9f123564a",
414 |    "metadata": {},
415 |    "source": [
416 |     "解释else部分。\n",
417 |     "sim_i2t 是图像到文本的相似度矩阵，形状为 [batch_size, batch_size*num_gpu]，表示当前批次中的每个图像与所有文本特征的相似度。\n",
418 |     "sim_t2i 是文本到图像的相似度矩阵，形状为 [batch_size, batch_size*num_gpu]，表示当前批次中的每个文本与所有图像特征的相似度。\n",
419 |     "交叉熵计算的是负对数，targets告诉损失函数每个图像（或文本）对应的正确文本（或图像）的索引。最小化交叉熵，就相当于最大化图片和其对应文本的相似度。"
420 |    ]
421 |   }
422 |  ],
423 |  "metadata": {
424 |   "kernelspec": {
425 |    "display_name": "Python 3 (ipykernel)",
426 |    "language": "python",
427 |    "name": "python3"
428 |   },
429 |   "language_info": {
430 |    "codemirror_mode": {
431 |     "name": "ipython",
432 |     "version": 3
433 |    },
434 |    "file_extension": ".py",
435 |    "mimetype": "text/x-python",
436 |    "name": "python",
437 |    "nbconvert_exporter": "python",
438 |    "pygments_lexer": "ipython3",
439 |    "version": "3.10.8"
440 |   }
441 |  },
442 |  "nbformat": 4,
443 |  "nbformat_minor": 5
444 | }
445 | 


--------------------------------------------------------------------------------
/docs/notes/image_text_matching.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "code",
  5 |    "execution_count": null,
  6 |    "id": "27f2b847-be80-486f-9e1b-aa1a1bbee0aa",
  7 |    "metadata": {},
  8 |    "outputs": [],
  9 |    "source": [
 10 |     "# https://github.com/salesforce/LAVIS/blob/main/lavis/models/blip2_models/blip2_qformer.py#L175 。\n",
 11 |     "\n",
 12 |     "text_input_ids_world = concat_all_gather(text_tokens.input_ids)\n",
 13 |     "text_attention_mask_world = concat_all_gather(text_tokens.attention_mask)\n",
 14 |     "image_embeds_world = all_gather_with_grad(image_embeds)"
 15 |    ]
 16 |   },
 17 |   {
 18 |    "cell_type": "markdown",
 19 |    "id": "bb6ff69f-9081-4e21-9452-5d91f3ae2373",
 20 |    "metadata": {},
 21 |    "source": [
 22 |     "收集所有GPU的数据"
 23 |    ]
 24 |   },
 25 |   {
 26 |    "cell_type": "raw",
 27 |    "id": "55adb44f-d2fa-4cda-9394-ee51416f860d",
 28 |    "metadata": {},
 29 |    "source": []
 30 |   },
 31 |   {
 32 |    "cell_type": "code",
 33 |    "execution_count": null,
 34 |    "id": "137f6663-6af2-4b39-b6ad-e59cb673ad3a",
 35 |    "metadata": {},
 36 |    "outputs": [],
 37 |    "source": [
 38 |     "with torch.no_grad():\n",
 39 |     "    if \"image_id\" in samples.keys():\n",
 40 |     "        mask = torch.eq(image_ids, image_ids_all.t())\n",
 41 |     "        sim_t2i.masked_fill_(mask, -10000)\n",
 42 |     "        sim_i2t.masked_fill_(mask, -10000)\n",
 43 |     "    else:\n",
 44 |     "        sim_t2i[:, rank * bs : rank * bs + bs].fill_diagonal_(-10000)\n",
 45 |     "        sim_i2t[:, rank * bs : rank * bs + bs].fill_diagonal_(-10000)\n",
 46 |     "\n",
 47 |     "    weights_t2i = F.softmax(sim_t2i, dim=1)\n",
 48 |     "    weights_i2t = F.softmax(sim_i2t, dim=1)"
 49 |    ]
 50 |   },
 51 |   {
 52 |    "cell_type": "markdown",
 53 |    "id": "10d1fb69-b2ac-4be4-a545-b01d85922b01",
 54 |    "metadata": {},
 55 |    "source": [
 56 |     "计算文本到图像和图像到文本的相似度矩阵，并对相同ID的对进行掩码处理，以防止模型在负采样时选择相同的对。\n",
 57 |     "\n",
 58 |     "随后对相似度矩阵应用softmax，以获得每个文本对应不同图像的权重分布`weights_t2i`，以及每个图像对应不同文本的权重分布 `weights_i2t`。"
 59 |    ]
 60 |   },
 61 |   {
 62 |    "cell_type": "raw",
 63 |    "id": "670f0477-578f-433f-806f-815d2a96e751",
 64 |    "metadata": {},
 65 |    "source": []
 66 |   },
 67 |   {
 68 |    "cell_type": "code",
 69 |    "execution_count": null,
 70 |    "id": "b11bacae-4965-4623-b757-ae10d7e4918a",
 71 |    "metadata": {},
 72 |    "outputs": [],
 73 |    "source": [
 74 |     "# select a negative image for each text\n",
 75 |     "image_embeds_neg = []\n",
 76 |     "for b in range(bs):\n",
 77 |     "    neg_idx = torch.multinomial(weights_t2i[b], 1).item()\n",
 78 |     "    image_embeds_neg.append(image_embeds_world[neg_idx])\n",
 79 |     "image_embeds_neg = torch.stack(image_embeds_neg, dim=0)\n",
 80 |     "\n",
 81 |     "# select a negative text for each image\n",
 82 |     "text_ids_neg = []\n",
 83 |     "text_atts_neg = []\n",
 84 |     "for b in range(bs):\n",
 85 |     "    neg_idx = torch.multinomial(weights_i2t[b], 1).item()\n",
 86 |     "    text_ids_neg.append(text_input_ids_world[neg_idx])\n",
 87 |     "    text_atts_neg.append(text_attention_mask_world[neg_idx])\n",
 88 |     "\n",
 89 |     "text_ids_neg = torch.stack(text_ids_neg, dim=0)\n",
 90 |     "text_atts_neg = torch.stack(text_atts_neg, dim=0)"
 91 |    ]
 92 |   },
 93 |   {
 94 |    "cell_type": "markdown",
 95 |    "id": "2ae11dc2-2f70-479c-8966-577e77d82c07",
 96 |    "metadata": {},
 97 |    "source": [
 98 |     "根据权重分布 `weights_t2i` 和 `weights_i2t` 为每个正样本选择一个负样本。具体做法是通过 `torch.multinomial` 从权重分布中采样负样本的索引，然后将这些负样本的嵌入（图像和文本）收集起来。\n",
 99 |     "\n",
100 |     "通过 torch.multinomial 函数从这个分布中采样，可以有效地选择那些与当前文本（或图像）具有高相似度但实际并不匹配的负样本。这样就实现了hard negative mining，即选择那些难以区分的负样本来训练模型。"
101 |    ]
102 |   },
103 |   {
104 |    "cell_type": "raw",
105 |    "id": "da15fe7c-26a9-41e6-8169-3fcd655bdd4b",
106 |    "metadata": {},
107 |    "source": []
108 |   },
109 |   {
110 |    "cell_type": "code",
111 |    "execution_count": null,
112 |    "id": "6385100f-4f2b-4e30-a89d-8bf9d0ec1e5d",
113 |    "metadata": {},
114 |    "outputs": [],
115 |    "source": [
116 |     "text_ids_all = torch.cat(\n",
117 |     "    [text_tokens.input_ids, text_tokens.input_ids, text_ids_neg], dim=0\n",
118 |     ")  # pos, pos, neg\n",
119 |     "text_atts_all = torch.cat(\n",
120 |     "    [text_tokens.attention_mask, text_tokens.attention_mask, text_atts_neg],\n",
121 |     "    dim=0,\n",
122 |     ")\n",
123 |     "\n",
124 |     "query_tokens_itm = self.query_tokens.expand(text_ids_all.shape[0], -1, -1)\n",
125 |     "query_atts_itm = torch.ones(query_tokens_itm.size()[:-1], dtype=torch.long).to(\n",
126 |     "    image.device\n",
127 |     ")\n",
128 |     "attention_mask_all = torch.cat([query_atts_itm, text_atts_all], dim=1)\n",
129 |     "\n",
130 |     "image_embeds_all = torch.cat(\n",
131 |     "    [image_embeds, image_embeds_neg, image_embeds], dim=0\n",
132 |     ")  # pos, neg, pos\n",
133 |     "image_atts_all = torch.ones(image_embeds_all.size()[:-1], dtype=torch.long).to(\n",
134 |     "    image.device\n",
135 |     ")"
136 |    ]
137 |   },
138 |   {
139 |    "cell_type": "markdown",
140 |    "id": "dabac777-f508-4cb5-903c-d19be9c6e871",
141 |    "metadata": {},
142 |    "source": [
143 |     " 将正样本和负样本的文本和图像拼接在一起，形成模型的输入。`text_ids_all` 和 `text_atts_all` 分别包含了正样本文本、正样本文本和负样本文本的 `input_ids` 和 `attention_mask`。\n",
144 |     "\n",
145 |     "`query_tokens_itm` 是用于ITM任务的查询标记，它们与文本和图像的嵌入一起输入模型。`attention_mask_all` 是拼接后的注意力掩码。\n",
146 |     "\n",
147 |     "`image_embeds_all` 是包含正样本图像、负样本图像和正样本图像的嵌入。`image_atts_all` 是对应的注意力掩码。"
148 |    ]
149 |   },
150 |   {
151 |    "cell_type": "raw",
152 |    "id": "2211ef0a-091e-4954-9147-06f08e04cb00",
153 |    "metadata": {},
154 |    "source": []
155 |   },
156 |   {
157 |    "cell_type": "code",
158 |    "execution_count": null,
159 |    "id": "6d69c6e9-76ac-4823-beb4-3e2fc3eefa30",
160 |    "metadata": {},
161 |    "outputs": [],
162 |    "source": [
163 |     "output_itm = self.Qformer.bert(\n",
164 |     "    text_ids_all,\n",
165 |     "    query_embeds=query_tokens_itm,\n",
166 |     "    attention_mask=attention_mask_all,\n",
167 |     "    encoder_hidden_states=image_embeds_all,\n",
168 |     "    encoder_attention_mask=image_atts_all,\n",
169 |     "    return_dict=True,\n",
170 |     ")"
171 |    ]
172 |   },
173 |   {
174 |    "cell_type": "markdown",
175 |    "id": "22fb43c1-e4a1-4d9e-9fcc-9128a1d98931",
176 |    "metadata": {},
177 |    "source": [
178 |     "前向传播。"
179 |    ]
180 |   },
181 |   {
182 |    "cell_type": "raw",
183 |    "id": "e332930d-411e-4c39-8584-be7818d10133",
184 |    "metadata": {},
185 |    "source": []
186 |   },
187 |   {
188 |    "cell_type": "code",
189 |    "execution_count": null,
190 |    "id": "1baac244-5252-4df8-9919-c403d5050258",
191 |    "metadata": {},
192 |    "outputs": [],
193 |    "source": [
194 |     "vl_embeddings = output_itm.last_hidden_state[:, : query_tokens_itm.size(1), :]\n",
195 |     "vl_output = self.itm_head(vl_embeddings)\n",
196 |     "logits = vl_output.mean(dim=1)\n",
197 |     "\n",
198 |     "itm_labels = torch.cat(\n",
199 |     "    [torch.ones(bs, dtype=torch.long), torch.zeros(2 * bs, dtype=torch.long)],\n",
200 |     "    dim=0,\n",
201 |     ").to(image.device)\n",
202 |     "loss_itm = F.cross_entropy(logits, itm_labels)"
203 |    ]
204 |   },
205 |   {
206 |    "cell_type": "markdown",
207 |    "id": "fcfd040f-9c4f-4251-8960-1defa87db71f",
208 |    "metadata": {},
209 |    "source": [
210 |     "提取出最后一层隐藏状态的查询嵌入，并通过线性分类器 `itm_head` 计算每个查询嵌入的logits。\n",
211 |     "\n",
212 |     "对所有查询嵌入的logits求平均，作为输出的匹配分数。\n",
213 |     "\n",
214 |     "制作标签 `itm_labels`，正样本为1，负样本为0。计算交叉熵损失 `loss_itm`。"
215 |    ]
216 |   }
217 |  ],
218 |  "metadata": {
219 |   "kernelspec": {
220 |    "display_name": "Python 3 (ipykernel)",
221 |    "language": "python",
222 |    "name": "python3"
223 |   },
224 |   "language_info": {
225 |    "codemirror_mode": {
226 |     "name": "ipython",
227 |     "version": 3
228 |    },
229 |    "file_extension": ".py",
230 |    "mimetype": "text/x-python",
231 |    "name": "python",
232 |    "nbconvert_exporter": "python",
233 |    "pygments_lexer": "ipython3",
234 |    "version": "3.10.8"
235 |   }
236 |  },
237 |  "nbformat": 4,
238 |  "nbformat_minor": 5
239 | }
240 | 


--------------------------------------------------------------------------------
/docs/notes/itm_forward.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "id": "82ce6682-5940-4a04-83f0-7d765e022126",
  6 |    "metadata": {},
  7 |    "source": [
  8 |     "# https://github.com/salesforce/LAVIS/blob/main/lavis/models/blip2_models/blip2_qformer.py#L175\n",
  9 |     "\n",
 10 |     "主要变量含义：\n",
 11 |     "- image_embeds_neg：\n",
 12 |     "    - 含义：这是为每个文本选择的负样本图像的embedding。\n",
 13 |     "    - 形状：[bs, D]，其中 bs 是批次大小（batch size），D是图像嵌入的维度。每个元素都是一个负样本图像的嵌入向量。\n",
 14 |     "- text_ids_neg：\n",
 15 |     "    - 含义：这是为每个图像选择的负样本文本的输入ID。\n",
 16 |     "    - 形状：[bs, L]，其中 L 是文本序列的长度。每个元素是一个负样本文本的输入ID序列。\n",
 17 |     "- text_atts_neg：\n",
 18 |     "    - 含义：这是为每个图像选择的负样本文本的mask。所有非pad未知都是1.\n",
 19 |     "    - 形状：[bs, L]，其中 L 是文本序列的长度。每个元素是一个负样本文本的注意力掩码序列。\n",
 20 |     "- text_ids_all：\n",
 21 |     "    - 含义：这是组合后的所有文本输入ID，包括正样本的文本ID（两次）和负样本的文本ID。\n",
 22 |     "    - 形状：[3 * bs, L]，其中 bs 是批次大小（batch size），L 是文本序列的长度。\n",
 23 |     "        - 第一个 bs：正样本的文本ID，来源于 text_tokens.input_ids。\n",
 24 |     "        - 第二个 bs：再次包含正样本的文本ID，仍然来源于 text_tokens.input_ids。\n",
 25 |     "        - 第三个 bs：负样本的文本ID，来源于 text_ids_neg。\n",
 26 |     "- text_atts_all：\n",
 27 |     "    - 含义：这是组合后的所有文本注意力掩码，包括正样本的注意力掩码（两次）和负样本的注意力掩码。\n",
 28 |     "    - 形状：[3 * bs, L]，其中 bs 是批次大小（batch size），L 是文本序列的长度。\n",
 29 |     "        - 第一个 bs：正样本的注意力掩码，来源于 text_tokens.attention_mask。\n",
 30 |     "        - 第二个 bs：再次包含正样本的注意力掩码，仍然来源于 text_tokens.attention_mask。\n",
 31 |     "        - 第三个 bs：负样本的注意力掩码，来源于 text_atts_neg。\n",
 32 |     "- query_tokens_itm:\n",
 33 |     "    - 含义：扩展后的query tokens\n",
 34 |     "    - 形状：和text_ids_all同形状，即[3 * bs, L]\n",
 35 |     "- query_atts_itm:\n",
 36 |     "    - 含义：都是1"
 37 |    ]
 38 |   },
 39 |   {
 40 |    "cell_type": "raw",
 41 |    "id": "ae943645-9935-4908-bbcf-202223d483a9",
 42 |    "metadata": {},
 43 |    "source": []
 44 |   },
 45 |   {
 46 |    "cell_type": "code",
 47 |    "execution_count": null,
 48 |    "id": "3672a1d9-7ccf-40c4-b648-21d0df4af9d8",
 49 |    "metadata": {},
 50 |    "outputs": [],
 51 |    "source": [
 52 |     "attention_mask_all = torch.cat([query_atts_itm, text_atts_all], dim=1)"
 53 |    ]
 54 |   },
 55 |   {
 56 |    "cell_type": "markdown",
 57 |    "id": "f620e911-b6f9-485e-a82d-118322836b02",
 58 |    "metadata": {},
 59 |    "source": [
 60 |     "queries和文本的mask拼接为一个整体。"
 61 |    ]
 62 |   },
 63 |   {
 64 |    "cell_type": "raw",
 65 |    "id": "074627b6-8041-461b-a03c-dcfa9b2bc9a9",
 66 |    "metadata": {},
 67 |    "source": []
 68 |   },
 69 |   {
 70 |    "cell_type": "code",
 71 |    "execution_count": null,
 72 |    "id": "a3e6442d-13ce-4479-bc7e-441faa1f80b9",
 73 |    "metadata": {},
 74 |    "outputs": [],
 75 |    "source": [
 76 |     "image_embeds_all = torch.cat(\n",
 77 |     "            [image_embeds, image_embeds_neg, image_embeds], dim=0\n",
 78 |     ")  # pos, neg, pos\n",
 79 |     "image_atts_all = torch.ones(image_embeds_all.size()[:-1], dtype=torch.long).to(\n",
 80 |     "    image.device\n",
 81 |     ")"
 82 |    ]
 83 |   },
 84 |   {
 85 |    "cell_type": "markdown",
 86 |    "id": "aa8927da-1c1d-4a85-a523-0907d9ec63bf",
 87 |    "metadata": {},
 88 |    "source": [
 89 |     "image_embeds_all也是三个batch的拼接。"
 90 |    ]
 91 |   },
 92 |   {
 93 |    "cell_type": "raw",
 94 |    "id": "1493a6ee-6042-418a-b651-2d0354ad7f43",
 95 |    "metadata": {},
 96 |    "source": []
 97 |   },
 98 |   {
 99 |    "cell_type": "code",
100 |    "execution_count": null,
101 |    "id": "1ddcac0b-4fa7-4c5b-bb43-a8df0cc3511f",
102 |    "metadata": {},
103 |    "outputs": [],
104 |    "source": [
105 |     "output_itm = self.Qformer.bert(\n",
106 |     "    text_ids_all,\n",
107 |     "    query_embeds=query_tokens_itm,\n",
108 |     "    attention_mask=attention_mask_all,\n",
109 |     "    encoder_hidden_states=image_embeds_all,\n",
110 |     "    encoder_attention_mask=image_atts_all,\n",
111 |     "    return_dict=True,\n",
112 |     ")\n",
113 |     "\n",
114 |     "# BertModel forward\n",
115 |     "def forward(\n",
116 |     "    self,\n",
117 |     "    input_ids=None,\n",
118 |     "    attention_mask=None,\n",
119 |     "    position_ids=None,\n",
120 |     "    head_mask=None,\n",
121 |     "    query_embeds=None,\n",
122 |     "    encoder_hidden_states=None,\n",
123 |     "    encoder_attention_mask=None,\n",
124 |     "    past_key_values=None,\n",
125 |     "    use_cache=None,\n",
126 |     "    output_attentions=None,\n",
127 |     "    output_hidden_states=None,\n",
128 |     "    return_dict=None,\n",
129 |     "    is_decoder=False,\n",
130 |     "):"
131 |    ]
132 |   },
133 |   {
134 |    "cell_type": "code",
135 |    "execution_count": null,
136 |    "id": "6f05e4a7-7b4b-4b50-818e-a2c44b7b1f86",
137 |    "metadata": {},
138 |    "outputs": [],
139 |    "source": [
140 |     "input_ids=text_ids_all\n",
141 |     "attention_mask=attention_mask_all\n",
142 |     "query_embeds=query_tokens_itm\n",
143 |     "encoder_hidden_states=image_embeds_all\n",
144 |     "encoder_attention_mask=image_atts_all"
145 |    ]
146 |   },
147 |   {
148 |    "cell_type": "raw",
149 |    "id": "0ce0ca21-3d42-41c5-9a79-1f4b54011b29",
150 |    "metadata": {},
151 |    "source": []
152 |   },
153 |   {
154 |    "cell_type": "code",
155 |    "execution_count": null,
156 |    "id": "b19b299a-7238-490e-96a1-a72075b0365d",
157 |    "metadata": {},
158 |    "outputs": [],
159 |    "source": [
160 |     "# BertModel forward\n",
161 |     "query_length = query_embeds.shape[1] if query_embeds is not None else 0\n",
162 |     "\n",
163 |     "embedding_output = self.embeddings(\n",
164 |     "    input_ids=input_ids,\n",
165 |     "    position_ids=position_ids,\n",
166 |     "    query_embeds=query_embeds,\n",
167 |     "    past_key_values_length=past_key_values_length,\n",
168 |     ")\n",
169 |     "\n",
170 |     "input_shape = embedding_output.size()[:-1]\n",
171 |     "\n",
172 |     "if is_decoder:\n",
173 |     "    # ...\n",
174 |     "else:\n",
175 |     "    extended_attention_mask = self.get_extended_attention_mask(\n",
176 |     "        attention_mask, input_shape, device, is_decoder\n",
177 |     "    )"
178 |    ]
179 |   },
180 |   {
181 |    "cell_type": "markdown",
182 |    "id": "23539ed0-064c-4a8d-95df-5f5182bf3cc9",
183 |    "metadata": {},
184 |    "source": [
185 |     "query_length=query数量\n",
186 |     "\n",
187 |     "embedding_output = torch.cat((query_embeds, input_embeddings), dim=1)\n",
188 |     "\n",
189 |     "get_extended_attention_mask和图文对比学习的类似。"
190 |    ]
191 |   },
192 |   {
193 |    "cell_type": "raw",
194 |    "id": "83938781-7324-4e5d-a4cf-e4db4cbe281e",
195 |    "metadata": {},
196 |    "source": []
197 |   },
198 |   {
199 |    "cell_type": "code",
200 |    "execution_count": null,
201 |    "id": "7c705b92-32dd-4d11-acb5-8cf3f567298a",
202 |    "metadata": {},
203 |    "outputs": [],
204 |    "source": [
205 |     "# BertModel forward\n",
206 |     "if encoder_hidden_states is not None:\n",
207 |     "    if type(encoder_hidden_states) == list:\n",
208 |     "        # ...\n",
209 |     "    else:\n",
210 |     "        (\n",
211 |     "            encoder_batch_size,\n",
212 |     "            encoder_sequence_length,\n",
213 |     "            _,\n",
214 |     "        ) = encoder_hidden_states.size()\n",
215 |     "    encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length)\n",
216 |     "\n",
217 |     "    if type(encoder_attention_mask) == list:\n",
218 |     "        # ...\n",
219 |     "    elif encoder_attention_mask is None:\n",
220 |     "        # ...\n",
221 |     "    else:\n",
222 |     "        encoder_extended_attention_mask = self.invert_attention_mask(\n",
223 |     "            encoder_attention_mask\n",
224 |     "        )\n",
225 |     "else:\n",
226 |     "    # ..."
227 |    ]
228 |   },
229 |   {
230 |    "cell_type": "markdown",
231 |    "id": "34826924-abf8-4d1a-a4f4-569f4a42f9f5",
232 |    "metadata": {},
233 |    "source": [
234 |     "假设 encoder_attention_mask 是一个形如 [0, 0, 1, 1, 1] 的序列，在调用 invert_attention_mask 之后，这个序列可能会被变换成 [1, 1, 0, 0, 0]，或者在一些实现中，可能被变换成一个很大的负数（如 -10000）以确保在 softmax 计算时，这些位置的权重几乎为零。"
235 |    ]
236 |   },
237 |   {
238 |    "cell_type": "raw",
239 |    "id": "99452631-01c6-482f-8ab7-cadefadc8894",
240 |    "metadata": {},
241 |    "source": []
242 |   },
243 |   {
244 |    "cell_type": "code",
245 |    "execution_count": null,
246 |    "id": "edfd77a0-ae63-47c4-a48c-3a78bbed2326",
247 |    "metadata": {},
248 |    "outputs": [],
249 |    "source": [
250 |     "# BertModel forward\n",
251 |     "encoder_outputs = self.encoder(\n",
252 |     "    embedding_output,\n",
253 |     "    attention_mask=extended_attention_mask,\n",
254 |     "    head_mask=head_mask,\n",
255 |     "    encoder_hidden_states=encoder_hidden_states,\n",
256 |     "    encoder_attention_mask=encoder_extended_attention_mask,\n",
257 |     "    past_key_values=past_key_values,\n",
258 |     "    use_cache=use_cache,\n",
259 |     "    output_attentions=output_attentions,\n",
260 |     "    output_hidden_states=output_hidden_states,\n",
261 |     "    return_dict=return_dict,\n",
262 |     "    query_length=query_length,\n",
263 |     ")\n",
264 |     "\n",
265 |     "# BertEncoder forward\n",
266 |     "def forward(\n",
267 |     "    self,\n",
268 |     "    hidden_states,\n",
269 |     "    attention_mask=None,\n",
270 |     "    head_mask=None,\n",
271 |     "    encoder_hidden_states=None,\n",
272 |     "    encoder_attention_mask=None,\n",
273 |     "    past_key_values=None,\n",
274 |     "    use_cache=None,\n",
275 |     "    output_attentions=False,\n",
276 |     "    output_hidden_states=False,\n",
277 |     "    return_dict=True,\n",
278 |     "    query_length=0,\n",
279 |     "):"
280 |    ]
281 |   },
282 |   {
283 |    "cell_type": "code",
284 |    "execution_count": null,
285 |    "id": "d1efb71b-d047-4ad1-8258-2ab1a9ec3e29",
286 |    "metadata": {},
287 |    "outputs": [],
288 |    "source": [
289 |     "# BertEncoder forward\n",
290 |     "hidden_states=embedding_output\n",
291 |     "attention_mask=extended_attention_mask\n",
292 |     "encoder_hidden_states=image_embeds_all\n",
293 |     "encoder_attention_mask=image_atts_all\n",
294 |     "query_length=query数量"
295 |    ]
296 |   },
297 |   {
298 |    "cell_type": "raw",
299 |    "id": "90a29432-785b-45cc-82a8-6750c0feb532",
300 |    "metadata": {},
301 |    "source": []
302 |   },
303 |   {
304 |    "cell_type": "code",
305 |    "execution_count": null,
306 |    "id": "62643171-646a-488c-9dc2-5d6fac073943",
307 |    "metadata": {},
308 |    "outputs": [],
309 |    "source": [
310 |     "# BertEncoder forward\n",
311 |     "for i in range(self.config.num_hidden_layers):\n",
312 |     "    layer_module = self.layer[i]\n",
313 |     "    if output_hidden_states:\n",
314 |     "        all_hidden_states = all_hidden_states + (hidden_states,)\n",
315 |     "\n",
316 |     "    layer_head_mask = head_mask[i] if head_mask is not None else None\n",
317 |     "    past_key_value = past_key_values[i] if past_key_values is not None else None\n",
318 |     "\n",
319 |     "    if getattr(self.config, \"gradient_checkpointing\", False) and self.training:\n",
320 |     "        # ...\n",
321 |     "    else:\n",
322 |     "        layer_outputs = layer_module(\n",
323 |     "            hidden_states,\n",
324 |     "            attention_mask,\n",
325 |     "            layer_head_mask,\n",
326 |     "            encoder_hidden_states,\n",
327 |     "            encoder_attention_mask,\n",
328 |     "            past_key_value,\n",
329 |     "            output_attentions,\n",
330 |     "            query_length,\n",
331 |     "        )\n",
332 |     "\n",
333 |     "# BertLayer的forward方法\n",
334 |     "def forward(\n",
335 |     "    self,\n",
336 |     "    hidden_states,\n",
337 |     "    attention_mask=None,\n",
338 |     "    head_mask=None,\n",
339 |     "    encoder_hidden_states=None,\n",
340 |     "    encoder_attention_mask=None,\n",
341 |     "    past_key_value=None,\n",
342 |     "    output_attentions=False,\n",
343 |     "    query_length=0,\n",
344 |     "):"
345 |    ]
346 |   },
347 |   {
348 |    "cell_type": "code",
349 |    "execution_count": null,
350 |    "id": "c1d9599b-39ff-4647-97d1-28ef53ff606d",
351 |    "metadata": {},
352 |    "outputs": [],
353 |    "source": [
354 |     "# BertLayer的forward方法\n",
355 |     "hidden_states=embedding_output\n",
356 |     "attention_mask=extended_attention_mask\n",
357 |     "encoder_hidden_states=image_embeds_all\n",
358 |     "encoder_attention_mask=image_atts_all\n",
359 |     "query_length=query数量"
360 |    ]
361 |   },
362 |   {
363 |    "cell_type": "raw",
364 |    "id": "10bc3117-f2ce-4ba8-ad64-788ff1cfaec9",
365 |    "metadata": {},
366 |    "source": []
367 |   },
368 |   {
369 |    "cell_type": "code",
370 |    "execution_count": null,
371 |    "id": "53c988cc-e190-4318-be70-a9f9a91fdaf0",
372 |    "metadata": {},
373 |    "outputs": [],
374 |    "source": [
375 |     "# BertLayer的forward方法\n",
376 |     "self_attention_outputs = self.attention(\n",
377 |     "    hidden_states,\n",
378 |     "    attention_mask,\n",
379 |     "    head_mask,\n",
380 |     "    output_attentions=output_attentions,\n",
381 |     "    past_key_value=self_attn_past_key_value,\n",
382 |     ")"
383 |    ]
384 |   },
385 |   {
386 |    "cell_type": "code",
387 |    "execution_count": null,
388 |    "id": "fa4f5577-a7b8-4e60-a887-9239ce85ea76",
389 |    "metadata": {},
390 |    "outputs": [],
391 |    "source": []
392 |   }
393 |  ],
394 |  "metadata": {
395 |   "kernelspec": {
396 |    "display_name": "Python 3 (ipykernel)",
397 |    "language": "python",
398 |    "name": "python3"
399 |   },
400 |   "language_info": {
401 |    "codemirror_mode": {
402 |     "name": "ipython",
403 |     "version": 3
404 |    },
405 |    "file_extension": ".py",
406 |    "mimetype": "text/x-python",
407 |    "name": "python",
408 |    "nbconvert_exporter": "python",
409 |    "pygments_lexer": "ipython3",
410 |    "version": "3.10.8"
411 |   }
412 |  },
413 |  "nbformat": 4,
414 |  "nbformat_minor": 5
415 | }
416 | 


--------------------------------------------------------------------------------
/docs/notes/mim.md:
--------------------------------------------------------------------------------
1 | # MIM
2 | 
3 | Masked Image Modeling（MIM，掩码图像建模）是一种自监督学习方法，主要用于训练图像识别和生成模型。其灵感来源于自然语言处理领域的Masked Language Modeling（例如BERT）。
4 | 
5 | 在MIM中，输入图像的一部分像素或区域被随机掩盖（通常用掩码标记），然后模型需要根据未被掩盖的部分来预测或重建被掩盖的部分。这种方法鼓励模型理解图像的全局和局部信息，从而提高其对图像内容的理解和生成能力。


--------------------------------------------------------------------------------
/docs/notes/minigpt4_vs_blip2.md:
--------------------------------------------------------------------------------
1 | # MiniGPT4和BLIP2的区别
2 | 
3 | > The main difference between MiniGPT-4 and BLIP-2 is the training strategy. We notice that BLIP-2's training strategy is not enough to align the vision module with powerful LLMs like Vicuna well and will impact the text generation ability of Vicuna seriously. Therefore, we propose a novel way to collect a small yet high-quality image-description pair dataset created by the model itself and polished by ChatGPT. After the traditional image-text training stage like BLIP-2 did, we further fineturn MiniGPT-4 on this dataset with conversation prompts together so MiniGPT-4 can generate coherent text to answer user's questions and improve its usability. This fineturn stage is very efficient and can be finished in 7 mins with 1 A100. However, its effectiveness is significant.
4 | > 
5 | > Another important finding is that we don't fine-tune the Q-Former like BLIP-2, but directly use the Q-Former aligned with FlanT5 before and only train a single projecting layer. We show that such a simple linear layer is enough to let Vicuna see the image. This makes our training very efficient.
6 | 
7 | 引自 https://github.com/Vision-CAIR/MiniGPT-4/issues/7 


--------------------------------------------------------------------------------
/docs/notes/mulinfo_contrastive.md:
--------------------------------------------------------------------------------
 1 | # 互信息和对比学习的关系
 2 | 
 3 | 互信息和对比学习虽然是两个不同的概念，但是在很多深度学习任务中，它们之间存在紧密的联系。以下是对它们关系的详细解释：
 4 | 
 5 | ### 互信息
 6 | 
 7 | 互信息（Mutual Information, MI）是一个度量两个随机变量之间相互依赖程度的量。它表示给定一个变量的信息，我们能获得关于另一个变量的信息量。具体来说，互信息可以用来衡量图像表示和文本表示之间的关联度或相似度。
 8 | 
 9 | 在多模态学习中，例如图像-文本对齐任务，最大化图像和文本表示之间的互信息可以提高模型的表现，因为这意味着模型能够更好地捕捉图像和文本之间的关联。
10 | 
11 | ### 对比学习
12 | 
13 | 对比学习（Contrastive Learning）是一种自监督学习方法，其目标是学习一个表示空间，使得相似的样本对在表示空间中靠近，不相似的样本对在表示空间中远离。对比学习的核心思想是通过构造正样本对（相似的样本）和负样本对（不相似的样本），来训练模型识别和区分这些样本对。
14 | 
15 | 常见的对比学习损失函数包括对比损失（contrastive loss）、InfoNCE 损失（基于噪声对比估计）和三元组损失（triplet loss）。这些损失函数在优化过程中都会鼓励模型将正样本对的表示拉近，将负样本对的表示拉远。
16 | 
17 | ### 互信息和对比学习的关系
18 | 
19 | 1. **目标一致性**：在对比学习中，通过最大化正样本对的相似度和最小化负样本对的相似度，实际上是在隐式地最大化样本对（例如图像-文本对）的互信息。因为正样本对应该具有更高的互信息，而负样本对应该具有更低的互信息。
20 | 2. **InfoNCE 损失和互信息**：InfoNCE（Noise Contrastive Estimation）损失是一种特殊的对比学习损失函数，它直接与互信息最大化有关。InfoNCE损失在优化过程中，显式地近似最大化了样本对之间的互信息，因此被广泛应用于对比学习任务中。
21 | 3. **相辅相成**：互信息提供了理论上的指导，说明了为什么对比学习可以有效地学习到有用的表示。而对比学习提供了具体的实现方法，通过构造和优化正负样本对，使得模型能够在实践中有效地最大化互信息。
22 | 
23 | 总之，互信息和对比学习在本质上都是为了提高模型对数据内在结构和关联的捕捉能力。互信息提供了理论基础，而对比学习则提供了实际的优化方法，两者相互补充，共同促进了多模态学习和表示学习的进步。


--------------------------------------------------------------------------------
/docs/notes/o1_literatures.md:
--------------------------------------------------------------------------------
 1 | Comments:
 2 | 
 3 | - The o1 model series is trained with large-scale reinforcement learning to reason using chain of thought.[^1]
 4 | - o1 is a model that thinks before giving the final answer. In my own words, here are the biggest updates to the field of AI.[^2]
 5 | - Don’t do chain of thought purely via prompting, train models to do better chain of thought using RL.[^2]
 6 | - In the history of deep learning we have always tried to scale training compute, but chain of thought is a form of adaptive compute that can also be scaled at inference time. [^2]
 7 | - Our large-scale reinforcement learning algorithm teaches the model how to think productively using its chain of thought in a highly data-efficient training process. [^3]
 8 | - Through reinforcement learning, o1 learns to hone its chain of thought and refine the strategies it uses. It learns to recognize and correct its mistakes. It learns to break down tricky steps into simpler ones. It learns to try a different approach when the current one isn’t working. This process dramatically improves the model’s ability to reason.[^3]
 9 | - We're no longer limited by pretraining paradigm; now, we can scale through inference compute, opening up new possibilities for capabilities and alignment.[^4]
10 | - The interesting update from Strawberry is that OpenAI has found a way to add a new dimension on which to improve performance: compute during inference. The company has found that when Strawberry takes longer to respond to a prompt—in other words, when it’s given more time to think—it generally responds more accurately.[^5]
11 | - OpenAI released their new reasoning system, o1, building on the early successes of Q* and more recently the rumors of Strawberry, to ship a new mode of interacting with AI on challenging tasks. o1 is a system designed by training new models on long reasoning chains, with lots of reinforcement learning 🍒, and deploying them at scale. Unlike traditional autoregressive language models, it is doing an online search for the user. It is spending more on inference, which confirms the existence of new scaling laws — inference scaling laws.[^6]
12 | - When referring to o1, it is best to refer to it as a system. There’s a chance all the operations are funneled through one advanced language model, but the funneling and recycling of those computations in a way that creates coherent outputs for the user is very complex. [^6]
13 | - OpenAI Strawberry (o1) is out! We are finally seeing the paradigm of inference-time scaling popularized and deployed in production. As Sutton said in the Bitter Lesson, there're only 2 techniques that scale indefinitely with compute: learning & search. It's time to shift focus to the latter. [^7]
14 | - You don't need a huge model to perform reasoning. Lots of parameters are dedicated to memorizing facts, in order to perform well in benchmarks like trivia QA. It is possible to factor out reasoning from knowledge, i.e. a small "reasoning core" that knows how to call tools like browser and code verifier. Pre-training compute may be decreased.[^7]
15 | - A huge amount of compute is shifted to serving inference instead of pre/post-training. LLMs are text-based simulators. By rolling out many possible strategies and scenarios in the simulator, the model will eventually converge to good solutions. The process is a well-studied problem like AlphaGo's monte carlo tree search (MCTS).[^7]
16 | 
17 | 
18 | 
19 | Reading List:
20 | 
21 | - [Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters](https://arxiv.org/abs/2408.03314)
22 | - https://github.com/hijkzzz/Awesome-LLM-Strawberry
23 | - [A Survey on Self-play Methods in Reinforcement Learning](https://arxiv.org/abs/2408.01072) 
24 | - [Reverse Engineering o1 Architecture](https://www.reddit.com/r/LocalLLaMA/comments/1fgr244/reverse_engineering_o1_architecture_with_a_little/)
25 | - [Reverse engineering OpenAI’s o1](https://www.interconnects.ai/p/reverse-engineering-openai-o1)
26 | - [Improving LLM Reasoning using SElf-generated data:RL and Verifiers](https://drive.google.com/file/d/1komQ7s9kPPvDx_8AxTh9A6tlfJA0j6dR/view)
27 | - [o1 发布后，信息量最大的圆桌对话：杨植麟、姜大昕、朱军探讨大模型技术路径](https://mp.weixin.qq.com/s/FSiCYyc1W6CFCT_eCwVSsw)
28 | - [Building OpenAI o1 (Extended Cut)](https://www.youtube.com/watch?v=tEzs3VHyBDM)
29 | 
30 | References:
31 | 
32 | [^1]: OpenAI o1 System Card </br>
33 | [^2]: Jason Wei </br> 
34 | [^3]: https://openai.com/index/learning-to-reason-with-llms/ </br>
35 | [^4]: Mira Murati </br>
36 | [^5]: https://every.to/chain-of-thought/openai-s-o1-model-explained </br>
37 | [^6]: https://www.interconnects.ai/p/reverse-engineering-openai-o1 </br>
38 | [^7]: Jim Fan </br>
39 | 


--------------------------------------------------------------------------------
/docs/notes/plan.md:
--------------------------------------------------------------------------------
1 | - [ ] Blip2 attention mask梳理
2 | - [ ] lavis解读
3 | 
4 | 


--------------------------------------------------------------------------------
/docs/notes/pretrain_stage1.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "code",
  5 |    "execution_count": null,
  6 |    "id": "cba2ee99-f930-48c0-a8b7-79dc8876afc3",
  7 |    "metadata": {},
  8 |    "outputs": [],
  9 |    "source": [
 10 |     "model:\n",
 11 |     "  arch: blip2"
 12 |    ]
 13 |   },
 14 |   {
 15 |    "cell_type": "markdown",
 16 |    "id": "c5d6f2c4-019a-46c7-85c5-1237e3fd05d2",
 17 |    "metadata": {},
 18 |    "source": [
 19 |     "模型结构是blip2。\n",
 20 |     "在 https://github.com/salesforce/LAVIS/blob/main/train.py#L94 的build_model方法，会根据arch得到模型类。\n",
 21 |     "\n",
 22 |     "build_model调用registry.get_model_class(model_config.arch), https://github.com/salesforce/LAVIS/blob/ac8fc98c93c02e2dfb727e24a361c4c309c8dbbc/lavis/tasks/base_task.py#L32\n",
 23 |     "\n",
 24 |     "blip2模型实际是Blip2Qformer， 定义于 https://github.com/salesforce/LAVIS/blob/ac8fc98c93c02e2dfb727e24a361c4c309c8dbbc/lavis/models/blip2_models/blip2_qformer.py#L25 "
 25 |    ]
 26 |   },
 27 |   {
 28 |    "cell_type": "raw",
 29 |    "id": "9b905999-ebaa-48cf-b990-6321b09eaf8a",
 30 |    "metadata": {},
 31 |    "source": []
 32 |   },
 33 |   {
 34 |    "cell_type": "code",
 35 |    "execution_count": null,
 36 |    "id": "796a203f-998a-4dbd-b6f1-8ae44f76d989",
 37 |    "metadata": {},
 38 |    "outputs": [],
 39 |    "source": [
 40 |     "model_type: pretrain"
 41 |    ]
 42 |   },
 43 |   {
 44 |    "cell_type": "markdown",
 45 |    "id": "36d05ff7-4772-43a6-98f4-38de9af81dd1",
 46 |    "metadata": {},
 47 |    "source": [
 48 |     "Each architecture has variants trained on different datasets and at different scale. 好像没什么实际作用。"
 49 |    ]
 50 |   },
 51 |   {
 52 |    "cell_type": "raw",
 53 |    "id": "9b79a370-3262-4755-910f-c2ad589bc800",
 54 |    "metadata": {},
 55 |    "source": []
 56 |   },
 57 |   {
 58 |    "cell_type": "code",
 59 |    "execution_count": null,
 60 |    "id": "11281184-e10f-4108-bc84-7af56ea873e8",
 61 |    "metadata": {},
 62 |    "outputs": [],
 63 |    "source": [
 64 |     "datasets:\n",
 65 |     "  coco_caption:\n",
 66 |     "    vis_processor:\n",
 67 |     "        train:\n",
 68 |     "          name: \"blip2_image_train\"\n",
 69 |     "          image_size: 224\n",
 70 |     "    text_processor:\n",
 71 |     "        train:\n",
 72 |     "          name: \"blip_caption\"        \n",
 73 |     "  vg_caption: # name of the dataset builder\n",
 74 |     "    vis_processor:\n",
 75 |     "        train:\n",
 76 |     "          name: \"blip_image_train\"\n",
 77 |     "          image_size: 224\n",
 78 |     "    text_processor:\n",
 79 |     "        train:\n",
 80 |     "          name: \"blip_caption\""
 81 |    ]
 82 |   },
 83 |   {
 84 |    "cell_type": "markdown",
 85 |    "id": "664de2a2-af46-4a7e-905a-aa168c2ca1af",
 86 |    "metadata": {},
 87 |    "source": [
 88 |     "训练所用的数据集。每个数据集，都会被管理到独有的处理类，加载成同一的形式。"
 89 |    ]
 90 |   },
 91 |   {
 92 |    "cell_type": "raw",
 93 |    "id": "54e7dba6-7cf0-4f94-a00b-903adb2a4fb4",
 94 |    "metadata": {},
 95 |    "source": []
 96 |   },
 97 |   {
 98 |    "cell_type": "code",
 99 |    "execution_count": null,
100 |    "id": "6b207b02-52aa-485d-a1aa-297110ef2262",
101 |    "metadata": {},
102 |    "outputs": [],
103 |    "source": [
104 |     "run:\n",
105 |     "  task: image_text_pretrain"
106 |    ]
107 |   },
108 |   {
109 |    "cell_type": "markdown",
110 |    "id": "238db4cf-0533-46aa-9a5d-ce737baa6789",
111 |    "metadata": {},
112 |    "source": [
113 |     "配置任务。每个任务都会关联到具体的任务类。\n",
114 |     "\n",
115 |     "image_text_pretrain任务定义于 https://github.com/salesforce/LAVIS/blob/ac8fc98c93c02e2dfb727e24a361c4c309c8dbbc/lavis/tasks/image_text_pretrain.py"
116 |    ]
117 |   }
118 |  ],
119 |  "metadata": {
120 |   "kernelspec": {
121 |    "display_name": "Python 3 (ipykernel)",
122 |    "language": "python",
123 |    "name": "python3"
124 |   },
125 |   "language_info": {
126 |    "codemirror_mode": {
127 |     "name": "ipython",
128 |     "version": 3
129 |    },
130 |    "file_extension": ".py",
131 |    "mimetype": "text/x-python",
132 |    "name": "python",
133 |    "nbconvert_exporter": "python",
134 |    "pygments_lexer": "ipython3",
135 |    "version": "3.10.8"
136 |   }
137 |  },
138 |  "nbformat": 4,
139 |  "nbformat_minor": 5
140 | }
141 | 


--------------------------------------------------------------------------------
/docs/notes/rope.md:
--------------------------------------------------------------------------------
  1 | # RoPE Explained
  2 | 
  3 | ## 原理
  4 | RoPE的出发点是“通过绝对位置编码的方式实现相对位置编码”。
  5 | 
  6 | 原论文有很多公式，目的是找到一组满足这样关系的函数：
  7 | 
  8 | $$
  9 | \left\langle f_q\left(\boldsymbol{x}_m, m\right), f_k\left(\boldsymbol{x}_n, n\right)\right\rangle=g\left(\boldsymbol{x}_m, \boldsymbol{x}_n, m-n\right) . 
 10 | $$
 11 | 
 12 | 意思是，位于m位置的q向量和位于n位置k向量，施加了“位置作用”之后得到的新向量的点积，只和这两个向量以及它们的相对位置有关。
 13 | 
 14 | 原论文证明，在二维情况下，这个位置作用可以表示为旋转矩阵：
 15 | 
 16 | <img src="../images/rope1.png" style="zoom:20%;" />
 17 | 
 18 | 这个矩阵将二维向量逆时针旋转 $\theta$ 度：
 19 | 
 20 | <img src="../images/rope2.png" style="zoom:35%;" />
 21 | 
 22 | 进一步，多维情况的矩阵是：
 23 | 
 24 | <img src="../images/rope3.png" style="zoom:35%;" />
 25 | 
 26 | 其中
 27 | 
 28 | <img src="../images/rope4.png" style="zoom:25%;" /> 
 29 | 
 30 | 这个矩阵对输入向量依次两两维度进行旋转操作（所以最多 $d/2$ 个操作）：
 31 | 
 32 | <img src="../images/rope5.png" style="zoom:40%;" /> 
 33 | 
 34 | 
 35 | 
 36 | 
 37 | ## 实现
 38 | 实现不需要声明巨大的矩阵，也不需要作矩阵乘法，因为原矩阵乘法相当于向量的乘机和求和：
 39 | 
 40 | <img src="../images/rope6.png" style="zoom:35%;" /> 
 41 | 
 42 | 下面是[llama3](https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/api/model.py)中rope的实现：
 43 | 
 44 | <iframe src="notes/rope_imp.html" width="100%" height="600px"></iframe>
 45 | 
 46 | 
 47 | ## 优点
 48 | RoPE的优良性质有：
 49 | - 序列长度灵活。能有效处理不同长度的输入。
 50 | - 远程衰减。两个qk的相对位置越远，计算attention时的权重越小。
 51 | - 可以拓展到线性attention（用得少）。
 52 | 
 53 | ## 附录：复数和旋转相关概念
 54 | **复数**
 55 | 
 56 | 复数是形如 $a+b i$ 的数, 其中 $a$ 和 $b$ 是实数, $i$ 是虚数单位, 满足 $i^2=-1$ 。复数可以表示为一个二维平面上的点或向量, 这个平面称为复平面, 其中 $x$ 轴表示实数部分, $y$ 轴表示虚数部分。
 57 | 
 58 | **极坐标**
 59 | 
 60 | 复数也可以用极坐标表示。极坐标形式表示一个点的距离（模）和角度（幅角）：
 61 | 
 62 | - 模（或幅度）$r$ ：复数从原点到该点的距离，计算公式为 $r = \sqrt{a^2 + b^2}$。
 63 | - 幅角（或相位）$\theta$：从正实轴到该点的角度，计算公式为 $\theta = \tan^{-1}(b / a)$ ，需要注意象限。
 64 | 
 65 | 因此，一个复数 $z$ 在极坐标下可以表示为 $z = r (\cos \theta + i \sin \theta)$，或者更常用的形式 $z = re^{i\theta}$ 。欧拉公式 $e^{i\theta} = \cos \theta + i \sin \theta $ 连接了指数函数和三角函数，是复数极坐标表示的基础。
 66 | 
 67 | **复数乘法和旋转的关系**
 68 | 
 69 | 复数乘法可以视为在复平面上进行旋转和缩放的操作。
 70 | 
 71 | 假设我们有两个复数 $z_1 = r_1 e^{i\theta_1}$ 和 $z_2 = r_2 e^{i\theta_2}$。
 72 | 
 73 | 复数的乘法定义为：
 74 | $$
 75 | z_1 \cdot z_2 = (r_1 e^{i\theta_1}) \cdot (r_2 e^{i\theta_2})
 76 | $$ 
 77 | 
 78 | 根据指数的性质，这相当于：
 79 | $$
 80 | z_1 \cdot z_2 = r_1 r_2 e^{i(\theta_1 + \theta_2)}
 81 | $$
 82 | 
 83 | 从几何角度来看，这意味着：
 84 | - 模的相乘：结果复数的模是两个复数模的乘积。
 85 | - 相位角的相加：结果复数的相位角是两个复数相位角的和。
 86 | 
 87 | 这正是旋转和缩放的几何意义：
 88 | - 缩放：模的乘积 $r_1 r_2$ 表示结果复数在复平面上距离原点的距离。通过乘法，复数的模会被缩放。
 89 | - 旋转：相位角的和 $\theta_1 + \theta_2$ 表示结果复数在复平面上的旋转角度。通过乘法，复数的相位角会被旋转。
 90 | 
 91 | **矩阵乘法表示的复数旋转**
 92 | 
 93 | <img src="../images/rope7.png" style="zoom:100%;" />
 94 | 
 95 | 
 96 | *参考资料*
 97 | - [ROFORMER: ENHANCED TRANSFORMER WITH ROTARY POSITION EMBEDDING](https://arxiv.org/pdf/2104.09864)
 98 | - [Rotary Positional Embeddings: Combining Absolute and Relative](https://www.youtube.com/watch?v=o29P0Kpobz0&t=530s)
 99 | 
100 | 


--------------------------------------------------------------------------------
/docs/notes/rope_imp.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "code",
  5 |    "execution_count": null,
  6 |    "id": "0327518c-717e-45f5-92fe-2ed72bf0a0b4",
  7 |    "metadata": {},
  8 |    "outputs": [],
  9 |    "source": [
 10 |     "import math\n",
 11 |     "from typing import Tuple\n",
 12 |     "\n",
 13 |     "import torch\n",
 14 |     "import torch.nn.functional as F\n",
 15 |     "from torch import nn"
 16 |    ]
 17 |   },
 18 |   {
 19 |    "cell_type": "code",
 20 |    "execution_count": null,
 21 |    "id": "4d64c744-2e24-41b5-bf1f-048e5925bdfb",
 22 |    "metadata": {},
 23 |    "outputs": [],
 24 |    "source": [
 25 |     "def apply_scaling(freqs: torch.Tensor):\n",
 26 |     "    # Values obtained from grid search\n",
 27 |     "    scale_factor = 8\n",
 28 |     "    low_freq_factor = 1\n",
 29 |     "    high_freq_factor = 4\n",
 30 |     "    old_context_len = 8192  # original llama3 length\n",
 31 |     "\n",
 32 |     "    low_freq_wavelen = old_context_len / low_freq_factor\n",
 33 |     "    high_freq_wavelen = old_context_len / high_freq_factor\n",
 34 |     "    new_freqs = []\n",
 35 |     "    for freq in freqs:\n",
 36 |     "        wavelen = 2 * math.pi / freq\n",
 37 |     "        if wavelen < high_freq_wavelen:\n",
 38 |     "            new_freqs.append(freq)\n",
 39 |     "        elif wavelen > low_freq_wavelen:\n",
 40 |     "            new_freqs.append(freq / scale_factor)\n",
 41 |     "        else:\n",
 42 |     "            assert low_freq_wavelen != high_freq_wavelen\n",
 43 |     "            smooth = (old_context_len / wavelen - low_freq_factor) / (\n",
 44 |     "                high_freq_factor - low_freq_factor\n",
 45 |     "            )\n",
 46 |     "            new_freqs.append((1 - smooth) * freq / scale_factor + smooth * freq)\n",
 47 |     "    return torch.tensor(new_freqs, dtype=freqs.dtype, device=freqs.device)"
 48 |    ]
 49 |   },
 50 |   {
 51 |    "cell_type": "markdown",
 52 |    "id": "0d9a6cac-baa6-4ece-baad-e02b0bcab66f",
 53 |    "metadata": {},
 54 |    "source": [
 55 |     "这个函数对频率进行缩放，目的是对频率进行调整，使得它们在不同的范围内平滑过渡。这对于处理长序列时的频率调整非常有用。\n",
 56 |     "\n",
 57 |     "```\n",
 58 |     "scale_factor：缩放因子。\n",
 59 |     "low_freq_factor 和 high_freq_factor：用于计算频率的上下界。\n",
 60 |     "old_context_len：原始上下文长度。\n",
 61 |     "low_freq_wavelen 和 high_freq_wavelen：低频和高频的波长。\n",
 62 |     "函数通过一个平滑函数在上下界之间进行过渡，确保频率在不同范围内的平滑性。\n",
 63 |     "```"
 64 |    ]
 65 |   },
 66 |   {
 67 |    "cell_type": "raw",
 68 |    "id": "e8416321-afc8-413a-82bc-60639116524a",
 69 |    "metadata": {},
 70 |    "source": []
 71 |   },
 72 |   {
 73 |    "cell_type": "code",
 74 |    "execution_count": null,
 75 |    "id": "69cf4c77-4231-4aa8-b1cd-77d3df135473",
 76 |    "metadata": {},
 77 |    "outputs": [],
 78 |    "source": [
 79 |     "def precompute_freqs_cis(\n",
 80 |     "    dim: int, end: int, theta: float = 500000.0, use_scaled: bool = False\n",
 81 |     "):\n",
 82 |     "    freqs = 1.0 / (theta ** (torch.arange(0, dim, 2)[: (dim // 2)].float() / dim))\n",
 83 |     "    t = torch.arange(end, device=freqs.device, dtype=torch.float32)\n",
 84 |     "    if use_scaled:\n",
 85 |     "        freqs = apply_scaling(freqs)\n",
 86 |     "    freqs = torch.outer(t, freqs)\n",
 87 |     "    freqs_cis = torch.polar(torch.ones_like(freqs), freqs)  # complex64\n",
 88 |     "    return freqs_cis"
 89 |    ]
 90 |   },
 91 |   {
 92 |    "cell_type": "markdown",
 93 |    "id": "23c5068d-84ac-476c-a79a-0ce2a318c96d",
 94 |    "metadata": {},
 95 |    "source": [
 96 |     "这个函数预计算频率的复数表示。\n",
 97 |     "\n",
 98 |     "```\n",
 99 |     "dim：频率维度。是原论文中的theta\n",
100 |     "end：时间步数长度，即序列的长度，最大token数。\n",
101 |     "theta：控制频率的基数。是原论文中的10000。\n",
102 |     "use_scaled：是否使用缩放。\n",
103 |     "频率的计算方式是基于指数函数的，生成的频率通过torch.polar函数转成复数表示。\n",
104 |     "```\n",
105 |     "\n",
106 |     "在这里，freqs对应原论文中的theta，而theta则对应原论文中的10000。这是需要注意的地方。\n",
107 |     "\n",
108 |     "cis是cosine和sine的缩写。具体来说，cis θ表示 cosθ + i*sinθ，其中 i 是虚数单位，θ是角度。这种表示方法在复数的极坐标形式中非常常见。\n",
109 |     "\n",
110 |     "t对应了原论文中的m。\n",
111 |     "\n",
112 |     "freqs = torch.outer(t, freqs)，将freqs通过外积计算，变成。new_freqs[i][j] = t[i]*freqs[j]。新的freqs矩阵在原论文中没有对应。\n",
113 |     "\n",
114 |     "torch.polar将极坐标形式的数值转换为复数。它接受两个参数：幅度和相位角，并返回一个复数张量。所以freqs_cis是幅度为1，相位角为freqs对应位置的复数张量。"
115 |    ]
116 |   },
117 |   {
118 |    "cell_type": "markdown",
119 |    "id": "d664d8ac-3215-452c-831e-1d372de74790",
120 |    "metadata": {},
121 |    "source": [
122 |     "过一个具体的例子来展示如何计算 `freqs_cis`。\n",
123 |     "\n",
124 |     "假设我们有以下参数：\n",
125 |     "- `dim = 4`\n",
126 |     "- `end = 3`\n",
127 |     "- `theta = 10000.0`\n",
128 |     "- `use_scaled = False`\n",
129 |     "\n",
130 |     "### 1. 计算频率 `freqs`\n",
131 |     "\n",
132 |     "首先，生成频率向量 `freqs`：\n",
133 |     "\n",
134 |     "dim = 4\n",
135 |     "theta = 10000.0\n",
136 |     "freqs = 1.0 / (theta ** (torch.arange(0, dim, 2)[: (dim // 2)].float() / dim))\n",
137 |     "\n",
138 |     "\n",
139 |     "具体步骤如下：\n",
140 |     "- `torch.arange(0, dim, 2)[: (dim // 2)]` 生成序列 `[0, 2]`。\n",
141 |     "- `theta ** (torch.arange(0, dim, 2)[: (dim // 2)].float() / dim)` 计算：\n",
142 |     "  - `theta ** (0 / 4) = 10000 ** 0 = 1.0`\n",
143 |     "  - `theta ** (2 / 4) = 10000 ** 0.5 = 100.0`\n",
144 |     "- 取倒数：\n",
145 |     "  - `1.0 / 1.0 = 1.0`\n",
146 |     "  - `1.0 / 100.0 = 0.01`\n",
147 |     "  \n",
148 |     "所以 `freqs` 为 `[1.0, 0.01]`。\n",
149 |     "\n",
150 |     "### 2. 生成时间步 `t`\n",
151 |     "\n",
152 |     "接下来生成时间步 `t`：\n",
153 |     "\n",
154 |     "end = 3\n",
155 |     "t = torch.arange(end, device=freqs.device, dtype=torch.float32)\n",
156 |     "\n",
157 |     "\n",
158 |     "- `torch.arange(end)` 生成序列 `[0, 1, 2]`。\n",
159 |     "  \n",
160 |     "所以 `t` 为 `[0, 1, 2]`。\n",
161 |     "\n",
162 |     "### 3. 计算外积 `freqs`\n",
163 |     "\n",
164 |     "计算时间步 `t` 和频率 `freqs` 的外积：\n",
165 |     "\n",
166 |     "freqs = torch.outer(t, freqs)\n",
167 |     "\n",
168 |     "\n",
169 |     "具体计算如下：\n",
170 |     "- `t` 为 `[0, 1, 2]`\n",
171 |     "- `freqs` 为 `[1.0, 0.01]`\n",
172 |     "- 外积结果为：\n",
173 |     "  \n",
174 |     "  [[0 * 1.0, 0 * 0.01],\n",
175 |     "   [1 * 1.0, 1 * 0.01],\n",
176 |     "   [2 * 1.0, 2 * 0.01]]\n",
177 |     "  \n",
178 |     "  即：\n",
179 |     "  \n",
180 |     "  [[0.0, 0.0],\n",
181 |     "   [1.0, 0.01],\n",
182 |     "   [2.0, 0.02]]\n",
183 |     "  \n",
184 |     "\n",
185 |     "### 4. 生成复数表示 `freqs_cis`\n",
186 |     "\n",
187 |     "最后生成复数表示 `freqs_cis`：\n",
188 |     "\n",
189 |     "freqs_cis = torch.polar(torch.ones_like(freqs), freqs)  # complex64\n",
190 |     "\n",
191 |     "\n",
192 |     "- `torch.ones_like(freqs)` 生成一个与 `freqs` 形状相同的全1张量：\n",
193 |     "  \n",
194 |     "  [[1.0, 1.0],\n",
195 |     "   [1.0, 1.0],\n",
196 |     "   [1.0, 1.0]]\n",
197 |     "  \n",
198 |     "- `torch.polar(torch.ones_like(freqs), freqs)` 将幅度为1，相位为 `freqs` 的复数表示出来：\n",
199 |     "  - 对于 `freqs` 的每个元素，计算对应的复数：\n",
200 |     "    - `cis(0.0) = cos(0.0) + i * sin(0.0) = 1.0 + 0.0i`\n",
201 |     "    - `cis(0.0) = cos(0.0) + i * sin(0.0) = 1.0 + 0.0i`\n",
202 |     "    - `cis(1.0) = cos(1.0) + i * sin(1.0) ≈ 0.5403 + 0.8415i`\n",
203 |     "    - `cis(0.01) = cos(0.01) + i * sin(0.01) ≈ 0.99995 + 0.0099998i`\n",
204 |     "    - `cis(2.0) = cos(2.0) + i * sin(2.0) ≈ -0.4161 + 0.9093i`\n",
205 |     "    - `cis(0.02) = cos(0.02) + i * sin(0.02) ≈ 0.9998 + 0.0199987i`\n",
206 |     "\n",
207 |     "  所以 `freqs_cis` 为：\n",
208 |     "  \n",
209 |     "  [[1.0 + 0.0i,       1.0 + 0.0i],\n",
210 |     "   [0.5403 + 0.8415i, 0.99995 + 0.0099998i],\n",
211 |     "   [-0.4161 + 0.9093i, 0.9998 + 0.0199987i]]"
212 |    ]
213 |   },
214 |   {
215 |    "cell_type": "code",
216 |    "execution_count": null,
217 |    "id": "af6212eb-3a7a-4286-8d58-9997c369d0e4",
218 |    "metadata": {},
219 |    "outputs": [],
220 |    "source": []
221 |   },
222 |   {
223 |    "cell_type": "code",
224 |    "execution_count": null,
225 |    "id": "7221efb4-aa06-42bb-9eb7-c7f7f1686b6d",
226 |    "metadata": {},
227 |    "outputs": [],
228 |    "source": [
229 |     "def reshape_for_broadcast(freqs_cis: torch.Tensor, x: torch.Tensor):\n",
230 |     "    ndim = x.ndim\n",
231 |     "    assert 0 <= 1 < ndim\n",
232 |     "    assert freqs_cis.shape == (x.shape[1], x.shape[-1])\n",
233 |     "    shape = [d if i == 1 or i == ndim - 1 else 1 for i, d in enumerate(x.shape)]\n",
234 |     "    return freqs_cis.view(*shape)"
235 |    ]
236 |   },
237 |   {
238 |    "cell_type": "raw",
239 |    "id": "4c2b709a-1638-4eee-bdfa-e1b8cddfd7a3",
240 |    "metadata": {},
241 |    "source": [
242 |     "这个函数将频率复数张量freqs_cis的形状调整为可以与输入张量x进行广播操作的形状。"
243 |    ]
244 |   },
245 |   {
246 |    "cell_type": "raw",
247 |    "id": "acdf9920-cfef-4229-905a-f0c91529a16e",
248 |    "metadata": {},
249 |    "source": []
250 |   },
251 |   {
252 |    "cell_type": "code",
253 |    "execution_count": null,
254 |    "id": "b4ab231a-c567-4b6b-a5dd-7a076302f133",
255 |    "metadata": {},
256 |    "outputs": [],
257 |    "source": [
258 |     "def apply_rotary_emb(\n",
259 |     "    xq: torch.Tensor,\n",
260 |     "    xk: torch.Tensor,\n",
261 |     "    freqs_cis: torch.Tensor,\n",
262 |     ") -> Tuple[torch.Tensor, torch.Tensor]:\n",
263 |     "    xq_ = torch.view_as_complex(xq.float().reshape(*xq.shape[:-1], -1, 2))\n",
264 |     "    xk_ = torch.view_as_complex(xk.float().reshape(*xk.shape[:-1], -1, 2))\n",
265 |     "    freqs_cis = reshape_for_broadcast(freqs_cis, xq_)\n",
266 |     "    xq_out = torch.view_as_real(xq_ * freqs_cis).flatten(3)\n",
267 |     "    xk_out = torch.view_as_real(xk_ * freqs_cis).flatten(3)\n",
268 |     "    return xq_out.type_as(xq), xk_out.type_as(xk)"
269 |    ]
270 |   },
271 |   {
272 |    "cell_type": "markdown",
273 |    "id": "59958a8e-ba81-4725-b136-3f1cdf5a5f6e",
274 |    "metadata": {},
275 |    "source": [
276 |     "这个函数将旋转位置编码应用于query和key张量。\n",
277 |     "\n",
278 |     "```\n",
279 |     "xq和xk：q和k张量。\n",
280 |     "freqs_cis：预计算的复数频率张量。\n",
281 |     "```\n",
282 |     "\n",
283 |     "torch.view_as_complex 将一个实数张量视为复数张量，其形状最后一维的大小必须是2，这意味着该张量的最后一维包含了复数的实部和虚部。函数返回一个新的复数张量，其形状与输入张量的形状相同，除了最后一维的大小从2变为1。\n",
284 |     "\n",
285 |     "torch.view_as_real(xq_ * freqs_cis).flatten(3)则对应了截图中橙色叉下面的公式。"
286 |    ]
287 |   },
288 |   {
289 |    "cell_type": "raw",
290 |    "id": "612fb92d-fb4a-4826-a109-a21f49729aff",
291 |    "metadata": {},
292 |    "source": []
293 |   },
294 |   {
295 |    "cell_type": "code",
296 |    "execution_count": null,
297 |    "id": "85b8d8e5-0487-4582-bff0-d4d0425f9349",
298 |    "metadata": {},
299 |    "outputs": [],
300 |    "source": [
301 |     "xq, xk = apply_rotary_emb(xq, xk, freqs_cis=freqs_cis)"
302 |    ]
303 |   },
304 |   {
305 |    "cell_type": "raw",
306 |    "id": "4e93e193-f8a5-459a-a44d-6565ffc048b1",
307 |    "metadata": {},
308 |    "source": [
309 |     "应用rope。"
310 |    ]
311 |   }
312 |  ],
313 |  "metadata": {
314 |   "kernelspec": {
315 |    "display_name": "Python 3 (ipykernel)",
316 |    "language": "python",
317 |    "name": "python3"
318 |   },
319 |   "language_info": {
320 |    "codemirror_mode": {
321 |     "name": "ipython",
322 |     "version": 3
323 |    },
324 |    "file_extension": ".py",
325 |    "mimetype": "text/x-python",
326 |    "name": "python",
327 |    "nbconvert_exporter": "python",
328 |    "pygments_lexer": "ipython3",
329 |    "version": "3.10.8"
330 |   }
331 |  },
332 |  "nbformat": 4,
333 |  "nbformat_minor": 5
334 | }
335 | 


--------------------------------------------------------------------------------
/docs/notes/save_checkpoint.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "code",
 5 |    "execution_count": null,
 6 |    "id": "fe44ce34-3069-45ef-9f48-ead5d5cdf0e6",
 7 |    "metadata": {},
 8 |    "outputs": [],
 9 |    "source": [
10 |     "param_grad_dic = {\n",
11 |     "    k: v.requires_grad for (k, v) in model_no_ddp.named_parameters()\n",
12 |     "}"
13 |    ]
14 |   },
15 |   {
16 |    "cell_type": "markdown",
17 |    "id": "3a00cf85-54a9-4fc3-94b2-9062e9f470f3",
18 |    "metadata": {},
19 |    "source": [
20 |     "创建一个字典 param_grad_dic，其中键是参数名，值是该参数是否需要计算梯度（requires_grad）。"
21 |    ]
22 |   },
23 |   {
24 |    "cell_type": "raw",
25 |    "id": "5165e19e-c695-4157-a509-1e16b94bf902",
26 |    "metadata": {},
27 |    "source": []
28 |   },
29 |   {
30 |    "cell_type": "code",
31 |    "execution_count": null,
32 |    "id": "7cae1ef5-3130-4d67-8cf8-21b71c4f63ef",
33 |    "metadata": {},
34 |    "outputs": [],
35 |    "source": [
36 |     "state_dict = model_no_ddp.state_dict()"
37 |    ]
38 |   },
39 |   {
40 |    "cell_type": "markdown",
41 |    "id": "49dd42aa-ac7a-4a23-a7db-f538aad82227",
42 |    "metadata": {},
43 |    "source": [
44 |     "state_dict状态字典包含所有模型参数和值。"
45 |    ]
46 |   },
47 |   {
48 |    "cell_type": "raw",
49 |    "id": "2e1a5939-9dab-41ab-8c85-7f5f6ed412af",
50 |    "metadata": {},
51 |    "source": []
52 |   },
53 |   {
54 |    "cell_type": "code",
55 |    "execution_count": null,
56 |    "id": "fbd88369-9057-4b67-a361-6417f28aaeda",
57 |    "metadata": {},
58 |    "outputs": [],
59 |    "source": [
60 |     "for k in list(state_dict.keys()):\n",
61 |     "    if k in param_grad_dic.keys() and not param_grad_dic[k]:\n",
62 |     "        del state_dict[k]"
63 |    ]
64 |   },
65 |   {
66 |    "cell_type": "markdown",
67 |    "id": "c194c6d1-4832-4a8d-b802-4c49b54b183d",
68 |    "metadata": {},
69 |    "source": [
70 |     "删除不需要梯度的参数。"
71 |    ]
72 |   }
73 |  ],
74 |  "metadata": {
75 |   "kernelspec": {
76 |    "display_name": "Python 3 (ipykernel)",
77 |    "language": "python",
78 |    "name": "python3"
79 |   },
80 |   "language_info": {
81 |    "codemirror_mode": {
82 |     "name": "ipython",
83 |     "version": 3
84 |    },
85 |    "file_extension": ".py",
86 |    "mimetype": "text/x-python",
87 |    "name": "python",
88 |    "nbconvert_exporter": "python",
89 |    "pygments_lexer": "ipython3",
90 |    "version": "3.10.8"
91 |   }
92 |  },
93 |  "nbformat": 4,
94 |  "nbformat_minor": 5
95 | }
96 | 


--------------------------------------------------------------------------------
/docs/notes/scaliing_laws.md:
--------------------------------------------------------------------------------
 1 | # Scaliing Laws
 2 | 
 3 | 在Scaling Laws for Neural Language Models论文中，scaling law指的是损失随着模型大小、数据集大小和训练使用的计算量呈幂律变化（图1）。
 4 | 
 5 | ![](../images/scale_law1.png)
 6 | 
 7 | 到了现在，scaling law泛指大模型规模变化的规律，规模可以是模型参数量、数据量、训练时间等等，规律可以是模型表现随模型参数量变化的规律、最佳数据量随模型参数量变化的规律等等。
 8 | 
 9 | 比如，在Palm2技术报告中，第二节专门记录了scaling law的实验结果。实验表明，当计算量增加时，为了达到最佳效果，模型参数量和和数据量应该等比例增加（图2）。
10 | 
11 | ![](../images/scale_law2.png)


--------------------------------------------------------------------------------
/docs/notes/video_chatgpt.md:
--------------------------------------------------------------------------------
 1 | # Video-ChatGPT结构
 2 | 
 3 | ![arch](../images/video-chatgpt.png)
 4 | 
 5 | Visual encoder用的是CLIP ViT-L/14。
 6 | 给定一个视频 $V_i \in \mathbb{R}^{T \times H \times W \times C}$ ，其中，$T$ 是帧数，$H$ 是高度， $W$是宽度，$C$ 是channel数量。
 7 | 用visual encoder对这些帧进行批计算，得到 $x_i \in \mathbb{R}^{T \times h \times w \times D}$ ，其中$h=H / p, w=W / p$，$p$是patch size。token的数量记为 $N=h \times w$。
 8 | 
 9 | Frame-level embeddings are average-pooled along the temporal dimension to obtain a video-level temporal representation $t_i \in \mathbb{R}^{N \times D}$. This operation, referred to as temporal pooling, implicitly incorporates temporal learning through the aggregation of multiple frames. 这个操作具体是这样的：首先将$x_i$的 $h \times w$ 进行拼接，得到 $T \times N \times D$；然后，在$T$维度上取平均，得到$N \times D$。
10 | 
11 | Similarly, the frame-level embeddings are average-pooled along the spatial dimension to yield the video-level spatial representation $z_i \in \mathbb{R}^{T \times D}$. The temporal and spatial features are concatenated to obtain the video-level features $v_i$,
12 | $$
13 | v_i=\left[\begin{array}{ll}t_i & z_i\end{array}\right] \in \mathbb{R}^{(T+N) \times D}
14 | $$
15 | 注意，如果视频很短或帧率很小，N可能远大于T。
16 | 
17 | A simple trainable linear layer $g$, projects these video-level features into the language decoder's embedding space, transforming them into corresponding language embedding tokens $Q_v$,
18 | $$
19 | Q_v=g\left(v_i\right) \in \mathbb{R}^{(T+N) \times K} .
20 | $$
21 | 其中，$K$ 是llm的token向量的维度。
22 | 
23 | 
24 | 


--------------------------------------------------------------------------------
/docs/notes/vit.md:
--------------------------------------------------------------------------------
 1 | # ViT
 2 | 
 3 | ![](../images/vit.png)
 4 | 
 5 | To perform classification, a standard approach is to prepend an extra learnable embedding “classification token” [CLASS] to the sequence of embedded patche
 6 | $$
 7 | \mathbf{Z} \leftarrow \operatorname{concat}([\mathrm{CLASS}], \mathbf{X W})
 8 | $$
 9 | 
10 | ## patch size 14 x 14是什么意思？
11 | 
12 | 在Vision Transformer（ViT）模型中，"patch size 14 x 14"指的是将输入图像划分为14像素乘14像素的图像块（patch）。这个概念是ViT模型的核心之一，用于将图像数据转化为适合Transformer处理的形式。
13 | 
14 | ### 具体解释：
15 | 
16 | 1. **输入图像**：假设我们有一个输入图像，其尺寸为 $H \times W \times C$，其中 $H$ 是高度，$W$ 是宽度，$C$是通道数（通常是3，表示RGB三通道）。
17 | 2. **划分图像块**：将这个图像划分为固定大小的块。在"patch size 14 x 14"的情况下，每个图像块的尺寸为14像素乘14像素。这意味着：
18 |     - 如果图像的高度和宽度是可被14整除的（例如224 x 224），那么图像会被划分为$ (224 / 14) \times (224 / 14) = 16 \times 16$ 个图像块，总共256个图像块。
19 |     - 如果图像尺寸不是14的倍数，通常会在输入图像预处理阶段进行适当的裁剪或填充，使其尺寸适合划分成完整的图像块。
20 | 3. **展平和线性映射**：每个14 x 14的图像块展平成一个长度为 $14 \times 14 \times C$ 的向量，然后通过一个线性变换映射到一个固定的维度（比如768维），形成图像块的嵌入表示。
21 | 4. **位置编码**：为了保留图像块的位置信息，给每个块添加位置编码，这样模型可以识别每个块在原始图像中的位置。
22 | 5. **输入Transformer**：将所有图像块的嵌入表示（包含位置编码）作为序列输入到Transformer编码器中进行处理。
23 | 
24 | ### 优点：
25 | 
26 | - **减少计算量**：相比于直接处理整个图像，将图像划分为较小的块可以显著减少计算量。
27 | - **适应Transformer架构**：这种方法使得Transformer可以处理视觉数据，类似于处理自然语言中的单词序列。
28 | 
29 | ### 举例说明：
30 | 
31 | 假设输入图像是224 x 224 x 3，使用14 x 14的图像块：
32 | 
33 | - 每个图像块的尺寸是14 x 14 x 3。
34 | - 每个图像块展平成一个长度为588（即14 x 14 x 3）的向量。
35 | - 每个向量通过线性变换映射到一个固定维度（例如768维）。
36 | - 最终得到16 x 16个嵌入表示，每个表示是一个768维的向量。
37 | - 这些向量序列输入到Transformer模型中进行进一步处理。
38 | 
39 | 原论文：https://arxiv.org/abs/2010.11929


--------------------------------------------------------------------------------
/docs/notes/weight_init.md:
--------------------------------------------------------------------------------
  1 | # 参数初始化
  2 | 
  3 | ## 神经网络的参数初始化为什么重要
  4 | 神经网络的参数初始化是非常重要的，原因如下：
  5 | 1. **收敛速度和稳定性**：
  6 |     - **良好的初始化**：可以加快训练过程，使神经网络能够更快地收敛到一个较优的解。
  7 |     - **不良的初始化**：可能导致梯度消失或爆炸问题，从而使训练过程变得非常缓慢甚至无法收敛。
  8 | 2. **避免对称性**：
  9 |     - 如果所有的权重都初始化为相同的值（例如0），那么所有的神经元在每一层的更新都会是相同的。这种对称性会导致神经网络无法有效地学习到不同的特征。
 10 |     - 随机初始化可以打破这种对称性，使不同的神经元能够学习到不同的特征。
 11 | 3. **梯度消失和梯度爆炸**：
 12 |     - **梯度消失**：如果权重初始化得太小，反向传播过程中梯度可能会变得越来越小，导致前层的权重更新很慢（或者根本不更新）。
 13 |     - **梯度爆炸**：如果权重初始化得太大，反向传播过程中梯度可能会变得越来越大，导致前层的权重更新过大，导致网络不稳定。
 14 | 4. **网络深度**：
 15 |     - 对于深层神经网络，参数初始化变得更加关键，因为不良的初始化在多层传播过程中会被放大，进而显著影响网络的训练。
 16 | 5. **不同类型的初始化方法**：
 17 |     - **均匀分布初始化**：例如 Xavier 初始化，适用于激活函数为 sigmoid 或 tanh 的网络。
 18 |     - **正态分布初始化**：例如 He 初始化，适用于激活函数为 ReLU 的网络。
 19 |     - 这些方法根据网络结构和激活函数的不同，选择合适的初始化方法可以显著提高训练效果。
 20 | 6. **预训练和转移学习**：
 21 |     - 在某些情况下，使用预训练的网络权重（例如在迁移学习中使用预训练模型）也是一种有效的初始化方法，可以显著提高新的任务的训练效果和收敛速度。
 22 | 
 23 | 总的来说，正确的参数初始化能够显著提升神经网络的训练效率和效果，提高模型的性能，避免训练过程中的常见问题。
 24 | 
 25 | ## 有哪些方法可以减少参数初始化带来的影响
 26 | 
 27 | 减少参数初始化带来的负面影响可以通过以下几种方法：
 28 | 
 29 | 1. **规范化初始化方法**：
 30 |     - **Xavier/Glorot 初始化**：适用于 sigmoid 和 tanh 激活函数。权重从一个均匀分布中抽取。
 31 |     - **He 初始化**：适用于 ReLU 和其变体（如 Leaky ReLU）。权重从一个正态分布中抽取，均值为0，标准差为 $\sqrt{\frac{2}{n_{in}}}$。
 32 | 2. **Batch Normalization**：
 33 |     - 在每一层之后进行批归一化，使得每一层的激活值在训练过程中保持稳定，从而减小参数初始化对训练的影响。
 34 |     - 通过归一化激活值，Batch Normalization 能够使网络对初始化不那么敏感，并且允许使用较高的学习率，使训练更加高效。
 35 | 3. **Layer Normalization**：
 36 |     - 类似于 Batch Normalization，但是在每一层内对每一个样本的激活值进行归一化，而不是在批次内进行归一化。这在某些情况下可以提供更好的性能和稳定性。
 37 | 4. **权重正则化**：
 38 |     - 使用 L2 正则化（权重衰减）或 L1 正则化可以在训练过程中约束权重的大小，从而减少参数初始化带来的影响。
 39 |     - 这可以防止权重过大或过小，进而稳定训练过程。
 40 | 5. **调整学习率**：
 41 |     - 使用自适应学习率优化器（如 Adam、RMSprop、Adagrad）可以根据梯度的变化自动调整学习率，从而减小参数初始化的不良影响。
 42 |     - 这些优化器可以在训练过程中动态调整每个参数的学习率，使得即使在初始参数不理想的情况下，训练也能逐步收敛。
 43 | 6. **学习率预热（Learning Rate Warmup）**：
 44 |     - 在训练开始时使用较低的学习率，然后逐步增加到预设的学习率。这可以避免在训练初期由于不良初始化导致的梯度爆炸问题。
 45 | 7. **使用预训练模型**：
 46 |     - 在迁移学习中，使用在大规模数据集上预训练的模型作为初始化，可以显著减少参数初始化的影响。
 47 |     - 这些预训练模型通常已经有了较好的一组参数，能够更快地适应新的任务。
 48 | 8. **Dropout**：
 49 |     - 在训练过程中随机丢弃一部分神经元，使得模型不会过于依赖某一组特定的权重，从而减小参数初始化的不良影响。
 50 | 
 51 | 通过以上方法，可以有效地减少参数初始化带来的负面影响，提高神经网络的训练效率和稳定性。
 52 | 
 53 | ## 还有什么要补充的吗
 54 | 当然，还有一些额外的方法和注意事项可以进一步减少参数初始化带来的负面影响，提高神经网络的训练效果：
 55 | 
 56 | 1. **Residual Connections（残差连接）**：
 57 |     - 在深层神经网络中使用残差连接（ResNet）可以帮助缓解梯度消失和梯度爆炸问题。这种方法通过引入短路或跳跃连接，使梯度能够更容易地传播回前面的层。
 58 | 2. **Gradient Clipping（梯度裁剪）**：
 59 |     - 在反向传播过程中，对梯度进行裁剪以防止梯度爆炸。通常是在梯度更新前将梯度值限制在一个预定义的范围内。
 60 | 3. **Learning Rate Scheduling（学习率调度）**：
 61 |     - 动态调整学习率，比如使用学习率衰减策略（如指数衰减、余弦退火、学习率递减等），使得在训练过程中学习率逐步减小，从而保持训练的稳定性。
 62 | 4. **Weight Tying（权重绑定）**：
 63 |     - 在某些模型（如循环神经网络、Transformer等）中，可以共享某些层的权重，从而减少模型参数的数量，减小初始化的不良影响。
 64 | 5. **Adaptive Moment Estimation（Adam优化器）**：
 65 |     - Adam优化器结合了动量和自适应学习率调整的优点，能够在一定程度上缓解参数初始化带来的问题。
 66 | 6. **Warm Restart（热重启）**：
 67 |     - 使用像SGDR（Stochastic Gradient Descent with Warm Restarts）这样的策略，在训练过程中周期性地重置学习率，可以帮助模型跳出局部最优。
 68 | 7. **Weight Decay（权重衰减）**：
 69 |     - 类似于L2正则化，通过在损失函数中引入权重衰减项，防止权重过大，从而稳定训练过程。
 70 | 8. **Ensemble Methods（集成方法）**：
 71 |     - 训练多个模型并对其进行集成，从而减少单个模型初始化带来的不确定性。这种方法可以通过对多个模型的预测进行平均或投票来提高整体性能。
 72 | 9. **Early Stopping（早停）**：
 73 |     - 监控验证集的表现并在性能不再提升时提前停止训练，可以避免过拟合和不必要的计算，同时减少由于初始化不良导致的长时间训练。
 74 | 10. **Data Augmentation（数据增强）**：
 75 |     - 增加训练数据的多样性，使模型在训练过程中看到更多不同的输入，从而减小参数初始化的不良影响。
 76 | 
 77 | 通过结合上述方法，可以进一步减小参数初始化带来的负面影响，提升神经网络的训练效率和性能。同时，在实际应用中，往往需要根据具体问题和数据特点进行调整和优化，以达到最佳效果。
 78 | 
 79 | ## 还有吗
 80 | 
 81 | 当然，还有一些更细致的策略和进阶技巧可以帮助减小参数初始化带来的负面影响，提高神经网络的训练效果和稳定性：
 82 | 
 83 | 1. **Advanced Initialization Techniques（高级初始化技术）**：
 84 |     - **Lecun Initialization**：适用于激活函数为sigmoid或tanh的网络，权重从正态分布中抽取，均值为0，标准差为 $\sqrt{\frac{1}{n_{in}}}$。
 85 |     - **Orthogonal Initialization**：初始化时使用正交矩阵，特别适用于深层神经网络和循环神经网络，可以更好地保持梯度的数值稳定性。
 86 | 2. **Regularization Techniques（正则化技巧）**：
 87 |     - **DropConnect**：类似于Dropout，但应用于权重而不是激活值。通过随机将一部分权重置为零，可以减少过拟合，增强模型的鲁棒性。
 88 | 3. **Advanced Optimization Algorithms（高级优化算法）**：
 89 |     - **LAMB（Layer-wise Adaptive Moments optimizer for Batch training）**：一种适用于大批量训练的优化算法，可以对每一层的参数进行更细粒度的调整，提升训练效果。
 90 |     - **Ranger**：结合了RAdam（Rectified Adam）和Lookahead优化器的优点，能够更稳定地训练深度模型。
 91 | 4. **Gradient Centralization**：
 92 |     - 在梯度更新前，对梯度进行中心化处理（将梯度的均值归零），可以使训练过程更加稳定，提升收敛速度。
 93 | 5. **Cosine Annealing**：
 94 |     - 使用余弦退火调度学习率，可以在训练过程中动态调整学习率，防止陷入局部最优，同时提升模型性能。
 95 | 6. **Cyclic Learning Rates**：
 96 |     - 使用循环学习率策略，周期性地调整学习率，可以帮助模型更好地探索参数空间，提升训练效果。
 97 | 7. **Weight Averaging**：
 98 |     - 在训练过程中，定期对模型参数进行平均（如Stochastic Weight Averaging, SWA），可以提高模型的泛化性能和稳定性。
 99 | 8. **Knowledge Distillation**：
100 |     - 使用一个预训练好的大模型（教师模型）来指导较小模型（学生模型）的训练。通过蒸馏知识，可以帮助学生模型更快地收敛到好的解。
101 | 9. **Curriculum Learning（课程学习）**：
102 |     - 从简单到复杂逐步训练模型，即先用简单的样本训练，然后逐步增加样本的难度。这可以帮助模型更加平滑地找到最优解，减少初始化的影响。
103 | 10. **Mixed Precision Training**：
104 |     - 使用混合精度训练（如16位和32位浮点数混合），不仅可以加快训练速度，还可以在一定程度上稳定训练过程，减少参数初始化带来的影响。
105 | 11. **Hyperparameter Optimization（超参数优化）**：
106 |     - 通过自动化的超参数搜索（如贝叶斯优化、随机搜索、网格搜索），找到最佳的初始化参数和优化策略，提升模型性能。
107 | 12. **Data Rebalancing**：
108 |     - 如果训练数据不平衡，可以使用重采样技术或数据生成技术（如SMOTE）来平衡数据分布，从而提高模型的稳定性和性能。
109 | 
110 | 通过结合这些高级技巧和策略，可以进一步优化神经网络的训练过程，减小参数初始化带来的负面影响，提高模型的性能和稳定性。在实际应用中，根据具体任务和数据的特点，选择和调整合适的方法，将有助于获得最佳的训练效果。
111 | 
112 | ## 正态分布在这里的意义是什么
113 | 在神经网络的参数初始化中，正态分布（也称高斯分布）是一种常用的概率分布，用于生成初始权重。使用正态分布进行参数初始化的意义和优点包括以下几个方面：
114 | 
115 | 1. **中心极限定理**：
116 |     - 中心极限定理表明，当许多独立随机变量的总和趋向于正态分布时，即使这些变量本身不是正态分布的，也会接近正态分布。这使得正态分布成为一种自然的选择，因为它能够很好地近似许多实际情况。
117 | 2. **对称性**：
118 |     - 正态分布是对称的，均值为0的正态分布意味着权重初始化既有正值也有负值，这有助于打破对称性，使得不同神经元能够学习到不同的特征。
119 | 3. **控制方差**：
120 |     - 通过选择合适的标准差，可以很好地控制权重的方差。对于不同的激活函数和网络结构，可以选择不同的标准差以适应具体需求。例如：
121 |         - **Xavier 初始化**：对于 sigmoid 或 tanh 激活函数，权重从均值为0、标准差为 $\sqrt{\frac{2}{n_{in} + n_{out}}}$ 的正态分布中抽取。
122 |         - **He 初始化**：对于 ReLU 激活函数，权重从均值为0、标准差为 $\sqrt{\frac{2}{n_{in}}}$的正态分布中抽取。
123 | 4. **减少梯度消失和爆炸**：
124 |     - 通过选择适当的均值和标准差，可以避免梯度消失和梯度爆炸问题。例如，He 初始化针对 ReLU 激活函数优化了标准差，从而使得在前向和反向传播过程中梯度的数值能够保持稳定。
125 | 5. **数学便利性**：
126 |     - 正态分布具有许多良好的数学性质，如均值和方差可以独立控制，累积分布函数有封闭形式等，这使得它在理论分析和实际计算中都非常便利。
127 | 6. **健壮性**：
128 |     - 在很多情况下，使用正态分布初始化的模型在训练初期表现出较好的鲁棒性和稳定性，能够适应各种不同的网络结构和数据分布。
129 | 
130 | 总之，正态分布在神经网络参数初始化中的意义在于其良好的数学性质、对称性、和对方差的有效控制。这些特性不仅有助于打破对称性，保证权重的多样性，还可以通过合适的参数选择，避免梯度消失和梯度爆炸问题，从而提升神经网络的训练效果和稳定性。
131 | 
132 | *回答：gpt-4o*


--------------------------------------------------------------------------------
/docs/reading.md:
--------------------------------------------------------------------------------
1 | - [年度书单2024](reading/年度书单2024.md)
2 | - [年度书单2023](reading/年度书单2023.md)
3 | - [年度书单2022](reading/年度书单2022.md)


--------------------------------------------------------------------------------
/docs/reading/年度书单2022.md:
--------------------------------------------------------------------------------
 1 | 1. 经济学原理，曼昆
 2 | 2. 房噬，张婕
 3 | 3. 政治学通识，包刚升
 4 | 4. 纳瓦尔宝典
 5 | 5. 分析与思考，黄奇帆
 6 | 6. 控制论与科学方法论，金观涛
 7 | 7. 暗时间，刘未鹏
 8 | 8. 乔布斯传，艾萨克森
 9 | 9. 三国演义
10 | 10. 置身事内，兰小欢
11 | 11. Skin in the game, 塔勒布
12 | 12. 高效能人士的七个习惯，柯维
13 | 13. 江泽民同志在武汉热工所，丁瑞芝
14 | 14. 心若菩提，曹德旺
15 | 15. 白鹿原，陈忠实


--------------------------------------------------------------------------------
/docs/reading/年度书单2023.md:
--------------------------------------------------------------------------------
 1 | 1. Bitcoin Clarity The Complete Beginners Guide to Understanding
 2 | 2. 投资中最简单的事
 3 | 3. 四世同堂
 4 | 4. 小而美：持续盈利的经营法则
 5 | 5. 微信背后的产品观
 6 | 6. The Intelligent Investor
 7 | 7. 埃隆·马斯克传
 8 | 8. 天堂蒜薹之歌
 9 | 9. 中央帝国的财政密码
10 | 10. 经济学的思维方式
11 | 11. 逃不开的经济周期
12 | 12. 国富论
13 | 13. 野草
14 | 14. 呐喊
15 | 15. 货币的教训
16 | 16. 美国四百年
17 | 17. 土地制度与中国发展
18 | 18. 菜场经济学
19 | 19. 毛泽东选集 第一卷
20 | 20. 你所不知道的冰冷经济真相
21 | 21. 坟


--------------------------------------------------------------------------------
/docs/reading/年度书单2024.md:
--------------------------------------------------------------------------------
 1 | 1. 《巴拉吉预言》 埃里克·乔根森
 2 | 2. 《复杂》 梅拉妮·米歇尔
 3 | 3. 《Good Energy》 Casey Means
 4 | 4. 《无穷的开始》 戴维·多伊奇
 5 | 5. 《文明、现代化、价值投资与中国》 李录
 6 | 6. 《邓小平时代》 傅高义，港版
 7 | 7. 《思考，快与慢》 丹尼尔·卡尼曼
 8 | 8. 《穷查理宝典》 彼得·考夫曼
 9 | 9. 《丰乳肥臀》 莫言
10 | 10. 《魏晋南北朝》 川胜义雄
11 | 11. 《咖啡图解小百科》 安妮·卡隆
12 | 12. 《百年中国经济史笔记》 杨小凯
13 | 13. 《我看见的世界》 李飞飞
14 | 14. 《我的几何人生》 丘成桐
15 | 15. 《深奥的简洁》 约翰·格里宾
16 | 16. 《彼得林奇的成功投资》 彼得·林奇 
17 | 17. 《芒格之道》 查理·芒格
18 | 18. 《控糖革命》 杰西·安佐斯佩
19 | 19. 《世界上最简单的会计书》 达雷尔·穆利斯
20 | 20. 《Rich AF》 Vivian Tu
21 | 21. 《堂吉诃德》 塞万提斯
22 | 22. 《晶片战争》 克里斯·弥勒
23 | 23. 《The Bed of Procrustes》 Nassim Nicholas Taleb
24 | 24. 《反脆弱》 纳西姆·尼古拉斯·塔勒布 
25 | 25. 《Excellent Advice for Living》 Kevin Kelly 
26 | 26. 《The Science of the Blockchain》 Roger Wattenhofer
27 | 27. 《Bitcoin: Hard Money You Can't F*ck With》 Jason A. Williams


--------------------------------------------------------------------------------