└── README.md /README.md: -------------------------------------------------------------------------------- 1 | # AI Engineering Transition Path 2 | Research papers for software engineers to transition to AI Engineering 3 | 4 | ## Tokenization 5 | 6 | - [Byte-pair Encoding](https://arxiv.org/pdf/1508.07909) 7 | - [Byte Latent Transformer: Patches Scale Better Than Tokens](https://arxiv.org/pdf/2412.09871) 8 | 9 | ## Vectorization 10 | 11 | - [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/pdf/1810.04805) 12 | - [IMAGEBIND: One Embedding Space To Bind Them All](https://arxiv.org/pdf/2305.05665) 13 | - [SONAR: Sentence-Level Multimodal and Language-Agnostic Representations](https://arxiv.org/pdf/2308.11466) 14 | - [FAISS library](https://arxiv.org/pdf/2401.08281) 15 | - [Facebook Large Concept Models](https://arxiv.org/pdf/2412.08821v2) 16 | 17 | ## Infrastructure 18 | 19 | - [TensorFlow](https://arxiv.org/pdf/1605.08695) 20 | - [Deepseek filesystem](https://github.com/deepseek-ai/3FS/blob/main/docs/design_notes.md) 21 | - [Milvus DB](https://www.cs.purdue.edu/homes/csjgwang/pubs/SIGMOD21_Milvus.pdf) 22 | - [Billion Scale Similarity Search : FAISS](https://arxiv.org/pdf/1702.08734) 23 | - [Ray](https://arxiv.org/abs/1712.05889) 24 | 25 | ## Core Architecture 26 | 27 | - [Attention is All You Need](https://papers.neurips.cc/paper/7181-attention-is-all-you-need.pdf) 28 | - [FlashAttention](https://arxiv.org/pdf/2205.14135) 29 | - [Multi Query Attention](https://arxiv.org/pdf/1911.02150) 30 | - [Grouped Query Attention](https://arxiv.org/pdf/2305.13245) 31 | - [Google Titans outperform Transformers](https://arxiv.org/pdf/2501.00663) 32 | - [VideoRoPE: Rotary Position Embedding](https://arxiv.org/pdf/2502.05173) 33 | 34 | ## Mixture of Experts 35 | 36 | - [Sparsely-Gated Mixture-of-Experts Layer](https://arxiv.org/pdf/1701.06538) 37 | - [GShard](https://arxiv.org/abs/2006.16668) 38 | - [Switch Transformers](https://arxiv.org/abs/2101.03961) 39 | 40 | ## RLHF 41 | 42 | - [Deep Reinforcement Learning with Human Feedback](https://arxiv.org/pdf/1706.03741) 43 | - [Fine-Tuning Language Models with RHLF](https://arxiv.org/pdf/1909.08593) 44 | - [Training language models with RHLF](https://arxiv.org/pdf/2203.02155) 45 | 46 | ## Chain of Thought 47 | 48 | - [Chain-of-Thought Prompting Elicits Reasoning in Large Language Models](https://arxiv.org/pdf/2201.11903) 49 | - [Chain of thought](https://arxiv.org/pdf/2411.14405v1) 50 | - [Demystifying Long Chain-of-Thought Reasoning in LLMs](https://arxiv.org/pdf/2502.03373) 51 | 52 | ## Reasoning 53 | 54 | - [Transformer Reasoning Capabilities](https://arxiv.org/pdf/2405.18512) 55 | - [Large Language Monkeys: Scaling Inference Compute with Repeated Sampling](https://arxiv.org/pdf/2407.21787) 56 | - [Scale model test times is better than scaling parameters](https://arxiv.org/pdf/2408.03314) 57 | - [Training Large Language Models to Reason in a Continuous Latent Space](https://arxiv.org/pdf/2412.06769) 58 | - [DeepSeek R1](https://arxiv.org/pdf/2501.12948v1) 59 | - [A Probabilistic Inference Approach to Inference-Time Scaling of LLMs using Particle-Based Monte Carlo Methods](https://arxiv.org/pdf/2502.01618) 60 | - [Latent Reasoning: A Recurrent Depth Approach](https://arxiv.org/pdf/2502.05171) 61 | - [Syntactic and Semantic Control of Large Language Models via Sequential Monte Carlo](https://arxiv.org/pdf/2504.13139) 62 | 63 | ## Optimizations 64 | 65 | - [The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits](https://arxiv.org/pdf/2402.17764) 66 | - [FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision](https://arxiv.org/pdf/2407.08608) 67 | - [ByteDance 1.58](https://arxiv.org/pdf/2412.18653v1) 68 | - [Transformer Square](https://arxiv.org/pdf/2501.06252) 69 | - [Inference-Time Scaling for Diffusion Models beyond Scaling Denoising Steps](https://arxiv.org/pdf/2501.09732) 70 | - [1b outperforms 405b](https://arxiv.org/pdf/2502.06703) 71 | - [Speculative Decoding](https://arxiv.org/pdf/2211.17192) 72 | 73 | ## Distillation 74 | 75 | - [Distilling the Knowledge in a Neural Network](https://arxiv.org/pdf/1503.02531) 76 | - [BYOL - Distilled Architecture](https://arxiv.org/pdf/2006.07733) 77 | - [DINO](https://arxiv.org/pdf/2104.14294) 78 | 79 | ## SSMs 80 | 81 | - [RWKV: Reinventing RNNs for the Transformer Era](https://arxiv.org/pdf/2305.13048) 82 | - [Mamba](https://arxiv.org/pdf/2312.00752) 83 | - [Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality](https://arxiv.org/pdf/2405.21060) 84 | - [Distilling Transformers to SSMs](https://arxiv.org/pdf/2408.10189) 85 | - [LoLCATs: On Low-Rank Linearizing of Large Language Models](https://arxiv.org/pdf/2410.10254) 86 | - [Think Slow, Fast](https://arxiv.org/pdf/2502.20339) 87 | 88 | ## Competition Models 89 | 90 | - [Google Math Olympiad 2](https://arxiv.org/pdf/2502.03544) 91 | - [Competitive Programming with Large Reasoning Models](https://arxiv.org/pdf/2502.06807) 92 | - [Google Math Olympiad 1](https://www.nature.com/articles/s41586-023-06747-5) 93 | 94 | ## Hype Makers 95 | 96 | - [Can AI be made to think critically](https://arxiv.org/pdf/2501.04682) 97 | - [Evolving Deeper LLM Thinking](https://arxiv.org/pdf/2501.09891) 98 | - [LLMs Can Easily Learn to Reason from Demonstrations Structure](https://arxiv.org/pdf/2502.07374) 99 | 100 | ## Hype Breakers 101 | 102 | - [Separating communication from intelligence](https://arxiv.org/pdf/2301.06627) 103 | - [Language is not intelligence](https://gwern.net/doc/psychology/linguistics/2024-fedorenko.pdf) 104 | 105 | ## Image Transformers 106 | 107 | - [Image is 16x16 word](https://arxiv.org/pdf/2010.11929) 108 | - [CLIP](https://arxiv.org/pdf/2103.00020) 109 | - [deepseek image generation](https://arxiv.org/pdf/2501.17811) 110 | 111 | ## Video Transformers 112 | 113 | - [ViViT: A Video Vision Transformer](https://arxiv.org/pdf/2103.15691) 114 | - [Joint Embedding abstractions with self-supervised video masks](https://arxiv.org/pdf/2404.08471) 115 | - [Facebook VideoJAM ai gen](https://arxiv.org/pdf/2502.02492) 116 | 117 | ## Case Studies 118 | 119 | - [Automated Unit Test Improvement using Large Language Models at Meta](https://arxiv.org/pdf/2402.09171) 120 | - [Retrieval-Augmented Generation with Knowledge Graphs for Customer Service Question Answering](https://arxiv.org/pdf/2404.17723v1) 121 | - [OpenAI o1 System Card](https://arxiv.org/pdf/2412.16720) 122 | - [LLM-powered bug catchers](https://arxiv.org/pdf/2501.12862) 123 | - [Chain-of-Retrieval Augmented Generation](https://arxiv.org/pdf/2501.14342) 124 | - [Swiggy Search](https://bytes.swiggy.com/improving-search-relevance-in-hyperlocal-food-delivery-using-small-language-models-ecda2acc24e6) 125 | - [Swarm by OpenAI](https://github.com/openai/swarm) 126 | - [Netflix Foundation Models](https://netflixtechblog.com/foundation-model-for-personalized-recommendation-1a0bd8e02d39) 127 | - [Model Context Protocol](https://www.anthropic.com/news/model-context-protocol) 128 | - [uber queryGPT](https://www.uber.com/en-IN/blog/query-gpt/) 129 | 130 | ## Video Course 131 | 132 | AI Engineering: https://interviewready.io/course-page/ai-engineering 133 | 134 | --------------------------------------------------------------------------------