└── README.md


/README.md:
--------------------------------------------------------------------------------
  1 | # AI Engineering Transition Path
  2 | Research papers for software engineers to transition to AI Engineering
  3 | 
  4 | ## Tokenization
  5 | 
  6 | - [Byte-pair Encoding](https://arxiv.org/pdf/1508.07909)
  7 | - [Byte Latent Transformer: Patches Scale Better Than Tokens](https://arxiv.org/pdf/2412.09871)
  8 | 
  9 | ## Vectorization
 10 | 
 11 | - [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/pdf/1810.04805)
 12 | - [IMAGEBIND: One Embedding Space To Bind Them All](https://arxiv.org/pdf/2305.05665)
 13 | - [SONAR: Sentence-Level Multimodal and Language-Agnostic Representations](https://arxiv.org/pdf/2308.11466)
 14 | - [FAISS library](https://arxiv.org/pdf/2401.08281)
 15 | - [Facebook Large Concept Models](https://arxiv.org/pdf/2412.08821v2)
 16 | 
 17 | ## Infrastructure
 18 | 
 19 | - [TensorFlow](https://arxiv.org/pdf/1605.08695)
 20 | - [Deepseek filesystem](https://github.com/deepseek-ai/3FS/blob/main/docs/design_notes.md)
 21 | - [Milvus DB](https://www.cs.purdue.edu/homes/csjgwang/pubs/SIGMOD21_Milvus.pdf)
 22 | - [Billion Scale Similarity Search : FAISS](https://arxiv.org/pdf/1702.08734)
 23 | - [Ray](https://arxiv.org/abs/1712.05889)
 24 | 
 25 | ## Core Architecture
 26 | 
 27 | - [Attention is All You Need](https://papers.neurips.cc/paper/7181-attention-is-all-you-need.pdf)
 28 | - [FlashAttention](https://arxiv.org/pdf/2205.14135)
 29 | - [Multi Query Attention](https://arxiv.org/pdf/1911.02150) 
 30 | - [Grouped Query Attention](https://arxiv.org/pdf/2305.13245)
 31 | - [Google Titans outperform Transformers](https://arxiv.org/pdf/2501.00663)
 32 | - [VideoRoPE: Rotary Position Embedding](https://arxiv.org/pdf/2502.05173)
 33 | 
 34 | ## Mixture of Experts
 35 | 
 36 | - [Sparsely-Gated Mixture-of-Experts Layer](https://arxiv.org/pdf/1701.06538)
 37 | - [GShard](https://arxiv.org/abs/2006.16668)
 38 | - [Switch Transformers](https://arxiv.org/abs/2101.03961)
 39 | 
 40 | ## RLHF
 41 | 
 42 | - [Deep Reinforcement Learning with Human Feedback](https://arxiv.org/pdf/1706.03741)
 43 | - [Fine-Tuning Language Models with RHLF](https://arxiv.org/pdf/1909.08593)
 44 | - [Training language models with RHLF](https://arxiv.org/pdf/2203.02155)
 45 | 
 46 | ## Chain of Thought
 47 | 
 48 | - [Chain-of-Thought Prompting Elicits Reasoning in Large Language Models](https://arxiv.org/pdf/2201.11903)
 49 | - [Chain of thought](https://arxiv.org/pdf/2411.14405v1)
 50 | - [Demystifying Long Chain-of-Thought Reasoning in LLMs](https://arxiv.org/pdf/2502.03373)
 51 | 
 52 | ## Reasoning
 53 | 
 54 | - [Transformer Reasoning Capabilities](https://arxiv.org/pdf/2405.18512)
 55 | - [Large Language Monkeys: Scaling Inference Compute with Repeated Sampling](https://arxiv.org/pdf/2407.21787)
 56 | - [Scale model test times is better than scaling parameters](https://arxiv.org/pdf/2408.03314)
 57 | - [Training Large Language Models to Reason in a Continuous Latent Space](https://arxiv.org/pdf/2412.06769)
 58 | - [DeepSeek R1](https://arxiv.org/pdf/2501.12948v1)
 59 | - [A Probabilistic Inference Approach to Inference-Time Scaling of LLMs using Particle-Based Monte Carlo Methods](https://arxiv.org/pdf/2502.01618)
 60 | - [Latent Reasoning: A Recurrent Depth Approach](https://arxiv.org/pdf/2502.05171)
 61 | - [Syntactic and Semantic Control of Large Language Models via Sequential Monte Carlo](https://arxiv.org/pdf/2504.13139)
 62 | 
 63 | ## Optimizations
 64 | 
 65 | - [The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits](https://arxiv.org/pdf/2402.17764)
 66 | - [FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision](https://arxiv.org/pdf/2407.08608)
 67 | - [ByteDance 1.58](https://arxiv.org/pdf/2412.18653v1)
 68 | - [Transformer Square](https://arxiv.org/pdf/2501.06252)
 69 | - [Inference-Time Scaling for Diffusion Models beyond Scaling Denoising Steps](https://arxiv.org/pdf/2501.09732)
 70 | - [1b outperforms 405b](https://arxiv.org/pdf/2502.06703)
 71 | - [Speculative Decoding](https://arxiv.org/pdf/2211.17192)
 72 | 
 73 | ## Distillation
 74 | 
 75 | - [Distilling the Knowledge in a Neural Network](https://arxiv.org/pdf/1503.02531)
 76 | - [BYOL - Distilled Architecture](https://arxiv.org/pdf/2006.07733)
 77 | - [DINO](https://arxiv.org/pdf/2104.14294)
 78 | 
 79 | ## SSMs
 80 | 
 81 | - [RWKV: Reinventing RNNs for the Transformer Era](https://arxiv.org/pdf/2305.13048)
 82 | - [Mamba](https://arxiv.org/pdf/2312.00752)
 83 | - [Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality](https://arxiv.org/pdf/2405.21060)
 84 | - [Distilling Transformers to SSMs](https://arxiv.org/pdf/2408.10189)
 85 | - [LoLCATs: On Low-Rank Linearizing of Large Language Models](https://arxiv.org/pdf/2410.10254)
 86 | - [Think Slow, Fast](https://arxiv.org/pdf/2502.20339)
 87 | 
 88 | ## Competition Models
 89 | 
 90 | - [Google Math Olympiad 2](https://arxiv.org/pdf/2502.03544)
 91 | - [Competitive Programming with Large Reasoning Models](https://arxiv.org/pdf/2502.06807)
 92 | - [Google Math Olympiad 1](https://www.nature.com/articles/s41586-023-06747-5)
 93 | 
 94 | ## Hype Makers
 95 | 
 96 | - [Can AI be made to think critically](https://arxiv.org/pdf/2501.04682)
 97 | - [Evolving Deeper LLM Thinking](https://arxiv.org/pdf/2501.09891)
 98 | - [LLMs Can Easily Learn to Reason from Demonstrations Structure](https://arxiv.org/pdf/2502.07374)
 99 | 
100 | ## Hype Breakers
101 | 
102 | - [Separating communication from intelligence](https://arxiv.org/pdf/2301.06627)
103 | - [Language is not intelligence](https://gwern.net/doc/psychology/linguistics/2024-fedorenko.pdf)
104 | 
105 | ## Image Transformers
106 | 
107 | - [Image is 16x16 word](https://arxiv.org/pdf/2010.11929)
108 | - [CLIP](https://arxiv.org/pdf/2103.00020)
109 | - [deepseek image generation](https://arxiv.org/pdf/2501.17811)
110 | 
111 | ## Video Transformers
112 | 
113 | - [ViViT: A Video Vision Transformer](https://arxiv.org/pdf/2103.15691)
114 | - [Joint Embedding abstractions with self-supervised video masks](https://arxiv.org/pdf/2404.08471)
115 | - [Facebook VideoJAM ai gen](https://arxiv.org/pdf/2502.02492)
116 | 
117 | ## Case Studies
118 | 
119 | - [Automated Unit Test Improvement using Large Language Models at Meta](https://arxiv.org/pdf/2402.09171)
120 | - [Retrieval-Augmented Generation with Knowledge Graphs for Customer Service Question Answering](https://arxiv.org/pdf/2404.17723v1)
121 | - [OpenAI o1 System Card](https://arxiv.org/pdf/2412.16720)
122 | - [LLM-powered bug catchers](https://arxiv.org/pdf/2501.12862)
123 | - [Chain-of-Retrieval Augmented Generation](https://arxiv.org/pdf/2501.14342)
124 | - [Swiggy Search](https://bytes.swiggy.com/improving-search-relevance-in-hyperlocal-food-delivery-using-small-language-models-ecda2acc24e6)
125 | - [Swarm by OpenAI](https://github.com/openai/swarm)
126 | - [Netflix Foundation Models](https://netflixtechblog.com/foundation-model-for-personalized-recommendation-1a0bd8e02d39)
127 | - [Model Context Protocol](https://www.anthropic.com/news/model-context-protocol)
128 | - [uber queryGPT](https://www.uber.com/en-IN/blog/query-gpt/)
129 | 
130 | ## Video Course
131 | 
132 | AI Engineering: https://interviewready.io/course-page/ai-engineering
133 | 
134 | 


--------------------------------------------------------------------------------