└── README.md /README.md: -------------------------------------------------------------------------------- 1 | # Reference Materials for Generative AI System Design Interview 2 | 3 | ## Chapter 1: Introduction and Overview 4 | 5 | [1] Machine Learning System Design Interview. https://www.aliaminian.com/books. 6 | [2] Support Vector Machines. https://scikit-learn.org/stable/modules/svm.html. 7 | [3] Bayes’ theorem. https://en.wikipedia.org/wiki/Bayes%27_theorem. 8 | [4] Gaussian mixture models. https://scikit-learn.org/1.5/modules/mixture.html. 9 | [5] Hidden Markov model. https://en.wikipedia.org/wiki/Hidden_Markov_model. 10 | [6] Boltzmann machine. https://en.wikipedia.org/wiki/Boltzmann_machine. 11 | [7] OpenAI’s ChatGPT. https://openai.com/index/chatgpt/. 12 | [8] Economic Potential of Generative AI. https://www.mckinsey.com/capabilities/mckinsey-digital/our-insights/the-economic-potential-of-generative-ai-the-next-productivity-frontier. 13 | [9] The Llama 3 Herd of Models. https://arxiv.org/abs/2407.21783. 14 | [10] Flamingo: a Visual Language Model for Few-Shot Learning. https://arxiv.org/abs/2204.14198. 15 | [11] PaLM: Scaling Language Modeling with Pathways. https://arxiv.org/abs/2204.02311. 16 | [12] Language Models are Few-Shot Learners. https://arxiv.org/abs/2005.14165. 17 | [13] Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding. https://arxiv.org/abs/2205.11487. 18 | [14] PaLM2 Technical Report. https://arxiv.org/abs/2305.10403. 19 | [15] H100 Tensor Core GPU. https://www.nvidia.com/en-us/data-center/h100/. 20 | [16] GPT-4 training cost. www.wired.com/story/openai-ceo-sam-altman-the-age-of-giant-ai-models-is-already-over/. 21 | [17] Scaling Laws for Neural Language Models. https://arxiv.org/abs/2001.08361. 22 | [18] Training Compute-Optimal Large Language Models. https://arxiv.org/abs/2203.15556. 23 | [19] Introducing OpenAI o1. https://openai.com/index/introducing-openai-o1-preview/. 24 | [20] Large Language Monkeys: Scaling Inference Compute with Repeated Sampling. https://arxiv.org/abs/2407.21787. 25 | [21] ETL. https://aws.amazon.com/what-is/etl/. 26 | [22] Tecton. https://www.tecton.ai/feature-store/. 27 | [23] Amazon SageMaker. https://aws.amazon.com/sagemaker/. 28 | [24] ML System Design Interview. https://www.amazon.com/gp/product/1736049127/. 29 | [25] Comprehensive Exploration of Synthetic Data Generation: A Survey. https://arxiv.org/abs/2401.02524. 30 | [26] HDFS Architecture Guide. https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html. 31 | [27] Amazon S3. https://aws.amazon.com/s3/. 32 | [28] Apache Parquet. https://parquet.apache.org/. 33 | [29] Apache ORC. https://orc.apache.org/docs/. 34 | [30] Apache Lucene. https://lucene.apache.org/. 35 | [31] Elasticsearch. https://www.elastic.co/elasticsearch. 36 | [32] Attention Is All You Need. https://arxiv.org/abs/1706.03762. 37 | [33] BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. https://arxiv.org/abs/1810.04805. 38 | [34] An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. https://arxiv.org/abs/2010.11929. 39 | [35] Learning Transferable Visual Models From Natural Language Supervision. https://arxiv.org/abs/2103.00020. 40 | [36] Zero-Shot Text-to-Image Generation. https://arxiv.org/abs/2102.12092. 41 | [37] Neural Machine Translation by Jointly Learning to Align and Translate. https://arxiv.org/abs/1409.0473. 42 | [38] Common Crawl. https://commoncrawl.org/. 43 | [39] Data Parallelism. https://en.wikipedia.org/wiki/Data_parallelism. 44 | [40] Model Parallelism. https://huggingface.co/docs/transformers/v4.15.0/en/parallelism. 45 | [41] Pipeline Parallelism. https://pytorch.org/docs/stable/distributed.pipelining.html. 46 | [42] Mixed Precision Training. https://arxiv.org/abs/1710.03740. 47 | [43] High-Resolution Image Synthesis with Latent Diffusion Models. https://arxiv.org/abs/2112.10752. 48 | [44] Training Deep Nets with Sublinear Memory Cost. https://arxiv.org/abs/1604.06174. 49 | [45] Automatic Mixed Precision. https://pytorch.org/tutorials/recipes/recipes/amp_recipe.html. 50 | [46] Model parallelism. https://huggingface.co/docs/transformers/v4.17.0/en/parallelism. 51 | [47] Paradigms of Parallelism. https://colossalai.org/docs/concepts/paradigms_of_parallelism/. 52 | [48] Tensor Parallelism tutorial. https://pytorch.org/tutorials/intermediate/TP_tutorial.html. 53 | [49] ZeRO: Memory Optimizations Toward Training Trillion Parameter Models. https://arxiv.org/abs/1910.02054. 54 | [50] Introducing PyTorch Fully Sharded Data Parallel (FSDP) API. https://pytorch.org/blog/introducing-pytorch-fully-sharded-data-parallel-api/. 55 | [51] Beam search. https://en.wikipedia.org/wiki/Beam_search. 56 | [52] Top-k sampling. https://docs.cohere.com/docs/controlling-generation-with-top-k-top-p. 57 | [53] Model monitoring for ML in production. https://www.evidentlyai.com/ml-in-production/model-monitoring. 58 | 59 | --- 60 | 61 | ## Chapter 2: Gmail Smart Compose 62 | [1] Gmail’s Smart Compose Feature. https://research.google/pubs/gmail‐smart‐compose‐real‐time‐assisted‐writing/. 63 | [2] Fundamentals of Recurrent Neural Network. https://arxiv.org/abs/1808.03314. 64 | [3] Attention Is All You Need. https://arxiv.org/abs/1706.03762. 65 | [4] Gated Recurrent Unit. https://en.wikipedia.org/wiki/Gated_recurrent_unit. 66 | [5] Long Short‐Term Memory. https://deeplearning.cs.cmu.edu/F23/document/readings/LSTM.pdf. 67 | [6] RITA: Group Attention is All You Need for Timeseries Analytics. https://arxiv.org/abs/2306.01926. 68 | [7] FlashAttention: Fast and Memory‐Efficient Exact Attention with IO‐Awareness. https://arxiv.org/abs/2205.14135. 69 | [8] Language Identification. https://en.wikipedia.org/wiki/Language_identification. 70 | [9] FastText Model for Language Identification. https://huggingface.co/facebook/fasttext-language-identification. 71 | [10] Transformer‐XL. https://arxiv.org/abs/1901.02860. 72 | [11] Byte‐Pair Encoding Tokenization. https://huggingface.co/learn/nlp-course/en/chapter6/5. 73 | [12] SentencePiece Tokenization. https://arxiv.org/abs/1808.06226. 74 | [13] Tiktoken Library. https://github.com/openai/tiktoken. 75 | [14] Google’s Gemini. https://gemini.google.com/. 76 | [15] SentencePiece Library. https://github.com/google/sentencepiece. 77 | [16] Summary of Tokenizers. https://huggingface.co/docs/transformers/en/tokenizer_summary. 78 | [17] OpenAI’s Tokenizers. https://tiktokenizer.vercel.app/?model=gpt‐4‐1106‐preview. 79 | [18] BERT. https://arxiv.org/abs/1810.04805. 80 | [19] OpenAI’s Models. https://platform.openai.com/docs/models. 81 | [20] Meta’s LLaMA. https://llama.meta.com/. 82 | [21] Introduction to Transformers by Andrej Karpathy. https://www.youtube.com/watch?v=XfpMkf4rD6E. 83 | [22] Transformer Visualized. https://jalammar.github.io/illustrated-transformer/. 84 | [23] Common Crawl. https://commoncrawl.org/. 85 | [24] Cross‐Entropy. https://en.wikipedia.org/wiki/Cross-entropy. 86 | [25] Prompt Engineering. https://platform.openai.com/docs/guides/prompt-engineering. 87 | [26] Beam Search. https://en.wikipedia.org/wiki/Beam_search. 88 | [27] Perplexity. https://en.wikipedia.org/wiki/Perplexity. 89 | [28] Gmail Smart Compose: Real‐Time Assisted Writing. https://arxiv.org/abs/1906.00080. 90 | [29] WordPiece Tokenization. https://huggingface.co/learn/nlp-course/en/chapter6/6. 91 | [30] Better & Faster Large Language Models via Multi‐token Prediction. https://arxiv.org/abs/2404.19737. 92 | 93 | --- 94 | 95 | ## Chapter 3: Google Translate 96 | 97 | [1] Google Translate Service. https://blog.google/products/translate/google-translate-new-languages-2024/. 98 | [2] Neural Machine Translation by Jointly Learning to Align and Translate. https://arxiv.org/abs/1409.0473. 99 | [3] BERT: Pre‐training of Deep Bidirectional Transformers for Language Understanding. https://arxiv.org/abs/1810.04805. 100 | [4] GPT Models. https://platform.openai.com/docs/models. 101 | [5] Claude Models. https://www.anthropic.com/claude. 102 | [6] Bidirectional Long Short‐Term Memory (BLSTM) Neural Networks for Reconstruction of Top‐Quark Pair Decay Kinematics. https://arxiv.org/abs/1909.01144. 103 | [7] BPE Tokenization. https://huggingface.co/learn/nlp-course/en/chapter6/5. 104 | [8] C4 Dataset. https://www.tensorflow.org/datasets/catalog/c4. 105 | [9] Wikipedia Dataset. https://www.tensorflow.org/datasets/catalog/wikipedia. 106 | [10] Stack Exchange Dataset. https://huggingface.co/datasets/HuggingFaceH4/stack-exchange-preferences. 107 | [11] How Transformers Work. https://huggingface.co/learn/nlp-course/en/chapter1/4. 108 | [12] Exploring the Limits of Transfer Learning with a Unified Text‐to‐Text Transformer. https://arxiv.org/pdf/1910.10683.pdf. 109 | [13] BART: Denoising Sequence‐to‐Sequence Pre‐training for Natural Language Generation, Translation, and Comprehension. https://arxiv.org/abs/1910.13461. 110 | [14] mT5: A Massively Multilingual Pre‐trained Text‐to‐Text Transformer. https://arxiv.org/abs/2010.11934. 111 | [15] Multilingual Denoising Pre‐training for Neural Machine Translation. https://arxiv.org/abs/2001.08210. 112 | [16] BLEU Metric. https://en.wikipedia.org/wiki/BLEU. 113 | [17] ROUGE Metric. https://en.wikipedia.org/wiki/ROUGE_(metric). 114 | [18] METEOR Metric. https://www.cs.cmu.edu/~alavie/METEOR/pdf/Banerjee-Lavie-2005-METEOR.pdf. 115 | [19] WordNet. https://wordnet.princeton.edu/. 116 | [20] No Language Left Behind: Scaling Human‐Centered Machine Translation. https://research.facebook.com/publications/no-language-left-behind/. 117 | [21] Decoder‐Only or Encoder‐Decoder? Interpreting Language Model as a Regularized Encoder‐Decoder. https://arxiv.org/abs/2304.04052. 118 | [22] Towards Continual Learning for Multilingual Machine Translation via Vocabulary Substitution. https://arxiv.org/abs/2103.06799. 119 | [23] Efficient Inference for Neural Machine Translation. https://arxiv.org/abs/2010.02416. 120 | [24] Meta’s Multilingual Model. https://ai.meta.com/blog/nllb-200-high-quality-machine-translation/. 121 | [25] Machine Translation Evaluation. https://en.wikipedia.org/wiki/Evaluation_of_machine_translation. 122 | [26] Word Error Rate (WER) Metric. https://en.wikipedia.org/wiki/Word_error_rate. 123 | [27] Automatic Language Identification Using Deep Neural Networks. https://research.google.com/pubs/archive/42538.pdf. 124 | 125 | --- 126 | 127 | ## Chapter 4: ChatGPT: Personal Assistant Chatbot 128 | [1] OpenAI’s ChatGPT. https://openai.com/index/chatgpt/. 129 | [2] ChatGPT Wiki. https://en.wikipedia.org/wiki/ChatGPT. 130 | [3] OpenAI’s Models. https://platform.openai.com/docs/models. 131 | [4] Google’s Gemini. https://gemini.google.com/. 132 | [5] Meta’s Llama. https://llama.meta.com/. 133 | [6] Beautiful Soup. https://beautiful-soup-4.readthedocs.io/en/latest/. 134 | [7] Lxml. https://lxml.de/. 135 | [8] Document Object Model. https://en.wikipedia.org/wiki/Document_Object_Model. 136 | [9] Boilerplate Removal Tool. https://github.com/miso-belica/jusText. 137 | [10] fastText. https://fasttext.cc/. 138 | [11] langid. https://github.com/saffsd/langid.py. 139 | [12] RoFormer: Enhanced Transformer with Rotary Position Embedding. https://arxiv.org/abs/2104.09864. 140 | [13] Llama 3 Human Evaluation. https://github.com/meta-llama/llama3/blob/main/eval_details.md. 141 | [14] Exploring the Limits of Transfer Learning with a Unified Text‐to‐Text Transformer. https://arxiv.org/abs/1910.10683. 142 | [15] DeBERTa: Decoding‐enhanced BERT with Disentangled Attention. https://arxiv.org/abs/2006.03654. 143 | [16] Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention. https://arxiv.org/abs/2006.16236. 144 | [17] Common Crawl. https://commoncrawl.org/. 145 | [18] C4 Dataset. https://www.tensorflow.org/datasets/catalog/c4. 146 | [19] Stack Exchange Dataset. https://github.com/EleutherAI/stackexchange-dataset. 147 | [20] Training Language Models to Follow Instructions with Human Feedback. https://arxiv.org/abs/2203.02155. 148 | [21] Alpaca. https://crfm.stanford.edu/2023/03/13/alpaca.html. 149 | [22] Dolly‐15K. https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm. 150 | [23] Introducing FLAN: More Generalizable Language Models with Instruction Fine‐Tuning. https://research.google/blog/introducing-flan-more-generalizable-language-models-with-instruction-fine-tuning/. 151 | [24] Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback. https://arxiv.org/abs/2204.05862. 152 | [25] Proximal Policy Optimization Algorithms. https://arxiv.org/abs/1707.06347. 153 | [26] Direct Preference Optimization: Your Language Model is Secretly a Reward Model. https://arxiv.org/abs/2305.18290. 154 | [27] Illustrating RLHF. https://huggingface.co/blog/rlhf. 155 | [28] RLHF Progress and Challenges. https://www.youtube.com/watch?v=hhiLw5Q_UFg. 156 | [29] State of GPT. https://www.youtube.com/watch?v=bZQun8Y4L2A. 157 | [30] Different Sampling Methods. https://huggingface.co/blog/how-to-generate. 158 | [31] The Curious Case of Neural Text Degeneration. https://arxiv.org/abs/1904.09751. 159 | [32] OpenAI’s API Reference. https://platform.openai.com/docs/api-reference/chat/create. 160 | [33] Cheat Sheet: Mastering Temperature and Top_p in ChatGPT API. https://community.openai.com/t/cheat‐sheet‐mastering‐temperature‐and‐top‐p‐in‐chatgpt‐api/172683. 161 | [34] PIQA: Reasoning about Physical Commonsense in Natural Language. https://arxiv.org/abs/1911.11641. 162 | [35] SocialIQA: Commonsense Reasoning about Social Interactions. https://arxiv.org/abs/1904.09728. 163 | [36] HellaSwag: Can a Machine Really Finish Your Sentence? https://arxiv.org/abs/1905.07830. 164 | [37] WinoGrande: An Adversarial Winograd Schema Challenge at Scale. https://arxiv.org/abs/1907.10641. 165 | [38] Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering. https://arxiv.org/abs/1809.02789. 166 | [39] CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge. https://arxiv.org/abs/1811.00937. 167 | [40] TriviaQA: A Large Scale Dataset for Reading Comprehension and Question Answering. https://nlp.cs.washington.edu/triviaqa/. 168 | [41] The Natural Questions Dataset. https://ai.google.com/research/NaturalQuestions. 169 | [42] SQuAD: 100,000+ Questions for Machine Comprehension of Text. https://arxiv.org/abs/1606.05250. 170 | [43] QuAC Dataset. https://quac.ai/. 171 | [44] BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions. https://arxiv.org/abs/1905.10044. 172 | [45] GSM8K Dataset. https://github.com/openai/grade-school-math. 173 | [46] MATH Dataset. https://github.com/hendrycks/math/. 174 | [47] HumanEval Dataset. https://github.com/openai/human-eval. 175 | [48] MBPP Dataset. https://github.com/google-research/google-research/tree/master/mbpp. 176 | [49] Measuring Massive Multitask Language Understanding. https://arxiv.org/abs/2009.03300. 177 | [50] Measuring Massive Multitask Language Understanding. https://arxiv.org/abs/2009.03300. 178 | [51] AGIEval: A Human‐Centric Benchmark for Evaluating Foundation Models. https://arxiv.org/abs/2304.06364. 179 | [52] RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models. https://arxiv.org/abs/2009.11462. 180 | [53] Perspective API. https://perspectiveapi.com/. 181 | [54] ToxiGen: A Large‐Scale Machine‐Generated Dataset for Adversarial and Implicit Hate Speech Detection. https://arxiv.org/abs/2203.09509. 182 | [55] HateCheck: Functional Tests for Hate Speech Detection Models. https://arxiv.org/abs/2012.15606. 183 | [56] CrowS‐Pairs: A Challenge Dataset for Measuring Social Biases in Masked Language Models. https://arxiv.org/abs/2010.00133. 184 | [57] BBQ: A Hand‐Built Bias Benchmark for Question Answering. https://arxiv.org/abs/2110.08193. 185 | [58] BOLD: Dataset and Metrics for Measuring Biases in Open‐Ended Language Generation. https://arxiv.org/abs/2101.11718. 186 | [59] TruthfulQA: Measuring How Models Mimic Human Falsehoods. https://arxiv.org/abs/2109.07958. 187 | [60] Question Answering for Privacy Policies: Combining Computational and Legal Perspectives. https://arxiv.org/abs/1911.00841. 188 | [61] AdvGLUE Benchmark. https://adversarialglue.github.io/. 189 | [62] Is BERT Really Robust? A Strong Baseline for Natural Language Attack on Text Classification and Entailment. https://arxiv.org/abs/1907.11932. 190 | [63] AdvBench. https://github.com/llm-attacks/llm-attacks. 191 | [64] Chatbot Arena Leaderboard. https://huggingface.co/spaces/lmarena-ai/chatbot-arena-leaderboard. 192 | [65] A Survey on Recent Advances in LLM‐Based Multi‐Turn Dialogue Systems. https://arxiv.org/abs/2402.18013. 193 | [66] Better & Faster Large Language Models via Multi‐Token Prediction. https://arxiv.org/abs/2404.19737. 194 | [67] Gemini 1.5: Unlocking Multimodal Understanding Across Millions of Tokens of Context. https://arxiv.org/abs/2403.05530. 195 | [68] HyperAttention: Long‐Context Attention in Near‐Linear Time. https://arxiv.org/abs/2310.05869. 196 | [69] MM‐LLMs: Recent Advances in Multimodal Large Language Models. https://arxiv.org/abs/2401.13601. 197 | [70] Multimodality and Large Multimodal Models. https://huyenchip.com/2023/10/10/multimodal.html. 198 | [71] What is Retrieval‐Augmented Generation? https://cloud.google.com/use‐cases/retrieval‐augmented‐generation. 199 | [72] How to Customize an LLM: A Deep Dive to Tailoring an LLM for Your Business. https://techcommunity.microsoft.com/blog/machinelearningblog/how-to-customize-an-llm-a-deep-dive-to-tailoring-an-llm-for-your-business/4110204. 200 | [73] Llama 2: Open Foundation and Fine‐Tuned Chat Models. https://arxiv.org/abs/2307.09288. 201 | [74] Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned. https://arxiv.org/abs/2209.07858. 202 | [75] Introducing Superalignment. https://openai.com/index/introducing-superalignment/. 203 | [76] Language Models are Few‐Shot Learners. https://arxiv.org/abs/2005.14165. 204 | [77] GQA: Training Generalized Multi‐Query Transformer Models from Multi‐Head Checkpoints. https://arxiv.org/abs/2305.13245. 205 | [78] Chain‐of‐Thought Prompting Elicits Reasoning in Large Language Models. https://arxiv.org/abs/2201.11903. 206 | [79] Efficiently Scaling Transformer Inference. https://arxiv.org/abs/2211.05102. 207 | [80] Prover‐Verifier Games Improve Legibility of Language Model Outputs. https://openai.com/index/prover-verifier-games-improve-legibility/. 208 | 209 | --- 210 | 211 | ## Chapter 5: Image Captioning 212 | 213 | [1] BLIP‐2: Bootstrapping Language‐Image Pre‐training with Frozen Image Encoders and Large Language Models. https://arxiv.org/abs/2301.12597. 214 | [2] xGen‐MM (BLIP‐3): A Family of Open Large Multimodal Models. https://www.arxiv.org/abs/2408.08872. 215 | [3] InternVL: Scaling Up Vision Foundation Models and Aligning for Generic Visual‐Linguistic Tasks. https://arxiv.org/abs/2312.14238. 216 | [4] Meta’s Llama. https://llama.meta.com/. 217 | [5] Byte‐Pair Encoding Tokenization. https://huggingface.co/learn/nlp-course/en/chapter6/5. 218 | [6] LAION‐5B: An Open Large‐Scale Dataset for Training Next Generation Image‐Text Models. https://arxiv.org/abs/2210.08402. 219 | [7] An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. https://arxiv.org/abs/2010.11929. 220 | [8] Language Models are Unsupervised Multitask Learners. https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf. 221 | [9] Learning Transferable Visual Models From Natural Language Supervision. https://arxiv.org/abs/2103.00020. 222 | [10] Cross‐Entropy. https://en.wikipedia.org/wiki/Cross-entropy. 223 | [11] CIDEr: Consensus‐Based Image Description Evaluation. https://arxiv.org/abs/1411.5726. 224 | [12] TF‐IDF Introduction. https://web.stanford.edu/class/cs276/19handouts/lecture6-tfidf-1per.pdf. 225 | [13] TF‐IDF. https://en.wikipedia.org/wiki/Tf%E2%80%93idf. 226 | [14] Visual Question Answering Introduction. https://huggingface.co/tasks/visual-question-answering. 227 | [15] Cross‐Domain Image Captioning with Discriminative Finetuning. https://arxiv.org/abs/2304.01662. 228 | [16] Crossmodal‐3600 — Multilingual Reference Captions for Geographically Diverse Images. https://research.google/blog/crossmodal-3600-multilingual-reference-captions-for-geographically-diverse-images/. 229 | [17] Efficient Image Captioning for Edge Devices. https://arxiv.org/abs/2212.08985. 230 | [18] Ensemble Model Using an Image Captioning and Ranking Example. https://cloud.google.com/dataflow/docs/notebooks/run_inference_multi_model. 231 | 232 | --- 233 | 234 | ## Chapter 6: Retrieval-Augmented Generation 235 | 236 | [1] Perplexity. https://www.perplexity.ai/. 237 | [2] ChatPDF. https://www.chatpdf.com/. 238 | [3] LoRA: Low‐Rank Adaptation of Large Language Models. https://arxiv.org/abs/2106.09685. 239 | [4] Optical Character Recognition. https://en.wikipedia.org/wiki/Optical_character_recognition. 240 | [5] Dedoc GitHub Repository. https://github.com/ispras/dedoc. 241 | [6] LayoutParser: A Unified Toolkit for Deep Learning Based Document Image Analysis. https://arxiv.org/abs/2103.15348. 242 | [7] Google Cloud document parser API. https://cloud.google.com/document-ai/docs/layout-parse-chunk. 243 | [8] PDF.CO document parser API. https://developer.pdf.co/api/document-parser/index.html. 244 | [9] Character text splitter in LangChain. https://python.langchain.com/v0.1/docs/modules/data_connection/document_transformers/character_text_splitter/. 245 | [10] Elasticsearch. https://www.elastic.co/elasticsearch. 246 | [11] A Survey on Knowledge Graphs: Representation, Acquisition, and Applications. https://ieeexplore.ieee.org/document/9416312. 247 | [12] Christopher D. Manning. Introduction to Information Retrieval.2008. 248 | [13] Modern Information Retrieval: A Brief Overview. http://singhal.info/ieee2001.pdf. 249 | [14] Learning Transferable Visual Models From Natural Language Supervision. https://arxiv.org/abs/2103.00020. 250 | [15] OpenAI finetuning documentation. https://platform.openai.com/docs/guides/fine-tuning. 251 | [16] Anthropic finetuning. https://www.anthropic.com/news/fine-tune-claude-3-haiku. 252 | [17] RAFT: Adapting Language Model to Domain Specific RAG. https://arxiv.org/abs/2403.10131. 253 | [18] Euclidean Distance. https://en.wikipedia.org/wiki/Euclidean_distance. 254 | [19] Cosine Similarity. https://en.wikipedia.org/wiki/Cosine_similarity. 255 | [20] Multidimensional binary search trees used for associative searching. https://dl.acm.org/doi/10.1145/361002.361007. 256 | [21] R‐trees: A dynamic index structure for spatial searching. https://dl.acm.org/doi/10.1145/971697.602266. 257 | [22] Annoy Library. https://github.com/spotify/annoy. 258 | [23] Similarity search in high dimensions via hashing. https://www.cs.princeton.edu/courses/archive/spring13/cos598C/Gionis.pdf. 259 | [24] Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs. https://arxiv.org/abs/1603.09320. 260 | [25] Faiss Documentation. https://faiss.ai/. 261 | [26] ScaNN. https://research.google/blog/announcing-scann-efficient-vector-similarity-search/. 262 | [27] Developer Playground. https://docs.cohere.com/v2/docs/playground-overview. 263 | [28] Chain‐of‐Thought Prompting Elicits Reasoning in Large Language Models. https://arxiv.org/abs/2201.11903. 264 | [29] Tree of Thoughts: Deliberate Problem Solving with Large Language Models. https://arxiv.org/abs/2305.10601. 265 | [30] OpenAI o1. https://openai.com/index/learning-to-reason-with-llms/. 266 | [31] Scaling LLM Test‐Time Compute Optimally can be More Effective than Scaling Model Parameters. https://arxiv.org/abs/2408.03314. 267 | [32] Language Models are Few‐Shot Learners. https://arxiv.org/abs/2005.14165. 268 | [33] Machine Learning System Design Interview. https://www.aliaminian.com/books. 269 | [34] Evaluation Measure for Information Retrieval. https://en.wikipedia.org/wiki/Evaluation_measures_(information_retrieval). 270 | [35] Ragas. https://docs.ragas.io/en/stable/. 271 | [36] ARES: An Automated Evaluation Framework for Retrieval‐Augmented Generation Systems. https://arxiv.org/abs/2311.09476. 272 | [37] Query2doc: Query Expansion with Large Language Models. https://arxiv.org/abs/2303.07678. 273 | [38] TableNet: Deep Learning model for end‐to‐end Table detection and Tabular data extraction from Scanned Document Images. https://arxiv.org/abs/2001.01469. 274 | [39] CascadeTabNet: An approach for end to end table detection and structure recognition from image‐based documents. https://arxiv.org/abs/2004.12629. 275 | [40] Deepdesrt: Deep learning for detection and structure recognition of tables in document images. https://ieeexplore.ieee.org/document/8270123. 276 | [41] Active Retrieval Augmented Generation. https://arxiv.org/abs/2305.06983. 277 | [42] Self‐RAG: Learning to Retrieve, Generate, and Critique through Self‐Reflection. https://arxiv.org/abs/2310.11511. 278 | [43] Precise Zero‐Shot Dense Retrieval without Relevance Labels. https://arxiv.org/abs/2212.10496. 279 | 280 | --- 281 | 282 | ## Chapter 7: Realistic Face Generation 283 | 284 | [1] StyleGAN2. https://arxiv.org/abs/1912.04958. 285 | [2] Auto‐Encoding Variational Bayes. https://arxiv.org/abs/1312.6114. 286 | [3] Generative Adversarial Networks. https://arxiv.org/abs/1406.2661. 287 | [4] Combating Mode Collapse in GAN Training: An Empirical Analysis Using Hessian Eigenvalues. https://arxiv.org/abs/2012.09673. 288 | [5] Google’s GAN Course. https://developers.google.com/machine-learning/gan/training. 289 | [6] StackGAN: Text to Photo‐Realistic Image Synthesis with Stacked Generative Adversarial Networks. https://arxiv.org/abs/1612.03242. 290 | [7] Zero‐Shot Text‐to‐Image Generation. https://arxiv.org/abs/2102.12092. 291 | [8] Muse: Text‐To‐Image Generation via Masked Generative Transformers. https://arxiv.org/abs/2301.00704. 292 | [9] DALL∙E 3. https://openai.com/index/dall-e-3/. 293 | [10] Attribute‐Specific Control Units in StyleGAN for Fine‐Grained Image Manipulation. https://arxiv.org/abs/2111.13010. 294 | [11] A Guide to Convolution Arithmetic for Deep Learning. https://arxiv.org/abs/1603.07285. 295 | [12] Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. https://arxiv.org/abs/1502.03167. 296 | [13] Layer Normalization. https://arxiv.org/abs/1607.06450. 297 | [14] Instance Normalization: The Missing Ingredient for Fast Stylization. https://arxiv.org/abs/1607.08022. 298 | [15] Group Normalization. https://arxiv.org/abs/1803.08494. 299 | [16] Deep Learning using Rectified Linear Units (ReLU). https://arxiv.org/abs/1803.08375. 300 | [17] PyTorch’s Tanh Layer. https://pytorch.org/docs/stable/generated/torch.nn.Tanh.html. 301 | [18] A Style‐Based Generator Architecture for Generative Adversarial Networks. https://arxiv.org/abs/1812.04948. 302 | [19] Minimax. https://en.wikipedia.org/wiki/Minimax. 303 | [20] Loss Functions in GANs. https://developers.google.com/machine-learning/gan/loss. 304 | [21] Towards Principled Methods for Training Generative Adversarial Networks. https://arxiv.org/abs/1701.04862. 305 | [22] Unrolled Generative Adversarial Networks. https://arxiv.org/abs/1611.02163. 306 | [23] Stabilizing Training of Generative Adversarial Networks through Regularization. https://arxiv.org/abs/1705.09367. 307 | [24] Megapixel Size Image Creation using Generative Adversarial Networks. https://arxiv.org/abs/1706.00082v1. 308 | [25] Inception Score. https://en.wikipedia.org/wiki/Inception_score. 309 | [26] GANs Trained by a Two Time‐Scale Update Rule Converge to a Local Nash Equilibrium. https://arxiv.org/abs/1706.08500. 310 | [27] Demystifying MMD GANs. https://arxiv.org/abs/1801.01401. 311 | [28] Rethinking the Inception Architecture for Computer Vision. https://arxiv.org/abs/1512.00567. 312 | [29] FID Calculation. https://en.wikipedia.org/wiki/Fr%C3%A9chet_inception_distance. 313 | [30] The Role of ImageNet Classes in Fréchet Inception Distance. https://arxiv.org/abs/2203.06026. 314 | [31] Hierarchical Text‐Conditional Image Generation with CLIP Latents. https://arxiv.org/abs/2204.06125. 315 | [32] Alias‐Free Generative Adversarial Networks. https://arxiv.org/abs/2106.12423. 316 | [33] StyleGAN3. https://nvlabs.github.io/stylegan3/. 317 | [34] Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. https://arxiv.org/abs/1511.06434. 318 | [35] Wasserstein GAN. https://arxiv.org/abs/1701.07875. 319 | [36] Stabilizing Generative Adversarial Networks: A Survey. https://arxiv.org/abs/1910.00927. 320 | [37] Conditional Generative Adversarial Nets. https://arxiv.org/abs/1411.1784. 321 | [38] CLIPScore: A Reference‐free Evaluation Metric for Image Captioning. https://arxiv.org/abs/2104.08718. 322 | [39] DreamBooth: Fine Tuning Text‐to‐Image Diffusion Models for Subject‐Driven Generation. https://arxiv.org/abs/2208.12242. 323 | 324 | --- 325 | 326 | ## Chapter 8: High-Resolution Image Synthesis 327 | 328 | [1] Taming Transformers for High‐Resolution Image Synthesis. https://arxiv.org/abs/2012.09841. 329 | [2] Neural Discrete Representation Learning. https://arxiv.org/abs/1711.00937. 330 | [3] Deep Learning using Rectified Linear Units (ReLU). https://arxiv.org/abs/1803.08375. 331 | [4] Euclidean Distance. https://en.wikipedia.org/wiki/Euclidean_distance. 332 | [5] A Guide to Convolution Arithmetic for Deep Learning. https://arxiv.org/abs/1603.07285. 333 | [6] Very Deep Convolutional Networks for Large‐Scale Image Recognition. https://arxiv.org/abs/1409.1556. 334 | [7] Generative Adversarial Networks. https://arxiv.org/abs/1406.2661. 335 | [8] Inception Score. https://en.wikipedia.org/wiki/Inception_score. 336 | [9] FID Calculation. https://en.wikipedia.org/wiki/Fr%C3%A9chet_inception_distance. 337 | [10] Image Super‐Resolution Using Very Deep Residual Channel Attention Networks. https://arxiv.org/abs/1807.02758. 338 | [11] ESRGAN: Enhanced Super‐Resolution Generative Adversarial Networks. https://arxiv.org/abs/1809.00219. 339 | [12] NTIRE 2024 Challenge on Image Super‐Resolution (×4): Methods and Results. https://arxiv.org/abs/2404.09790. 340 | [13] Muse: Text‐To‐Image Generation via Masked Generative Transformers. https://arxiv.org/abs/2301.00704. 341 | [14] VQGAN‐CLIP: Open Domain Image Generation and Editing with Natural Language Guidance. https://arxiv.org/abs/2204.08583. 342 | [15] LAR‐SR: A Local Autoregressive Model for Image Super‐Resolution. https://openaccess.thecvf.com/content/CVPR2022/papers/Guo_LAR-SR_A_Local_Autoregressive_Model_for_Image_Super-Resolution_CVPR_2022_paper.pdf. 343 | [16] Long Horizon Temperature Scaling. https://arxiv.org/abs/2302.03686. 344 | [17] Learning Rate Scheduling. https://d2l.ai/chapter_optimization/lr‐scheduler.html. 345 | [18] Adversarial Training. https://adversarial-ml-tutorial.org/adversarial_training/. 346 | [19] Progressive Growing of GANs for Improved Quality, Stability, and Variation. https://arxiv.org/abs/1710.10196. 347 | [20] CogView2: Faster and Better Text‐to‐Image Generation via Hierarchical Transformers. https://arxiv.org/abs/2204.14217. 348 | 349 | --- 350 | 351 | ## Chapter 9: Text-to-Image Generation 352 | 353 | [1] OpenAI’s DALL‐E 3. https://openai.com/index/dall-e-3/. 354 | [2] Imagen 3. https://arxiv.org/abs/2408.07009. 355 | [3] Adobe’s Firefly. https://www.adobe.com/products/firefly.html. 356 | [4] Introducing ChatGPT. https://openai.com/index/chatgpt/. 357 | [5] Zero‐Shot Text‐to‐Image Generation. https://arxiv.org/abs/2102.12092. 358 | [6] Muse: Text‐To‐Image Generation via Masked Generative Transformers. https://arxiv.org/abs/2301.00704. 359 | [7] Generative Modeling by Estimating Gradients of the Data Distribution. https://arxiv.org/abs/1907.05600. 360 | [8] Learning Transferable Visual Models From Natural Language Supervision. https://arxiv.org/abs/2103.00020. 361 | [9] Exploring the Limits of Transfer Learning with a Unified Text‐to‐Text Transformer. https://arxiv.org/abs/1910.10683. 362 | [10] Hierarchical Text‐Conditional Image Generation with CLIP Latents. https://arxiv.org/abs/2204.06125. 363 | [11] High‐Resolution Image Synthesis with Latent Diffusion Models. https://arxiv.org/abs/2112.10752. 364 | [12] On the De‐duplication of LAION‐2B. https://arxiv.org/abs/2303.12733. 365 | [13] xGen‐MM (BLIP‐3): A Family of Open Large Multimodal Models. https://www.arxiv.org/abs/2408.08872. 366 | [14] U‐Net: Convolutional Networks for Biomedical Image Segmentation. https://arxiv.org/abs/1505.04597. 367 | [15] Scalable Diffusion Models with Transformers. https://arxiv.org/abs/2212.09748. 368 | [16] An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. https://arxiv.org/abs/2010.11929. 369 | [17] Photorealistic Text‐to‐Image Diffusion Models with Deep Language Understanding. https://arxiv.org/abs/2205.11487. 370 | [18] Denoising Diffusion Probabilistic Models. https://arxiv.org/abs/2006.11239. 371 | [19] Classifier‐Free Diffusion Guidance. https://arxiv.org/abs/2207.12598. 372 | [20] Denoising Diffusion Implicit Models. https://arxiv.org/abs/2010.02502. 373 | [21] Introduction to Diffusion Models. https://lilianweng.github.io/posts/2021-07-11-diffusion-models/. 374 | [22] Mixed Precision Training. https://arxiv.org/abs/1710.03740. 375 | [23] FSDP tutorial. https://pytorch.org/tutorials/intermediate/FSDP_tutorial.html. 376 | [24] DeepSpeed. https://github.com/microsoft/DeepSpeed. 377 | [25] Parallel Sampling of Diffusion Models. https://arxiv.org/abs/2305.16317. 378 | [26] Consistency Models. https://arxiv.org/abs/2303.01469. 379 | [27] Inception score. https://en.wikipedia.org/wiki/Inception_score. 380 | [28] FID calculation. https://en.wikipedia.org/wiki/Fr%C3%A9chet_inception_distance. 381 | [29] CLIPScore: A Reference‐free Evaluation Metric for Image Captioning. https://arxiv.org/abs/2104.08718. 382 | [30] Sora overview. https://openai.com/index/video-generation-models-as-world-simulators/. 383 | [31] Imagen Video: High Definition Video Generation with Diffusion Models. https://arxiv.org/abs/2210.02303. 384 | [32] Finetune Stable Diffusion Models with DDPO via TRL. https://huggingface.co/blog/trl-ddpo. 385 | [33] Kandinsky: an Improved Text‐to‐Image Synthesis with Image Prior and Latent Diffusion. https://arxiv.org/abs/2310.03502. 386 | [34] On the Importance of Noise Scheduling for Diffusion Models. https://arxiv.org/abs/2301.10972. 387 | [35] Patchn’Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution. https://arxiv.org/abs/2307.06304. 388 | [36] InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual‐Linguistic Tasks. https://arxiv.org/abs/2312.14238. 389 | [37] BLIP‐2: Bootstrapping Language‐Image Pre‐training with Frozen Image Encoders and Large Language Models. https://arxiv.org/abs/2301.12597. 390 | [38] Adding Conditional Control to Text‐to‐Image Diffusion Models. https://arxiv.org/abs/2302.05543. 391 | [39] StyleDrop: Text‐to‐image generation in any style. https://research.google/blog/styledrop-text-to-image-generation-in-any-style/. 392 | 393 | --- 394 | 395 | ## Chapter 10: Personal Headshot Generation 396 | 397 | [1] Imagine yourself: Tuning‐Free Personalized Image Generation. https://ai.meta.com/research/publications/imagine-yourself-tuning-free-personalized-image-generation/. 398 | [2] MoA: Mixture‐of‐Attention for Subject‐Context Disentanglement in Personalized Image Generation. https://arxiv.org/abs/2404.11565. 399 | [3] InstantID: Zero‐shot Identity‐Preserving Generation in Seconds. https://arxiv.org/abs/2401.07519. 400 | [4] An Image is Worth One Word: Personalizing Text‐to‐Image Generation using Textual Inversion. https://textual-inversion.github.io/. 401 | [5] DreamBooth: Fine Tuning Text‐to‐Image Diffusion Models for Subject‐Driven Generation. https://arxiv.org/abs/2208.12242. 402 | [6] LoRA: Low‐Rank Adaptation of Large Language Models. https://arxiv.org/abs/2106.09685. 403 | [7] Language Models are Few‐Shot Learners. https://arxiv.org/abs/2005.14165. 404 | [8] Classifier‐Free Diffusion Guidance. https://arxiv.org/abs/2207.12598. 405 | [9] CLIPScore: A Reference‐free Evaluation Metric for Image Captioning. https://arxiv.org/abs/2104.08718. 406 | [10] FID calculation. https://en.wikipedia.org/wiki/Fr%C3%A9chet_inception_distance. 407 | [11] Inception score. https://en.wikipedia.org/wiki/Inception_score. 408 | [12] Learning Transferable Visual Models From Natural Language Supervision. https://arxiv.org/abs/2103.00020. 409 | [13] Emerging Properties in Self‐Supervised Vision Transformers. https://arxiv.org/abs/2104.14294. 410 | [14] Contrastive Representation Learning. https://lilianweng.github.io/posts/2021-05-31-contrastive/. 411 | [15] DINOv2: Learning Robust Visual Features without Supervision. https://arxiv.org/abs/2304.07193. 412 | [16] An Empirical Study of Catastrophic Forgetting in Large Language Models During Continual Fine‐tuning. https://arxiv.org/abs/2308.08747. 413 | [17] SDXL: Improving Latent Diffusion Models for High‐Resolution Image Synthesis. https://arxiv.org/abs/2307.01952. 414 | [18] Deepfakes, Misinformation, and Disinformation in the Era of Frontier AI, Generative AI, and Large AI Models. https://arxiv.org/abs/2311.17394. 415 | [19] Privacy‐Preserving Personal Identifiable Information (PII) Label Detection Using Machine Learning. https://ieeexplore.ieee.org/document/10307924. 416 | [20] Does fine‐tuning GPT‐3 with the OpenAI API leak personally‐identifiable information? https://arxiv.org/abs/2307.16382. 417 | 418 | --- 419 | 420 | ## Chapter 11: Text-to-Video Generation 421 | 422 | [1] Video generation models as world simulators. https://openai.com/index/video-generation-models-as-world-simulators/. 423 | [2] H100 Tensor Core GPU. https://www.nvidia.com/en-us/data-center/h100/. 424 | [3] High‐Resolution Image Synthesis with Latent Diffusion Models. https://arxiv.org/abs/2112.10752. 425 | [4] Meta Movie Gen. https://ai.meta.com/research/movie-gen/. 426 | [5] Auto‐Encoding Variational Bayes. https://arxiv.org/abs/1312.6114. 427 | [6] The Illustrated Stable Diffusion. https://jalammar.github.io/illustrated-stable-diffusion/. 428 | [7] On the De‐duplication of LAION‐2B. https://arxiv.org/abs/2303.12733. 429 | [8] The Llama 3 Herd of Models. https://arxiv.org/abs/2407.21783. 430 | [9] LLaVA‐NeXT: A Strong Zero‐shot Video Understanding Model. https://llava-vl.github.io/blog/2024-04-30-llava-next-video/. 431 | [10] Lumiere: A Space‐Time Diffusion Model for Video Generation. https://arxiv.org/abs/2401.12945. 432 | [11] OpenSora Technical Report. https://github.com/hpcaitech/Open-Sora/blob/main/docs/report_02.md. 433 | [12] RoFormer: Enhanced Transformer with Rotary Position Embedding. https://arxiv.org/abs/2104.09864. 434 | [13] Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets. https://arxiv.org/abs/2311.15127. 435 | [14] Emu Video: Factorizing Text‐to‐Video Generation by Explicit Image Conditioning. https://arxiv.org/abs/2311.10709. 436 | [15] Imagen Video: High Definition Video Generation with Diffusion Models. https://arxiv.org/abs/2210.02303. 437 | [16] HyperAttention: Long‐context Attention in Near‐Linear Time. https://arxiv.org/abs/2310.05869. 438 | [17] Mixture of Experts Explained. https://huggingface.co/blog/moe. 439 | [18] VBench: Comprehensive Benchmark Suite for Video Generative Models. https://vchitect.github.io/VBench-project/. 440 | [19] Movie Gen Bench. https://github.com/facebookresearch/MovieGenBench. 441 | [20] FID calculation. https://en.wikipedia.org/wiki/Fr%C3%A9chet_inception_distance. 442 | [21] Inception score. https://en.wikipedia.org/wiki/Inception_score. 443 | [22] The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. https://arxiv.org/abs/1801.03924. 444 | [23] Demystifying MMD GANs. https://arxiv.org/abs/1801.01401. 445 | [24] Towards Accurate Generative Models of Video: A New Metric & Challenges. https://arxiv.org/abs/1812.01717. 446 | [25] Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. https://arxiv.org/abs/1705.07750. 447 | [26] Rethinking the Inception Architecture for Computer Vision. https://arxiv.org/abs/1512.00567. 448 | [27] Moonshot: Towards Controllable Video Generation and Editing with Multimodal Conditions. https://arxiv.org/abs/2401.01827. 449 | [28] Progressive Distillation for Fast Sampling of Diffusion Models. https://arxiv.org/abs/2202.00512. 450 | [29] Schedulers. https://huggingface.co/docs/diffusers/v0.9.0/en/api/schedulers. 451 | [30] Photorealistic Text‐to‐Image Diffusion Models with Deep Language Understanding. https://arxiv.org/abs/2205.11487. 452 | [31] CustomVideo: Customizing Text‐to‐Video Generation with Multiple Subjects. https://arxiv.org/abs/2401.09962. 453 | [32] Control‐A‐Video: Controllable Text‐to‐Video Generation with Diffusion Models. https://controlavideo.github.io/. 454 | [33] Introducing Stable Cascade. https://stability.ai/news/introducing-stable-cascade. 455 | 456 | --- 457 | 458 | --------------------------------------------------------------------------------