├── Interview_QA ├── ReadMe.md ├── images │ ├── ReadMe.md │ ├── 1-.jpg │ ├── 2-.jpg │ ├── llm-engineer-toolkit.jpg │ ├── prompt-eng-techniques-hub.jpg │ └── llm-survey-papers-collection.jpg ├── QA_115-117.md ├── QA_76-78.md ├── QA_82-84.md ├── QA_16-18.md ├── QA_109-111.md ├── QA_103-105.md ├── QA_91-93.md ├── QA_55-57.md ├── QA_79-81.md ├── QA_40-42.md ├── QA_97-99.md ├── QA_13-15.md ├── QA_85-87.md ├── QA_49-51.md ├── QA_34-36.md ├── QA_73-75.md ├── QA_22-24.md ├── QA_52-54.md ├── QA_28-30.md ├── QA_58-60.md ├── QA_31-33.md ├── QA_67-69.md ├── QA_7-9.md ├── QA_19-21.md ├── QA_25-27.md ├── QA_94-96.md ├── QA_46-48.md ├── QA_112-114.md ├── QA_1-3.md ├── QA_37-39.md ├── QA_10-12.md ├── QA_106-108.md ├── QA_61-63.md ├── QA_88-90.md ├── QA_4-6.md ├── QA_64-66.md ├── QA_100-102.md ├── QA_43-45.md └── QA_70-72.md ├── LICENSE └── README.md /Interview_QA/ReadMe.md: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /Interview_QA/images/ReadMe.md: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /Interview_QA/images/1-.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/KalyanKS-NLP/LLM-Interview-Questions-and-Answers-Hub/HEAD/Interview_QA/images/1-.jpg -------------------------------------------------------------------------------- /Interview_QA/images/2-.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/KalyanKS-NLP/LLM-Interview-Questions-and-Answers-Hub/HEAD/Interview_QA/images/2-.jpg -------------------------------------------------------------------------------- /Interview_QA/images/llm-engineer-toolkit.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/KalyanKS-NLP/LLM-Interview-Questions-and-Answers-Hub/HEAD/Interview_QA/images/llm-engineer-toolkit.jpg -------------------------------------------------------------------------------- /Interview_QA/images/prompt-eng-techniques-hub.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/KalyanKS-NLP/LLM-Interview-Questions-and-Answers-Hub/HEAD/Interview_QA/images/prompt-eng-techniques-hub.jpg -------------------------------------------------------------------------------- /Interview_QA/images/llm-survey-papers-collection.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/KalyanKS-NLP/LLM-Interview-Questions-and-Answers-Hub/HEAD/Interview_QA/images/llm-survey-papers-collection.jpg -------------------------------------------------------------------------------- /Interview_QA/QA_115-117.md: -------------------------------------------------------------------------------- 1 | Authored by **Kalyan KS**. You can follow him on [Twitter](https://x.com/kalyan_kpl) and [LinkedIn](https://www.linkedin.com/in/kalyanksnlp/) for the latest LLM, RAG and Agent updates. 2 | 3 | ## 📌 Q115: What is the significance of self-supervised learning in LLM pretraining? 4 | 5 | ### ✅ Answer 6 | 7 | Self-supervised learning (SSL) is crucial for LLM pretraining because it allows models to learn rich, general-purpose language representations from massive amounts of unlabeled text data. It works by creating a surrogate task, like predicting the next word, where the "labels" are automatically derived from the input itself. 8 | 9 | This eliminates the need for expensive human annotation. By predicting the next token in a sequence, the model learns syntax, semantics, and world knowledge from massive unlabeled corpora. 10 | 11 | ## **👨🏻‍💻 LLM Engineer Toolkit** 12 | 13 | 🤖 This repository contains a curated list of 120+ LLM, RAG and Agent related libraries category wise. 14 | 15 | 👉 [Repo link](https://github.com/KalyanKS-NLP/llm-engineer-toolkit) 16 | 17 | This repository is highly useful for Data Scientists, AI/ML Engineers working with LLMs, RAG and Agents. 18 | 19 | ![LLM Engineer Toolkit](images/llm-engineer-toolkit.jpg) 20 | 21 | 22 | --------------------------------------------------------------------------------------------- 23 | 24 | 25 | 26 | 27 | -------------------------------------------------------------------------------- /Interview_QA/QA_76-78.md: -------------------------------------------------------------------------------- 1 | Authored by **Kalyan KS**. You can follow him on [Twitter](https://x.com/kalyan_kpl) and [LinkedIn](https://www.linkedin.com/in/kalyanksnlp/) for the latest LLM, RAG and Agent updates. 2 | 3 | ## 📌 Q76: Explain the trade-offs in using CoT prompting. 4 | 5 | ### ✅ Answer 6 | 7 | Chain-of-Thought (CoT) prompting improves accuracy and interpretability by encouraging models to generate intermediate steps before producing the final answer. However, it introduces trade-offs such as increased latency and token usage, which raise computational and cost overheads. 8 | 9 | Moreover, CoT can sometimes amplify hallucinations if the reasoning path is incorrect. It also requires a careful prompt design to balance reasoning detail with brevity for optimal performance. 10 | 11 | ## 📌 Q77: What is prompt engineering, and why is it important for LLMs? 12 | 13 | ### ✅ Answer 14 | 15 | Prompt engineering is the process of designing and refining input prompts to effectively guide LLMs toward generating accurate, relevant, and coherent responses. Since LLMs interpret user intent based on textual cues, well-crafted prompts reduce ambiguity and improve performance. 16 | 17 | It is especially important because LLMs are sensitive to phrasing, context, and examples provided in the prompt. Effective prompt engineering enhances output quality without retraining the model, making it a key skill in leveraging LLMs efficiently for diverse applications. 18 | 19 | ## 📌 Q78: What is the difference between zero-shot and few-shot prompting? 20 | 21 | ### ✅ Answer 22 | 23 | The fundamental difference lies in the number of examples within the prompt. In zero-shot prompting, the model is given only the instruction and input data without examples. Here the model completely depends on its pre-trained knowledge to generate the correct output. 24 | 25 | In contrast, few-shot prompting includes a small number of input-output examples (the "shots") in the prompt. The examples guide the LLM to understand the desired format, style, or task better, often leading to significantly improved performance. 26 | 27 | ## LLM Survey Papers Collection 28 | 29 | 👉 [Repo Link](https://github.com/KalyanKS-NLP/LLM-Survey-Papers-Collection) 30 | 31 | ![LLM Survey Papers Collection](images/llm-survey-papers-collection.jpg) 32 | 33 | --------------------------------------------------------------------------------------------- 34 | 35 | 36 | 37 | 38 | -------------------------------------------------------------------------------- /Interview_QA/QA_82-84.md: -------------------------------------------------------------------------------- 1 | Authored by **Kalyan KS**. You can follow him on [Twitter](https://x.com/kalyan_kpl) and [LinkedIn](https://www.linkedin.com/in/kalyanksnlp/) for the latest LLM, RAG and Agent updates. 2 | 3 | ## 📌 Q82: What is In-Context Learning (ICL), and how is few-shot prompting related? 4 | 5 | ### ✅ Answer 6 | 7 | In-Context Learning (ICL) is a powerful ability of large language models that lets them perform tasks simply by understanding the input prompt. Few-shot prompting is a specific type of ICL where a few labeled examples are added to the prompt. 8 | 9 | The added examples show the model how the task should be done, helping it generalize to similar inputs. ICL allows LLMs to quickly adapt to new tasks with minimal data and no retraining. 10 | 11 | ## 📌 Q83: What is self-consistency prompting, and how does it improve reasoning? 12 | 13 | ### ✅ Answer 14 | 15 | Self-consistency prompting is a technique in LLMs where the model generates multiple reasoning paths or answers for the same input and then aggregates them to select the most consistent solution. Instead of relying on a single output, it considers the agreement among several reasoning attempts, which reduces the likelihood of errors caused by spurious or biased reasoning steps. 16 | 17 | This approach improves reasoning by amplifying correct patterns and filtering out inconsistent or unlikely answers, which results in more accurate outputs. 18 | 19 | ## 📌 Q84: Why is context important in designing prompts? 20 | 21 | ### ✅ Answer 22 | 23 | Context is crucial in prompt design because it provides the LLMs with the necessary background, constraints, and specific role it should adopt. This helps LLMs to generate a relevant and high-quality output. Without sufficient context, the LLM may misinterpret the user’s intent and rely on general knowledge. 24 | 25 | This may result in the generation of ambiguous, generic, or incorrect responses that don't meet the user's specific needs. 26 | 27 | ## **🚀 AIxFunda Newsletter (free)** 28 | 29 | Join 🚀 AIxFunda free newsletter to get the latest updates and interesting tutorials related to Generative AI, LLMs, Agents and RAG. 30 | 31 | - ✨ Weekly GenAI updates. 32 | - 📄 Weekly LLM, Agents and RAG paper updates. 33 | - 📝 1 fresh blog post on an interesting topic every week. 34 | 35 | 👉 [Subcribe Now](https://aixfunda.substack.com/) 36 | 37 | --------------------------------------------------------------------------------------------- 38 | 39 | 40 | 41 | 42 | 43 | -------------------------------------------------------------------------------- /Interview_QA/QA_16-18.md: -------------------------------------------------------------------------------- 1 | Authored by **Kalyan KS**. You can follow him on [Twitter](https://x.com/kalyan_kpl) and [LinkedIn](https://www.linkedin.com/in/kalyanksnlp/) for the latest LLM, RAG and Agent updates. 2 | 3 | ## 📌 Q16: What is the purpose of the encoder in a transformer model? 4 | 5 | ### ✅ Answer 6 | 7 | The purpose of the encoder in a transformer model is to process the entire input sequence and generate context-aware numerical representations. The encoder uses multi-head self-attention and feed-forward neural networks to capture the relationships and dependencies between all the tokens to generate context-rich token representations. These context-rich token representations are then passed to the decoder, which generates the output sequence. 8 | 9 | 10 | ## 📌 Q17: What is the purpose of the decoder in a transformer model? 11 | 12 | ### ✅ Answer 13 | 14 | The purpose of the decoder in a transformer model is to generate the output sequence based on the contextual representations provided by the encoder and the previously generated tokens. 15 | 16 | The decoder layers use (i) masked self-attention to ensure that predictions for the current step rely solely on previously generated outputs and (ii) encoder-decoder attention to focus on the most relevant parts of the encoder's final output when generating each output token. 17 | 18 | In simple words, the decoder transforms the encoder’s output (contextual representations) into the desired output token by token. 19 | 20 | ## 📌 Q18: How does the encoder-decoder structure work at a high level in the Transformer model? 21 | 22 | ### ✅ Answer 23 | 24 | The encoder-decoder structure in a Transformer is used for sequence-to-sequence tasks like machine translation. At a high level, the encoder processes the entire input sequence simultaneously, using self-attention to create rich, context-aware vector representations (the encoded state). 25 | 26 | The decoder then uses a separate self-attention mechanism, masked to prevent looking ahead, to generate the output sequence one token at a time, autoregressively. The decoder also incorporates an encoder-decoder attention layer, allowing it to focus on the most relevant parts of the encoder's output for each token it generates. 27 | 28 | ## LLM Survey Papers Collection 29 | 30 | 👉 [Repo Link](https://github.com/KalyanKS-NLP/LLM-Survey-Papers-Collection) 31 | 32 | ![LLM Survey Papers Collection](images/llm-survey-papers-collection.jpg) 33 | 34 | --------------------------------------------------------------------------------------------- 35 | 36 | 37 | 38 | 39 | -------------------------------------------------------------------------------- /Interview_QA/QA_109-111.md: -------------------------------------------------------------------------------- 1 | Authored by **Kalyan KS**. You can follow him on [Twitter](https://x.com/kalyan_kpl) and [LinkedIn](https://www.linkedin.com/in/kalyanksnlp/) for the latest LLM, RAG and Agent updates. 2 | 3 | ## 📌 Q109: Explain the pretraining objective used in LLM pretraining. 4 | 5 | ### ✅ Answer 6 | 7 | The pre-training objective of large language models is next token prediction, also known as causal language modeling. In this setup, the model learns to predict the next token in a sequence given all the previous ones. By minimizing the difference between its predictions and the actual next tokens across billions of examples, the model gradually learns grammar, semantics, and contextual relationships. 8 | 9 | This objective enables LLMs to generate coherent, contextually relevant text and perform a wide range of downstream tasks through prompting or fine-tuning. 10 | 11 | ## 📌 Q110: What is the difference between casual language modeling and masked language modeling? 12 | 13 | ### ✅ Answer 14 | 15 | Causal Language Modeling (CLM) is an autoregressive approach where the model predicts the next token in a sequence based only on the preceding tokens. 16 | 17 | In contrast, Masked Language Modeling (MLM) is an autoencoding approach where the model predicts intentionally masked (missing) tokens by leveraging bidirectional context, i.e., it considers both the past and future tokens in the sequence. 18 | 19 | ## 📌 Q111: How do LLMs handle out-of-vocabulary (OOV) words? 20 | 21 | ### ✅ Answer 22 | 23 | LLMs handle out-of-vocabulary (OOV) words using subword tokenization methods such as Byte Pair Encoding (BPE), WordPiece, or SentencePiece. These techniques split rare or unseen words into smaller, known subword units or characters. 24 | 25 | This allows the model to represent and understand new words from existing tokens in the vocabulary. For example, the word “unhappiness” might be split into “un”, “happy”, and “ness”. This approach reduces the OOV problem and improves generalization to unseen vocabulary. 26 | 27 | ## **👨🏻‍💻 Prompt Engineering Techniques Hub** 28 | 29 | This GitHub repo includes implementations of must know 25+ prompt engineering techniques. 30 | 31 | 👉 [Repo link](https://github.com/KalyanKS-NLP/Prompt-Engineering-Techniques-Hub) 32 | 33 | Knowledge of prompt engineering techniques is essential for Data Scientists, AI/ML Engineers working with LLMs, RAG and Agents. 34 | 35 | ![Prompt Engineering Techniques Hub](images/prompt-eng-techniques-hub.jpg) 36 | 37 | 38 | ------------------------------------------------------------------------------------------ 39 | 40 | 41 | 42 | 43 | 44 | -------------------------------------------------------------------------------- /Interview_QA/QA_103-105.md: -------------------------------------------------------------------------------- 1 | Authored by **Kalyan KS**. You can follow him on [Twitter](https://x.com/kalyan_kpl) and [LinkedIn](https://www.linkedin.com/in/kalyanksnlp/) for the latest LLM, RAG and Agent updates. 2 | 3 | ## 📌 Q103: What is QLoRA, and how does it differ from LoRA? 4 | 5 | ### ✅ Answer 6 | 7 | QLoRA (Quantized Low-Rank Adaptation) is an extension of LoRA that enables fine-tuning large language models more efficiently by combining low-rank adaptation with quantization. While LoRA freezes most model weights and trains small low-rank matrices to reduce memory and compute costs, QLoRA goes a step further by quantizing the model weights to 4-bit precision during fine-tuning. 8 | 9 | This drastically lowers GPU memory requirements without significant performance loss, allowing fine-tuning of very large models on consumer-grade hardware. 10 | 11 | ## 📌 Q104: When would you use QLoRA instead of standard LoRA? 12 | 13 | ### ✅ Answer 14 | 15 | QLoRA is preferred over standard LoRA when fine-tuning very large language models on limited hardware resources, such as a single GPU with constrained memory. QLoRA’s 4-bit quantization significantly reduces memory usage while maintaining model performance, making it ideal for resource-efficient fine-tuning. 16 | 17 | It’s particularly useful when working with models like LLaMA that would otherwise exceed GPU limits. In contrast, standard LoRA is sufficient when hardware capacity is not a major constraint. 18 | 19 | ## 📌 Q105: How would you handle LLM fine-tuning on consumer hardware with limited GPU memory? 20 | 21 | ### ✅ Answer 22 | 23 | When fine-tuning an LLM on consumer hardware with limited GPU memory, techniques like LoRA (Low-Rank Adaptation) or QLoRA can be used to reduce memory usage by training only a small subset of parameters. Additionally, using gradient accumulation, mixed precision training, smaller batch sizes, and smaller sequence lengths helps manage GPU constraints. 24 | 25 | These approaches minimize memory overhead and computational cost, making fine-tuning LLMs feasible on resource-constrained devices. 26 | 27 | ## **👨🏻‍💻 LLM Engineer Toolkit** 28 | 29 | 🤖 This repository contains a curated list of 120+ LLM, RAG and Agent related libraries category wise. 30 | 31 | 👉 [Repo link](https://github.com/KalyanKS-NLP/llm-engineer-toolkit) 32 | 33 | This repository is highly useful for Data Scientists, AI/ML Engineers working with LLMs, RAG and Agents. 34 | 35 | ![LLM Engineer Toolkit](images/llm-engineer-toolkit.jpg) 36 | 37 | 38 | --------------------------------------------------------------------------------------------- 39 | 40 | 41 | 42 | 43 | -------------------------------------------------------------------------------- /Interview_QA/QA_91-93.md: -------------------------------------------------------------------------------- 1 | Authored by **Kalyan KS**. You can follow him on [Twitter](https://x.com/kalyan_kpl) and [LinkedIn](https://www.linkedin.com/in/kalyanksnlp/) for the latest LLM, RAG and Agent updates. 2 | 3 | ## 📌 Q91: What role does alignment tuning play in improving an LLM's usability? 4 | 5 | ### ✅ Answer 6 | 7 | Alignment tuning enhances an LLM’s usability by ensuring its responses align with human values and ethical principles, making interactions safer and more relatable. This process ensures the model is not only helpful (answering questions effectively) but also harmless (refusing to generate toxic or unethical content). 8 | 9 | By tuning the model to a preference reward signal, alignment makes the LLM's output feel more natural, safe, and trustworthy, fundamentally enhancing the user experience. 10 | 11 | ## 📌 Q92: How do you prevent overfitting during fine-tuning? 12 | 13 | ### ✅ Answer 14 | 15 | To prevent overfitting during LLM fine-tuning, the primary methods involve using a validation set to monitor performance and employing early stopping when the validation loss starts to increase, indicating the model is memorizing the training data. Additionally, techniques like regularization (e.g., L2 or dropout) can be applied to penalize complex models. 16 | 17 | Lastly, ensuring the fine-tuning dataset is diverse and sufficiently large relative to the model size avoids overfitting and helps the model to generalize better to unseen data. 18 | 19 | ## 📌 Q93: What is catastrophic forgetting, and why is it a concern in fine-tuning? 20 | 21 | ### ✅ Answer 22 | 23 | Catastrophic forgetting is the loss of previously learned capabilities or knowledge when an LLM is fine-tuned on new, distinct data. It occurs because fine-tuning often involves updating all or most of the model's parameters, causing the new training signal to drastically alter weights important for the old tasks. 24 | 25 | This is a significant concern because it compromises the general utility of the base LLM, meaning the model might excel at the new, fine-tuned task but become incapable of performing the original, broader set of tasks it was initially trained for. 26 | 27 | ## **👨🏻‍💻 LLM Engineer Toolkit** 28 | 29 | 🤖 This repository contains a curated list of 120+ LLM, RAG and Agent related libraries category wise. 30 | 31 | 👉 [Repo link](https://github.com/KalyanKS-NLP/llm-engineer-toolkit) 32 | 33 | This repository is highly useful for Data Scientists, AI/ML Engineers working with LLMs, RAG and Agents. 34 | 35 | ![LLM Engineer Toolkit](images/llm-engineer-toolkit.jpg) 36 | 37 | 38 | --------------------------------------------------------------------------------------------- 39 | 40 | 41 | 42 | 43 | 44 | -------------------------------------------------------------------------------- /Interview_QA/QA_55-57.md: -------------------------------------------------------------------------------- 1 | Authored by **Kalyan KS**. You can follow him on [Twitter](https://x.com/kalyan_kpl) and [LinkedIn](https://www.linkedin.com/in/kalyanksnlp/) for the latest LLM, RAG and Agent updates. 2 | 3 | ## 📌 Q55: What is the purpose of temperature in LLM inference, and how does it affect the output? 4 | 5 | ### ✅ Answer 6 | 7 | The temperature parameter in LLM inference controls the randomness of the token selection process by rescaling the logits before the softmax function. Its purpose is to manage the trade-off between creativity and determinism in the generated output. 8 | 9 | A high temperature (e.g., closer to 1.0) makes the probability distribution flatter, increasing the chance of selecting less likely tokens, resulting in more diverse, creative, and sometimes less coherent text. Conversely, a low temperature (e.g., closer to 0.0) makes the model more deterministic and focused on the most probable tokens, generating more coherent text. 10 | 11 | ## 📌 Q56: What is autoregressive generation in the context of LLMs? 12 | 13 | ### ✅ Answer 14 | 15 | Autoregressive generation is the sequential process by which LLMs generate text, one token (word or sub-word unit) at a time, using the previously generated tokens and the original input as context. 16 | 17 | Essentially, the model predicts the next most probable token based on all the preceding tokens, creating a dependency chain that results in coherent and contextually relevant text output. This process continues until an end-of-sequence token is predicted or a pre-defined length limit is reached. 18 | 19 | 20 | ## 📌 Q57: Explain the strengths and limitations of autoregressive text generation in LLMs. 21 | 22 | ### ✅ Answer 23 | 24 | The main strength of autoregressive text generation in LLMs is its ability to produce coherent, contextually relevant, and high-quality sequential text. This happens because of its method of predicting the next token based on all preceding ones. 25 | 26 | However, its primary limitation is that it's inherently slow because each new token must be generated serially, precluding true parallelization. Additionally, in the sequential generation process, errors can accumulate over time since each prediction depends on the previous tokens. 27 | 28 | 29 | ## **🚀 AIxFunda Newsletter (free)** 30 | 31 | 32 | Join 🚀 AIxFunda free newsletter to get the latest updates and interesting tutorials related to Generative AI, LLMs, Agents and RAG. 33 | 34 | - ✨ Weekly GenAI updates. 35 | - 📄 Weekly LLM, Agents and RAG paper updates. 36 | - 📝 1 fresh blog post on an interesting topic every week. 37 | 38 | 👉 [Subcribe Now](https://aixfunda.substack.com/) 39 | 40 | --------------------------------------------------------------------------------------------- 41 | 42 | 43 | 44 | 45 | 46 | -------------------------------------------------------------------------------- /Interview_QA/QA_79-81.md: -------------------------------------------------------------------------------- 1 | Authored by **Kalyan KS**. You can follow him on [Twitter](https://x.com/kalyan_kpl) and [LinkedIn](https://www.linkedin.com/in/kalyanksnlp/) for the latest LLM, RAG and Agent updates. 2 | 3 | ## 📌 Q79: What are the different approaches for choosing examples for few-shot prompting? 4 | 5 | ### ✅ Answer 6 | 7 | Proper choice of examples in few-shot prompting aims to maximize the informativeness and relevance of the limited context provided to the LLM. One common approach is random sampling, where examples are chosen arbitrarily from the dataset to provide a general overview. 8 | 9 | Diversity-based selection ensures the examples cover a wide range of input scenarios and edge cases relevant to the task, preventing overfitting. Conversely, similarity-based selection (often using vector embeddings) retrieves examples semantically most similar to the input query, providing the most direct in-context guidance. 10 | 11 | ## 📌 Q80: Why is context length important when designing prompts for LLMs? 12 | 13 | ### ✅ Answer 14 | 15 | Context length is crucial because it dictates the maximum number of tokens an LLM can consider when generating a response. If the prompt having instructions, examples, and input data exceeds this limit, the model will truncate the input. 16 | 17 | This leads to a loss of vital information, which results in incomplete, irrelevant, or inaccurate outputs. Therefore, understanding and managing the context length is essential for designing concise yet comprehensive prompts that fully leverage the model's capabilities. 18 | 19 | ## 📌 Q81: What is a system prompt, and how does it differ from a user prompt? 20 | 21 | ### ✅ Answer 22 | 23 | A system prompt is an instruction given to an LLM to define its overall behavior, role, or response style throughout a conversation, such as “You are an expert data scientist.” It sets the foundation for how the model interprets and responds to inputs. 24 | 25 | In contrast, a user prompt is a direct query or task provided by the user during interaction, like “Explain the concept of embeddings.” In simple words, the system prompt defines the model’s persona and boundaries, while the user prompt provides the query to be answered or the task to be performed. 26 | 27 | 28 | ## **👨🏻‍💻 LLM Engineer Toolkit** 29 | 30 | 🤖 This repository contains a curated list of 120+ LLM, RAG and Agent related libraries category wise. 31 | 32 | 👉 [Repo link](https://github.com/KalyanKS-NLP/llm-engineer-toolkit) 33 | 34 | This repository is highly useful for Data Scientists, AI/ML Engineers working with LLMs, RAG and Agents. 35 | 36 | ![LLM Engineer Toolkit](images/llm-engineer-toolkit.jpg) 37 | 38 | 39 | --------------------------------------------------------------------------------------------- 40 | 41 | 42 | 43 | 44 | 45 | -------------------------------------------------------------------------------- /Interview_QA/QA_40-42.md: -------------------------------------------------------------------------------- 1 | Authored by **Kalyan KS**. You can follow him on [Twitter](https://x.com/kalyan_kpl) and [LinkedIn](https://www.linkedin.com/in/kalyanksnlp/) for the latest LLM, RAG and Agent updates. 2 | 3 | ## 📌 Q40: Explain the trade-offs between batching and latency in LLM serving. 4 | 5 | ### ✅ Answer 6 | 7 | Batching in LLM serving increases throughput by processing multiple requests concurrently, maximizing GPU utilization and efficiency, especially under heavy workloads. However, larger batches inevitably increase latency since faster or shorter requests must wait for the longest sequence to complete before the batch finishes. 8 | 9 | Continuous or dynamic batching strategies attempt to balance this by adjusting batches in real time, replacing completed requests with new ones to maintain efficiency while minimizing waiting time. Ultimately, the trade-off depends on application needs—interactive systems prioritize low latency, whereas industrial or offline jobs aim for higher throughput at some latency cost. 10 | 11 | ## 📌 Q41: How can techniques like mixture-of-experts (MoE) optimize inference efficiency? 12 | 13 | ### ✅ Answer 14 | 15 | Mixture-of-Experts (MoE) optimizes inference efficiency by introducing conditional computation or sparsity. Instead of activating all model parameters for every input token, a lightweight gating network dynamically selects only a small subset of specialized "expert" sub-networks to process the token. 16 | 17 | This selective activation means the model performs significantly fewer floating-point operations (FLOPs) per token compared to a dense model of a comparable total size. This results in faster token generation (reduced latency) and higher throughput. 18 | 19 | ## 📌 Q42:Explain the role of decoding strategy in LLM text generation. 20 | 21 | ### ✅ Answer 22 | 23 | Decoding strategies in LLMs are the decision rules a model uses to choose the next token from its vocabulary when generating text. An LLM doesn't just "know" the next word; it calculates a probability for every single possible word it knows. The decoding strategy is simply the method used to select the final word from that list of probabilities to form the output text, one word at a time. 24 | 25 | The choice of strategy heavily influences whether the output is predictable and factual, or creative and diverse. Common decoding strategies include greedy search, beam search, and sampling methods like temperature scaling and top-k/top-p sampling. 26 | 27 | 28 | ## 👨🏻‍💻 LLM Survey Papers Collection 29 | 30 | 👉 [Repo Link](https://github.com/KalyanKS-NLP/LLM-Survey-Papers-Collection) 31 | 32 | ![LLM Survey Papers Collection](images/llm-survey-papers-collection.jpg) 33 | 34 | --------------------------------------------------------------------------------------------- 35 | 36 | 37 | 38 | 39 | 40 | -------------------------------------------------------------------------------- /Interview_QA/QA_97-99.md: -------------------------------------------------------------------------------- 1 | Authored by **Kalyan KS**. You can follow him on [Twitter](https://x.com/kalyan_kpl) and [LinkedIn](https://www.linkedin.com/in/kalyanksnlp/) for the latest LLM, RAG and Agent updates. 2 | 3 | ## 📌 Q97: When should you use fine-tuning vs. RAG? 4 | 5 | ### ✅ Answer 6 | 7 | Fine-tuning is best when you want the model to deeply learn domain-specific knowledge or handle specialized tasks where the knowledge is relatively static and not frequently changing. On the other hand, RAG (Retrieval-Augmented Generation) is ideal when you need the model to access the latest, proprietary, or frequently changing information without retraining it. 8 | 9 | It allows the model to provide fact-grounded answers and source traceability from a secure knowledge base. 10 | 11 | ## 📌 Q98: Explain the limitations of using RAG over LLM fine-tuning. 12 | 13 | ### ✅ Answer 14 | 15 | The main drawbacks of using RAG compared to fine-tuning lie in its performance and lack of deep specialization. Because RAG adds an extra retrieval step, it introduces more inference latency, making it less ideal for low-latency use cases. While RAG is effective at bringing in external knowledge, it doesn’t actually change the model’s core behavior, style, or ability to handle complex, domain-specific reasoning. 16 | 17 | Fine-tuning achieves all these by updating the model’s weights. Additionally, RAG’s output quality heavily depends on the retriever’s accuracy, meaning it can produce poor results if irrelevant information is fetched. 18 | 19 | ## 📌 Q99: Explain the limitations of using LLM fine-tuning over RAG. 20 | 21 | ### ✅ Answer 22 | 23 | The main drawback of fine-tuning compared to RAG is the model’s static knowledge. Once the model is trained, it can’t access new or real-time information without undergoing an expensive and time-consuming retraining process. Fine-tuning also requires significant computational resources and specialized expertise for preparing data and training the model. 24 | 25 | Moreover, fine-tuned models risk “catastrophic forgetting,” where learning new information causes them to lose some of their original general knowledge. RAG avoids the catastrophic forgetting problem, as it keeps the base model unchanged. 26 | 27 | ## **👨🏻‍💻 Prompt Engineering Techniques Hub** 28 | 29 | This GitHub repo includes implementations of must know 25+ prompt engineering techniques. 30 | 31 | 👉 [Repo link](https://github.com/KalyanKS-NLP/Prompt-Engineering-Techniques-Hub) 32 | 33 | Knowledge of prompt engineering techniques is essential for Data Scientists, AI/ML Engineers working with LLMs, RAG and Agents. 34 | 35 | ![Prompt Engineering Techniques Hub](images/prompt-eng-techniques-hub.jpg) 36 | 37 | 38 | ------------------------------------------------------------------------------------------ 39 | 40 | 41 | 42 | 43 | 44 | -------------------------------------------------------------------------------- /Interview_QA/QA_13-15.md: -------------------------------------------------------------------------------- 1 | Authored by **Kalyan KS**. You can follow him on [Twitter](https://x.com/kalyan_kpl) and [LinkedIn](https://www.linkedin.com/in/kalyanksnlp/) for the latest LLM, RAG and Agent updates. 2 | 3 | ## 📌 Q13: Explain the role of token embeddings in the Transformer model. 4 | 5 | ### ✅ Answer 6 | 7 | Token embeddings in the Transformer model encode syntactic and semantic information. These embeddings are obtained from an embedding matrix, where each token’s index maps to a fixed-size vector. These token embeddings are combined with positional embeddings to help the model retain word order. These enriched representations are then passed into the subsequent Transformer model layers, enabling context-aware understanding and text generation. 8 | 9 | ## 📌 Q14: Explain the working of the embedding layer in the transformer model. 10 | 11 | ### ✅ Answer 12 | 13 | The main purpose of the embedding layer in the transformer model is to convert input token IDs into dense, continuous vectors called embeddings. Each token is first represented as a one-hot vector and then multiplied with an embedding matrix to produce the token embedding. 14 | 15 | These token embeddings are then combined with positional embeddings to retain information about token order within the sequence. To summarize, the embedding layer transform the sequence of tokens represented as integer IDs into the sequence of embeddings. These embeddings are then processed by the subsequent layers to inject contextual information. 16 | 17 | ## 📌 Q15: What is the role of self-attention in the Transformer model, and why is it called “self-attention”? 18 | 19 | ### ✅ Answer 20 | 21 | The self-attention mechanism enables the model to compute context-rich representation for each token by allowing the tokens to attend to all tokens in the input sequence. These context-rich representations capture long-range dependencies and relationships. 22 | 23 | The mechanism is called “self-attention” because the attention is computed within the same sequence - each token attends to itself and others without external input. This process helps build rich contextual embeddings crucial for understanding meaning and structure in text. 24 | 25 | 26 | **👨🏻‍💻 LLM Engineer Toolkit** 27 | -------------------------------------------------------------------------------------------- 28 | 🤖 This repository contains a curated list of 120+ LLM, RAG and Agent related libraries category wise. 29 | 30 | 👉 [Repo link](https://github.com/KalyanKS-NLP/llm-engineer-toolkit) 31 | 32 | This repository is highly useful for Data Scientists, AI/ML Engineers working with LLM, RAG and Agents. 33 | 34 | ![LLM Engineer Toolkit](images/llm-engineer-toolkit.jpg) 35 | 36 | 37 | --------------------------------------------------------------------------------------------- 38 | 39 | 40 | 41 | 42 | 43 | 44 | -------------------------------------------------------------------------------- /Interview_QA/QA_85-87.md: -------------------------------------------------------------------------------- 1 | Authored by **Kalyan KS**. You can follow him on [Twitter](https://x.com/kalyan_kpl) and [LinkedIn](https://www.linkedin.com/in/kalyanksnlp/) for the latest LLM, RAG and Agent updates. 2 | 3 | ## 📌 Q85: Describe a strategy for reducing hallucinations via prompt design. 4 | 5 | ### ✅ Answer 6 | 7 | A key strategy for reducing hallucinations in LLMs via prompt design is grounding the response by instructing the model to rely exclusively on provided context. This is often achieved by including a clear directive such as: "Answer the following question only using the provided document. If the document does not contain the answer, state 'The information is not available in the document.' Do not use any external knowledge." 8 | 9 | This constrains the model's search space to the provided input, significantly decreasing the likelihood of generating fabricated or incorrect details. 10 | 11 | ## 📌 Q86: How would you structure a prompt to ensure the LLM output is in a specific format, like JSON? 12 | 13 | ### ✅ Answer 14 | 15 | To ensure an LLM output is in a specific format like JSON, you should include an explicit instruction in the prompt, clearly stating the desired output structure. For example, "Please respond with a valid JSON object." 16 | 17 | It is also helpful to provide a schema or example of the required JSON structure, including the keys and expected data types for each value. Additionally, including an explicit instruction to only output the JSON and nothing else helps prevent explanation text from being included, resulting in a clean, parsable output. 18 | 19 | ## 📌 Q87: Explain the purpose of ReAct prompting in AI Agents. 20 | 21 | ### ✅ Answer 22 | 23 | ReAct (Reasoning and Acting) prompting allows the agent to solve complex, multi-step tasks by allowing it to dynamically plan, execute external actions (e.g., using a search engine or tool), and refine its approach based on observations. 24 | 25 | This loop of thinking, acting, and observing keeps the model’s reasoning grounded in real-world feedback, which reduces hallucinations and makes its decisions more accurate, interpretable, and reliable. By combining logical reasoning with real-world interaction, ReAct enables more flexible, reliable, and human-like problem-solving. 26 | 27 | ## **👨🏻‍💻 Prompt Engineering Techniques Hub** 28 | 29 | This GitHub repo includes implementations of must know 25+ prompt engineering techniques. 30 | 31 | 👉 [Repo link](https://github.com/KalyanKS-NLP/Prompt-Engineering-Techniques-Hub) 32 | 33 | Knowledge of prompt engineering techniques is essential for Data Scientists, AI/ML Engineers working with LLMs, RAG and Agents. 34 | 35 | ![Prompt Engineering Techniques Hub](images/prompt-eng-techniques-hub.jpg) 36 | 37 | 38 | ------------------------------------------------------------------------------------------ 39 | 40 | 41 | 42 | 43 | 44 | -------------------------------------------------------------------------------- /Interview_QA/QA_49-51.md: -------------------------------------------------------------------------------- 1 | Authored by **Kalyan KS**. You can follow him on [Twitter](https://x.com/kalyan_kpl) and [LinkedIn](https://www.linkedin.com/in/kalyanksnlp/) for the latest LLM, RAG and Agent updates. 2 | 3 | ## 📌 Q49: When you set the temperature to 0.0, which decoding strategy are you using? 4 | 5 | ### ✅ Answer 6 | 7 | Setting the temperature to 0.0 means you're employing greedy decoding. This strategy forces the model to always select the token with the absolute highest probability at each step, making the output deterministic. 8 | 9 | It essentially eliminates all randomness, favoring the most probable sequence according to the model's learned distribution. 10 | 11 | ## 📌 Q50: How is Beam Search fundamentally different from a Breadth-First Search (BFS) or Depth-First Search (DFS)? 12 | 13 | ### ✅ Answer 14 | 15 | Beam Search differs from Breadth-First Search (BFS) and Depth-First Search (DFS) in that it’s not an exhaustive search algorithm but a heuristic-based, approximate search. Unlike BFS, which explores all nodes level-by-level, or DFS, which explores one path as deeply as possible, Beam Search keeps only the top-k most promising candidates at each step. 16 | 17 | This greedy, constrained exploration makes it significantly more memory and time efficient than exhaustive BFS or DFS, which is crucial for large state spaces. However, it sacrifices completeness and optimality, as the best solution may be pruned. 18 | 19 | ## 📌 Q51: Explain the criteria for choosing different decoding strategies. 20 | 21 | ### ✅ Answer 22 | 23 | The criteria for choosing different decoding strategies, such as greedy search, beam search, or sampling methods (like nucleus or top-k sampling), primarily depend on the desired balance between output quality, diversity, and computational cost. Greedy search is fast but often produces suboptimal, repetitive, or bland text, making it suitable when speed is critical and high quality is not paramount. 24 | 25 | Beam search improves quality by exploring multiple paths and is preferred for tasks requiring deterministic outputs, like translation or summarization, though it's slower. Sampling methods are chosen when the goal is to generate more creative and diverse, less deterministic outputs, as in dialogue systems or story generation, by introducing controlled randomness. 26 | 27 | 28 | ## **👨🏻‍💻 LLM Engineer Toolkit** 29 | 30 | 🤖 This repository contains a curated list of 120+ LLM, RAG and Agent related libraries category wise. 31 | 32 | 👉 [Repo link](https://github.com/KalyanKS-NLP/llm-engineer-toolkit) 33 | 34 | This repository is highly useful for Data Scientists, AI/ML Engineers working with LLM, RAG and Agents. 35 | 36 | ![LLM Engineer Toolkit](images/llm-engineer-toolkit.jpg) 37 | 38 | 39 | --------------------------------------------------------------------------------------------- 40 | 41 | 42 | 43 | 44 | 45 | -------------------------------------------------------------------------------- /Interview_QA/QA_34-36.md: -------------------------------------------------------------------------------- 1 | Authored by **Kalyan KS**. You can follow him on [Twitter](https://x.com/kalyan_kpl) and [LinkedIn](https://www.linkedin.com/in/kalyanksnlp/) for the latest LLM, RAG and Agent updates. 2 | 3 | ## 📌 Q34: How does Transformer model address the vanishing gradient problem? 4 | 5 | ### ✅ Answer 6 | 7 | Transformers address the vanishing gradient problem primarily through the use of residual (skip) connections and layer normalization. Residual connections allow gradients to flow directly through the network by adding the input of a layer to its output, preventing gradients from shrinking during backpropagation. 8 | 9 | Layer normalization helps stabilize training by keeping activations within a consistent range, which controls the scale of gradients. Additionally, the self-attention mechanism enables direct information flow between tokens, reducing the depth of dependency and further mitigating gradient vanishing. 10 | 11 | ## 📌 Q35: What is the purpose of the position-wise feed-forward sublayer in the Transformer model? 12 | 13 | ### ✅ Answer 14 | 15 | The position-wise feed-forward sublayer in the Transformer model serves to independently process each position's representation in the sequence after the attention mechanism. It consists of two linear transformations with a ReLU activation in between, applied identically and separately to each token vector. 16 | 17 | This sublayer adds non-linearity and increases the model's capacity to learn complex features by transforming the attention output into richer representations for the next layers. 18 | 19 | ## 📌 Q36: Can you briefly explain the difference between LLM training and LLM inference? 20 | 21 | ### ✅ Answer 22 | 23 | LLM training is the process where a large language model learns patterns and relationships from massive datasets, adjusting its internal parameters (weights) to minimize prediction error. LLM inference is the deployment phase where the trained, fixed-parameter model uses its learned knowledge to generate a response for a new, unseen input prompt. Training is computationally intensive and done once, while inference is fast and happens every time the model is used. 24 | 25 | **☕ Support the Author** 26 | ------------------------------------------------------------------------------------------- 27 | I hope you found this “LLM Interview Questions and Answers Hub” highly useful. 28 | 29 | I’ve made this freely available to help the AI and NLP community grow and to support learners like you. If you found it helpful and would like to show your appreciation, you can buy me a coffee to keep me motivated in creating more free resources like this. 30 | 31 | 👉 [Buy Me a Coffee](https://ko-fi.com/kalyanksnlp) 32 | 33 | Your small gesture goes a long way in supporting my work—thank you for being part of this journey! 🙏 34 | 35 | — Kalyan KS 36 | 37 | --------------------------------------------------------------------------------------------- 38 | 39 | 40 | 41 | 42 | -------------------------------------------------------------------------------- /Interview_QA/QA_73-75.md: -------------------------------------------------------------------------------- 1 | Authored by **Kalyan KS**. You can follow him on [Twitter](https://x.com/kalyan_kpl) and [LinkedIn](https://www.linkedin.com/in/kalyanksnlp/) for the latest LLM, RAG and Agent updates. 2 | 3 | ## 📌 Q73: What are the possible options for accelerating LLM inference? 4 | 5 | ### ✅ Answer 6 | 7 | Possible options to accelerate LLM inference include: 8 | 9 | - Quantization: Reduces numerical precision to lower computation and memory use, speeding up inference. 10 | 11 | - Pruning: Eliminates less important neurons or weights to make the model smaller and faster. 12 | 13 | - Knowledge Distillation: Transfers knowledge to a smaller, more efficient model that runs faster. 14 | 15 | - KV Caching: Stores and reuses previous attention computations to speed up token generation. 16 | 17 | - Speculative Decoding: Uses a smaller draft model to generate candidate tokens verified by the full model. 18 | 19 | - Hardware Acceleration: Utilizes GPUs, TPUs, or custom accelerators designed for parallel matrix operations. 20 | 21 | ## 📌 Q74: What is Chain-of-Thought (CoT) prompting, and when is it most useful? 22 | 23 | ### ✅ Answer 24 | 25 | Chain-of-Thought (CoT) prompting is a technique where you instruct an LLM to explicitly show the step-by-step reasoning process before arriving at the final answer. This process mimics human-like thinking, breaking down complex problems into manageable intermediate steps. 26 | 27 | CoT is most useful for complex reasoning tasks that require multiple steps, such as multi-step arithmetic, symbolic reasoning, and common-sense question answering. The CoT prompting technique significantly improves the LLM's accuracy and ability to handle complexity compared to standard prompting. 28 | 29 | ## 📌 Q75: Explain the reason behind the effectiveness of Chain-of-Thought (CoT) prompting. 30 | 31 | ### ✅ Answer 32 | 33 | The power of Chain-of-Thought (CoT) prompting lies in how it helps large language models think more like humans - step by step. Instead of jumping straight to an answer, CoT encourages the model to reason through the problem by breaking it down into smaller, logical steps. 34 | 35 | This structured thinking makes it easier for the model to handle tasks involving logic, math, or common sense, resulting in more accurate and trustworthy answers. In simple terms, CoT gives the model a bit of extra “thinking time” before it decides on the final response. 36 | 37 | 38 | ## **👨🏻‍💻 LLM Engineer Toolkit** 39 | 40 | 🤖 This repository contains a curated list of 120+ LLM, RAG and Agent related libraries category wise. 41 | 42 | 👉 [Repo link](https://github.com/KalyanKS-NLP/llm-engineer-toolkit) 43 | 44 | This repository is highly useful for Data Scientists, AI/ML Engineers working with LLMs, RAG and Agents. 45 | 46 | ![LLM Engineer Toolkit](images/llm-engineer-toolkit.jpg) 47 | 48 | 49 | --------------------------------------------------------------------------------------------- 50 | 51 | 52 | 53 | 54 | 55 | -------------------------------------------------------------------------------- /Interview_QA/QA_22-24.md: -------------------------------------------------------------------------------- 1 | Authored by **Kalyan KS**. You can follow him on [Twitter](https://x.com/kalyan_kpl) and [LinkedIn](https://www.linkedin.com/in/kalyanksnlp/) for the latest LLM, RAG and Agent updates. 2 | 3 | ## 📌 Q22: How does masked self-attention differ from regular self-attention, and where is it used in a Transformer? 4 | 5 | ### ✅ Answer 6 | 7 | Masked self-attention differs from regular self-attention by restricting the attention mechanism so that each position in the sequence can only attend to earlier positions and itself, preventing access to future tokens. This is essential for autoregressive tasks like text generation, where tokens are generated sequentially one by one. 8 | 9 | Regular self-attention allows every token to attend to all tokens in the sequence, which is useful in encoder layers for understanding context bidirectionally. Masked self-attention is specifically used in the decoder layers of the Transformer model to ensure that the model predicts tokens one at a time without peeking ahead. 10 | 11 | ## 📌 Q23: Discuss the pros and cons of the self-attention mechanism in the Transformer model. 12 | 13 | ### ✅ Answer 14 | 15 | Self-attention enables the Transformer model to capture long-range dependencies (contextual relationships) between tokens in the input sequence efficiently. It allows for parallel computation, improving training speed compared to recurrent models. 16 | 17 | However, its quadratic time complexity with respect to sequence length and large memory footprint make it computationally expensive for long inputs. 18 | 19 | ## 📌 Q24: What is the purpose of masked self-attention in the Transformer decoder? 20 | 21 | ### ✅ Answer 22 | 23 | During output generation, masked self-attention in the decoder prevents the model from looking ahead by blocking access to future tokens during generation. This means that when predicting a token, the model is restricted to attending only to the tokens that come before it in the sequence. 24 | 25 | This ensures (i) auto-regressive behavior, allowing the model to generate text one token at a time, and (ii) the model learns to make predictions based solely on past context, maintaining the integrity and correctness of language modeling. 26 | 27 | **☕ Support the Author** 28 | ------------------------------------------------------------------------------------------- 29 | I hope you found this “LLM Interview Questions and Answers Hub” highly useful. 30 | 31 | I’ve made this freely available to help the AI and NLP community grow and to support learners like you. If you found it helpful and would like to show your appreciation, you can buy me a coffee to keep me motivated in creating more free resources like this. 32 | 33 | 👉 [Buy Me a Coffee](https://ko-fi.com/kalyanksnlp) 34 | 35 | Your small gesture goes a long way in supporting my work—thank you for being part of this journey! 🙏 36 | 37 | — Kalyan KS 38 | 39 | --------------------------------------------------------------------------------------------- 40 | 41 | 42 | 43 | 44 | 45 | -------------------------------------------------------------------------------- /Interview_QA/QA_52-54.md: -------------------------------------------------------------------------------- 1 | Authored by **Kalyan KS**. You can follow him on [Twitter](https://x.com/kalyan_kpl) and [LinkedIn](https://www.linkedin.com/in/kalyanksnlp/) for the latest LLM, RAG and Agent updates. 2 | 3 | ## 📌 Q52: Compare deterministic and stochastic decoding methods in LLMs. 4 | 5 | ### ✅ Answer 6 | 7 | Deterministic decoding methods, like Greedy Search and Beam Search, consistently yield the same output for the same input because they select the next token based purely on the highest probability or a fixed set of high-probability paths, respectively, without any element of randomness. 8 | 9 | In contrast, stochastic decoding methods, such as Temperature Sampling, Top-K Sampling, and Top-p (Nucleus) Sampling, introduce randomness to the token selection process, resulting in varied and diverse outputs even for identical inputs. 10 | 11 | Deterministic methods are generally preferred for tasks requiring predictability and factual accuracy (e.g., translation, summarization), while stochastic methods excel in open-ended or creative tasks (e.g., storytelling, brainstorming) where novelty is desirable. 12 | 13 | ## 📌 Q53: What is the role of the context window during LLM inference? 14 | 15 | ### ✅ Answer 16 | 17 | The context window refers to the maximum total number of tokens (both input and output) that the model can process at one time. The context window during LLM inference serves as the model's working memory. This window enables the model to consider both the current input and preceding tokens to generate contextually relevant and coherent responses. 18 | 19 | A larger context window allows the LLM to handle longer texts or complex tasks by preserving more information. However, if the input surpasses the window size, the model "forgets" earlier tokens, potentially reducing response accuracy. ​Essentially, it determines how much past information the model can use to make the next prediction. 20 | 21 | ## 📌 Q54: Explain the pros and cons of large and small context windows in LLM inference. 22 | 23 | ### ✅ Answer 24 | 25 | The context window in LLMs defines the maximum number of tokens it can process at once, and it is like the model’s working memory. A large context window allows the model to process more input, leading to more coherent and contextually aware outputs. However, processing more input tokens significantly increases memory usage, computational cost, and latency. 26 | 27 | Conversely, a small context window is faster and more memory-efficient, making it suitable for lower-resource environments or real-time applications. However, it limits the model's 'memory,' potentially causing it to lose track of earlier information and generate less consistent or relevant outputs. 28 | 29 | 30 | ## 👨🏻‍💻 LLM Survey Papers Collection 31 | 32 | 👉 [Repo Link](https://github.com/KalyanKS-NLP/LLM-Survey-Papers-Collection) 33 | 34 | ![LLM Survey Papers Collection](images/llm-survey-papers-collection.jpg) 35 | 36 | --------------------------------------------------------------------------------------------- 37 | 38 | 39 | 40 | 41 | 42 | -------------------------------------------------------------------------------- /Interview_QA/QA_28-30.md: -------------------------------------------------------------------------------- 1 | Authored by **Kalyan KS**. You can follow him on [Twitter](https://x.com/kalyan_kpl) and [LinkedIn](https://www.linkedin.com/in/kalyanksnlp/) for the latest LLM, RAG and Agent updates. 2 | 3 | ## 📌 Q28: What is the purpose of residual (skip) connections in Transformer model layers? 4 | 5 | ### ✅ Answer 6 | 7 | The main purpose of residual (skip) connections in Transformer layers is to allow the gradients to flow directly through the network during backpropagation. This helps to mitigate the vanishing gradient problem in deep networks, ensuring effective training. 8 | 9 | Specifically, they enable the output of a layer to be Layer Input + Layer Output rather than just the Layer Output. This means the layer learns an additive change rather than a new representation entirely. This facilitates training very deep architectures, leading to better overall performance. 10 | 11 | ## 📌 Q29: Why is layer normalization used, and where is it applied in Transformer? 12 | 13 | ### ✅ Answer 14 | 15 | Layer normalization is used in Transformer to stabilize and accelerate training by normalizing the inputs across features within each layer, ensuring that activations have a mean of zero and variance of one. This normalization helps control the scale of gradients during backpropagation, preventing issues like exploding or vanishing gradients. 16 | 17 | In Transformer architectures, layer normalization is applied before the self-attention and feedforward sublayers (Pre-LN) or after these sublayers within the residual connections (Post-LN). The Pre-LN variant improves gradient stability and allows faster training without requiring learning rate warm-up, while Post-LN was used in the original Transformer but can lead to unstable gradients if not carefully managed. 18 | 19 | ## 📌 Q30: What is cross-entropy loss, and how is it applied during transformer training? 20 | 21 | ### ✅ Answer 22 | 23 | Cross-entropy loss measures the difference between the predicted probability distribution of a model and the true distribution of target labels. In Transformer training, it quantifies how well the model predicts the next token by comparing its output logits (after softmax) with the actual token index. 24 | 25 | The loss penalizes incorrect predictions with higher values, encouraging the model to assign greater probability to the correct token. Minimizing the cross-entropy loss through backpropagation helps optimize the model’s parameters for accurate sequence generation. 26 | 27 | ## **👨🏻‍💻 Prompt Engineering Techniques Hub** 28 | 29 | This GitHub repo includes implementations of must know 25+ prompt engineering techniques. 30 | 31 | 👉 [Repo link](https://github.com/KalyanKS-NLP/Prompt-Engineering-Techniques-Hub) 32 | 33 | Knowledge of prompt engineering techniques is essential for Data Scientists, AI/ML Engineers working with LLMs, RAG and Agents. 34 | 35 | ![Prompt Engineering Techniques Hub](images/prompt-eng-techniques-hub.jpg) 36 | 37 | 38 | ------------------------------------------------------------------------------------------ 39 | 40 | 41 | 42 | 43 | 44 | -------------------------------------------------------------------------------- /Interview_QA/QA_58-60.md: -------------------------------------------------------------------------------- 1 | Authored by **Kalyan KS**. You can follow him on [Twitter](https://x.com/kalyan_kpl) and [LinkedIn](https://www.linkedin.com/in/kalyanksnlp/) for the latest LLM, RAG and Agent updates. 2 | 3 | ## 📌 Q58: Explain how diffusion language models (DLMs) differ from Large Language Models (LLMs). 4 | 5 | ### ✅ Answer 6 | 7 | Diffusion Language Models (DLMs) differ from traditional Large Language Models (LLMs) in how they generate text. LLMs use autoregressive generation, predicting one token at a time based on previous ones. 8 | 9 | DLMs use a denoising process, starting from random noise and iteratively refining it into coherent text—similar to how diffusion models generate images. This approach allows DLMs to capture global context more effectively and potentially produce more diverse outputs. 10 | 11 | ## 📌 Q59: Do you prefer DLMs or LLMs for latency-sensitive applications? 12 | 13 | ### ✅ Answer 14 | 15 | LLMs are autoregressive, generating text sequentially token-by-token, which creates a sequential bottleneck resulting in slower overall latency for long outputs. In contrast, DLMs are non-autoregressive and generate text by iteratively refining the entire sequence in parallel, which often offers them a significant advantage in inference speed and throughput for bulk processing or longer generations. 16 | 17 | While a single denoising step in a DLM can be computationally heavier than an LLM's single token prediction, the ability to generate multiple tokens simultaneously over a few steps means that DLMs can achieve a faster Time Per Output Token (TPOT), making them an emerging alternative for latency-sensitive applications. 18 | 19 | ## 📌 Q60: Explain the concept of token streaming during inference. 20 | 21 | ### ✅ Answer 22 | 23 | Token streaming during LLM inference is an optimization technique where the model's output, typically the next predicted token, is sent to the user immediately as soon as it's generated, rather than waiting for the entire response to be completed. 24 | 25 | Token streaming significantly reduces the perceived latency because the user can begin reading the output almost instantly, making the interaction feel much faster and more responsive. The full response is thus "streamed" out token by token until an end-of-sequence token is reached. 26 | 27 | **☕ Support the Author** 28 | ------------------------------------------------------------------------------------------- 29 | I hope you found this “LLM Interview Questions and Answers Hub” highly useful. 30 | 31 | I’ve made this freely available to help the AI and NLP community grow and to support learners like you. If you found it helpful and would like to show your appreciation, you can buy me a coffee to keep me motivated in creating more free resources like this. 32 | 33 | 👉 [Buy Me a Coffee](https://ko-fi.com/kalyanksnlp) 34 | 35 | Your small gesture goes a long way in supporting my work—thank you for being part of this journey! 🙏 36 | 37 | — Kalyan KS 38 | 39 | --------------------------------------------------------------------------------------------- 40 | 41 | 42 | 43 | 44 | 45 | -------------------------------------------------------------------------------- /Interview_QA/QA_31-33.md: -------------------------------------------------------------------------------- 1 | Authored by **Kalyan KS**. You can follow him on [Twitter](https://x.com/kalyan_kpl) and [LinkedIn](https://www.linkedin.com/in/kalyanksnlp/) for the latest LLM, RAG and Agent updates. 2 | 3 | ## 📌 Q31: Compare transformers and RNNs in terms of handling long-range dependencies. 4 | 5 | ### ✅ Answer 6 | 7 | Transformers are fundamentally better than Recurrent Neural Networks (RNNs) at capturing long-range dependencies due to their core architectural difference. RNNs process sequences sequentially and suffer from the vanishing gradient problem, which causes information from earlier steps to diminish over long sequences, making it difficult to relate distant tokens. 8 | 9 | In contrast, Transformers utilize a self-attention mechanism that computes a direct relationship score between every pair of tokens in the sequence. This allows it to capture dependencies regardless of the distance between the tokens in O(1) sequential steps, unlike the O(n) steps required by RNNs. This non-sequential, parallelized attention allows Transformers to effectively model context across much longer spans. 10 | 11 | ## 📌 Q32: What are the fundamental limitations of the Transformer model? 12 | 13 | ### ✅ Answer 14 | 15 | The fundamental limitations of the Transformer model include its high computational and memory demands. This is because of the quadratic complexity of the self-attention mechanism, which makes processing very long sequences challenging. 16 | 17 | Additionally, they require vast amounts of diverse training data and can be sensitive to biased or low-quality data, impacting generalization and robustness. Despite these challenges, they remain powerful and better than traditional deep learning models like CNNs and RNNs. 18 | 19 | ## 📌 Q33: How do transformers address the limitations of traditional deep learning models like CNNs and RNNs? 20 | 21 | ### ✅ Answer 22 | 23 | Transformers address the limitations of traditional deep learning models like CNNs and RNNs primarily through their self-attention mechanism and parallel processing capability. RNNs process data sequentially and struggle with long-range dependencies due to vanishing gradients. CNNs excel at local feature extraction but lack the ability to model sequential or long-range relationships across data. 24 | 25 | The self-attention mechanism allows transformers to process all parts of the input in parallel, assessing the importance of every word/token to every other word/token regardless of their distance. Compared to traditional models, transformers can scale better and train faster on large datasets. 26 | 27 | ## **🚀 AIxFunda Newsletter (free)** 28 | 29 | 30 | Join 🚀 AIxFunda free newsletter to get the latest updates and interesting tutorials related to Generative AI, LLMs, Agents and RAG. 31 | 32 | - ✨ Weekly GenAI updates. 33 | - 📄 Weekly LLM, Agents and RAG paper updates. 34 | - 📝 1 fresh blog post on an interesting topic every week. 35 | 36 | 👉 [Subcribe Now](https://aixfunda.substack.com/) 37 | 38 | --------------------------------------------------------------------------------------------- 39 | 40 | 41 | 42 | 43 | 44 | -------------------------------------------------------------------------------- /Interview_QA/QA_67-69.md: -------------------------------------------------------------------------------- 1 | Authored by **Kalyan KS**. You can follow him on [Twitter](https://x.com/kalyan_kpl) and [LinkedIn](https://www.linkedin.com/in/kalyanksnlp/) for the latest LLM, RAG and Agent updates. 2 | 3 | ## 📌 Q67: Differentiate between online and offline LLM inference deployment scenarios and discuss their respective requirements. 4 | 5 | ### ✅ Answer 6 | 7 | Online LLM inference involves real-time, user-facing requests typically hosted on a cloud server, requiring low latency and high throughput to handle unpredictable traffic and network communication efficiently. 8 | 9 | Conversely, offline LLM inference deals with precollected data in batches, usually on on-premise or local hardware, where the primary requirements are high throughput and processing large data volumes at scale, with less stringent latency demands. The online scenario prioritizes rapid individual responses, while the offline scenario focuses on massive-scale, non-real-time data processing. 10 | 11 | ## 📌 Q68: Explain the throughput vs. latency tradeoff in LLM inference. 12 | 13 | ### ✅ Answer 14 | 15 | Latency refers to how long it takes to process a single request - the faster it responds, the lower the latency. This is usually achieved by handling smaller batches of data at a time. On the other hand, throughput measures how many requests a system can handle per second, which improves when larger batches are processed together to fully utilize the GPU. 16 | 17 | However, doing so makes individual requests wait longer, increasing latency. Hence, systems must balance these metrics based on their application - interactive applications like chatbots focus on low latency for quick responses, while batch-processing systems aim for high throughput to maximize efficiency. 18 | 19 | ## 📌 Q69: What are the various bottlenecks in a typical LLM inference pipeline when running on a modern GPU? 20 | 21 | ### ✅ Answer 22 | 23 | When running large language models (LLMs) on modern GPUs, several key bottlenecks limit performance. One major issue is memory bandwidth saturation, where the model frequently accesses large key-value (KV) caches, slowing data movement. 24 | 25 | As the KV cache grows during text generation, it consumes more GPU memory, forcing smaller batch sizes and creating memory pressure. Compute bottlenecks also occur in heavy operations like matrix multiplications, though these are often less critical than memory-related delays. 26 | 27 | In hybrid CPU-GPU systems, inefficient task scheduling can leave GPU cores underutilized, while multi-GPU setups face extra communication delays that reduce scalability. Overcoming these challenges involves techniques like caching, model quantization, smarter scheduling, and balanced workload distribution to fully harness GPU power and minimize latency. 28 | 29 | 30 | ## LLM Survey Papers Collection 31 | 32 | 👉 [Repo Link](https://github.com/KalyanKS-NLP/LLM-Survey-Papers-Collection) 33 | 34 | ![LLM Survey Papers Collection](images/llm-survey-papers-collection.jpg) 35 | 36 | --------------------------------------------------------------------------------------------- 37 | 38 | 39 | 40 | 41 | -------------------------------------------------------------------------------- /Interview_QA/QA_7-9.md: -------------------------------------------------------------------------------- 1 | Authored by **Kalyan KS**. You can follow him on [Twitter](https://x.com/kalyan_kpl) and [LinkedIn](https://www.linkedin.com/in/kalyanksnlp/) for the latest LLM, RAG and Agent updates. 2 | 3 | ## 📌 Q7: Explain why subword tokenization is preferred over word-level tokenization in the Transformer model. 4 | 5 | ### ✅ Answer 6 | 7 | Subword tokenization, such as Byte-Pair Encoding (BPE) in the Transformer model, offers several advantages over word-level tokenization by handling out-of-vocabulary words and balancing vocabulary size. It effectively handles out-of-vocabulary (OOV) words (unseen words, i.e., words not in the model vocabulary) by breaking them into known subwords, significantly improving coverage of rare or unseen words. 8 | 9 | This method also results in a smaller, manageable vocabulary size compared to the vast lexicon of word-level tokenization, while still providing the semantic benefits of word-level units. Finally, it helps the model learn about morphology (prefixes, suffixes) by segmenting words into meaningful parts. 10 | 11 | ## 📌 Q8: Explain the trade-offs in using a large vocabulary in LLMs. 12 | 13 | ### ✅ Answer 14 | 15 | Using a large vocabulary in LLMs allows the model to represent words more precisely and reduce token fragmentation, improving understanding of rare or domain-specific terms. However, it increases the size of the embedding and output layers, leading to higher memory usage and slower training. 16 | 17 | Larger vocabularies also make softmax computations more expensive and may lead to a sparser representation of less frequent tokens, potentially hindering effective learning for those words. Therefore, models often moderate vocabulary sizes to ensure a balance between linguistic coverage and computational efficiency. 18 | 19 | ## 📌 Q9: Explain how self-attention is computed in the Transformer model step by step. 20 | 21 | ### ✅ Answer 22 | 23 | Self-attention in the Transformer model is computed by first projecting each input token into three vectors: queries, keys, and values using learned weight matrices. Then, attention scores are calculated by taking the dot product of a query with all keys, scaled by the square root of the key dimension to stabilize gradients. 24 | 25 | These scores are normalized using the softmax function to obtain attention weights, representing the importance of each token relative to others. Finally, each token's output is computed as the weighted sum of the value vectors, allowing the model to capture contextual relationships within the input sequence. 26 | 27 | ## **👨🏻‍💻 Prompt Engineering Techniques Hub** 28 | 29 | This GitHub repo includes implementations of must know 25+ prompt engineering techniques. 30 | 31 | 👉 [Repo link](https://github.com/KalyanKS-NLP/Prompt-Engineering-Techniques-Hub) 32 | 33 | Knowledge of prompt engineering techniques is essential for Data Scientists, AI/ML Engineers working with LLMs, RAG and Agents. 34 | 35 | ![Prompt Engineering Techniques Hub](images/prompt-eng-techniques-hub.jpg) 36 | 37 | 38 | ------------------------------------------------------------------------------------------ 39 | 40 | 41 | 42 | 43 | 44 | 45 | -------------------------------------------------------------------------------- /Interview_QA/QA_19-21.md: -------------------------------------------------------------------------------- 1 | Authored by **Kalyan KS**. You can follow him on [Twitter](https://x.com/kalyan_kpl) and [LinkedIn](https://www.linkedin.com/in/kalyanksnlp/) for the latest LLM, RAG and Agent updates. 2 | 3 | ## 📌 Q19: What is the purpose of scaling in the self-attention mechanism in the Transformer model? 4 | 5 | ### ✅ Answer 6 | 7 | The purpose of scaling in the self-attention mechanism of a Transformer model is to stabilize training by controlling the magnitude of the dot-product values between query and key vectors. Without scaling, the dot products can grow large in high-dimensional spaces, resulting in extremely small gradients that slow down or destabilize learning. 8 | 9 | To prevent this, the dot products are divided by the square root of the key dimension, which keeps the attention scores in a range that leads to more effective gradient flow and stable optimization. Thus, the scaling step ensures that the self-attention mechanism can effectively focus on relevant tokens of the input sequence without numerical issues during training. 10 | 11 | ## 📌 Q20: Why does the Transformer model use multiple self-attention heads instead of a single self-attention head? 12 | 13 | ### ✅ Answer 14 | 15 | Transformer models use multiple self-attention heads instead of a single one because they enable the model to capture diverse relationships and patterns in the input. This is similar to using multiple flashlights to illuminate different aspects of a scene. 16 | 17 | Using a single head would restrict the model to one form of relationship, limiting its capacity to fully understand the input data. Multi-head attention, by allowing the model to capture diverse relationships, results in richer contextual representations and subsequently better performance across NLP tasks. 18 | 19 | ## 📌 Q21: How are the outputs of multiple heads combined and projected back in the multi-head attention in the Transformer model? 20 | 21 | ### ✅ Answer 22 | 23 | In the multi-head attention model of the Transformer, the outputs from multiple self-attention heads are first computed independently, each producing a contextually weighted representation of the input in different subspaces. These outputs are then concatenated along the feature dimension to form a single combined vector. 24 | 25 | This concatenated result is passed through a final linear projection layer, which projects it back into the model’s expected dimension for further processing. This mechanism allows the model to integrate diverse information captured by each head into a unified representation. 26 | 27 | 28 | ## **👨🏻‍💻 Prompt Engineering Techniques Hub** 29 | 30 | This GitHub repo includes implementations of must know 25+ prompt engineering techniques. 31 | 32 | 👉 [Repo link](https://github.com/KalyanKS-NLP/Prompt-Engineering-Techniques-Hub) 33 | 34 | Knowledge of prompt engineering techniques is essential for Data Scientists, AI/ML Engineers working with LLMs, RAG and Agents. 35 | 36 | ![Prompt Engineering Techniques Hub](images/prompt-eng-techniques-hub.jpg) 37 | 38 | 39 | ------------------------------------------------------------------------------------------ 40 | 41 | 42 | 43 | 44 | 45 | 46 | -------------------------------------------------------------------------------- /Interview_QA/QA_25-27.md: -------------------------------------------------------------------------------- 1 | Authored by **Kalyan KS**. You can follow him on [Twitter](https://x.com/kalyan_kpl) and [LinkedIn](https://www.linkedin.com/in/kalyanksnlp/) for the latest LLM, RAG and Agent updates. 2 | 3 | ## 📌 Q25: Explain how masking works in masked self-attention in Transformer. 4 | 5 | ### ✅ Answer 6 | 7 | Masking in masked self-attention (used in the Transformer's decoder) prevents the model from attending to future tokens in the sequence. It's achieved by setting the attention scores for those future positions to a very large negative number (like −∞), so that the softmax function turns them into zero. This ensures that the prediction for the current token only relies on the known preceding tokens, preserving the auto-regressive property. 8 | 9 | ## 📌 Q26: Explain why self-attention in the decoder is referred to as cross-attention. How does it differ from self-attention in the encoder? 10 | 11 | ### ✅ Answer 12 | 13 | Self-attention in the decoder is referred to as cross-attention because it involves attending to the encoder's output while generating each token in the decoder. Specifically, in cross-attention, the queries come from the decoder's previous layer, while the keys and values come from the encoder's output, allowing the decoder to focus on relevant parts of the input sequence. 14 | 15 | This differs from self-attention in the encoder, where the attention mechanism operates solely within the input sequence itself, with queries, keys, and values all derived from the same data. Thus, the encoder’s self-attention captures internal dependencies, while the decoder’s cross-attention links the decoder to the encoder representations for contextual generation. 16 | 17 | ## 📌 Q27: What is the softmax function, and where is it applied in the Transformer model? 18 | 19 | ### ✅ Answer 20 | 21 | The softmax function is a mathematical operation that converts a vector of raw scores (logits) into a probability distribution, where each value lies between 0 and 1, and all values sum to 1. In the Transformer model, softmax is applied in the output layer during token prediction to transform the model's raw score outputs into probabilities over the vocabulary, enabling the selection of the most likely next token. 22 | 23 | Additionally, softmax is used within the self-attention mechanism to normalize the attention weights. This normalization allows the model to weigh the importance of different input tokens effectively. Thus, softmax plays a crucial role in handling probabilities both in self-attention and final predictions.​ 24 | 25 | 26 | **👨🏻‍💻 LLM Engineer Toolkit** 27 | -------------------------------------------------------------------------------------------- 28 | 🤖 This repository contains a curated list of 120+ LLM, RAG and Agent related libraries category wise. 29 | 30 | 👉 [Repo link](https://github.com/KalyanKS-NLP/llm-engineer-toolkit) 31 | 32 | This repository is highly useful for Data Scientists, AI/ML Engineers working with LLM, RAG and Agents. 33 | 34 | ![LLM Engineer Toolkit](images/llm-engineer-toolkit.jpg) 35 | 36 | 37 | --------------------------------------------------------------------------------------------- 38 | 39 | 40 | 41 | 42 | 43 | -------------------------------------------------------------------------------- /Interview_QA/QA_94-96.md: -------------------------------------------------------------------------------- 1 | Authored by **Kalyan KS**. You can follow him on [Twitter](https://x.com/kalyan_kpl) and [LinkedIn](https://www.linkedin.com/in/kalyanksnlp/) for the latest LLM, RAG and Agent updates. 2 | 3 | ## 📌 Q94: What are the strengths and limitations of full fine-tuning? 4 | 5 | ### ✅ Answer 6 | 7 | Full fine-tuning involves updating all parameters of a pre-trained LLM. Its primary strength is achieving the highest possible performance because the model is fully adapted, potentially leading to state-of-the-art results. However, its major limitations include 8 | 9 | - being computationally expensive and time-consuming, demanding significant GPU resources, large storage for the full model checkpoints, and 10 | 11 | - a high risk of catastrophic forgetting, where the model loses its general knowledge acquired during pre-training. 12 | 13 | ## 📌 Q95: Explain how parameter efficient fine-tuning addresses the limitations of full fine-tuning. 14 | 15 | ### ✅ Answer 16 | 17 | Parameter-efficient fine-tuning (PEFT) addresses the limitations of full fine-tuning by updating only a small subset of model parameters or adding lightweight modules (introducing a minimal number of new, trainable parameters) while keeping the majority of the pre-trained model weights frozen. This approach significantly reduces computational costs, memory usage, and storage requirements compared to full fine-tuning, which updates all parameters and demands high resources. 18 | 19 | PEFT also mitigates the risk of catastrophic forgetting and overfitting that full fine-tuning faces by preserving the original pre-trained knowledge. Additionally, PEFT enables faster training and easier deployment on edge devices or resource-constrained environments, making it more scalable and cost-effective without sacrificing performance. This balance of efficiency and effectiveness makes PEFT a practical solution for adapting large language models across multiple domains and tasks. 20 | 21 | 22 | ## 📌 Q96: When might prompt engineering be preferred over task-specific fine-tuning? 23 | 24 | ### ✅ Answer 25 | 26 | Prompt engineering is generally preferred over task-specific fine-tuning when the task is complex or open-ended, requiring the LLM’s broad general knowledge and in-context learning abilities to solve it. 27 | 28 | It is also the better choice when data for fine-tuning is scarce or if rapid iteration and experimentation are needed, as modifying a prompt is significantly faster and more resource-efficient than fine-tuning the model. Furthermore, if a single LLM must handle a variety of diverse tasks , prompt engineering is preferred, as it avoids creating and deploying a separate fine-tuned model for each. 29 | 30 | ## **🚀 AIxFunda Newsletter (free)** 31 | 32 | 33 | Join 🚀 AIxFunda free newsletter to get the latest updates and interesting tutorials related to Generative AI, LLMs, Agents and RAG. 34 | 35 | - ✨ Weekly GenAI updates. 36 | - 📄 Weekly LLM, Agents and RAG paper updates. 37 | - 📝 1 fresh blog post on an interesting topic every week. 38 | 39 | 👉 [Subcribe Now](https://aixfunda.substack.com/) 40 | 41 | --------------------------------------------------------------------------------------------- 42 | 43 | 44 | 45 | 46 | -------------------------------------------------------------------------------- /Interview_QA/QA_46-48.md: -------------------------------------------------------------------------------- 1 | Authored by **Kalyan KS**. You can follow him on [Twitter](https://x.com/kalyan_kpl) and [LinkedIn](https://www.linkedin.com/in/kalyanksnlp/) for the latest LLM, RAG and Agent updates. 2 | 3 | ## 📌 Q46: How does Beam Search improve upon Greedy Search, and what is the role of the beam width parameter? 4 | 5 | ### ✅ Answer 6 | 7 | Beam Search improves upon Greedy Search by exploring a wider set of possibilities. It keeps track of the k most probable partial sequences (where k is the beam width) at each decoding step, instead of just the single most probable one like Greedy Search. This significantly increases the chances of finding a globally better sequence by mitigating the risk of getting stuck with locally optimal, yet ultimately suboptimal, early choices. 8 | 9 | The beam width (k) parameter determines the number of hypotheses maintained; a larger beam width explores more options and generally yields better results at the cost of higher computational complexity and slower decoding speed. 10 | 11 | ## 📌 Q47: When is a deterministic strategy (like Beam Search) preferable to a stochastic (sampling) strategy? Provide a specific use case. 12 | 13 | ### ✅ Answer 14 | 15 | A deterministic strategy like Beam Search is preferable when consistency, reliability, and reproducibility of outputs are crucial. It systematically explores the most probable token sequences, making it ideal for tasks such as machine translation, summarization, or code generation where accuracy and coherence outweigh diversity. 16 | 17 | For example, in legal or medical summarization, Beam Search ensures factual consistency and avoids random variations that sampling-based methods might introduce. 18 | 19 | ## 📌 Q48: Discuss the primary trade-off between the computational cost and the output quality when comparing Greedy Search and Beam Search. 20 | 21 | ### ✅ Answer 22 | 23 | Greedy Search is computationally cheap because it only considers the single best token at each step, resulting in a fast decoding time with minimal memory use. However, it often leads to a sub-optimal output sequence because it cannot recover from locally good, but globally poor, choices. 24 | 25 | Conversely, Beam Search maintains k (the beam width) of the most promising partial sequences at each step, significantly increasing the computational cost and decoding time and memory usage. However, this wider exploration of the search space dramatically improves the probability of finding a higher-quality, more globally optimal sequence. 26 | 27 | 28 | ## **☕ Support the Author** 29 | 30 | I hope you found this “LLM Interview Questions and Answers Hub” highly useful. 31 | 32 | I’ve made this freely available to help the AI and NLP community grow and to support learners like you. If you found it helpful and would like to show your appreciation, you can buy me a coffee to keep me motivated in creating more free resources like this. 33 | 34 | 👉 [Buy Me a Coffee](https://ko-fi.com/kalyanksnlp) 35 | 36 | Your small gesture goes a long way in supporting my work—thank you for being part of this journey! 🙏 37 | 38 | — Kalyan KS 39 | 40 | --------------------------------------------------------------------------------------------- 41 | 42 | 43 | 44 | 45 | -------------------------------------------------------------------------------- /Interview_QA/QA_112-114.md: -------------------------------------------------------------------------------- 1 | Authored by **Kalyan KS**. You can follow him on [Twitter](https://x.com/kalyan_kpl) and [LinkedIn](https://www.linkedin.com/in/kalyanksnlp/) for the latest LLM, RAG and Agent updates. 2 | 3 | ## 📌 Q112: In the context of LLM pretraining, what is scaling law? 4 | 5 | ### ✅ Answer 6 | 7 | In the context of LLM pretraining, scaling laws describe the predictable relationship between a model’s performance and its key factors—such as model size (number of parameters), dataset size, and compute resources. Empirical studies show that as these factors increase, model performance improves following a power-law trend until diminishing returns appear. 8 | 9 | This law provides a crucial guide for efficiently designing and allocating resources for large-scale model training by predicting the optimal balance of data, parameters, and compute needed to achieve a target performance level. 10 | 11 | ## 📌 Q113: Explain the concept of Mixture-of-Experts (MoE) architecture and its role in LLM pretraining. 12 | 13 | ### ✅ Answer 14 | 15 | The Mixture-of-Experts (MoE) architecture significantly improves Large Language Model (LLM) pretraining efficiency and capacity by replacing the standard feed-forward layers with a set of specialized "expert" networks. A "router" or "gating network" learns to selectively activate a small subset of these experts for each input token. 16 | 17 | This allows the model to (i) dramatically increase its total parameter count and thus its capacity for knowledge and (ii) maintain a low computational cost during inference, as only a fraction of the parameters are used for any given input. This sparsity enables models with billions of parameters to be trained and run more efficiently, facilitating the scaling of LLMs to unprecedented sizes. 18 | 19 | ## 📌 Q114: What is model parallelism, and how is it used in LLM pre-training? 20 | 21 | 22 | ### ✅ Answer 23 | 24 | Model parallelism is a technique used to train large language models that are too big to fit on a single GPU by splitting the model’s parameters across multiple devices. Instead of each GPU holding a full model copy, different GPUs handle different layers or parts of the same layer. 25 | 26 | During forward and backward passes, activations and gradients are communicated between GPUs to complete computation. This allows efficient utilization of hardware for massive models. However, it requires careful coordination to minimize communication overhead and latency. 27 | 28 | **☕ Support the Author** 29 | ------------------------------------------------------------------------------------------- 30 | I hope you found this “LLM Interview Questions and Answers Hub” highly useful. 31 | 32 | I’ve made this freely available to help the AI and NLP community grow and to support learners like you. If you found it helpful and would like to show your appreciation, you can buy me a coffee to keep me motivated in creating more free resources like this. 33 | 34 | 👉 [Buy Me a Coffee](https://ko-fi.com/kalyanksnlp) 35 | 36 | Your small gesture goes a long way in supporting my work—thank you for being part of this journey! 🙏 37 | 38 | — Kalyan KS 39 | 40 | --------------------------------------------------------------------------------------------- 41 | 42 | 43 | 44 | 45 | 46 | -------------------------------------------------------------------------------- /Interview_QA/QA_1-3.md: -------------------------------------------------------------------------------- 1 | Authored by **Kalyan KS**. You can follow him on [Twitter](https://x.com/kalyan_kpl) and [LinkedIn](https://www.linkedin.com/in/kalyanksnlp/) for the latest LLM, RAG and Agent updates. 2 | 3 | ## 📌 Q1: CNNs and RNNs don’t use positional embeddings. Why do transformers use positional embeddings? 4 | 5 | ### ✅ Answer 6 | 7 | CNNs and RNNs don’t use positional embeddings because these models inherently capture positional information through convolutional filters or recurrence over time steps. However, the Transformer model uses the self-attention mechanism, which processes all tokens in parallel without any notion of sequence or order. 8 | 9 | Positional embeddings inject explicit information about each token’s position in the sequence, allowing transformers to understand word order and relative distances between tokens. This helps them model sequential dependencies and sentence structure effectively, compensating for the parallel and order-agnostic nature of self-attention mechanisms. 10 | 11 | ## 📌 Q2: Tell me the basic steps involved in running an inference query on an LLM. 12 | 13 | ### ✅ Answer 14 | 15 | Running an inference query on an LLM involves the following steps: 16 | 17 | - Tokenization: The input text is first broken down into tokens, and then the tokens are mapped to their IDs in the model’s vocabulary. 18 | 19 | - Prefill Phase: The embedding layer converts these token IDs into embeddings, and then the decoder layers process these embeddings simultaneously to compute intermediate states (keys and values). This parallel processing establishes the context for generating new tokens. 20 | 21 | - Decoding Phase: The model generates output tokens one at a time autoregressively, each based on previously generated tokens and cached states, continuing until a stopping condition is met. 22 | 23 | - Detokenization: Finally, the generated tokens are converted back into human-readable text. 24 | 25 | ## 📌 Q3: Explain how KV Cache accelerates LLM inference. 26 | 27 | ### ✅ Answer 28 | 29 | KV Cache speeds up LLM inference by storing the attention key (K) and value (V) representations that were computed for previous tokens. This allows the model to reuse them instead of recalculating them at each decoding step. During text generation, only the query for the latest token is computed and then combined with the cached K and V vectors to produce the next token, significantly reducing redundant computation. 30 | 31 | This dramatically speeds up inference—often delivering several-fold faster token generation. However, the cache consumes substantial GPU memory, which is why techniques like KV cache offloading and compression are used to balance speed and memory efficiency in large-scale LLM serving. 32 | 33 | ## **👨🏻‍💻 LLM Engineer Toolkit** 34 | 35 | 🤖 This repository contains a curated list of 120+ LLM, RAG and Agent related libraries category wise. 36 | 37 | 👉 [Repo link](https://github.com/KalyanKS-NLP/llm-engineer-toolkit) 38 | 39 | This repository is highly useful for Data Scientists, AI/ML Engineers working with LLMs, RAG and Agents. 40 | 41 | ![LLM Engineer Toolkit](images/llm-engineer-toolkit.jpg) 42 | 43 | 44 | --------------------------------------------------------------------------------------------- 45 | 46 | 47 | 48 | 49 | -------------------------------------------------------------------------------- /Interview_QA/QA_37-39.md: -------------------------------------------------------------------------------- 1 | Authored by **Kalyan KS**. You can follow him on [Twitter](https://x.com/kalyan_kpl) and [LinkedIn](https://www.linkedin.com/in/kalyanksnlp/) for the latest LLM, RAG and Agent updates. 2 | 3 | ## 📌 Q37: What is latency in LLM inference, and why is it important? 4 | 5 | ### ✅ Answer 6 | 7 | Latency in LLM inference refers to the time delay between submitting an input prompt to the model and receiving the final, complete output. It measures how quickly the model processes input and generates output, typically in milliseconds or seconds. It is a critical performance metric because it directly impacts the user experience and the throughput of applications. 8 | 9 | Low latency is crucial for delivering smooth, real-time experiences in chatbots, coding assistants, or search systems. High latency can lead to slow, frustrating interactions, especially in real-time or conversational systems. 10 | 11 | ## 📌 Q38: What is batch inference, and how does it differ from single-query inference? 12 | 13 | ### ✅ Answer 14 | 15 | In batch inference, LLMs process and generate outputs for a large set of accumulated input data (the "batch") all at once. Batch inference is highly efficient for offline or asynchronous tasks like bulk document classification or generating recommendations overnight. This contrasts with single-query inference (or real-time inference), where the model processes individual data points as they arrive, providing responses with very low latency. 16 | 17 | Single-query inference is typically required for interactive, user-facing applications like chatbots. The key difference lies in latency requirements and throughput optimization: batch inference prioritizes high throughput (processing a lot of data quickly) while sacrificing low latency, whereas single-query inference prioritizes minimal latency. 18 | 19 | ## 📌 Q39: How does batching generally help with LLM inference efficiency? 20 | 21 | ### ✅ Answer 22 | 23 | Batching significantly improves LLM inference efficiency by enabling parallel processing of multiple requests, which maximizes GPU utilization. Instead of processing each query sequentially, batching groups requests together to leverage the full compute capacity of hardware, leading to higher throughput (tokens-per-second). 24 | 25 | Continuous batching further enhances efficiency by using iteration-level scheduling, where new requests can replace completed ones within a batch without waiting for all sequences to finish, achieving much higher throughput improvements. This approach transforms GPU utilization from underutilized sequential processing to optimized parallel execution, making it essential for production LLM serving. 26 | 27 | 28 | **👨🏻‍💻 LLM Engineer Toolkit** 29 | -------------------------------------------------------------------------------------------- 30 | 🤖 This repository contains a curated list of 120+ LLM, RAG and Agent related libraries category wise. 31 | 32 | 👉 [Repo link](https://github.com/KalyanKS-NLP/llm-engineer-toolkit) 33 | 34 | This repository is highly useful for Data Scientists, AI/ML Engineers working with LLM, RAG and Agents. 35 | 36 | ![LLM Engineer Toolkit](images/llm-engineer-toolkit.jpg) 37 | 38 | --------------------------------------------------------------------------------------------- 39 | 40 | 41 | 42 | 43 | 44 | -------------------------------------------------------------------------------- /Interview_QA/QA_10-12.md: -------------------------------------------------------------------------------- 1 | Authored by **Kalyan KS**. You can follow him on [Twitter](https://x.com/kalyan_kpl) and [LinkedIn](https://www.linkedin.com/in/kalyanksnlp/) for the latest LLM, RAG and Agent updates. 2 | 3 | ## 📌 Q10: What is the computational complexity of self-attention in the Transformer model? 4 | 5 | ### ✅ Answer 6 | 7 | The computational complexity of the self-attention mechanism in the Transformer model is primarily quadratic with respect to the sequence length $n$. Formally, it is O($n^2$.d). The O($n^2$.d) comes from computing the attention scores between all pairs of tokens. Since there are $n$ x $n$ = $n^2$ pairwise interactions, and each involves a dot product of d dimensional vectors, this results in $n^2$.d operations. 8 | 9 | The quadratic complexity with respect to sequence length $n$ makes self-attention computationally expensive for long sequences. However, this design allows the model to capture relationships between all tokens simultaneously, which is crucial for its performance. Despite its quadratic complexity, it remains faster than traditional RNNs in practice due to parallelization benefits. 10 | 11 | 12 | ## 📌 Q11: How do Transformer model address the vanishing gradient problem? 13 | 14 | ### ✅ Answer 15 | 16 | Transformers address the vanishing gradient problem primarily through the use of residual (skip) connections and layer normalization. Residual connections allow gradients to flow directly through the network by adding the input of a layer to its output, preventing gradients from shrinking during backpropagation. 17 | 18 | Layer normalization helps stabilize training by keeping activations within a consistent range, which controls the scale of gradients. Additionally, the self-attention mechanism enables direct information flow between tokens, reducing the depth of dependency and further mitigating gradient vanishing. 19 | 20 | ## 📌 Q12: What is tokenization, and why is it necessary in LLMs? 21 | 22 | ### ✅ Answer 23 | 24 | Tokenization is the process of converting raw text into smaller, discrete units called tokens, which can be words, sub-words, or characters. This step is necessary in LLMs because models cannot directly process raw strings of text; they require numerical input. 25 | 26 | Tokenization allows mapping the input text to a vocabulary of known tokens. These tokens are then converted into numerical representations called embeddings. These embeddings are then used by LLMs for training and inference, fundamentally enabling input text processing. 27 | 28 | 29 | **☕ Support the Author** 30 | ------------------------------------------------------------------------------------------- 31 | I hope you found this “LLM Interview Questions and Answers Hub” highly useful. 32 | 33 | I’ve made this freely available to help the AI and NLP community grow and to support learners like you. If you found it helpful and would like to show your appreciation, you can buy me a coffee to keep me motivated in creating more free resources like this. 34 | 35 | 👉 [Buy Me a Coffee](https://ko-fi.com/kalyanksnlp) 36 | 37 | Your small gesture goes a long way in supporting my work—thank you for being part of this journey! 🙏 38 | 39 | — Kalyan KS 40 | 41 | --------------------------------------------------------------------------------------------- 42 | 43 | 44 | 45 | 46 | 47 | 48 | -------------------------------------------------------------------------------- /Interview_QA/QA_106-108.md: -------------------------------------------------------------------------------- 1 | Authored by **Kalyan KS**. You can follow him on [Twitter](https://x.com/kalyan_kpl) and [LinkedIn](https://www.linkedin.com/in/kalyanksnlp/) for the latest LLM, RAG and Agent updates. 2 | 3 | ## 📌 Q106: Explain different preference alignment methods and their trade-offs. 4 | 5 | ### ✅ Answer 6 | 7 | Different preference alignment methods in LLMs include Reinforcement Learning from Human Feedback (RLHF), Direct Preference Optimization (DPO), Odds Ratio Preference Optimization (ORPO), and Kahneman-Tversky Optimization (KTO). 8 | 9 | 1. RLHF uses a reward model trained on human-labeled examples to guide model behavior but is computationally expensive and complex. 10 | 11 | 2. DPO simplifies this by directly optimizing the model on preference data without a separate reward model, offering efficiency and stability but depending heavily on data quality. 12 | 13 | 3. ORPO combines task objectives and preference alignment in one loss, improving training efficiency but increasing implementation complexity. 14 | 15 | 4. KTO uses binary good/bad labels, is robust to noisy data, and is simple to label, yet it may lack granularity for nuanced alignment. 16 | 17 | Trade-offs revolve around complexity, computational resources, data requirements, and alignment precision, with the optimal method chosen based on specific use cases and resource constraints.​ 18 | 19 | ## 📌 Q107: What is gradient accumulation, and how does it help with fine-tuning large models? 20 | 21 | ### ✅ Answer 22 | 23 | Gradient accumulation is a technique used in fine-tuning large language models (LLMs) that helps manage GPU memory limitations by simulating larger batch sizes. Instead of updating model weights after processing each mini-batch, gradients are accumulated over several smaller batches, and the model parameters are updated once after the accumulation. 24 | 25 | This approach enables training with effective large batch sizes even on memory-constrained hardware, improving the stability and performance of the fine-tuning process. It allows fine-tuning of LLMs on less powerful GPUs and reduces hardware cost while maintaining training quality. 26 | 27 | ## 📌 Q108: What are the possible options to speed up LLM fine-tuning? 28 | 29 | ### ✅ Answer 30 | 31 | To speed up LLM fine-tuning, several optimization strategies can be used. Techniques like LoRA (Low-Rank Adaptation) or QLoRA (Quantized LoRA) reduce the number of trainable parameters, significantly lowering GPU memory usage and training time. 32 | 33 | Additionally, mixed precision training (FP16/BF16) for faster computation, gradient accumulation for simulating larger batch sizes, and distributed training across multiple GPUs achieve quicker convergence and shorter training times. These approaches drastically cut down both computational cost and the fine-tuning time. 34 | 35 | ## **🚀 AIxFunda Newsletter (free)** 36 | 37 | 38 | Join 🚀 AIxFunda free newsletter to get the latest updates and interesting tutorials related to Generative AI, LLMs, Agents and RAG. 39 | 40 | - ✨ Weekly GenAI updates. 41 | - 📄 Weekly LLM, Agents and RAG paper updates. 42 | - 📝 1 fresh blog post on an interesting topic every week. 43 | 44 | 👉 [Subcribe Now](https://aixfunda.substack.com/) 45 | 46 | --------------------------------------------------------------------------------------------- 47 | 48 | 49 | 50 | 51 | 52 | -------------------------------------------------------------------------------- /Interview_QA/QA_61-63.md: -------------------------------------------------------------------------------- 1 | Authored by **Kalyan KS**. You can follow him on [Twitter](https://x.com/kalyan_kpl) and [LinkedIn](https://www.linkedin.com/in/kalyanksnlp/) for the latest LLM, RAG and Agent updates. 2 | 3 | ## 📌 Q61: What is speculative decoding, and when would you use it? 4 | 5 | ### ✅ Answer 6 | 7 | Speculative decoding accelerates LLM inference by pairing a smaller, faster "draft" model with a larger "target" model. The draft model speculatively generates multiple future tokens ahead of time, which the target model then verifies in parallel, accepting those matching its own predictions and correcting others. 8 | 9 | This draft-then-verify approach reduces the sequential bottleneck of generating tokens one-by-one, improving GPU utilization and decreasing latency without sacrificing output quality. It is particularly useful in latency-sensitive applications like chatbots and code completion, where both speed and accuracy are critical. 10 | 11 | ## 📌 Q62: What are the challenges in performing distributed inference across multiple GPUs? 12 | 13 | ### ✅ Answer 14 | 15 | The main challenges in performing distributed inference across multiple GPUs include memory management, communication overhead, workload balancing, and phase-specific resource needs. 16 | 17 | 1. Memory management is crucial because large models often do not fit into a single GPU, requiring model partitioning or sharding across GPUs. 18 | 19 | 2. Communication overhead arises from the synchronization of parameters and intermediate data between GPUs, which can add significant latency. 20 | 21 | 3. Workload balancing is needed to ensure that no single GPU becomes a bottleneck while others are underutilized, requiring effective parallelism strategies. 22 | 23 | 4. Lastly, different phases of inference, such as prefill (compute-bound) and decode (memory-bound), demand distinct GPU resources, complicating efficient resource allocation and orchestration. 24 | 25 | These challenges demand careful orchestration and optimization to maximize throughput and minimize latency in multi-GPU distributed inference systems. 26 | 27 | ## 📌 Q63: How would you design a scalable LLM inference system for real-time applications? 28 | 29 | ### ✅ Answer 30 | 31 | A scalable LLM inference system for real-time applications should use model sharding and distributed serving frameworks like vLLM to parallelize inference across multiple GPUs or nodes. The system should implement request batching, dynamic load balancing, and asynchronous processing to optimize GPU utilization and reduce latency. 32 | 33 | Caching frequent prompts or embeddings further speeds responses, while autoscaling policies ensure resource efficiency during traffic spikes. Incorporating quantization and distillation can reduce model size and improve real-time performance without major accuracy loss. 34 | 35 | 36 | ## **👨🏻‍💻 LLM Engineer Toolkit** 37 | 38 | 🤖 This repository contains a curated list of 120+ LLM, RAG and Agent related libraries category wise. 39 | 40 | 👉 [Repo link](https://github.com/KalyanKS-NLP/llm-engineer-toolkit) 41 | 42 | This repository is highly useful for Data Scientists, AI/ML Engineers working with LLM, RAG and Agents. 43 | 44 | ![LLM Engineer Toolkit](images/llm-engineer-toolkit.jpg) 45 | 46 | 47 | --------------------------------------------------------------------------------------------- 48 | 49 | 50 | 51 | 52 | 53 | -------------------------------------------------------------------------------- /Interview_QA/QA_88-90.md: -------------------------------------------------------------------------------- 1 | Authored by **Kalyan KS**. You can follow him on [Twitter](https://x.com/kalyan_kpl) and [LinkedIn](https://www.linkedin.com/in/kalyanksnlp/) for the latest LLM, RAG and Agent updates. 2 | 3 | ## 📌 Q88: What are the different phases in LLM development? 4 | 5 | ### ✅ Answer 6 | 7 | The typical development process for a Large Language Model (LLM) involves three primary phases: 8 | - pre-training, where the model learns foundational language understanding from massive, general datasets through self-supervised tasks like predicting the next word; 9 | - followed by instruction-tuning, where the pre-trained model is fine-tuned on diverse examples of instructions and corresponding desired outputs to make it better at following instructions and performing specific tasks; 10 | - and finally, alignment tuning, which further refines the model's behavior to align with human values, preferences, and safety standards, resulting in more helpful and harmless responses. 11 | 12 | ## 📌 Q89: What are the different types of LLM fine-tuning? 13 | 14 | ### ✅ Answer 15 | 16 | Fine-tuning, in the context of LLMs, is a broad term that can refer to instruction fine-tuning, task-specific fine-tuning, or alignment tuning. 17 | 18 | 1. Instruction fine-tuning involves fine-tuning the model on a dataset of high-quality (instruction, response) pairs to improve its ability to follow instructions and generalize across various tasks. 19 | 20 | 2. Task-specific fine-tuning involves fine-tuning the model on a dataset tailored to a single, specific downstream application (e.g., sentiment analysis, text summarization) to maximize performance on that particular task. 21 | 22 | 3. Alignment Tuning involves fine-tuning the model using Reinforcement Learning (RL) to adjust the model's behavior to be safe, helpful, and align with human values and preferences. 23 | 24 | ## 📌 Q90: What role does instruction tuning play in improving an LLM’s usability? 25 | 26 | ### ✅ Answer 27 | 28 | Instruction tuning greatly improves how well a large language model (LLM) understands and follows user directions in the input prompt. While raw, pretrained LLMs trained only to predict the next word are good at continuing text. But they often struggle to follow explicit instructions. 29 | 30 | Instruction fine-tuning fixes this by training the model on high-quality pairs of prompts and responses. Through exposure to diverse examples, the model learns to correctly interpret the user instructions and perform the given tasks. To summarize, instruction tuning makes LLMs better at understanding the user prompts and producing desired outputs. 31 | 32 | **☕ Support the Author** 33 | ------------------------------------------------------------------------------------------- 34 | I hope you found this “LLM Interview Questions and Answers Hub” highly useful. 35 | 36 | I’ve made this freely available to help the AI and NLP community grow and to support learners like you. If you found it helpful and would like to show your appreciation, you can buy me a coffee to keep me motivated in creating more free resources like this. 37 | 38 | 👉 [Buy Me a Coffee](https://ko-fi.com/kalyanksnlp) 39 | 40 | Your small gesture goes a long way in supporting my work—thank you for being part of this journey! 🙏 41 | 42 | — Kalyan KS 43 | 44 | --------------------------------------------------------------------------------------------- 45 | 46 | 47 | 48 | 49 | 50 | -------------------------------------------------------------------------------- /Interview_QA/QA_4-6.md: -------------------------------------------------------------------------------- 1 | Authored by **Kalyan KS**. You can follow him on [Twitter](https://x.com/kalyan_kpl) and [LinkedIn](https://www.linkedin.com/in/kalyanksnlp/) for the latest LLM, RAG and Agent updates. 2 | 3 | ## 📌 Q4: How does quantization affect inference speed and memory requirements? 4 | 5 | ### ✅ Answer 6 | 7 | Quantization reduces the numerical precision of LLM's parameters, such as converting 32-bit floating-point weights to lower bit widths like 8-bit integers or 4-bit values. This compression dramatically decreases the model's memory footprint, often by 50-75% or more, enabling deployment on smaller devices and allowing more data to fit into cache memory. 8 | 9 | Simultaneously, quantization speeds up inference because lower-precision operations require fewer computational resources and can be processed faster by modern hardware optimized for such formats. 10 | 11 | However, this efficiency gain may come at a small cost to model accuracy, necessitating calibration or quantization-aware training to minimize performance degradation. Overall, quantization is a vital technique for reducing memory usage and improving inference speed in practical LLM applications. 12 | 13 | ## 📌 Q5: How do you handle the large memory requirements of KV cache in LLM inference? 14 | 15 | ### ✅ Answer 16 | 17 | Handling large KV Cache memory requirements in LLM inference is crucial for high-throughput serving and long context windows, as the cache size grows linearly with sequence length and batch size. We can handle large memory requirements of KV Cache in LLM inference using techniques like PagedAttention, Grouped-Query, Multi-Query Attention, Quantization, or Cache Offloading. 18 | 19 | PagedAttention manages the cache using fixed-size blocks like an operating system's virtual memory, reduces memory fragmentation, and enables higher batch sizes. Grouped-Query Attention (GQA) or Multi-Query Attention (MQA) decreases the size of the KV cache by reducing the number of Key/Value heads. 20 | 21 | Finally, strategies such as quantizing the KV cache (e.g., to INT8) or offloading inactive cache data to cheaper storage (like CPU RAM) also help manage the memory footprint effectively. 22 | 23 | ## 📌 Q6: After tokenization, how are tokens converted into embeddings in the Transformer model? 24 | 25 | ### ✅ Answer 26 | 27 | Tokens, represented as integer IDs after tokenization, are converted into embeddings using a lookup table called an embedding matrix. For every token in the model's vocabulary, this matrix has a dense, fixed-size vector (the embedding). 28 | 29 | Specifically, the token ID is used as an index to retrieve its corresponding high-dimensional vector from the embedding matrix. This retrieved vector is the numerical representation that captures the token's semantic and syntactic meaning. The embeddings of input tokens are then processed by the subsequent transformer layers. 30 | 31 | ## **🚀 AIxFunda Newsletter (free)** 32 | 33 | 34 | Join 🚀 AIxFunda free newsletter to get the latest updates and interesting tutorials related to Generative AI, LLMs, Agents and RAG. 35 | 36 | - ✨ Weekly GenAI updates. 37 | - 📄 Weekly LLM, Agents and RAG paper updates. 38 | - 📝 1 fresh blog post on an interesting topic every week. 39 | 40 | 👉 [Subcribe Now](https://aixfunda.substack.com/) 41 | 42 | --------------------------------------------------------------------------------------------- 43 | 44 | 45 | 46 | 47 | 48 | 49 | -------------------------------------------------------------------------------- /Interview_QA/QA_64-66.md: -------------------------------------------------------------------------------- 1 | Authored by **Kalyan KS**. You can follow him on [Twitter](https://x.com/kalyan_kpl) and [LinkedIn](https://www.linkedin.com/in/kalyanksnlp/) for the latest LLM, RAG and Agent updates. 2 | 3 | ## 📌 Q64: Explain the role of flash attention in reducing memory bottlenecks during inference. 4 | 5 | ### ✅ Answer 6 | 7 | Flash Attention plays a crucial role in reducing memory bottlenecks during inference by optimizing the way the models handle attention computations. Traditional attention mechanisms incur high memory overhead due to frequent data transfers between slower high bandwidth memory (HBM) and faster but smaller on-chip SRAM, repeatedly loading and writing keys, queries, and values for each step. 8 | 9 | Flash Attention significantly reduces memory bottlenecks during LLM inference by moving the computation of the large, intermediate attention scores matrix (Q.KT) from the slow High Bandwidth Memory (HBM) to the faster, on-chip SRAM. It achieves this by using a technique called tiling, where the attention calculation is broken into smaller blocks and computed incrementally, meaning the full, massive attention matrix is never explicitly materialized in the slower HBM. 10 | 11 | This approach greatly decreases memory access latency and reduces computational overhead, enabling faster and more memory-efficient inference, especially beneficial for long input sequences. 12 | 13 | ## 📌 Q65: What is continuous batching, and how does it differ from static batching? 14 | 15 | ### ✅ Answer 16 | 17 | Static batching involves grouping a fixed number of requests together and processing them simultaneously. Its main drawback is poor efficiency, as all requests must wait for the single longest sequence to finish, resulting in idle GPU time and increased latency. Continuous batching is a superior, dynamic technique that operates at the token generation level, immediately replacing a completed request with a new one. 18 | 19 | This key difference ensures the GPU is constantly utilized, dramatically boosting overall throughput and reducing latency. Static batching is often preferred in offline scenarios where latency is less important, while continuous batching shines in online, interactive applications. 20 | 21 | ## 📌 Q66: What is mixed precision (e.g., FP16) and why is it used during inference? 22 | 23 | ### ✅ Answer 24 | 25 | Mixed precision is a technique that uses a combination of different numerical formats, typically using the 16-bit floating-point format (FP16) for most computations, alongside higher-precision formats like FP32 where necessary for numerical stability. It is used during inference to significantly reduce both memory consumption and computational time. 26 | 27 | Halving the bit-width cuts the model's memory consumption by roughly half, which allows for either larger models to fit onto the GPU or for a larger batch size. Crucially, modern hardware like NVIDIA Tensor Cores can execute FP16 operations significantly faster, thus boosting overall throughput with minimal loss in model accuracy. 28 | 29 | 30 | ## **🚀 AIxFunda Newsletter (free)** 31 | 32 | 33 | Join 🚀 AIxFunda free newsletter to get the latest updates and interesting tutorials related to Generative AI, LLMs, Agents and RAG. 34 | 35 | - ✨ Weekly GenAI updates. 36 | - 📄 Weekly LLM, Agents and RAG paper updates. 37 | - 📝 1 fresh blog post on an interesting topic every week. 38 | 39 | 👉 [Subcribe Now](https://aixfunda.substack.com/) 40 | 41 | --------------------------------------------------------------------------------------------- 42 | 43 | 44 | 45 | 46 | 47 | -------------------------------------------------------------------------------- /Interview_QA/QA_100-102.md: -------------------------------------------------------------------------------- 1 | Authored by **Kalyan KS**. You can follow him on [Twitter](https://x.com/kalyan_kpl) and [LinkedIn](https://www.linkedin.com/in/kalyanksnlp/) for the latest LLM, RAG and Agent updates. 2 | 3 | ## 📌 Q100: When should you prefer task-specific fine-tuning over prompt engineering? 4 | 5 | ### ✅ Answer 6 | 7 | You should opt for task-specific fine-tuning when prompt engineering alone doesn’t deliver the desired level of performance. This often happens in the case of complex, specialized tasks in fields like law or medicine, where the base model may lack the needed deep domain knowledge. 8 | 9 | Fine-tuning is also ideal when you need low-latency, cost-efficient inference in large-scale production systems, as it produces a smaller and more optimized model than one relying on lengthy prompts. Lastly, fine-tuning offers greater control over the model’s behavior, making it easier to ensure consistency, which prompting alone can’t always guarantee. 10 | 11 | ## 📌 Q101: What is LoRA, and how does it work at a high level? 12 | 13 | ### ✅ Answer 14 | 15 | LoRA (Low-Rank Adaptation) is a Parameter-Efficient Fine-Tuning (PEFT) technique that allows you to fine-tune LLMs by updating only a small number of additional parameters instead of modifying all the model’s weights. 16 | 17 | At a high level, LoRA works by injecting small trainable low-rank matrices into specific layers of the model (typically the attention or feedforward layers). Instead of updating the large weight matrix $W$ , LoRA keeps $W$ frozen and adds a low-rank decomposition $ΔW = AB$ , where A and B are much smaller matrices ($r << d$). During training, only A and B, are learned, significantly reducing memory and compute costs. 18 | 19 | At inference, the adapted weights are effectively merged with the original model weights, so the model behaves as if it was fully fine-tuned but with far fewer parameters trained. This makes LoRA both efficient (in storage and computation) and modular. 20 | 21 | 22 | ## 📌 Q102: Explain the key ingredient behind the effectiveness of the LoRA technique. 23 | 24 | ### ✅ Answer 25 | 26 | The key ingredient behind the effectiveness of the LoRA technique lies in its low-rank decomposition of weight updates, which captures task-specific information within a much smaller subspace of the model’s full parameter space. By expressing the weight change as the product of two small matrices, LoRA enables efficient fine-tuning with minimal memory and computational overhead. 27 | 28 | This approach leverages the observation that large language models have redundant parameters, and most adaptations can be represented in a low-dimensional form. As a result, LoRA achieves performance comparable to full fine-tuning while training only a tiny fraction of the model’s parameters. 29 | 30 | **☕ Support the Author** 31 | ------------------------------------------------------------------------------------------- 32 | I hope you found this “LLM Interview Questions and Answers Hub” highly useful. 33 | 34 | I’ve made this freely available to help the AI and NLP community grow and to support learners like you. If you found it helpful and would like to show your appreciation, you can buy me a coffee to keep me motivated in creating more free resources like this. 35 | 36 | 👉 [Buy Me a Coffee](https://ko-fi.com/kalyanksnlp) 37 | 38 | Your small gesture goes a long way in supporting my work—thank you for being part of this journey! 🙏 39 | 40 | — Kalyan KS 41 | 42 | --------------------------------------------------------------------------------------------- 43 | 44 | 45 | 46 | 47 | 48 | -------------------------------------------------------------------------------- /Interview_QA/QA_43-45.md: -------------------------------------------------------------------------------- 1 | Authored by **Kalyan KS**. You can follow him on [Twitter](https://x.com/kalyan_kpl) and [LinkedIn](https://www.linkedin.com/in/kalyanksnlp/) for the latest LLM, RAG and Agent updates. 2 | 3 | ## 📌 Q43: What are the different decoding strategies in LLMs? 4 | 5 | ### ✅ Answer 6 | 7 | Decoding strategies in LLMs determine how tokens are selected during text generation. The main strategies are: 8 | 9 | - Greedy Decoding - Always picks the token with the highest probability at each step. It’s fast but can produce repetitive or suboptimal text. 10 | 11 | - Beam Search - Keeps multiple best candidate sequences (“beams”) at each step and chooses the most likely overall sequence. It improves quality but increases computation. 12 | 13 | - Top-k Sampling - Randomly samples the next token from the top k most probable tokens, adding diversity and reducing repetition. 14 | 15 | - Top-p (Nucleus) Sampling - Randomly samples from the smallest set of tokens whose cumulative probability ≥ p (e.g., 0.9). 16 | 17 | - Temperature Sampling - Adjusts randomness by scaling logits before softmax — higher temperature (>1) makes outputs more random; lower (<1) makes them more deterministic. 18 | 19 | - Speculative Decoding - Uses a smaller draft model to predict multiple tokens ahead and quickly verifies them with the main model, greatly speeding up generation without losing quality. 20 | 21 | ## 📌 Q44: Explain the impact of the decoding strategy on LLM-generated output quality and latency. 22 | 23 | ### ✅ Answer 24 | 25 | Decoding strategies critically impact LLM output by balancing quality (coherence, diversity, and relevance) and latency (generation speed). Deterministic methods like Greedy Search are fast due to selecting the highest probability token at each step but often yield repetitive, lower-quality text. Beam Search improves quality by exploring multiple token sequences but increases latency due to managing several beams. 26 | 27 | Stochastic decoding methods, such as Top-k or Top-p (nucleus) sampling, generate text token by token, choosing each next token probabilistically rather than deterministically. While this increases creativity and diversity, it also introduces extra computational overhead (sorting or cumulative sum) at every decoding step, which increases latency. 28 | 29 | Newer, faster techniques like Speculative Decoding reduce latency by using a smaller model to draft tokens, which the larger model verifies in parallel, offering significant speedups while aiming to preserve the high output quality of the original model. 30 | 31 | ## 📌 Q45: Explain the greedy search decoding strategy and its main drawback. 32 | 33 | ### ✅ Answer 34 | 35 | The greedy search decoding strategy in LLMs is a straightforward method where the model selects the token with the highest probability as the next word at each step of the generation process. This approach is highly efficient and deterministic, always producing the same output for a given input. Its primary drawback, however, is its shortsightedness. 36 | 37 | By only focusing on the locally optimal choice, it often fails to find a sequence with the globally highest overall probability, potentially leading to repetitive, less coherent, or suboptimal text. 38 | 39 | ## **🚀 AIxFunda Newsletter (free)** 40 | 41 | 42 | Join 🚀 AIxFunda free newsletter to get the latest updates and interesting tutorials related to Generative AI, LLMs, Agents and RAG. 43 | 44 | - ✨ Weekly GenAI updates. 45 | - 📄 Weekly LLM, Agents and RAG paper updates. 46 | - 📝 1 fresh blog post on an interesting topic every week. 47 | 48 | 👉 [Subcribe Now](https://aixfunda.substack.com/) 49 | 50 | --------------------------------------------------------------------------------------------- 51 | 52 | 53 | 54 | 55 | 56 | 57 | -------------------------------------------------------------------------------- /Interview_QA/QA_70-72.md: -------------------------------------------------------------------------------- 1 | Authored by **Kalyan KS**. You can follow him on [Twitter](https://x.com/kalyan_kpl) and [LinkedIn](https://www.linkedin.com/in/kalyanksnlp/) for the latest LLM, RAG and Agent updates. 2 | 3 | ## 📌 Q70: How do you measure LLM inference performance? 4 | 5 | ### ✅ Answer 6 | 7 | LLM inference performance is primarily measured using latency and throughput metrics to gauge the model's speed and efficiency under load. Key metrics include Time To First Token (TTFT), which measures how long it takes for the model to produce the first token after receiving a prompt, impacting user experience in real-time applications. 8 | 9 | Time Per Output Token (TPOT) assesses the average time to generate each subsequent token, influencing the smoothness of the output stream. Overall latency is the total time from input submission to the complete response. 10 | 11 | Throughput, another crucial metric, measures how many tokens or requests the system can handle per unit time, indicating its scalability. These metrics together help assess how fast, responsive, and scalable an LLM is in practical deployment scenarios. 12 | 13 | ## 📌 Q71: What are the different LLM inference engines available? Which one do you prefer? 14 | 15 | ### ✅ Answer 16 | 17 | The most prominent LLM inference engines today are 18 | 19 | - vLLM, which excels in high-throughput and memory efficiency via PagedAttention and continuous batching. 20 | 21 | - NVIDIA TensorRT-LLM, which offers peak performance (lowest latency) by optimizing specifically for NVIDIA GPUs with custom CUDA kernels. 22 | 23 | - Hugging Face Text Generation Inference (TGI), a robust, production-ready solution well-integrated with the Hugging Face ecosystem. 24 | 25 | - Other engines include LMDeploy and llama.cpp (for CPU/edge devices). 26 | 27 | My preference leans towards vLLM due to its excellent balance of high throughput, ease of use (Hugging Face compatibility), and good hardware flexibility. These features make vLLM ideal for most scalable cloud-based serving environments. 28 | 29 | ## 📌 Q72: What are the challenges in LLM inference? 30 | 31 | ### ✅ Answer 32 | 33 | The main challenges in LLM inference are high latency, computational intensity, memory constraints, token limits, accuracy issues including hallucinations, and scalability concerns. 34 | 35 | 1. High latency occurs because LLMs generate output token-by-token, creating delays in real-time applications. 36 | 37 | 2. Computational intensity means that running LLMs requires powerful and expensive hardware, leading to high operational costs. 38 | 39 | 3. Memory constraints limit the deployment of LLMs on devices with restricted memory capacity. 40 | 41 | 4. Token limits restrict input size, often necessitating truncation that can reduce context understanding. 42 | 43 | 5. Accuracy issues such as hallucinations can compromise output reliability. 44 | 45 | 6. Scalability remains a challenge in handling many concurrent requests without performance degradation. 46 | 47 | **☕ Support the Author** 48 | ------------------------------------------------------------------------------------------- 49 | I hope you found this “LLM Interview Questions and Answers Hub” highly useful. 50 | 51 | I’ve made this freely available to help the AI and NLP community grow and to support learners like you. If you found it helpful and would like to show your appreciation, you can buy me a coffee to keep me motivated in creating more free resources like this. 52 | 53 | 👉 [Buy Me a Coffee](https://ko-fi.com/kalyanksnlp) 54 | 55 | Your small gesture goes a long way in supporting my work—thank you for being part of this journey! 🙏 56 | 57 | — Kalyan KS 58 | 59 | --------------------------------------------------------------------------------------------- 60 | 61 | 62 | 63 | 64 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Apache License 2 | Version 2.0, January 2004 3 | http://www.apache.org/licenses/ 4 | 5 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 6 | 7 | 1. Definitions. 8 | 9 | "License" shall mean the terms and conditions for use, reproduction, 10 | and distribution as defined by Sections 1 through 9 of this document. 11 | 12 | "Licensor" shall mean the copyright owner or entity authorized by 13 | the copyright owner that is granting the License. 14 | 15 | "Legal Entity" shall mean the union of the acting entity and all 16 | other entities that control, are controlled by, or are under common 17 | control with that entity. For the purposes of this definition, 18 | "control" means (i) the power, direct or indirect, to cause the 19 | direction or management of such entity, whether by contract or 20 | otherwise, or (ii) ownership of fifty percent (50%) or more of the 21 | outstanding shares, or (iii) beneficial ownership of such entity. 22 | 23 | "You" (or "Your") shall mean an individual or Legal Entity 24 | exercising permissions granted by this License. 25 | 26 | "Source" form shall mean the preferred form for making modifications, 27 | including but not limited to software source code, documentation 28 | source, and configuration files. 29 | 30 | "Object" form shall mean any form resulting from mechanical 31 | transformation or translation of a Source form, including but 32 | not limited to compiled object code, generated documentation, 33 | and conversions to other media types. 34 | 35 | "Work" shall mean the work of authorship, whether in Source or 36 | Object form, made available under the License, as indicated by a 37 | copyright notice that is included in or attached to the work 38 | (an example is provided in the Appendix below). 39 | 40 | "Derivative Works" shall mean any work, whether in Source or Object 41 | form, that is based on (or derived from) the Work and for which the 42 | editorial revisions, annotations, elaborations, or other modifications 43 | represent, as a whole, an original work of authorship. For the purposes 44 | of this License, Derivative Works shall not include works that remain 45 | separable from, or merely link (or bind by name) to the interfaces of, 46 | the Work and Derivative Works thereof. 47 | 48 | "Contribution" shall mean any work of authorship, including 49 | the original version of the Work and any modifications or additions 50 | to that Work or Derivative Works thereof, that is intentionally 51 | submitted to Licensor for inclusion in the Work by the copyright owner 52 | or by an individual or Legal Entity authorized to submit on behalf of 53 | the copyright owner. For the purposes of this definition, "submitted" 54 | means any form of electronic, verbal, or written communication sent 55 | to the Licensor or its representatives, including but not limited to 56 | communication on electronic mailing lists, source code control systems, 57 | and issue tracking systems that are managed by, or on behalf of, the 58 | Licensor for the purpose of discussing and improving the Work, but 59 | excluding communication that is conspicuously marked or otherwise 60 | designated in writing by the copyright owner as "Not a Contribution." 61 | 62 | "Contributor" shall mean Licensor and any individual or Legal Entity 63 | on behalf of whom a Contribution has been received by Licensor and 64 | subsequently incorporated within the Work. 65 | 66 | 2. Grant of Copyright License. Subject to the terms and conditions of 67 | this License, each Contributor hereby grants to You a perpetual, 68 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 69 | copyright license to reproduce, prepare Derivative Works of, 70 | publicly display, publicly perform, sublicense, and distribute the 71 | Work and such Derivative Works in Source or Object form. 72 | 73 | 3. Grant of Patent License. Subject to the terms and conditions of 74 | this License, each Contributor hereby grants to You a perpetual, 75 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 76 | (except as stated in this section) patent license to make, have made, 77 | use, offer to sell, sell, import, and otherwise transfer the Work, 78 | where such license applies only to those patent claims licensable 79 | by such Contributor that are necessarily infringed by their 80 | Contribution(s) alone or by combination of their Contribution(s) 81 | with the Work to which such Contribution(s) was submitted. If You 82 | institute patent litigation against any entity (including a 83 | cross-claim or counterclaim in a lawsuit) alleging that the Work 84 | or a Contribution incorporated within the Work constitutes direct 85 | or contributory patent infringement, then any patent licenses 86 | granted to You under this License for that Work shall terminate 87 | as of the date such litigation is filed. 88 | 89 | 4. Redistribution. You may reproduce and distribute copies of the 90 | Work or Derivative Works thereof in any medium, with or without 91 | modifications, and in Source or Object form, provided that You 92 | meet the following conditions: 93 | 94 | (a) You must give any other recipients of the Work or 95 | Derivative Works a copy of this License; and 96 | 97 | (b) You must cause any modified files to carry prominent notices 98 | stating that You changed the files; and 99 | 100 | (c) You must retain, in the Source form of any Derivative Works 101 | that You distribute, all copyright, patent, trademark, and 102 | attribution notices from the Source form of the Work, 103 | excluding those notices that do not pertain to any part of 104 | the Derivative Works; and 105 | 106 | (d) If the Work includes a "NOTICE" text file as part of its 107 | distribution, then any Derivative Works that You distribute must 108 | include a readable copy of the attribution notices contained 109 | within such NOTICE file, excluding those notices that do not 110 | pertain to any part of the Derivative Works, in at least one 111 | of the following places: within a NOTICE text file distributed 112 | as part of the Derivative Works; within the Source form or 113 | documentation, if provided along with the Derivative Works; or, 114 | within a display generated by the Derivative Works, if and 115 | wherever such third-party notices normally appear. The contents 116 | of the NOTICE file are for informational purposes only and 117 | do not modify the License. You may add Your own attribution 118 | notices within Derivative Works that You distribute, alongside 119 | or as an addendum to the NOTICE text from the Work, provided 120 | that such additional attribution notices cannot be construed 121 | as modifying the License. 122 | 123 | You may add Your own copyright statement to Your modifications and 124 | may provide additional or different license terms and conditions 125 | for use, reproduction, or distribution of Your modifications, or 126 | for any such Derivative Works as a whole, provided Your use, 127 | reproduction, and distribution of the Work otherwise complies with 128 | the conditions stated in this License. 129 | 130 | 5. Submission of Contributions. Unless You explicitly state otherwise, 131 | any Contribution intentionally submitted for inclusion in the Work 132 | by You to the Licensor shall be under the terms and conditions of 133 | this License, without any additional terms or conditions. 134 | Notwithstanding the above, nothing herein shall supersede or modify 135 | the terms of any separate license agreement you may have executed 136 | with Licensor regarding such Contributions. 137 | 138 | 6. Trademarks. This License does not grant permission to use the trade 139 | names, trademarks, service marks, or product names of the Licensor, 140 | except as required for reasonable and customary use in describing the 141 | origin of the Work and reproducing the content of the NOTICE file. 142 | 143 | 7. Disclaimer of Warranty. Unless required by applicable law or 144 | agreed to in writing, Licensor provides the Work (and each 145 | Contributor provides its Contributions) on an "AS IS" BASIS, 146 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 147 | implied, including, without limitation, any warranties or conditions 148 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A 149 | PARTICULAR PURPOSE. You are solely responsible for determining the 150 | appropriateness of using or redistributing the Work and assume any 151 | risks associated with Your exercise of permissions under this License. 152 | 153 | 8. Limitation of Liability. In no event and under no legal theory, 154 | whether in tort (including negligence), contract, or otherwise, 155 | unless required by applicable law (such as deliberate and grossly 156 | negligent acts) or agreed to in writing, shall any Contributor be 157 | liable to You for damages, including any direct, indirect, special, 158 | incidental, or consequential damages of any character arising as a 159 | result of this License or out of the use or inability to use the 160 | Work (including but not limited to damages for loss of goodwill, 161 | work stoppage, computer failure or malfunction, or any and all 162 | other commercial damages or losses), even if such Contributor 163 | has been advised of the possibility of such damages. 164 | 165 | 9. Accepting Warranty or Additional Liability. While redistributing 166 | the Work or Derivative Works thereof, You may choose to offer, 167 | and charge a fee for, acceptance of support, warranty, indemnity, 168 | or other liability obligations and/or rights consistent with this 169 | License. However, in accepting such obligations, You may act only 170 | on Your own behalf and on Your sole responsibility, not on behalf 171 | of any other Contributor, and only if You agree to indemnify, 172 | defend, and hold each Contributor harmless for any liability 173 | incurred by, or claims asserted against, such Contributor by reason 174 | of your accepting any such warranty or additional liability. 175 | 176 | END OF TERMS AND CONDITIONS 177 | 178 | APPENDIX: How to apply the Apache License to your work. 179 | 180 | To apply the Apache License to your work, attach the following 181 | boilerplate notice, with the fields enclosed by brackets "[]" 182 | replaced with your own identifying information. (Don't include 183 | the brackets!) The text should be enclosed in the appropriate 184 | comment syntax for the file format. We also recommend that a 185 | file or class name and description of purpose be included on the 186 | same "printed page" as the copyright notice for easier 187 | identification within third-party archives. 188 | 189 | Copyright [yyyy] [name of copyright owner] 190 | 191 | Licensed under the Apache License, Version 2.0 (the "License"); 192 | you may not use this file except in compliance with the License. 193 | You may obtain a copy of the License at 194 | 195 | http://www.apache.org/licenses/LICENSE-2.0 196 | 197 | Unless required by applicable law or agreed to in writing, software 198 | distributed under the License is distributed on an "AS IS" BASIS, 199 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 200 | See the License for the specific language governing permissions and 201 | limitations under the License. 202 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # 🚀 LLM Interview Questions and Answers Hub 2 | This repository includes 100+ LLM interview questions with answers. 3 | ![AIxFunda Newsletter](https://github.com/KalyanKS-NLP/LLM-Interview-Questions-and-Answers-Hub/blob/main/Interview_QA/images/2-.jpg) 4 | 5 | 6 | ## Related Repositories 7 | - 🚀[Prompt Engineering Techniques Hub](https://github.com/KalyanKS-NLP/Prompt-Engineering-Techniques-Hub) - 25+ prompt engineering techniques with LangChain implementations. 8 | - 👨🏻‍💻 [LLM Engineer Toolkit](https://github.com/KalyanKS-NLP/llm-engineer-toolkit) - Categories wise collection of 120+ LLM, RAG and Agent related libraries. 9 | - 🩸[LLM, RAG and Agents Survey Papers Collection](https://github.com/KalyanKS-NLP/LLM-Survey-Papers-Collection) - Category wise collection of 200+ survey papers. 10 | 11 | | # | Question | Answer | 12 | |---|---------|--------| 13 | | Q1 | CNNs and RNNs don’t use positional embeddings. Why do transformers use positional embeddings? | [Answer](https://github.com/KalyanKS-NLP/LLM-Interview-Questions-and-Answers-Hub/blob/main/Interview_QA/QA_1-3.md) | 14 | | Q2 | Tell me the basic steps involved in running an inference query on an LLM. | [Answer](https://github.com/KalyanKS-NLP/LLM-Interview-Questions-and-Answers-Hub/blob/main/Interview_QA/QA_1-3.md) | 15 | | Q3 | Explain how KV Cache accelerates LLM inference. | [Answer](https://github.com/KalyanKS-NLP/LLM-Interview-Questions-and-Answers-Hub/blob/main/Interview_QA/QA_1-3.md) | 16 | | Q4 | How does quantization affect inference speed and memory requirements? | [Answer](https://github.com/KalyanKS-NLP/LLM-Interview-Questions-and-Answers-Hub/blob/main/Interview_QA/QA_4-6.md) | 17 | | Q5 | How do you handle the large memory requirements of KV cache in LLM inference? | [Answer](https://github.com/KalyanKS-NLP/LLM-Interview-Questions-and-Answers-Hub/blob/main/Interview_QA/QA_4-6.md) | 18 | | Q6 | After tokenization, how are tokens converted into embeddings in the Transformer model? | [Answer](https://github.com/KalyanKS-NLP/LLM-Interview-Questions-and-Answers-Hub/blob/main/Interview_QA/QA_4-6.md) | 19 | | Q7 | Explain why subword tokenization is preferred over word-level tokenization in the Transformer model. | [Answer](https://github.com/KalyanKS-NLP/LLM-Interview-Questions-and-Answers-Hub/blob/main/Interview_QA/QA_7-9.md) | 20 | | Q8 | Explain the trade-offs in using a large vocabulary in LLMs. | [Answer](https://github.com/KalyanKS-NLP/LLM-Interview-Questions-and-Answers-Hub/blob/main/Interview_QA/QA_7-9.md) | 21 | | Q9 | Explain how self-attention is computed in the Transformer model step by step. | [Answer](https://github.com/KalyanKS-NLP/LLM-Interview-Questions-and-Answers-Hub/blob/main/Interview_QA/QA_7-9.md) | 22 | | Q10 | What is the computational complexity of self-attention in the Transformer model? | [Answer](https://github.com/KalyanKS-NLP/LLM-Interview-Questions-and-Answers-Hub/blob/main/Interview_QA/QA_10-12.md) | 23 | | Q11 | How do Transformer models address the vanishing gradient problem? | [Answer](https://github.com/KalyanKS-NLP/LLM-Interview-Questions-and-Answers-Hub/blob/main/Interview_QA/QA_10-12.md) | 24 | | Q12 | What is tokenization, and why is it necessary in LLMs? | [Answer](https://github.com/KalyanKS-NLP/LLM-Interview-Questions-and-Answers-Hub/blob/main/Interview_QA/QA_10-12.md) | 25 | | Q13 | Explain the role of token embeddings in the Transformer model. | [Answer](https://github.com/KalyanKS-NLP/LLM-Interview-Questions-and-Answers-Hub/blob/main/Interview_QA/QA_13-15.md) | 26 | | Q14 | Explain the working of the embedding layer in the Transformer model. | [Answer](https://github.com/KalyanKS-NLP/LLM-Interview-Questions-and-Answers-Hub/blob/main/Interview_QA/QA_13-15.md) | 27 | | Q15 | What is the role of self-attention in the Transformer model, and why is it called “self-attention”? | [Answer](https://github.com/KalyanKS-NLP/LLM-Interview-Questions-and-Answers-Hub/blob/main/Interview_QA/QA_13-15.md) | 28 | | Q16 | What is the purpose of the encoder in a Transformer model? | [Answer](https://github.com/KalyanKS-NLP/LLM-Interview-Questions-and-Answers-Hub/blob/main/Interview_QA/QA_16-18.md) | 29 | | Q17 | What is the purpose of the decoder in a Transformer model? | [Answer](https://github.com/KalyanKS-NLP/LLM-Interview-Questions-and-Answers-Hub/blob/main/Interview_QA/QA_16-18.md) | 30 | | Q18 | How does the encoder-decoder structure work at a high level in the Transformer model? | [Answer](https://github.com/KalyanKS-NLP/LLM-Interview-Questions-and-Answers-Hub/blob/main/Interview_QA/QA_16-18.md) | 31 | | Q19 | What is the purpose of scaling in the self-attention mechanism in the Transformer model? | [Answer](https://github.com/KalyanKS-NLP/LLM-Interview-Questions-and-Answers-Hub/blob/main/Interview_QA/QA_19-21.md) | 32 | | Q20 | Why does the Transformer model use multiple self-attention heads instead of a single self-attention head? | [Answer](https://github.com/KalyanKS-NLP/LLM-Interview-Questions-and-Answers-Hub/blob/main/Interview_QA/QA_19-21.md) | 33 | | Q21 | How are the outputs of multiple heads combined and projected back in the multi-head attention in the Transformer model? | [Answer](https://github.com/KalyanKS-NLP/LLM-Interview-Questions-and-Answers-Hub/blob/main/Interview_QA/QA_19-21.md) | 34 | | Q22 | How does masked self-attention differ from regular self-attention, and where is it used in a Transformer? | [Answer](https://github.com/KalyanKS-NLP/LLM-Interview-Questions-and-Answers-Hub/blob/main/Interview_QA/QA_22-24.md) | 35 | | Q23 | Discuss the pros and cons of the self-attention mechanism in the Transformer model. | [Answer](https://github.com/KalyanKS-NLP/LLM-Interview-Questions-and-Answers-Hub/blob/main/Interview_QA/QA_22-24.md) | 36 | | Q24 | What is the purpose of masked self-attention in the Transformer decoder? | [Answer](https://github.com/KalyanKS-NLP/LLM-Interview-Questions-and-Answers-Hub/blob/main/Interview_QA/QA_22-24.md) | 37 | | Q25 | Explain how masking works in masked self-attention in Transformer. | [Answer](https://github.com/KalyanKS-NLP/LLM-Interview-Questions-and-Answers-Hub/blob/main/Interview_QA/QA_25-27.md) | 38 | | Q26 | Explain why self-attention in the decoder is referred to as cross-attention. How does it differ from self-attention in the encoder? | [Answer](https://github.com/KalyanKS-NLP/LLM-Interview-Questions-and-Answers-Hub/blob/main/Interview_QA/QA_25-27.md) | 39 | | Q27 | What is the softmax function, and where is it applied in Transformers? | [Answer](https://github.com/KalyanKS-NLP/LLM-Interview-Questions-and-Answers-Hub/blob/main/Interview_QA/QA_25-27.md) | 40 | | Q28 | What is the purpose of residual (skip) connections in Transformer layers? | [Answer](https://github.com/KalyanKS-NLP/LLM-Interview-Questions-and-Answers-Hub/blob/main/Interview_QA/QA_28-30.md) | 41 | | Q29 | Why is layer normalization used, and where is it applied in Transformers? | [Answer](https://github.com/KalyanKS-NLP/LLM-Interview-Questions-and-Answers-Hub/blob/main/Interview_QA/QA_28-30.md) | 42 | | Q30 | What is cross-entropy loss, and how is it applied during Transformer training? | [Answer](https://github.com/KalyanKS-NLP/LLM-Interview-Questions-and-Answers-Hub/blob/main/Interview_QA/QA_28-30.md) | 43 | | Q31 | Compare Transformers and RNNs in terms of handling long-range dependencies. | [Answer](https://github.com/KalyanKS-NLP/LLM-Interview-Questions-and-Answers-Hub/blob/main/Interview_QA/QA_31-33.md) | 44 | | Q32 | What are the fundamental limitations of the Transformer model? | [Answer](https://github.com/KalyanKS-NLP/LLM-Interview-Questions-and-Answers-Hub/blob/main/Interview_QA/QA_31-33.md) | 45 | | Q33 | How do Transformers address the limitations of CNNs and RNNs? | [Answer](https://github.com/KalyanKS-NLP/LLM-Interview-Questions-and-Answers-Hub/blob/main/Interview_QA/QA_31-33.md) | 46 | | Q34 | How do Transformer models address the vanishing gradient problem? | [Answer](https://github.com/KalyanKS-NLP/LLM-Interview-Questions-and-Answers-Hub/blob/main/Interview_QA/QA_34-36.md) | 47 | | Q35 | What is the purpose of the position-wise feed-forward sublayer? | [Answer](https://github.com/KalyanKS-NLP/LLM-Interview-Questions-and-Answers-Hub/blob/main/Interview_QA/QA_34-36.md) | 48 | | Q36 | Can you briefly explain the difference between LLM training and inference? | [Answer](https://github.com/KalyanKS-NLP/LLM-Interview-Questions-and-Answers-Hub/blob/main/Interview_QA/QA_34-36.md) | 49 | | Q37 | What is latency in LLM inference, and why is it important? | [Answer](https://github.com/KalyanKS-NLP/LLM-Interview-Questions-and-Answers-Hub/blob/main/Interview_QA/QA_37-39.md) | 50 | | Q38 | What is batch inference, and how does it differ from single-query inference? | [Answer](https://github.com/KalyanKS-NLP/LLM-Interview-Questions-and-Answers-Hub/blob/main/Interview_QA/QA_37-39.md) | 51 | | Q39 | How does batching generally help with LLM inference efficiency? | [Answer](https://github.com/KalyanKS-NLP/LLM-Interview-Questions-and-Answers-Hub/blob/main/Interview_QA/QA_37-39.md) | 52 | | Q40 | Explain the trade-offs between batching and latency in LLM serving. | [Answer](https://github.com/KalyanKS-NLP/LLM-Interview-Questions-and-Answers-Hub/blob/main/Interview_QA/QA_40-42.md) | 53 | | Q41 | How can techniques like mixture-of-experts (MoE) optimize inference efficiency? | [Answer](https://github.com/KalyanKS-NLP/LLM-Interview-Questions-and-Answers-Hub/blob/main/Interview_QA/QA_40-42.md) | 54 | | Q42 | Explain the role of decoding strategy in LLM text generation. | [Answer](https://github.com/KalyanKS-NLP/LLM-Interview-Questions-and-Answers-Hub/blob/main/Interview_QA/QA_40-42.md) | 55 | | Q43 | What are the different decoding strategies in LLMs? | [Answer](https://github.com/KalyanKS-NLP/LLM-Interview-Questions-and-Answers-Hub/blob/main/Interview_QA/QA_43-45.md) | 56 | | Q44 | Explain the impact of the decoding strategy on LLM-generated output quality and latency. | [Answer](https://github.com/KalyanKS-NLP/LLM-Interview-Questions-and-Answers-Hub/blob/main/Interview_QA/QA_43-45.md) | 57 | | Q45 | Explain the greedy search decoding strategy and its main drawback. | [Answer](https://github.com/KalyanKS-NLP/LLM-Interview-Questions-and-Answers-Hub/blob/main/Interview_QA/QA_43-45.md) | 58 | | Q46 | How does Beam Search improve upon Greedy Search, and what is the role of the beam width parameter? | [Answer](https://github.com/KalyanKS-NLP/LLM-Interview-Questions-and-Answers-Hub/blob/main/Interview_QA/QA_46-48.md) | 59 | | Q47 | When is a deterministic strategy (like Beam Search) preferable to a stochastic (sampling) strategy? Provide a specific use case. | [Answer](https://github.com/KalyanKS-NLP/LLM-Interview-Questions-and-Answers-Hub/blob/main/Interview_QA/QA_46-48.md) | 60 | | Q48 | Discuss the primary trade-off between the computational cost and the output quality when comparing Greedy Search and Beam Search. | [Answer](https://github.com/KalyanKS-NLP/LLM-Interview-Questions-and-Answers-Hub/blob/main/Interview_QA/QA_46-48.md) | 61 | | Q49 | When you set the temperature to 0.0, which decoding strategy are you using? | [Answer](https://github.com/KalyanKS-NLP/LLM-Interview-Questions-and-Answers-Hub/blob/main/Interview_QA/QA_49-51.md) | 62 | | Q50 | How is Beam Search fundamentally different from a Breadth-First Search (BFS) or Depth-First Search (DFS)? | [Answer](https://github.com/KalyanKS-NLP/LLM-Interview-Questions-and-Answers-Hub/blob/main/Interview_QA/QA_49-51.md) | 63 | | Q51 | Explain the criteria for choosing different decoding strategies. | [Answer](https://github.com/KalyanKS-NLP/LLM-Interview-Questions-and-Answers-Hub/blob/main/Interview_QA/QA_49-51.md) | 64 | | Q52 | Compare deterministic and stochastic decoding methods in LLMs. | [Answer](https://github.com/KalyanKS-NLP/LLM-Interview-Questions-and-Answers-Hub/blob/main/Interview_QA/QA_52-54.md) | 65 | | Q53 | What is the role of the context window during LLM inference? | [Answer](https://github.com/KalyanKS-NLP/LLM-Interview-Questions-and-Answers-Hub/blob/main/Interview_QA/QA_52-54.md) | 66 | | Q54 | Explain the pros and cons of large and small context windows in LLM inference. | [Answer](https://github.com/KalyanKS-NLP/LLM-Interview-Questions-and-Answers-Hub/blob/main/Interview_QA/QA_52-54.md) | 67 | | Q55 | What is the purpose of temperature in LLM inference, and how does it affect the output? | [Answer](https://github.com/KalyanKS-NLP/LLM-Interview-Questions-and-Answers-Hub/blob/main/Interview_QA/QA_55-57.md) | 68 | | Q56 | What is autoregressive generation in the context of LLMs? | [Answer](https://github.com/KalyanKS-NLP/LLM-Interview-Questions-and-Answers-Hub/blob/main/Interview_QA/QA_55-57.md) | 69 | | Q57 | Explain the strengths and limitations of autoregressive text generation in LLMs. | [Answer](https://github.com/KalyanKS-NLP/LLM-Interview-Questions-and-Answers-Hub/blob/main/Interview_QA/QA_55-57.md) | 70 | | Q58 | Explain how diffusion language models (DLMs) differ from Large Language Models (LLMs). | [Answer](https://github.com/KalyanKS-NLP/LLM-Interview-Questions-and-Answers-Hub/blob/main/Interview_QA/QA_58-60.md) | 71 | | Q59 | Do you prefer DLMs or LLMs for latency-sensitive applications? | [Answer](https://github.com/KalyanKS-NLP/LLM-Interview-Questions-and-Answers-Hub/blob/main/Interview_QA/QA_58-60.md) | 72 | | Q60 | Explain the concept of token streaming during inference. | [Answer](https://github.com/KalyanKS-NLP/LLM-Interview-Questions-and-Answers-Hub/blob/main/Interview_QA/QA_58-60.md) | 73 | | Q61 | What is speculative decoding, and when would you use it? | [Answer](https://github.com/KalyanKS-NLP/LLM-Interview-Questions-and-Answers-Hub/blob/main/Interview_QA/QA_61-63.md) | 74 | | Q62 | What are the challenges in performing distributed inference across multiple GPUs? | [Answer](https://github.com/KalyanKS-NLP/LLM-Interview-Questions-and-Answers-Hub/blob/main/Interview_QA/QA_61-63.md) | 75 | | Q63 | How would you design a scalable LLM inference system for real-time applications? | [Answer](https://github.com/KalyanKS-NLP/LLM-Interview-Questions-and-Answers-Hub/blob/main/Interview_QA/QA_61-63.md) | 76 | | Q64 | Explain the role of Flash Attention in reducing memory bottlenecks. | [Answer](https://github.com/KalyanKS-NLP/LLM-Interview-Questions-and-Answers-Hub/blob/main/Interview_QA/QA_64-66.md) | 77 | | Q65 | What is continuous batching, and how does it differ from static batching? | [Answer](https://github.com/KalyanKS-NLP/LLM-Interview-Questions-and-Answers-Hub/blob/main/Interview_QA/QA_64-66.md) | 78 | | Q66 | What is mixed precision, and why is it used during inference? | [Answer](https://github.com/KalyanKS-NLP/LLM-Interview-Questions-and-Answers-Hub/blob/main/Interview_QA/QA_64-66.md) | 79 | | Q67 | Differentiate between online and offline LLM inference deployment scenarios and discuss their respective requirements. | [Answer](https://github.com/KalyanKS-NLP/LLM-Interview-Questions-and-Answers-Hub/blob/main/Interview_QA/QA_67-69.md) | 80 | | Q68 | Explain the throughput vs latency trade-off in LLM inference. | [Answer](https://github.com/KalyanKS-NLP/LLM-Interview-Questions-and-Answers-Hub/blob/main/Interview_QA/QA_67-69.md) | 81 | | Q69 | What are the various bottlenecks in a typical LLM inference pipeline when running on a modern GPU? | [Answer](https://github.com/KalyanKS-NLP/LLM-Interview-Questions-and-Answers-Hub/blob/main/Interview_QA/QA_67-69.md) | 82 | | Q70 | How do you measure LLM inference performance? | [Answer](https://github.com/KalyanKS-NLP/LLM-Interview-Questions-and-Answers-Hub/blob/main/Interview_QA/QA_70-72.md) | 83 | | Q71 | What are the different LLM inference engines available? Which one do you prefer? | [Answer](https://github.com/KalyanKS-NLP/LLM-Interview-Questions-and-Answers-Hub/blob/main/Interview_QA/QA_70-72.md) | 84 | | Q72 | What are the challenges in LLM inference? | [Answer](https://github.com/KalyanKS-NLP/LLM-Interview-Questions-and-Answers-Hub/blob/main/Interview_QA/QA_70-72.md) | 85 | | Q73 | What are the possible options for accelerating LLM inference? | [Answer](https://github.com/KalyanKS-NLP/LLM-Interview-Questions-and-Answers-Hub/blob/main/Interview_QA/QA_73-75.md) | 86 | | Q74 | What is Chain-of-Thought prompting, and when is it useful? | [Answer](https://github.com/KalyanKS-NLP/LLM-Interview-Questions-and-Answers-Hub/blob/main/Interview_QA/QA_73-75.md) | 87 | | Q75 | Explain the reason behind the effectiveness of Chain-of-Thought (CoT) prompting. | [Answer](https://github.com/KalyanKS-NLP/LLM-Interview-Questions-and-Answers-Hub/blob/main/Interview_QA/QA_73-75.md) | 88 | | Q76 | Explain the trade-offs in using CoT prompting. | [Answer](https://github.com/KalyanKS-NLP/LLM-Interview-Questions-and-Answers-Hub/blob/main/Interview_QA/QA_76-78.md) | 89 | | Q77 | What is prompt engineering, and why is it important for LLMs? | [Answer](https://github.com/KalyanKS-NLP/LLM-Interview-Questions-and-Answers-Hub/blob/main/Interview_QA/QA_76-78.md) | 90 | | Q78 | What is the difference between zero-shot and few-shot prompting? | [Answer](https://github.com/KalyanKS-NLP/LLM-Interview-Questions-and-Answers-Hub/blob/main/Interview_QA/QA_76-78.md) | 91 | | Q79 | What are the different approaches for choosing examples for few-shot prompting? | [Answer](https://github.com/KalyanKS-NLP/LLM-Interview-Questions-and-Answers-Hub/blob/main/Interview_QA/QA_79-81.md) | 92 | | Q80 | Why is context length important when designing prompts for LLMs? | [Answer](https://github.com/KalyanKS-NLP/LLM-Interview-Questions-and-Answers-Hub/blob/main/Interview_QA/QA_79-81.md) | 93 | | Q81 | What is a system prompt, and how does it differ from a user prompt? | [Answer](https://github.com/KalyanKS-NLP/LLM-Interview-Questions-and-Answers-Hub/blob/main/Interview_QA/QA_79-81.md) | 94 | | Q82 | What is In-Context Learning (ICL), and how is few-shot prompting related? | [Answer](https://github.com/KalyanKS-NLP/LLM-Interview-Questions-and-Answers-Hub/blob/main/Interview_QA/QA_82-84.md) | 95 | | Q83 | What is self-consistency prompting, and how does it improve reasoning? | [Answer](https://github.com/KalyanKS-NLP/LLM-Interview-Questions-and-Answers-Hub/blob/main/Interview_QA/QA_82-84.md) | 96 | | Q84 | Why is context important in prompt design? | [Answer](https://github.com/KalyanKS-NLP/LLM-Interview-Questions-and-Answers-Hub/blob/main/Interview_QA/QA_82-84.md) | 97 | | Q85 | Describe a strategy for reducing hallucinations via prompt design. | [Answer](https://github.com/KalyanKS-NLP/LLM-Interview-Questions-and-Answers-Hub/blob/main/Interview_QA/QA_85-87.md) | 98 | | Q86 | How would you structure a prompt to ensure the LLM output is in a specific format, like JSON? | [Answer](https://github.com/KalyanKS-NLP/LLM-Interview-Questions-and-Answers-Hub/blob/main/Interview_QA/QA_85-87.md) | 99 | | Q87 | Explain the purpose of ReAct prompting in AI agents. | [Answer](https://github.com/KalyanKS-NLP/LLM-Interview-Questions-and-Answers-Hub/blob/main/Interview_QA/QA_85-87.md) | 100 | | Q88 | What are the different phases in LLM development? | [Answer](https://github.com/KalyanKS-NLP/LLM-Interview-Questions-and-Answers-Hub/blob/main/Interview_QA/QA_88-90.md) | 101 | | Q89 | What are the different types of LLM fine-tuning? | [Answer](https://github.com/KalyanKS-NLP/LLM-Interview-Questions-and-Answers-Hub/blob/main/Interview_QA/QA_88-90.md) | 102 | | Q90 | What role does instruction tuning play in improving an LLM’s usability? | [Answer](https://github.com/KalyanKS-NLP/LLM-Interview-Questions-and-Answers-Hub/blob/main/Interview_QA/QA_88-90.md) | 103 | | Q91 | What role does alignment tuning play in improving an LLM's usability? | [Answer](https://github.com/KalyanKS-NLP/LLM-Interview-Questions-and-Answers-Hub/blob/main/Interview_QA/QA_91-93.md) | 104 | | Q92 | How do you prevent overfitting during fine-tuning? | [Answer](https://github.com/KalyanKS-NLP/LLM-Interview-Questions-and-Answers-Hub/blob/main/Interview_QA/QA_91-93.md) | 105 | | Q93 | What is catastrophic forgetting, and why is it a concern in fine-tuning? | [Answer](https://github.com/KalyanKS-NLP/LLM-Interview-Questions-and-Answers-Hub/blob/main/Interview_QA/QA_91-93.md) | 106 | | Q94 | What are the strengths and limitations of full fine-tuning? | [Answer](https://github.com/KalyanKS-NLP/LLM-Interview-Questions-and-Answers-Hub/blob/main/Interview_QA/QA_94-96.md) | 107 | | Q95 | Explain how parameter efficient fine-tuning addresses the limitations of full fine-tuning. | [Answer](https://github.com/KalyanKS-NLP/LLM-Interview-Questions-and-Answers-Hub/blob/main/Interview_QA/QA_94-96.md) | 108 | | Q96 | When might prompt engineering be preferred over task-specific fine-tuning? | [Answer](https://github.com/KalyanKS-NLP/LLM-Interview-Questions-and-Answers-Hub/blob/main/Interview_QA/QA_94-96.md) | 109 | | Q97 | When should you use fine-tuning vs RAG? | [Answer](https://github.com/KalyanKS-NLP/LLM-Interview-Questions-and-Answers-Hub/blob/main/Interview_QA/QA_97-99.md) | 110 | | Q98 | What are the limitations of using RAG over fine-tuning? | [Answer](https://github.com/KalyanKS-NLP/LLM-Interview-Questions-and-Answers-Hub/blob/main/Interview_QA/QA_97-99.md) | 111 | | Q99 | What are the limitations of fine-tuning compared to RAG? | [Answer](https://github.com/KalyanKS-NLP/LLM-Interview-Questions-and-Answers-Hub/blob/main/Interview_QA/QA_97-99.md) | 112 | | Q100 | When should you prefer task-specific fine-tuning over prompt engineering? | [Answer](https://github.com/KalyanKS-NLP/LLM-Interview-Questions-and-Answers-Hub/blob/main/Interview_QA/QA_100-102.md) | 113 | | Q101 | What is LoRA, and how does it work? | [Answer](https://github.com/KalyanKS-NLP/LLM-Interview-Questions-and-Answers-Hub/blob/main/Interview_QA/QA_100-102.md) | 114 | | Q102 | Explain the key ingredient behind the effectiveness of the LoRA technique. | [Answer](https://github.com/KalyanKS-NLP/LLM-Interview-Questions-and-Answers-Hub/blob/main/Interview_QA/QA_100-102.md) | 115 | | Q103 | What is QLoRA, and how does it differ from LoRA? | [Answer](https://github.com/KalyanKS-NLP/LLM-Interview-Questions-and-Answers-Hub/blob/main/Interview_QA/QA_103-105.md) | 116 | | Q104 | When would you use QLoRA instead of standard LoRA? | [Answer](https://github.com/KalyanKS-NLP/LLM-Interview-Questions-and-Answers-Hub/blob/main/Interview_QA/QA_103-105.md) | 117 | | Q105 | How would you handle LLM fine-tuning on consumer hardware with limited GPU memory? | [Answer](https://github.com/KalyanKS-NLP/LLM-Interview-Questions-and-Answers-Hub/blob/main/Interview_QA/QA_103-105.md) | 118 | | Q106 | Explain different preference alignment methods and their trade-offs. | [Answer](https://github.com/KalyanKS-NLP/LLM-Interview-Questions-and-Answers-Hub/blob/main/Interview_QA/QA_106-108.md) | 119 | | Q107 | What is gradient accumulation, and how does it help with fine-tuning large models? | [Answer](https://github.com/KalyanKS-NLP/LLM-Interview-Questions-and-Answers-Hub/blob/main/Interview_QA/QA_106-108.md) | 120 | | Q108 | What are the possible options to speed up LLM fine-tuning? | [Answer](https://github.com/KalyanKS-NLP/LLM-Interview-Questions-and-Answers-Hub/blob/main/Interview_QA/QA_106-108.md) | 121 | | Q109 | Explain the pretraining objective used in LLM pretraining. | [Answer](https://github.com/KalyanKS-NLP/LLM-Interview-Questions-and-Answers-Hub/blob/main/Interview_QA/QA_109-111.md) | 122 | | Q110 | What is the difference between casual language modeling and masked language modeling? | [Answer](https://github.com/KalyanKS-NLP/LLM-Interview-Questions-and-Answers-Hub/blob/main/Interview_QA/QA_109-111.md) | 123 | | Q111 | How do LLMs handle out-of-vocabulary (OOV) words? | [Answer](https://github.com/KalyanKS-NLP/LLM-Interview-Questions-and-Answers-Hub/blob/main/Interview_QA/QA_109-111.md) | 124 | | Q112 | In the context of LLM pretraining, what is scaling law? | [Answer](https://github.com/KalyanKS-NLP/LLM-Interview-Questions-and-Answers-Hub/blob/main/Interview_QA/QA_112-114.md) | 125 | | Q113 | Explain the concept of Mixture-of-Experts (MoE) architecture and its role in LLM pretraining. | [Answer](https://github.com/KalyanKS-NLP/LLM-Interview-Questions-and-Answers-Hub/blob/main/Interview_QA/QA_112-114.md) | 126 | | Q114 | What is model parallelism, and how is it used in LLM pre-training? | [Answer](https://github.com/KalyanKS-NLP/LLM-Interview-Questions-and-Answers-Hub/blob/main/Interview_QA/QA_112-114.md) | 127 | | Q115 | What is the significance of self-supervised learning in LLM pretraining? | [Answer](https://github.com/KalyanKS-NLP/LLM-Interview-Questions-and-Answers-Hub/blob/main/Interview_QA/QA_115-117.md) | 128 | 129 | 130 | ## ⭐️ Star History 131 | 132 | [![Star History Chart](https://api.star-history.com/svg?repos=KalyanKS-NLP/LLM-Interview-Questions-and-Answers-Hub&type=Date)](https://star-history.com/#) 133 | 134 | Please consider giving a star, if you find this repository useful. 135 | 136 | --------------------------------------------------------------------------------