├── .github └── FUNDING.yml ├── .gitignore ├── LICENSE ├── README.md ├── images ├── andrej_on_learning.png └── cover.png ├── resources ├── blogs.md ├── books.md ├── courses.md ├── datasets.md ├── papers.md ├── projects.md └── videos-and-podcasts.md └── weekly-notes └── week1.md /.github/FUNDING.yml: -------------------------------------------------------------------------------- 1 | # These are supported funding model platforms 2 | 3 | github: # Replace with up to 4 GitHub Sponsors-enabled usernames e.g., [user1, user2] 4 | patreon: researchrookie/membership 5 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | llm_env/ 2 | 3 | # Byte-compiled / optimized / DLL files 4 | __pycache__/ 5 | *.py[cod] 6 | *$py.class 7 | 8 | # C extensions 9 | *.so 10 | 11 | # Distribution / packaging 12 | .Python 13 | build/ 14 | develop-eggs/ 15 | dist/ 16 | downloads/ 17 | eggs/ 18 | .eggs/ 19 | lib/ 20 | lib64/ 21 | parts/ 22 | sdist/ 23 | var/ 24 | wheels/ 25 | share/python-wheels/ 26 | *.egg-info/ 27 | .installed.cfg 28 | *.egg 29 | MANIFEST 30 | 31 | # PyInstaller 32 | # Usually these files are written by a python script from a template 33 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 34 | *.manifest 35 | *.spec 36 | 37 | # Installer logs 38 | pip-log.txt 39 | pip-delete-this-directory.txt 40 | 41 | # Unit test / coverage reports 42 | htmlcov/ 43 | .tox/ 44 | .nox/ 45 | .coverage 46 | .coverage.* 47 | .cache 48 | nosetests.xml 49 | coverage.xml 50 | *.cover 51 | *.py,cover 52 | .hypothesis/ 53 | .pytest_cache/ 54 | cover/ 55 | 56 | # Translations 57 | *.mo 58 | *.pot 59 | 60 | # Django stuff: 61 | *.log 62 | local_settings.py 63 | db.sqlite3 64 | db.sqlite3-journal 65 | 66 | # Flask stuff: 67 | instance/ 68 | .webassets-cache 69 | 70 | # Scrapy stuff: 71 | .scrapy 72 | 73 | # Sphinx documentation 74 | docs/_build/ 75 | 76 | # PyBuilder 77 | .pybuilder/ 78 | target/ 79 | 80 | # Jupyter Notebook 81 | .ipynb_checkpoints 82 | 83 | # IPython 84 | profile_default/ 85 | ipython_config.py 86 | 87 | # pyenv 88 | # For a library or package, you might want to ignore these files since the code is 89 | # intended to run in multiple environments; otherwise, check them in: 90 | # .python-version 91 | 92 | # pipenv 93 | # According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control. 94 | # However, in case of collaboration, if having platform-specific dependencies or dependencies 95 | # having no cross-platform support, pipenv may install dependencies that don't work, or not 96 | # install all needed dependencies. 97 | #Pipfile.lock 98 | 99 | # poetry 100 | # Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control. 101 | # This is especially recommended for binary packages to ensure reproducibility, and is more 102 | # commonly ignored for libraries. 103 | # https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control 104 | #poetry.lock 105 | 106 | # pdm 107 | # Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control. 108 | #pdm.lock 109 | # pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it 110 | # in version control. 111 | # https://pdm.fming.dev/latest/usage/project/#working-with-version-control 112 | .pdm.toml 113 | .pdm-python 114 | .pdm-build/ 115 | 116 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm 117 | __pypackages__/ 118 | 119 | # Celery stuff 120 | celerybeat-schedule 121 | celerybeat.pid 122 | 123 | # SageMath parsed files 124 | *.sage.py 125 | 126 | # Environments 127 | .env 128 | .venv 129 | env/ 130 | venv/ 131 | ENV/ 132 | env.bak/ 133 | venv.bak/ 134 | 135 | # Spyder project settings 136 | .spyderproject 137 | .spyproject 138 | 139 | # Rope project settings 140 | .ropeproject 141 | 142 | # mkdocs documentation 143 | /site 144 | 145 | # mypy 146 | .mypy_cache/ 147 | .dmypy.json 148 | dmypy.json 149 | 150 | # Pyre type checker 151 | .pyre/ 152 | 153 | # pytype static type analyzer 154 | .pytype/ 155 | 156 | # Cython debug symbols 157 | cython_debug/ 158 | 159 | # PyCharm 160 | # JetBrains specific template is maintained in a separate JetBrains.gitignore that can 161 | # be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore 162 | # and can be added to the global gitignore or merged into this file. For a more nuclear 163 | # option (not recommended) you can uncomment the following to ignore the entire idea folder. 164 | #.idea/ 165 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2024 Data Science with Harshit 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # AI Research Program 2 | 3 | ![AI Research Program](./images/cover.png) 4 | 5 | An independent AI research program created by [Harshit](https://researchrookie.substack.com/). 6 | 7 | I am an educator at ♥️ and I love to deconstruct topics to better understand and teach them. I am now embarking on a journey to re-learn the concepts of AI via this AI Research Program. 8 | 9 | I have designed this program to be thorough, exciting, gamified and hands-on. The program also draws inspiration from various Research Engineer roles at companies like [OpenAI](https://openai.com/careers/research-engineer/), [Anthropic](https://boards.greenhouse.io/anthropic/jobs/4019739008), [Meta](https://www.metacareers.com/v2/jobs/2780884748740452/), and the likes. 10 | 11 | The following curriculum is designed to help me get a deeper and thorough understanding of all important concepts in AI. 12 | 13 | ## Table of contents 14 | - [AI Research Program](#ai-research-program) 15 | - [Table of contents](#table-of-contents) 16 | - [My learning approach](#my-learning-approach) 17 | - [How to consume this?](#how-to-consume-this) 18 | - [Pillar 1: Foundational Knowledge](#pillar-1-foundational-knowledge) 19 | - [Pillar 2: Programming and Tools](#pillar-2-programming-and-tools) 20 | - [Pillar 3: Deep Learning Fundamentals](#pillar-3-deep-learning-fundamentals) 21 | - [Pillar 4: Reinforcement Learning](#pillar-4-reinforcement-learning) 22 | - [Pillar 5: NLP / LLM Research](#pillar-5-nlp--llm-research) 23 | - [Pillar 6: Research Paper Analysis and Replication](#pillar-6-research-paper-analysis-and-replication) 24 | - [Pillar 7: High-Performance AI Systems and Applications](#pillar-7-high-performance-ai-systems-and-applications) 25 | - [Pillar 8: Large-scale ETL and Data Engineering](#pillar-8-large-scale-etl-and-data-engineering) 26 | - [Pillar 9: Ethical AI and Responsible Development](#pillar-9-ethical-ai-and-responsible-development) 27 | - [Pillar 10: Community Engagement and Networking](#pillar-10-community-engagement-and-networking) 28 | - [Pillar 11: Research and Publication](#pillar-11-research-and-publication) 29 | - [Resources](#resources) 30 | 31 | ## My learning approach 32 | I have a background in Data Science and ML and I've decided to take a recursive approach(going deeper as needed) based on my level of understanding. 33 | 34 | My primary purpose of starting this program is to dig deeper into **LLM research and engineering problems**. 35 | 36 | So, I'll start from Pillar 5, NLP/LLM research, how LLMs work, how they are pre-trained and how to finetune them. In the process, I will hop onto other Pillars to learn and re-learn a other fundamental concepts. 37 | 38 | Keep an eye on [weekly-notes](./weekly-notes) for my learnings and thoughts. 39 | 40 | I'll dedicate at least 1 hours daily to my learning. 41 | 42 | > ⚠️ This program is still under development and will be updated as I progress in my journey. Feel free to contribute to it by opening issues or suggesting improvements. 43 | 44 | ## How to consume this? 45 | - Beginners should start with the [Foundational Knowledge](#pillar-1-foundational-knowledge) section. 46 | - For all others who have some exposure to ML/DL, tweak the curriculum as per your level of understanding. 47 | - This curriculum is not supposed to be consumed in a linear fashion. Treat it more like a checklist. 48 | - You should start with the basics and gradually move towards more complex topics as per your goals and interests. 49 | 50 | To follow me in this journey, consider subscribing to my: 51 | - [weekly newsletter](https://researchrookie.substack.com/). 52 | - Learning vlogs at [youtube](https://www.youtube.com/@DataSciencewithHarshit) 53 | - [Discord for discussions and sharing your learnings](https://discord.gg/ux6K7wEu) 54 | 55 | 56 | ## Pillar 1: Foundational Knowledge 57 | 58 | | Topic | Subtopics | Learning Resources | Projects/Assignments | Evaluation Criteria | 59 | |-------|-----------|---------------------|----------------------|---------------------| 60 | | Linear Algebra | Vectors, matrices, eigenvalues | MIT OpenCourseWare, 3Blue1Brown videos | Implement basic operations, solve systems of equations | Quiz, problem set, subjective evaluation | 61 | | Calculus | Derivatives, integrals, gradients | Khan Academy, Coursera | Optimization problems, gradient descent implementation | Quiz, problem set, subjective evaluation | 62 | | Probability | Distributions, Bayes theorem | Stanford CS109 | Probabilistic modeling project | Quiz, project review, subjective evaluation | 63 | | Statistics | Hypothesis testing, regression | StatQuest videos, EdX courses | Data analysis project | Project review, subjective evaluation | 64 | | Information Theory | Entropy, mutual information | Elements of Information Theory (book) | Implement compression algorithm | Project review, subjective evaluation | 65 | 66 | ## Pillar 2: Programming and Tools 67 | 68 | | Skill | Subtopics | Learning Resources | Projects/Assignments | Evaluation Criteria | 69 | |-------|-----------|---------------------|----------------------|---------------------| 70 | | Python | Data structures, OOP, functional programming | Python.org tutorials, Real Python | Build a data processing pipeline | Code review, subjective evaluation | 71 | | PyTorch | Tensors, autograd, distributed training | PyTorch documentation, Fast.ai course | Implement and distribute a neural network | Project review, subjective evaluation | 72 | | Git | Version control, branching, merging | Git documentation, GitHub Learning Lab | Contribute to an open-source project | Contribution review, subjective evaluation | 73 | | Linux | Command line, shell scripting, OS internals | Linux Journey, "Operating Systems: Three Easy Pieces" book | Implement a simple process scheduler | Practical test, code review, subjective evaluation | 74 | | Cloud Platforms | AWS, GCP, or Azure basics | Cloud provider documentation | Deploy a scalable web application | Project review, subjective evaluation | 75 | | High Performance Computing | GPU programming, CUDA | NVIDIA CUDA tutorials | Optimize a neural network for GPU computation | Performance benchmarks, subjective evaluation | 76 | | Kubernetes | Container orchestration, scaling | Kubernetes documentation, KodeKloud courses | Deploy and scale an ML model using Kubernetes | Project review, subjective evaluation | 77 | 78 | ## Pillar 3: Deep Learning Fundamentals 79 | 80 | | Topic | Subtopics | Learning Resources | Projects/Assignments | Evaluation Criteria | 81 | |-------|-----------|---------------------|----------------------|---------------------| 82 | | Neural Networks | Perceptrons, activation functions, backpropagation | Deep Learning book by Goodfellow et al. | Implement a multi-layer perceptron | Code review, subjective evaluation | 83 | | CNNs | Convolution, pooling, modern architectures | CS231n course materials | Build an image classification model | Project review, subjective evaluation | 84 | | RNNs | LSTM, GRU, sequence modeling | D2L.ai tutorials | Implement a language model | Code review, subjective evaluation | 85 | | Attention Mechanisms | Self-attention, multi-head attention | "Attention Is All You Need" paper | Implement and optimize a new attention mechanism | Performance analysis, subjective evaluation | 86 | | Transformer Variants | Transformer-XL, Reformer, Performer | Recent papers on arXiv | Compare compute efficiency of two Transformer variants | Comparative analysis, subjective evaluation | 87 | | Optimization Techniques | SGD, Adam, learning rate schedules | Sebastian Ruder's blog | Compare optimizers on a benchmark task | Report review, subjective evaluation | 88 | | Regularization | Dropout, batch normalization, data augmentation | Papers with Code | Implement and compare regularization techniques | Project review, subjective evaluation | 89 | 90 | ## Pillar 4: Reinforcement Learning 91 | 92 | | Topic | Subtopics | Learning Resources | Projects/Assignments | Evaluation Criteria | 93 | |-------|-----------|---------------------|----------------------|---------------------| 94 | | RL Fundamentals | MDP, value functions, policy gradients | Sutton & Barto book, David Silver's course | Implement and solve classic RL problems (e.g., CartPole, Mountain Car) | Algorithm performance, subjective evaluation | 95 | | Deep RL | DQN, A3C, PPO | OpenAI Spinning Up | Implement a deep RL algorithm for a complex environment (e.g., Atari games) | Agent performance, subjective evaluation | 96 | | RL for NLP | RLHF, sequence generation | Recent papers (e.g., InstructGPT) | Fine-tune a language model using RL | Model improvement, subjective evaluation | 97 | 98 | ## Pillar 5: NLP / LLM Research 99 | | Topic | Subtopics | Learning Resources | Projects / Assignments | Evaluation Criteria | 100 | |-------|-----------|---------------------|------------------------|----------------------| 101 | | 1. Foundations | - Word Vectors/Embeddings
- Tokenization
- Preprocessing
- Data Sampling | - Book: "Speech and Language Processing" by Jurafsky & Martin
- Course: Stanford CS224N: NLP with Deep Learning
- Paper: "Efficient Estimation of Word Representations in Vector Space" by Mikolov et al. | - Implement word2vec from scratch
- Build a custom tokenizer for a specific language or domain
- Create a data preprocessing pipeline for a large text corpus | - Accuracy of word embeddings on analogy tasks
- Efficiency and coverage of tokenization
- Quality and cleanliness of preprocessed data | 102 | | 2. Classical NLP Techniques | - Hidden Markov Models
- Naive Bayes
- Maximum Entropy Markov Models
- Conditional Random Fields | - Book: "Foundations of Statistical Natural Language Processing" by Manning & Schütze
- Tutorial: "A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition" by Rabiner | - Implement a part-of-speech tagger using HMMs
- Build a spam classifier using Naive Bayes
- Develop a named entity recognition system using CRFs | - Accuracy on standard POS tagging datasets
- Precision, recall, and F1 score for spam classification
- F1 score on CoNLL 2003 NER dataset | 103 | | 3. Neural Architectures | - Feed-forward Neural Networks
- Recurrent Neural Networks
- Convolutional Neural Networks
- Attention Mechanisms
- Transformers | - Book: "Deep Learning" by Goodfellow, Bengio, & Courville
- Paper: "Attention Is All You Need" by Vaswani et al.
- Course: fast.ai Practical Deep Learning for Coders | - Implement a sentiment analysis model using CNNs
- Build a language model using LSTMs
- Create a machine translation system using Transformers | - Accuracy on sentiment analysis benchmarks (e.g., IMDb)
- Perplexity of language model on test set
- BLEU score for machine translation | 104 | | 4. Language Models | - N-gram Models
- Neural Language Models
- Autoregressive vs. Autoencoder Models
- Large Language Models (LLMs)
- Vision-Language Models (VLMs) | - Paper: "Language Models are Few-Shot Learners" (GPT-3)
- Blog: "The Illustrated Transformer" by Jay Alammar
- Course: Hugging Face NLP Course | - Fine-tune GPT-2 for text generation
- Implement a BERT-based question answering system
- Create a multimodal model for image captioning | - Perplexity and cross-entropy loss
- F1 and Exact Match scores for QA
- BLEU, METEOR, and CIDEr scores for image captioning | 105 | | 5. Advanced LLM Concepts | - LLM Alignment
- Token Sampling Methods
- Context Length Extension
- Personalization | - Paper: "Constitutional AI: Harmlessness from AI Feedback" by Anthropic
- Blog: "How to sample from language models" by Hugging Face
- Paper: "LoRA: Low-Rank Adaptation of Large Language Models" | - Implement different decoding strategies (greedy, beam search, top-k, top-p)
- Develop a method to extend context length of a pre-trained LLM
- Create a personalized language model using adapters | - Human evaluation of model alignment
- Quality and diversity of generated text
- Perplexity on long-context tasks
- Personalization accuracy on user-specific tasks | 106 | | 6. NLP Applications | - Machine Translation
- Named Entity Recognition
- Textual Entailment
- Retrieval Augmented Generation (RAG)
- Document Intelligence | - Paper: "Neural Machine Translation by Jointly Learning to Align and Translate"
- Tutorial: spaCy's Named Entity Recognition
- Paper: "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" | - Build an end-to-end neural machine translation system
- Develop a RAG system for question answering
- Create a document understanding system for invoice processing | - BLEU, METEOR scores for MT
- F1 score for NER
- Accuracy on textual entailment datasets (e.g., SNLI)
- Relevance and accuracy of RAG responses | 107 | | 7. Knowledge Representation | - Knowledge Graphs
- Semantic Networks | - Book: "Knowledge Representation and Reasoning" by Brachman & Levesque
- Tutorial: Neo4j Graph Database | - Construct a knowledge graph from a text corpus
- Develop a question answering system using a knowledge graph | - Coverage and accuracy of extracted knowledge
- Precision and recall of graph-based QA system | 108 | | 8. Challenges and Mitigation | - Hallucination Mitigation
- AI Text Detection
- Bias Detection and Mitigation | - Paper: "TruthfulQA: Measuring How Models Mimic Human Falsehoods"
- Blog: "Detecting Machine-Generated Text" by OpenAI
- Paper: "Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings" | - Develop a fact-checking system for LLM outputs
- Create an AI text detector
- Implement bias mitigation techniques in word embeddings | - Reduction in hallucination rate
- Accuracy of AI text detection
- Reduction in bias measures (e.g., WEAT score) | 109 | | 9. Evaluation and Benchmarking | - LLM/VLM Benchmarks
- Task-specific Metrics | - Paper: "GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding"
- Tutorial: Hugging Face's Evaluate library | - Evaluate an LLM on multiple NLP tasks using GLUE benchmark
- Implement and compare different evaluation metrics for a specific NLP task | - Performance across multiple benchmarks
- Inter-annotator agreement for human evaluation | 110 | | 10. Practical Aspects | - Large Language Model Ops (LLMOps)
- Ethical Considerations | - Book: "Building Machine Learning Pipelines" by Hapke & Nelson
- Paper: "On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?" | - Design and implement an LLMOps pipeline
- Conduct an ethical audit of an NLP system | - Efficiency and reliability of deployment pipeline
- Compliance with ethical AI principles | 111 | 112 | ## Pillar 6: Research Paper Analysis and Replication 113 | 114 | | Activity | Approach | Resources | Deliverables | Evaluation Criteria | 115 | |----------|----------|-----------|--------------|---------------------| 116 | | Paper Reading | Weekly reading of latest papers | arXiv, Papers With Code | Paper summaries, critiques | Quality of analysis, subjective evaluation | 117 | | Research Replication | Reproduce results of significant papers | Original papers, open-source implementations | Replicated experiments, results comparison | Accuracy of replication, subjective evaluation | 118 | | Trend Analysis | Identify and analyze research trends | NeurIPS, ICML, ACL proceedings | Trend reports, blog posts | Insight quality, subjective evaluation | 119 | | Paper Presentation | Present papers to peers or online | Conference recordings | Video explanations, slide decks | Presentation quality, subjective evaluation | 120 | | Extension Projects | Extend or combine ideas from papers | Related work sections of papers | Novel experiments, blog posts | Originality, subjective evaluation | 121 | | Visualization Projects | Develop interactive visualizations for ML concepts | D3.js, Plotly | Create an interactive visualization of attention between tokens in a language model | Visualization quality, insight provided, subjective evaluation | 122 | 123 | 124 | ## Pillar 7: High-Performance AI Systems and Applications 125 | 126 | | Topic | Subtopics | Learning Resources | Projects/Assignments | Evaluation Criteria | 127 | |-------|-----------|---------------------|----------------------|---------------------| 128 | | Model Optimization | Quantization, pruning, distillation | TensorFlow Model Optimization Toolkit | Optimize a large model for edge devices | Performance benchmarks, subjective evaluation | 129 | | Efficient Architectures | MobileNet, EfficientNet, BERT-tiny | Papers, GitHub implementations | Implement and benchmark efficient models | Comparative analysis, subjective evaluation | 130 | | Hardware Acceleration | GPU programming, TPU utilization | NVIDIA Deep Learning Institute | Optimize a model for specific hardware | Performance improvement, subjective evaluation | 131 | | Scalable AI Systems | Microservices, containerization, orchestration | Kubernetes documentation, Docker tutorials | Design a scalable AI service architecture | Architecture review, subjective evaluation | 132 | | ML Ops | CI/CD for ML, model monitoring | Google MLOps guides | Set up an MLOps pipeline | Pipeline effectiveness, subjective evaluation | 133 | 134 | 135 | ## Pillar 8: Large-scale ETL and Data Engineering 136 | 137 | | Topic | Subtopics | Learning Resources | Projects/Assignments | Evaluation Criteria | 138 | |-------|-----------|---------------------|----------------------|---------------------| 139 | | ETL Fundamentals | Data extraction, transformation, loading | Udacity Data Engineering course | Design and implement an ETL pipeline | Pipeline efficiency, subjective evaluation | 140 | | Distributed Data Processing | Apache Spark, Hadoop | Spark documentation, Coursera courses | Process and analyze a large dataset using Spark | Processing speed, scalability, subjective evaluation | 141 | | Data Warehousing | Star schema, OLAP | Kimball Group resources | Design a data warehouse for ML experiments | Design review, subjective evaluation | 142 | | Streaming Data | Kafka, Flink | Confluent Kafka tutorials | Implement a real-time data processing pipeline | System performance, subjective evaluation | 143 | 144 | ## Pillar 9: Ethical AI and Responsible Development 145 | 146 | | Topic | Subtopics | Learning Resources | Projects/Assignments | Evaluation Criteria | 147 | |-------|-----------|---------------------|----------------------|---------------------| 148 | | AI Ethics | Fairness, accountability, transparency | MIT Moral Machine, ethicalML.org | Develop an AI ethics framework | Framework review, subjective evaluation | 149 | | Bias in AI | Dataset bias, algorithmic bias | "Gender Shades" paper, AI Fairness 360 | Audit a model for bias | Report review, subjective evaluation | 150 | | Privacy in ML | Differential privacy, federated learning | OpenMined tutorials | Implement privacy-preserving ML | Project review, subjective evaluation | 151 | | Explainable AI | LIME, SHAP, counterfactual explanations | Interpretable Machine Learning book | Develop model explanations | Project review, subjective evaluation | 152 | | AI Governance | Regulations, guidelines, best practices | EU AI Act, IEEE Ethically Aligned Design | Propose AI governance structure | Proposal review, subjective evaluation | 153 | 154 | ## Pillar 10: Community Engagement and Networking 155 | 156 | | Activity | Approach | Resources | Deliverables | Evaluation Criteria | 157 | |----------|----------|-----------|--------------|---------------------| 158 | | Conference Participation | Attend or present at AI conferences | Conference websites, Call for Papers | Presentation slides, trip reports | Engagement quality, subjective evaluation | 159 | | Open Source Contribution | Contribute to AI/ML open source projects | GitHub, PyPi | Code contributions, pull requests | Contribution impact, subjective evaluation | 160 | | AI Community Building | Organize meetups, study groups | Meetup.com, local tech communities | Event reports, community growth metrics | Community impact, subjective evaluation | 161 | | Online Presence | Blog writing, social media engagement | Medium, Twitter, LinkedIn | Blog posts, thread discussions | Reach and engagement, subjective evaluation | 162 | | Collaborative Projects | Partner with other researchers/engineers | Academic collaborations, hackathons | Joint projects, co-authored papers | Collaboration quality, subjective evaluation | 163 | 164 | ## Pillar 11: Research and Publication 165 | 166 | | Activity | Approach | Resources | Deliverables | Evaluation Criteria | 167 | |----------|----------|-----------|--------------|---------------------| 168 | | Research Ideation | Brainstorming, literature review | arXiv, Google Scholar | Research proposals | Originality, feasibility, subjective evaluation | 169 | | Experiment Design | Hypothesis formulation, methodology planning | Research design books, mentorship | Experimental protocols | Rigor, subjective evaluation | 170 | | Data Collection and Analysis | Data gathering, statistical analysis | Kaggle datasets, statistical tools | Datasets, analysis reports | Data quality, analysis depth, subjective evaluation | 171 | | Paper Writing | Scientific writing, LaTeX | Overleaf, writing workshops | Draft papers | Writing quality, subjective evaluation | 172 | | Peer Review | Understand and participate in peer review process | PubliONS, Elsevier Researcher Academy | Review reports | Review quality, subjective evaluation | 173 | | Publication and Presentation | Submit to journals/conferences, create posters | Journal guidelines, conference deadlines | Published papers, conference posters | Impact factor, presentation quality, subjective evaluation | 174 | 175 | ## Resources 176 | - [Books](/resources/books.md) 177 | - [Courses](/resources/courses.md) 178 | - [Blogs](/resources/blogs.md) 179 | - [Papers](/resources/papers.md) 180 | - [Projects](/resources/projects.md) 181 | - [Videos and Podcasts](/resources/videos-and-podcasts.md) -------------------------------------------------------------------------------- /images/andrej_on_learning.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dswh/ai-research-program/08c6e926382029bfa0e2a01fa3e5f9762ce3aa62/images/andrej_on_learning.png -------------------------------------------------------------------------------- /images/cover.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dswh/ai-research-program/08c6e926382029bfa0e2a01fa3e5f9762ce3aa62/images/cover.png -------------------------------------------------------------------------------- /resources/blogs.md: -------------------------------------------------------------------------------- 1 | ## Blogs that I want to read / re-read 2 | 3 | ### Main 4 | --- 5 | - [The AI Research Engineer](https://researchrookie.substack.com/) 6 | - [A High-Level Overview of Large Language Models](https://www.borealisai.com/research-blogs/a-high-level-overview-of-large-language-models/) 7 | - [Applied LLMs - what we've learned from a year of building with LLMs](https://applied-llms.org/) 8 | - 9 | 10 | 11 | ### Tokenization 12 | - [Evolution of tokenization](https://towardsdatascience.com/the-evolution-of-tokenization-in-nlp-byte-pair-encoding-in-nlp-d7621b9c1186) 13 | - [Byte Pair Encoding](https://leimao.github.io/blog/Byte-Pair-Encoding/) 14 | - 15 | 16 | 17 | ### Tools / best practices for reading and writing research papers 18 | - [Best AI tools for writing research papers](https://www.reddit.com/r/ArtificialInteligence/comments/1eexqk2/best_ai_tools_for_writing_research_papers_in/) 19 | 20 | 21 | ### Attention 22 | - [The Illustrated Transformer](https://jalammar.github.io/illustrated-transformer/) -------------------------------------------------------------------------------- /resources/books.md: -------------------------------------------------------------------------------- 1 | # Books that I want to read / re-read 2 | 3 | - [Understanding Deep Learning](https://udlbook.github.io/udlbook/) 4 | - [Deep Learning for Coders with fastai and PyTorch](https://github.com/fastai/fastbook) 5 | - [Build a Large Language Model from Scratch](https://livebook.manning.com/book/build-a-large-language-model-from-scratch/) 6 | - -------------------------------------------------------------------------------- /resources/courses.md: -------------------------------------------------------------------------------- 1 | ## Courses that I am going to take 2 | 3 | - -------------------------------------------------------------------------------- /resources/datasets.md: -------------------------------------------------------------------------------- 1 | ## Datasets one can use for training / finetuning / learning 2 | 3 | Sample pretraining dataset: [Paul Graham Essay](https://www.paulgraham.com/greatwork.html) 4 | 5 | - [AI Books4 Dataset for Training LLMs Further](https://web.archive.org/web/20240519104217/https://old.reddit.com/r/datasets/comments/1cvi151/ai_books4_dataset_for_training_llms_further/) 6 | -------------------------------------------------------------------------------- /resources/papers.md: -------------------------------------------------------------------------------- 1 | ## Recommended papers for understanding LLMs 2 | 3 | ### Tokenization 4 | - [Neural Machine Translation of rare words with Subword Units (BPE)](https://arxiv.org/pdf/1508.07909) 5 | 6 | 7 | - [30 papers Ilya recommended John Carmack to read](https://arc.net/folder/D0472A20-9C20-4D3F-B145-D2865C0A9FEE) 8 | - [Attention is All You Need](https://arxiv.org/abs/1706.03762) 9 | - [The Annotated Transformer](https://nlp.seas.harvard.edu/annotated-transformer/) -------------------------------------------------------------------------------- /resources/projects.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dswh/ai-research-program/08c6e926382029bfa0e2a01fa3e5f9762ce3aa62/resources/projects.md -------------------------------------------------------------------------------- /resources/videos-and-podcasts.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dswh/ai-research-program/08c6e926382029bfa0e2a01fa3e5f9762ce3aa62/resources/videos-and-podcasts.md -------------------------------------------------------------------------------- /weekly-notes/week1.md: -------------------------------------------------------------------------------- 1 | # Week 1 - Building an LLM 2 | 3 | ## Dataset preparation 4 | - Tokenization 5 | - Data Sampling 6 | - Data loading 7 | - Embedding --------------------------------------------------------------------------------