├── .github
    └── FUNDING.yml
├── .gitignore
├── LICENSE
├── README.md
├── images
    ├── andrej_on_learning.png
    └── cover.png
├── resources
    ├── blogs.md
    ├── books.md
    ├── courses.md
    ├── datasets.md
    ├── papers.md
    ├── projects.md
    └── videos-and-podcasts.md
└── weekly-notes
    └── week1.md


/.github/FUNDING.yml:
--------------------------------------------------------------------------------
1 | # These are supported funding model platforms
2 | 
3 | github: # Replace with up to 4 GitHub Sponsors-enabled usernames e.g., [user1, user2]
4 | patreon: researchrookie/membership
5 | 


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
  1 | llm_env/
  2 | 
  3 | # Byte-compiled / optimized / DLL files
  4 | __pycache__/
  5 | *.py[cod]
  6 | *$py.class
  7 | 
  8 | # C extensions
  9 | *.so
 10 | 
 11 | # Distribution / packaging
 12 | .Python
 13 | build/
 14 | develop-eggs/
 15 | dist/
 16 | downloads/
 17 | eggs/
 18 | .eggs/
 19 | lib/
 20 | lib64/
 21 | parts/
 22 | sdist/
 23 | var/
 24 | wheels/
 25 | share/python-wheels/
 26 | *.egg-info/
 27 | .installed.cfg
 28 | *.egg
 29 | MANIFEST
 30 | 
 31 | # PyInstaller
 32 | #  Usually these files are written by a python script from a template
 33 | #  before PyInstaller builds the exe, so as to inject date/other infos into it.
 34 | *.manifest
 35 | *.spec
 36 | 
 37 | # Installer logs
 38 | pip-log.txt
 39 | pip-delete-this-directory.txt
 40 | 
 41 | # Unit test / coverage reports
 42 | htmlcov/
 43 | .tox/
 44 | .nox/
 45 | .coverage
 46 | .coverage.*
 47 | .cache
 48 | nosetests.xml
 49 | coverage.xml
 50 | *.cover
 51 | *.py,cover
 52 | .hypothesis/
 53 | .pytest_cache/
 54 | cover/
 55 | 
 56 | # Translations
 57 | *.mo
 58 | *.pot
 59 | 
 60 | # Django stuff:
 61 | *.log
 62 | local_settings.py
 63 | db.sqlite3
 64 | db.sqlite3-journal
 65 | 
 66 | # Flask stuff:
 67 | instance/
 68 | .webassets-cache
 69 | 
 70 | # Scrapy stuff:
 71 | .scrapy
 72 | 
 73 | # Sphinx documentation
 74 | docs/_build/
 75 | 
 76 | # PyBuilder
 77 | .pybuilder/
 78 | target/
 79 | 
 80 | # Jupyter Notebook
 81 | .ipynb_checkpoints
 82 | 
 83 | # IPython
 84 | profile_default/
 85 | ipython_config.py
 86 | 
 87 | # pyenv
 88 | #   For a library or package, you might want to ignore these files since the code is
 89 | #   intended to run in multiple environments; otherwise, check them in:
 90 | # .python-version
 91 | 
 92 | # pipenv
 93 | #   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
 94 | #   However, in case of collaboration, if having platform-specific dependencies or dependencies
 95 | #   having no cross-platform support, pipenv may install dependencies that don't work, or not
 96 | #   install all needed dependencies.
 97 | #Pipfile.lock
 98 | 
 99 | # poetry
100 | #   Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
101 | #   This is especially recommended for binary packages to ensure reproducibility, and is more
102 | #   commonly ignored for libraries.
103 | #   https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
104 | #poetry.lock
105 | 
106 | # pdm
107 | #   Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
108 | #pdm.lock
109 | #   pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
110 | #   in version control.
111 | #   https://pdm.fming.dev/latest/usage/project/#working-with-version-control
112 | .pdm.toml
113 | .pdm-python
114 | .pdm-build/
115 | 
116 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
117 | __pypackages__/
118 | 
119 | # Celery stuff
120 | celerybeat-schedule
121 | celerybeat.pid
122 | 
123 | # SageMath parsed files
124 | *.sage.py
125 | 
126 | # Environments
127 | .env
128 | .venv
129 | env/
130 | venv/
131 | ENV/
132 | env.bak/
133 | venv.bak/
134 | 
135 | # Spyder project settings
136 | .spyderproject
137 | .spyproject
138 | 
139 | # Rope project settings
140 | .ropeproject
141 | 
142 | # mkdocs documentation
143 | /site
144 | 
145 | # mypy
146 | .mypy_cache/
147 | .dmypy.json
148 | dmypy.json
149 | 
150 | # Pyre type checker
151 | .pyre/
152 | 
153 | # pytype static type analyzer
154 | .pytype/
155 | 
156 | # Cython debug symbols
157 | cython_debug/
158 | 
159 | # PyCharm
160 | #  JetBrains specific template is maintained in a separate JetBrains.gitignore that can
161 | #  be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
162 | #  and can be added to the global gitignore or merged into this file.  For a more nuclear
163 | #  option (not recommended) you can uncomment the following to ignore the entire idea folder.
164 | #.idea/
165 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2024 Data Science with Harshit
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # AI Research Program
  2 | 
  3 | ![AI Research Program](./images/cover.png)
  4 | 
  5 | An independent AI research program created by [Harshit](https://researchrookie.substack.com/).
  6 | 
  7 | I am an educator at ♥️ and I love to deconstruct topics to better understand and teach them. I am now embarking on a journey to re-learn the concepts of AI via this AI Research Program.
  8 | 
  9 | I have designed this program to be thorough, exciting, gamified and hands-on. The program also draws inspiration from various Research Engineer roles at companies like [OpenAI](https://openai.com/careers/research-engineer/), [Anthropic](https://boards.greenhouse.io/anthropic/jobs/4019739008), [Meta](https://www.metacareers.com/v2/jobs/2780884748740452/), and the likes.
 10 | 
 11 | The following curriculum is designed to help me get a deeper and thorough understanding of all important concepts in AI. 
 12 | 
 13 | ## Table of contents
 14 | - [AI Research Program](#ai-research-program)
 15 |   - [Table of contents](#table-of-contents)
 16 |   - [My learning approach](#my-learning-approach)
 17 |   - [How to consume this?](#how-to-consume-this)
 18 |   - [Pillar 1: Foundational Knowledge](#pillar-1-foundational-knowledge)
 19 |   - [Pillar 2: Programming and Tools](#pillar-2-programming-and-tools)
 20 |   - [Pillar 3: Deep Learning Fundamentals](#pillar-3-deep-learning-fundamentals)
 21 |   - [Pillar 4: Reinforcement Learning](#pillar-4-reinforcement-learning)
 22 |   - [Pillar 5: NLP / LLM Research](#pillar-5-nlp--llm-research)
 23 |   - [Pillar 6: Research Paper Analysis and Replication](#pillar-6-research-paper-analysis-and-replication)
 24 |   - [Pillar 7: High-Performance AI Systems and Applications](#pillar-7-high-performance-ai-systems-and-applications)
 25 |   - [Pillar 8: Large-scale ETL and Data Engineering](#pillar-8-large-scale-etl-and-data-engineering)
 26 |   - [Pillar 9: Ethical AI and Responsible Development](#pillar-9-ethical-ai-and-responsible-development)
 27 |   - [Pillar 10: Community Engagement and Networking](#pillar-10-community-engagement-and-networking)
 28 |   - [Pillar 11: Research and Publication](#pillar-11-research-and-publication)
 29 |   - [Resources](#resources)
 30 | 
 31 | ## My learning approach
 32 | I have a background in Data Science and ML and I've decided to take a recursive approach(going deeper as needed) based on my level of understanding.
 33 | 
 34 | My primary purpose of starting this program is to dig deeper into **LLM research and engineering problems**.
 35 | 
 36 | So, I'll start from Pillar 5, NLP/LLM research, how LLMs work, how they are pre-trained and how to finetune them. In the process, I will hop onto other Pillars to learn and re-learn a other fundamental concepts.
 37 | 
 38 | Keep an eye on [weekly-notes](./weekly-notes) for my learnings and thoughts.
 39 | 
 40 | I'll dedicate at least 1 hours daily to my learning.
 41 | 
 42 | > ⚠️ This program is still under development and will be updated as I progress in my journey. Feel free to contribute to it by opening issues or suggesting improvements.
 43 | 
 44 | ## How to consume this?
 45 | - Beginners should start with the [Foundational Knowledge](#pillar-1-foundational-knowledge) section.
 46 | - For all others who have some exposure to ML/DL, tweak the curriculum as per your level of understanding.
 47 | - This curriculum is not supposed to be consumed in a linear fashion. Treat it more like a checklist. 
 48 | - You should start with the basics and gradually move towards more complex topics as per your goals and interests.
 49 | 
 50 | To follow me in this journey, consider subscribing to my:
 51 | - [weekly newsletter](https://researchrookie.substack.com/).
 52 | - Learning vlogs at [youtube](https://www.youtube.com/@DataSciencewithHarshit)
 53 | - [Discord for discussions and sharing your learnings](https://discord.gg/ux6K7wEu)
 54 | 
 55 | 
 56 | ## Pillar 1: Foundational Knowledge
 57 | 
 58 | | Topic | Subtopics | Learning Resources | Projects/Assignments | Evaluation Criteria |
 59 | |-------|-----------|---------------------|----------------------|---------------------|
 60 | | Linear Algebra | Vectors, matrices, eigenvalues | MIT OpenCourseWare, 3Blue1Brown videos | Implement basic operations, solve systems of equations | Quiz, problem set, subjective evaluation |
 61 | | Calculus | Derivatives, integrals, gradients | Khan Academy, Coursera | Optimization problems, gradient descent implementation | Quiz, problem set, subjective evaluation |
 62 | | Probability | Distributions, Bayes theorem | Stanford CS109 | Probabilistic modeling project | Quiz, project review, subjective evaluation |
 63 | | Statistics | Hypothesis testing, regression | StatQuest videos, EdX courses | Data analysis project | Project review, subjective evaluation |
 64 | | Information Theory | Entropy, mutual information | Elements of Information Theory (book) | Implement compression algorithm | Project review, subjective evaluation |
 65 | 
 66 | ## Pillar 2: Programming and Tools
 67 | 
 68 | | Skill | Subtopics | Learning Resources | Projects/Assignments | Evaluation Criteria |
 69 | |-------|-----------|---------------------|----------------------|---------------------|
 70 | | Python | Data structures, OOP, functional programming | Python.org tutorials, Real Python | Build a data processing pipeline | Code review, subjective evaluation |
 71 | | PyTorch | Tensors, autograd, distributed training | PyTorch documentation, Fast.ai course | Implement and distribute a neural network | Project review, subjective evaluation |
 72 | | Git | Version control, branching, merging | Git documentation, GitHub Learning Lab | Contribute to an open-source project | Contribution review, subjective evaluation |
 73 | | Linux | Command line, shell scripting, OS internals | Linux Journey, "Operating Systems: Three Easy Pieces" book | Implement a simple process scheduler | Practical test, code review, subjective evaluation |
 74 | | Cloud Platforms | AWS, GCP, or Azure basics | Cloud provider documentation | Deploy a scalable web application | Project review, subjective evaluation |
 75 | | High Performance Computing | GPU programming, CUDA | NVIDIA CUDA tutorials | Optimize a neural network for GPU computation | Performance benchmarks, subjective evaluation |
 76 | | Kubernetes | Container orchestration, scaling | Kubernetes documentation, KodeKloud courses | Deploy and scale an ML model using Kubernetes | Project review, subjective evaluation |
 77 | 
 78 | ## Pillar 3: Deep Learning Fundamentals
 79 | 
 80 | | Topic | Subtopics | Learning Resources | Projects/Assignments | Evaluation Criteria |
 81 | |-------|-----------|---------------------|----------------------|---------------------|
 82 | | Neural Networks | Perceptrons, activation functions, backpropagation | Deep Learning book by Goodfellow et al. | Implement a multi-layer perceptron | Code review, subjective evaluation |
 83 | | CNNs | Convolution, pooling, modern architectures | CS231n course materials | Build an image classification model | Project review, subjective evaluation |
 84 | | RNNs | LSTM, GRU, sequence modeling | D2L.ai tutorials | Implement a language model | Code review, subjective evaluation |
 85 | | Attention Mechanisms | Self-attention, multi-head attention | "Attention Is All You Need" paper | Implement and optimize a new attention mechanism | Performance analysis, subjective evaluation |
 86 | | Transformer Variants | Transformer-XL, Reformer, Performer | Recent papers on arXiv | Compare compute efficiency of two Transformer variants | Comparative analysis, subjective evaluation |
 87 | | Optimization Techniques | SGD, Adam, learning rate schedules | Sebastian Ruder's blog | Compare optimizers on a benchmark task | Report review, subjective evaluation |
 88 | | Regularization | Dropout, batch normalization, data augmentation | Papers with Code | Implement and compare regularization techniques | Project review, subjective evaluation |
 89 | 
 90 | ## Pillar 4: Reinforcement Learning
 91 | 
 92 | | Topic | Subtopics | Learning Resources | Projects/Assignments | Evaluation Criteria |
 93 | |-------|-----------|---------------------|----------------------|---------------------|
 94 | | RL Fundamentals | MDP, value functions, policy gradients | Sutton & Barto book, David Silver's course | Implement and solve classic RL problems (e.g., CartPole, Mountain Car) | Algorithm performance, subjective evaluation |
 95 | | Deep RL | DQN, A3C, PPO | OpenAI Spinning Up | Implement a deep RL algorithm for a complex environment (e.g., Atari games) | Agent performance, subjective evaluation |
 96 | | RL for NLP | RLHF, sequence generation | Recent papers (e.g., InstructGPT) | Fine-tune a language model using RL | Model improvement, subjective evaluation |
 97 | 
 98 | ## Pillar 5: NLP / LLM Research
 99 | | Topic | Subtopics | Learning Resources | Projects / Assignments | Evaluation Criteria |
100 | |-------|-----------|---------------------|------------------------|----------------------|
101 | | 1. Foundations | - Word Vectors/Embeddings<br>- Tokenization<br>- Preprocessing<br>- Data Sampling | - Book: "Speech and Language Processing" by Jurafsky & Martin<br>- Course: Stanford CS224N: NLP with Deep Learning<br>- Paper: "Efficient Estimation of Word Representations in Vector Space" by Mikolov et al. | - Implement word2vec from scratch<br>- Build a custom tokenizer for a specific language or domain<br>- Create a data preprocessing pipeline for a large text corpus | - Accuracy of word embeddings on analogy tasks<br>- Efficiency and coverage of tokenization<br>- Quality and cleanliness of preprocessed data |
102 | | 2. Classical NLP Techniques | - Hidden Markov Models<br>- Naive Bayes<br>- Maximum Entropy Markov Models<br>- Conditional Random Fields | - Book: "Foundations of Statistical Natural Language Processing" by Manning & Schütze<br>- Tutorial: "A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition" by Rabiner | - Implement a part-of-speech tagger using HMMs<br>- Build a spam classifier using Naive Bayes<br>- Develop a named entity recognition system using CRFs | - Accuracy on standard POS tagging datasets<br>- Precision, recall, and F1 score for spam classification<br>- F1 score on CoNLL 2003 NER dataset |
103 | | 3. Neural Architectures | - Feed-forward Neural Networks<br>- Recurrent Neural Networks<br>- Convolutional Neural Networks<br>- Attention Mechanisms<br>- Transformers | - Book: "Deep Learning" by Goodfellow, Bengio, & Courville<br>- Paper: "Attention Is All You Need" by Vaswani et al.<br>- Course: fast.ai Practical Deep Learning for Coders | - Implement a sentiment analysis model using CNNs<br>- Build a language model using LSTMs<br>- Create a machine translation system using Transformers | - Accuracy on sentiment analysis benchmarks (e.g., IMDb)<br>- Perplexity of language model on test set<br>- BLEU score for machine translation |
104 | | 4. Language Models | - N-gram Models<br>- Neural Language Models<br>- Autoregressive vs. Autoencoder Models<br>- Large Language Models (LLMs)<br>- Vision-Language Models (VLMs) | - Paper: "Language Models are Few-Shot Learners" (GPT-3)<br>- Blog: "The Illustrated Transformer" by Jay Alammar<br>- Course: Hugging Face NLP Course | - Fine-tune GPT-2 for text generation<br>- Implement a BERT-based question answering system<br>- Create a multimodal model for image captioning | - Perplexity and cross-entropy loss<br>- F1 and Exact Match scores for QA<br>- BLEU, METEOR, and CIDEr scores for image captioning |
105 | | 5. Advanced LLM Concepts | - LLM Alignment<br>- Token Sampling Methods<br>- Context Length Extension<br>- Personalization | - Paper: "Constitutional AI: Harmlessness from AI Feedback" by Anthropic<br>- Blog: "How to sample from language models" by Hugging Face<br>- Paper: "LoRA: Low-Rank Adaptation of Large Language Models" | - Implement different decoding strategies (greedy, beam search, top-k, top-p)<br>- Develop a method to extend context length of a pre-trained LLM<br>- Create a personalized language model using adapters | - Human evaluation of model alignment<br>- Quality and diversity of generated text<br>- Perplexity on long-context tasks<br>- Personalization accuracy on user-specific tasks |
106 | | 6. NLP Applications | - Machine Translation<br>- Named Entity Recognition<br>- Textual Entailment<br>- Retrieval Augmented Generation (RAG)<br>- Document Intelligence | - Paper: "Neural Machine Translation by Jointly Learning to Align and Translate"<br>- Tutorial: spaCy's Named Entity Recognition<br>- Paper: "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" | - Build an end-to-end neural machine translation system<br>- Develop a RAG system for question answering<br>- Create a document understanding system for invoice processing | - BLEU, METEOR scores for MT<br>- F1 score for NER<br>- Accuracy on textual entailment datasets (e.g., SNLI)<br>- Relevance and accuracy of RAG responses |
107 | | 7. Knowledge Representation | - Knowledge Graphs<br>- Semantic Networks | - Book: "Knowledge Representation and Reasoning" by Brachman & Levesque<br>- Tutorial: Neo4j Graph Database | - Construct a knowledge graph from a text corpus<br>- Develop a question answering system using a knowledge graph | - Coverage and accuracy of extracted knowledge<br>- Precision and recall of graph-based QA system |
108 | | 8. Challenges and Mitigation | - Hallucination Mitigation<br>- AI Text Detection<br>- Bias Detection and Mitigation | - Paper: "TruthfulQA: Measuring How Models Mimic Human Falsehoods"<br>- Blog: "Detecting Machine-Generated Text" by OpenAI<br>- Paper: "Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings" | - Develop a fact-checking system for LLM outputs<br>- Create an AI text detector<br>- Implement bias mitigation techniques in word embeddings | - Reduction in hallucination rate<br>- Accuracy of AI text detection<br>- Reduction in bias measures (e.g., WEAT score) |
109 | | 9. Evaluation and Benchmarking | - LLM/VLM Benchmarks<br>- Task-specific Metrics | - Paper: "GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding"<br>- Tutorial: Hugging Face's Evaluate library | - Evaluate an LLM on multiple NLP tasks using GLUE benchmark<br>- Implement and compare different evaluation metrics for a specific NLP task | - Performance across multiple benchmarks<br>- Inter-annotator agreement for human evaluation |
110 | | 10. Practical Aspects | - Large Language Model Ops (LLMOps)<br>- Ethical Considerations | - Book: "Building Machine Learning Pipelines" by Hapke & Nelson<br>- Paper: "On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?" | - Design and implement an LLMOps pipeline<br>- Conduct an ethical audit of an NLP system | - Efficiency and reliability of deployment pipeline<br>- Compliance with ethical AI principles |
111 | 
112 | ## Pillar 6: Research Paper Analysis and Replication
113 | 
114 | | Activity | Approach | Resources | Deliverables | Evaluation Criteria |
115 | |----------|----------|-----------|--------------|---------------------|
116 | | Paper Reading | Weekly reading of latest papers | arXiv, Papers With Code | Paper summaries, critiques | Quality of analysis, subjective evaluation |
117 | | Research Replication | Reproduce results of significant papers | Original papers, open-source implementations | Replicated experiments, results comparison | Accuracy of replication, subjective evaluation |
118 | | Trend Analysis | Identify and analyze research trends | NeurIPS, ICML, ACL proceedings | Trend reports, blog posts | Insight quality, subjective evaluation |
119 | | Paper Presentation | Present papers to peers or online | Conference recordings | Video explanations, slide decks | Presentation quality, subjective evaluation |
120 | | Extension Projects | Extend or combine ideas from papers | Related work sections of papers | Novel experiments, blog posts | Originality, subjective evaluation |
121 | | Visualization Projects | Develop interactive visualizations for ML concepts | D3.js, Plotly | Create an interactive visualization of attention between tokens in a language model | Visualization quality, insight provided, subjective evaluation |
122 | 
123 | 
124 | ## Pillar 7: High-Performance AI Systems and Applications
125 | 
126 | | Topic | Subtopics | Learning Resources | Projects/Assignments | Evaluation Criteria |
127 | |-------|-----------|---------------------|----------------------|---------------------|
128 | | Model Optimization | Quantization, pruning, distillation | TensorFlow Model Optimization Toolkit | Optimize a large model for edge devices | Performance benchmarks, subjective evaluation |
129 | | Efficient Architectures | MobileNet, EfficientNet, BERT-tiny | Papers, GitHub implementations | Implement and benchmark efficient models | Comparative analysis, subjective evaluation |
130 | | Hardware Acceleration | GPU programming, TPU utilization | NVIDIA Deep Learning Institute | Optimize a model for specific hardware | Performance improvement, subjective evaluation |
131 | | Scalable AI Systems | Microservices, containerization, orchestration | Kubernetes documentation, Docker tutorials | Design a scalable AI service architecture | Architecture review, subjective evaluation |
132 | | ML Ops | CI/CD for ML, model monitoring | Google MLOps guides | Set up an MLOps pipeline | Pipeline effectiveness, subjective evaluation |
133 | 
134 | 
135 | ## Pillar 8: Large-scale ETL and Data Engineering
136 | 
137 | | Topic | Subtopics | Learning Resources | Projects/Assignments | Evaluation Criteria |
138 | |-------|-----------|---------------------|----------------------|---------------------|
139 | | ETL Fundamentals | Data extraction, transformation, loading | Udacity Data Engineering course | Design and implement an ETL pipeline | Pipeline efficiency, subjective evaluation |
140 | | Distributed Data Processing | Apache Spark, Hadoop | Spark documentation, Coursera courses | Process and analyze a large dataset using Spark | Processing speed, scalability, subjective evaluation |
141 | | Data Warehousing | Star schema, OLAP | Kimball Group resources | Design a data warehouse for ML experiments | Design review, subjective evaluation |
142 | | Streaming Data | Kafka, Flink | Confluent Kafka tutorials | Implement a real-time data processing pipeline | System performance, subjective evaluation |
143 | 
144 | ## Pillar 9: Ethical AI and Responsible Development
145 | 
146 | | Topic | Subtopics | Learning Resources | Projects/Assignments | Evaluation Criteria |
147 | |-------|-----------|---------------------|----------------------|---------------------|
148 | | AI Ethics | Fairness, accountability, transparency | MIT Moral Machine, ethicalML.org | Develop an AI ethics framework | Framework review, subjective evaluation |
149 | | Bias in AI | Dataset bias, algorithmic bias | "Gender Shades" paper, AI Fairness 360 | Audit a model for bias | Report review, subjective evaluation |
150 | | Privacy in ML | Differential privacy, federated learning | OpenMined tutorials | Implement privacy-preserving ML | Project review, subjective evaluation |
151 | | Explainable AI | LIME, SHAP, counterfactual explanations | Interpretable Machine Learning book | Develop model explanations | Project review, subjective evaluation |
152 | | AI Governance | Regulations, guidelines, best practices | EU AI Act, IEEE Ethically Aligned Design | Propose AI governance structure | Proposal review, subjective evaluation |
153 | 
154 | ## Pillar 10: Community Engagement and Networking
155 | 
156 | | Activity | Approach | Resources | Deliverables | Evaluation Criteria |
157 | |----------|----------|-----------|--------------|---------------------|
158 | | Conference Participation | Attend or present at AI conferences | Conference websites, Call for Papers | Presentation slides, trip reports | Engagement quality, subjective evaluation |
159 | | Open Source Contribution | Contribute to AI/ML open source projects | GitHub, PyPi | Code contributions, pull requests | Contribution impact, subjective evaluation |
160 | | AI Community Building | Organize meetups, study groups | Meetup.com, local tech communities | Event reports, community growth metrics | Community impact, subjective evaluation |
161 | | Online Presence | Blog writing, social media engagement | Medium, Twitter, LinkedIn | Blog posts, thread discussions | Reach and engagement, subjective evaluation |
162 | | Collaborative Projects | Partner with other researchers/engineers | Academic collaborations, hackathons | Joint projects, co-authored papers | Collaboration quality, subjective evaluation |
163 | 
164 | ## Pillar 11: Research and Publication
165 | 
166 | | Activity | Approach | Resources | Deliverables | Evaluation Criteria |
167 | |----------|----------|-----------|--------------|---------------------|
168 | | Research Ideation | Brainstorming, literature review | arXiv, Google Scholar | Research proposals | Originality, feasibility, subjective evaluation |
169 | | Experiment Design | Hypothesis formulation, methodology planning | Research design books, mentorship | Experimental protocols | Rigor, subjective evaluation |
170 | | Data Collection and Analysis | Data gathering, statistical analysis | Kaggle datasets, statistical tools | Datasets, analysis reports | Data quality, analysis depth, subjective evaluation |
171 | | Paper Writing | Scientific writing, LaTeX | Overleaf, writing workshops | Draft papers | Writing quality, subjective evaluation |
172 | | Peer Review | Understand and participate in peer review process | PubliONS, Elsevier Researcher Academy | Review reports | Review quality, subjective evaluation |
173 | | Publication and Presentation | Submit to journals/conferences, create posters | Journal guidelines, conference deadlines | Published papers, conference posters | Impact factor, presentation quality, subjective evaluation |
174 | 
175 | ## Resources
176 | - [Books](/resources/books.md)
177 | - [Courses](/resources/courses.md)
178 | - [Blogs](/resources/blogs.md)
179 | - [Papers](/resources/papers.md)
180 | - [Projects](/resources/projects.md)
181 | - [Videos and Podcasts](/resources/videos-and-podcasts.md)


--------------------------------------------------------------------------------
/images/andrej_on_learning.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dswh/ai-research-program/08c6e926382029bfa0e2a01fa3e5f9762ce3aa62/images/andrej_on_learning.png


--------------------------------------------------------------------------------
/images/cover.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dswh/ai-research-program/08c6e926382029bfa0e2a01fa3e5f9762ce3aa62/images/cover.png


--------------------------------------------------------------------------------
/resources/blogs.md:
--------------------------------------------------------------------------------
 1 | ## Blogs that I want to read / re-read
 2 | 
 3 | ### Main
 4 | ---
 5 | - [The AI Research Engineer](https://researchrookie.substack.com/)
 6 | - [A High-Level Overview of Large Language Models](https://www.borealisai.com/research-blogs/a-high-level-overview-of-large-language-models/)
 7 | - [Applied LLMs - what we've learned from a year of building with LLMs](https://applied-llms.org/)
 8 | - 
 9 | 
10 | 
11 | ### Tokenization
12 | - [Evolution of tokenization](https://towardsdatascience.com/the-evolution-of-tokenization-in-nlp-byte-pair-encoding-in-nlp-d7621b9c1186)
13 | - [Byte Pair Encoding](https://leimao.github.io/blog/Byte-Pair-Encoding/)
14 | - 
15 | 
16 | 
17 | ### Tools / best practices for reading and writing research papers
18 | - [Best AI tools for writing research papers](https://www.reddit.com/r/ArtificialInteligence/comments/1eexqk2/best_ai_tools_for_writing_research_papers_in/)
19 | 
20 | 
21 | ### Attention
22 | - [The Illustrated Transformer](https://jalammar.github.io/illustrated-transformer/)


--------------------------------------------------------------------------------
/resources/books.md:
--------------------------------------------------------------------------------
1 | # Books that I want to read / re-read
2 | 
3 | - [Understanding Deep Learning](https://udlbook.github.io/udlbook/)
4 | - [Deep Learning for Coders with fastai and PyTorch](https://github.com/fastai/fastbook)
5 | - [Build a Large Language Model from Scratch](https://livebook.manning.com/book/build-a-large-language-model-from-scratch/)
6 | - 


--------------------------------------------------------------------------------
/resources/courses.md:
--------------------------------------------------------------------------------
1 | ## Courses that I am going to take
2 | 
3 | - 


--------------------------------------------------------------------------------
/resources/datasets.md:
--------------------------------------------------------------------------------
1 | ## Datasets one can use for training / finetuning / learning
2 | 
3 | Sample pretraining dataset: [Paul Graham Essay](https://www.paulgraham.com/greatwork.html)
4 | 
5 | - [AI Books4 Dataset for Training LLMs Further](https://web.archive.org/web/20240519104217/https://old.reddit.com/r/datasets/comments/1cvi151/ai_books4_dataset_for_training_llms_further/)
6 | 


--------------------------------------------------------------------------------
/resources/papers.md:
--------------------------------------------------------------------------------
1 | ## Recommended papers for understanding LLMs
2 | 
3 | ### Tokenization 
4 | - [Neural Machine Translation of rare words with Subword Units (BPE)](https://arxiv.org/pdf/1508.07909)
5 | 
6 | 
7 | - [30 papers Ilya recommended John Carmack to read](https://arc.net/folder/D0472A20-9C20-4D3F-B145-D2865C0A9FEE)
8 | - [Attention is All You Need](https://arxiv.org/abs/1706.03762)
9 | - [The Annotated Transformer](https://nlp.seas.harvard.edu/annotated-transformer/)


--------------------------------------------------------------------------------
/resources/projects.md:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dswh/ai-research-program/08c6e926382029bfa0e2a01fa3e5f9762ce3aa62/resources/projects.md


--------------------------------------------------------------------------------
/resources/videos-and-podcasts.md:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dswh/ai-research-program/08c6e926382029bfa0e2a01fa3e5f9762ce3aa62/resources/videos-and-podcasts.md


--------------------------------------------------------------------------------
/weekly-notes/week1.md:
--------------------------------------------------------------------------------
1 | # Week 1 - Building an LLM
2 | 
3 | ## Dataset preparation
4 | - Tokenization
5 | - Data Sampling
6 | - Data loading
7 | - Embedding


--------------------------------------------------------------------------------