├── .github
└── FUNDING.yml
├── .gitignore
├── LICENSE
├── README.md
├── images
├── andrej_on_learning.png
└── cover.png
├── resources
├── blogs.md
├── books.md
├── courses.md
├── datasets.md
├── papers.md
├── projects.md
└── videos-and-podcasts.md
└── weekly-notes
└── week1.md
/.github/FUNDING.yml:
--------------------------------------------------------------------------------
1 | # These are supported funding model platforms
2 |
3 | github: # Replace with up to 4 GitHub Sponsors-enabled usernames e.g., [user1, user2]
4 | patreon: researchrookie/membership
5 |
--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
1 | llm_env/
2 |
3 | # Byte-compiled / optimized / DLL files
4 | __pycache__/
5 | *.py[cod]
6 | *$py.class
7 |
8 | # C extensions
9 | *.so
10 |
11 | # Distribution / packaging
12 | .Python
13 | build/
14 | develop-eggs/
15 | dist/
16 | downloads/
17 | eggs/
18 | .eggs/
19 | lib/
20 | lib64/
21 | parts/
22 | sdist/
23 | var/
24 | wheels/
25 | share/python-wheels/
26 | *.egg-info/
27 | .installed.cfg
28 | *.egg
29 | MANIFEST
30 |
31 | # PyInstaller
32 | # Usually these files are written by a python script from a template
33 | # before PyInstaller builds the exe, so as to inject date/other infos into it.
34 | *.manifest
35 | *.spec
36 |
37 | # Installer logs
38 | pip-log.txt
39 | pip-delete-this-directory.txt
40 |
41 | # Unit test / coverage reports
42 | htmlcov/
43 | .tox/
44 | .nox/
45 | .coverage
46 | .coverage.*
47 | .cache
48 | nosetests.xml
49 | coverage.xml
50 | *.cover
51 | *.py,cover
52 | .hypothesis/
53 | .pytest_cache/
54 | cover/
55 |
56 | # Translations
57 | *.mo
58 | *.pot
59 |
60 | # Django stuff:
61 | *.log
62 | local_settings.py
63 | db.sqlite3
64 | db.sqlite3-journal
65 |
66 | # Flask stuff:
67 | instance/
68 | .webassets-cache
69 |
70 | # Scrapy stuff:
71 | .scrapy
72 |
73 | # Sphinx documentation
74 | docs/_build/
75 |
76 | # PyBuilder
77 | .pybuilder/
78 | target/
79 |
80 | # Jupyter Notebook
81 | .ipynb_checkpoints
82 |
83 | # IPython
84 | profile_default/
85 | ipython_config.py
86 |
87 | # pyenv
88 | # For a library or package, you might want to ignore these files since the code is
89 | # intended to run in multiple environments; otherwise, check them in:
90 | # .python-version
91 |
92 | # pipenv
93 | # According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
94 | # However, in case of collaboration, if having platform-specific dependencies or dependencies
95 | # having no cross-platform support, pipenv may install dependencies that don't work, or not
96 | # install all needed dependencies.
97 | #Pipfile.lock
98 |
99 | # poetry
100 | # Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
101 | # This is especially recommended for binary packages to ensure reproducibility, and is more
102 | # commonly ignored for libraries.
103 | # https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
104 | #poetry.lock
105 |
106 | # pdm
107 | # Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
108 | #pdm.lock
109 | # pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
110 | # in version control.
111 | # https://pdm.fming.dev/latest/usage/project/#working-with-version-control
112 | .pdm.toml
113 | .pdm-python
114 | .pdm-build/
115 |
116 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
117 | __pypackages__/
118 |
119 | # Celery stuff
120 | celerybeat-schedule
121 | celerybeat.pid
122 |
123 | # SageMath parsed files
124 | *.sage.py
125 |
126 | # Environments
127 | .env
128 | .venv
129 | env/
130 | venv/
131 | ENV/
132 | env.bak/
133 | venv.bak/
134 |
135 | # Spyder project settings
136 | .spyderproject
137 | .spyproject
138 |
139 | # Rope project settings
140 | .ropeproject
141 |
142 | # mkdocs documentation
143 | /site
144 |
145 | # mypy
146 | .mypy_cache/
147 | .dmypy.json
148 | dmypy.json
149 |
150 | # Pyre type checker
151 | .pyre/
152 |
153 | # pytype static type analyzer
154 | .pytype/
155 |
156 | # Cython debug symbols
157 | cython_debug/
158 |
159 | # PyCharm
160 | # JetBrains specific template is maintained in a separate JetBrains.gitignore that can
161 | # be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
162 | # and can be added to the global gitignore or merged into this file. For a more nuclear
163 | # option (not recommended) you can uncomment the following to ignore the entire idea folder.
164 | #.idea/
165 |
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2024 Data Science with Harshit
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # AI Research Program
2 |
3 | 
4 |
5 | An independent AI research program created by [Harshit](https://researchrookie.substack.com/).
6 |
7 | I am an educator at ♥️ and I love to deconstruct topics to better understand and teach them. I am now embarking on a journey to re-learn the concepts of AI via this AI Research Program.
8 |
9 | I have designed this program to be thorough, exciting, gamified and hands-on. The program also draws inspiration from various Research Engineer roles at companies like [OpenAI](https://openai.com/careers/research-engineer/), [Anthropic](https://boards.greenhouse.io/anthropic/jobs/4019739008), [Meta](https://www.metacareers.com/v2/jobs/2780884748740452/), and the likes.
10 |
11 | The following curriculum is designed to help me get a deeper and thorough understanding of all important concepts in AI.
12 |
13 | ## Table of contents
14 | - [AI Research Program](#ai-research-program)
15 | - [Table of contents](#table-of-contents)
16 | - [My learning approach](#my-learning-approach)
17 | - [How to consume this?](#how-to-consume-this)
18 | - [Pillar 1: Foundational Knowledge](#pillar-1-foundational-knowledge)
19 | - [Pillar 2: Programming and Tools](#pillar-2-programming-and-tools)
20 | - [Pillar 3: Deep Learning Fundamentals](#pillar-3-deep-learning-fundamentals)
21 | - [Pillar 4: Reinforcement Learning](#pillar-4-reinforcement-learning)
22 | - [Pillar 5: NLP / LLM Research](#pillar-5-nlp--llm-research)
23 | - [Pillar 6: Research Paper Analysis and Replication](#pillar-6-research-paper-analysis-and-replication)
24 | - [Pillar 7: High-Performance AI Systems and Applications](#pillar-7-high-performance-ai-systems-and-applications)
25 | - [Pillar 8: Large-scale ETL and Data Engineering](#pillar-8-large-scale-etl-and-data-engineering)
26 | - [Pillar 9: Ethical AI and Responsible Development](#pillar-9-ethical-ai-and-responsible-development)
27 | - [Pillar 10: Community Engagement and Networking](#pillar-10-community-engagement-and-networking)
28 | - [Pillar 11: Research and Publication](#pillar-11-research-and-publication)
29 | - [Resources](#resources)
30 |
31 | ## My learning approach
32 | I have a background in Data Science and ML and I've decided to take a recursive approach(going deeper as needed) based on my level of understanding.
33 |
34 | My primary purpose of starting this program is to dig deeper into **LLM research and engineering problems**.
35 |
36 | So, I'll start from Pillar 5, NLP/LLM research, how LLMs work, how they are pre-trained and how to finetune them. In the process, I will hop onto other Pillars to learn and re-learn a other fundamental concepts.
37 |
38 | Keep an eye on [weekly-notes](./weekly-notes) for my learnings and thoughts.
39 |
40 | I'll dedicate at least 1 hours daily to my learning.
41 |
42 | > ⚠️ This program is still under development and will be updated as I progress in my journey. Feel free to contribute to it by opening issues or suggesting improvements.
43 |
44 | ## How to consume this?
45 | - Beginners should start with the [Foundational Knowledge](#pillar-1-foundational-knowledge) section.
46 | - For all others who have some exposure to ML/DL, tweak the curriculum as per your level of understanding.
47 | - This curriculum is not supposed to be consumed in a linear fashion. Treat it more like a checklist.
48 | - You should start with the basics and gradually move towards more complex topics as per your goals and interests.
49 |
50 | To follow me in this journey, consider subscribing to my:
51 | - [weekly newsletter](https://researchrookie.substack.com/).
52 | - Learning vlogs at [youtube](https://www.youtube.com/@DataSciencewithHarshit)
53 | - [Discord for discussions and sharing your learnings](https://discord.gg/ux6K7wEu)
54 |
55 |
56 | ## Pillar 1: Foundational Knowledge
57 |
58 | | Topic | Subtopics | Learning Resources | Projects/Assignments | Evaluation Criteria |
59 | |-------|-----------|---------------------|----------------------|---------------------|
60 | | Linear Algebra | Vectors, matrices, eigenvalues | MIT OpenCourseWare, 3Blue1Brown videos | Implement basic operations, solve systems of equations | Quiz, problem set, subjective evaluation |
61 | | Calculus | Derivatives, integrals, gradients | Khan Academy, Coursera | Optimization problems, gradient descent implementation | Quiz, problem set, subjective evaluation |
62 | | Probability | Distributions, Bayes theorem | Stanford CS109 | Probabilistic modeling project | Quiz, project review, subjective evaluation |
63 | | Statistics | Hypothesis testing, regression | StatQuest videos, EdX courses | Data analysis project | Project review, subjective evaluation |
64 | | Information Theory | Entropy, mutual information | Elements of Information Theory (book) | Implement compression algorithm | Project review, subjective evaluation |
65 |
66 | ## Pillar 2: Programming and Tools
67 |
68 | | Skill | Subtopics | Learning Resources | Projects/Assignments | Evaluation Criteria |
69 | |-------|-----------|---------------------|----------------------|---------------------|
70 | | Python | Data structures, OOP, functional programming | Python.org tutorials, Real Python | Build a data processing pipeline | Code review, subjective evaluation |
71 | | PyTorch | Tensors, autograd, distributed training | PyTorch documentation, Fast.ai course | Implement and distribute a neural network | Project review, subjective evaluation |
72 | | Git | Version control, branching, merging | Git documentation, GitHub Learning Lab | Contribute to an open-source project | Contribution review, subjective evaluation |
73 | | Linux | Command line, shell scripting, OS internals | Linux Journey, "Operating Systems: Three Easy Pieces" book | Implement a simple process scheduler | Practical test, code review, subjective evaluation |
74 | | Cloud Platforms | AWS, GCP, or Azure basics | Cloud provider documentation | Deploy a scalable web application | Project review, subjective evaluation |
75 | | High Performance Computing | GPU programming, CUDA | NVIDIA CUDA tutorials | Optimize a neural network for GPU computation | Performance benchmarks, subjective evaluation |
76 | | Kubernetes | Container orchestration, scaling | Kubernetes documentation, KodeKloud courses | Deploy and scale an ML model using Kubernetes | Project review, subjective evaluation |
77 |
78 | ## Pillar 3: Deep Learning Fundamentals
79 |
80 | | Topic | Subtopics | Learning Resources | Projects/Assignments | Evaluation Criteria |
81 | |-------|-----------|---------------------|----------------------|---------------------|
82 | | Neural Networks | Perceptrons, activation functions, backpropagation | Deep Learning book by Goodfellow et al. | Implement a multi-layer perceptron | Code review, subjective evaluation |
83 | | CNNs | Convolution, pooling, modern architectures | CS231n course materials | Build an image classification model | Project review, subjective evaluation |
84 | | RNNs | LSTM, GRU, sequence modeling | D2L.ai tutorials | Implement a language model | Code review, subjective evaluation |
85 | | Attention Mechanisms | Self-attention, multi-head attention | "Attention Is All You Need" paper | Implement and optimize a new attention mechanism | Performance analysis, subjective evaluation |
86 | | Transformer Variants | Transformer-XL, Reformer, Performer | Recent papers on arXiv | Compare compute efficiency of two Transformer variants | Comparative analysis, subjective evaluation |
87 | | Optimization Techniques | SGD, Adam, learning rate schedules | Sebastian Ruder's blog | Compare optimizers on a benchmark task | Report review, subjective evaluation |
88 | | Regularization | Dropout, batch normalization, data augmentation | Papers with Code | Implement and compare regularization techniques | Project review, subjective evaluation |
89 |
90 | ## Pillar 4: Reinforcement Learning
91 |
92 | | Topic | Subtopics | Learning Resources | Projects/Assignments | Evaluation Criteria |
93 | |-------|-----------|---------------------|----------------------|---------------------|
94 | | RL Fundamentals | MDP, value functions, policy gradients | Sutton & Barto book, David Silver's course | Implement and solve classic RL problems (e.g., CartPole, Mountain Car) | Algorithm performance, subjective evaluation |
95 | | Deep RL | DQN, A3C, PPO | OpenAI Spinning Up | Implement a deep RL algorithm for a complex environment (e.g., Atari games) | Agent performance, subjective evaluation |
96 | | RL for NLP | RLHF, sequence generation | Recent papers (e.g., InstructGPT) | Fine-tune a language model using RL | Model improvement, subjective evaluation |
97 |
98 | ## Pillar 5: NLP / LLM Research
99 | | Topic | Subtopics | Learning Resources | Projects / Assignments | Evaluation Criteria |
100 | |-------|-----------|---------------------|------------------------|----------------------|
101 | | 1. Foundations | - Word Vectors/Embeddings
- Tokenization
- Preprocessing
- Data Sampling | - Book: "Speech and Language Processing" by Jurafsky & Martin
- Course: Stanford CS224N: NLP with Deep Learning
- Paper: "Efficient Estimation of Word Representations in Vector Space" by Mikolov et al. | - Implement word2vec from scratch
- Build a custom tokenizer for a specific language or domain
- Create a data preprocessing pipeline for a large text corpus | - Accuracy of word embeddings on analogy tasks
- Efficiency and coverage of tokenization
- Quality and cleanliness of preprocessed data |
102 | | 2. Classical NLP Techniques | - Hidden Markov Models
- Naive Bayes
- Maximum Entropy Markov Models
- Conditional Random Fields | - Book: "Foundations of Statistical Natural Language Processing" by Manning & Schütze
- Tutorial: "A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition" by Rabiner | - Implement a part-of-speech tagger using HMMs
- Build a spam classifier using Naive Bayes
- Develop a named entity recognition system using CRFs | - Accuracy on standard POS tagging datasets
- Precision, recall, and F1 score for spam classification
- F1 score on CoNLL 2003 NER dataset |
103 | | 3. Neural Architectures | - Feed-forward Neural Networks
- Recurrent Neural Networks
- Convolutional Neural Networks
- Attention Mechanisms
- Transformers | - Book: "Deep Learning" by Goodfellow, Bengio, & Courville
- Paper: "Attention Is All You Need" by Vaswani et al.
- Course: fast.ai Practical Deep Learning for Coders | - Implement a sentiment analysis model using CNNs
- Build a language model using LSTMs
- Create a machine translation system using Transformers | - Accuracy on sentiment analysis benchmarks (e.g., IMDb)
- Perplexity of language model on test set
- BLEU score for machine translation |
104 | | 4. Language Models | - N-gram Models
- Neural Language Models
- Autoregressive vs. Autoencoder Models
- Large Language Models (LLMs)
- Vision-Language Models (VLMs) | - Paper: "Language Models are Few-Shot Learners" (GPT-3)
- Blog: "The Illustrated Transformer" by Jay Alammar
- Course: Hugging Face NLP Course | - Fine-tune GPT-2 for text generation
- Implement a BERT-based question answering system
- Create a multimodal model for image captioning | - Perplexity and cross-entropy loss
- F1 and Exact Match scores for QA
- BLEU, METEOR, and CIDEr scores for image captioning |
105 | | 5. Advanced LLM Concepts | - LLM Alignment
- Token Sampling Methods
- Context Length Extension
- Personalization | - Paper: "Constitutional AI: Harmlessness from AI Feedback" by Anthropic
- Blog: "How to sample from language models" by Hugging Face
- Paper: "LoRA: Low-Rank Adaptation of Large Language Models" | - Implement different decoding strategies (greedy, beam search, top-k, top-p)
- Develop a method to extend context length of a pre-trained LLM
- Create a personalized language model using adapters | - Human evaluation of model alignment
- Quality and diversity of generated text
- Perplexity on long-context tasks
- Personalization accuracy on user-specific tasks |
106 | | 6. NLP Applications | - Machine Translation
- Named Entity Recognition
- Textual Entailment
- Retrieval Augmented Generation (RAG)
- Document Intelligence | - Paper: "Neural Machine Translation by Jointly Learning to Align and Translate"
- Tutorial: spaCy's Named Entity Recognition
- Paper: "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" | - Build an end-to-end neural machine translation system
- Develop a RAG system for question answering
- Create a document understanding system for invoice processing | - BLEU, METEOR scores for MT
- F1 score for NER
- Accuracy on textual entailment datasets (e.g., SNLI)
- Relevance and accuracy of RAG responses |
107 | | 7. Knowledge Representation | - Knowledge Graphs
- Semantic Networks | - Book: "Knowledge Representation and Reasoning" by Brachman & Levesque
- Tutorial: Neo4j Graph Database | - Construct a knowledge graph from a text corpus
- Develop a question answering system using a knowledge graph | - Coverage and accuracy of extracted knowledge
- Precision and recall of graph-based QA system |
108 | | 8. Challenges and Mitigation | - Hallucination Mitigation
- AI Text Detection
- Bias Detection and Mitigation | - Paper: "TruthfulQA: Measuring How Models Mimic Human Falsehoods"
- Blog: "Detecting Machine-Generated Text" by OpenAI
- Paper: "Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings" | - Develop a fact-checking system for LLM outputs
- Create an AI text detector
- Implement bias mitigation techniques in word embeddings | - Reduction in hallucination rate
- Accuracy of AI text detection
- Reduction in bias measures (e.g., WEAT score) |
109 | | 9. Evaluation and Benchmarking | - LLM/VLM Benchmarks
- Task-specific Metrics | - Paper: "GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding"
- Tutorial: Hugging Face's Evaluate library | - Evaluate an LLM on multiple NLP tasks using GLUE benchmark
- Implement and compare different evaluation metrics for a specific NLP task | - Performance across multiple benchmarks
- Inter-annotator agreement for human evaluation |
110 | | 10. Practical Aspects | - Large Language Model Ops (LLMOps)
- Ethical Considerations | - Book: "Building Machine Learning Pipelines" by Hapke & Nelson
- Paper: "On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?" | - Design and implement an LLMOps pipeline
- Conduct an ethical audit of an NLP system | - Efficiency and reliability of deployment pipeline
- Compliance with ethical AI principles |
111 |
112 | ## Pillar 6: Research Paper Analysis and Replication
113 |
114 | | Activity | Approach | Resources | Deliverables | Evaluation Criteria |
115 | |----------|----------|-----------|--------------|---------------------|
116 | | Paper Reading | Weekly reading of latest papers | arXiv, Papers With Code | Paper summaries, critiques | Quality of analysis, subjective evaluation |
117 | | Research Replication | Reproduce results of significant papers | Original papers, open-source implementations | Replicated experiments, results comparison | Accuracy of replication, subjective evaluation |
118 | | Trend Analysis | Identify and analyze research trends | NeurIPS, ICML, ACL proceedings | Trend reports, blog posts | Insight quality, subjective evaluation |
119 | | Paper Presentation | Present papers to peers or online | Conference recordings | Video explanations, slide decks | Presentation quality, subjective evaluation |
120 | | Extension Projects | Extend or combine ideas from papers | Related work sections of papers | Novel experiments, blog posts | Originality, subjective evaluation |
121 | | Visualization Projects | Develop interactive visualizations for ML concepts | D3.js, Plotly | Create an interactive visualization of attention between tokens in a language model | Visualization quality, insight provided, subjective evaluation |
122 |
123 |
124 | ## Pillar 7: High-Performance AI Systems and Applications
125 |
126 | | Topic | Subtopics | Learning Resources | Projects/Assignments | Evaluation Criteria |
127 | |-------|-----------|---------------------|----------------------|---------------------|
128 | | Model Optimization | Quantization, pruning, distillation | TensorFlow Model Optimization Toolkit | Optimize a large model for edge devices | Performance benchmarks, subjective evaluation |
129 | | Efficient Architectures | MobileNet, EfficientNet, BERT-tiny | Papers, GitHub implementations | Implement and benchmark efficient models | Comparative analysis, subjective evaluation |
130 | | Hardware Acceleration | GPU programming, TPU utilization | NVIDIA Deep Learning Institute | Optimize a model for specific hardware | Performance improvement, subjective evaluation |
131 | | Scalable AI Systems | Microservices, containerization, orchestration | Kubernetes documentation, Docker tutorials | Design a scalable AI service architecture | Architecture review, subjective evaluation |
132 | | ML Ops | CI/CD for ML, model monitoring | Google MLOps guides | Set up an MLOps pipeline | Pipeline effectiveness, subjective evaluation |
133 |
134 |
135 | ## Pillar 8: Large-scale ETL and Data Engineering
136 |
137 | | Topic | Subtopics | Learning Resources | Projects/Assignments | Evaluation Criteria |
138 | |-------|-----------|---------------------|----------------------|---------------------|
139 | | ETL Fundamentals | Data extraction, transformation, loading | Udacity Data Engineering course | Design and implement an ETL pipeline | Pipeline efficiency, subjective evaluation |
140 | | Distributed Data Processing | Apache Spark, Hadoop | Spark documentation, Coursera courses | Process and analyze a large dataset using Spark | Processing speed, scalability, subjective evaluation |
141 | | Data Warehousing | Star schema, OLAP | Kimball Group resources | Design a data warehouse for ML experiments | Design review, subjective evaluation |
142 | | Streaming Data | Kafka, Flink | Confluent Kafka tutorials | Implement a real-time data processing pipeline | System performance, subjective evaluation |
143 |
144 | ## Pillar 9: Ethical AI and Responsible Development
145 |
146 | | Topic | Subtopics | Learning Resources | Projects/Assignments | Evaluation Criteria |
147 | |-------|-----------|---------------------|----------------------|---------------------|
148 | | AI Ethics | Fairness, accountability, transparency | MIT Moral Machine, ethicalML.org | Develop an AI ethics framework | Framework review, subjective evaluation |
149 | | Bias in AI | Dataset bias, algorithmic bias | "Gender Shades" paper, AI Fairness 360 | Audit a model for bias | Report review, subjective evaluation |
150 | | Privacy in ML | Differential privacy, federated learning | OpenMined tutorials | Implement privacy-preserving ML | Project review, subjective evaluation |
151 | | Explainable AI | LIME, SHAP, counterfactual explanations | Interpretable Machine Learning book | Develop model explanations | Project review, subjective evaluation |
152 | | AI Governance | Regulations, guidelines, best practices | EU AI Act, IEEE Ethically Aligned Design | Propose AI governance structure | Proposal review, subjective evaluation |
153 |
154 | ## Pillar 10: Community Engagement and Networking
155 |
156 | | Activity | Approach | Resources | Deliverables | Evaluation Criteria |
157 | |----------|----------|-----------|--------------|---------------------|
158 | | Conference Participation | Attend or present at AI conferences | Conference websites, Call for Papers | Presentation slides, trip reports | Engagement quality, subjective evaluation |
159 | | Open Source Contribution | Contribute to AI/ML open source projects | GitHub, PyPi | Code contributions, pull requests | Contribution impact, subjective evaluation |
160 | | AI Community Building | Organize meetups, study groups | Meetup.com, local tech communities | Event reports, community growth metrics | Community impact, subjective evaluation |
161 | | Online Presence | Blog writing, social media engagement | Medium, Twitter, LinkedIn | Blog posts, thread discussions | Reach and engagement, subjective evaluation |
162 | | Collaborative Projects | Partner with other researchers/engineers | Academic collaborations, hackathons | Joint projects, co-authored papers | Collaboration quality, subjective evaluation |
163 |
164 | ## Pillar 11: Research and Publication
165 |
166 | | Activity | Approach | Resources | Deliverables | Evaluation Criteria |
167 | |----------|----------|-----------|--------------|---------------------|
168 | | Research Ideation | Brainstorming, literature review | arXiv, Google Scholar | Research proposals | Originality, feasibility, subjective evaluation |
169 | | Experiment Design | Hypothesis formulation, methodology planning | Research design books, mentorship | Experimental protocols | Rigor, subjective evaluation |
170 | | Data Collection and Analysis | Data gathering, statistical analysis | Kaggle datasets, statistical tools | Datasets, analysis reports | Data quality, analysis depth, subjective evaluation |
171 | | Paper Writing | Scientific writing, LaTeX | Overleaf, writing workshops | Draft papers | Writing quality, subjective evaluation |
172 | | Peer Review | Understand and participate in peer review process | PubliONS, Elsevier Researcher Academy | Review reports | Review quality, subjective evaluation |
173 | | Publication and Presentation | Submit to journals/conferences, create posters | Journal guidelines, conference deadlines | Published papers, conference posters | Impact factor, presentation quality, subjective evaluation |
174 |
175 | ## Resources
176 | - [Books](/resources/books.md)
177 | - [Courses](/resources/courses.md)
178 | - [Blogs](/resources/blogs.md)
179 | - [Papers](/resources/papers.md)
180 | - [Projects](/resources/projects.md)
181 | - [Videos and Podcasts](/resources/videos-and-podcasts.md)
--------------------------------------------------------------------------------
/images/andrej_on_learning.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dswh/ai-research-program/08c6e926382029bfa0e2a01fa3e5f9762ce3aa62/images/andrej_on_learning.png
--------------------------------------------------------------------------------
/images/cover.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dswh/ai-research-program/08c6e926382029bfa0e2a01fa3e5f9762ce3aa62/images/cover.png
--------------------------------------------------------------------------------
/resources/blogs.md:
--------------------------------------------------------------------------------
1 | ## Blogs that I want to read / re-read
2 |
3 | ### Main
4 | ---
5 | - [The AI Research Engineer](https://researchrookie.substack.com/)
6 | - [A High-Level Overview of Large Language Models](https://www.borealisai.com/research-blogs/a-high-level-overview-of-large-language-models/)
7 | - [Applied LLMs - what we've learned from a year of building with LLMs](https://applied-llms.org/)
8 | -
9 |
10 |
11 | ### Tokenization
12 | - [Evolution of tokenization](https://towardsdatascience.com/the-evolution-of-tokenization-in-nlp-byte-pair-encoding-in-nlp-d7621b9c1186)
13 | - [Byte Pair Encoding](https://leimao.github.io/blog/Byte-Pair-Encoding/)
14 | -
15 |
16 |
17 | ### Tools / best practices for reading and writing research papers
18 | - [Best AI tools for writing research papers](https://www.reddit.com/r/ArtificialInteligence/comments/1eexqk2/best_ai_tools_for_writing_research_papers_in/)
19 |
20 |
21 | ### Attention
22 | - [The Illustrated Transformer](https://jalammar.github.io/illustrated-transformer/)
--------------------------------------------------------------------------------
/resources/books.md:
--------------------------------------------------------------------------------
1 | # Books that I want to read / re-read
2 |
3 | - [Understanding Deep Learning](https://udlbook.github.io/udlbook/)
4 | - [Deep Learning for Coders with fastai and PyTorch](https://github.com/fastai/fastbook)
5 | - [Build a Large Language Model from Scratch](https://livebook.manning.com/book/build-a-large-language-model-from-scratch/)
6 | -
--------------------------------------------------------------------------------
/resources/courses.md:
--------------------------------------------------------------------------------
1 | ## Courses that I am going to take
2 |
3 | -
--------------------------------------------------------------------------------
/resources/datasets.md:
--------------------------------------------------------------------------------
1 | ## Datasets one can use for training / finetuning / learning
2 |
3 | Sample pretraining dataset: [Paul Graham Essay](https://www.paulgraham.com/greatwork.html)
4 |
5 | - [AI Books4 Dataset for Training LLMs Further](https://web.archive.org/web/20240519104217/https://old.reddit.com/r/datasets/comments/1cvi151/ai_books4_dataset_for_training_llms_further/)
6 |
--------------------------------------------------------------------------------
/resources/papers.md:
--------------------------------------------------------------------------------
1 | ## Recommended papers for understanding LLMs
2 |
3 | ### Tokenization
4 | - [Neural Machine Translation of rare words with Subword Units (BPE)](https://arxiv.org/pdf/1508.07909)
5 |
6 |
7 | - [30 papers Ilya recommended John Carmack to read](https://arc.net/folder/D0472A20-9C20-4D3F-B145-D2865C0A9FEE)
8 | - [Attention is All You Need](https://arxiv.org/abs/1706.03762)
9 | - [The Annotated Transformer](https://nlp.seas.harvard.edu/annotated-transformer/)
--------------------------------------------------------------------------------
/resources/projects.md:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dswh/ai-research-program/08c6e926382029bfa0e2a01fa3e5f9762ce3aa62/resources/projects.md
--------------------------------------------------------------------------------
/resources/videos-and-podcasts.md:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dswh/ai-research-program/08c6e926382029bfa0e2a01fa3e5f9762ce3aa62/resources/videos-and-podcasts.md
--------------------------------------------------------------------------------
/weekly-notes/week1.md:
--------------------------------------------------------------------------------
1 | # Week 1 - Building an LLM
2 |
3 | ## Dataset preparation
4 | - Tokenization
5 | - Data Sampling
6 | - Data loading
7 | - Embedding
--------------------------------------------------------------------------------