├── AI ML DL.png ├── Data Sci.png ├── Hackathon.jpg ├── Picture1.png └── README.md /AI ML DL.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/didiooi/beginnersguideML/50fd4b82d00cabae17634d0f509a1f15899e2b53/AI ML DL.png -------------------------------------------------------------------------------- /Data Sci.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/didiooi/beginnersguideML/50fd4b82d00cabae17634d0f509a1f15899e2b53/Data Sci.png -------------------------------------------------------------------------------- /Hackathon.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/didiooi/beginnersguideML/50fd4b82d00cabae17634d0f509a1f15899e2b53/Hackathon.jpg -------------------------------------------------------------------------------- /Picture1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/didiooi/beginnersguideML/50fd4b82d00cabae17634d0f509a1f15899e2b53/Picture1.png -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Getting Started with Data Science and AI 2 | by Didi Ooi S. *v1.0 October-22-2017* 3 | Repo Content: Resources (all hyperlinked for your convenience!), background, and (very) basic introduction. 4 | 5 | ## 1. My Long-Short Summary 6 | Hello, a lot of you are probably **NOT** coming from a computer science or applied statistics background *(like myself)*. One of the most effective way to pick up Data Science (especially Machine Learning and Deep Learning) is to: 7 | 1. PRACTICE, PRACTICE, PRACTICE *and* LEARN **at the same time** 8 | 2. Post your repositories (project) on [Github](https://www.github.com) & [Kaggle](https://www.kaggle.com) 9 | 3. Start participating in local [Hackathons](https://en.wikipedia.org/wiki/Hackathon) and/or on [Kaggle Competitions](https://www.kaggle.com/competitions) to try and solve real data. 10 | 11 | The list above is not necessarily in order. Half of the people I talked to are more comfortable with 1, 2 and then 3. Which is akin to a traditional education (the top-down approach). Some people are a bit more adventurous, starting with 3, then 2 then 1 (like myself)! I decide to visit my first hackathon [Houston Hackathon](http://houstonhackathon.com/) this year to find out what it is all about - only to get pulled in one project, and then another [one](https://www.hackathon.com/event/geophysics-hackathon-2017-36373291494)! If you like learning-on-the-go and love solving problems and don't mind being thrown into an unknown pit, then this is the way to go. 12 | 13 | ### PRACTICE AND LEARN (AT THE SAME TIME) 14 | 15 | The reason why I said **practice and learn** 'at the same time' is that: 16 | - PRACTICING allows you to apply the **TOOLS** that you need to solve your data science problem with existing repositories without wasting time to start from ground up. I believe practicing it this way from top-down is much faster than the traditional learning we have all been exposed to (especially me!) 17 | - LEARNING is crucial to understand **WHY** did you apply the tools that you did i.e. why did you choose the algorithms and the background math that shaped it. You're not a Data Scientist if you're just the end-user of tools! 18 | 19 | ### PARTICIPATE IN INTEREST GROUPS 20 | 21 | NEXT STEP, if you live in medium/big cities - go out of your comfort zone and meet other people in the AI/DS field, in your area. Chances are your city already has plenty of [MeetUps](http://www.meetup.com/) or [Eventbrite events](https://www.eventbrite.com) where you can go participate and sit in. 98% of these FREE events/meetups cater to like-minded people of **various** degree of experiences! Houston's Machine Learning and Energy Data Science Meetups have between 3 - 10 NEW people each meeting. And up to half of them are without the AI/DS background/degree, BUT have the subject matter expertise in which they want to use this technology to solve problems. 22 | 23 | *If there isn't one, then be the one to organize the first in your area!* 24 | 25 | From finding out about Meetup in the summer 2017 *(yes I am the late bloomer)*, I now regularly go to [Houston Data Science](https://www.meetup.com/Houston-Data-Science/), [Houston Machine Learning](https://www.meetup.com/Houston-Machine-Learning/), [Houston Energy Data Science](https://www.meetup.com/Houston-Energy-Data-Science-Meetup/) and [Houston Data Visualization](https://www.meetup.com/Houston-Data-Visualization-Meetup/) groups and I definitely learned a lot by talking to people, asking questions, or sitting in lectures. 26 | 27 | (**Be warned** that it can be intimidating at first, but understand and expect that you will leave from first few Meetups with 0.1 to 5% comprehension of the whole learning experience and that is COMPLETELY NORMAL. Read Carol Dweck's Mindset to get a sense of what I mean. It is really important to keep an **open mind**, and be very eager to learn and you will soon realize how that percentage of comprehension grows over time. It is easier and faster than just learning the theoretical background from books and courses but most importantly, it will reinforce your theoretical learning.) 28 | 29 | ![](https://github.com/didiooi/beginnersguideML/blob/master/Hackathon.jpg) 30 | *(At the more recent [Agile* Geophysics Hackathon](https://agilescientific.com/blog/2017/9/29/hacking-in-houston) in Houston!)* 31 | 32 | ### UNDERSTAND HOW IT IS BEING APPLIED 33 | Last but not least, learn how data science and its architectures are being applied in the **REAL WORLD**. Start by figuring out which **industries you are interested in**, research about how they are (or are not) integrating traditional methods and products with advanced analytics and emerging technologies into their existing or future products, workflow pipelines or supply chain etc. 34 | Be the big picture problem solver. Then VOILA! Your journey begins! 35 | 36 | For the **Geologist** and the **Geophysicists**, I recommend keeping up-to-date with these two journals: [Computers and Geoscience](https://www.journals.elsevier.com/computers-and-geosciences) journal and [Leading Edge](https://library.seg.org/toc/leedff/current). For open-source tools please visit our open-source collaborative effort [Open Geoscience](https://github.com/softwareunderground/awesome-open-geoscience)! 37 | 38 | ## 2. Background for the Newbies 39 | People often get mixed-up with the term Machine Learning (ML), Deep Learning (DL) and how it relates to Artificial Intelligence (AI). 40 | 41 | Here is one diagram I made *(inspired from Nvidia)*: 42 | ![](https://github.com/didiooi/beginnersguideML/blob/master/AI%20ML%20DL.png) 43 | *(In short, Deep Learning is a subset of Machine Learning which is a subset of AI)* 44 | 45 | There is also another way to view this: **Supervised and Unsupervised Learning**. Supervised Learning includes Machine Learning while Unsupervised Learning includes Deep Learning and Reinforcement Learning. 46 | 47 | ### BEHIND THE HYPE: TERMINOLOGIES EXPLAINED 48 | 49 | - **Data Science**: Extraction of knowledge and information from data, using integrated ideas from Mathematics, Statistics, Machine Learning, Computer Science, and Subject Matter Expertise (SME). 50 | - **Big Data**: Unstructured data from multiple sources arriving at an alarming **Velocity, Volume and Variety** and in format in which meaningful value and information is not leveraged from (yet). 51 | - **Machine Learning** (ML): A field in computer science whereby the algorithm has the ability to learn without being explicitly programmed. 52 | - **Statistical Learning**: Branch of applied statistics recently emerge in response to ML, emphasizing statistical models and assessment of uncertainty. 53 | - **Deep Learning** (DL): A computational method for implementing machine learning using artificial neural network by building multiple layers of abstraction to solve complex semantic problems. 54 | - **Reinforcement Learning** (RL): An extremely promising new area using the trial-and-error paradigm where the (computing) Agent learns and corrects its Action based on Reward signals and State. 55 | 56 | ![](https://github.com/didiooi/beginnersguideML/blob/master/Data%20Sci.png) 57 | 58 | *(The goal is to be a **UNICORN**, or at least a strong one third and half of the other two thirds, if that makes sense?)* 59 | 60 | All of the terminologies are very similar but have **DIFFERENT EMPHASES**. 61 | 62 | Notice the importance of a subject matter expertise in the equation. Don't ditch your science degree/masters and jump straight on the Data Science and AI bandwagon - your skillsets from your courses are still valuable, it will make you the subject matter expertise and think about how you can use AI for your industry/field of research. 63 | 64 | 65 | ## 3. Practice 66 | - [Kaggle](https://www.kaggle.com/): Run through [tutorials](https://www.kaggle.com/wiki/Tutorials) and start with solving the [Titanic problem](https://www.kaggle.com/c/titanic) with Machine Learning! 67 | - Forking from existing and popular [GitHub](https://www.github.com) repositories and play with it! 68 | - It will also be extremely USEFUL to have these 30 essential data science, ML and DL [**CHEAT SHEETS**](https://www.kdnuggets.com/2017/09/essential-data-science-machine-learning-deep-learning-cheat-sheets.html) next to you at all times, posted on your corkboard at work, at home and by your bedside. 69 | 70 | ## 4. Resources 71 | 72 | 73 | ### MACHINE LEARNING 74 | - Andrew Ng's Stanford (now Google's) [**Machine Learning** course](https://www.coursera.org/learn/machine-learning) is a great place to start if you already have a decent science and math background. 75 | - For the theoretical background behind Statistical Learning, which is an advanced branch of statistics invented in conjunction with Machine Learning, your best bet will be [**Introduction to Statistical Learning**](http://www-bcf.usc.edu/~gareth/ISL/). The book is free to access online! 76 | - If you want the *classic* beginner's guide to ML and needs some refresher with math, definitely go to Chris Bishop's [**Pattern Recognition**](http://users.isr.ist.utl.pt/~wurmd/Livros/school/Bishop%20-%20Pattern%20Recognition%20And%20Machine%20Learning%20-%20Springer%20%202006.pdf). 77 | - If you're feeling extra adventurous and would love to learn the theoretical and mathematical background, try Hastie's [Introduction to Statistical Learning](https://web.stanford.edu/~hastie/ElemStatLearn/) 78 | 79 | ### DEEP LEARNING 80 | - For **Deep Learning**, there is no other place to start with than Ian Goodfellow et al's [Deep Learning](http://www.deeplearningbook.org/) book. It is FREE online too. 81 | - Deep Learning course by Ng at [deeplearning.ai](https://www.deeplearning.ai/) 82 | - Jeremy Howard's [Fast AI](http://course.fast.ai/) 83 | - [Neural Network Zoo](http://www.asimovinstitute.org/neural-network-zoo/) - cheat sheet containing most NN architectures 84 | - [Neural Network playground with TensorFlow](http://playground.tensorflow.org) - a web app that lets you play with a real neural network (NN) running in your browser and click buttons and tweak parameters to see how it works. If you are unsure of what pressing all that buttons mean, read [Google's blog](https://cloud.google.com/blog/big-data/2016/07/understanding-neural-networks-with-tensorflow-playground) 85 | 86 | ![](https://github.com/didiooi/beginnersguideML/blob/master/Picture1.png) 87 | 88 | ### REINFORCEMENT LEARNING 89 | - [NEWS ALERT: Alpha Go Zero](https://deepmind.com/blog/alphago-zero-learning-scratch/) - see how this AI that has nothing to learn from its human. 90 | - [Brief Introduction](http://karpathy.github.io/2016/05/31/rl/) 91 | - Resources include [Barto and Sutton's classic RL book](http://incompleteideas.net/sutton/book/ebook/the-book.html) - which is free online 92 | 93 | 94 | ### MATH & STATS (OFTEN ENCOUNTERED) 95 | I am not a big fan of this but since this is a frequently asked question, I will just put it here. My advice is to learn it procedurally by demand, as starting it this way will quickly diminish your interest to pursue ML/DS very quickly. Tbh, Andrew Ng's ML course is very forgiving with refreshing the math and stats for you! 96 | 1. Probability Statistics 97 | 2. Linear Algebra 98 | 3. Multivariate Calculus esp Derivative and Integral 99 | 4. Optimization 100 | 101 | ## 5. Programming Language 102 | With so many languages out there and people preaching on theirs to use, it is easy to get overwhelmed. Advice here is to remember that your goal of mastery is not the language, it is the knowledge of logic and syntax. For complete newbies, I definitely recommend Python (as of 2017). The other reason is because Python has the greatest community support and it calls out Machine Learning libraries/framework easily. 103 | 104 | [**Python**](https://www.python.org/) is the fastest growing language because of how dynamic and readable it is, so I'd suggest getting started with the basics of it. If you're a complete beginner, like me, start with this Al Sweigart's no-fuss examples from [Automate The Boring Stuff with Python](https://automatetheboringstuff.com/). 105 | 106 | Still not convinced that Python is beating R, Matlab etc? Read '[Python overtakes R, becomes the leader in Data Science, Machine Learning platforms](https://www.kdnuggets.com/2017/08/python-overtakes-r-leader-analytics-data-science.html)' 107 | 108 | **Important Python Tools, Libraries** 109 | [*I will elaborate more on this in the future!*] 110 | - [Anaconda](https://anaconda.org/) 111 | - [Jupyter Notebook](http://jupyter.org/) 112 | - [Numpy](www.numpy.org): arrays and matrices support 113 | - [SciPy](https://www.scipy.org): stats, maths, engineering 114 | - [Pandas](www.pandas.pydata.org): data structures 115 | - [Matplotlib](https://matplotlib.org): data visualization (2D plotting) 116 | - [Seaborn](https://seaborn.pydata.org): higher level data viz 117 | 118 | ## 6. Open Source Machine Learning libraries 119 | 1. [Scikit-Learn](http://scikit-learn.org): for the Pythonista 120 | 2. [Tensorflow](https://www.tensorflow.org/): Google Brain's open source software library for Machine Learning 121 | 3. [Theano](http://deeplearning.net/software/theano/): another Python library, I believe it is similar to NumPy 122 | 4. [Keras](https://keras.io/): capable of running on top of Deeplearning4j, Tensorflow, Microsoft Cognitive Toolkit(CNTK) or Theano 123 | 5. ...*and more but get to know the first two first maybe experiment it with examples from [Aurelion Geron's book](https://www.amazon.com/Hands-Machine-Learning-Scikit-Learn-TensorFlow/dp/1491962291/ref=sr_1_9?ie=UTF8&qid=1508900732&sr=8-9&keywords=reinforcement+learning)!*. 124 | 125 | ## 7. News and Forums for Data Science and AI 126 | 1. [KDNuggets](https://www.kdnuggets.com/) 127 | 2. Following the right people on [Twitter](https://twitter.com/didiooi/following) (Most of the people I follow on my Twitter are at the forefront of the Machine Learning and Deep Learning realm) 128 | 3. [Quora on Machine Learning](https://www.quora.com/topic/Machine-Learning): for pretty intelligent discussion you can just simply follow the [top/most viewed writers](https://www.quora.com/topic/Machine-Learning/writers), like [Andrew Ng](https://www.quora.com/profile/Andrew-Ng) 129 | 4. [Medium](https://medium.com) Short reads on all sorts of topics, including ML, DL, robotics (make sure to personalize your feed first) 130 | 5. Reddit for hype-and-updates on [/MachineLearning](https://www.reddit.com/r/MachineLearning/) 131 | 6. [StackExchange](https://stackexchange.com/): to ask for help in any data science or programming problems 132 | 133 | ## 8. Data Visualization 134 | Now that you have the tools and resources, it is important to remember that [data visualization](https://en.wikipedia.org/wiki/Data_visualization) is also an important front-end component to Data Science. This is because EFFECTIVE COMMUNICATION of data is crucial to all the work you have spent your blood, sweat and tears on, especially when you are sharing the results with your boss, stakeholders and/or clients. 135 | The lack thereof is what gave rise to the other buzzword - **Business Intelligence** which includes tools like Microsoft Power BI, TIBCO Spotfire, Tableau (which are basically Excel on steroids). 136 | Inspired by Microsoft's Data Summit 2017 keynote by [Alberto Cairo](http://www.thefunctionalart.com) (modern data viz guru to [Edward Tufte](https://en.wikipedia.org/wiki/Edward_Tufte)) - here is a short read [6 Fundamentals of Data Visualization](https://www.linkedin.com/pulse/6-fundamental-principles-data-visualization-didi-sher-ooi/) summarising it. 137 | 138 | 139 | ## Key Takeaway 140 | The key takeaways that I have learned: 141 | 1. You do not need to know advanced coding to get into Data Science or Machine Learning etc, do it from top-down 142 | 2. Data Science and its tools is **NOT magic**! You should remain skeptical and vigilant. Good data and proper internal validation is required. 143 | 144 | ## Questions? 145 | Yes, this is what I decided to do on a Sunday morning after receiving requests from friends on how they can get started in the AI field over the last few weeks, so forgive any grammatical errors. Do let me know if you have any questions, at didi.ooi@bristol.ac.uk or message me at [LinkedIn](https://www.linkedin.com/in/didiooi). 146 | 147 | 148 | --------------------------------------------------------------------------------