├── LICENSE
├── ML-101 Guide.md
├── ML-101 Modules
├── Intro
│ └── Intro.md
├── Module 01
│ ├── Lesson 01
│ │ └── README.md
│ ├── Lesson 02
│ │ ├── Practice-Homework.md
│ │ └── README.md
│ ├── Lesson 03
│ │ ├── Plan.ipynb
│ │ └── README.md
│ ├── Lesson 04
│ │ └── README.md
│ └── ML-101 Module 01.md
├── Module 02
│ ├── Lesson 01
│ │ └── README.md
│ ├── Lesson 02
│ │ ├── Practice
│ │ │ ├── Practice-Homework.md
│ │ │ ├── startup-profit-prediction - Practice Code Part 1&2.ipynb
│ │ │ ├── startup-profit-prediction - Practice Code Part 3&4.ipynb
│ │ │ ├── startup-profit-prediction - Practice.ipynb
│ │ │ ├── test.csv
│ │ │ └── train.csv
│ │ └── README.md
│ └── ML-101 Module 02.md
└── Module 03
│ ├── Lesson 01
│ └── README.md
│ ├── Lesson 02
│ ├── Practice 1
│ │ ├── Gender - Practice Code Part 1&2.ipynb
│ │ ├── Gender - Practice Code Part 3&4.ipynb
│ │ ├── Gender - Practice.ipynb
│ │ ├── Practice-Homework.md
│ │ └── gender.csv
│ ├── Practice 2
│ │ ├── Practice-Homework.md
│ │ ├── Winequality - Practice Code Part 1&2.ipynb
│ │ ├── Winequality - Practice Code Part 3&4.ipynb
│ │ ├── Winequality - Practice.ipynb
│ │ └── winequality.csv
│ └── README.md
│ ├── Lesson 03
│ ├── Practice 1
│ │ ├── Practice-Homework.md
│ │ ├── vehicles - Practice Code Part 1.ipynb
│ │ ├── vehicles - Practice Code Part 2.ipynb
│ │ ├── vehicles.csv
│ │ └── vehicles_Practice.ipynb
│ ├── Practice 2
│ │ ├── Practice-Homework.md
│ │ ├── vehicles - short version - Practice Code Part 1.ipynb
│ │ ├── vehicles - short version - Practice Code Part 2.ipynb
│ │ ├── vehicles - short version - Practice.ipynb
│ │ ├── vehicles.csv
│ │ └── vehicles_helper.py
│ └── README.md
│ └── ML-101 Module 03.md
├── Modules
├── Module01
│ ├── Lesson01
│ │ └── README.md
│ └── README.md
├── Module02
│ └── README.md
└── Module03
│ └── README.md
├── README.md
├── how_to
├── Github
│ └── How_to_Github.md
├── Jupyter Notebook
│ └── how_to_Jupyter_Notebook.md
└── README.md
└── img
└── readme.md
/LICENSE:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2020 Data-Learn
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
--------------------------------------------------------------------------------
/ML-101 Guide.md:
--------------------------------------------------------------------------------
1 | 
2 |
3 | # Getting Started with Machine Learning and Data Science (ML-101)
4 | ## Вход в профессию
5 |
6 | [Introduction / Введение](https://github.com/Data-Learn/data-science/blob/main/ML-101%20Guide.md#%D0%B2%D0%B2%D0%B5%D0%B4%D0%B5%D0%BD%D0%B8%D0%B5)
7 |
8 | [Who can apply / Для кого этот курс](https://github.com/Data-Learn/data-science/blob/main/ML-101%20Guide.md#%D0%B4%D0%BB%D1%8F-%D0%BA%D0%BE%D0%B3%D0%BE-%D1%8D%D1%82%D0%BE%D1%82-%D0%BA%D1%83%D1%80%D1%81)
9 |
10 | [Course outline / Схема курса](https://github.com/Data-Learn/data-science/blob/main/ML-101%20Guide.md#%D1%81%D1%85%D0%B5%D0%BC%D0%B0-%D0%BA%D1%83%D1%80%D1%81%D0%B0)
11 |
12 | [Module 01 / Модуль 01](https://github.com/Data-Learn/data-science/blob/main/ML-101%20Modules/Module%2001/ML-101%20Module%2001.md) 
13 |
14 | [Module 02 / Модуль 02](https://github.com/Data-Learn/data-science/blob/main/ML-101%20Modules/Module%2002/ML-101%20Module%2002.md) 
15 |
16 | [Module 03 / Модуль 03](https://github.com/Data-Learn/data-science/blob/main/ML-101%20Modules/Module%2003/ML-101%20Module%2003.md) 
17 |
18 | [Basic knowledge / Базовые знания](https://github.com/Data-Learn/data-science/blob/main/ML-101%20Guide.md#%D0%B1%D0%B0%D0%B7%D0%BE%D0%B2%D1%8B%D0%B5-%D0%B7%D0%BD%D0%B0%D0%BD%D0%B8%D1%8F)
19 |
20 | [Instruments / Инструменты](https://github.com/Data-Learn/data-science/blob/main/ML-101%20Guide.md#%D0%B8%D0%BD%D1%81%D1%82%D1%80%D1%83%D0%BC%D0%B5%D0%BD%D1%82%D1%8B)
21 |
22 | [Course principles / Принципы курса](https://github.com/Data-Learn/data-science/blob/main/ML-101%20Guide.md#%D0%BF%D1%80%D0%B8%D0%BD%D1%86%D0%B8%D0%BF%D1%8B-%D0%BA%D1%83%D1%80%D1%81%D0%B0)
23 |
24 | [Certificates / Сертификаты](https://github.com/Data-Learn/data-science/blob/main/ML-101%20Guide.md#%D1%81%D0%B5%D1%80%D1%82%D0%B8%D1%84%D0%B8%D0%BA%D0%B0%D1%82%D1%8B) 
25 |
26 | [Repository structure / Структура репозитория](https://github.com/Data-Learn/data-science/blob/main/ML-101%20Guide.md#%D1%81%D1%82%D1%80%D1%83%D0%BA%D1%82%D1%83%D1%80%D0%B0-%D1%80%D0%B5%D0%BF%D0%BE%D0%B7%D0%B8%D1%82%D0%BE%D1%80%D0%B8%D1%8F)
27 |
28 | [Registration / Как зарегистрироваться на курс](https://github.com/Data-Learn/data-science/blob/main/ML-101%20Guide.md#%D0%BA%D0%B0%D0%BA-%D0%B7%D0%B0%D1%80%D0%B5%D0%B3%D0%B8%D1%81%D1%82%D1%80%D0%B8%D1%80%D0%BE%D0%B2%D0%B0%D1%82%D1%8C%D1%81%D1%8F-%D0%BD%D0%B0-%D0%BA%D1%83%D1%80%D1%81)
29 |
30 |
31 | ## Введение
32 |
33 | Всем привет!
34 |
35 | Меня зовут [Анастасия Риццо](https://www.linkedin.com/in/anastasia-r-7b8a0376), я data scientist, а так же автор и преподаватель этого курса.
36 |
37 | Со многими из вас мы уже знакомы по data science [вебинару](https://www.youtube.com/watch?v=0bejTH1BKWI) на [канале](https://www.youtube.com/channel/UCWki7GBUE5lDMJCbn4e1XMg) **DataLearn**, а также по [курсу](https://github.com/Data-Learn/data-engineering) **Data Engineering от Дмитрия Аношина**.
38 |
39 | Совместно с **Data Learn** мы начинаем **вводный курс** в теорию **Машинного Обучения и Data Science **ML-101**** , с понятной теорией и практическими кейсами из реальной жизни.
40 |
41 | [**Intro Video**](https://youtu.be/g2azOLGzeNo)
42 |
43 | ## Для кого этот курс
44 |
45 | Этот курс рассчитан на людей, которые:
46 |
47 | - очень интересуются тематикой, но, по ряду причин, всё никак не могут начать её изучать;
48 |
49 | - хотят поменять профессию и уйти в Data Science, но не совсем понимают, что их ждет и стоит ли игра свеч;
50 |
51 | - хотят войти в мир Data Science не просто в теории, но и самостоятельно сделать несколько практических кейсов;
52 |
53 | - уже работают как Data Engineer или Business/BI/Data Analyst, и хотят говорить на одном языке с Data Scientist.
54 |
55 | ## Схема курса
56 |
57 | Давайте я обрисую **схему курса** и вы, перейдя по ссылкам, увидите детальное описание каждого модуля:
58 |
59 | Итак, курс состоит из **3 модулей**.
60 |
61 | [Первый модуль](https://github.com/Data-Learn/data-science/blob/main/ML-101%20Modules/Module%2001/ML-101%20Module%2001.md) это **теория**, [Второй](https://github.com/Data-Learn/data-science/blob/main/ML-101%20Modules/Module%2002/ML-101%20Module%2002.md) и [Третий](https://github.com/Data-Learn/data-science/blob/main/ML-101%20Modules/Module%2003/ML-101%20Module%2003.md) модуль это **теория вместе с практикой**.
62 |
63 |
64 | ## Базовые знания
65 |
66 | > **_Примечание:_** Если вы чего-то не знаете, то будем разбираться вместе по ходу курса.
67 |
68 | Давайте поймем какие базовые знания вам понадобятся:
69 |
70 | ### 1. Теория Баз Данных
71 | Тут надо всего чуть-чуть самых основ: представлять, что такое база данных и знать, что именно в ней лежат данные; знать минимальную терминологию. Все остальное «страшное-сложное» я вам расскажу. Если вы здесь уже «на опыте» или проходите [курс](https://github.com/Data-Learn/data-engineering) **DE-101 Дмитрия Аношина**, то с этим вопросом у вас проблем не будет.
72 |
73 | ### 2. Статистика
74 | Тут она, конечно, нужна, но не вся; но введение в неё прочитать стоит; буду использовать много терминов оттуда. Но, в целом, по ходу обучения разберёмся.
75 |
76 | ### 3. Алгоритмы
77 | Тут речь пойдёт об алгоритмах машинного обучения. И это всё очень индивидуально. У меня с ними любовь аж со второго курса универа. Надо просто захотеть их понять и разложить на ряд маленьких шагов. В общем, с этим мы тоже справимся вместе 😊
78 |
79 | ### 4. Python
80 | В **Data Science** используют **2** языка программирования: **Python** и **R**. Мы будем использовать **Python**.
81 | Те из вас, кто знаком с программированием на любых других языках, сможет легко перейти на **Python**. Это очень простой для понимания язык.
82 | Те, кто и слово то такое не слышал, вы не переживайте, но начать учить надо.
83 | Когда я начну рассказывать теорию, вы ещё можете просто смотреть на скрины с кодом. Но когда у нас начнётся практика, вам будет тяжело. Я буду стараться комментировать код, но его будет много и вам станет страшно. И, возможно, большая часть людей уйдёт с радаров насовсем, подумав, что:
84 |
85 | - жизнь – боль и data science это не для них;
86 |
87 | - курс плохой, ничего не понятно;
88 |
89 | - я слишком умный для всего этого, пойду поем!
90 |
91 | **Выход** такой:
92 |
93 | - возьмите любой быстрый курс по введению в **Python** или посмотрите **Data Learn** вебинары по **Python** от **Дмитрия Беляева**: [Вебинар 1](https://www.youtube.com/watch?v=TpnJWYgNMWE) и [Вебинар 2](https://www.youtube.com/watch?v=9h6vDs1M5M8)
94 |
95 | - когда пойдут практические кейсы, вы хотя бы начните понимать и запоминать какой кусок кода что делает и за что отвечает (как детали Lego); старайтесь своими пальчиками перепечатывать код (это называется hands-on practice) – мозг что-то точно отложит и запомнит; а позже сами старайтесь разобрать код.
96 | Да, это копирование. Но любое обучение начинается с него. У вас всё получится!
97 |
98 | ### 5. Немного Математики
99 | Тут прям совсем чуть-чуть надо, 2+2, даже говорить неприлично.
100 |
101 | ### 6. Английский язык
102 | Он нужен и точка. Кто не знает – начинайте учить. Ваш уровень должен быть таким, чтобы вы смогли читать и понимать текст. В нашем случае, еще и техническую терминологию.
103 |
104 | Даже если:
105 |
106 | - вы работаете в России, Украине, Белоруссии (ребята, держитесь там!), Казахстане и других странах постсоветского пространства, и общаетесь на своих родных языках...
107 |
108 | - вы можете изучить базы данных, статистику, алгоритмы и математику на своём родном языке...
109 |
110 | ...то, что делать с программированием? Как вы будете писать код (пусть и с комментариями на вашем языке)?
111 |
112 | И ещё один момент. Я получила западное образование, всю техническую литературу и терминологию мне легче объяснить на английском. Порой, я даже не знаю как то или это на русском. Но, всё же, я русскоговорящая и делаю этот курс для вас. А вам надо начинать изучать английский язык. Поэтому, местами я буду переходить с одного языка на другой, где-то использовать только английскую терминологию. Это пойдёт вам только на пользу. Особенно для тех, кто в будущем планирует вести проекты с западными командами.
113 | Поэтому, комментариев из серии «не выношу перескакивания с одного языка на другой», "плохое произношение" или «не выделывайся и говори на русском» мы, надеюсь, избежим. Спасибо за понимание!
114 |
115 |
116 | ## Инструменты
117 |
118 | Какими инструментами мы с вами будем пользоваться:
119 |
120 | ### 1. Youtube
121 | **Канал Data Learn**, [там](https://www.youtube.com/channel/UCWki7GBUE5lDMJCbn4e1XMg) мы будем смотреть все видео курса. Кстати, **там много других полезных видео и вебинаров** по таким тематикам как: **Data Engineering, Аналитика, Python, Data Science, SQL, Карьера, Ведение проектов и работа в Data команде, Изучение английского языка, Эмиграция технических специалистов**.
122 |
123 | ### 2. Github
124 | Там будет находиться вся **навигация** этого **курса +** там будут лежать наши **практические кейсы**. И, да, вам обязательно надо будет **завести** там **аккаунт**. У кого есть - молодцы! У кого еще нет аккаунта тоже ок, я сделала для вас [инструкцию](https://github.com/Data-Learn/data-science/blob/main/how_to/Github/How_to_Github.md)
125 | с картинками “что-куда-зачем нажать” (чтобы вы точно пришли из пункта “А” в пункт “Б”).
126 |
127 |
128 | Кто хочет изучить тему **Github** более детально, то на **Data Learn** есть [инструкции](https://github.com/Data-Learn/data-engineering/blob/master/how-to/How%20to%20get%20git.md) **how_to**. Ещё вы найдете ссылки [тут](https://github.com/Data-Learn/data-engineering/blob/master/Learning%20resources.md).
129 |
130 | Для айтишников это знакомый ресурс, там лежит много кода и полезной информации, а так же мы используем его как наше **техническое резюме**. Кто не знает, что такое техническое резюме, сейчас расскажу.
131 |
132 | Это не резюме в общепринятом смысле. А, скорее, примеры работ, портфолио если хотите. В данном случае, это ваш **Github аккаунт**. Там мы показываем что стоит за написанными в реальном резюме красивыми словами. Какие проекты были реально сделаны и как именно.
133 |
134 | Каждый **проект** (или что вы решите выставить: хорошую курсовую, домашнее задание, просто код) имеет отдельный **репозиторий** (или, простыми словами, главная папка где он хранится). У **репозитория** должно быть понятное **название** (не набор букв “wmvhf-05” и понимай как хочешь), хорошее **описание** проекта (о чем проект, что там сделано), чистый **код** (очень желательно с комментариями). Чем больше качественных и правильно оформленных репозиториев, тем лучше.
135 |
136 | Любая серьезная IT компания, команда, hr менеджер захотят увидеть ваш **Github** перед тем как с вами разговаривать. Оба моих data science интерншипа (Mozilla Amazon) проходили с использованием **Github**. При подаче документов на интерншип в Mozilla аккаунт **Github** стоял как mast-have.
137 |
138 | Представьте, что вы работодатель, и вам надо нанять на работу 1 кандидата из 100. Перед вами 100 идеальных резюме. Как выбрать? Если у кандидатов есть Github, то они уже на ступень выше остальных. Хотя бы потому, что они облегчили работодателю задачу выбора путем предоставления своего технического резюме, сэкономили его рабочее время.
139 |
140 | А нужно вам всё это, если:
141 |
142 | - у вас есть амбиции;
143 |
144 | - вы хотите от жизни большего;
145 |
146 | - в будущем видите себя в компаниях FAANG или где-то на той же орбите;
147 |
148 | - вы уже выходите на уровень создания чего-либо не только за деньги, но и с целью сделать Мир лучше.
149 |
150 | Еще важно (для западных работодателей особенно) какой **вклад** человек вносит в **Open Source community** (здесь можно погуглить). Можно присоединиться к репозиториям разных курсов (например, к этому) и там выполнять задания (это будет отображаться в вашем аккаунте). Можно создать свой репозиторий с контентом по той теме в которой вы профи и выложить это для людей. Можно найти аккаунт любой IT компании и сделать свой вклад в их код или проект (если они ваш вклад примут, или merge, то это уже прям вы молодец!).
151 | ### 3. Slack
152 | Это месенджер, вы можете скачаеть его [тут](https://slack.com/intl/en-ca/downloads/) для компьютера или найти версию для телефона; в нем мы будем общаться, задавать вопросы, обсуждать что-либо.
153 |
154 | ### 4. Компьютер или Ноутбук
155 | Для практических работ вам понадобится любой **компьютер или ноутбук** с операционными системами **Windows, Maс или Linux**. На него мы установим **Jupyter notebook** - это та среда, где мы будем писать код и производить вычисления.
156 |
157 | Теперь немного о том, почему именно эту среду разработки мы будем использовать. Я знаю, что многие из вас используют другие среды и обязательно напишут много комментариев с этим вопросом. А всё просто. **Курс МЛ 101 вводный** и расчитан на очень базовые навыки и знания. И учиться мы будем на простых и понятных инструментах. **Jupyter notebook** прост в установке, прост по дизайну интерфейса, прост в обращении. И, главное, прост для понимания как работает тот или иной код. Опять же, из своего опыта, на моих интерншипах от нас требовали использовать **Jupyter notebook**. С опытом вы можете перейти на более профессиональные инструменты.
158 |
159 | Вот [ссылка](https://github.com/Data-Learn/data-science/blob/main/how_to/Jupyter%20Notebook/how_to_Jupyter_Notebook.md) на инструкцию по установке.
160 |
161 | Доступ в интернет само собой нужен.
162 |
163 | ## Принципы курса
164 |
165 | Как и все курсы **Data Learn**, этот курс имеет свои **принципы**:
166 |
167 | 1. Начинаем от простого и идем к сложному.
168 |
169 | 2. Объясняю всё "на пальцах", простым языком. Я адепт Симплификации. Я могу обьяснять вам заумную теорию академическим языком, но какой в этом смысл, если теория так и останется не понятой большинством . Я считаю, что самое трудное, это объяснить сложные вещи простым языком.
170 |
171 | 3. Все практические кейсы, которые мы будем разбирать, из реальной жизни.
172 |
173 | 4. Research, то есть поиск недостающей вам информации самостоятельно. Я рассказываю вам тему, показываю вам это направление, мы с вами вместе по этой теме идём. Если вам недостаточно информации, то просто google it.
174 |
175 | 5. Критика только конструктивная. Это значит, что если вы что-либо критикуете, то взамен предложите лучшее решение. Нет решения - критика не принимается.
176 |
177 | ## Сертификаты
178 |
179 | Так как курс пока еще в процессе создания, мы не придумали как будет выглядеть **финальный сертификат**, но постоянно думаем об этом. Помимо основного сертификатa мы добавили концепцию **значков**, которые вы будете получать за выполнения домашнего задания для каждого модуля.
180 |
181 | **Курс ML-101** состоит из **3 модулей** и за каждый модуль вы получите **значок**. Чтобы его получить, вам необходимо показать нам ваш **Github**, в которому будет создана **папка ML-101**, а внутри будуте **подпапки**:
182 | - Module01
183 | - Module02
184 | - Module03
185 |
186 | Если вы сделали **домашнее задание**, то в папку **ML-101** вы сможете **добавить новый документ** по нашему шаблону, в котором будет информация о ваших достижениях.
187 |
188 | Несмотря на то, что **Data Learn** еще относительно молодой проект, он уже завоевал доверие у многих дата профессионалов, а это значит, студенты **Data Learn** получают самые актуальные знания, которые востребованы на отечественном и западном рынке. Требуется серьезная мотивация и целеустремленность, чтобы закончить курс, и если вы справитесь со всеми модулями курса **ML-101**, то вы легко справитесь с базовым уровнем задач на позициях **Data Science Intern, Junior Data Scientist, Applied Scientist**.
189 |
190 | ## Структура репозитория
191 |
192 | Сейчас я расскажу вам как пользоваться этим репозиторием.
193 |
194 | - [файл](https://github.com/Data-Learn/data-science/blob/main/README.md) **README.md** - это знакомство с платформой **Data Learn**, описание курсов, небольшие инструкции по регистрации.
195 |
196 | - [файл](https://github.com/Data-Learn/data-science/blob/main/ML-101%20Guide.md) **ML-101 Guide.md** - это наш гид по курсу **ML-101 (Getting Started with Machine Learning and Data Science)**, который содержит всю информацию и имеет ссылки на необходимые ресурсы для успешного прохождения курса.
197 |
198 | - [папкa](https://github.com/Data-Learn/data-science/tree/main/ML-101%20Modules) **ML-101 Modules** - содержит 3 папки, соответствующие 3м модулям этого курса: **Module 01, Module 02, Module 03**. Вы найдете там: описание уроков, ссылку на видео урок, дополнительные материалы (опционально) и задания практического кейса (выполнение которого обязательно для успешного прохождения курса и получения значков и сертификата).
199 |
200 | - [папка](https://github.com/Data-Learn/data-science/tree/main/how_to) **how_to** - содержит инструкции по установке нужных нам инструментов.
201 |
202 | ## Как зарегистрироваться на курс
203 | 1. Вы регистрируетесь на странице курса [**ML-101**](https://datalearn.ru) .
204 | 2. На сайте появляется страница, на которой будет ссылка на не большой опрос про ваш опыт и интерес к ресурсу. Вам нужно заполнить опрос.
205 | 3. Когда вы пройдете опрос, на странице по завершения опроса вы увидите ссылку приглашение в наше **Slack** комьюнити.
206 |
207 | **Slack** это месенджер, вы можете скачаеть его [тут](https://slack.com/intl/en-ca/downloads/) для компьютера или найти версию для телефона; в нем мы будем общаться, задавать вопросы, обсуждать что-либо.
208 |
209 | Наши каналы в **Slack**:
210 |
211 | У курса **ML-101** есть общий канал курса и отдельные каналы для каждого модуля, где будут выходить анонсы, обсуждаться практические задания и можно попросить помощи у коллег.
212 |
213 | - **ml-101-общий чат курса** , **ml_module01** , **ml_module02** , **ml_module03** . 3 модуля == 3 канала.
214 |
215 | - **data_learn_announce** - главный канал, в него мы публикуем новости, анонсируем новые видео; вы можете комментировать сообщения.
216 | - **data_learn_chat** - болталка для всех и обо всем.
217 | - **ask-help-with-data-stuff** - можно задать вопрос на любую тему или попросить помочь с вашей работой.
218 | - **boltalka** - это канал обо всем.
219 | - **what_i_learnt** - канал, где вы можете рассказать о том, что вы выучили и какой курс прошли.
220 | - **python-chat** - канал посвящен вопросам **Python**.
221 |
222 | Вы можете добавить нужный вам канал **Slack** и посмотреть на весь список доступных каналов, кликнув на **+**.
223 |
224 | 
225 |
226 | Всем спасибо и до встречи на курсе **ML-101** и в нашем сообществе **Datal Learn** в **Slack**.
227 |
228 | Анастасия Риццо / Anastasia Rizzo.
229 |
--------------------------------------------------------------------------------
/ML-101 Modules/Intro/Intro.md:
--------------------------------------------------------------------------------
1 |  **Getting Started with Machine Learning and Data Science (ML-101)** - an introductory [course](https://github.com/Data-Learn/data-science/blob/main/ML-101%20Guide.md) of Machine Learning and Data Science, with theory and practical real life cases by [Anastasia Rizzo](https://www.linkedin.com/in/anastasia-r-7b8a0376).
2 |
3 | The course includes 3 modules:
4 |
5 | Module 01: The theory of Machine Learning and Data Science;
6 |
7 | Module 02: Regression (theory and practice);
8 |
9 | Module 03: Classification (theory and 2 practical cases).
10 |
11 | The course allows you to experience the Data Scientist profession yourself and is especially suitable for those who may be unsure, but are very interested in starting to explore this topic.
12 | ##
13 |
14 |
15 |
16 | **Getting Started with Machine Learning and Data Science (ML-101)** - [курс](https://github.com/Data-Learn/data-science/blob/main/ML-101%20Guide.md) от [Анастасии Риццо](https://www.linkedin.com/in/anastasia-r-7b8a0376) о теории Машинного Обучения и Data Science, с понятной теорией и практическими кейсами из реальной жизни. Курс включает в себя 3 модуля: Первый модуль про теорию Машинного Обучения и Data Science; Второй модуль посвящен Регрессии (теория и практика); Третий модуль про Классификацию (тоже теория и 2 практических кейса). Курс позволяет вам примерить профессию Data Scientist на себя и особенно подойдет тем, кому страшно, но очень интересно начать изучать данную тематику.
17 |
18 | [**Intro Video**](https://youtu.be/g2azOLGzeNo)
19 |
20 |
--------------------------------------------------------------------------------
/ML-101 Modules/Module 01/Lesson 01/README.md:
--------------------------------------------------------------------------------
1 | ## Lesson 01. AI: subsets; ML: types + tasks + lifecycle.
2 | **Искусственный Интеллект: подвиды; Машинное Обучение: виды + задачи + жизненный цикл.**
3 |
4 |  In this tutorial you will learn:
5 |
6 | 📌 Some information about the course: How to take this course? How will the learning process take place?
7 |
8 | 📌 Some background information about Artificial Intelligence (AI), Machine Learning (ML) and Data Science (DS);
9 |
10 | 📌 AI and its subfields;
11 |
12 | 📌 Types of ML (Supervised, Unsupervised, Semi-supervised and Reinforcement Learning);
13 |
14 | 📌 Data with/without Labels or Labelled and Unlabelled data;
15 |
16 | 📌 What tasks can be solved using ML (Recommendation, Ranking, Regression, Classification, Clustering, Anomaly Detection);
17 |
18 | 📌 What is the ML Lifecycle and how does it work.
19 |
20 | [**Video Tutorial 01**](https://youtu.be/Cf_Yys2VHS4)
21 | ##
22 |
23 | В этом уроке вы узнаете:
24 |
25 | 📌 Немного информации по курсу: Как проходить курс? Как будет проходить процесс обучения?
26 |
27 | 📌 Немного вводной информации про Искусственный Интеллект (AI), Машинное обучение (ML) и Data Science;
28 |
29 | 📌 AI и его подвиды;
30 |
31 | 📌 Виды ML (Supervised, Unsupervised, Semi-supervised and Reinforcement Learning);
32 |
33 | 📌 Data with/without Labels или Размеченные и Неразмеченные данные;
34 |
35 | 📌 Какие задачи можно решить с помощью ML (Recommendation, Ranking, Regression, Classification, Clustering, Anomaly Detection);
36 |
37 | 📌 Что такое Жизненный Цикл ML (ML Lifecycle) и как он работает.
38 |
39 | [**Video Tutorial 01**](https://youtu.be/Cf_Yys2VHS4)
40 |
--------------------------------------------------------------------------------
/ML-101 Modules/Module 01/Lesson 02/Practice-Homework.md:
--------------------------------------------------------------------------------
1 | # Practice / Практическое задание.
2 |
3 | 1. **Install Jupyter Notebook** / Установить Jupyter Notebook на свой компьютер/ноутбук.
4 |
5 | 2. **Register on Github** / Зарегистрироваться на Github:
6 |
7 | - create an account / создать аккаунт;
8 |
9 | - fork a repository / присоединиться к репозиторию курса;
10 |
11 |
12 | ## Links how_to / Ссылки на подробную пошаговую инструкцию по выполнению задания:
13 |
14 | - **Jupyter Notebook**
15 |
16 | https://github.com/Data-Learn/data-science/blob/main/how_to/Jupyter%20Notebook/how_to_Jupyter_Notebook.md
17 |
18 | - **Github**
19 |
20 | https://github.com/Data-Learn/data-science/blob/main/how_to/Github/How_to_Github.md
21 |
--------------------------------------------------------------------------------
/ML-101 Modules/Module 01/Lesson 02/README.md:
--------------------------------------------------------------------------------
1 | ## Lesson 02. Datasets, Libraries, Data Load, Train-Validation-Test datasets.
2 | Датасеты, Библиотеки, Загрузка данных, Train-Validation-Test датасеты.
3 |
4 |
5 | 
6 | In this tutorial we will:
7 |
8 | 📌 Learn how to plan work with datasets;
9 |
10 | 📌 Analyse the libraries: Pandas, Numpy, Matplotlib, Seaborn, Scikit-learn.
11 |
12 | 📌 See how to load data from a '.csv' file;
13 |
14 | 📌 Understand the difference between Train, Validation and Test datasets.
15 |
16 | [**Video Tutorial 02**](https://youtu.be/KYeSwj6V150)
17 | ##
18 |
19 | В этом уроке мы:
20 |
21 | 📌 Узнаем как составлять план работы с datasets;
22 |
23 | 📌 Разберем библиотеки: Pandas, Numpy, Matplotlib, Seaborn, Scikit-learn.
24 |
25 | 📌 Увидим как загрузить данные из .csv файла;
26 |
27 | 📌 Поймем разницу между Train, Validation and Test datasets.
28 |
29 | [**Video Tutorial 02**](https://youtu.be/KYeSwj6V150)
30 |
--------------------------------------------------------------------------------
/ML-101 Modules/Module 01/Lesson 03/Plan.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Общее название проекта"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "### _\"Детальное название проекта.\"_"
15 | ]
16 | },
17 | {
18 | "cell_type": "markdown",
19 | "metadata": {},
20 | "source": [
21 | "## Table of Contents\n",
22 | "\n",
23 | "\n",
24 | "## Part 0: Introduction\n",
25 | "\n",
26 | "### Overview\n",
27 | " О чем этот датасет\n",
28 | " +\n",
29 | " Метаданные:\n",
30 | " \n",
31 | "* Rank - Ranking of overall sales\n",
32 | "\n",
33 | "* Name - The games name\n",
34 | "\n",
35 | "* Year - Year of the game's release\n",
36 | " \n",
37 | "### Assumptions\n",
38 | " Пояснения/уточнения\n",
39 | "\n",
40 | "### Questions:\n",
41 | " Вопросы но котороые надо ответить\n",
42 | "\n",
43 | "* #### Question 1: \n",
44 | "* #### Question 2: \n",
45 | "* #### Question 3: \n",
46 | "\n",
47 | "\n",
48 | "## [Part 1: Import, Settings, Load Data](#Part-1:-Import,-Settings,-Load-Data.)\n",
49 | "* ### Import libraries, Create settings, Read data from ‘.csv’ file\n",
50 | "\n",
51 | "## [Part 2: Exploratory Data Analysis](#Part-2:-Exploratory-Data-Analysis.)\n",
52 | "* ### Info, Head, Describe \n",
53 | "* ### Observation of target variable \"...\"\n",
54 | "* ### Missing Data\n",
55 | " * #### List of data features with missing values (visualisation: какую диаграмму, график или плот используем?) \n",
56 | " * #### Filling missing values\n",
57 | "* ### Numerical and Categorical features\n",
58 | " * #### List of Numerical and Categorical features\n",
59 | " * #### Numerical features:\n",
60 | " * Head\n",
61 | " * Visualisation of Numerical features (какую диаграмму, график или плот используем?)\n",
62 | " * Outliers (visualisation: какую диаграмму, график или плот используем?)\n",
63 | " * Correlation Numerical features to the target\n",
64 | "\n",
65 | " * #### Categorical Features:\n",
66 | " * Head\n",
67 | " * Visualisation of Categorical features (какую диаграмму, график или плот используем?)\n",
68 | " * Convert Categorical into Numerical features\n",
69 | " * Drop all old Categorical features \n",
70 | " \n",
71 | " * #### Correlation new features to the target. Drop all features with weak correlation to the target.\n",
72 | " * #### Visualisation of all data features with strong correlation to target (visualisation: heatmap)\n",
73 | "\n",
74 | "## [Part 3: Data Wrangling and Transformation](#Part-3:-Data-Wrangling-and-Transformation.)\n",
75 | "* ### Multicollinearity \n",
76 | "* ### Standard Scaler\n",
77 | "* ### Creating datasets for ML part\n",
78 | "* ### 'Train\\Test' splitting method\n",
79 | "\n",
80 | "## [Part 4: Machine Learning](#Part-4:-Machine-Learning.)\n",
81 | "* ### ML Models\n",
82 | "* ### Build and train a models\n",
83 | "* ### Evaluate a models\n",
84 | " * #### If regression: Mean Squared Error (MSE), Root Mean Squared Error (RMSE), Mean Abolute Error (MAE), R Squared\n",
85 | " * #### If Classification: Classification Report and Confusion Matrix \n",
86 | "* ### Hyper parameters tuning (если надо)\n",
87 | "* ### Creating final predictions with Test set\n",
88 | "* ### If Classification: AUC–ROC curve (если надо)\n",
89 | "\n",
90 | "## [Conclusion](#Conclusion.)\n",
91 | "* ### Submission of ‘.csv’ file with predictions\n",
92 | "\n",
93 | "\n"
94 | ]
95 | },
96 | {
97 | "cell_type": "markdown",
98 | "metadata": {},
99 | "source": [
100 | "## Part 1: Import, Settings, Load Data."
101 | ]
102 | },
103 | {
104 | "cell_type": "markdown",
105 | "metadata": {},
106 | "source": [
107 | "* ### Import "
108 | ]
109 | },
110 | {
111 | "cell_type": "code",
112 | "execution_count": null,
113 | "metadata": {},
114 | "outputs": [],
115 | "source": []
116 | },
117 | {
118 | "cell_type": "markdown",
119 | "metadata": {},
120 | "source": [
121 | "* ### Settings"
122 | ]
123 | },
124 | {
125 | "cell_type": "code",
126 | "execution_count": null,
127 | "metadata": {},
128 | "outputs": [],
129 | "source": []
130 | },
131 | {
132 | "cell_type": "markdown",
133 | "metadata": {},
134 | "source": [
135 | "* ### Load Data"
136 | ]
137 | },
138 | {
139 | "cell_type": "code",
140 | "execution_count": null,
141 | "metadata": {},
142 | "outputs": [],
143 | "source": []
144 | },
145 | {
146 | "cell_type": "markdown",
147 | "metadata": {},
148 | "source": [
149 | "## Part 2: Exploratory Data Analysis."
150 | ]
151 | },
152 | {
153 | "cell_type": "markdown",
154 | "metadata": {},
155 | "source": [
156 | "## Part 3: Data Wrangling and Transformation."
157 | ]
158 | },
159 | {
160 | "cell_type": "markdown",
161 | "metadata": {},
162 | "source": [
163 | "## Part 4: Machine Learning."
164 | ]
165 | },
166 | {
167 | "cell_type": "markdown",
168 | "metadata": {},
169 | "source": [
170 | "## Conclusion."
171 | ]
172 | },
173 | {
174 | "cell_type": "code",
175 | "execution_count": null,
176 | "metadata": {},
177 | "outputs": [],
178 | "source": []
179 | }
180 | ],
181 | "metadata": {
182 | "kernelspec": {
183 | "display_name": "Python 3",
184 | "language": "python",
185 | "name": "python3"
186 | },
187 | "language_info": {
188 | "codemirror_mode": {
189 | "name": "ipython",
190 | "version": 3
191 | },
192 | "file_extension": ".py",
193 | "mimetype": "text/x-python",
194 | "name": "python",
195 | "nbconvert_exporter": "python",
196 | "pygments_lexer": "ipython3",
197 | "version": "3.7.3"
198 | }
199 | },
200 | "nbformat": 4,
201 | "nbformat_minor": 2
202 | }
203 |
--------------------------------------------------------------------------------
/ML-101 Modules/Module 01/Lesson 03/README.md:
--------------------------------------------------------------------------------
1 |
2 | ## Tutorial 03. Exploratory Data Analysis, Data Wrangling and Transformation.
3 |
4 | Первичный анализ данных, Обработка и трансформация данных.
5 |
6 |  In this tutorial we will go through the entire Exploratory Data Analysis, which includes:
7 |
8 | 📌 Descriptive Statistic
9 |
10 | 📌 Observation of target variable
11 |
12 | 📌Missing Data
13 |
14 | 📌Numerical and Categorical features
15 |
16 | Consider Data Wrangling and Transformation:
17 |
18 | 📌 Multicollinearity
19 |
20 | 📌Standard Scaler
21 |
22 | 📌Creating datasets for ML part
23 |
24 | 📌 'Train\Test' splitting method
25 |
26 | [**Video Tutorial 03**](https://youtu.be/S-ZBb4yvxAQ)
27 | ##
28 |
29 |
30 | В этом уроке мы:
31 |
32 | Пройдем весь **Exploratory Data Analysis**, который включает в себя:
33 |
34 | 📌 Descriptive Statistic
35 |
36 | 📌 Observation of target variable
37 |
38 | 📌 Missing Data
39 |
40 | 📌 Numerical and Categorical features
41 |
42 | Рассмотрим **Data Wrangling and Transformation**:
43 |
44 | 📌 Multicollinearity
45 |
46 | 📌 Standard Scaler
47 |
48 | 📌 Creating datasets for ML part
49 |
50 | 📌 'Train\Test' splitting method
51 |
52 | [**Video Tutorial 03**](https://youtu.be/S-ZBb4yvxAQ)
53 |
--------------------------------------------------------------------------------
/ML-101 Modules/Module 01/Lesson 04/README.md:
--------------------------------------------------------------------------------
1 |
2 | ## Lesson 04. Machine Learning.
3 |
4 |  In this tutorial we will go through the whole part of Machine Learning:
5 |
6 | 📌 Build and Train ML model
7 |
8 | 📌 Overfitting and Underfitting + Cross-Validation
9 |
10 | 📌 Model Evaluation
11 |
12 | 📌 Tuning hyper parameters
13 |
14 | 📌 Submission of ‘.csv’ file
15 |
16 | [**Video Tutorial 04**](https://youtu.be/Ypiv_2luYTU)
17 | ##
18 |
19 | В этом уроке мы:
20 |
21 | Пройдем всю часть Machine Learning:
22 |
23 |
24 | 📌 Build and Train ML model
25 |
26 | 📌 Overfitting и Underfitting + Cross-Validation
27 |
28 | 📌 Model Evaluation
29 |
30 | 📌 Tuning hyper parameters
31 |
32 | 📌 Submission of ‘.csv’ file
33 |
34 | [**Video Tutorial 04**](https://youtu.be/Ypiv_2luYTU)
35 |
--------------------------------------------------------------------------------
/ML-101 Modules/Module 01/ML-101 Module 01.md:
--------------------------------------------------------------------------------
1 | # Module 01: Theory of Machine Learning and Data Science / Теория Машинного Обучения и Data Science.
2 |
3 |  The **first module** consists of **4 tutorials**. During this module we will:
4 |
5 | 📌 Get acquainted with the basic theory of machine learning.
6 |
7 | 📌 Install **jupyter notebook**, talk about preliminary work with datasets, libraries, data loading, different types of datasets.
8 |
9 | 📌 Analyse EDA (Exploratory Data Analysis), descriptive statistic, missing values, numerical & categorical data, outliers, correlation feature to target, data processing and transformation.
10 |
11 | 📌 See how to prepare a dataset for a machine learning model: train/test split, overfitting & underfitting, cross-validation.
12 |
13 | 📌 Talk about creating a machine learning model, how to fill it with data, how to tune its hyperparameters, how to evaluate the performance of the model, how to get and save predictions.
14 | ##
15 |
16 |
17 | **Первый модуль** состоит из **4 уроков**. В процессе этого модуля мы:
18 |
19 | 📌 Познакомимся с базовой теорией машинного обучения.
20 |
21 | 📌 Установим **jupyter notebook**, поговорим про предварительные работы с датасетами, библиотеки, загрузку данных, разные виды датасетов.
22 |
23 | 📌 Разберем EDA (Exploratory Data Analysis) / Общий Анализ Данных, descriptive statistic / описательную статистику, missing values / пропущенные или нулевые значения, numerical & categorical data / числовые и категориальные данные, outliers / выбросы, correlation feature to target / корреляция атрибутов к главному атрибуту, data wrangling and transformation / обработка и трансформация данных.
24 |
25 | 📌 Мы увидим как подготовить датасет для модели машинного обучения: train/test split, overfitting & underfitting / переобучение и недообучение, cross-validation / кросс-валидация.
26 |
27 | 📌 Поговорим про создание модели машинного обучения, как наполнить её данными, как настроить её гиперпараметры, как оценить работу модели, как получить и сохранить predictions.
28 | ##
29 |
30 | ## Tutorial 01. AI: subsets; ML: types + tasks + lifecycle.
31 | Искусственный Интеллект: подвиды; Машинное Обучение: виды + задачи + жизненный цикл.
32 |
33 | [**Video Tutorial 01**](https://youtu.be/Cf_Yys2VHS4)
34 |
35 | ## Tutorial 02. Datasets, Libraries, Data Load, Train-Validation-Test datasets.
36 | Датасеты, Библиотеки, Загрузка данных, Train-Validation-Test датасеты.
37 |
38 | [**Video Tutorial 02**](https://youtu.be/KYeSwj6V150)
39 |
40 | ## Tutorial 03. Exploratory Data Analysis.
41 | Общий Анализ Данных.
42 |
43 | [**Video Tutorial 03**](https://youtu.be/S-ZBb4yvxAQ)
44 |
45 | ## Tutorial 04. Dataset Preparation for ML Model, ML Model, Hyper Parameters, ML Model Evaluation, Predictions.
46 | Подготовка данных для Модели МО, Модели МО, Гиперпараметры, Ответы-Инсайты.
47 |
48 | [**Video Tutorial 04**](https://youtu.be/Ypiv_2luYTU)
49 |
50 |
--------------------------------------------------------------------------------
/ML-101 Modules/Module 02/Lesson 01/README.md:
--------------------------------------------------------------------------------
1 | ## Tutorial 01. Regression: Theory and Algorithms
2 | **Регрессия: Теория и Алгоритмы.**
3 |
4 |  In this tutorial, we will go through a Regression theory and some of its algorithms:
5 |
6 | В этом уроке мы пройдем немного теории Регрессии и некоторые её алгоритмы:
7 |
8 | 📌 Build and Train ML model
9 |
10 | 📌 Linear Regression
11 |
12 | 📌 Ridge
13 |
14 | 📌 Lasso
15 |
16 | 📌 Elastic Net
17 |
18 | 📌 Support Vector Regression
19 |
20 | 📌 Decision Tree
21 |
22 | 📌 Random Forest
23 |
24 | [**Video Tutorial 01**](https://youtu.be/q7dQR_cd8pk)
25 |
--------------------------------------------------------------------------------
/ML-101 Modules/Module 02/Lesson 02/Practice/Practice-Homework.md:
--------------------------------------------------------------------------------
1 | ## Lesson 02. Regression: Practice.
2 | **Регрессия: Практика.**
3 |
4 |  In this tutorial, we will make a practical regression case / В этом уроке мы сделаем практический регрессионный кейс.
5 |
6 | You will have 5 files / Вам будут даны 5 файлов:
7 |
8 | * train.csv
9 |
10 | * test.csv
11 |
12 | * startup-profit-prediction - Practice.ipynb
13 |
14 | * startup-profit-prediction - Practice Code Part 1&2.ipynb
15 |
16 | * startup-profit-prediction - Practice Code Part 3&4.ipynb
17 | ##
18 |
19 | ## Practice / Задание:
20 |
21 |
22 | 
23 | * Open the files _startup-profit-prediction - Practice.ipynb_ and _startup-profit-prediction - Practice Code Part 1&2.ipynb_ . Copy the code from _startup-profit-prediction - Practice Code Part 1&2.ipynb_ to _startup-profit-prediction - Practice.ipynb_ . Compile the code block by block .
24 | * Open the file _startup-profit-prediction - Practice Code Part 3&4.ipynb_ and copy the code into the file _startup-profit-prediction - Practice.ipynb_ with already compiled code. Compile everything together.
25 | * Get and save predictions (they will be automatically saved to the folder where all your files are).
26 | * Upload the following 4 files to your Github account:
27 | ##
28 |
29 | * Откройте файлы _startup-profit-prediction - Practice.ipynb_ и _startup-profit-prediction - Practice Code Part 1&2.ipynb_ . Скопируйте код из файла _startup-profit-prediction - Practice Code Part 1&2.ipynb_ в файл _startup-profit-prediction - Practice.ipynb_ . Блок за блоком скомпилируйте код.
30 | * Откройте файл _startup-profit-prediction - Practice Code Part 3&4.ipynb_ и скопируйте код в файл _startup-profit-prediction - Practice.ipynb_ с уже скомпилированным кодом. Скомпилируйте всё вместе.
31 | * Получите и сохраните Predictions (они автоматически сохранятся в папку, где находятся все ваши файлы).
32 | * Загрузите в свой Github account следующие 4 файла:
33 |
34 | **train.csv**
35 |
36 | **test.csv**
37 |
38 | **startup-profit-prediction - Practice.ipynb** (with new compiled code / с новым скомпилированным кодом)
39 |
40 | **StartupPredictions.csv**
41 |
--------------------------------------------------------------------------------
/ML-101 Modules/Module 02/Lesson 02/Practice/startup-profit-prediction - Practice.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# \"50 startups.\""
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "### _\"Predict which companies to invest for maximizing profit\" (Regression task)._"
15 | ]
16 | },
17 | {
18 | "cell_type": "markdown",
19 | "metadata": {},
20 | "source": [
21 | "## Table of Contents\n",
22 | "\n",
23 | "\n",
24 | "## Part 0: Introduction\n",
25 | "\n",
26 | "### Overview\n",
27 | "The dataset that's we see here contains data about 50 startups. It has 7 columns: “ID”, “R&D Spend”, “Administration”, “Marketing Spend”, “State”, “Category” “Profit”.\n",
28 | "\n",
29 | " \n",
30 | "**Метаданные:**\n",
31 | " \n",
32 | "* **ID** - startup ID\n",
33 | "\n",
34 | "* **R&D Spend** - how much each startup spends on Research and Development\n",
35 | "\n",
36 | "* **Administration** - how much they spend on Administration cost\n",
37 | "\n",
38 | "* **Marketing Spend** - how much they spend on Marketing\n",
39 | "\n",
40 | "* **State** - which state the startup is based in\n",
41 | "\n",
42 | "* **Category** - which business category the startup belong to\n",
43 | "\n",
44 | "* **Profit** - the profit made by the startup\n",
45 | " \n",
46 | "\n",
47 | "### Questions:\n",
48 | " \n",
49 | "\n",
50 | "* #### Predict which companies to invest for maximizing profit (choose model with the best score; create predictions; choose companies)\n",
51 | "\n",
52 | "\n",
53 | "## [Part 1: Import, Load Data](#Part-1:-Import,-Load-Data.)\n",
54 | "* ### Import libraries, Read data from ‘.csv’ file\n",
55 | "\n",
56 | "## [Part 2: Exploratory Data Analysis](#Part-2:-Exploratory-Data-Analysis.)\n",
57 | "* ### Info, Head\n",
58 | "* ### Observation of target variable (describe + visualisation:distplot)\n",
59 | "* ### Numerical and Categorical features\n",
60 | " * #### List of Numerical and Categorical features\n",
61 | "* ### Missing Data\n",
62 | " * #### List of data features with missing values \n",
63 | " * #### Filling missing values\n",
64 | "* ### Numerical and Categorical features \n",
65 | " * #### Visualisation of Numerical and categorical features (regplot + barplot)\n",
66 | "\n",
67 | "## [Part 3: Data Wrangling and Transformation](#Part-3:-Data-Wrangling-and-Transformation.)\n",
68 | "* ### One-Hot Encoding \n",
69 | "* ### Standard Scaler (optional)\n",
70 | "* ### Creating datasets for ML part\n",
71 | "* ### 'Train\\Test' splitting method\n",
72 | "\n",
73 | "## [Part 4: Machine Learning](#Part-4:-Machine-Learning.)\n",
74 | "* ### ML Models (Linear regression, Gradient Boosting Regression)\n",
75 | "* ### Build, train, evaluate and visualise models\n",
76 | "* ### Creating final predictions with Test set\n",
77 | "* ### Model comparison\n",
78 | "\n",
79 | "\n",
80 | "## [Conclusion](#Conclusion.)\n",
81 | "* ### Submission of ‘.csv’ file with predictions"
82 | ]
83 | },
84 | {
85 | "cell_type": "markdown",
86 | "metadata": {},
87 | "source": [
88 | "## Part 1: Import, Load Data."
89 | ]
90 | },
91 | {
92 | "cell_type": "markdown",
93 | "metadata": {},
94 | "source": [
95 | "* ### Import "
96 | ]
97 | },
98 | {
99 | "cell_type": "code",
100 | "execution_count": null,
101 | "metadata": {},
102 | "outputs": [],
103 | "source": [
104 | "# import standard libraries\n",
105 | " \n",
106 | "\n",
107 | "# import models and metrics\n"
108 | ]
109 | },
110 | {
111 | "cell_type": "markdown",
112 | "metadata": {},
113 | "source": [
114 | "* ### Load Data"
115 | ]
116 | },
117 | {
118 | "cell_type": "code",
119 | "execution_count": 11,
120 | "metadata": {
121 | "_cell_guid": "79c7e3d0-c299-4dcb-8224-4455121ee9b0",
122 | "_uuid": "d629ff2d2480ee46fbb7e2d37f6b5fab8052498a"
123 | },
124 | "outputs": [],
125 | "source": [
126 | "# read data from '.csv' files\n",
127 | "\n",
128 | "\n",
129 | "# identify target\n"
130 | ]
131 | },
132 | {
133 | "cell_type": "markdown",
134 | "metadata": {},
135 | "source": [
136 | "## Part 2: Exploratory Data Analysis."
137 | ]
138 | },
139 | {
140 | "cell_type": "markdown",
141 | "metadata": {},
142 | "source": [
143 | "* ### Info"
144 | ]
145 | },
146 | {
147 | "cell_type": "code",
148 | "execution_count": 12,
149 | "metadata": {},
150 | "outputs": [],
151 | "source": [
152 | "# print the full summary of the Train dataset\n"
153 | ]
154 | },
155 | {
156 | "cell_type": "code",
157 | "execution_count": 13,
158 | "metadata": {},
159 | "outputs": [],
160 | "source": [
161 | "# print the full summary of the Test dataset\n"
162 | ]
163 | },
164 | {
165 | "cell_type": "markdown",
166 | "metadata": {},
167 | "source": [
168 | "* ### Head"
169 | ]
170 | },
171 | {
172 | "cell_type": "code",
173 | "execution_count": 14,
174 | "metadata": {
175 | "scrolled": false
176 | },
177 | "outputs": [],
178 | "source": [
179 | "# preview of the first 5 lines of the loaded Train data \n"
180 | ]
181 | },
182 | {
183 | "cell_type": "code",
184 | "execution_count": 15,
185 | "metadata": {},
186 | "outputs": [],
187 | "source": [
188 | "# preview of the first 5 lines of the loaded Test data \n"
189 | ]
190 | },
191 | {
192 | "cell_type": "markdown",
193 | "metadata": {},
194 | "source": [
195 | "* ### Observation of target variable"
196 | ]
197 | },
198 | {
199 | "cell_type": "code",
200 | "execution_count": 16,
201 | "metadata": {},
202 | "outputs": [],
203 | "source": [
204 | "# target variable\n"
205 | ]
206 | },
207 | {
208 | "cell_type": "code",
209 | "execution_count": 17,
210 | "metadata": {
211 | "scrolled": true
212 | },
213 | "outputs": [],
214 | "source": [
215 | "# visualisation of 'Profit' distribution\n"
216 | ]
217 | },
218 | {
219 | "cell_type": "code",
220 | "execution_count": 18,
221 | "metadata": {},
222 | "outputs": [],
223 | "source": [
224 | "# set 'ID' to index\n"
225 | ]
226 | },
227 | {
228 | "cell_type": "markdown",
229 | "metadata": {},
230 | "source": [
231 | "* ### Numerical and Categorical features\n",
232 | "#### List of Numerical and Categorical features"
233 | ]
234 | },
235 | {
236 | "cell_type": "code",
237 | "execution_count": 19,
238 | "metadata": {
239 | "scrolled": true
240 | },
241 | "outputs": [],
242 | "source": [
243 | "# check for Numerical and Categorical features in Train\n"
244 | ]
245 | },
246 | {
247 | "cell_type": "markdown",
248 | "metadata": {},
249 | "source": [
250 | "* ### Missing values"
251 | ]
252 | },
253 | {
254 | "cell_type": "markdown",
255 | "metadata": {},
256 | "source": [
257 | "#### List of data features with missing values"
258 | ]
259 | },
260 | {
261 | "cell_type": "code",
262 | "execution_count": 20,
263 | "metadata": {
264 | "scrolled": true
265 | },
266 | "outputs": [],
267 | "source": [
268 | "# check the Train features with missing values \n"
269 | ]
270 | },
271 | {
272 | "cell_type": "code",
273 | "execution_count": 21,
274 | "metadata": {},
275 | "outputs": [],
276 | "source": [
277 | "# check the Test features with missing values\n"
278 | ]
279 | },
280 | {
281 | "cell_type": "markdown",
282 | "metadata": {},
283 | "source": [
284 | "#### Filling missing values"
285 | ]
286 | },
287 | {
288 | "cell_type": "markdown",
289 | "metadata": {},
290 | "source": [
291 | "Fields where NAN values have meaning.\n",
292 | "\n",
293 | "Explaining in further depth:\n",
294 | "\n",
295 | "* 'R&D Spend': Numerical - replacement of NAN by 'mean';\n",
296 | "* 'Administration': Numerical - replacement of NAN by 'mean';\n",
297 | "* 'Marketing Spend': Numerical - replacement of NAN by 'mean';\n",
298 | "* 'State': Categorical - replacement of NAN by 'None';\n",
299 | "* 'Category': Categorical - replacement of NAN by 'None'."
300 | ]
301 | },
302 | {
303 | "cell_type": "code",
304 | "execution_count": 22,
305 | "metadata": {},
306 | "outputs": [],
307 | "source": [
308 | " # Numerical NAN columns to fill in Train and Test datasets\n",
309 | "\n",
310 | "\n",
311 | "# replace 'NAN' with 'mean' in these columns\n",
312 | "\n",
313 | "\n",
314 | "# Categorical NAN columns to fill in Train and Test datasets\n",
315 | "\n",
316 | "\n",
317 | "# replace 'NAN' with 'None' in these columns\n"
318 | ]
319 | },
320 | {
321 | "cell_type": "code",
322 | "execution_count": 23,
323 | "metadata": {},
324 | "outputs": [],
325 | "source": [
326 | "# check is there any mising values left in Train\n"
327 | ]
328 | },
329 | {
330 | "cell_type": "code",
331 | "execution_count": 24,
332 | "metadata": {},
333 | "outputs": [],
334 | "source": [
335 | "# check is there any mising values left in Test\n"
336 | ]
337 | },
338 | {
339 | "cell_type": "markdown",
340 | "metadata": {},
341 | "source": [
342 | "#### Visualisation of Numerical features (regplot)"
343 | ]
344 | },
345 | {
346 | "cell_type": "code",
347 | "execution_count": 25,
348 | "metadata": {},
349 | "outputs": [],
350 | "source": [
351 | "# numerical features visualisation\n"
352 | ]
353 | },
354 | {
355 | "cell_type": "code",
356 | "execution_count": 26,
357 | "metadata": {
358 | "scrolled": true
359 | },
360 | "outputs": [],
361 | "source": [
362 | "# categorical features visualisation\n",
363 | "# 'Profit' split in 'State' level\n"
364 | ]
365 | },
366 | {
367 | "cell_type": "code",
368 | "execution_count": 27,
369 | "metadata": {
370 | "scrolled": false
371 | },
372 | "outputs": [],
373 | "source": [
374 | "# categorical features visualisation\n",
375 | "# 'Profit' split in 'Category' level\n"
376 | ]
377 | },
378 | {
379 | "cell_type": "markdown",
380 | "metadata": {},
381 | "source": [
382 | "## Part 3: Data Wrangling and Transformation."
383 | ]
384 | },
385 | {
386 | "cell_type": "markdown",
387 | "metadata": {},
388 | "source": [
389 | "* ### One-Hot Encoding"
390 | ]
391 | },
392 | {
393 | "cell_type": "code",
394 | "execution_count": 28,
395 | "metadata": {},
396 | "outputs": [],
397 | "source": [
398 | "# One-Hot Encoding Train dataset\n",
399 | "\n",
400 | "\n",
401 | "# Drop target variable \n",
402 | "\n"
403 | ]
404 | },
405 | {
406 | "cell_type": "code",
407 | "execution_count": 29,
408 | "metadata": {},
409 | "outputs": [],
410 | "source": [
411 | "# preview of the first 5 lines of the loaded Train data \n"
412 | ]
413 | },
414 | {
415 | "cell_type": "code",
416 | "execution_count": 30,
417 | "metadata": {},
418 | "outputs": [],
419 | "source": [
420 | "# Train data shape\n"
421 | ]
422 | },
423 | {
424 | "cell_type": "code",
425 | "execution_count": 31,
426 | "metadata": {},
427 | "outputs": [],
428 | "source": [
429 | "# One Hot-Encoding Test dataset\n"
430 | ]
431 | },
432 | {
433 | "cell_type": "code",
434 | "execution_count": 32,
435 | "metadata": {
436 | "scrolled": true
437 | },
438 | "outputs": [],
439 | "source": [
440 | "# preview of the first 5 lines of the loaded Test data \n"
441 | ]
442 | },
443 | {
444 | "cell_type": "code",
445 | "execution_count": 33,
446 | "metadata": {},
447 | "outputs": [],
448 | "source": [
449 | "# Test data shape\n"
450 | ]
451 | },
452 | {
453 | "cell_type": "code",
454 | "execution_count": 34,
455 | "metadata": {},
456 | "outputs": [],
457 | "source": [
458 | "# Drop unnecessary variables \n"
459 | ]
460 | },
461 | {
462 | "cell_type": "markdown",
463 | "metadata": {},
464 | "source": [
465 | "* ### StandardScaler"
466 | ]
467 | },
468 | {
469 | "cell_type": "code",
470 | "execution_count": null,
471 | "metadata": {},
472 | "outputs": [],
473 | "source": []
474 | },
475 | {
476 | "cell_type": "markdown",
477 | "metadata": {},
478 | "source": [
479 | "* ### Creating datasets for ML part"
480 | ]
481 | },
482 | {
483 | "cell_type": "code",
484 | "execution_count": 35,
485 | "metadata": {},
486 | "outputs": [],
487 | "source": [
488 | "# set 'X' for features of scaled Train dataset 'sc_train'\n",
489 | "\n",
490 | "\n",
491 | "# set 'y' for the target 'Profit'\n",
492 | "\n",
493 | "\n",
494 | "# 'X_Test' for features of scaled Test dataset 'sc_test'\n"
495 | ]
496 | },
497 | {
498 | "cell_type": "markdown",
499 | "metadata": {},
500 | "source": [
501 | "* ### 'Train\\Test' split"
502 | ]
503 | },
504 | {
505 | "cell_type": "code",
506 | "execution_count": null,
507 | "metadata": {},
508 | "outputs": [],
509 | "source": []
510 | },
511 | {
512 | "cell_type": "code",
513 | "execution_count": null,
514 | "metadata": {},
515 | "outputs": [],
516 | "source": []
517 | },
518 | {
519 | "cell_type": "code",
520 | "execution_count": null,
521 | "metadata": {},
522 | "outputs": [],
523 | "source": []
524 | },
525 | {
526 | "cell_type": "code",
527 | "execution_count": null,
528 | "metadata": {},
529 | "outputs": [],
530 | "source": []
531 | },
532 | {
533 | "cell_type": "markdown",
534 | "metadata": {},
535 | "source": [
536 | "## Part 4: Machine Learning."
537 | ]
538 | },
539 | {
540 | "cell_type": "markdown",
541 | "metadata": {},
542 | "source": [
543 | "* ### Build, train, evaluate and visualise models"
544 | ]
545 | },
546 | {
547 | "cell_type": "markdown",
548 | "metadata": {},
549 | "source": [
550 | "* #### Linear Regression"
551 | ]
552 | },
553 | {
554 | "cell_type": "code",
555 | "execution_count": 36,
556 | "metadata": {
557 | "scrolled": true
558 | },
559 | "outputs": [],
560 | "source": [
561 | "# Linear Regression model\n",
562 | "\n",
563 | "\n",
564 | "# Model Training\n",
565 | "\n",
566 | "# Model Prediction\n"
567 | ]
568 | },
569 | {
570 | "cell_type": "code",
571 | "execution_count": 37,
572 | "metadata": {},
573 | "outputs": [],
574 | "source": [
575 | "# Model R2 score\n"
576 | ]
577 | },
578 | {
579 | "cell_type": "code",
580 | "execution_count": 38,
581 | "metadata": {
582 | "scrolled": true
583 | },
584 | "outputs": [],
585 | "source": [
586 | "# Model Metrics\n"
587 | ]
588 | },
589 | {
590 | "cell_type": "code",
591 | "execution_count": 39,
592 | "metadata": {},
593 | "outputs": [],
594 | "source": [
595 | "# visualisation of Train dataset predictions\n",
596 | "\n",
597 | "# Plot outputs\n"
598 | ]
599 | },
600 | {
601 | "cell_type": "code",
602 | "execution_count": 40,
603 | "metadata": {
604 | "scrolled": true
605 | },
606 | "outputs": [],
607 | "source": [
608 | "# Test final predictions\n"
609 | ]
610 | },
611 | {
612 | "cell_type": "code",
613 | "execution_count": 41,
614 | "metadata": {},
615 | "outputs": [],
616 | "source": [
617 | "# Model Metrics\n"
618 | ]
619 | },
620 | {
621 | "cell_type": "code",
622 | "execution_count": 42,
623 | "metadata": {},
624 | "outputs": [],
625 | "source": [
626 | "# visualisation of Test dataset predictions\n",
627 | "\n",
628 | "# Plot outputs\n"
629 | ]
630 | },
631 | {
632 | "cell_type": "code",
633 | "execution_count": 43,
634 | "metadata": {
635 | "scrolled": false
636 | },
637 | "outputs": [],
638 | "source": [
639 | "# comparison between Actual 'Profit' from Train dataset abd Predicted 'Profit' from Test dataset\n"
640 | ]
641 | },
642 | {
643 | "cell_type": "markdown",
644 | "metadata": {},
645 | "source": [
646 | "* #### Gradient Boosting Regressor"
647 | ]
648 | },
649 | {
650 | "cell_type": "code",
651 | "execution_count": 44,
652 | "metadata": {},
653 | "outputs": [],
654 | "source": [
655 | "# Gradient Boosting Regressor model\n",
656 | "\n",
657 | "\n",
658 | "# Model Training\n",
659 | "\n",
660 | "\n",
661 | "# Model Prediction\n",
662 | "\n",
663 | "\n",
664 | "# Model R2 score\n"
665 | ]
666 | },
667 | {
668 | "cell_type": "code",
669 | "execution_count": 45,
670 | "metadata": {},
671 | "outputs": [],
672 | "source": [
673 | "# Model Metrics\n"
674 | ]
675 | },
676 | {
677 | "cell_type": "code",
678 | "execution_count": 46,
679 | "metadata": {},
680 | "outputs": [],
681 | "source": [
682 | "# Test final predictions\n"
683 | ]
684 | },
685 | {
686 | "cell_type": "code",
687 | "execution_count": 47,
688 | "metadata": {},
689 | "outputs": [],
690 | "source": [
691 | "# Model Metrics\n"
692 | ]
693 | },
694 | {
695 | "cell_type": "code",
696 | "execution_count": 48,
697 | "metadata": {},
698 | "outputs": [],
699 | "source": [
700 | "# visualisation of Test dataset predictions\n",
701 | "\n",
702 | "# Plot outputs\n"
703 | ]
704 | },
705 | {
706 | "cell_type": "markdown",
707 | "metadata": {},
708 | "source": [
709 | "### Model comparison"
710 | ]
711 | },
712 | {
713 | "cell_type": "code",
714 | "execution_count": 49,
715 | "metadata": {
716 | "scrolled": true
717 | },
718 | "outputs": [],
719 | "source": [
720 | "# score comparison of models\n"
721 | ]
722 | },
723 | {
724 | "cell_type": "code",
725 | "execution_count": 50,
726 | "metadata": {
727 | "scrolled": false
728 | },
729 | "outputs": [],
730 | "source": [
731 | "# comparison between Actual 'Profit' from Train dataset abd Predicted 'Profit' from Test dataset\n"
732 | ]
733 | },
734 | {
735 | "cell_type": "markdown",
736 | "metadata": {},
737 | "source": [
738 | "**Result**: The best model is **Gradient Boosting Regressor** with **R2 score = 0.972002**."
739 | ]
740 | },
741 | {
742 | "cell_type": "markdown",
743 | "metadata": {},
744 | "source": [
745 | "## Conclusion."
746 | ]
747 | },
748 | {
749 | "cell_type": "code",
750 | "execution_count": 51,
751 | "metadata": {},
752 | "outputs": [],
753 | "source": [
754 | "# submission of .csv file with final predictions\n"
755 | ]
756 | },
757 | {
758 | "cell_type": "code",
759 | "execution_count": null,
760 | "metadata": {},
761 | "outputs": [],
762 | "source": []
763 | }
764 | ],
765 | "metadata": {
766 | "kernelspec": {
767 | "display_name": "Python 3",
768 | "language": "python",
769 | "name": "python3"
770 | },
771 | "language_info": {
772 | "codemirror_mode": {
773 | "name": "ipython",
774 | "version": 3
775 | },
776 | "file_extension": ".py",
777 | "mimetype": "text/x-python",
778 | "name": "python",
779 | "nbconvert_exporter": "python",
780 | "pygments_lexer": "ipython3",
781 | "version": "3.9.1"
782 | }
783 | },
784 | "nbformat": 4,
785 | "nbformat_minor": 4
786 | }
787 |
--------------------------------------------------------------------------------
/ML-101 Modules/Module 02/Lesson 02/Practice/test.csv:
--------------------------------------------------------------------------------
1 | ID,R&D Spend,Administration,Marketing Spend,State,Category
2 | 0,165349.2,136897.8,471784.1,New York,Industrials
3 | 1,162597.7,151377.59,443898.53,California,Technology
4 | 2,153441.51,101145.55,407934.54,Florida,Healthcare
5 | 3,144372.41,118671.85,383199.62,New York,Financials
6 | 4,142107.34,91391.77,366168.42,Florida,Industrials
7 | 5,131876.9,99814.71,362861.36,New York,Telecommunications
8 | 6,134615.46,147198.87,127716.82,California,Telecommunications
9 | 7,130298.13,145530.06,323876.68,Florida,Technology
10 | 8,120542.52,148718.95,311613.29,New York,Healthcare
11 | 9,123334.88,108679.17,304981.62,California,Healthcare
12 | 10,101913.08,,229160.95,Florida,Healthcare
13 | 11,100671.96,91790.61,249744.55,California,Technology
14 | 12,93863.75,127320.38,249839.44,Florida,Technology
15 | 13,91992.39,135495.07,252664.93,California,Industrials
16 | 14,119943.24,156547.42,256512.92,Florida,Technology
17 | 15,114523.61,122616.84,261776.23,New York,Technology
18 | 16,78013.11,121597.55,264346.06,California,Financials
19 | 17,94657.16,145077.58,282574.31,New York,Technology
20 | 18,91749.16,114175.79,294919.57,Florida,Technology
21 | 19,86419.7,153514.11,248839.44,New York,Technology
22 | 20,76253.86,113867.3,298664.47,California,Technology
23 | 21,78389.47,153773.43,299737.29,New York,Financials
24 | 22,73994.56,122782.75,303319.26,Florida,Healthcare
25 | 23,67532.53,105751.03,304768.73,Florida,Technology
26 | 24,77044.01,99281.34,140574.81,New York,Healthcare
27 | 25,64664.71,139553.16,137962.62,California,Telecommunications
28 | 26,75328.87,144135.98,134050.07,Florida,Technology
29 | 27,72107.6,127864.55,353183.81,New York,Technology
30 | 28,66051.52,182645.56,118148.2,,Technology
31 | 29,65605.48,153032.06,107138.38,New York,Financials
32 | 30,61994.48,115641.28,91131.24,Florida,Healthcare
33 | 31,61136.38,152701.92,88218.23,New York,Industrials
34 | 32,63408.86,129219.61,46085.25,California,Technology
35 | 33,55493.95,103057.49,214634.81,Florida,Technology
36 | 34,46426.07,157693.92,210797.67,California,Telecommunications
37 | 35,46014.02,85047.44,205517.64,New York,Technology
38 | 36,28663.76,127056.21,201126.82,Florida,Technology
39 | 37,44069.95,,197029.42,California,Technology
40 | 38,20229.59,65947.93,185265.1,New York,Healthcare
41 | 39,38558.51,82982.09,174999.3,California,Financials
42 | 40,28754.33,118546.05,351183.81,California,Technology
43 | 41,27892.92,84710.77,164470.71,Florida,Healthcare
44 | 42,23640.93,96189.63,148001.11,California,Healthcare
45 | 43,15505.73,127382.3,35534.17,New York,Oil & Gas
46 | 44,22177.74,154806.14,28334.72,California,Technology
47 | 45,1000.23,124153.04,1903.93,New York,Healthcare
48 | 46,1315.46,115816.21,297114.46,Florida,Financials
49 | 47,,135426.92,174779.3,California,Technology
50 | 48,542.05,51743.15,86718.23,New York,Technology
51 | 49,,116983.8,45173.06,California,Industrials
52 |
--------------------------------------------------------------------------------
/ML-101 Modules/Module 02/Lesson 02/Practice/train.csv:
--------------------------------------------------------------------------------
1 | ID,R&D Spend,Administration,Marketing Spend,State,Category,Profit
2 | 0,165349.2,136897.8,471784.1,New York,Industrials,192261.83
3 | 1,162597.7,151377.59,443898.53,California,Technology,191792.06
4 | 2,153441.51,101145.55,407934.54,Florida,Healthcare,191050.39
5 | 3,144372.41,118671.85,383199.62,New York,Financials,182901.99
6 | 4,142107.34,91391.77,366168.42,Florida,Industrials,166187.94
7 | 5,131876.9,99814.71,362861.36,New York,Telecommunications,156991.12
8 | 6,134615.46,147198.87,127716.82,California,Telecommunications,156122.51
9 | 7,130298.13,145530.06,323876.68,Florida,Technology,155752.6
10 | 8,120542.52,148718.95,311613.29,New York,Healthcare,152211.77
11 | 9,123334.88,108679.17,304981.62,California,Healthcare,149759.96
12 | 10,101913.08,110594.11,229160.95,Florida,Healthcare,146121.95
13 | 11,100671.96,91790.61,249744.55,California,Technology,144259.4
14 | 12,93863.75,127320.38,249839.44,Florida,Technology,141585.52
15 | 13,91992.39,135495.07,252664.93,California,Industrials,134307.35
16 | 14,119943.24,156547.42,256512.92,Florida,Technology,132602.65
17 | 15,114523.61,122616.84,261776.23,New York,Technology,129917.04
18 | 16,78013.11,121597.55,264346.06,California,Financials,126992.93
19 | 17,94657.16,145077.58,282574.31,New York,Technology,125370.37
20 | 18,91749.16,114175.79,294919.57,Florida,Technology,124266.9
21 | 19,86419.7,153514.11,,New York,Technology,122776.86
22 | 20,76253.86,113867.3,298664.47,California,Technology,118474.03
23 | 21,78389.47,153773.43,299737.29,New York,Financials,111313.02
24 | 22,73994.56,122782.75,303319.26,Florida,Healthcare,110352.25
25 | 23,67532.53,105751.03,304768.73,Florida,,108733.99
26 | 24,77044.01,99281.34,140574.81,New York,Healthcare,108552.04
27 | 25,64664.71,139553.16,137962.62,California,Telecommunications,107404.34
28 | 26,75328.87,144135.98,134050.07,Florida,Technology,105733.54
29 | 27,72107.6,127864.55,353183.81,New York,Technology,105008.31
30 | 28,66051.52,182645.56,118148.2,Florida,Technology,103282.38
31 | 29,65605.48,153032.06,107138.38,New York,Financials,101004.64
32 | 30,61994.48,115641.28,91131.24,Florida,Healthcare,99937.59
33 | 31,61136.38,152701.92,88218.23,New York,Industrials,97483.56
34 | 32,63408.86,129219.61,46085.25,California,Technology,97427.84
35 | 33,55493.95,103057.49,214634.81,Florida,Technology,96778.92
36 | 34,46426.07,157693.92,210797.67,California,Telecommunications,96712.8
37 | 35,46014.02,85047.44,205517.64,New York,Technology,96479.51
38 | 36,28663.76,127056.21,201126.82,Florida,Technology,90708.19
39 | 37,44069.95,51283.14,197029.42,California,Technology,89949.14
40 | 38,20229.59,65947.93,185265.1,New York,Healthcare,81229.06
41 | 39,38558.51,82982.09,174999.3,California,Financials,81005.76
42 | 40,28754.33,118546.05,172795.67,California,Technology,78239.91
43 | 41,27892.92,84710.77,164470.71,Florida,Healthcare,77798.83
44 | 42,23640.93,96189.63,148001.11,California,Healthcare,71498.49
45 | 43,15505.73,127382.3,35534.17,New York,Oil & Gas,69758.98
46 | 44,22177.74,154806.14,28334.72,California,Technology,65200.33
47 | 45,1000.23,124153.04,1903.93,New York,Healthcare,64926.08
48 | 46,1315.46,115816.21,297114.46,Florida,Financials,49490.75
49 | 47,,135426.92,,California,Technology,42559.73
50 | 48,542.05,51743.15,,New York,Technology,35673.41
51 | 49,,116983.8,45173.06,California,Industrials,14681.4
52 |
--------------------------------------------------------------------------------
/ML-101 Modules/Module 02/Lesson 02/README.md:
--------------------------------------------------------------------------------
1 | ## Tutorial 02. Regression: Practice.
2 | **Регрессия: Практика.**
3 |
4 |  In this tutorial we will make a practical regression case.
5 |
6 | В этом уроке мы сделаем практический регрессионный кейс.
7 |
8 | [**Video Tutorial 02**](https://youtu.be/p2R8eK5ljAA)
9 |
10 |
--------------------------------------------------------------------------------
/ML-101 Modules/Module 02/ML-101 Module 02.md:
--------------------------------------------------------------------------------
1 |
2 | # Module 02: Regression / Регрессия.
3 |
4 |  The **second module** consists of **2 tutorials**. During this module we will:
5 |
6 | • Consider the Theory of Regression and some of its Algorithms.
7 |
8 | • Then we'll see how to do EDA and evaluate the performance of a machine learning model for Regression.
9 |
10 | • Finally, we will make the first practical case - a simple Regression, all stages.
11 |
12 | • As a result, you will have one **.ipynb file** that you put on Github.
13 |
14 | ## Tutorial 01. Regression: Theory, Algorithms, EDA, ML Model Evaluation.
15 |
16 | ## Tutorial 02. LAB 01: Regression (whole process: from dataset extraction to saving predictions).
17 |
18 | ##
19 |
20 | **Второй модуль** состоит из **2 уроков**. В процессе этого модуля мы:
21 |
22 | • Рассмотрим Теорию Регрессии и некоторые её Алгоритмы.
23 |
24 | • Затем увидим как сделать EDA и оценку работы модели машинного обучения для Регрессии.
25 |
26 | • И, наконец, мы сделаем первый практический кейс - простую Регрессию, весь процесс от и до, все этапы.
27 |
28 | • Как **итог**: у вас будет один **.ipynb файл**, который вы положите себе на **Github**.
29 |
30 | ## Tutorial 01. Регрессия: Теория, Алгоритмы, Общий Анализ Данных для Регрессии, Оценка Модели МО.
31 |
32 | ## Tutorial 02. LAB 01: Практический Кейс 01: Регрессия (весь процесс: от выгрузки датасета до сохранения ответов-инсайтов).
33 |
34 |
35 |
--------------------------------------------------------------------------------
/ML-101 Modules/Module 03/Lesson 01/README.md:
--------------------------------------------------------------------------------
1 | ## Tutorial 01. Classification: Theory and Algorithms
2 | **Классификация: Теория и Алгоритмы.**
3 |
4 |  In this tutorial, we will go through a Classification theory and some of its algorithms:
5 |
6 | В этом уроке мы пройдем немного теории Классификации и некоторые её алгоритмы:
7 |
8 |
9 | 📌 Logistic Regression
10 |
11 | 📌 KNN
12 |
13 | 📌 Naive Bayes
14 |
15 | 📌 Support Vector Machine
16 |
17 | 📌 Decision Tree Classifier
18 |
19 | 📌 Random Forest classifier
20 |
21 | [**Video Tutorial 01**](https://youtu.be/kFeY1zuGO7o)
22 |
23 |
--------------------------------------------------------------------------------
/ML-101 Modules/Module 03/Lesson 02/Practice 1/Gender - Practice Code Part 3&4.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "## Part 3: Data Wrangling and Transformation."
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {
13 | "papermill": {
14 | "duration": 0.029702,
15 | "end_time": "2021-05-12T06:37:38.664858",
16 | "exception": false,
17 | "start_time": "2021-05-12T06:37:38.635156",
18 | "status": "completed"
19 | },
20 | "tags": []
21 | },
22 | "source": [
23 | "* ### Creating datasets for ML part"
24 | ]
25 | },
26 | {
27 | "cell_type": "code",
28 | "execution_count": 354,
29 | "metadata": {
30 | "execution": {
31 | "iopub.execute_input": "2021-05-12T06:37:38.722329Z",
32 | "iopub.status.busy": "2021-05-12T06:37:38.721379Z",
33 | "iopub.status.idle": "2021-05-12T06:37:38.806079Z",
34 | "shell.execute_reply": "2021-05-12T06:37:38.807281Z"
35 | },
36 | "papermill": {
37 | "duration": 0.117005,
38 | "end_time": "2021-05-12T06:37:38.807542",
39 | "exception": false,
40 | "start_time": "2021-05-12T06:37:38.690537",
41 | "status": "completed"
42 | },
43 | "tags": []
44 | },
45 | "outputs": [],
46 | "source": [
47 | "# set 'X' for features' and y' for the target ('gender').\n",
48 | "y = data['gender']\n",
49 | "X = data.drop(['gender'],axis=1)"
50 | ]
51 | },
52 | {
53 | "cell_type": "markdown",
54 | "metadata": {},
55 | "source": [
56 | "* ### 'Train\\Test' split"
57 | ]
58 | },
59 | {
60 | "cell_type": "code",
61 | "execution_count": 355,
62 | "metadata": {},
63 | "outputs": [],
64 | "source": [
65 | "# 'Train\\Test' splitting method\n",
66 | "X_train, X_test,y_train, y_test = train_test_split(X,y,test_size=0.3, random_state=0) "
67 | ]
68 | },
69 | {
70 | "cell_type": "markdown",
71 | "metadata": {},
72 | "source": [
73 | "## Part 4: Machine Learning."
74 | ]
75 | },
76 | {
77 | "cell_type": "markdown",
78 | "metadata": {},
79 | "source": [
80 | "* ### Build, train and evaluate model"
81 | ]
82 | },
83 | {
84 | "cell_type": "markdown",
85 | "metadata": {
86 | "papermill": {
87 | "duration": 0.025229,
88 | "end_time": "2021-05-12T06:37:38.959430",
89 | "exception": false,
90 | "start_time": "2021-05-12T06:37:38.934201",
91 | "status": "completed"
92 | },
93 | "tags": []
94 | },
95 | "source": [
96 | "### Logistic Regression"
97 | ]
98 | },
99 | {
100 | "cell_type": "code",
101 | "execution_count": 356,
102 | "metadata": {
103 | "execution": {
104 | "iopub.execute_input": "2021-05-12T06:37:39.015844Z",
105 | "iopub.status.busy": "2021-05-12T06:37:39.015009Z",
106 | "iopub.status.idle": "2021-05-12T06:37:39.161675Z",
107 | "shell.execute_reply": "2021-05-12T06:37:39.162217Z"
108 | },
109 | "papermill": {
110 | "duration": 0.177869,
111 | "end_time": "2021-05-12T06:37:39.162420",
112 | "exception": false,
113 | "start_time": "2021-05-12T06:37:38.984551",
114 | "status": "completed"
115 | },
116 | "tags": []
117 | },
118 | "outputs": [],
119 | "source": [
120 | "# Logistic Regression model\n",
121 | "LR = LogisticRegression(random_state=0)\n",
122 | "LR.fit(X_train, y_train)\n",
123 | "LR_pred = LR.predict(X_test)"
124 | ]
125 | },
126 | {
127 | "cell_type": "code",
128 | "execution_count": 360,
129 | "metadata": {},
130 | "outputs": [
131 | {
132 | "data": {
133 | "text/plain": [
134 | "array([1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0])"
135 | ]
136 | },
137 | "execution_count": 360,
138 | "metadata": {},
139 | "output_type": "execute_result"
140 | }
141 | ],
142 | "source": [
143 | "# LR predictions\n",
144 | "LR_pred"
145 | ]
146 | },
147 | {
148 | "cell_type": "markdown",
149 | "metadata": {},
150 | "source": [
151 | "* ### Visualisation of predictions"
152 | ]
153 | },
154 | {
155 | "cell_type": "code",
156 | "execution_count": 364,
157 | "metadata": {},
158 | "outputs": [
159 | {
160 | "data": {
161 | "text/html": [
162 | "
| Actual Gender | LR Predicted Gender |
\n",
224 | " \n",
225 | " 45 | \n",
226 | " 1 | \n",
227 | " 1 | \n",
228 | "
\n",
229 | " \n",
230 | " 28 | \n",
231 | " 0 | \n",
232 | " 1 | \n",
233 | "
\n",
234 | " \n",
235 | " 29 | \n",
236 | " 0 | \n",
237 | " 1 | \n",
238 | "
\n",
239 | " \n",
240 | " 55 | \n",
241 | " 1 | \n",
242 | " 1 | \n",
243 | "
\n",
244 | " \n",
245 | " 63 | \n",
246 | " 1 | \n",
247 | " 1 | \n",
248 | "
\n",
249 | " \n",
250 | " 31 | \n",
251 | " 0 | \n",
252 | " 0 | \n",
253 | "
\n",
254 | " \n",
255 | " 51 | \n",
256 | " 1 | \n",
257 | " 0 | \n",
258 | "
\n",
259 | " \n",
260 | " 46 | \n",
261 | " 1 | \n",
262 | " 0 | \n",
263 | "
\n",
264 | " \n",
265 | " 34 | \n",
266 | " 1 | \n",
267 | " 1 | \n",
268 | "
\n",
269 | " \n",
270 | " 4 | \n",
271 | " 0 | \n",
272 | " 0 | \n",
273 | "
\n",
274 | "
"
275 | ],
276 | "text/plain": [
277 | ""
278 | ]
279 | },
280 | "execution_count": 364,
281 | "metadata": {},
282 | "output_type": "execute_result"
283 | }
284 | ],
285 | "source": [
286 | "# visual comparison between Actual 'Gender' and Predicted 'Gender'\n",
287 | "actualvspredicted = pd.DataFrame({\"Actual Gender\":y_test,\"LR Predicted Gender\":LR_pred})\n",
288 | "actualvspredicted.head(10).style.background_gradient(cmap='Blues')"
289 | ]
290 | },
291 | {
292 | "cell_type": "markdown",
293 | "metadata": {},
294 | "source": [
295 | "* ### Classification report"
296 | ]
297 | },
298 | {
299 | "cell_type": "code",
300 | "execution_count": 357,
301 | "metadata": {
302 | "scrolled": false
303 | },
304 | "outputs": [
305 | {
306 | "name": "stdout",
307 | "output_type": "stream",
308 | "text": [
309 | "LR Classification Report: \n",
310 | " precision recall f1-score support\n",
311 | "\n",
312 | " 0 0.636364 0.777778 0.700000 9\n",
313 | " 1 0.777778 0.636364 0.700000 11\n",
314 | "\n",
315 | " accuracy 0.700000 20\n",
316 | " macro avg 0.707071 0.707071 0.700000 20\n",
317 | "weighted avg 0.714141 0.700000 0.700000 20\n",
318 | "\n"
319 | ]
320 | }
321 | ],
322 | "source": [
323 | "# classification report of LR model\n",
324 | "print(\"LR Classification Report: \\n\", classification_report(y_test, LR_pred, digits = 6))"
325 | ]
326 | },
327 | {
328 | "cell_type": "markdown",
329 | "metadata": {},
330 | "source": [
331 | "* ### Confusion matrix"
332 | ]
333 | },
334 | {
335 | "cell_type": "code",
336 | "execution_count": 358,
337 | "metadata": {},
338 | "outputs": [
339 | {
340 | "name": "stdout",
341 | "output_type": "stream",
342 | "text": [
343 | "LR Confusion Matrix\n"
344 | ]
345 | },
346 | {
347 | "data": {
348 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAWYAAAEMCAYAAAD3SRwqAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjAsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+17YcXAAAXg0lEQVR4nO3deZTcdZnv8XcliiEECAgcvBcwIPjAIKviILsK44B3BFeUGSQeYcIgsggjInJJci9cl4EEQbawDQzKCAZEFmFYLyLLiIDI8jgKMs4FmbAkLIaQkL5/VFWnk+murgqp/n2r83556tC/tZ7Gcz795anv71u1vr4+JEnlGFN1AZKkpRnMklQYg1mSCmMwS1JhDGZJKozBLEmFeUvVBUjSaBYRBwOHD9i1MXBpZh4+xCXUnMcsSSMjIrYErgY+kJnPDXWerQxJGjlnA19vFcpgK0OSlktETAQmDnJobmbOHeT8PYFVM/OK4e7dC8Fsr0VSu2pv5uJVtzu87bzZCKYBJw1yaBowdZD9U4DT2rl38cG86nZD9se1kpr/wJm8tqjqKlSacSsizWoddXdnAhcPsn+w0fIqwO7A5HZuXHwwS9KIqbU/4G60K/5LCA9ha+A3mflqOycbzJLU1NmIuRObAP/R7skGsyQ1dTBi7kRm/hD4YbvnG8yS1DRmbNUVAAazJC3RvVZGRwxmSWrqUiujUwazJDU5YpakwjhilqTCOGKWpMI4K0OSCuOIWZIKM8YesySVxRGzJBXGWRmSVBg//JOkwtjKkKTC2MqQpMI4YpakwjhilqTCOGKWpMI4K0OSCuOIWZIKY49ZkgrjiFmSCuOIWZIK44hZkspSG2MwS1JRarYyJKkwZeSywSxJTY6YJakwBrMkFWaMH/5JUmHKGDAbzJLUZCtDkgpjMEtSYQxmSSqMwSxJhamNMZglqSjdGjFHxF8BJwGrATdl5pGtzi9j0p4kFaBWq7X9aldEbAKcA+wHbA1sHxF7t7rGEbMkNXUwYI6IicDEQQ7Nzcy5A7Y/DvxzZv5H47r9gdda3XvIYI6IccB0YHPgVuC7mbm4/bIlqbd02Mo4inp7YlnTgKkDtjcFXo+Ia4CNgGuBE1vduNWI+Wzq/ZAbgP2BtYH/2XbJktRjOgzmmcDFg+yfu8z2W4DdgD2AV4BrgIOGuLb/gqG8LzO3AoiIy6mPmg1mSaNWJ2tlNNoVy4bwYP4I3JyZcwAi4irg/bQI5lZVLBxQwIsU8xS5JHVJrYNX+64FPhIREyNiLLA3cH+rCzqZlWF/WdKo1o1ZGZl5L/Bt4GfAo8BTwEWtrmnVypgYEZ8YsL3mwO3MnN12ZVouu753M246f/Dpjrffl+w95YwRrkileP6555hx6ne4++d3sWDBa7xnq2045qvHsdlm7666tJ7WrXnMmXkhcGG757cK5n8HvjzEdh9gMHfZPQ89waQ9j19q34d23JxZ0w7k1ItvrqgqVW3x4sUcfeTh9PX1MfOMsxg/fjznnHUGf/vFyVx1zXVMnLhW1SX2rF54JPsLmfnkiFWi/2Lhojd49vmX+7fXmDCOk4/cjxmX3MzNdz9WYWWqUubjPPTgA1x1zfVs8q53AXDyN7/Drju9nzvvuIO/2ne/iivsXaU8kt2qx/yjEatCbTn+kL15feEiTjnvhqpLUYXe8Y53cMZZ5zJp443799VqNejr46WX5lVYWe/rRo95ebQaMZfxp0MArLvWBA7dfzeOOOWfmf/awuEv0Kg1ceJa7Lb7Hkvt+/5ll7JgwQI+sNMu1RQ1SvRCK2PNiPg4QwS0H/6NrEM+vStzXniZH1x/X9WlqDC333oL351xGgce9IX+1oaWTy8E87rAEUMc88O/Efa5j+7AJdfcw6JFzlrUEj++ajbTp57IR/beh6OP+fuqy+l9ZeRyy2D+bWZ+cHluGhG3AW9bZncN6MvMnZbnniuzLTZZn003Wo8rbmw5J10rmVnnns2Z353JZw/4G7729W8UM9rrZaX8O+zW6nJfA2ZRX1VpUZfeY6Wx8/ab8syceeSTz1Zdigpx0QWzOPO7Mzns8COY8ndfqrqcUWNMIbMyWgXz5ct708y8NyIuBbbOzKuW9z6q2yY24JHfPl11GSrEb/Jxzjh9Bvt94pN88lOf4bk5c/qPjV9tNcaPH19hdb2t+BFzZn7rzdw4M7/zZq7XEuuvuyYvzHu16jJUiJ/ecD1vvPEGV8/+EVfPXnpW65e+fCR/e+hhFVXW+wrJZWp9fX1V19DSqtsdXnaBGnHzHziT12yQaRnj6sPMNxWtcdyNbedNfusjXYtxv8FEkhpKGTEbzJLU0Asf/knSSsVglqTC2MqQpMIUP11OklY2BrMkFaaQXDaYJanJD/8kqTC2MiSpMIXkssEsSU2OmCWpMIXkssEsSU2OmCWpMM7KkKTCFDJgNpglqclWhiQVppBcNpglqckRsyQVxmCWpMI4K0OSClPIgNlglqQmWxmSVJhCctlglqSmMV1K5oi4DVgPWNjYNSUz7x3qfINZkhq68eFfRNSAdwPvzMxF7VwzZDBHxCdaXZiZszsrT5LK1qVJGdH4500R8XZgVmae2eqCViPmL7c41gcYzJJGlU4+/IuIicDEQQ7Nzcy5A7bXAm6hnqlvBW6PiMzMfxnq3kMGc2Z+sO0KJWkU6LDFfBRw0iD7pwFTmxuZeTdwd3M7Ii4A9gE6D+YBN1kfuADYDNgFuBQ4KDP/2F7tktQbanSUzDOBiwfZP3C0TETsArwtM2/pf5slHwIOqp0P/84CrgYOB14EHqQe1B9t41pJ6hmd9Jgb7Yq5w55Yb3dMj4idqLcyDgIObVlHGzedlJmzgMWZuTAzjwM2auM6SeopY8bU2n61KzOvBa4DHgDuBy5stDeG1M6IeXFE9Ad4RKxOe4EuST2lW/OYM/NE4MS262jjnNnAZcCaETEFuBX44fKVJ0nlqtXaf3XTsMGcmacANwD/CuwFnAdM725ZkjTyarVa269uavfJv+8Dv6T+SeK/ZWZf90qSpGqUslbGsCPmiPhz4AnqzetbgMcjYqtuFyZJI21srdb2q5va6TGfDnwxM9+ZmRsAxwBnd7UqSapAKa2MdoJ5lYGPDmbmT4DVuleSJFVjTK39V1fraOOc+yPiU82NiNiHer9ZkkaVUkbMrVaXe5n6YkVjgS9GxIvAG8A6wLNdrUqSKlDKh3+tZmW8Z8SqkKQCFP/VUpn5VPPniNgOmEB98Y2xwKbArK5XJ0kjaGyvfEt2RMwC9gXGAU9TD+WfYTBLGmXKiOX2PvzbC9gYuIr6inJ7An/qZlGSVIUxtVrbr67W0cY5z2Tmq8DjwFaZeTuwQVerkqQK9MxaGcDrEbEb8CjwlxGxJvV+sySNKqVMl2snmI8DpgDXA9sCzwH/1M2iJKkKpYyYh/3wLzPvAe5pbO4YEWtm5rzuliVJI6/4WRkR8RPqD5gMdozM/FjXqhpg/gMtv+VbK6lx7a6LKHWg+HnMwJUjVkUL37vr91WXoMJ8aedJrLrd4VWXocKsiEFcKV/N1OoBk38cyUIkqWq9MGKWpJVKIS1mg1mSmor/8E+SVjaF5HJba2WsD1wAbAbsClwCTM7MZ7pcmySNqEJazG19CHkWcDUwH3gBeBA4v5tFSVIVemmtjEmZOQtYnJkLM/M4YKOuViVJFRjTwaub2ukxL46I/joiYnXKme4nSStMKa2MdoJ5NnAZsGZETAEOBn7Y1aokqQKlzMoYduSbmacANwD/Sn1t5vOA6V2uS5JGXCnfkt3WdLnMvIT6bAxJGrW6/aFeu9qZLvcwgyxmlJlbd6UiSapIIbnc1oh54GoxqwCfBZ7oTjmSVJ1CWsxtrcd8x8DtiLgZ+DlwcreKkqQq1Ar5OtbleST77cB/W9GFSFLV3lLIROBOe8w16g+XnNvNoiSpCr207OcxwILGz33AnMx8rHslSVI1eqbHDHw7M7fteiWSVLFuD5gj4h+AdTJzcqvz2umovBoRG6yQqiSpYN1cxCgiPgwc1M657YyYVwOejIg/AK80dzqPWdJoM7aDD/8iYiIwcZBDczNz7jLnrk19JtspwDbD3budYD6ynSIlqdeN6Wy63FHASYPsnwZMXWbfucAJwIbt3HjIYI6ISzPzwGXnMUvSaNVhh2ImcPEg+5cdLR8M/CEzb4mIye3cuNWIect2q5Ok0aCTWRmNdsXcYU+E/YF3RMSDwNrAhIiYkZlHD3WB3/knSQ3dWMQoM/dq/twYMe/RKpShdTBvHREvDbK/BvRl5hrLVaUkFaqQ50taBnMC+4xUIZJUtW4vlJ+ZFzN4X3oprYJ5QWY+taIKkqTSFbJURstgnj9iVUhSAYpfKyMzdx7JQiSpamXEsrMyJKlfz3y1lCStLMqIZYNZkvqNKWTdT4NZkhp6YVaGJK1Uip+VIUkrmzJi2WCWpH6OmCWpMGMNZkkqSxmxbDBLUr9CBswGsyQ1dfjVUl1jMEtSgyNmSSpMzRGzJJXFWRmSVJhCctlglqQmg1mSCmOPWR175nePceX/+QofP/abbLD5NlWXo4rs+t7NuOn8Iwc9dvt9yd5TzhjhikaPQlb9NJh7xcIFr3HTrG/Tt3hx1aWoYvc89AST9jx+qX0f2nFzZk07kFMvvrmiqkYHv8FEHbnz8nOZsPY6zPvPp6suRRVbuOgNnn3+5f7tNSaM4+Qj92PGJTdz892PVVhZ7yullVHKutBq4fe/uo8nf3Ufux9wWNWlqEDHH7I3ry9cxCnn3VB1KT1vTK39V1fr6O7tISIM/zdh/svzuOWiGXx48lG8bfyEqstRYdZdawKH7r8bJ597A/NfW1h1OT2v1sH/uqkrrYyI2AQ4DXgfsKgRzg8DR2fmb7rxnqPVrZeczsbb7sikrXbg5RfmVF2OCnPIp3dlzgsv84Pr76u6lFGhkBZz62COiMOAzYHbMvOqDu57PnB8Zt474F47AhcBOy9PoSujx+76F+Y89TsOmH5O1aWoUJ/76A5ccs09LFrkh8IrQiG5PHQwR8RpwI7AncDJEfHOzJzZ5n3HDQxlgMy8JyKWv9KV0KN33cQrLz7HBUd/FoC+vj4AfjzjG2yx85586PODT5nSymGLTdZn043W44ob76+6lFGjFx7J3gvYLjMXRcTpwI+BdoP5oYi4EPgpMA9YHdgH+NWbKXZl85FDjmPR6wv6t/8070Wu/OYxfHjy0Wy05fYVVqYS7Lz9pjwzZx755LNVlzJ6lJHLLYN5YWYuAsjMpyNilQ7uexiwH7ALsAbwEnAt0Ek7ZKU3Ya11ltoe+9ZVGvvfzvg1JlZRkgqyTWzAI791+uSKVMp0uU4+/Huj3RMzs496CBvEUpesv+6avDDv1arLGFUK6WS0DOZVI2I7lgzul9rOzF92uzgtbfW11+WIC2+sugwV4tNHnVt1CaNOIbncOpiB2cvsa273AZt0pSJJqkohyTxkMGfmpBGsQ5IqV8paGUM+lRcR5w34eZ2hzpOk0aLWwasTETE9Ih6NiEci4ivDnd/qcekdBvx8U4d1SFLv6UIyR8TuwIeArak/Df3lGOahjnbXsShjfC9JXdSNtTIy8w7gg43px+tRbyG3nE7T7nS5vrarkKQe1UmLOSImAoM9UDA3M+cO3JGZCyNiGnAscAXw/1rdu9WIeUxErBURawNjmz83X+2XL0m9oVZr/wUcBTw5yOuowe6dmScB6wIbAoe0qqPViHkr4DmWtDGeH3CsDxg73C8pSb2kwyf/ZgIXD7J/qdFyRGxOff2gBzPzTxExm3q/eUitpsu5jrKklUonrYxGu2LusCfWn/mYFhG7UB/U7gtc2OoCw1eSGroxXS4zrweuAx4A7gd+npmXt7rG7/yTpKYuzT/LzKnA1HbPN5glqaEXV5eTpFGt21+y2i6DWZKaDGZJKoutDEkqTCGLyxnMktRUSC4bzJLUr5BkNpglqaGUhfINZklqKCOWDWZJWqKQZDaYJanB6XKSVJhCWswGsyQ1GcySVBhbGZJUGEfMklSYQnLZYJakJkfMklScMpLZYJakBhfKl6TC2MqQpMI4XU6SSlNGLhvMktRUSC4bzJLUZI9ZkgpTKySZDWZJaigjlg1mSepXyIDZYJakJqfLSVJhHDFLUmEMZkkqjK0MSSqMI2ZJKkwhuWwwS1K/QpLZYJakBnvMklSYbi2UHxEnAZ9pbF6XmV9tWUd3ypCkHlTr4NWmiNgT+AtgO2Bb4L0R8fFW1zhilqSGTloZETERmDjIobmZOXfA9jPAMZn5euO6x4CNWtbR19fXdiGSpLqImAqcNMihaZk5dYhrNgPuAnbOzH8b6t6OmCVp+cwELh5k/9xB9hERWwLXAX/fKpTBEbMkdV1E7Az8CDgqMy8f7nyDWZK6KCI2BH4J7J+Zt7Zzja0MSequY4FxwGkR0dx3TmaeM9QFjpglqTDOY5akwhjMklQYg1mSCmMwS1JhnJXRIyLiz4FvZeYeVdei6kXEGOAsYBtgAXBwZv622qq0ojhi7gER8VXgfOpTbiSA/YBxmfkB4GvAqRXXoxXIYO4NvwM+UXURKsouwE8BMvMe4H3VlqMVyWDuAZn5I2Bh1XWoKGsA8wZsvxERtiZHCYNZ6k0vAasP2B6TmYuqKkYrlsEs9aa7gH0AImJH4OFqy9GK5H/6SL3pKmCviPg59e/T+ELF9WgFcq0MSSqMrQxJKozBLEmFMZglqTAGsyQVxmCWpMI4Xa4iEdEH/Bp4Y8DuX2TmwRHxe+DOzDxwwPnvA67MzEkjUNvFwF7AHKAPeCv1x8IPycz/fBP37f8dIuJQYGJmfrPF+QcDq2TmWR2+z6+BwzPz9mX23w6cmZlXtrh2KrBOZh7ewftNAn6dmRM6qfPNakyVGw+sAgRL5jI/ApwAPEl9caMLBlxzLPCezJw8krWqMwZztT6Ymc8NcezTEXFjZv7TiFa0xIzM/IfmRkScSn01s0+tiJu3+r6zAXah/sdLg8jMnWCpPwzbNo819i0GTo2In2VmVlKklovBXK4TgDMi4q7MfLLqYoBbgG8DNEb09wJbA18H7gPOBDaiPrq+PDNPaZz7d8DR1Nd16H86beDINCLeDZwLrEc9TP438DrwMeoPUczPzO9FxAnAJ6m34H4PHJaZT0fEnwEXUh89Pg6sNtwvExFfB/YFVm2cf2xmXtU4vEVE/F9gbeCBxvu8HBH/fajfs1Dzqa869/2I+EBmvl51QWqPPeZq3RYRDw54rTfg2B3UR6jfr3pxmohYFfg8cNuA3b/OzC0aYXYpcGFmvhd4P7BnRHwmIrYFpgK7ZeYO1MN2MJcDV2TmltQfMz6F+h+Ca6iP3L8XEZ8HtgLe3xgZXk99KVSAy4BZmbk1cDrwzmF+n3cCewJ7NK45AZg+4JRNqf8B2Ir6U3XfaOwf9Pds9V4FOBl4hfq/U/UIR8zVatXKADgJ+DD1cLt6RCpa4uiI+JvGz2+h/ofi+AHH7wSIiNWA3YG1I+J/NY5NALYFNgRuysw/NvafB/zlwDeJiLWpL/Z+PkBm/gF4V+PYwFP/B/Uw/EVj/1hgfES8nfrI/ZLG9Xc1esxDysynGkH/1xGxKbBjo+am2Zk5p1HDRcB3ImJ6i9/zvlbvV6XMXNz4//HBiLix6nrUHoO5YJm5KCIOAO4HXhjht1+qxzyIVxr/HEt9VLlTZv4JICLWAV4DpjSONQ22+llzX//aAFFP3n9f5ryx1L/B5ezGOW8D1hpwfLj36RcR2wM/BmYAN1H/o3P2gFMGfiA7hvqSq61+z3VavV/VMvMPETEF+Ecaf8BUNlsZhcvMJ4AjKPQ/RTPzJeAe4CsAETGR+spn+1IPvb+IiA0ap08e4vr7gYMa12/YuH5N6gH71sapNwIHR8Qaje3pwKWZ+Xzj+oMb129PvQXRym7UZ8CcRj2U96MevE0fi4i1ImIscAhwwzC/Z/EaM1FuAI6quhYNz2DuAZl5KTDkFK8CHADsGBEPU/9Q8AeZeVlmPgx8FbglIn7B0F+NdQDwmYh4CPgJ9Slef6QeJIdGxPHUWx3XAvdExCPU2xeTG9d/Dvhs4/1PBB4bpt4fAOtExGPAo9RH/2tHRHN940cb7/UwMBdoTukb9Pcc/l9PMY4Anqq6CA3P1eUkqTCOmCWpMAazJBXGYJakwhjMklQYg1mSCmMwS1JhDGZJKozBLEmF+f/xDcnPJjj6kAAAAABJRU5ErkJggg==\n",
349 | "text/plain": [
350 | ""
351 | ]
352 | },
353 | "metadata": {
354 | "needs_background": "light"
355 | },
356 | "output_type": "display_data"
357 | },
358 | {
359 | "name": "stdout",
360 | "output_type": "stream",
361 | "text": [
362 | "\n"
363 | ]
364 | }
365 | ],
366 | "source": [
367 | "# confusion matrix of LR model\n",
368 | "LR_confusion_mx = confusion_matrix(y_test, LR_pred)\n",
369 | "print('LR Confusion Matrix')\n",
370 | "\n",
371 | "# visualisation\n",
372 | "ax = plt.subplot()\n",
373 | "sns.heatmap(LR_confusion_mx, annot = True, fmt = 'd', cmap = 'Blues', ax = ax, linewidths = 0.5, annot_kws = {'size': 15})\n",
374 | "ax.set_ylabel('FP True label TP')\n",
375 | "ax.set_xlabel('FN Predicted label TN')\n",
376 | "ax.xaxis.set_ticklabels(['1', '0'], fontsize = 10)\n",
377 | "ax.yaxis.set_ticklabels(['1', '0'], fontsize = 10)\n",
378 | "plt.show()\n",
379 | "print() "
380 | ]
381 | },
382 | {
383 | "cell_type": "markdown",
384 | "metadata": {},
385 | "source": [
386 | "* ### ROC-AUC score"
387 | ]
388 | },
389 | {
390 | "cell_type": "code",
391 | "execution_count": 359,
392 | "metadata": {
393 | "scrolled": false
394 | },
395 | "outputs": [
396 | {
397 | "data": {
398 | "text/plain": [
399 | "0.7070707070707071"
400 | ]
401 | },
402 | "execution_count": 359,
403 | "metadata": {},
404 | "output_type": "execute_result"
405 | }
406 | ],
407 | "source": [
408 | "# ROC-AUC score of LR model\n",
409 | "roc_auc_score(LR_pred, y_test)"
410 | ]
411 | },
412 | {
413 | "cell_type": "markdown",
414 | "metadata": {},
415 | "source": [
416 | "## Conclusion."
417 | ]
418 | },
419 | {
420 | "cell_type": "markdown",
421 | "metadata": {},
422 | "source": [
423 | "**The main question** was: Predict a person's gender based on their personal preferences (check balance of classes; calculate perdictions).\n",
424 | "\n",
425 | "**Answers**: \n",
426 | "\n",
427 | "1. The data is too small. Only 66 instances.\n",
428 | "\n",
429 | "2. The classes are balanced.\n",
430 | "\n",
431 | "3. Logistic Regression model was choosen. Predictions (with visual comparison) were done with an accuracy of the model equal 0.7, no hyper parameters were applied."
432 | ]
433 | },
434 | {
435 | "cell_type": "code",
436 | "execution_count": null,
437 | "metadata": {},
438 | "outputs": [],
439 | "source": []
440 | }
441 | ],
442 | "metadata": {
443 | "kernelspec": {
444 | "display_name": "Python 3",
445 | "language": "python",
446 | "name": "python3"
447 | },
448 | "language_info": {
449 | "codemirror_mode": {
450 | "name": "ipython",
451 | "version": 3
452 | },
453 | "file_extension": ".py",
454 | "mimetype": "text/x-python",
455 | "name": "python",
456 | "nbconvert_exporter": "python",
457 | "pygments_lexer": "ipython3",
458 | "version": "3.7.3"
459 | },
460 | "papermill": {
461 | "default_parameters": {},
462 | "duration": 12.03125,
463 | "end_time": "2021-05-12T06:37:40.714635",
464 | "environment_variables": {},
465 | "exception": null,
466 | "input_path": "__notebook__.ipynb",
467 | "output_path": "__notebook__.ipynb",
468 | "parameters": {},
469 | "start_time": "2021-05-12T06:37:28.683385",
470 | "version": "2.3.2"
471 | }
472 | },
473 | "nbformat": 4,
474 | "nbformat_minor": 5
475 | }
476 |
--------------------------------------------------------------------------------
/ML-101 Modules/Module 03/Lesson 02/Practice 1/Gender - Practice.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "id": "superior-hebrew",
6 | "metadata": {},
7 | "source": [
8 | "# \"Gender.\""
9 | ]
10 | },
11 | {
12 | "cell_type": "markdown",
13 | "id": "searching-minute",
14 | "metadata": {},
15 | "source": [
16 | "### _\"Classifying gender based on personal preferences\" (Binary classification task)._"
17 | ]
18 | },
19 | {
20 | "cell_type": "markdown",
21 | "id": "loving-square",
22 | "metadata": {},
23 | "source": [
24 | "## Table of Contents\n",
25 | "\n",
26 | "\n",
27 | "## Part 0: Introduction\n",
28 | "\n",
29 | "### Overview\n",
30 | "The dataset that's we see here contains 5 columns and 66 entries of data about personal preferences based on gender.\n",
31 | "\n",
32 | "Gender is a social construct. The way males and females are treated differently since birth moulds their behaviour and personal preferences into what society expects for their gender.\n",
33 | "\n",
34 | "**Метаданные:**\n",
35 | " \n",
36 | "* **Favorite Color** - Favorite color (colors reported by respondents were mapped to either warm, cool or neutral)\n",
37 | " \n",
38 | "* **Favorite Music Genre** - Favorite broad music genre\n",
39 | "\n",
40 | "* **Favorite Beverage** - Favorite alcoholic drink\n",
41 | "\n",
42 | "* **Favorite Soft Drink** - Favorite fizzy drink\n",
43 | "\n",
44 | "* **Gender** - Binary gender \n",
45 | "\n",
46 | "\n",
47 | "\n",
48 | "### Questions:\n",
49 | " \n",
50 | "Predict a person's gender based on their personal preferences (check balance of classes; calculate perdictions)\n",
51 | "\n",
52 | "\n",
53 | "## [Part 1: Import, Load Data](#Part-1:-Import,-Load-Data.)\n",
54 | "* ### Import libraries, Read data from ‘.csv’ file\n",
55 | "\n",
56 | "## [Part 2: Exploratory Data Analysis](#Part-2:-Exploratory-Data-Analysis.)\n",
57 | "* ### Info, Head\n",
58 | "* ### Rename Columns\n",
59 | "* ### Columns visualisation\n",
60 | "* ### 'gender' attribute value counts \n",
61 | "* ### Encode the Data\n",
62 | "\n",
63 | "## [Part 3: Data Wrangling and Transformation](#Part-3:-Data-Wrangling-and-Transformation.)\n",
64 | "* ### Creating datasets for ML part\n",
65 | "* ### 'Train\\Test' splitting method\n",
66 | "\n",
67 | "## [Part 4: Machine Learning](#Part-4:-Machine-Learning.)\n",
68 | "* ### Build, train and evaluate model \n",
69 | " * #### Logistic Regression\n",
70 | " * #### Visualisation of predictions\n",
71 | " * #### Classification report\n",
72 | " * #### Confusion Matrix\n",
73 | " * #### ROC-AUC score\n",
74 | "\n",
75 | "## [Conclusion](#Conclusion.)\n"
76 | ]
77 | },
78 | {
79 | "cell_type": "markdown",
80 | "id": "earlier-excerpt",
81 | "metadata": {},
82 | "source": [
83 | "## Part 1: Import, Load Data."
84 | ]
85 | },
86 | {
87 | "cell_type": "markdown",
88 | "id": "composite-training",
89 | "metadata": {},
90 | "source": [
91 | "* ### Import libraries"
92 | ]
93 | },
94 | {
95 | "cell_type": "code",
96 | "execution_count": 1,
97 | "id": "illegal-stockholm",
98 | "metadata": {
99 | "_cell_guid": "b1076dfc-b9ad-4769-8c92-a6c4dae69d19",
100 | "_uuid": "8f2839f25d086af736a60e9eeb907d3b93b6e0e5",
101 | "execution": {
102 | "iopub.execute_input": "2021-05-12T06:37:35.552073Z",
103 | "iopub.status.busy": "2021-05-12T06:37:35.550650Z",
104 | "iopub.status.idle": "2021-05-12T06:37:35.574536Z",
105 | "shell.execute_reply": "2021-05-12T06:37:35.575034Z"
106 | },
107 | "papermill": {
108 | "duration": 0.050276,
109 | "end_time": "2021-05-12T06:37:35.575327",
110 | "exception": false,
111 | "start_time": "2021-05-12T06:37:35.525051",
112 | "status": "completed"
113 | },
114 | "tags": []
115 | },
116 | "outputs": [],
117 | "source": [
118 | "# import standard libraries\n",
119 | "\n"
120 | ]
121 | },
122 | {
123 | "cell_type": "markdown",
124 | "id": "clinical-williams",
125 | "metadata": {
126 | "papermill": {
127 | "duration": 0.020256,
128 | "end_time": "2021-05-12T06:37:35.617997",
129 | "exception": false,
130 | "start_time": "2021-05-12T06:37:35.597741",
131 | "status": "completed"
132 | },
133 | "tags": []
134 | },
135 | "source": [
136 | "* ### Read data from ‘.csv’ file"
137 | ]
138 | },
139 | {
140 | "cell_type": "code",
141 | "execution_count": 2,
142 | "id": "latter-philadelphia",
143 | "metadata": {
144 | "execution": {
145 | "iopub.execute_input": "2021-05-12T06:37:35.662611Z",
146 | "iopub.status.busy": "2021-05-12T06:37:35.661903Z",
147 | "iopub.status.idle": "2021-05-12T06:37:35.712174Z",
148 | "shell.execute_reply": "2021-05-12T06:37:35.711447Z"
149 | },
150 | "papermill": {
151 | "duration": 0.073936,
152 | "end_time": "2021-05-12T06:37:35.712323",
153 | "exception": false,
154 | "start_time": "2021-05-12T06:37:35.638387",
155 | "status": "completed"
156 | },
157 | "tags": []
158 | },
159 | "outputs": [],
160 | "source": [
161 | "# read data from '.csv' file\n"
162 | ]
163 | },
164 | {
165 | "cell_type": "markdown",
166 | "id": "infinite-grain",
167 | "metadata": {},
168 | "source": [
169 | "## Part 2: Exploratory Data Analysis."
170 | ]
171 | },
172 | {
173 | "cell_type": "markdown",
174 | "id": "local-grant",
175 | "metadata": {
176 | "papermill": {
177 | "duration": 0.021453,
178 | "end_time": "2021-05-12T06:37:35.942783",
179 | "exception": false,
180 | "start_time": "2021-05-12T06:37:35.921330",
181 | "status": "completed"
182 | },
183 | "tags": []
184 | },
185 | "source": [
186 | "* ### Info"
187 | ]
188 | },
189 | {
190 | "cell_type": "code",
191 | "execution_count": 3,
192 | "id": "saved-tragedy",
193 | "metadata": {
194 | "execution": {
195 | "iopub.execute_input": "2021-05-12T06:37:35.995840Z",
196 | "iopub.status.busy": "2021-05-12T06:37:35.994818Z",
197 | "iopub.status.idle": "2021-05-12T06:37:35.999418Z",
198 | "shell.execute_reply": "2021-05-12T06:37:35.998890Z"
199 | },
200 | "papermill": {
201 | "duration": 0.034193,
202 | "end_time": "2021-05-12T06:37:35.999559",
203 | "exception": false,
204 | "start_time": "2021-05-12T06:37:35.965366",
205 | "status": "completed"
206 | },
207 | "scrolled": true,
208 | "tags": []
209 | },
210 | "outputs": [],
211 | "source": [
212 | "# print the full summary of the dataset \n"
213 | ]
214 | },
215 | {
216 | "cell_type": "markdown",
217 | "id": "informational-adoption",
218 | "metadata": {},
219 | "source": [
220 | "Dataset consists of 66 rows and 5 columns;\n",
221 | "\n",
222 | "has 1 datatype: object(5);\n",
223 | "\n",
224 | "has no missing values."
225 | ]
226 | },
227 | {
228 | "cell_type": "markdown",
229 | "id": "educational-occupation",
230 | "metadata": {},
231 | "source": [
232 | "* ### Head"
233 | ]
234 | },
235 | {
236 | "cell_type": "code",
237 | "execution_count": 4,
238 | "id": "cathedral-police",
239 | "metadata": {},
240 | "outputs": [],
241 | "source": [
242 | "# preview of the first 5 lines of the loaded data \n"
243 | ]
244 | },
245 | {
246 | "cell_type": "markdown",
247 | "id": "applicable-butter",
248 | "metadata": {},
249 | "source": [
250 | "* ### Rename Columns"
251 | ]
252 | },
253 | {
254 | "cell_type": "code",
255 | "execution_count": 5,
256 | "id": "narrative-worship",
257 | "metadata": {
258 | "scrolled": true
259 | },
260 | "outputs": [],
261 | "source": [
262 | "# columns rename\n"
263 | ]
264 | },
265 | {
266 | "cell_type": "markdown",
267 | "id": "mexican-rally",
268 | "metadata": {},
269 | "source": [
270 | "* ### Columns visualisation"
271 | ]
272 | },
273 | {
274 | "cell_type": "code",
275 | "execution_count": 6,
276 | "id": "through-revelation",
277 | "metadata": {
278 | "scrolled": false
279 | },
280 | "outputs": [],
281 | "source": [
282 | "# columns visualisation\n"
283 | ]
284 | },
285 | {
286 | "cell_type": "markdown",
287 | "id": "particular-basement",
288 | "metadata": {},
289 | "source": [
290 | "* ### 'gender' attribute value counts "
291 | ]
292 | },
293 | {
294 | "cell_type": "code",
295 | "execution_count": 7,
296 | "id": "cosmetic-mechanics",
297 | "metadata": {
298 | "scrolled": true
299 | },
300 | "outputs": [],
301 | "source": [
302 | "# 'gender' value counts \n"
303 | ]
304 | },
305 | {
306 | "cell_type": "markdown",
307 | "id": "primary-agreement",
308 | "metadata": {},
309 | "source": [
310 | "There are 33 of 'Female' and 33 of 'Male' in our dataset. This means that our dataset is balanced."
311 | ]
312 | },
313 | {
314 | "cell_type": "markdown",
315 | "id": "accepted-texas",
316 | "metadata": {
317 | "papermill": {
318 | "duration": 0.02239,
319 | "end_time": "2021-05-12T06:37:36.285297",
320 | "exception": false,
321 | "start_time": "2021-05-12T06:37:36.262907",
322 | "status": "completed"
323 | },
324 | "tags": []
325 | },
326 | "source": [
327 | "* ### Encode the Data"
328 | ]
329 | },
330 | {
331 | "cell_type": "code",
332 | "execution_count": 8,
333 | "id": "particular-collect",
334 | "metadata": {
335 | "execution": {
336 | "iopub.execute_input": "2021-05-12T06:37:36.333225Z",
337 | "iopub.status.busy": "2021-05-12T06:37:36.332597Z",
338 | "iopub.status.idle": "2021-05-12T06:37:37.594161Z",
339 | "shell.execute_reply": "2021-05-12T06:37:37.593612Z"
340 | },
341 | "papermill": {
342 | "duration": 1.286573,
343 | "end_time": "2021-05-12T06:37:37.594309",
344 | "exception": false,
345 | "start_time": "2021-05-12T06:37:36.307736",
346 | "status": "completed"
347 | },
348 | "scrolled": true,
349 | "tags": []
350 | },
351 | "outputs": [],
352 | "source": [
353 | "# label encoding\n"
354 | ]
355 | },
356 | {
357 | "cell_type": "markdown",
358 | "id": "sorted-mining",
359 | "metadata": {},
360 | "source": [
361 | "## Part 3: Data Wrangling and Transformation."
362 | ]
363 | },
364 | {
365 | "cell_type": "markdown",
366 | "id": "received-vocabulary",
367 | "metadata": {
368 | "papermill": {
369 | "duration": 0.029702,
370 | "end_time": "2021-05-12T06:37:38.664858",
371 | "exception": false,
372 | "start_time": "2021-05-12T06:37:38.635156",
373 | "status": "completed"
374 | },
375 | "tags": []
376 | },
377 | "source": [
378 | "* ### Creating datasets for ML part"
379 | ]
380 | },
381 | {
382 | "cell_type": "code",
383 | "execution_count": 9,
384 | "id": "acute-feeding",
385 | "metadata": {
386 | "execution": {
387 | "iopub.execute_input": "2021-05-12T06:37:38.722329Z",
388 | "iopub.status.busy": "2021-05-12T06:37:38.721379Z",
389 | "iopub.status.idle": "2021-05-12T06:37:38.806079Z",
390 | "shell.execute_reply": "2021-05-12T06:37:38.807281Z"
391 | },
392 | "papermill": {
393 | "duration": 0.117005,
394 | "end_time": "2021-05-12T06:37:38.807542",
395 | "exception": false,
396 | "start_time": "2021-05-12T06:37:38.690537",
397 | "status": "completed"
398 | },
399 | "tags": []
400 | },
401 | "outputs": [],
402 | "source": [
403 | "# set 'X' for features' and y' for the target ('gender').\n"
404 | ]
405 | },
406 | {
407 | "cell_type": "markdown",
408 | "id": "known-water",
409 | "metadata": {},
410 | "source": [
411 | "* ### 'Train\\Test' split"
412 | ]
413 | },
414 | {
415 | "cell_type": "code",
416 | "execution_count": 10,
417 | "id": "helpful-endorsement",
418 | "metadata": {},
419 | "outputs": [],
420 | "source": [
421 | "# 'Train\\Test' splitting method\n"
422 | ]
423 | },
424 | {
425 | "cell_type": "markdown",
426 | "id": "central-binding",
427 | "metadata": {},
428 | "source": [
429 | "## Part 4: Machine Learning."
430 | ]
431 | },
432 | {
433 | "cell_type": "markdown",
434 | "id": "pending-glucose",
435 | "metadata": {},
436 | "source": [
437 | "* ### Build, train and evaluate model"
438 | ]
439 | },
440 | {
441 | "cell_type": "markdown",
442 | "id": "sunset-restaurant",
443 | "metadata": {
444 | "papermill": {
445 | "duration": 0.025229,
446 | "end_time": "2021-05-12T06:37:38.959430",
447 | "exception": false,
448 | "start_time": "2021-05-12T06:37:38.934201",
449 | "status": "completed"
450 | },
451 | "tags": []
452 | },
453 | "source": [
454 | "### Logistic Regression"
455 | ]
456 | },
457 | {
458 | "cell_type": "code",
459 | "execution_count": 11,
460 | "id": "signal-hurricane",
461 | "metadata": {
462 | "execution": {
463 | "iopub.execute_input": "2021-05-12T06:37:39.015844Z",
464 | "iopub.status.busy": "2021-05-12T06:37:39.015009Z",
465 | "iopub.status.idle": "2021-05-12T06:37:39.161675Z",
466 | "shell.execute_reply": "2021-05-12T06:37:39.162217Z"
467 | },
468 | "papermill": {
469 | "duration": 0.177869,
470 | "end_time": "2021-05-12T06:37:39.162420",
471 | "exception": false,
472 | "start_time": "2021-05-12T06:37:38.984551",
473 | "status": "completed"
474 | },
475 | "tags": []
476 | },
477 | "outputs": [],
478 | "source": [
479 | "# Logistic Regression model\n"
480 | ]
481 | },
482 | {
483 | "cell_type": "code",
484 | "execution_count": 12,
485 | "id": "nuclear-milan",
486 | "metadata": {},
487 | "outputs": [],
488 | "source": [
489 | "# LR predictions\n"
490 | ]
491 | },
492 | {
493 | "cell_type": "markdown",
494 | "id": "preliminary-looking",
495 | "metadata": {},
496 | "source": [
497 | "* ### Visualisation of predictions"
498 | ]
499 | },
500 | {
501 | "cell_type": "code",
502 | "execution_count": 13,
503 | "id": "little-republic",
504 | "metadata": {},
505 | "outputs": [],
506 | "source": [
507 | "# visual comparison between Actual 'Gender' and Predicted 'Gender'\n"
508 | ]
509 | },
510 | {
511 | "cell_type": "markdown",
512 | "id": "single-chest",
513 | "metadata": {},
514 | "source": [
515 | "* ### Classification report"
516 | ]
517 | },
518 | {
519 | "cell_type": "code",
520 | "execution_count": 14,
521 | "id": "blank-horse",
522 | "metadata": {
523 | "scrolled": false
524 | },
525 | "outputs": [],
526 | "source": [
527 | "# classification report of LR model\n"
528 | ]
529 | },
530 | {
531 | "cell_type": "markdown",
532 | "id": "direct-planner",
533 | "metadata": {},
534 | "source": [
535 | "* ### Confusion matrix"
536 | ]
537 | },
538 | {
539 | "cell_type": "code",
540 | "execution_count": 15,
541 | "id": "specific-variety",
542 | "metadata": {},
543 | "outputs": [],
544 | "source": [
545 | "# confusion matrix of LR model\n",
546 | "\n",
547 | "\n",
548 | "# visualisation\n",
549 | " "
550 | ]
551 | },
552 | {
553 | "cell_type": "markdown",
554 | "id": "tropical-superintendent",
555 | "metadata": {},
556 | "source": [
557 | "* ### ROC-AUC score"
558 | ]
559 | },
560 | {
561 | "cell_type": "code",
562 | "execution_count": 16,
563 | "id": "outer-council",
564 | "metadata": {
565 | "scrolled": false
566 | },
567 | "outputs": [],
568 | "source": [
569 | "# ROC-AUC score of LR model\n"
570 | ]
571 | },
572 | {
573 | "cell_type": "markdown",
574 | "id": "coral-organ",
575 | "metadata": {},
576 | "source": [
577 | "## Conclusion."
578 | ]
579 | },
580 | {
581 | "cell_type": "markdown",
582 | "id": "legendary-poster",
583 | "metadata": {},
584 | "source": [
585 | "**The main question** was: Predict a person's gender based on their personal preferences (check balance of classes; calculate perdictions).\n",
586 | "\n",
587 | "**Answers**: \n",
588 | "\n",
589 | "1. The data is too small. Only 66 instances.\n",
590 | "\n",
591 | "2. The classes are balanced.\n",
592 | "\n",
593 | "3. Logistic Regression model was choosen. Predictions (with visual comparison) were done with an accuracy of the model equal 0.7, no hyper parameters were applied."
594 | ]
595 | },
596 | {
597 | "cell_type": "code",
598 | "execution_count": null,
599 | "id": "generic-reading",
600 | "metadata": {},
601 | "outputs": [],
602 | "source": []
603 | }
604 | ],
605 | "metadata": {
606 | "kernelspec": {
607 | "display_name": "Python 3",
608 | "language": "python",
609 | "name": "python3"
610 | },
611 | "language_info": {
612 | "codemirror_mode": {
613 | "name": "ipython",
614 | "version": 3
615 | },
616 | "file_extension": ".py",
617 | "mimetype": "text/x-python",
618 | "name": "python",
619 | "nbconvert_exporter": "python",
620 | "pygments_lexer": "ipython3",
621 | "version": "3.9.1"
622 | },
623 | "papermill": {
624 | "default_parameters": {},
625 | "duration": 12.03125,
626 | "end_time": "2021-05-12T06:37:40.714635",
627 | "environment_variables": {},
628 | "exception": null,
629 | "input_path": "__notebook__.ipynb",
630 | "output_path": "__notebook__.ipynb",
631 | "parameters": {},
632 | "start_time": "2021-05-12T06:37:28.683385",
633 | "version": "2.3.2"
634 | }
635 | },
636 | "nbformat": 4,
637 | "nbformat_minor": 5
638 | }
639 |
--------------------------------------------------------------------------------
/ML-101 Modules/Module 03/Lesson 02/Practice 1/Practice-Homework.md:
--------------------------------------------------------------------------------
1 | ## Tutorial 02. Binary Classification: Practice.
2 | **Бинарная Классификация: Практика.**
3 |
4 |  In this tutorial, we will make a practical classification case / B этом уроке мы сделаем практический классификационный кейс.
5 |
6 | You will be given 4 files / Вам будут даны 4 файла:
7 |
8 | * gender.csv
9 |
10 | * gender - Practice.ipynb
11 |
12 | * gender - Practice Code Part 1&2.ipynb
13 |
14 | * gender - Practice Code Part 3&4.ipynb
15 |
16 |
17 | ## Practice / Задание:
18 |
19 | 
20 |
21 | * Open the files _gender - Practice.ipynb_ and _gender - Practice Code Part 1&2.ipynb_ . Copy the code from the _gender - Practice Code Part 1&2.ipynb_ file into the _gender - Practice.ipynb_ file. Block by block compile the code.
22 | * Open the _gender - Practice Code Part 3&4.ipynb_ file and copy the code into the _gender - Practice.ipynb_ file with already compiled code. Compile everything together.
23 | * Upload the following 2 files to your Github account:
24 | ##
25 | * Откройте файлы _gender - Practice.ipynb_ и _gender - Practice Code Part 1&2.ipynb_ . Скопируйте код из файла _gender - Practice Code Part 1&2.ipynb_ в файл _gender - Practice.ipynb_ . Блок за блоком скомпилируйте код.
26 | * Откройте файл _gender - Practice Code Part 3&4.ipynb_ и скопируйте код в файл _gender - Practice.ipynb_ с уже скомпилированным кодом. Скомпилируйте всё вместе.
27 | * Загрузите в свой Github account следующие 2 файла:
28 |
29 | **gender.csv**
30 |
31 | **gender - Practice.ipynb** (with new compiled code / с новым скомпилированным кодом)
32 |
33 |
34 |
--------------------------------------------------------------------------------
/ML-101 Modules/Module 03/Lesson 02/Practice 1/gender.csv:
--------------------------------------------------------------------------------
1 | Favorite Color,Favorite Music Genre,Favorite Beverage,Favorite Soft Drink,Gender
2 | Cool,Rock,Vodka,7UP/Sprite,F
3 | Neutral,Hip hop,Vodka,Coca Cola/Pepsi,F
4 | Warm,Rock,Wine,Coca Cola/Pepsi,F
5 | Warm,Folk/Traditional,Whiskey,Fanta,F
6 | Cool,Rock,Vodka,Coca Cola/Pepsi,F
7 | Warm,Jazz/Blues,Doesn't drink,Fanta,F
8 | Cool,Pop,Beer,Coca Cola/Pepsi,F
9 | Warm,Pop,Whiskey,Fanta,F
10 | Warm,Rock,Other,7UP/Sprite,F
11 | Neutral,Pop,Wine,Coca Cola/Pepsi,F
12 | Cool,Pop,Other,7UP/Sprite,F
13 | Warm,Pop,Other,7UP/Sprite,F
14 | Warm,Pop,Wine,7UP/Sprite,F
15 | Warm,Electronic,Wine,Coca Cola/Pepsi,F
16 | Cool,Rock,Beer,Coca Cola/Pepsi,F
17 | Warm,Jazz/Blues,Wine,Coca Cola/Pepsi,F
18 | Cool,Pop,Wine,7UP/Sprite,F
19 | Cool,Rock,Other,Coca Cola/Pepsi,F
20 | Cool,Rock,Other,Coca Cola/Pepsi,F
21 | Cool,Pop,Doesn't drink,7UP/Sprite,F
22 | Cool,Pop,Beer,Fanta,F
23 | Warm,Jazz/Blues,Whiskey,Fanta,F
24 | Cool,Rock,Vodka,Coca Cola/Pepsi,F
25 | Warm,Pop,Other,Coca Cola/Pepsi,F
26 | Cool,Folk/Traditional,Whiskey,7UP/Sprite,F
27 | Warm,R&B and soul,Whiskey,Coca Cola/Pepsi,F
28 | Cool,Pop,Beer,Other,F
29 | Cool,Pop,Doesn't drink,Other,F
30 | Cool,Pop,Doesn't drink,Coca Cola/Pepsi,F
31 | Cool,Electronic,Doesn't drink,Fanta,F
32 | Warm,Rock,Other,Coca Cola/Pepsi,F
33 | Neutral,Rock,Beer,Coca Cola/Pepsi,F
34 | Cool,R&B and soul,Beer,Coca Cola/Pepsi,F
35 | Warm,R&B and soul,Wine,Other,M
36 | Neutral,Hip hop,Beer,7UP/Sprite,M
37 | Warm,Electronic,Other,Coca Cola/Pepsi,M
38 | Neutral,Rock,Doesn't drink,Coca Cola/Pepsi,M
39 | Cool,Pop,Other,Fanta,M
40 | Cool,Pop,Whiskey,Fanta,M
41 | Warm,Rock,Vodka,7UP/Sprite,M
42 | Cool,Rock,Vodka,Coca Cola/Pepsi,M
43 | Neutral,Pop,Doesn't drink,7UP/Sprite,M
44 | Warm,R&B and soul,Doesn't drink,Coca Cola/Pepsi,M
45 | Cool,Rock,Wine,7UP/Sprite,M
46 | Cool,Folk/Traditional,Beer,Other,M
47 | Cool,Hip hop,Beer,Coca Cola/Pepsi,M
48 | Cool,Hip hop,Wine,Coca Cola/Pepsi,M
49 | Cool,R&B and soul,Whiskey,7UP/Sprite,M
50 | Cool,Rock,Doesn't drink,Other,M
51 | Warm,Hip hop,Beer,Coca Cola/Pepsi,M
52 | Cool,R&B and soul,Doesn't drink,Coca Cola/Pepsi,M
53 | Cool,Rock,Doesn't drink,Coca Cola/Pepsi,M
54 | Cool,Hip hop,Doesn't drink,Other,M
55 | Warm,Rock,Beer,Fanta,M
56 | Cool,Electronic,Doesn't drink,Fanta,M
57 | Cool,Electronic,Other,Fanta,M
58 | Warm,Folk/Traditional,Other,Fanta,M
59 | Warm,Electronic,Vodka,Fanta,M
60 | Warm,Jazz/Blues,Vodka,Coca Cola/Pepsi,M
61 | Cool,Pop,Whiskey,Other,M
62 | Cool,Electronic,Whiskey,Coca Cola/Pepsi,M
63 | Cool,Rock,Vodka,Coca Cola/Pepsi,M
64 | Cool,Hip hop,Beer,Coca Cola/Pepsi,M
65 | Neutral,Hip hop,Doesn't drink,Fanta,M
66 | Cool,Rock,Wine,Coca Cola/Pepsi,M
67 | Cool,Electronic,Beer,Coca Cola/Pepsi,M
--------------------------------------------------------------------------------
/ML-101 Modules/Module 03/Lesson 02/Practice 2/Practice-Homework.md:
--------------------------------------------------------------------------------
1 | ## Tutorial 02. Binary Classification: Practice.
2 | **Бинарная Классификация: Практика.**
3 |
4 |  In this tutorial, we will make a practical classification case / В этом уроке мы сделаем практический классификационный кейс.
5 |
6 | You will be given 4 files / Вам будут даны 4 файла:
7 |
8 | * winequality.csv
9 |
10 | * winequality - Practice.ipynb
11 |
12 | * winequality - Practice Code Part 1&2.ipynb
13 |
14 | * winequality - Practice Code Part 3&4.ipynb
15 | ##
16 |
17 | ## Practice / Задание:
18 |
19 | 
20 | * Open the files _winequality - Practice.ipynb_ and _winequality - Practice Code Part 1&2.ipynb_ . Copy the code from _winequality - Practice Code Part 1&2.ipynb_ into _winequality - Practice.ipynb_ . Block by block compile the code.
21 | * Open the _winequality - Practice Code Part 3&4.ipynb_ file and copy the code into the _gender - Practice.ipynb_ file with already compiled code. Compile everything together.
22 | * Upload the following 2 files to your Github account:
23 |
24 | ##
25 | * Откройте файлы _winequality - Practice.ipynb_ и _winequality - Practice Code Part 1&2.ipynb_ . Скопируйте код из файла _winequality - Practice Code Part 1&2.ipynb_ в файл _winequality - Practice.ipynb_ . Блок за блоком скомпилируйте код.
26 | * Откройте файл _winequality - Practice Code Part 3&4.ipynb_ и скопируйте код в файл _gender - Practice.ipynb_ с уже скомпилированным кодом. Скомпилируйте всё вместе.
27 | * Загрузите в свой Github account следующие 2 файла:
28 |
29 | **winequality.csv**
30 |
31 | **winequality - Practice.ipynb** (с новым скомпилированным кодом)
32 |
--------------------------------------------------------------------------------
/ML-101 Modules/Module 03/Lesson 02/Practice 2/Winequality - Practice Code Part 1&2.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# \"Wine Quality.\""
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "## Part 1: Import, Load Data."
15 | ]
16 | },
17 | {
18 | "cell_type": "markdown",
19 | "metadata": {},
20 | "source": [
21 | "* ### Import libraries"
22 | ]
23 | },
24 | {
25 | "cell_type": "code",
26 | "execution_count": 3,
27 | "metadata": {},
28 | "outputs": [],
29 | "source": [
30 | "# import standard libraries\n",
31 | "\n",
32 | "import numpy as np\n",
33 | "import pandas as pd\n",
34 | "import matplotlib.pyplot as plt\n",
35 | "import seaborn as sns\n",
36 | "from scipy import stats\n",
37 | "from scipy.stats import norm\n",
38 | "%matplotlib inline\n",
39 | "sns.set()\n",
40 | "\n",
41 | "import sklearn.metrics as metrics\n",
42 | "from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, roc_auc_score\n",
43 | "from sklearn.model_selection import train_test_split, GridSearchCV\n",
44 | "from sklearn.preprocessing import StandardScaler\n",
45 | "\n",
46 | "from sklearn.linear_model import LogisticRegression\n",
47 | "from sklearn.neighbors import KNeighborsClassifier\n",
48 | "from sklearn.tree import DecisionTreeClassifier\n",
49 | "\n",
50 | "import warnings\n",
51 | "warnings.filterwarnings('ignore')"
52 | ]
53 | },
54 | {
55 | "cell_type": "markdown",
56 | "metadata": {},
57 | "source": [
58 | "* ### Read data from ‘.csv’ file"
59 | ]
60 | },
61 | {
62 | "cell_type": "code",
63 | "execution_count": 4,
64 | "metadata": {},
65 | "outputs": [],
66 | "source": [
67 | "# read data from '.csv' file\n",
68 | "dataset = pd.read_csv('winequality.csv') "
69 | ]
70 | },
71 | {
72 | "cell_type": "markdown",
73 | "metadata": {},
74 | "source": [
75 | "## Part 2: Exploratory Data Analysis."
76 | ]
77 | },
78 | {
79 | "cell_type": "markdown",
80 | "metadata": {},
81 | "source": [
82 | "* ### Info"
83 | ]
84 | },
85 | {
86 | "cell_type": "code",
87 | "execution_count": 5,
88 | "metadata": {
89 | "scrolled": true
90 | },
91 | "outputs": [
92 | {
93 | "name": "stdout",
94 | "output_type": "stream",
95 | "text": [
96 | "\n",
97 | "RangeIndex: 4898 entries, 0 to 4897\n",
98 | "Data columns (total 12 columns):\n",
99 | "fixed acidity 4898 non-null float64\n",
100 | "volatile acidity 4898 non-null float64\n",
101 | "citric acid 4898 non-null float64\n",
102 | "residual sugar 4898 non-null float64\n",
103 | "chlorides 4898 non-null float64\n",
104 | "free sulfur dioxide 4898 non-null float64\n",
105 | "total sulfur dioxide 4898 non-null float64\n",
106 | "density 4898 non-null float64\n",
107 | "pH 4898 non-null float64\n",
108 | "sulphates 4898 non-null float64\n",
109 | "alcohol 4898 non-null float64\n",
110 | "quality 4898 non-null int64\n",
111 | "dtypes: float64(11), int64(1)\n",
112 | "memory usage: 459.2 KB\n"
113 | ]
114 | }
115 | ],
116 | "source": [
117 | "# print the full summary of the dataset \n",
118 | "dataset.info()"
119 | ]
120 | },
121 | {
122 | "cell_type": "markdown",
123 | "metadata": {},
124 | "source": [
125 | "Dataset consists of 4898 rows and 12 columns; \n",
126 | "\n",
127 | "has 2 datatypes: float64(11), int64(1);\n",
128 | "\n",
129 | "has no missing values."
130 | ]
131 | },
132 | {
133 | "cell_type": "markdown",
134 | "metadata": {},
135 | "source": [
136 | "* ### Head"
137 | ]
138 | },
139 | {
140 | "cell_type": "code",
141 | "execution_count": 6,
142 | "metadata": {},
143 | "outputs": [
144 | {
145 | "data": {
146 | "text/html": [
147 | "\n",
148 | "\n",
161 | "
\n",
162 | " \n",
163 | " \n",
164 | " | \n",
165 | " fixed acidity | \n",
166 | " volatile acidity | \n",
167 | " citric acid | \n",
168 | " residual sugar | \n",
169 | " chlorides | \n",
170 | " free sulfur dioxide | \n",
171 | " total sulfur dioxide | \n",
172 | " density | \n",
173 | " pH | \n",
174 | " sulphates | \n",
175 | " alcohol | \n",
176 | " quality | \n",
177 | "
\n",
178 | " \n",
179 | " \n",
180 | " \n",
181 | " 0 | \n",
182 | " 7.0 | \n",
183 | " 0.27 | \n",
184 | " 0.36 | \n",
185 | " 20.7 | \n",
186 | " 0.045 | \n",
187 | " 45.0 | \n",
188 | " 170.0 | \n",
189 | " 1.0010 | \n",
190 | " 3.00 | \n",
191 | " 0.45 | \n",
192 | " 8.8 | \n",
193 | " 6 | \n",
194 | "
\n",
195 | " \n",
196 | " 1 | \n",
197 | " 6.3 | \n",
198 | " 0.30 | \n",
199 | " 0.34 | \n",
200 | " 1.6 | \n",
201 | " 0.049 | \n",
202 | " 14.0 | \n",
203 | " 132.0 | \n",
204 | " 0.9940 | \n",
205 | " 3.30 | \n",
206 | " 0.49 | \n",
207 | " 9.5 | \n",
208 | " 6 | \n",
209 | "
\n",
210 | " \n",
211 | " 2 | \n",
212 | " 8.1 | \n",
213 | " 0.28 | \n",
214 | " 0.40 | \n",
215 | " 6.9 | \n",
216 | " 0.050 | \n",
217 | " 30.0 | \n",
218 | " 97.0 | \n",
219 | " 0.9951 | \n",
220 | " 3.26 | \n",
221 | " 0.44 | \n",
222 | " 10.1 | \n",
223 | " 6 | \n",
224 | "
\n",
225 | " \n",
226 | " 3 | \n",
227 | " 7.2 | \n",
228 | " 0.23 | \n",
229 | " 0.32 | \n",
230 | " 8.5 | \n",
231 | " 0.058 | \n",
232 | " 47.0 | \n",
233 | " 186.0 | \n",
234 | " 0.9956 | \n",
235 | " 3.19 | \n",
236 | " 0.40 | \n",
237 | " 9.9 | \n",
238 | " 6 | \n",
239 | "
\n",
240 | " \n",
241 | " 4 | \n",
242 | " 7.2 | \n",
243 | " 0.23 | \n",
244 | " 0.32 | \n",
245 | " 8.5 | \n",
246 | " 0.058 | \n",
247 | " 47.0 | \n",
248 | " 186.0 | \n",
249 | " 0.9956 | \n",
250 | " 3.19 | \n",
251 | " 0.40 | \n",
252 | " 9.9 | \n",
253 | " 6 | \n",
254 | "
\n",
255 | " \n",
256 | "
\n",
257 | "
"
258 | ],
259 | "text/plain": [
260 | " fixed acidity volatile acidity citric acid residual sugar chlorides \\\n",
261 | "0 7.0 0.27 0.36 20.7 0.045 \n",
262 | "1 6.3 0.30 0.34 1.6 0.049 \n",
263 | "2 8.1 0.28 0.40 6.9 0.050 \n",
264 | "3 7.2 0.23 0.32 8.5 0.058 \n",
265 | "4 7.2 0.23 0.32 8.5 0.058 \n",
266 | "\n",
267 | " free sulfur dioxide total sulfur dioxide density pH sulphates \\\n",
268 | "0 45.0 170.0 1.0010 3.00 0.45 \n",
269 | "1 14.0 132.0 0.9940 3.30 0.49 \n",
270 | "2 30.0 97.0 0.9951 3.26 0.44 \n",
271 | "3 47.0 186.0 0.9956 3.19 0.40 \n",
272 | "4 47.0 186.0 0.9956 3.19 0.40 \n",
273 | "\n",
274 | " alcohol quality \n",
275 | "0 8.8 6 \n",
276 | "1 9.5 6 \n",
277 | "2 10.1 6 \n",
278 | "3 9.9 6 \n",
279 | "4 9.9 6 "
280 | ]
281 | },
282 | "execution_count": 6,
283 | "metadata": {},
284 | "output_type": "execute_result"
285 | }
286 | ],
287 | "source": [
288 | "# preview of the first 5 lines of the loaded data \n",
289 | "dataset.head()"
290 | ]
291 | },
292 | {
293 | "cell_type": "markdown",
294 | "metadata": {},
295 | "source": [
296 | "* ### Describe"
297 | ]
298 | },
299 | {
300 | "cell_type": "code",
301 | "execution_count": 7,
302 | "metadata": {},
303 | "outputs": [
304 | {
305 | "data": {
306 | "text/html": [
307 | "\n",
308 | "\n",
321 | "
\n",
322 | " \n",
323 | " \n",
324 | " | \n",
325 | " fixed acidity | \n",
326 | " volatile acidity | \n",
327 | " citric acid | \n",
328 | " residual sugar | \n",
329 | " chlorides | \n",
330 | " free sulfur dioxide | \n",
331 | " total sulfur dioxide | \n",
332 | " density | \n",
333 | " pH | \n",
334 | " sulphates | \n",
335 | " alcohol | \n",
336 | " quality | \n",
337 | "
\n",
338 | " \n",
339 | " \n",
340 | " \n",
341 | " count | \n",
342 | " 4898.000000 | \n",
343 | " 4898.000000 | \n",
344 | " 4898.000000 | \n",
345 | " 4898.000000 | \n",
346 | " 4898.000000 | \n",
347 | " 4898.000000 | \n",
348 | " 4898.000000 | \n",
349 | " 4898.000000 | \n",
350 | " 4898.000000 | \n",
351 | " 4898.000000 | \n",
352 | " 4898.000000 | \n",
353 | " 4898.000000 | \n",
354 | "
\n",
355 | " \n",
356 | " mean | \n",
357 | " 6.854788 | \n",
358 | " 0.278241 | \n",
359 | " 0.334192 | \n",
360 | " 6.391415 | \n",
361 | " 0.045772 | \n",
362 | " 35.308085 | \n",
363 | " 138.360657 | \n",
364 | " 0.994027 | \n",
365 | " 3.188267 | \n",
366 | " 0.489847 | \n",
367 | " 10.514267 | \n",
368 | " 5.877909 | \n",
369 | "
\n",
370 | " \n",
371 | " std | \n",
372 | " 0.843868 | \n",
373 | " 0.100795 | \n",
374 | " 0.121020 | \n",
375 | " 5.072058 | \n",
376 | " 0.021848 | \n",
377 | " 17.007137 | \n",
378 | " 42.498065 | \n",
379 | " 0.002991 | \n",
380 | " 0.151001 | \n",
381 | " 0.114126 | \n",
382 | " 1.230621 | \n",
383 | " 0.885639 | \n",
384 | "
\n",
385 | " \n",
386 | " min | \n",
387 | " 3.800000 | \n",
388 | " 0.080000 | \n",
389 | " 0.000000 | \n",
390 | " 0.600000 | \n",
391 | " 0.009000 | \n",
392 | " 2.000000 | \n",
393 | " 9.000000 | \n",
394 | " 0.987110 | \n",
395 | " 2.720000 | \n",
396 | " 0.220000 | \n",
397 | " 8.000000 | \n",
398 | " 3.000000 | \n",
399 | "
\n",
400 | " \n",
401 | " 25% | \n",
402 | " 6.300000 | \n",
403 | " 0.210000 | \n",
404 | " 0.270000 | \n",
405 | " 1.700000 | \n",
406 | " 0.036000 | \n",
407 | " 23.000000 | \n",
408 | " 108.000000 | \n",
409 | " 0.991723 | \n",
410 | " 3.090000 | \n",
411 | " 0.410000 | \n",
412 | " 9.500000 | \n",
413 | " 5.000000 | \n",
414 | "
\n",
415 | " \n",
416 | " 50% | \n",
417 | " 6.800000 | \n",
418 | " 0.260000 | \n",
419 | " 0.320000 | \n",
420 | " 5.200000 | \n",
421 | " 0.043000 | \n",
422 | " 34.000000 | \n",
423 | " 134.000000 | \n",
424 | " 0.993740 | \n",
425 | " 3.180000 | \n",
426 | " 0.470000 | \n",
427 | " 10.400000 | \n",
428 | " 6.000000 | \n",
429 | "
\n",
430 | " \n",
431 | " 75% | \n",
432 | " 7.300000 | \n",
433 | " 0.320000 | \n",
434 | " 0.390000 | \n",
435 | " 9.900000 | \n",
436 | " 0.050000 | \n",
437 | " 46.000000 | \n",
438 | " 167.000000 | \n",
439 | " 0.996100 | \n",
440 | " 3.280000 | \n",
441 | " 0.550000 | \n",
442 | " 11.400000 | \n",
443 | " 6.000000 | \n",
444 | "
\n",
445 | " \n",
446 | " max | \n",
447 | " 14.200000 | \n",
448 | " 1.100000 | \n",
449 | " 1.660000 | \n",
450 | " 65.800000 | \n",
451 | " 0.346000 | \n",
452 | " 289.000000 | \n",
453 | " 440.000000 | \n",
454 | " 1.038980 | \n",
455 | " 3.820000 | \n",
456 | " 1.080000 | \n",
457 | " 14.200000 | \n",
458 | " 9.000000 | \n",
459 | "
\n",
460 | " \n",
461 | "
\n",
462 | "
"
463 | ],
464 | "text/plain": [
465 | " fixed acidity volatile acidity citric acid residual sugar \\\n",
466 | "count 4898.000000 4898.000000 4898.000000 4898.000000 \n",
467 | "mean 6.854788 0.278241 0.334192 6.391415 \n",
468 | "std 0.843868 0.100795 0.121020 5.072058 \n",
469 | "min 3.800000 0.080000 0.000000 0.600000 \n",
470 | "25% 6.300000 0.210000 0.270000 1.700000 \n",
471 | "50% 6.800000 0.260000 0.320000 5.200000 \n",
472 | "75% 7.300000 0.320000 0.390000 9.900000 \n",
473 | "max 14.200000 1.100000 1.660000 65.800000 \n",
474 | "\n",
475 | " chlorides free sulfur dioxide total sulfur dioxide density \\\n",
476 | "count 4898.000000 4898.000000 4898.000000 4898.000000 \n",
477 | "mean 0.045772 35.308085 138.360657 0.994027 \n",
478 | "std 0.021848 17.007137 42.498065 0.002991 \n",
479 | "min 0.009000 2.000000 9.000000 0.987110 \n",
480 | "25% 0.036000 23.000000 108.000000 0.991723 \n",
481 | "50% 0.043000 34.000000 134.000000 0.993740 \n",
482 | "75% 0.050000 46.000000 167.000000 0.996100 \n",
483 | "max 0.346000 289.000000 440.000000 1.038980 \n",
484 | "\n",
485 | " pH sulphates alcohol quality \n",
486 | "count 4898.000000 4898.000000 4898.000000 4898.000000 \n",
487 | "mean 3.188267 0.489847 10.514267 5.877909 \n",
488 | "std 0.151001 0.114126 1.230621 0.885639 \n",
489 | "min 2.720000 0.220000 8.000000 3.000000 \n",
490 | "25% 3.090000 0.410000 9.500000 5.000000 \n",
491 | "50% 3.180000 0.470000 10.400000 6.000000 \n",
492 | "75% 3.280000 0.550000 11.400000 6.000000 \n",
493 | "max 3.820000 1.080000 14.200000 9.000000 "
494 | ]
495 | },
496 | "execution_count": 7,
497 | "metadata": {},
498 | "output_type": "execute_result"
499 | }
500 | ],
501 | "source": [
502 | "dataset.describe()"
503 | ]
504 | },
505 | {
506 | "cell_type": "markdown",
507 | "metadata": {},
508 | "source": [
509 | "Предположим, вам дали такой датасет и поставили конктетный вопрос: классифицируйте какие вина хорошие, а какие нет?\n",
510 | "У вас нет атрибута \"Y\" и ответа. Но есть хороший вспомогательный атрибут \"quality\" из которого мы сможем создать наш атрибут \"Y\" с ответом для обучения модели.\n",
511 | "Атрибут \"quality\" имеет значения от 3 до 9, где 3 это \"Not Good\", а 9 это \"Good\" качество вина. Чем выше число, тем лучше вино."
512 | ]
513 | },
514 | {
515 | "cell_type": "markdown",
516 | "metadata": {},
517 | "source": [
518 | "* ### Encoding 'quality' attribute"
519 | ]
520 | },
521 | {
522 | "cell_type": "code",
523 | "execution_count": 8,
524 | "metadata": {},
525 | "outputs": [],
526 | "source": [
527 | "# lambda function; wine quality from 3-6 == 0, from 7-9 == 1.\n",
528 | "dataset['quality'] = dataset.quality.apply(lambda q: 0 if q <= 6 else 1)"
529 | ]
530 | },
531 | {
532 | "cell_type": "code",
533 | "execution_count": 9,
534 | "metadata": {},
535 | "outputs": [
536 | {
537 | "data": {
538 | "text/html": [
539 | "\n",
540 | "\n",
553 | "
\n",
554 | " \n",
555 | " \n",
556 | " | \n",
557 | " fixed acidity | \n",
558 | " volatile acidity | \n",
559 | " citric acid | \n",
560 | " residual sugar | \n",
561 | " chlorides | \n",
562 | " free sulfur dioxide | \n",
563 | " total sulfur dioxide | \n",
564 | " density | \n",
565 | " pH | \n",
566 | " sulphates | \n",
567 | " alcohol | \n",
568 | " quality | \n",
569 | "
\n",
570 | " \n",
571 | " \n",
572 | " \n",
573 | " 0 | \n",
574 | " 7.0 | \n",
575 | " 0.27 | \n",
576 | " 0.36 | \n",
577 | " 20.7 | \n",
578 | " 0.045 | \n",
579 | " 45.0 | \n",
580 | " 170.0 | \n",
581 | " 1.0010 | \n",
582 | " 3.00 | \n",
583 | " 0.45 | \n",
584 | " 8.8 | \n",
585 | " 0 | \n",
586 | "
\n",
587 | " \n",
588 | " 1 | \n",
589 | " 6.3 | \n",
590 | " 0.30 | \n",
591 | " 0.34 | \n",
592 | " 1.6 | \n",
593 | " 0.049 | \n",
594 | " 14.0 | \n",
595 | " 132.0 | \n",
596 | " 0.9940 | \n",
597 | " 3.30 | \n",
598 | " 0.49 | \n",
599 | " 9.5 | \n",
600 | " 0 | \n",
601 | "
\n",
602 | " \n",
603 | " 2 | \n",
604 | " 8.1 | \n",
605 | " 0.28 | \n",
606 | " 0.40 | \n",
607 | " 6.9 | \n",
608 | " 0.050 | \n",
609 | " 30.0 | \n",
610 | " 97.0 | \n",
611 | " 0.9951 | \n",
612 | " 3.26 | \n",
613 | " 0.44 | \n",
614 | " 10.1 | \n",
615 | " 0 | \n",
616 | "
\n",
617 | " \n",
618 | " 3 | \n",
619 | " 7.2 | \n",
620 | " 0.23 | \n",
621 | " 0.32 | \n",
622 | " 8.5 | \n",
623 | " 0.058 | \n",
624 | " 47.0 | \n",
625 | " 186.0 | \n",
626 | " 0.9956 | \n",
627 | " 3.19 | \n",
628 | " 0.40 | \n",
629 | " 9.9 | \n",
630 | " 0 | \n",
631 | "
\n",
632 | " \n",
633 | " 4 | \n",
634 | " 7.2 | \n",
635 | " 0.23 | \n",
636 | " 0.32 | \n",
637 | " 8.5 | \n",
638 | " 0.058 | \n",
639 | " 47.0 | \n",
640 | " 186.0 | \n",
641 | " 0.9956 | \n",
642 | " 3.19 | \n",
643 | " 0.40 | \n",
644 | " 9.9 | \n",
645 | " 0 | \n",
646 | "
\n",
647 | " \n",
648 | "
\n",
649 | "
"
650 | ],
651 | "text/plain": [
652 | " fixed acidity volatile acidity citric acid residual sugar chlorides \\\n",
653 | "0 7.0 0.27 0.36 20.7 0.045 \n",
654 | "1 6.3 0.30 0.34 1.6 0.049 \n",
655 | "2 8.1 0.28 0.40 6.9 0.050 \n",
656 | "3 7.2 0.23 0.32 8.5 0.058 \n",
657 | "4 7.2 0.23 0.32 8.5 0.058 \n",
658 | "\n",
659 | " free sulfur dioxide total sulfur dioxide density pH sulphates \\\n",
660 | "0 45.0 170.0 1.0010 3.00 0.45 \n",
661 | "1 14.0 132.0 0.9940 3.30 0.49 \n",
662 | "2 30.0 97.0 0.9951 3.26 0.44 \n",
663 | "3 47.0 186.0 0.9956 3.19 0.40 \n",
664 | "4 47.0 186.0 0.9956 3.19 0.40 \n",
665 | "\n",
666 | " alcohol quality \n",
667 | "0 8.8 0 \n",
668 | "1 9.5 0 \n",
669 | "2 10.1 0 \n",
670 | "3 9.9 0 \n",
671 | "4 9.9 0 "
672 | ]
673 | },
674 | "execution_count": 9,
675 | "metadata": {},
676 | "output_type": "execute_result"
677 | }
678 | ],
679 | "source": [
680 | "# preview of the first 5 lines of the loaded data \n",
681 | "dataset.head()"
682 | ]
683 | },
684 | {
685 | "cell_type": "markdown",
686 | "metadata": {},
687 | "source": [
688 | "* ### 'quality' attribute value counts and visualisation"
689 | ]
690 | },
691 | {
692 | "cell_type": "code",
693 | "execution_count": 10,
694 | "metadata": {},
695 | "outputs": [
696 | {
697 | "name": "stdout",
698 | "output_type": "stream",
699 | "text": [
700 | "Not good wine 78.36 % of the dataset\n",
701 | "Good wine 21.64 % of the dataset\n"
702 | ]
703 | },
704 | {
705 | "data": {
706 | "text/plain": [
707 | "0 3838\n",
708 | "1 1060\n",
709 | "Name: quality, dtype: int64"
710 | ]
711 | },
712 | "execution_count": 10,
713 | "metadata": {},
714 | "output_type": "execute_result"
715 | }
716 | ],
717 | "source": [
718 | "print('Not good wine', round(dataset['quality'].value_counts()[0]/len(dataset) * 100,2), '% of the dataset')\n",
719 | "print('Good wine', round(dataset['quality'].value_counts()[1]/len(dataset) * 100,2), '% of the dataset')\n",
720 | "\n",
721 | "dataset['quality'].value_counts()"
722 | ]
723 | },
724 | {
725 | "cell_type": "code",
726 | "execution_count": 11,
727 | "metadata": {
728 | "scrolled": false
729 | },
730 | "outputs": [
731 | {
732 | "data": {
733 | "text/plain": [
734 | ""
735 | ]
736 | },
737 | "execution_count": 11,
738 | "metadata": {},
739 | "output_type": "execute_result"
740 | },
741 | {
742 | "data": {
743 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAX8AAAD5CAYAAADP2jUWAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjAsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+17YcXAAATyElEQVR4nO3df4xddZnH8fed0oVx2yKWIW2tgCz2gXWVGgWT5Zc/WBOipEuwEtsFu5GypGL4A38ltsuW6BqTpbIYqwaokHQVYrG6ULrZWFxAsIjyKxH7hOwWtHY2TMZdadFiy8z+cc+Eu2XaOTNzubcz3/crabj3Od9z53u4J5975jln7mkMDw8jSSpLT7cnIEnqPMNfkgpk+EtSgQx/SSqQ4S9JBTL8JalAR9UdGBH/BByfmSsiYjFwCzAHeAC4KjMPRMSJwEbgBCCB5Zm5NyJeD/wLcAowAHwkM/+7zdsiSaqpVvhHxPuBjwFbqtJG4IrM3B4RtwIrga8D64H1mXlHRKwB1gCfBb4APJiZH4yIy4B/Bi6tOcejgTOBfuDlmutIUulmAPOBR4GXDl44ZvhHxBuALwL/CJwREScBvZm5vRpyG7A2Im4BzgP+uqV+P83w/2C1DOA7wNciYmZm7q+xAWcCD9YYJ0l6tXOBHx9crNPz/ybweeB/qucLaB6Fj+gHFgLHAy9k5oGD6v9vnWr5C0BfzYn3jz1EknQIo2boYY/8I+IK4NeZuS0iVlTlHqD1OyEawNAodar6yJhWjZZlY3kZYHBwL0NDfhVFO/T1zWZgYE+3pyG9ivtm+/T0NJg7dxYcol0+VtvnUmB+RDwBvAGYRTPg57eMmQfsBp4Hjo2IGZn5cjVmdzXmN9W4XRFxFDAbGJzQFkmSJu2wbZ/M/KvM/IvMXAz8PfCvmfm3wL6IOLsadhmwterfP8grJ3IvB7ZWj++tnlMtf7Bmv1+S9BqofannQZYDN0fEHOAx4Kaqvgq4PSJWA78CPlrV1wC3RcQvgP+t1pckdUljCnyl88nATnv+7WNfVUcq9832aen5vxl49lXLOz0hSVL3Gf6SVCDDX5IKNNETvhrF7Dm9HHP01Phf2tc3u9tTGNO+lw6w54U/dHsa0rQ0NZJqijjm6KO46NofdHsa08bdNyzBU3/Sa8O2jyQVyPCXpAIZ/pJUIMNfkgpk+EtSgQx/SSqQ4S9JBTL8JalAhr8kFcjwl6QCGf6SVCDDX5IKZPhLUoFqfatnRFwPfBgYBm7NzHUR8S3gHODFatjazNwcERcA64Be4M7MXF29xmLgFmAO8ABwVWYeaOvWSJJqGfPIPyLOB94HvB14F/DJiIjq8XmZubj6tzkieoENwBLgdODMiLiweqmNwNWZuQhoACvbvzmSpDrGDP/MvB94b3WUfgLN3xb+AJwIbIiIpyJibUT0AGcBz2Tmzmr8RmBpRJwE9Gbm9uplbwOWtn9zJEl11Gr7ZOb+iFgLfAr4LjATuA9YBfwOuAf4OLAX6G9ZtR9YCCw4RL226i70KsxUuOOY2sv3vDNq38krM6+LiC8DdwPvz8yLR5ZFxFeBy4FNNM8LjGgAQzR/wxitXtvg4F6GhobHHthF7rTtNzDgvbxK0tc32/e8TXp6Goc9aK7T8z+tOllLZv4e+B5waURc0jKsAewHdgHzW+rzgN2HqUuSuqDOpZ6nADdHxNER8Sc0T+beD9wYEcdFxEzgSmAz8AgQEXFqRMwAlgFbM/M5YF9EnF295mXA1nZvjCSpnjonfO8FtgCPAz8HHs7M64EvAQ8BTwNPZOZ3MnMfsAK4q6rvoNkKAlgOfCUidgCzgJvauymSpLoaw8NHdh8dOBnYOVV6/hdd+4NuT2PauPuGJfZ/C2PPv31aev5vBp591fJOT0iS1H2GvyQVyPCXpAIZ/pJUIMNfkgpk+EtSgQx/SSqQ4S9JBTL8JalAhr8kFcjwl6QCGf6SVCDDX5IKZPhLUoEMf0kqkOEvSQUy/CWpQIa/JBXoqDqDIuJ64MPAMHBrZq6LiAuAdUAvcGdmrq7GLgZuAeYADwBXZeaBiDgR2AicACSwPDP3tnuDJEljG/PIPyLOB94HvB14F/DJiDgD2AAsAU4HzoyIC6tVNgJXZ+YioAGsrOrrgfWZeRrwM2BNOzdEklTfmOGfmfcD783MAzSP2o8CXg88k5k7q/pGYGlEnAT0Zub2avXbqvpM4DxgU2u9nRsiSaqvVtsnM/dHxFrgU8B3gQVAf8uQfmDhYerHAy9UHxSt9dqqu9CrMH19s7s9BXWY73ln1Ap/gMy8LiK+DNwNLKLZ/x/RAIZo/iZRp05Vr21wcC9DQwe/xJHFnbb9Bgb2dHsK6qC+vtm+523S09M47EFznZ7/adVJXDLz98D3gPcA81uGzQN2A7sOUX8eODYiZlT1+VVdktQFdS71PAW4OSKOjog/oXmS95tARMSpVaAvA7Zm5nPAvog4u1r3sqq+H3gQuLSqXw5sbeeGSJLqq3PC915gC/A48HPg4cy8A1gB3AU8DezglZO5y4GvRMQOYBZwU1VfBVwZEU8D5wKr27cZkqTxaAwPH9l9dOBkYOdU6flfdO0Puj2NaePuG5bY/y2MPf/2aen5vxl49lXLOz0hSVL3Gf6SVCDDX5IKZPhLUoEMf0kqkOEvSQUy/CWpQIa/JBXI8JekAhn+klQgw1+SCmT4S1KBDH9JKpDhL0kFMvwlqUCGvyQVyPCXpAIZ/pJUoKPqDIqI64CPVE+3ZOZnIuJbwDnAi1V9bWZujogLgHVAL3BnZq6uXmMxcAswB3gAuCozD7RvUyRJdY155F+F+QeAdwCLgXdGxMXAu4DzMnNx9W9zRPQCG4AlwOnAmRFxYfVSG4GrM3MR0ABWtn9zJEl11Dny7weuzcw/AkTEL4ETq38bIuKNwGZgLXAW8Exm7qzGbgSWRsTTQG9mbq9e87Zq/NfbuC2SpJrGDP/M/MXI44h4C832z7nAe4BVwO+Ae4CPA3tpfliM6AcWAgsOUa+tugu9CtPXN7vbU1CH+Z53Rq2eP0BEvBXYAnw6MxO4uGXZV4HLgU3AcMtqDWCIZntptHptg4N7GRoaHntgF7nTtt/AwJ5uT0Ed1Nc32/e8TXp6Goc9aK51tU9EnA1sAz6XmbdHxNsi4pKWIQ1gP7ALmN9SnwfsPkxdktQFdU74vgn4PrAsM++oyg3gxog4LiJmAlfS7Ps/0lwlTo2IGcAyYGtmPgfsqz5EAC4DtrZ5WyRJNdVp+3wKOAZYFxEjtW8AXwIeAmYCd2XmdwAiYgVwV7XOvTRbQQDLgZsjYg7wGHBTezZBkjRedU74XgNcc4jF60cZvw04Y5T6kzSvBpIkdZl/4StJBTL8JalAhr8kFcjwl6QCGf6SVCDDX5IKZPhLUoEMf0kqkOEvSQUy/CWpQIa/JBXI8JekAhn+klQgw1+SCmT4S1KBDH9JKpDhL0kFMvwlqUB17uFLRFwHfKR6uiUzPxMRFwDrgF7gzsxcXY1dDNwCzAEeAK7KzAMRcSKwETgBSGB5Zu5t69ZIkmoZ88i/CvkPAO8AFgPvjIiPAhuAJcDpwJkRcWG1ykbg6sxcBDSAlVV9PbA+M08DfgasaeeGSJLqq9P26Qeuzcw/ZuZ+4JfAIuCZzNyZmQdoBv7SiDgJ6M3M7dW6t1X1mcB5wKbWevs2Q5I0HmO2fTLzFyOPI+ItNNs/X6X5oTCiH1gILDhE/XjgheqDorVe29y5s8YzXNNEX9/sbk9BHeZ73hm1ev4AEfFWYAvwaeAAzaP/EQ1giOZvEsM16lT12gYH9zI0dPBLHFncadtvYGBPt6egDurrm+173iY9PY3DHjTXutonIs4GtgGfy8zbgV3A/JYh84Ddh6k/DxwbETOq+vyqLknqgjonfN8EfB9Ylpl3VOVHmovi1CrQlwFbM/M5YF/1YQFwWVXfDzwIXFrVLwe2tnE7JEnjUKft8yngGGBdRIzUvgGsAO6qlt3LKydzlwM3R8Qc4DHgpqq+Crg9IlYDvwI+2ob5S5ImoM4J32uAaw6x+IxRxj8JnDVK/TngPeOcnyTpNeBf+EpSgQx/SSqQ4S9JBTL8JalAhr8kFcjwl6QCGf6SVCDDX5IKZPhLUoEMf0kqkOEvSQUy/CWpQIa/JBXI8JekAhn+klQgw1+SCmT4S1KBDH9JKlCde/gCUN2T92HgQ5n5bER8CzgHeLEasjYzN0fEBcA6oBe4MzNXV+svBm4B5gAPAFdl5oH2bYokqa5aR/4R8W7gx8CilvK7gPMyc3H1b3NE9AIbgCXA6cCZEXFhNX4jcHVmLgIawMp2bYQkaXzqtn1WAp8AdgNExOuAE4ENEfFURKyNiB6aN25/JjN3Vkf1G4GlEXES0JuZ26vXuw1Y2sbtkCSNQ622T2ZeARARI6V5wH3AKuB3wD3Ax4G9QH/Lqv3AQmDBIeq1zZ07azzDNU309c3u9hTUYb7nnVG7598qM/8LuHjkeUR8Fbgc2AQMtwxtAEM0f8MYrV7b4OBehoaGxx7YRe607TcwsKfbU1AH9fXN9j1vk56exmEPmid0tU9EvC0iLmkpNYD9wC5gfkt9Hs1W0aHqkqQumOilng3gxog4LiJmAlcCm4FHgIiIUyNiBrAM2JqZzwH7IuLsav3LgK2TnLskaYImFP6Z+RTwJeAh4Gngicz8TmbuA1YAd1X1HTRbQQDLga9ExA5gFnDT5KYuSZqocfX8M/PklsfrgfWjjNkGnDFK/UmaVwNJkrrMv/CVpAIZ/pJUIMNfkgpk+EtSgQx/SSqQ4S9JBTL8JalAhr8kFcjwl6QCGf6SVCDDX5IKZPhLUoEMf0kqkOEvSQUy/CWpQIa/JBXI8JekAhn+klSg2rdxjIg5wMPAhzLz2Yi4AFgH9AJ3Zubqatxi4BZgDvAAcFVmHoiIE4GNwAlAAsszc29bt0bSIc2e08sxR4/rzq1d0dc3u9tTGNO+lw6w54U/dHsak1JrT4iIdwM3A4uq573ABuB84NfAloi4MDO30gz4KzJze0TcCqwEvk7zfr/rM/OOiFgDrAE+2+4NkjS6Y44+iouu/UG3pzEt3H3DEvZ0exKTVLftsxL4BLC7en4W8Exm7szMAzQDf2lEnAT0Zub2atxtVX0mcB6wqbU++elLkiai1pF/Zl4BEBEjpQVAf8uQfmDhYerHAy9UHxSt9drmzp01nuGaJqZCC0Blmur75kQbgD3AcMvzBjA0jjpVvbbBwb0MDR38EkeWqb4zHIkGBqb6L9dHDvfP9jrS982ensZhD5onerXPLmB+y/N5NFtCh6o/DxwbETOq+nxeaSFJkjpsouH/CBARcWoV6MuArZn5HLAvIs6uxl1W1fcDDwKXVvXLga2TmLckaRImFP6ZuQ9YAdwFPA3s4JWTucuBr0TEDmAWcFNVXwVcGRFPA+cCqyc+bUnSZIyr55+ZJ7c83gacMcqYJ2leDXRw/TngPeOeoSSp7fwLX0kqkOEvSQUy/CWpQIa/JBXI8JekAhn+klQgw1+SCmT4S1KBDH9JKpDhL0kFMvwlqUCGvyQVyPCXpAIZ/pJUIMNfkgpk+EtSgQx/SSqQ4S9JBRrXbRwPFhE/Ak4A9lelvwP+jOb9eWcCN2bm16qxFwDrgF7gzsz0Hr6S1CUTDv+IaACLgJMy80BVeyNwB/BO4CXg4eoDYiewATgf+DWwJSIuzMytk5y/JGkCJnPkH9V//z0i5gI3A3uA+zLztwARsQn4MHA/8Exm7qzqG4GlgOEvSV0wmfA/DtgGfJJmi+c/gDuB/pYx/cBZwIJR6gvH88Pmzp01ialqqurrm93tKUijmur75oTDPzN/Avxk5HlE3Eqzp/+FlmENYIjmieXhUeq1DQ7uZWhoeOyBXTTVd4Yj0cDAnm5PYdpw/2yvI33f7OlpHPagecJX+0TEORHx/pZSA3gWmN9SmwfsBnYdoi5J6oLJtH1eD1wfEX9Js+3zMeBvgI0R0Qe8CFwCXAk8BUREnErz5O8ymieAJUldMOEj/8y8B9gCPA78HNiQmQ8Bnwd+BDwBfDszf5qZ+4AVwF3A08AOYNPkpi5JmqhJXeefmWuANQfVvg18e5Sx24AzJvPzJEnt4V/4SlKBDH9JKpDhL0kFMvwlqUCGvyQVyPCXpAIZ/pJUIMNfkgpk+EtSgQx/SSqQ4S9JBTL8JalAhr8kFcjwl6QCGf6SVCDDX5IKZPhLUoEMf0kq0KRu4zheEbEMWE3zhu83ZubXOvnzJUlNHTvyj4g3Al8EzgEWA1dGxJ936udLkl7RySP/C4D7MvO3ABGxCfgwcP0Y680A6OlpvLaza5MTjuvt9hSmlanyvk8V7p/tc6Tvmy3zmzHa8k6G/wKgv+V5P3BWjfXmAxx33J++FnNqu1tXf6DbU5hW5s6d1e0pTCvun+0zhfbN+cB/HlzsZPj3AMMtzxvAUI31HgXOpflh8fJrMC9Jmo5m0Az+R0db2Mnw30UzxEfMA3bXWO8l4MevyYwkaXp71RH/iE6G/w+Bf4iIPuBF4BLgyg7+fElSpWNX+2Tmb4DPAz8CngC+nZk/7dTPlyS9ojE8PDz2KEnStOJf+EpSgQx/SSqQ4S9JBTL8JalAhr8kFcjwl6QCdfQrndUdEXEazS/RW0jzKzV2A/+WmT/r6sQkdY1H/tNcRKwC7qiePgo8Vj2+OSKu7c6sJHWbf+Q1zUVEAu/IzN8fVH8d8FhmntadmUkQEScebnlm/qpTcymNbZ/p7wDNO6cdrBfY3+G5SAfbAryFZivy4C/IHwZO6fiMCmH4T39fBB6PiG00vxZ7mOa9Fd5H87uWpG46G3gQWJWZD3V7MiWx7VOAiFhA805qC2ie59kF/DAz63yltvSaioizgCsy02/57SDDX5IK5NU+klQgw1+SCmT4S1KBDH9JKtD/ASGyRVHeZtJaAAAAAElFTkSuQmCC\n",
744 | "text/plain": [
745 | ""
746 | ]
747 | },
748 | "metadata": {
749 | "needs_background": "light"
750 | },
751 | "output_type": "display_data"
752 | }
753 | ],
754 | "source": [
755 | "# visualisation plot\n",
756 | "dataset['quality'].value_counts().plot(x = dataset['quality'], kind='bar')"
757 | ]
758 | },
759 | {
760 | "cell_type": "markdown",
761 | "metadata": {},
762 | "source": [
763 | "There are 78.36 % of 'Not Good' quality wines and only 21.64 % of 'Good' quality wines in our dataset. This means that our dataset is imbalanced."
764 | ]
765 | },
766 | {
767 | "cell_type": "markdown",
768 | "metadata": {},
769 | "source": [
770 | "* ### Resampling of an imbalanced dataset"
771 | ]
772 | },
773 | {
774 | "cell_type": "code",
775 | "execution_count": 12,
776 | "metadata": {},
777 | "outputs": [],
778 | "source": [
779 | "# class count\n",
780 | "#count_class_0, count_class_1 = dataset.quality.value_counts()\n",
781 | "\n",
782 | "# divide by class\n",
783 | "#class_0 = dataset[dataset['quality'] == 0]\n",
784 | "#class_1 = dataset[dataset['quality'] == 1]"
785 | ]
786 | },
787 | {
788 | "cell_type": "markdown",
789 | "metadata": {},
790 | "source": [
791 | "* ### Random under-sampling of an imbalanced dataset"
792 | ]
793 | },
794 | {
795 | "cell_type": "code",
796 | "execution_count": 13,
797 | "metadata": {},
798 | "outputs": [],
799 | "source": [
800 | "#class_0_under = class_0.sample(count_class_1)\n",
801 | "#dataset_under = pd.concat([class_0_under, class_1], axis=0)\n",
802 | "\n",
803 | "#print('Random under-sampling:')\n",
804 | "#print(dataset_under.quality.value_counts())\n",
805 | "\n",
806 | "#dataset_under.quality.value_counts().plot(kind='bar', title='Count (target)');"
807 | ]
808 | },
809 | {
810 | "cell_type": "markdown",
811 | "metadata": {},
812 | "source": [
813 | "* ### Random over-sampling of an imbalanced dataset"
814 | ]
815 | },
816 | {
817 | "cell_type": "code",
818 | "execution_count": 14,
819 | "metadata": {},
820 | "outputs": [],
821 | "source": [
822 | "#class_1_over = class_1.sample(count_class_0, replace=True)\n",
823 | "#dataset_over = pd.concat([class_0, class_1_over], axis=0)\n",
824 | "\n",
825 | "#print('Random over-sampling:')\n",
826 | "#print(dataset_over.quality.value_counts())\n",
827 | "\n",
828 | "#dataset_over.quality.value_counts().plot(kind='bar', title='Count (target)');"
829 | ]
830 | },
831 | {
832 | "cell_type": "markdown",
833 | "metadata": {},
834 | "source": [
835 | "* ### Initialisation of target"
836 | ]
837 | },
838 | {
839 | "cell_type": "code",
840 | "execution_count": 15,
841 | "metadata": {},
842 | "outputs": [],
843 | "source": [
844 | "# initialisation of target\n",
845 | "target = dataset['quality']\n",
846 | "\n",
847 | "# for under-sampling dataset\n",
848 | "#target_under = dataset_under['quality']\n",
849 | "\n",
850 | "# for over-sampling dataset\n",
851 | "#target_over = dataset_over['quality'] "
852 | ]
853 | },
854 | {
855 | "cell_type": "markdown",
856 | "metadata": {},
857 | "source": [
858 | "* ### Drop column 'quality'"
859 | ]
860 | },
861 | {
862 | "cell_type": "code",
863 | "execution_count": 16,
864 | "metadata": {},
865 | "outputs": [],
866 | "source": [
867 | "dataset = dataset.drop(columns=['quality'])\n",
868 | "\n",
869 | "# for under-sampling dataset\n",
870 | "#dataset_under = dataset_under.drop(columns=['quality'])\n",
871 | "\n",
872 | "# for over-sampling dataset\n",
873 | "#dataset_over = dataset_over.drop(columns=['quality'])"
874 | ]
875 | },
876 | {
877 | "cell_type": "code",
878 | "execution_count": null,
879 | "metadata": {},
880 | "outputs": [],
881 | "source": []
882 | }
883 | ],
884 | "metadata": {
885 | "kernelspec": {
886 | "display_name": "Python 3",
887 | "language": "python",
888 | "name": "python3"
889 | },
890 | "language_info": {
891 | "codemirror_mode": {
892 | "name": "ipython",
893 | "version": 3
894 | },
895 | "file_extension": ".py",
896 | "mimetype": "text/x-python",
897 | "name": "python",
898 | "nbconvert_exporter": "python",
899 | "pygments_lexer": "ipython3",
900 | "version": "3.7.3"
901 | }
902 | },
903 | "nbformat": 4,
904 | "nbformat_minor": 2
905 | }
906 |
--------------------------------------------------------------------------------
/ML-101 Modules/Module 03/Lesson 02/Practice 2/Winequality - Practice Code Part 3&4.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "## Part 3: Data Wrangling and Transformation."
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "* ### StandardScaler"
15 | ]
16 | },
17 | {
18 | "cell_type": "code",
19 | "execution_count": 1,
20 | "metadata": {},
21 | "outputs": [
22 | {
23 | "ename": "NameError",
24 | "evalue": "name 'StandardScaler' is not defined",
25 | "output_type": "error",
26 | "traceback": [
27 | "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
28 | "\u001b[1;31mNameError\u001b[0m Traceback (most recent call last)",
29 | "\u001b[1;32m\u001b[0m in \u001b[0;36m\u001b[1;34m\u001b[0m\n\u001b[0;32m 1\u001b[0m \u001b[1;31m# StandardScaler\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m----> 2\u001b[1;33m \u001b[0msc\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mStandardScaler\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 3\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 4\u001b[0m \u001b[0mdataset_sc\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0msc\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mfit_transform\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mdataset\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 5\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n",
30 | "\u001b[1;31mNameError\u001b[0m: name 'StandardScaler' is not defined"
31 | ]
32 | }
33 | ],
34 | "source": [
35 | "# StandardScaler \n",
36 | "sc = StandardScaler()\n",
37 | "\n",
38 | "dataset_sc = sc.fit_transform(dataset)\n",
39 | "\n",
40 | "# for under-sampling dataset\n",
41 | "#dataset_sc = sc.fit_transform(dataset_under)\n",
42 | "\n",
43 | "# for over-sampling dataset\n",
44 | "#dataset_sc = sc.fit_transform(dataset_over)\n",
45 | "\n",
46 | "dataset_sc = pd.DataFrame(dataset_sc)\n",
47 | "dataset_sc.head()"
48 | ]
49 | },
50 | {
51 | "cell_type": "code",
52 | "execution_count": null,
53 | "metadata": {},
54 | "outputs": [],
55 | "source": []
56 | },
57 | {
58 | "cell_type": "markdown",
59 | "metadata": {},
60 | "source": [
61 | "* ### Creating datasets for ML part"
62 | ]
63 | },
64 | {
65 | "cell_type": "code",
66 | "execution_count": null,
67 | "metadata": {},
68 | "outputs": [],
69 | "source": [
70 | "# set 'X' for features' and y' for the target ('quality').\n",
71 | "y = target\n",
72 | "X = dataset_sc.copy()\n",
73 | "\n",
74 | "# for under-sampling dataset \n",
75 | "#y = target_under\n",
76 | "#X = dataset_sc.copy()\n",
77 | "\n",
78 | "# for over-sampling dataset \n",
79 | "#y = target_over\n",
80 | "#X = dataset_sc.copy()"
81 | ]
82 | },
83 | {
84 | "cell_type": "code",
85 | "execution_count": null,
86 | "metadata": {
87 | "scrolled": true
88 | },
89 | "outputs": [],
90 | "source": [
91 | "# preview of the first 5 lines of the loaded data \n",
92 | "X.head()"
93 | ]
94 | },
95 | {
96 | "cell_type": "markdown",
97 | "metadata": {},
98 | "source": [
99 | "* ### 'Train\\Test' split"
100 | ]
101 | },
102 | {
103 | "cell_type": "code",
104 | "execution_count": null,
105 | "metadata": {},
106 | "outputs": [],
107 | "source": [
108 | "# apply 'Train\\Test' splitting method\n",
109 | "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)"
110 | ]
111 | },
112 | {
113 | "cell_type": "code",
114 | "execution_count": null,
115 | "metadata": {},
116 | "outputs": [],
117 | "source": [
118 | "# print shape of X_train and y_train\n",
119 | "X_train.shape, y_train.shape"
120 | ]
121 | },
122 | {
123 | "cell_type": "code",
124 | "execution_count": null,
125 | "metadata": {
126 | "scrolled": false
127 | },
128 | "outputs": [],
129 | "source": [
130 | "# print shape of X_test and y_test\n",
131 | "X_test.shape, y_test.shape"
132 | ]
133 | },
134 | {
135 | "cell_type": "markdown",
136 | "metadata": {},
137 | "source": [
138 | "## Part 4: Machine Learning."
139 | ]
140 | },
141 | {
142 | "cell_type": "markdown",
143 | "metadata": {},
144 | "source": [
145 | "* ### Build, train and evaluate models without hyperparameters"
146 | ]
147 | },
148 | {
149 | "cell_type": "markdown",
150 | "metadata": {},
151 | "source": [
152 | "* Logistic Regression\n",
153 | "* K-Nearest Neighbors\n",
154 | "* Decision Trees\n"
155 | ]
156 | },
157 | {
158 | "cell_type": "code",
159 | "execution_count": null,
160 | "metadata": {},
161 | "outputs": [],
162 | "source": [
163 | "# Logistic Regression\n",
164 | "LR = LogisticRegression()\n",
165 | "LR.fit(X_train, y_train)\n",
166 | "LR_pred = LR.predict(X_test)\n",
167 | "\n",
168 | "# K-Nearest Neighbors\n",
169 | "KNN = KNeighborsClassifier()\n",
170 | "KNN.fit(X_train, y_train)\n",
171 | "KNN_pred = KNN.predict(X_test)\n",
172 | "\n",
173 | "# Decision Tree\n",
174 | "DT = DecisionTreeClassifier(random_state = 0)\n",
175 | "DT.fit(X_train, y_train)\n",
176 | "DT_pred = DT.predict(X_test)"
177 | ]
178 | },
179 | {
180 | "cell_type": "markdown",
181 | "metadata": {},
182 | "source": [
183 | "* ### Classification report"
184 | ]
185 | },
186 | {
187 | "cell_type": "code",
188 | "execution_count": null,
189 | "metadata": {
190 | "scrolled": true
191 | },
192 | "outputs": [],
193 | "source": [
194 | "print(\"LR Classification Report: \\n\", classification_report(y_test, LR_pred, digits = 6))\n",
195 | "print(\"KNN Classification Report: \\n\", classification_report(y_test, KNN_pred, digits = 6))\n",
196 | "print(\"DT Classification Report: \\n\", classification_report(y_test, DT_pred, digits = 6))"
197 | ]
198 | },
199 | {
200 | "cell_type": "markdown",
201 | "metadata": {},
202 | "source": [
203 | "* ### Confusion matrix"
204 | ]
205 | },
206 | {
207 | "cell_type": "code",
208 | "execution_count": null,
209 | "metadata": {},
210 | "outputs": [],
211 | "source": [
212 | "LR_confusion_mx = confusion_matrix(y_test, LR_pred)\n",
213 | "print(\"LR Confusion Matrix: \\n\", LR_confusion_mx)\n",
214 | "print()\n",
215 | "KNN_confusion_mx = confusion_matrix(y_test, KNN_pred)\n",
216 | "print(\"KNN Confusion Matrix: \\n\", KNN_confusion_mx)\n",
217 | "print()\n",
218 | "DT_confusion_mx = confusion_matrix(y_test, DT_pred)\n",
219 | "print(\"DT Confusion Matrix: \\n\", DT_confusion_mx)\n",
220 | "print()"
221 | ]
222 | },
223 | {
224 | "cell_type": "markdown",
225 | "metadata": {},
226 | "source": [
227 | "* ### ROC-AUC score"
228 | ]
229 | },
230 | {
231 | "cell_type": "code",
232 | "execution_count": null,
233 | "metadata": {
234 | "scrolled": true
235 | },
236 | "outputs": [],
237 | "source": [
238 | "roc_auc_score(DT_pred, y_test)"
239 | ]
240 | },
241 | {
242 | "cell_type": "markdown",
243 | "metadata": {},
244 | "source": [
245 | "* ### Build, train and evaluate models with hyperparameters"
246 | ]
247 | },
248 | {
249 | "cell_type": "code",
250 | "execution_count": null,
251 | "metadata": {},
252 | "outputs": [],
253 | "source": [
254 | "# Logistic Regression\n",
255 | "LR = LogisticRegression()\n",
256 | "LR_params = {'C':[1,2,3,4,5,6,7,8,9,10], 'penalty':['l1', 'l2', 'elasticnet', 'none'], 'solver':['lbfgs', 'newton-cg', 'liblinear', 'sag', 'saga'], 'random_state':[0]}\n",
257 | "LR1 = GridSearchCV(LR, param_grid = LR_params)\n",
258 | "LR1.fit(X_train, y_train)\n",
259 | "LR1_pred = LR1.predict(X_test)\n",
260 | "\n",
261 | "# K-Nearest Neighbors\n",
262 | "KNN = KNeighborsClassifier()\n",
263 | "KNN_params = {'n_neighbors':[5,7,9,11]}\n",
264 | "KNN1 = GridSearchCV(KNN, param_grid = KNN_params) \n",
265 | "KNN1.fit(X_train, y_train)\n",
266 | "KNN1_pred = KNN1.predict(X_test)\n",
267 | "\n",
268 | "# Decision Tree\n",
269 | "DT = DecisionTreeClassifier()\n",
270 | "DT_params = {'max_depth':[2,10,15,20], 'criterion':['gini', 'entropy'], 'random_state':[0]}\n",
271 | "DT1 = GridSearchCV(DT, param_grid = DT_params)\n",
272 | "DT1.fit(X_train, y_train)\n",
273 | "DT1_pred = DT1.predict(X_test)"
274 | ]
275 | },
276 | {
277 | "cell_type": "code",
278 | "execution_count": null,
279 | "metadata": {},
280 | "outputs": [],
281 | "source": [
282 | "# print the best hyper parameters set\n",
283 | "print(\"Logistic Regression Best Hyper Parameters: \", LR1.best_params_)\n",
284 | "print(\"K-Nearest Neighbour Best Hyper Parameters: \", KNN1.best_params_)\n",
285 | "print(\"Decision Tree Best Hyper Parameters: \", DT1.best_params_)"
286 | ]
287 | },
288 | {
289 | "cell_type": "markdown",
290 | "metadata": {},
291 | "source": [
292 | "* ### Classification report"
293 | ]
294 | },
295 | {
296 | "cell_type": "code",
297 | "execution_count": null,
298 | "metadata": {},
299 | "outputs": [],
300 | "source": [
301 | "print(\"LR Classification Report: \\n\", classification_report(y_test, LR1_pred, digits = 6))\n",
302 | "print(\"KNN Classification Report: \\n\", classification_report(y_test, KNN1_pred, digits = 6))\n",
303 | "print(\"DT Classification Report: \\n\", classification_report(y_test, DT1_pred, digits = 6))"
304 | ]
305 | },
306 | {
307 | "cell_type": "markdown",
308 | "metadata": {},
309 | "source": [
310 | "* ### Confusion matrix"
311 | ]
312 | },
313 | {
314 | "cell_type": "code",
315 | "execution_count": null,
316 | "metadata": {},
317 | "outputs": [],
318 | "source": [
319 | "# confusion matrix of DT model\n",
320 | "DT_confusion_mx = confusion_matrix(y_test, DT1_pred)\n",
321 | "print('DT Confusion Matrix')\n",
322 | "\n",
323 | "# visualisation\n",
324 | "ax = plt.subplot()\n",
325 | "sns.heatmap(DT_confusion_mx, annot = True, fmt = 'd', cmap = 'Blues', ax = ax, linewidths = 0.5, annot_kws = {'size': 15})\n",
326 | "ax.set_ylabel('FP True label TP')\n",
327 | "ax.set_xlabel('FN Predicted label TN')\n",
328 | "ax.xaxis.set_ticklabels(['1', '0'], fontsize = 10)\n",
329 | "ax.yaxis.set_ticklabels(['1', '0'], fontsize = 10)\n",
330 | "plt.show()\n",
331 | "print() "
332 | ]
333 | },
334 | {
335 | "cell_type": "markdown",
336 | "metadata": {},
337 | "source": [
338 | "* ### ROC-AUC score"
339 | ]
340 | },
341 | {
342 | "cell_type": "code",
343 | "execution_count": null,
344 | "metadata": {},
345 | "outputs": [],
346 | "source": [
347 | "roc_auc_score(DT1_pred, y_test)"
348 | ]
349 | },
350 | {
351 | "cell_type": "markdown",
352 | "metadata": {},
353 | "source": [
354 | "## Conclusion."
355 | ]
356 | },
357 | {
358 | "cell_type": "code",
359 | "execution_count": null,
360 | "metadata": {},
361 | "outputs": [],
362 | "source": [
363 | "# submission of .csv file with predictions\n",
364 | "sub = pd.DataFrame()\n",
365 | "sub['ID'] = X_test.index\n",
366 | "sub['quality'] = DT1_pred\n",
367 | "sub.to_csv('WinePredictionsTest.csv', index=False)"
368 | ]
369 | },
370 | {
371 | "cell_type": "markdown",
372 | "metadata": {},
373 | "source": [
374 | "**Question**: Predict which wines are 'Good/1' and 'Not Good/0' (use binary classification; check balance of classes; calculate perdictions; choose the best model).\n",
375 | "\n",
376 | "**Answers**:\n",
377 | "\n",
378 | "1. Binary classification was applied.\n",
379 | "\n",
380 | "2. Classes were highly imbalanced with 78.36 % of '0' class and only 21.64 % of '1' class in our dataset. \n",
381 | "\n",
382 | "3. Three options were applied in order to calculate the best predictions:\n",
383 | " * Calculate predictions with imbalanced dataset\n",
384 | " * Calculate predictions with random under-sampling technique of an imbalanced dataset\n",
385 | " * Calculate predictions with random over-sampling technique of an imbalanced dataset\n",
386 | " \n",
387 | "4. Three ML models were used: Logistic Regression, KNN, Decision Tree (without and with hyper parameters).\n",
388 | "\n",
389 | "5. The best result was choosen: \n",
390 | " * Random over-sampling dataset with 3838 enteties in class '0' and 3838 enteties in class '1', 7676 enteties in total.\n",
391 | " * Train/Test split: test_size=0.2, random_state=0\n",
392 | " * Decision Tree model without hyper parameters tuning, with an accuracy score equal ... and ROC-AUC score equal ... ."
393 | ]
394 | }
395 | ],
396 | "metadata": {
397 | "kernelspec": {
398 | "display_name": "Python 3",
399 | "language": "python",
400 | "name": "python3"
401 | },
402 | "language_info": {
403 | "codemirror_mode": {
404 | "name": "ipython",
405 | "version": 3
406 | },
407 | "file_extension": ".py",
408 | "mimetype": "text/x-python",
409 | "name": "python",
410 | "nbconvert_exporter": "python",
411 | "pygments_lexer": "ipython3",
412 | "version": "3.7.3"
413 | }
414 | },
415 | "nbformat": 4,
416 | "nbformat_minor": 2
417 | }
418 |
--------------------------------------------------------------------------------
/ML-101 Modules/Module 03/Lesson 02/Practice 2/Winequality - Practice.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# \"Wine Quality.\""
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "### _\"Quality ratings of Portuguese white wines\" (Classification task)._"
15 | ]
16 | },
17 | {
18 | "cell_type": "markdown",
19 | "metadata": {},
20 | "source": [
21 | "## Table of Contents\n",
22 | "\n",
23 | "\n",
24 | "## Part 0: Introduction\n",
25 | "\n",
26 | "### Overview\n",
27 | "The dataset that's we see here contains 12 columns and 4898 entries of data about Portuguese white wines.\n",
28 | " \n",
29 | "**Метаданные:**\n",
30 | " \n",
31 | "* **fixed acidity** \n",
32 | "\n",
33 | "* **volatile acidity**\n",
34 | "\n",
35 | "* **citric acid** \n",
36 | "\n",
37 | "* **residual sugar** \n",
38 | "\n",
39 | "* **chlorides** \n",
40 | "\n",
41 | "* **free sulfur dioxide** \n",
42 | "\n",
43 | "* **total sulfur dioxide**\n",
44 | "\n",
45 | "* **density** \n",
46 | "\n",
47 | "* **pH** \n",
48 | "\n",
49 | "* **sulphates** \n",
50 | "\n",
51 | "* **alcohol** \n",
52 | "\n",
53 | "* **quality** - score between 3 and 9\n",
54 | "\n",
55 | "\n",
56 | "### Questions:\n",
57 | " \n",
58 | "Predict which wines are 'Good/1' and 'Not Good/0' (use binary classification; check balance of classes; calculate perdictions; choose the best model)\n",
59 | "\n",
60 | "\n",
61 | "## [Part 1: Import, Load Data](#Part-1:-Import,-Load-Data.)\n",
62 | "* ### Import libraries, Read data from ‘.csv’ file\n",
63 | "\n",
64 | "## [Part 2: Exploratory Data Analysis](#Part-2:-Exploratory-Data-Analysis.)\n",
65 | "* ### Info, Head, Describe\n",
66 | "* ### Encoding 'quality' attribute\n",
67 | "* ### 'quality' attribute value counts and visualisation\n",
68 | "* ### Resampling of an imbalanced dataset\n",
69 | "* ### Random under-sampling of an imbalanced dataset\n",
70 | "* ### Random over-sampling of an imbalanced dataset\n",
71 | "* ### Initialisation of target\n",
72 | "* ### Drop column 'quality'\n",
73 | "\n",
74 | "## [Part 3: Data Wrangling and Transformation](#Part-3:-Data-Wrangling-and-Transformation.)\n",
75 | "* ### StandardScaler\n",
76 | "* ### Creating datasets for ML part\n",
77 | "* ### 'Train\\Test' splitting method\n",
78 | "\n",
79 | "## [Part 4: Machine Learning](#Part-4:-Machine-Learning.)\n",
80 | "* ### Build, train and evaluate models without hyperparameters\n",
81 | " * #### Logistic Regression, K-Nearest Neighbors, Decision Trees \n",
82 | " * #### Classification report\n",
83 | " * #### Confusion Matrix\n",
84 | " * #### ROC-AUC score\n",
85 | "* ### Build, train and evaluate models with hyperparameters\n",
86 | " * #### Logistic Regression, K-Nearest Neighbors, Decision Trees \n",
87 | " * #### Classification report\n",
88 | " * #### Confusion Matrix\n",
89 | " * #### ROC-AUC score\n",
90 | "\n",
91 | "## [Conclusion](#Conclusion.)\n",
92 | "\n"
93 | ]
94 | },
95 | {
96 | "cell_type": "markdown",
97 | "metadata": {},
98 | "source": [
99 | "## Part 1: Import, Load Data."
100 | ]
101 | },
102 | {
103 | "cell_type": "markdown",
104 | "metadata": {},
105 | "source": [
106 | "* ### Import libraries"
107 | ]
108 | },
109 | {
110 | "cell_type": "code",
111 | "execution_count": 39,
112 | "metadata": {},
113 | "outputs": [],
114 | "source": [
115 | "# import standard libraries\n",
116 | "\n"
117 | ]
118 | },
119 | {
120 | "cell_type": "markdown",
121 | "metadata": {},
122 | "source": [
123 | "* ### Read data from ‘.csv’ file"
124 | ]
125 | },
126 | {
127 | "cell_type": "code",
128 | "execution_count": 40,
129 | "metadata": {},
130 | "outputs": [],
131 | "source": [
132 | "# read data from '.csv' file\n"
133 | ]
134 | },
135 | {
136 | "cell_type": "markdown",
137 | "metadata": {},
138 | "source": [
139 | "## Part 2: Exploratory Data Analysis."
140 | ]
141 | },
142 | {
143 | "cell_type": "markdown",
144 | "metadata": {},
145 | "source": [
146 | "* ### Info"
147 | ]
148 | },
149 | {
150 | "cell_type": "code",
151 | "execution_count": 41,
152 | "metadata": {
153 | "scrolled": true
154 | },
155 | "outputs": [],
156 | "source": [
157 | "# print the full summary of the dataset \n"
158 | ]
159 | },
160 | {
161 | "cell_type": "markdown",
162 | "metadata": {},
163 | "source": [
164 | "* ### Head"
165 | ]
166 | },
167 | {
168 | "cell_type": "code",
169 | "execution_count": 42,
170 | "metadata": {},
171 | "outputs": [],
172 | "source": [
173 | "# preview of the first 5 lines of the loaded data \n"
174 | ]
175 | },
176 | {
177 | "cell_type": "markdown",
178 | "metadata": {},
179 | "source": [
180 | "* ### Describe"
181 | ]
182 | },
183 | {
184 | "cell_type": "code",
185 | "execution_count": null,
186 | "metadata": {},
187 | "outputs": [],
188 | "source": []
189 | },
190 | {
191 | "cell_type": "markdown",
192 | "metadata": {},
193 | "source": [
194 | "* ### Encoding 'quality' attribute"
195 | ]
196 | },
197 | {
198 | "cell_type": "code",
199 | "execution_count": 43,
200 | "metadata": {},
201 | "outputs": [],
202 | "source": [
203 | "# lambda function; wine quality from 3-6 == 0, from 7-9 == 1.\n"
204 | ]
205 | },
206 | {
207 | "cell_type": "code",
208 | "execution_count": 44,
209 | "metadata": {},
210 | "outputs": [],
211 | "source": [
212 | "# preview of the first 5 lines of the loaded data \n"
213 | ]
214 | },
215 | {
216 | "cell_type": "markdown",
217 | "metadata": {},
218 | "source": [
219 | "* ### 'quality' attribute value counts and visualisation"
220 | ]
221 | },
222 | {
223 | "cell_type": "code",
224 | "execution_count": null,
225 | "metadata": {},
226 | "outputs": [],
227 | "source": []
228 | },
229 | {
230 | "cell_type": "code",
231 | "execution_count": 45,
232 | "metadata": {
233 | "scrolled": false
234 | },
235 | "outputs": [],
236 | "source": [
237 | "# visualisation plot\n"
238 | ]
239 | },
240 | {
241 | "cell_type": "markdown",
242 | "metadata": {},
243 | "source": [
244 | "* ### Resampling of an imbalanced dataset"
245 | ]
246 | },
247 | {
248 | "cell_type": "code",
249 | "execution_count": 46,
250 | "metadata": {},
251 | "outputs": [],
252 | "source": [
253 | "# class count\n",
254 | "\n",
255 | "\n",
256 | "# divide by class\n"
257 | ]
258 | },
259 | {
260 | "cell_type": "markdown",
261 | "metadata": {},
262 | "source": [
263 | "* ### Random under-sampling of an imbalanced dataset"
264 | ]
265 | },
266 | {
267 | "cell_type": "code",
268 | "execution_count": null,
269 | "metadata": {},
270 | "outputs": [],
271 | "source": []
272 | },
273 | {
274 | "cell_type": "markdown",
275 | "metadata": {},
276 | "source": [
277 | "* ### Random over-sampling of an imbalanced dataset"
278 | ]
279 | },
280 | {
281 | "cell_type": "code",
282 | "execution_count": null,
283 | "metadata": {},
284 | "outputs": [],
285 | "source": []
286 | },
287 | {
288 | "cell_type": "markdown",
289 | "metadata": {},
290 | "source": [
291 | "* ### Initialisation of target"
292 | ]
293 | },
294 | {
295 | "cell_type": "code",
296 | "execution_count": null,
297 | "metadata": {},
298 | "outputs": [],
299 | "source": []
300 | },
301 | {
302 | "cell_type": "markdown",
303 | "metadata": {},
304 | "source": [
305 | "* ### Drop column 'quality'"
306 | ]
307 | },
308 | {
309 | "cell_type": "code",
310 | "execution_count": null,
311 | "metadata": {},
312 | "outputs": [],
313 | "source": []
314 | },
315 | {
316 | "cell_type": "markdown",
317 | "metadata": {},
318 | "source": [
319 | "## Part 3: Data Wrangling and Transformation."
320 | ]
321 | },
322 | {
323 | "cell_type": "markdown",
324 | "metadata": {},
325 | "source": [
326 | "* ### StandardScaler"
327 | ]
328 | },
329 | {
330 | "cell_type": "code",
331 | "execution_count": null,
332 | "metadata": {},
333 | "outputs": [],
334 | "source": [
335 | "# StandardScaler \n"
336 | ]
337 | },
338 | {
339 | "cell_type": "markdown",
340 | "metadata": {},
341 | "source": [
342 | "* ### Creating datasets for ML part"
343 | ]
344 | },
345 | {
346 | "cell_type": "code",
347 | "execution_count": 47,
348 | "metadata": {},
349 | "outputs": [],
350 | "source": [
351 | "# set 'X' for features' and y' for the target ('quality').\n",
352 | "\n",
353 | "\n",
354 | "# for under-sampling dataset \n",
355 | "\n",
356 | "\n",
357 | "# for over-sampling dataset \n"
358 | ]
359 | },
360 | {
361 | "cell_type": "code",
362 | "execution_count": 48,
363 | "metadata": {
364 | "scrolled": true
365 | },
366 | "outputs": [],
367 | "source": [
368 | "# preview of the first 5 lines of the loaded data \n"
369 | ]
370 | },
371 | {
372 | "cell_type": "markdown",
373 | "metadata": {},
374 | "source": [
375 | "* ### 'Train\\Test' split"
376 | ]
377 | },
378 | {
379 | "cell_type": "code",
380 | "execution_count": 50,
381 | "metadata": {},
382 | "outputs": [],
383 | "source": [
384 | "# apply 'Train\\Test' splitting method\n"
385 | ]
386 | },
387 | {
388 | "cell_type": "code",
389 | "execution_count": 51,
390 | "metadata": {},
391 | "outputs": [],
392 | "source": [
393 | "# print shape of X_train and y_train\n"
394 | ]
395 | },
396 | {
397 | "cell_type": "code",
398 | "execution_count": 52,
399 | "metadata": {
400 | "scrolled": false
401 | },
402 | "outputs": [],
403 | "source": [
404 | "# print shape of X_test and y_test\n"
405 | ]
406 | },
407 | {
408 | "cell_type": "markdown",
409 | "metadata": {},
410 | "source": [
411 | "## Part 4: Machine Learning."
412 | ]
413 | },
414 | {
415 | "cell_type": "markdown",
416 | "metadata": {},
417 | "source": [
418 | "* ### Build, train and evaluate models without hyperparameters"
419 | ]
420 | },
421 | {
422 | "cell_type": "markdown",
423 | "metadata": {},
424 | "source": [
425 | "* Logistic Regression\n",
426 | "* K-Nearest Neighbors\n",
427 | "* Decision Trees\n"
428 | ]
429 | },
430 | {
431 | "cell_type": "code",
432 | "execution_count": 53,
433 | "metadata": {},
434 | "outputs": [],
435 | "source": [
436 | "# Logistic Regression\n",
437 | "\n",
438 | "\n",
439 | "# K-Nearest Neighbors\n",
440 | "\n",
441 | "\n",
442 | "# Decision Tree\n"
443 | ]
444 | },
445 | {
446 | "cell_type": "markdown",
447 | "metadata": {},
448 | "source": [
449 | "* ### Classification report"
450 | ]
451 | },
452 | {
453 | "cell_type": "code",
454 | "execution_count": null,
455 | "metadata": {
456 | "scrolled": true
457 | },
458 | "outputs": [],
459 | "source": []
460 | },
461 | {
462 | "cell_type": "markdown",
463 | "metadata": {},
464 | "source": [
465 | "* ### Confusion matrix"
466 | ]
467 | },
468 | {
469 | "cell_type": "code",
470 | "execution_count": null,
471 | "metadata": {},
472 | "outputs": [],
473 | "source": []
474 | },
475 | {
476 | "cell_type": "markdown",
477 | "metadata": {},
478 | "source": [
479 | "* ### ROC-AUC score"
480 | ]
481 | },
482 | {
483 | "cell_type": "code",
484 | "execution_count": null,
485 | "metadata": {
486 | "scrolled": true
487 | },
488 | "outputs": [],
489 | "source": []
490 | },
491 | {
492 | "cell_type": "markdown",
493 | "metadata": {},
494 | "source": [
495 | "* ### Build, train and evaluate models with hyperparameters"
496 | ]
497 | },
498 | {
499 | "cell_type": "code",
500 | "execution_count": 54,
501 | "metadata": {},
502 | "outputs": [],
503 | "source": [
504 | "# Logistic Regression\n",
505 | "\n",
506 | "\n",
507 | "# K-Nearest Neighbors\n",
508 | "\n",
509 | "\n",
510 | "# Decision Tree\n"
511 | ]
512 | },
513 | {
514 | "cell_type": "code",
515 | "execution_count": 55,
516 | "metadata": {},
517 | "outputs": [],
518 | "source": [
519 | "# print the best hyper parameters set\n"
520 | ]
521 | },
522 | {
523 | "cell_type": "markdown",
524 | "metadata": {},
525 | "source": [
526 | "* ### Classification report"
527 | ]
528 | },
529 | {
530 | "cell_type": "code",
531 | "execution_count": null,
532 | "metadata": {},
533 | "outputs": [],
534 | "source": []
535 | },
536 | {
537 | "cell_type": "markdown",
538 | "metadata": {},
539 | "source": [
540 | "* ### Confusion matrix"
541 | ]
542 | },
543 | {
544 | "cell_type": "code",
545 | "execution_count": 56,
546 | "metadata": {},
547 | "outputs": [],
548 | "source": [
549 | "# confusion matrix of DT model\n",
550 | "\n",
551 | "\n",
552 | "# visualisation\n"
553 | ]
554 | },
555 | {
556 | "cell_type": "markdown",
557 | "metadata": {},
558 | "source": [
559 | "* ### ROC-AUC score"
560 | ]
561 | },
562 | {
563 | "cell_type": "code",
564 | "execution_count": null,
565 | "metadata": {},
566 | "outputs": [],
567 | "source": []
568 | },
569 | {
570 | "cell_type": "markdown",
571 | "metadata": {},
572 | "source": [
573 | "## Conclusion."
574 | ]
575 | },
576 | {
577 | "cell_type": "code",
578 | "execution_count": 57,
579 | "metadata": {},
580 | "outputs": [],
581 | "source": [
582 | "# submission of .csv file with predictions\n"
583 | ]
584 | }
585 | ],
586 | "metadata": {
587 | "kernelspec": {
588 | "display_name": "Python 3",
589 | "language": "python",
590 | "name": "python3"
591 | },
592 | "language_info": {
593 | "codemirror_mode": {
594 | "name": "ipython",
595 | "version": 3
596 | },
597 | "file_extension": ".py",
598 | "mimetype": "text/x-python",
599 | "name": "python",
600 | "nbconvert_exporter": "python",
601 | "pygments_lexer": "ipython3",
602 | "version": "3.7.3"
603 | }
604 | },
605 | "nbformat": 4,
606 | "nbformat_minor": 2
607 | }
608 |
--------------------------------------------------------------------------------
/ML-101 Modules/Module 03/Lesson 02/README.md:
--------------------------------------------------------------------------------
1 | ## Tutorial 02. Binary Classification: Practice.
2 | **Бинарная Классификация: Практика.**
3 |
4 |  In this tutorial, we will make 2 practical classification cases.
5 |
6 | В этом уроке мы сделаем 2 практических классификационных кейса.
7 |
8 | [**Video Tutorial 02**](https://youtu.be/RUG1PsKyBj8)
9 |
--------------------------------------------------------------------------------
/ML-101 Modules/Module 03/Lesson 03/Practice 1/Practice-Homework.md:
--------------------------------------------------------------------------------
1 | ## Tutorial 03. Multi-Class Classification: Practice.
2 | **Мультиклассовая Классификация: Практика.**
3 |
4 |  In this tutorial, we will make a practical classification case / В этом уроке мы сделаем практический классификационный кейс.
5 |
6 | You will be given 4 files / Вам будут даны 4 файла:
7 |
8 | * vehicles.csv
9 |
10 | * vehicles - Practice.ipynb
11 |
12 | * vehicles - Practice Code Part 1.ipynb
13 |
14 | * vehicles - Practice Code Part 2.ipynb
15 | ##
16 |
17 | ## Practice / Задание:
18 |
19 | 
20 | * Open the files _vehicles - Practice.ipynb_ and _vehicles - Practice Code Part 1.ipynb_ . Copy the code from _vehicles - Practice Code Part 1.ipynb_ into _vehicles - Practice.ipynb_ . Block by block compile the code.
21 | * Open the _vehicles - Practice Code Part 2.ipynb_ file and copy the code into the _vehicles - Practice.ipynb_ file with already compiled code. Compile everything together.
22 | * Upload the following 2 files to your Github account:
23 |
24 | ##
25 |
26 | * Откройте файлы _vehicles - Practice.ipynb_ и _vehicles - Practice Code Part 1.ipynb_ . Скопируйте код из файла _vehicles - Practice Code Part 1.ipynb_ в файл _vehicles - Practice.ipynb_ . Блок за блоком скомпилируйте код.
27 | * Откройте файл _vehicles - Practice Code Part 2.ipynb_ и скопируйте код в файл _vehicles - Practice.ipynb_ с уже скомпилированным кодом. Скомпилируйте всё вместе.
28 | * Загрузите в свой Github account следующие 2 файла:
29 |
30 | **vehicles.csv**
31 |
32 | **vehicles- Practice.ipynb** (с новым скомпилированным кодом)
33 |
34 |
35 |
--------------------------------------------------------------------------------
/ML-101 Modules/Module 03/Lesson 03/Practice 1/vehicles_Practice.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# \"Vehicles.\""
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "### _\"Recognizing vehicle type from its silhouette\" (Classification task)._"
15 | ]
16 | },
17 | {
18 | "cell_type": "markdown",
19 | "metadata": {},
20 | "source": [
21 | "## Table of Contents\n",
22 | "\n",
23 | "\n"
24 | ]
25 | },
26 | {
27 | "cell_type": "markdown",
28 | "metadata": {},
29 | "source": [
30 | "## Part 1: Import, Load Data."
31 | ]
32 | },
33 | {
34 | "cell_type": "markdown",
35 | "metadata": {},
36 | "source": [
37 | "* ### Import libraries"
38 | ]
39 | },
40 | {
41 | "cell_type": "code",
42 | "execution_count": 27,
43 | "metadata": {},
44 | "outputs": [],
45 | "source": [
46 | "# import standard libraries\n",
47 | "\n"
48 | ]
49 | },
50 | {
51 | "cell_type": "markdown",
52 | "metadata": {},
53 | "source": [
54 | "* ### Read data from ‘.csv’ file"
55 | ]
56 | },
57 | {
58 | "cell_type": "code",
59 | "execution_count": 28,
60 | "metadata": {},
61 | "outputs": [],
62 | "source": [
63 | "# read data from '.csv' file\n",
64 | "\n",
65 | "\n",
66 | "# initialisation of target\n"
67 | ]
68 | },
69 | {
70 | "cell_type": "markdown",
71 | "metadata": {},
72 | "source": [
73 | "## Part 2: Exploratory Data Analysis."
74 | ]
75 | },
76 | {
77 | "cell_type": "markdown",
78 | "metadata": {},
79 | "source": [
80 | "* ### Info"
81 | ]
82 | },
83 | {
84 | "cell_type": "code",
85 | "execution_count": 29,
86 | "metadata": {
87 | "scrolled": true
88 | },
89 | "outputs": [],
90 | "source": [
91 | "# print the full summary of the dataset \n"
92 | ]
93 | },
94 | {
95 | "cell_type": "markdown",
96 | "metadata": {},
97 | "source": [
98 | "* ### Head"
99 | ]
100 | },
101 | {
102 | "cell_type": "code",
103 | "execution_count": 30,
104 | "metadata": {},
105 | "outputs": [],
106 | "source": [
107 | "# preview of the first 5 lines of the loaded data \n"
108 | ]
109 | },
110 | {
111 | "cell_type": "markdown",
112 | "metadata": {},
113 | "source": [
114 | "* ### Describe"
115 | ]
116 | },
117 | {
118 | "cell_type": "code",
119 | "execution_count": null,
120 | "metadata": {},
121 | "outputs": [],
122 | "source": []
123 | },
124 | {
125 | "cell_type": "markdown",
126 | "metadata": {},
127 | "source": [
128 | "* ### 'Class' attribute value counts and visualisation"
129 | ]
130 | },
131 | {
132 | "cell_type": "code",
133 | "execution_count": 31,
134 | "metadata": {},
135 | "outputs": [],
136 | "source": [
137 | "# target attribute value counts\n"
138 | ]
139 | },
140 | {
141 | "cell_type": "code",
142 | "execution_count": 32,
143 | "metadata": {
144 | "scrolled": false
145 | },
146 | "outputs": [],
147 | "source": [
148 | "# target attribute visualisation plot\n"
149 | ]
150 | },
151 | {
152 | "cell_type": "markdown",
153 | "metadata": {},
154 | "source": [
155 | "* ### Label encoder for 'Class' attribute"
156 | ]
157 | },
158 | {
159 | "cell_type": "code",
160 | "execution_count": 33,
161 | "metadata": {
162 | "scrolled": true
163 | },
164 | "outputs": [],
165 | "source": [
166 | "# label encoder for 'Class' attribute\n"
167 | ]
168 | },
169 | {
170 | "cell_type": "markdown",
171 | "metadata": {},
172 | "source": [
173 | "* ### Vizualisation of all attributes"
174 | ]
175 | },
176 | {
177 | "cell_type": "code",
178 | "execution_count": 34,
179 | "metadata": {
180 | "scrolled": true
181 | },
182 | "outputs": [],
183 | "source": [
184 | "# vizualisation (first part of attributes)\n"
185 | ]
186 | },
187 | {
188 | "cell_type": "code",
189 | "execution_count": 35,
190 | "metadata": {
191 | "scrolled": true
192 | },
193 | "outputs": [],
194 | "source": [
195 | "# vizualisation (second part of attributes)\n"
196 | ]
197 | },
198 | {
199 | "cell_type": "code",
200 | "execution_count": 36,
201 | "metadata": {
202 | "scrolled": true
203 | },
204 | "outputs": [],
205 | "source": [
206 | "# vizualisation (third part of attributes)\n"
207 | ]
208 | },
209 | {
210 | "cell_type": "markdown",
211 | "metadata": {},
212 | "source": [
213 | "* ### Correlation plot of each attribute"
214 | ]
215 | },
216 | {
217 | "cell_type": "code",
218 | "execution_count": 37,
219 | "metadata": {
220 | "scrolled": true
221 | },
222 | "outputs": [],
223 | "source": [
224 | "# corelation plot \n"
225 | ]
226 | },
227 | {
228 | "cell_type": "markdown",
229 | "metadata": {},
230 | "source": [
231 | "* ### Correlation list of each attribute"
232 | ]
233 | },
234 | {
235 | "cell_type": "code",
236 | "execution_count": 38,
237 | "metadata": {},
238 | "outputs": [],
239 | "source": [
240 | "# correlation list\n"
241 | ]
242 | },
243 | {
244 | "cell_type": "markdown",
245 | "metadata": {},
246 | "source": [
247 | "* ### Drop column 'Class'"
248 | ]
249 | },
250 | {
251 | "cell_type": "code",
252 | "execution_count": null,
253 | "metadata": {},
254 | "outputs": [],
255 | "source": []
256 | },
257 | {
258 | "cell_type": "markdown",
259 | "metadata": {},
260 | "source": [
261 | "## Part 3: Data Wrangling and Transformation."
262 | ]
263 | },
264 | {
265 | "cell_type": "markdown",
266 | "metadata": {},
267 | "source": [
268 | "* ### StandardScaler"
269 | ]
270 | },
271 | {
272 | "cell_type": "code",
273 | "execution_count": 39,
274 | "metadata": {},
275 | "outputs": [],
276 | "source": [
277 | "# StandardScaler \n"
278 | ]
279 | },
280 | {
281 | "cell_type": "markdown",
282 | "metadata": {},
283 | "source": [
284 | "* ### Creating datasets for ML part"
285 | ]
286 | },
287 | {
288 | "cell_type": "code",
289 | "execution_count": 40,
290 | "metadata": {},
291 | "outputs": [],
292 | "source": [
293 | "# set 'X' for features' and y' for the target ('Class').\n"
294 | ]
295 | },
296 | {
297 | "cell_type": "markdown",
298 | "metadata": {},
299 | "source": [
300 | "* ### 'Train\\Test' split"
301 | ]
302 | },
303 | {
304 | "cell_type": "code",
305 | "execution_count": 41,
306 | "metadata": {},
307 | "outputs": [],
308 | "source": [
309 | "# apply 'Train\\Test' splitting method\n"
310 | ]
311 | },
312 | {
313 | "cell_type": "code",
314 | "execution_count": 42,
315 | "metadata": {},
316 | "outputs": [],
317 | "source": [
318 | "# print shape of X_train and y_train\n"
319 | ]
320 | },
321 | {
322 | "cell_type": "code",
323 | "execution_count": 43,
324 | "metadata": {
325 | "scrolled": true
326 | },
327 | "outputs": [],
328 | "source": [
329 | "# print shape of X_test and y_test\n"
330 | ]
331 | },
332 | {
333 | "cell_type": "markdown",
334 | "metadata": {},
335 | "source": [
336 | "## Part 4: Machine Learning."
337 | ]
338 | },
339 | {
340 | "cell_type": "markdown",
341 | "metadata": {},
342 | "source": [
343 | "* ### Build, train and evaluate model"
344 | ]
345 | },
346 | {
347 | "cell_type": "markdown",
348 | "metadata": {},
349 | "source": [
350 | "* SVC model"
351 | ]
352 | },
353 | {
354 | "cell_type": "code",
355 | "execution_count": null,
356 | "metadata": {
357 | "scrolled": true
358 | },
359 | "outputs": [],
360 | "source": []
361 | },
362 | {
363 | "cell_type": "markdown",
364 | "metadata": {},
365 | "source": [
366 | "* ### Classification report"
367 | ]
368 | },
369 | {
370 | "cell_type": "code",
371 | "execution_count": null,
372 | "metadata": {
373 | "scrolled": false
374 | },
375 | "outputs": [],
376 | "source": []
377 | },
378 | {
379 | "cell_type": "markdown",
380 | "metadata": {},
381 | "source": [
382 | "* ### Confusion matrix"
383 | ]
384 | },
385 | {
386 | "cell_type": "code",
387 | "execution_count": 44,
388 | "metadata": {
389 | "scrolled": true
390 | },
391 | "outputs": [],
392 | "source": [
393 | "# confusion matrix of SVC model\n"
394 | ]
395 | },
396 | {
397 | "cell_type": "markdown",
398 | "metadata": {},
399 | "source": [
400 | "* ### Misclassification plot"
401 | ]
402 | },
403 | {
404 | "cell_type": "code",
405 | "execution_count": 45,
406 | "metadata": {},
407 | "outputs": [],
408 | "source": [
409 | "# misclassification vehicle plot \n"
410 | ]
411 | },
412 | {
413 | "cell_type": "markdown",
414 | "metadata": {},
415 | "source": [
416 | "* ### Comparison table between Actual 'Class' and Predicted 'Class'"
417 | ]
418 | },
419 | {
420 | "cell_type": "code",
421 | "execution_count": 46,
422 | "metadata": {},
423 | "outputs": [],
424 | "source": [
425 | "# comparison table between Actual 'Class' and Predicted 'Class'\n"
426 | ]
427 | },
428 | {
429 | "cell_type": "markdown",
430 | "metadata": {},
431 | "source": [
432 | "## Conclusion."
433 | ]
434 | },
435 | {
436 | "cell_type": "code",
437 | "execution_count": 47,
438 | "metadata": {},
439 | "outputs": [],
440 | "source": [
441 | "# submission of .csv file with test predictions\n"
442 | ]
443 | }
444 | ],
445 | "metadata": {
446 | "kernelspec": {
447 | "display_name": "Python 3",
448 | "language": "python",
449 | "name": "python3"
450 | },
451 | "language_info": {
452 | "codemirror_mode": {
453 | "name": "ipython",
454 | "version": 3
455 | },
456 | "file_extension": ".py",
457 | "mimetype": "text/x-python",
458 | "name": "python",
459 | "nbconvert_exporter": "python",
460 | "pygments_lexer": "ipython3",
461 | "version": "3.7.3"
462 | }
463 | },
464 | "nbformat": 4,
465 | "nbformat_minor": 2
466 | }
467 |
--------------------------------------------------------------------------------
/ML-101 Modules/Module 03/Lesson 03/Practice 2/Practice-Homework.md:
--------------------------------------------------------------------------------
1 | ## Tutorial 03. Multi-Class Classification: Practice.
2 | **Мультиклассовая Классификация: Практика.**
3 |
4 |  In this tutorial, we will make a practical classification case / В этом уроке мы сделаем практический классификационный кейс.
5 |
6 | You will be given 5 files / Вам будут даны 5 файлов:
7 |
8 | * vehicles.csv
9 |
10 | * vehicles - short version - Practice.ipynb
11 |
12 | * vehicles - short version - Practice Code Part 1.ipynb
13 |
14 | * vehicles - short version - Practice Code Part 2.ipynb
15 |
16 | * vehicles_helper.py
17 | ##
18 |
19 | ## Practice / Задание:
20 |
21 | 
22 | * Open the files _vehicles - short version - Practice.ipynb_ and _vehicles - short version - Practice Code Part 1.ipynb_ . Copy the code from the _vehicles - short version - Practice Code Part 1.ipynb_ file into the _vehicles - short version - Practice.ipynb_ file. Block by block compile the code.
23 | * Open the _vehicles - short version - Practice Code Part 2.ipynb_ file and copy the code into the _vehicles - short version - Practice.ipynb_ file with already compiled code. Compile everything together.
24 | * Upload the following 2 files to your Github account:
25 |
26 | ##
27 | * Откройте файлы _vehicles - short version - Practice.ipynb_ и _vehicles - short version - Practice Code Part 1.ipynb_ . Скопируйте код из файла _vehicles - short version - Practice Code Part 1.ipynb_ в файл _vehicles - short version - Practice.ipynb_ . Блок за блоком скомпилируйте код.
28 | * Откройте файл _vehicles - short version - Practice Code Part 2.ipynb_ и скопируйте код в файл _vehicles - short version - Practice.ipynb_ с уже скомпилированным кодом. Скомпилируйте всё вместе.
29 | * Загрузите в свой Github account следующие 2 файла:
30 |
31 | **vehicles.csv**
32 |
33 | **vehicles - short version - Practice.ipynb** (with new compiled code / с новым скомпилированным кодом)
34 |
--------------------------------------------------------------------------------
/ML-101 Modules/Module 03/Lesson 03/Practice 2/vehicles - short version - Practice.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# \"Vehicles.\""
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "### _\"Recognizing vehicle type from its silhouette\" (Classification task)._"
15 | ]
16 | },
17 | {
18 | "cell_type": "markdown",
19 | "metadata": {},
20 | "source": [
21 | "## Table of Contents\n",
22 | "\n",
23 | "\n"
24 | ]
25 | },
26 | {
27 | "cell_type": "markdown",
28 | "metadata": {},
29 | "source": [
30 | "## Part 1: Import, Load Data."
31 | ]
32 | },
33 | {
34 | "cell_type": "markdown",
35 | "metadata": {},
36 | "source": [
37 | "* ### Import libraries"
38 | ]
39 | },
40 | {
41 | "cell_type": "code",
42 | "execution_count": 18,
43 | "metadata": {},
44 | "outputs": [],
45 | "source": [
46 | "# import standard libraries\n",
47 | "\n"
48 | ]
49 | },
50 | {
51 | "cell_type": "markdown",
52 | "metadata": {},
53 | "source": [
54 | "* ### Read data from ‘.csv’ file"
55 | ]
56 | },
57 | {
58 | "cell_type": "code",
59 | "execution_count": 19,
60 | "metadata": {},
61 | "outputs": [],
62 | "source": [
63 | "# read data from '.csv' file\n",
64 | " \n",
65 | "\n",
66 | "# initialisation of target\n"
67 | ]
68 | },
69 | {
70 | "cell_type": "markdown",
71 | "metadata": {},
72 | "source": [
73 | "## Part 2: Exploratory Data Analysis."
74 | ]
75 | },
76 | {
77 | "cell_type": "markdown",
78 | "metadata": {},
79 | "source": [
80 | "* ### Info"
81 | ]
82 | },
83 | {
84 | "cell_type": "code",
85 | "execution_count": 20,
86 | "metadata": {
87 | "scrolled": true
88 | },
89 | "outputs": [],
90 | "source": [
91 | "# print the full summary of the dataset \n"
92 | ]
93 | },
94 | {
95 | "cell_type": "markdown",
96 | "metadata": {},
97 | "source": [
98 | "* ### Head"
99 | ]
100 | },
101 | {
102 | "cell_type": "code",
103 | "execution_count": 21,
104 | "metadata": {},
105 | "outputs": [],
106 | "source": [
107 | "# preview of the first 5 lines of the loaded data \n"
108 | ]
109 | },
110 | {
111 | "cell_type": "markdown",
112 | "metadata": {},
113 | "source": [
114 | "* ### Describe"
115 | ]
116 | },
117 | {
118 | "cell_type": "code",
119 | "execution_count": null,
120 | "metadata": {},
121 | "outputs": [],
122 | "source": []
123 | },
124 | {
125 | "cell_type": "markdown",
126 | "metadata": {},
127 | "source": [
128 | "* ### 'Class' attribute value counts and visualisation"
129 | ]
130 | },
131 | {
132 | "cell_type": "code",
133 | "execution_count": 22,
134 | "metadata": {},
135 | "outputs": [],
136 | "source": [
137 | "# target attribute value counts and visualisation\n"
138 | ]
139 | },
140 | {
141 | "cell_type": "markdown",
142 | "metadata": {},
143 | "source": [
144 | "Our dataset is balanced."
145 | ]
146 | },
147 | {
148 | "cell_type": "markdown",
149 | "metadata": {},
150 | "source": [
151 | "* ### Label encoder for 'Class' attribute"
152 | ]
153 | },
154 | {
155 | "cell_type": "code",
156 | "execution_count": 23,
157 | "metadata": {
158 | "scrolled": true
159 | },
160 | "outputs": [],
161 | "source": [
162 | "# label encoder for 'Class' attribute\n"
163 | ]
164 | },
165 | {
166 | "cell_type": "markdown",
167 | "metadata": {},
168 | "source": [
169 | "* ### Vizualisation of all attributes"
170 | ]
171 | },
172 | {
173 | "cell_type": "code",
174 | "execution_count": 24,
175 | "metadata": {},
176 | "outputs": [],
177 | "source": [
178 | "# vizualisation of all attributes\n"
179 | ]
180 | },
181 | {
182 | "cell_type": "code",
183 | "execution_count": null,
184 | "metadata": {
185 | "scrolled": false
186 | },
187 | "outputs": [],
188 | "source": []
189 | },
190 | {
191 | "cell_type": "markdown",
192 | "metadata": {},
193 | "source": [
194 | "* ### Correlation list and plot of each attribute"
195 | ]
196 | },
197 | {
198 | "cell_type": "code",
199 | "execution_count": 25,
200 | "metadata": {},
201 | "outputs": [],
202 | "source": [
203 | "# correlation list and plot of each attribute\n"
204 | ]
205 | },
206 | {
207 | "cell_type": "markdown",
208 | "metadata": {},
209 | "source": [
210 | "* ### Drop column 'Class'"
211 | ]
212 | },
213 | {
214 | "cell_type": "code",
215 | "execution_count": null,
216 | "metadata": {},
217 | "outputs": [],
218 | "source": []
219 | },
220 | {
221 | "cell_type": "markdown",
222 | "metadata": {},
223 | "source": [
224 | "## Part 3: Data Wrangling and Transformation."
225 | ]
226 | },
227 | {
228 | "cell_type": "markdown",
229 | "metadata": {},
230 | "source": [
231 | "* ### StandardScaler"
232 | ]
233 | },
234 | {
235 | "cell_type": "code",
236 | "execution_count": 26,
237 | "metadata": {},
238 | "outputs": [],
239 | "source": [
240 | "# StandardScaler \n"
241 | ]
242 | },
243 | {
244 | "cell_type": "markdown",
245 | "metadata": {},
246 | "source": [
247 | "* ### Creating datasets for ML part"
248 | ]
249 | },
250 | {
251 | "cell_type": "code",
252 | "execution_count": 27,
253 | "metadata": {},
254 | "outputs": [],
255 | "source": [
256 | "# set 'X' for features' and y' for the target ('Class').\n"
257 | ]
258 | },
259 | {
260 | "cell_type": "markdown",
261 | "metadata": {},
262 | "source": [
263 | "* ### 'Train\\Test' split"
264 | ]
265 | },
266 | {
267 | "cell_type": "code",
268 | "execution_count": 28,
269 | "metadata": {},
270 | "outputs": [],
271 | "source": [
272 | "# apply 'Train\\Test' splitting method\n"
273 | ]
274 | },
275 | {
276 | "cell_type": "code",
277 | "execution_count": 29,
278 | "metadata": {},
279 | "outputs": [],
280 | "source": [
281 | "# print shape of X_train and y_train\n"
282 | ]
283 | },
284 | {
285 | "cell_type": "code",
286 | "execution_count": 30,
287 | "metadata": {
288 | "scrolled": true
289 | },
290 | "outputs": [],
291 | "source": [
292 | "# print shape of X_test and y_test\n"
293 | ]
294 | },
295 | {
296 | "cell_type": "markdown",
297 | "metadata": {},
298 | "source": [
299 | "## Part 4: Machine Learning."
300 | ]
301 | },
302 | {
303 | "cell_type": "markdown",
304 | "metadata": {},
305 | "source": [
306 | "* ### Build, train and evaluate model"
307 | ]
308 | },
309 | {
310 | "cell_type": "markdown",
311 | "metadata": {},
312 | "source": [
313 | "* SVC model"
314 | ]
315 | },
316 | {
317 | "cell_type": "code",
318 | "execution_count": null,
319 | "metadata": {},
320 | "outputs": [],
321 | "source": []
322 | },
323 | {
324 | "cell_type": "markdown",
325 | "metadata": {},
326 | "source": [
327 | "* ### Classification report"
328 | ]
329 | },
330 | {
331 | "cell_type": "code",
332 | "execution_count": null,
333 | "metadata": {},
334 | "outputs": [],
335 | "source": []
336 | },
337 | {
338 | "cell_type": "markdown",
339 | "metadata": {},
340 | "source": [
341 | "* ### Confusion matrix"
342 | ]
343 | },
344 | {
345 | "cell_type": "code",
346 | "execution_count": 31,
347 | "metadata": {
348 | "scrolled": true
349 | },
350 | "outputs": [],
351 | "source": [
352 | "# confusion matrix of SVC model\n"
353 | ]
354 | },
355 | {
356 | "cell_type": "markdown",
357 | "metadata": {},
358 | "source": [
359 | "* ### Misclassification plot"
360 | ]
361 | },
362 | {
363 | "cell_type": "code",
364 | "execution_count": 32,
365 | "metadata": {},
366 | "outputs": [],
367 | "source": [
368 | "# misclassification vehicle plot \n"
369 | ]
370 | },
371 | {
372 | "cell_type": "markdown",
373 | "metadata": {},
374 | "source": [
375 | "* ### Comparison table between Actual 'Class' and Predicted 'Class'"
376 | ]
377 | },
378 | {
379 | "cell_type": "code",
380 | "execution_count": 33,
381 | "metadata": {},
382 | "outputs": [],
383 | "source": [
384 | "# comparison table between Actual 'Class' and Predicted 'Class'\n",
385 | "\n"
386 | ]
387 | },
388 | {
389 | "cell_type": "markdown",
390 | "metadata": {},
391 | "source": [
392 | "## Conclusion."
393 | ]
394 | },
395 | {
396 | "cell_type": "code",
397 | "execution_count": 34,
398 | "metadata": {},
399 | "outputs": [],
400 | "source": [
401 | "# submission of .csv file with test predictions\n"
402 | ]
403 | },
404 | {
405 | "cell_type": "code",
406 | "execution_count": null,
407 | "metadata": {},
408 | "outputs": [],
409 | "source": []
410 | }
411 | ],
412 | "metadata": {
413 | "kernelspec": {
414 | "display_name": "Python 3",
415 | "language": "python",
416 | "name": "python3"
417 | },
418 | "language_info": {
419 | "codemirror_mode": {
420 | "name": "ipython",
421 | "version": 3
422 | },
423 | "file_extension": ".py",
424 | "mimetype": "text/x-python",
425 | "name": "python",
426 | "nbconvert_exporter": "python",
427 | "pygments_lexer": "ipython3",
428 | "version": "3.7.3"
429 | }
430 | },
431 | "nbformat": 4,
432 | "nbformat_minor": 2
433 | }
434 |
--------------------------------------------------------------------------------
/ML-101 Modules/Module 03/Lesson 03/Practice 2/vehicles_helper.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | import pandas as pd
3 | import matplotlib.pyplot as plt
4 | import seaborn as sns
5 | from scipy import stats
6 | from scipy.stats import norm
7 | get_ipython().run_line_magic('matplotlib', 'inline')
8 | sns.set()
9 |
10 | import sklearn.metrics as metrics
11 | from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
12 | from sklearn.model_selection import train_test_split
13 | from sklearn.preprocessing import LabelEncoder, StandardScaler
14 |
15 | from sklearn.svm import SVC
16 |
17 | import warnings
18 | warnings.filterwarnings('ignore')
19 |
20 |
21 | def attributes_counts(dataset):
22 | """ Basic descriptions of the target attribute.
23 |
24 | This module displays 'Class' attribute value counts
25 | and visualisation plot of the dataset.
26 |
27 | Parameters:
28 | dataset.
29 | """
30 | print("'Class' Value Counts: "+" \n", dataset['Class'].value_counts())
31 | print("\n Visualisation plot: "+" \n", dataset['Class'].value_counts().plot(x = dataset['Class'], kind='bar'))
32 |
33 |
34 |
35 | def all_attrubutes_vizual(dataset, one, two, three):
36 | """ Visual presentation of all attributes in the dataset.
37 |
38 | This module shows all attributes divided into 3 parts
39 | for better visualization.
40 |
41 | Parameters:
42 | dataset,
43 | one: selected attributes for part one
44 | two: selected attributes for part two
45 | three: selected attributes for part three
46 | """
47 | print("Part one:"+"\n")
48 | df1 = dataset[one]
49 | sns.pairplot(df1, kind="scatter", hue="Class", plot_kws=dict(s=80, edgecolor="white", linewidth=2.5))
50 | plt.show()
51 | print("\n")
52 | print("Part two:"+"\n")
53 | df2 = dataset[two]
54 | sns.pairplot(df2, kind="scatter", hue="Class", plot_kws=dict(s=80, edgecolor="white", linewidth=2.5))
55 | plt.show()
56 | print("\n")
57 | print("Part three:"+"\n")
58 | df3 = dataset[three]
59 | sns.pairplot(df3, kind="scatter", hue="Class", plot_kws=dict(s=80, edgecolor="white", linewidth=2.5))
60 | plt.show()
61 |
62 |
63 | def corr_plot_list(dataset):
64 | """ This module presents correlation plot and list of each attribute"""
65 |
66 | print("'Correlation list of each attribute: ")
67 | corr = dataset.corr()
68 | corr_abs = corr.abs()
69 | num_cols = len(dataset)
70 | num_corr = corr_abs.nlargest(num_cols, 'Class')['Class']
71 | print(num_corr)
72 | print("\n")
73 | print("'Correlation plot of each attribute: "+"\n", dataset.corr()['Class'].sort_values().plot(kind='bar', figsize=(18, 6)))
74 |
75 |
76 |
--------------------------------------------------------------------------------
/ML-101 Modules/Module 03/Lesson 03/README.md:
--------------------------------------------------------------------------------
1 | ## Tutorial 03. Multi-Class Classification: Practice.
2 | **Мультиклассовая Классификация: Практика.**
3 |
4 |  In this tutorial, we will make 1 practical classification case in two versions.
5 |
6 | В этом уроке мы сделаем 1 практический классификационный кейс в двух вариантах.
7 |
8 | [**Video Tutorial 03**](https://youtu.be/IQYLfhYsWL8)
9 |
--------------------------------------------------------------------------------
/ML-101 Modules/Module 03/ML-101 Module 03.md:
--------------------------------------------------------------------------------
1 | # Module 03: Classification / Классификация.
2 |
3 |  The **third module** consists of **3 tutorials**. During this module we will:
4 |
5 | • Study the Theory of Classification and some of its Algorithms.
6 |
7 | • As a practice, we will make 3 practical cases for different types of Classification.
8 |
9 | • As a result: 3 .ipynb files for Github.
10 | ##
11 |
12 | **Третий модуль** состоит из **3 уроков**. В процессе этого модуля мы:
13 |
14 | • Изучим Теорию Классификации и некоторые её Алгоритмы.
15 |
16 | • Как практику, мы сделаем уже 3 “от и до” практических кейса для разных видов Классификации.
17 |
18 | • Как **итог**: **3 .ipynb файла** для **Github**.
19 |
20 | ## Tutorial 01. Classification: Theory & Algorithms.
21 | Классификация: Теория и Алгоритмы.
22 | ## Tutorial 02. LAB 02: Binary Classification (whole process: from dataset extraction to saving predictions).
23 | Практический Кейс 02: Бинарная Классификация (весь процесс: от выгрузки датасета до сохранения ответов-инсайтов; 2 датасета).
24 | ## Tutorial 03. LAB 03: Multi-class Classification (whole process: from dataset extraction to saving predictions).
25 | Практический Кейс 03: Многоклассовая Классификация (весь процесс: от выгрузки датасета до сохранения ответов-инсайтов).
26 |
27 |
28 |
--------------------------------------------------------------------------------
/Modules/Module01/Lesson01/README.md:
--------------------------------------------------------------------------------
1 | ## Lesson 01. AI: subsets; ML: types + tasks + lifecycle.
2 | **Искусственный Интеллект: подвиды; Машинное Обучение: виды + задачи + жизненный цикл.**
3 |
4 | В этом уроке вы узнаете:
5 |
6 | 📌 Немного информации по курсу: Как проходить курс? Как будет проходить процесс обучения?
7 |
8 | 📌 Немного вводной информации про Искусственный Интеллект (AI), Машинное обучение (ML) и Data Science;
9 |
10 | 📌 AI и его подвиды;
11 |
12 | 📌 Виды ML (Supervised, Unsupervised, Semi-supervised and Reinforcement Learning);
13 |
14 | 📌 Data with/without Labels или Размеченные и Неразмеченные данные;
15 |
16 | 📌 Какие задачи можно решить с помощью ML (Recommendation, Ranking, Regression, Classification, Clustering, Anomaly Detection);
17 |
18 | 📌 Что такое Жизненный Цикл ML (ML Lifecycle) и как он работает.
19 |
--------------------------------------------------------------------------------
/Modules/Module01/README.md:
--------------------------------------------------------------------------------
1 | # Модуль 01: Theory of Machine Learning and Data Science / Теория Машинного Обучения и Data Science.
2 |
3 | **Первый модуль** состоит из **4 уроков**. В процессе этого модуля мы:
4 |
5 | • Познакомимся с базовой теорией машинного обучения.
6 |
7 | • Установим **jupiter notebook**, поговорим про предварительные работы с датасетами, библиотеки, загрузку данных, разные виды датасетов.
8 |
9 | • Разберем EDA (Exploratory Data Analysis) / Общий Анализ Данных, descriptive statistic / описательную статистику, missing values / пропущенные или нулевые значения, numerical & categorical data / числовые и категориальные данные, outliers / выбросы, correlation feature to target / корреляция атрибутов к главному атрибуту, data wrangling and transformation / обработка и трансформация данных.
10 |
11 | • Мы увидим как подготовить датасет для модели машинного обучения: train/test split, overfitting & underfitting / переобучение и недообучение, cross-validation / кросс-валидация.
12 |
13 | • Поговорим про создание модели машинного обучения, как наполнить её данными, как настроить её гиперпараметры, как оценить работу модели, как получить и сохранить predictions.
14 |
15 |
16 | ## Lesson 01. AI: subsets; ML: types + tasks + lifecycle.
17 | Искусственный Интеллект: подвиды; Машинное Обучение: виды + задачи + жизненный цикл.
18 |
19 | ## Lesson 02. Datasets, Libraries, Data Load, Train-Validation-Test datasets.
20 | Датасеты, Библиотеки, Загрузка данных, Train-Validation-Test датасеты.
21 |
22 | ## Lesson 03. Exploratory Data Analysis.
23 | Общий Анализ Данных.
24 |
25 | ## Lesson 04. Dataset Preparation for ML Model, ML Model, Hyper Parameters, ML Model Evaluation, Predictions.
26 | Подготовка данных для Модели МО, Модели МО, Гиперпараметры, Ответы-Инсайты.
27 |
--------------------------------------------------------------------------------
/Modules/Module02/README.md:
--------------------------------------------------------------------------------
1 |
2 | # Модуль 02: Regression / Регрессия.
3 |
4 | **Второй модуль** состоит из **2 уроков**. В процессе этого модуля мы:
5 |
6 | • Рассмотрим Теорию Регрессии и некоторые её Алгоритмы.
7 |
8 | • Затем увидим как сделать EDA и оценку работы модели машинного обучения для Регрессии.
9 |
10 | • И, наконец, мы сделаем первый практический кейс - простую Регрессию, весь процесс от и до, все этапы.
11 |
12 | • Как **итог**: у вас будет один **.ipynb файл**, который вы положите себе на **Github**.
13 |
14 | ## Lesson 01. Regression: Theory, Algorithms, EDA, ML Model Evaluation.
15 | Регрессия: Теория, Алгоритмы, Общий Анализ Данных для Регрессии, Оценка Модели МО.
16 |
17 | ## Lesson 02. LAB 01: Regression (whole process: from dataset extraction to saving predictions).
18 | Практический Кейс 01: Регрессия (весь процесс: от выгрузки датасета до сохранения ответов-инсайтов).
19 |
--------------------------------------------------------------------------------
/Modules/Module03/README.md:
--------------------------------------------------------------------------------
1 | # Модуль 03: Classification / Классификация.
2 |
3 | **Третий модуль** состоит из **3 уроков**. В процессе этого модуля мы:
4 |
5 | • Изучим Теорию Классификации и некоторые её Алгоритмы.
6 |
7 | • Рассмотрим EDA и сделаем оценку работы модели машинного обучения для Классификации.
8 |
9 | • Как практику, мы сделаем уже два “от и до” практических кейса для разных видов Классификации.
10 |
11 | • Как **итог**: **2 .ipynb файла** для **Github**.
12 |
13 | ## Lesson 01. Classification: Theory, Algorithms, EDA, ML Model Evaluation.
14 | Классификация: Теория, Алгоритмы, Общий Анализ Данных для Классификации, Оценка Модели МО.
15 | ## Lesson 02. LAB 02: Classification type 1 (whole process: from dataset extraction to saving predictions).
16 | Практический Кейс 02: Классификация вид 1 (весь процесс: от выгрузки датасета до сохранения ответов-инсайтов).
17 | ## Lesson 03. LAB 03: Classification type 2 (whole process: from dataset extraction to saving predictions).
18 | Практический Кейс 03: Классификация вид 2 (весь процесс: от выгрузки датасета до сохранения ответов-инсайтов).
19 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 |
2 | 
3 |
4 | # Welcome to Data Learn / Добро пожаловать в Data Learn
5 |
6 | **Data Learn** is an [open-source educational platform](https://datalearn.ru/) whose main goal is to share technical knowledge / **Data Learn** - это [открытый ресурс](https://datalearn.ru/), главная задача которого - научить вас.
7 |
8 | Так как мы не коммерческая организация, мы очень хотим, чтобы наше время, затраченное на создание материалов, помогло вам. И у нас реальные учителя, которые работали в крупнейших компаниях в разных странах и городах и хотят помочь другим стать чуточку успешней ;)
9 |
10 | Это репозиторий курса [**ML-101**](https://github.com/Data-Learn/data-science/blob/main/ML-101%20Guide.md). Но, перед тем как перейти к самому курсу, позвольте нам кратко рассказать вам о нас.
11 |
12 | Пожалуйста дочитайте страницу до конца и вы найдете всю необходимую информацию.
13 |
14 |
15 | ## Курсы Data Learn
16 |
17 | На нашей платформе есть ряд курсов. Со временем их число будет увеличиваться.
18 |
19 |  **Getting Started with Machine Learning and Data Science (ML-101)** - an introductory [course](https://github.com/Data-Learn/data-science/blob/main/ML-101%20Guide.md) of Machine Learning and Data Science, with theory and practical real life cases by [Anastasia Rizzo](https://github.com/arizzogithub?tab=overview&from=2022-07-01&to=2022-07-03).
20 | The course includes 3 modules:
21 | Module 01: The theory of Machine Learning and Data Science;
22 | Module 02: Regression (theory and practice);
23 | Module 03: Classification (theory and 2 practical cases).
24 | The course allows you to experience the Data Scientist profession yourself and is especially suitable for those who may be unsure, but are very interested in starting to explore this topic.
25 |
26 | **Getting Started with Machine Learning and Data Science (ML-101)** -[курс](https://github.com/Data-Learn/data-science/blob/main/ML-101%20Guide.md) от [Анастасии Риццо](https://github.com/arizzogithub?tab=overview&from=2022-07-01&to=2022-07-03) о теории Машинного Обучения и Data Science, с понятной теорией и практическими кейсами из реальной жизни. Курс включает в себя 3 модуля: Первый модуль про теорию Машинного Обучения и Data Science; Второй модуль посвящен Регрессии (теория и практика); Третий модуль про Классификацию (тоже теория и 2 практических кейса). Курс позволяет вам примерить профессию Data Scientist на себя и особенно подойдет тем, кому страшно, но очень интересно начать изучать данную тематику.
27 |
28 | [**Intro Video**](https://youtu.be/g2azOLGzeNo)
29 |
30 | [**Registration / Регистрируйтесь на курс ML-101**](https://datalearn.ru/kurs-po-ml-ds)
31 |
32 |
33 | **Getting Started with Analytics (Data) Engineering (DE-101)** - [курс](https://github.com/Data-Learn/data-engineering) от [Дмитрия Аношина](https://www.linkedin.com/in/dmitryanoshin/) о работе инженером данных и 10+ летний опыт создания аналитических решений в России, Европе, Канаде и США. Курс включает в себя базовые вещи, такие как Business Intelligence инструменты, базы данных, ETL инструменты, облачные вычисления и многое другое. Даже если у вас нет опыта с данными, то это вам не помешает. Первые несколько модулей будут посвящены основам аналитики и классическим задачам: Business Intelligence (отчетность, визуализация, хранилище данных, SQL, Excel, интеграция данных). Это будет достаточно для профессии BI разработчик, Аналитик и тп. Начиная с 5-6 модуля мы начнем углубляться непосредственно в работу Инженера Данных, опираясь на знания, полученные на начальных этапах.
34 |
35 | [**Регистрируйтесь на курс DE-101**](https://datalearn.ru/kurs-po-getting-start-with-data-engineering)
36 |
37 |
38 | Также нам бы хотелось выделить еще один элемент - **Аналитическое Комьюнити для Женщин**. Мы видим большой спрос на такого рода сообщества на западе и думаем, что было бы классно иметь такое в русскоязычном сообществе для того, чтобы прекрасная половина могла изучать аналитику и технологии в своей комфортной зоне и со своей скоростью. Мы бы хотели, чтобы нашлись заинтересованные девушки, кто будет развивать это направление, а мы помогали бы с контентом (на данном этапе в этом направлении пока ничего не делается).
39 |
40 | [**Возглавить женское Data Community**](https://datalearn.ru/kurs-po-data-analytics-for-women)
41 |
42 | Также у нас есть отличная возможность проводить вебинары и приглашать спикеров со всего мира. Возможно, получится даже привлечь компании, которые будут заинтересованы в специалистах. Да и само сообщество должно помочь в поиске работы и сотрудников.
43 |
44 | **Сложно осуществить всё задуманное в одиночку, будьте проактивными и помогайте!**
45 |
46 |
47 | ## Как зарегистрироваться на курсы Data Learn
48 | 1. Bы регистрируетесь на сайте [Data Learn](https://datalearn.ru) или прямо на страницe курсa [**Курс ML-101**](https://datalearn.ru/kurs-po-ml-ds)
49 |
50 | 2. Вы подаете в телеграм канал [Data Learn](https://t.me/datalearn_chat/3401). Вам там ответят на любой вопрос! (Раньше у нас был slack, и он даже еще есть, но талеграм чат оказался удобней для всех!)
51 |
52 | 3. Пройдите [опрос](https://docs.google.com/forms/d/1PT8B4_dYQH8X2yDu8ju0TOHPXVRtky6LPFAN-J1t5hY/edit?ts=60070d83&gxids=7628), если еще не прошли! Так мы узнаем о вас еще больше!
53 |
54 |
55 | Всем спасибо и до встречи на **курсах** на канале [**Data Learn**](https://www.youtube.com/channel/UCWki7GBUE5lDMJCbn4e1XMg) и в нашем сообществе **Data Learn** в **Telegram**.
56 |
57 | Перейти к курсу на GitHub [**ML-101**](https://github.com/Data-Learn/data-science/blob/main/ML-101%20Guide.md).
58 |
59 |
--------------------------------------------------------------------------------
/how_to/Github/How_to_Github.md:
--------------------------------------------------------------------------------
1 | # Как создать аккаунт на Github, присоединиться к курсам Datalearn и добавить практические/домашние задания (на примере курса ML-101).
2 |
3 | ## Шаг 01.
4 |
5 | • Введите в поисковик https://github.com
6 |
7 | • На открывшейся странице нажмите на кнопку **“Sign up”**.
8 |
9 | 
10 |
11 |
12 | ## Шаг 02.
13 |
14 | • Введите ваш **Username**, **Email address** and **Password**;
15 |
16 | • Нажмите на кнопку **“Verify”** (решите предложенный пазл);
17 |
18 | • Нажмите на кнопку **“Create account”**.
19 |
20 | 
21 |
22 |
23 | ## Шаг 03.
24 |
25 | • Идите в свой **email**. Там будет лежать письмо от **Github**. Вам надо будет подтвердить факт создания вами аккаунта путем нажатия на предложенный **линк.**
26 |
27 |
28 | ## Шаг 04.
29 |
30 | • Пройдите финальные настройки и нажмите на кнопку **“Complete setup”**.
31 |
32 | 
33 | 
34 |
35 |
36 | ## Шаг 05.
37 |
38 | • На открывшейся странице в верхнем левом углу в поле поисковика наберите **Data-Learn**.
39 |
40 | 
41 |
42 |
43 | ## Шаг 06.
44 |
45 | • На открывшейся странице нажмите на **“Users”**.
46 |
47 | 
48 |
49 |
50 | ## Шаг 07.
51 |
52 | • На открывшейся странице найдите **Data Learn** и нажмите на него.
53 |
54 | 
55 |
56 |
57 | ## Шаг 08.
58 |
59 | • На открывшейся странице **Data Learn** есть **3 репозитория** с разными курсами. Репозиторий нашего курса называется **data-science**. Нажмите на него.
60 |
61 | 
62 |
63 |
64 | ## Шаг 09.
65 |
66 | • Перед вами страница нашего курса. Здесь в папках будут храниться модули курса, практические задания, ссылки на видео, инструкции по установке;
67 |
68 | • Чтобы присоединиться к данному курсу, нажмите на кнопку **“Fork”** (посмотрите вправо) и подождите несколько секунд;
69 |
70 | 
71 |
72 | • После этого на страница обновится и на ней произойдут небольшие изменения;
73 |
74 | • **Изменение 1:** в левом углу изменится название и добавится строка **forked from Data-Learn/data-science.** Это изменение говорит о том, что теперь у вас есть свой репозиторий с данным курсом;
75 |
76 | • **Изменение 2:** в правом углу около кнопки **“Fork”** изменится число на +1. Это изменение посчитало вас как еще одного участника репозитория нашего курса;
77 |
78 | 
79 |
80 | • Если вам понравится наш курс, то нажмите на кнопку **“Star”** (рядом с кнопкой **“Fork”**) и поставьте звезду.
81 |
82 |
83 | ## Шаг 10.
84 |
85 | • Теперь в правом верхнем углу нажмите на цветную **иконку** с вашим профилем;
86 |
87 | • На открывшейся странице вы увидите свой **профиль**;
88 |
89 | • На **таймлайн** сразу автоматически ставится пометка зеленым цветом: чем больше полезных действий вы сделаете, тем темнее цвет;
90 |
91 | • Вверху вы увидите, что у вас есть **1 репозиторий**. Нажмите на него.
92 |
93 | 
94 |
95 |
96 | ## Шаг 11.
97 |
98 | • На открывшейся странице вы видите, что у вас появился 1 репозиторий;
99 |
100 | 
101 |
102 | • При нажатии на него вы попадете на страницу репозитория курса, но только с вашего аккаунта;
103 |
104 | 
105 |
106 | • В последствии вы будете создавать тут ваши папки с практическими кейсами;
107 |
108 | • Теперь, когда вам будет нужно открыть репозиторий курса, вы заходите в свой аккаунт, и оттуда открываете курс.
109 |
110 |
111 | # ПРОДОЛЖЕНИЕ
112 |
113 | Вы уже присоединились к репозиторию курса. Сделали вы это для того, чтобы добавлять туда практические примеры, описания выполненной вами работы, значки о прохождениях модулей и финальный сертификат. Именно это вы будете представлять как примеры того, что вы делали.
114 |
115 | Теперь у вас есть свой репозиторий нашего курса.Это копия главного репозитория.
116 |
117 | 
118 |
119 | **Ваша задача**: наполнить именно ваш репозиторий.
120 |
121 |
122 | # Шаг 12. Как загрузить практические/домашние задания.
123 |
124 | - Заходим в ту папку модуля и урока, в которые вам надо добавить практические/домашние задания (в нашем примере это будет Урок01).
125 |
126 | - нажимаем на кнопку **Add file** и выбираем опцию **Upload files**
127 |
128 | 
129 |
130 | - выбираем наш файл с практическим/домашним заданием, загружаем его и нажимаем зеленую кнопку **Commit changes**.
131 |
132 | 
133 |
--------------------------------------------------------------------------------
/how_to/Jupyter Notebook/how_to_Jupyter_Notebook.md:
--------------------------------------------------------------------------------
1 | # How to: Как установить Jupyter Notebook.
2 |
3 | ## Introduction
4 |
5 | **Jupyter Notebook** - это веб-приложение с открытым исходным кодом, позволяющее создавать документы, содержащие интерактивный код, уравнения, визуализации, текст и обмениваться ими. Используется для очистки и преобразования данных, численного моделирования, статистического моделирования, визуализации данных, машинного обучения и многого другого.
6 |
7 | 
8 |
9 | ## Requirements
10 |
11 | **Операционная система**: Windows, Mac, Linux.
12 |
13 | **Python 3.3** или выше.
14 |
15 | ## Installation
16 |
17 | Для начала нам нужен **Python**.
18 |
19 | Идем на сайт https://www.python.org/downloads/ и скачиваем последнюю версию для своей операционной системы.
20 |
21 | 
22 |
23 | Далее на примере **Windows 10 pro** запускаем установочный файл и перезапускаем компьютер.
24 |
25 | 
26 |
27 | Теперь нам надо убедиться, что у вас установлен **python** версии 3.3 и выше. На **Mac** и во многих дистрибутивах **Linux** **python** уже установлен по умолчанию. В **Windows** же его нужно устанавливать дополнительно.
28 |
29 | Для этого открываем **командную строку** или **cmd**.
30 |
31 | 
32 |
33 | Вводим тескт **python** и нажимаем **Enter**.
34 |
35 | 
36 |
37 | Теперь, надо запустить **Jupyter Notebook**.
38 |
39 | Для этого мы переоткрываем командную строку, и вводим **python -m pip install jupyter** и нажимаем **Enter**.
40 | После этого начнется загрузка **jupyter notebook**.
41 |
42 | 
43 |
44 | ## Начало работы
45 |
46 | Для начала работы, опять перезаходим в командную строку, печатаем **jupyter-notebook** и нажимаем **Enter**.
47 |
48 | 
49 |
50 | Это запустит сервер, а браузер откроет новую вкладку со следующим URL: https://localhost:8888/tree
51 |
52 | 
53 |
54 | В этой вкладке мы и будем работать. Вкладка **cmd** должна быть открыта на протяжении всей вашей работы с **Jupyter Notebook**.
55 |
56 |
57 |
58 |
59 |
--------------------------------------------------------------------------------
/how_to/README.md:
--------------------------------------------------------------------------------
1 | "How to" tutorial / Инструкции по установке.
2 |
--------------------------------------------------------------------------------
/img/readme.md:
--------------------------------------------------------------------------------
1 | Place for images and screenshots
2 |
--------------------------------------------------------------------------------