├── 10_visits_by_browser ├── README.md └── visits_by_browser.ipynb ├── 11_telegram_bot_airflow_reporting ├── README.md ├── mini_project.py ├── mini_project_dag.py └── report_2019-04-01.txt ├── 12_sql_task └── sql_tasks.ipynb ├── 13_nyc_taxi_timeit_optimization ├── README.md └── nyc_taxi_timeit.ipynb ├── 14_bikes_rent_chicago ├── README.md └── bike_rent_chicago.ipynb ├── 15_booking_in_london ├── .ipynb_checkpoints │ └── bookings_in_london-checkpoint.ipynb ├── README.md └── bookings_in_london.ipynb ├── 16_retail_dashboard ├── README.md ├── retail_dashboard.ipynb ├── task1.png ├── task2.png ├── task3.png ├── task_4.png ├── task_5.png ├── task_6.png ├── task_7.png └── task_8_dashboard.png ├── 17_video_games ├── README.md └── video_games.ipynb ├── 18_ads_conversion ├── README.md └── ads_conversion.ipynb ├── 19_yandex_music ├── README.md └── yandex_music_project.ipynb ├── 1_taxi_in_nyc ├── README.md └── taxi_in_nyc.ipynb ├── 20_bikes_rent_london ├── README.md └── bikes_rent_london.ipynb ├── 21_delivery_ab ├── README.md └── delivery_ab.ipynb ├── 22_app_interface_ab ├── README.md └── app_interface_ab.ipynb ├── 23_cars_sales ├── README.md └── cars_sales.ipynb ├── 24_bootstrap_ab ├── README.md └── bootstrap_ab.ipynb ├── 25_mobile_app_aa ├── README.md └── mobile_app_aa.ipynb ├── 26_taxi_churn ├── README.md ├── churn_abs.png ├── churn_city.png ├── churn_pct.png ├── churn_platform.png └── taxi_churn.ipynb ├── 27_ab_simulation ├── README.md └── ab_simulation.ipynb ├── 28_sales_monthly └── README.md ├── 29_profit_monthly └── README.md ├── 2_hotel_bookings ├── README.md └── hotel_bookings.ipynb ├── 30_vacancies_overview └── README.md ├── 31_sales_overview └── README.md ├── 32_airbnb_listings ├── README.md └── dashboard_canvas.pdf ├── 33_metrics_calc ├── README.md └── metrics_calc.ipynb ├── 34_retention_analysis ├── README.md └── retention_prep_data_for_tableau.ipynb ├── 35_rfm_analysis ├── README.md └── rfm_analysis.ipynb ├── 36_probabilities ├── .ipynb_checkpoints │ └── probabilities-checkpoint.ipynb ├── README.md └── probabilities.ipynb ├── 37_final_project ├── README.md ├── final_project.ipynb ├── my_project_slides.pdf └── retention_200901_200923.jpg ├── 3_user_logs ├── README.md └── user_logs.ipynb ├── 4_taxi_peru ├── README.md └── taxi_peru.ipynb ├── 5_raw_data_handling ├── README.md ├── data_tree.png └── raw_data_handling.ipynb ├── 6_retail_in_germany ├── README.md └── retail_in_germany.ipynb ├── 7_error_in_transaction_data ├── README.md └── error_in_transaction_data.ipynb ├── 8_avocado_price ├── .ipynb_checkpoints │ └── avocado_price-checkpoint.ipynb ├── README.md ├── avocado_EWA.jpg ├── avocado_SMA.jpg ├── avocado_price.ipynb └── plotly_result.png ├── 9_ads_campaign ├── README.md ├── ads_campaign.ipynb └── plotly_result.png └── README.md /10_visits_by_browser/README.md: -------------------------------------------------------------------------------- 1 | This is the tenth dataset analysed by me while passing [data analysis course](https://karpov.courses/) I've enrolled in January 2021. 2 | 3 | 4 | Visits by Browser -- analising web-site visits. Defining proportion of real users and visits by bots. Findig the most popular browser for users and for bots. Barplotting results, downloading data using **Google Docs API** and merging it to our dataframe. Read_csv, groupby, agg, query, sort_values, pivot, fillna, assign and merge methods were used for Exploratory Data Analysis. 5 | 6 | 7 | 8 | Hope this repo will help you to assess my coding, data analytics and SQL skills or will be just fun for you to look through. 9 | 10 | 11 | 12 | -------------------------------------------- 13 | Fill free to contact me via nktn.lx@gmal.com 14 | Follow me on twitter: @nktn_lx 15 | And here on github: github.com/nktnlx -------------------------------------------------------------------------------- /11_telegram_bot_airflow_reporting/README.md: -------------------------------------------------------------------------------- 1 | This is the eleventh dataset analyzed by me while passing [data analysis course](https://karpov.courses/) I've enrolled in January 2021. 2 | 3 | 4 | Telegram Bot Airflow Reporting -- reading an advertising campaign data from Google Docs spreadsheet, creating pandas dataframe to calculate clicks, views, **CTR** and money spent on the campaign. Calculating day by day change of the metrics, writing report with results to a txt file and sending this file via telegram bot to your mobile phone. The script is executed by Airflow every Monday at 12:00 p.m. 5 | 6 | Do not forget to create your own json file with Telegram token and chat_id to run the script properly. 7 | 8 | Files description: 9 | mini_project.py -- is for running the script in terminal 10 | mini_project_dag.py -- DAG with the script to be executed with Airflow by a preset schedule 11 | report_2019-04-01.txt -- the result of running the script 12 | 13 | 14 | 15 | Hope this repo will help you to assess my coding, data analytics and SQL skills or will be just fun for you to look through. 16 | 17 | 18 | 19 | -------------------------------------------- 20 | Fill free to contact me via nktn.lx@gmal.com 21 | Follow me on twitter: @nktn_lx 22 | And here on github: github.com/nktnlx -------------------------------------------------------------------------------- /11_telegram_bot_airflow_reporting/mini_project.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import requests 3 | import json 4 | from urllib.parse import urlencode 5 | import os 6 | 7 | 8 | 9 | # url to download data 10 | url = 'https://docs.google.com/spreadsheets/d/e/2PACX-1vR-ti6Su94955DZ4Tky8EbwifpgZf_dTjpBdiVH0Ukhsq94jZdqoHuUytZsFZKfwpXEUCKRFteJRc9P/pub?gid=889004448&single=true&output=csv' 11 | df = pd.read_csv(url, parse_dates=['date', 'time']) 12 | 13 | 14 | # calculating dataframe with all necessary metrics for reporting 15 | def calc_metrics(df): 16 | # count views 17 | views = ( df.query('event == "view"') 18 | .groupby(['date'], as_index=False) 19 | .agg({'event': 'count'}) 20 | .rename(columns={'event': 'views_count'}) ) 21 | 22 | # count clicks 23 | clicks = ( df.query('event == "click"') 24 | .groupby(['date'], as_index=False) 25 | .agg({'event': 'count'}) 26 | .rename(columns={'event': 'clicks_count'}) ) 27 | # count ctr 28 | ctr = ( clicks.merge(views, on='date') 29 | .assign(ctr_percentage = round(100 * clicks.clicks_count / views.views_count, 2)) ) 30 | 31 | # count money_spent 32 | result = ( df.drop_duplicates(subset=['date'], keep='first') 33 | .loc[: , ['date','ad_cost']] 34 | .merge(ctr, on='date') ) 35 | result = result.assign(money_spent = (result.ad_cost / 1000) * result.views_count) 36 | 37 | # calc percentage difference between two dates 38 | result['clicks_dif'] = round(result.clicks_count.pct_change() * 100, 2) 39 | result['views_dif'] = round(result.views_count.pct_change() * 100, 2) 40 | result['ctr_dif'] = round(result.ctr_percentage.pct_change() * 100, 2) 41 | result['money_spent_dif'] = round(result.money_spent.pct_change() * 100, 2) 42 | 43 | return result 44 | 45 | 46 | # creating a report and writing it to a txt-file 47 | def report(result): 48 | # creating variables for reporting 49 | date = result.date.astype('str')[0] 50 | ad_id = df.ad_id.iloc[0] 51 | money = round(result.money_spent[1], 2) 52 | money_dif = round(result.money_spent_dif[1], 2) 53 | views = round(result.views_count[1], 2) 54 | views_dif = round(result.views_dif[1], 2) 55 | clicks = round(result.clicks_count[1], 2) 56 | clicks_dif = round(result.clicks_dif[1], 2) 57 | ctr = round(result.ctr_percentage[1], 2) 58 | ctr_dif = round(result.ctr_dif[1], 2) 59 | 60 | # writing report to a file 61 | with open('report_{}.txt'.format(date), 'w') as rep: 62 | rep.write( 63 | f'''Published ad_id {ad_id} report for {date}: 64 | Expenditures: {money} RUB ({money_dif}%) 65 | Views: {views} ({views_dif}%) 66 | Clicks: {clicks} ({clicks_dif}%) 67 | CTR: {ctr} ({ctr_dif}%) 68 | ''') 69 | return 'report_{}.txt'.format(date) 70 | 71 | 72 | # sending file with the report via telegram bot 73 | def send_via_tlgrm(file_name): 74 | # reading token and chat_id from a file 75 | with open('ini.json') as src: # here should be path to a file with your token and chat_id info!!! 76 | data = json.load(src) 77 | 78 | # @david8test_bot 79 | token = data['token'] 80 | chat_id = data['chat_id'] # your chat id 81 | 82 | # sending message 83 | message = 'Please find attached a report:' # text which you want to send 84 | params = {'chat_id': chat_id, 'text': message} 85 | base_url = f'https://api.telegram.org/bot{token}/' 86 | url = base_url + 'sendMessage?' + urlencode(params) 87 | resp = requests.get(url) 88 | 89 | # getting current working directory and definging path to file 90 | cwd = os.getcwd() 91 | filepath = cwd + '/' + file_name 92 | 93 | # sending a file with report 94 | url = base_url + 'sendDocument?' + urlencode(params) 95 | files = {'document': open(filepath, 'rb')} 96 | resp = requests.get(url, files=files) 97 | ## Now you'll receive a text message and the file with a report from your bot 98 | 99 | 100 | # function to execute the script 101 | def main(): 102 | result = calc_metrics(df) 103 | filename = report(result) 104 | send_via_tlgrm(filename) 105 | 106 | 107 | # boilerplate to run the script 108 | if __name__ == '__main__': 109 | main() -------------------------------------------------------------------------------- /11_telegram_bot_airflow_reporting/mini_project_dag.py: -------------------------------------------------------------------------------- 1 | from airflow import DAG 2 | from airflow.operators.python_operator import PythonOperator 3 | from datetime import datetime 4 | import pandas as pd 5 | 6 | 7 | # default arguments for airflow 8 | default_args = { 9 | 'owner': 'a.nikitin-8', 10 | 'depends_on_past': False, 11 | 'start_date': datetime(2021, 2, 11), 12 | 'retries': 0 13 | } 14 | 15 | # creating DAG 16 | dag = DAG('ad_report', 17 | default_args=default_args, 18 | catchup=False, 19 | schedule_interval='0 12 * * 1') # is executed every Monday at 12:00 p.m. 20 | 21 | 22 | # url to download data 23 | url = 'https://docs.google.com/spreadsheets/d/e/2PACX-1vR-ti6Su94955DZ4Tky8EbwifpgZf_dTjpBdiVH0Ukhsq94jZdqoHuUytZsFZKfwpXEUCKRFteJRc9P/pub?gid=889004448&single=true&output=csv' 24 | df = pd.read_csv(url, parse_dates=['date', 'time']) 25 | 26 | 27 | # calculating dataframe with all necessary metrics for reporting 28 | def calc_metrics(df): 29 | # count views 30 | views = ( df.query('event == "view"') 31 | .groupby(['date'], as_index=False) 32 | .agg({'event': 'count'}) 33 | .rename(columns={'event': 'views_count'}) ) 34 | 35 | # count clicks 36 | clicks = ( df.query('event == "click"') 37 | .groupby(['date'], as_index=False) 38 | .agg({'event': 'count'}) 39 | .rename(columns={'event': 'clicks_count'}) ) 40 | # count ctr 41 | ctr = ( clicks.merge(views, on='date') 42 | .assign(ctr_percentage = round(100 * clicks.clicks_count / views.views_count, 2)) ) 43 | 44 | # count money_spent 45 | result = ( df.drop_duplicates(subset=['date'], keep='first') 46 | .loc[: , ['date','ad_cost']] 47 | .merge(ctr, on='date') ) 48 | result = result.assign(money_spent = (result.ad_cost / 1000) * result.views_count) 49 | 50 | # calc percentage difference between two dates 51 | result['clicks_dif'] = round(result.clicks_count.pct_change() * 100, 2) 52 | result['views_dif'] = round(result.views_count.pct_change() * 100, 2) 53 | result['ctr_dif'] = round(result.ctr_percentage.pct_change() * 100, 2) 54 | result['money_spent_dif'] = round(result.money_spent.pct_change() * 100, 2) 55 | 56 | return result 57 | 58 | 59 | # creating a report and writing it to a txt-file 60 | def report(result): 61 | # creating variables for reporting 62 | date = result.date.astype('str')[0] 63 | ad_id = df.ad_id.iloc[0] 64 | money = round(result.money_spent[1], 2) 65 | money_dif = round(result.money_spent_dif[1], 2) 66 | views = round(result.views_count[1], 2) 67 | views_dif = round(result.views_dif[1], 2) 68 | clicks = round(result.clicks_count[1], 2) 69 | clicks_dif = round(result.clicks_dif[1], 2) 70 | ctr = round(result.ctr_percentage[1], 2) 71 | ctr_dif = round(result.ctr_dif[1], 2) 72 | 73 | # writing report to a file 74 | with open('report_{}.txt'.format(date), 'w') as rep: 75 | rep.write( 76 | f'''Published ad_id {ad_id} report for {date}: 77 | Expenditures: {money} RUB ({money_dif}%) 78 | Views: {views} ({views_dif}%) 79 | Clicks: {clicks} ({clicks_dif}%) 80 | CTR: {ctr} ({ctr_dif}%) 81 | ''') 82 | return 'report_{}.txt'.format(date) 83 | 84 | 85 | # sending file with the report via telegram bot 86 | def send_via_tlgrm(file_name): 87 | import requests 88 | import json 89 | from urllib.parse import urlencode 90 | import os 91 | # reading token and chat_id from a file 92 | with open('ini.json') as src: # here should be path to a file with your token and chat_id info!!! 93 | data = json.load(src) 94 | 95 | # @david8test_bot 96 | token = data['token'] 97 | chat_id = data['chat_id'] # your chat id 98 | 99 | # sending message 100 | message = 'Please find attached a report:' # text which you want to send 101 | params = {'chat_id': chat_id, 'text': message} 102 | base_url = f'https://api.telegram.org/bot{token}/' 103 | url = base_url + 'sendMessage?' + urlencode(params) 104 | resp = requests.get(url) 105 | 106 | # getting current working directory and definging path to file 107 | cwd = os.getcwd() 108 | filepath = cwd + '/' + file_name 109 | 110 | # sending a file with report 111 | url = base_url + 'sendDocument?' + urlencode(params) 112 | files = {'document': open(filepath, 'rb')} 113 | resp = requests.get(url, files=files) 114 | ## Now you'll receive a text message and the file with a report from your bot 115 | 116 | 117 | # function to execute the script 118 | def main(): 119 | result = calc_metrics(df) 120 | filename = report(result) 121 | send_via_tlgrm(filename) 122 | 123 | 124 | # task 125 | t1 = PythonOperator( 126 | task_id='write_ad_report', 127 | python_callable=main, 128 | dag=dag) -------------------------------------------------------------------------------- /11_telegram_bot_airflow_reporting/report_2019-04-01.txt: -------------------------------------------------------------------------------- 1 | Published ad_id 121288 report for 2019-04-01: 2 | Expenditures: 17.43 RUB (-81.06%) 3 | Views: 93 (-81.06%) 4 | Clicks: 6 (-64.71%) 5 | CTR: 6.45 (86.42%) 6 | -------------------------------------------------------------------------------- /13_nyc_taxi_timeit_optimization/README.md: -------------------------------------------------------------------------------- 1 | This is the thirteenth dataset analyzed by me while passing [data analysis course](https://karpov.courses/) I've enrolled in January 2021. 2 | 3 | 4 | NYC taxi & timeit optimization -- calculating distance of a ride using pick-up and drop-off coordinates. Compared a couple of ways to apply distance calculation to the dataframe. The optimization helped to decrease calculation run-time about 3276 times! Checked calculation results, found outliers using boxplot graphs and descriptive statistics. Fixed dataframe by removing outliers and found the cost of the longest ride. 5 | 6 | 7 | 8 | 9 | Hope this repo will help you to assess my coding, data analytics and SQL skills or will be just fun for you to look through. 10 | 11 | 12 | 13 | -------------------------------------------- 14 | Fill free to contact me via nktn.lx@gmal.com 15 | Follow me on twitter: @nktn_lx 16 | And here on github: github.com/nktnlx -------------------------------------------------------------------------------- /14_bikes_rent_chicago/README.md: -------------------------------------------------------------------------------- 1 | This is the fourteenth dataset analyzed by me while passing [data analysis course](https://karpov.courses/) I've enrolled in January 2021. 2 | 3 | 4 | Bikes rent in Chicago -- dates to dateformat conversion, resampling data to aggregate by days, automatically merging data from distinct files into one dataframe using os.walk(), differentiating bikes rents by user type, finding the most popular destination points overall and based on the week of the day. 5 | 6 | 7 | 8 | 9 | 10 | Hope this repo will help you to assess my coding, data analytics and SQL skills or will be just fun for you to look through. 11 | 12 | 13 | 14 | -------------------------------------------- 15 | Fill free to contact me via nktn.lx@gmal.com 16 | Follow me on twitter: @nktn_lx 17 | And here on github: github.com/nktnlx -------------------------------------------------------------------------------- /15_booking_in_london/README.md: -------------------------------------------------------------------------------- 1 | This is the fifteenth dataset analyzed by me while passing [data analysis course](https://karpov.courses/) I've enrolled in January 2021. 2 | 3 | 4 | Bookings in London -- used Pandahouse and SQL queries to import data from Clickhouse into pandas dataframe. Processed imported data and performed Exploratory Data Analysis. Built scatterplot, distplot, lineplot and heatmap using Seaborn and Matplotlib. 5 | 6 | 7 | 8 | 9 | 10 | Hope this repo will help you to assess my coding, data analytics and SQL skills or will be just fun for you to look through. 11 | 12 | 13 | 14 | -------------------------------------------- 15 | Fill free to contact me via nktn.lx@gmal.com 16 | Follow me on twitter: @nktn_lx 17 | And here on github: github.com/nktnlx -------------------------------------------------------------------------------- /16_retail_dashboard/README.md: -------------------------------------------------------------------------------- 1 | This is the sixteenth dataset analyzed by me while passing [data analysis course](https://karpov.courses/) I've enrolled in January 2021. 2 | 3 | 4 | Retail dashboard -- built a series of visualizations and a dashboard using SQL queries and Redash. Calculated and checked the dynamics of **MAU** and **AOV**. Found an anomaly in data, defined the market generating the majority of revenue, analyzed the most popular goods sold in the store. Wrote a dashboard summary with recommendation to push up sales. 5 | 6 | 7 | 8 | 9 | Hope this repo will help you to assess my coding, data analytics and SQL skills or will be just fun for you to look through. 10 | 11 | 12 | 13 | -------------------------------------------- 14 | Fill free to contact me via nktn.lx@gmal.com 15 | Follow me on twitter: @nktn_lx 16 | And here on github: github.com/nktnlx -------------------------------------------------------------------------------- /16_retail_dashboard/retail_dashboard.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## Retail Dashboard" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": null, 13 | "metadata": {}, 14 | "outputs": [], 15 | "source": [ 16 | "# Visualization and dashboard were created using Redash (https://redash.io/)" 17 | ] 18 | }, 19 | { 20 | "cell_type": "markdown", 21 | "metadata": {}, 22 | "source": [ 23 | "**Create a table for our retail data**" 24 | ] 25 | }, 26 | { 27 | "cell_type": "code", 28 | "execution_count": null, 29 | "metadata": {}, 30 | "outputs": [], 31 | "source": [ 32 | "CREATE TABLE test.retail (\n", 33 | " InvoiceNo stringMergeTree,\n", 34 | " StockCode stringMergeTree,\n", 35 | " Description stringMergeTree,\n", 36 | " Quantity Int32,\n", 37 | " InvoiceDate DateTime('Europe/London'),\n", 38 | " UnitPrice Decimal64(3),\n", 39 | " CustomerID uint32,\n", 40 | " Country stringMergeTree)\n", 41 | "ENGINE = MergeTree\n", 42 | "ORDER BY InvoiceDate, CustomerID" 43 | ] 44 | }, 45 | { 46 | "cell_type": "markdown", 47 | "metadata": {}, 48 | "source": [ 49 | "**Check the number of unique customers by country**" 50 | ] 51 | }, 52 | { 53 | "cell_type": "code", 54 | "execution_count": null, 55 | "metadata": {}, 56 | "outputs": [], 57 | "source": [ 58 | "SELECT COUNT(DISTINCT CustomerID) AS uniq_customers,\n", 59 | " Country\n", 60 | "FROM default.retail\n", 61 | "GROUP BY Country" 62 | ] 63 | }, 64 | { 65 | "cell_type": "markdown", 66 | "metadata": {}, 67 | "source": [ 68 | "" 69 | ] 70 | }, 71 | { 72 | "cell_type": "markdown", 73 | "metadata": {}, 74 | "source": [ 75 | "**Check the change of MAU (monthly active users) in UK, Australia and Netherlands**" 76 | ] 77 | }, 78 | { 79 | "cell_type": "code", 80 | "execution_count": null, 81 | "metadata": {}, 82 | "outputs": [], 83 | "source": [ 84 | "SELECT COUNT(DISTINCT CustomerID) AS active_users,\n", 85 | " toStartOfMonth(InvoiceDate) AS MONTH,\n", 86 | " Country AS country\n", 87 | "FROM default.retail\n", 88 | "WHERE country IN ('United Kingdom',\n", 89 | " 'Australia',\n", 90 | " 'Netherlands')\n", 91 | "GROUP BY MONTH,\n", 92 | " country\n", 93 | "ORDER BY MONTH" 94 | ] 95 | }, 96 | { 97 | "cell_type": "markdown", 98 | "metadata": {}, 99 | "source": [ 100 | "" 101 | ] 102 | }, 103 | { 104 | "cell_type": "markdown", 105 | "metadata": {}, 106 | "source": [ 107 | "**Check the change of MAU (monthly active users) for all countries except UK**" 108 | ] 109 | }, 110 | { 111 | "cell_type": "code", 112 | "execution_count": null, 113 | "metadata": {}, 114 | "outputs": [], 115 | "source": [ 116 | "SELECT COUNT(DISTINCT CustomerID) AS active_users,\n", 117 | " toStartOfMonth(InvoiceDate) AS MONTH,\n", 118 | " Country AS country\n", 119 | "FROM default.retail\n", 120 | "WHERE country != 'United Kingdom'\n", 121 | "GROUP BY MONTH,\n", 122 | " country\n", 123 | "ORDER BY MONTH" 124 | ] 125 | }, 126 | { 127 | "cell_type": "markdown", 128 | "metadata": {}, 129 | "source": [ 130 | "" 131 | ] 132 | }, 133 | { 134 | "cell_type": "markdown", 135 | "metadata": {}, 136 | "source": [ 137 | "**Calculate AOV (average order value) for each country**" 138 | ] 139 | }, 140 | { 141 | "cell_type": "code", 142 | "execution_count": null, 143 | "metadata": {}, 144 | "outputs": [], 145 | "source": [ 146 | "SELECT AVG(order_value) AS AOV,\n", 147 | " Country\n", 148 | "FROM\n", 149 | " (SELECT InvoiceNo,\n", 150 | " SUM(TotalPrice) AS order_value,\n", 151 | " Country\n", 152 | " FROM\n", 153 | " (SELECT InvoiceNo,\n", 154 | " Quantity * UnitPrice AS TotalPrice,\n", 155 | " Country\n", 156 | " FROM default.retail)\n", 157 | " GROUP BY InvoiceNo,\n", 158 | " Country\n", 159 | " ORDER BY order_value DESC)\n", 160 | "GROUP BY Country\n", 161 | "ORDER BY AOV DESC" 162 | ] 163 | }, 164 | { 165 | "cell_type": "markdown", 166 | "metadata": {}, 167 | "source": [ 168 | "" 169 | ] 170 | }, 171 | { 172 | "cell_type": "markdown", 173 | "metadata": {}, 174 | "source": [ 175 | "**Check the dynamics of AOV metrics for United Kingdom, Germany, France, Spain, Netherlands, Belgium, Switzerland, Portugal, Australia, USA**" 176 | ] 177 | }, 178 | { 179 | "cell_type": "code", 180 | "execution_count": null, 181 | "metadata": {}, 182 | "outputs": [], 183 | "source": [ 184 | "SELECT AVG(order_value) AS AOV,\n", 185 | " Country,\n", 186 | " MONTH\n", 187 | "FROM\n", 188 | " (SELECT SUM(TotalPrice) AS order_value,\n", 189 | " MONTH,\n", 190 | " Country\n", 191 | " FROM\n", 192 | " (SELECT InvoiceNo,\n", 193 | " Quantity * UnitPrice AS TotalPrice,\n", 194 | " Country,\n", 195 | " toStartOfMonth(InvoiceDate) AS MONTH\n", 196 | " FROM default.retail\n", 197 | " WHERE Country IN ('United Kingdom',\n", 198 | " 'Germany',\n", 199 | " 'France',\n", 200 | " 'Spain',\n", 201 | " 'Netherlands',\n", 202 | " 'Belgium',\n", 203 | " 'Switzerland',\n", 204 | " 'Portugal',\n", 205 | " 'Australia',\n", 206 | " 'USA') )\n", 207 | " GROUP BY InvoiceNo,\n", 208 | " Country,\n", 209 | " MONTH\n", 210 | " ORDER BY MONTH ASC, order_value DESC)\n", 211 | "GROUP BY MONTH,\n", 212 | " Country\n", 213 | "ORDER BY MONTH ASC, AOV DESC" 214 | ] 215 | }, 216 | { 217 | "cell_type": "markdown", 218 | "metadata": {}, 219 | "source": [ 220 | "" 221 | ] 222 | }, 223 | { 224 | "cell_type": "markdown", 225 | "metadata": {}, 226 | "source": [ 227 | "**Calculate an average number of items per order by country**" 228 | ] 229 | }, 230 | { 231 | "cell_type": "code", 232 | "execution_count": null, 233 | "metadata": {}, 234 | "outputs": [], 235 | "source": [ 236 | "SELECT AVG(quantity_per_invoice) AS average_items,\n", 237 | " Country\n", 238 | "FROM\n", 239 | " (SELECT InvoiceNo,\n", 240 | " SUM(Quantity) AS quantity_per_invoice,\n", 241 | " Country\n", 242 | " FROM default.retail\n", 243 | " GROUP BY InvoiceNo,\n", 244 | " Country)\n", 245 | "GROUP BY Country\n", 246 | "ORDER BY average_items DESC" 247 | ] 248 | }, 249 | { 250 | "cell_type": "markdown", 251 | "metadata": {}, 252 | "source": [ 253 | "" 254 | ] 255 | }, 256 | { 257 | "cell_type": "markdown", 258 | "metadata": {}, 259 | "source": [ 260 | "**Investigate customers from the Netherlands. Find a customer who has bought the biggest number of items**" 261 | ] 262 | }, 263 | { 264 | "cell_type": "code", 265 | "execution_count": null, 266 | "metadata": {}, 267 | "outputs": [], 268 | "source": [ 269 | "SELECT\n", 270 | " SUM(Quantity) AS overall_quantity,\n", 271 | " CustomerID\n", 272 | "FROM default.retail\n", 273 | "WHERE Country == 'Netherlands'\n", 274 | "GROUP BY CustomerID\n", 275 | "ORDER BY overall_quantity DESC\n", 276 | "LIMIT 100" 277 | ] 278 | }, 279 | { 280 | "cell_type": "markdown", 281 | "metadata": {}, 282 | "source": [ 283 | "Now we can see that an average number of items per invoice is so high for customers from the Netherlands because we have one customer (CustomerID 14646) who has purchased 196719 items!\n", 284 | "" 285 | ] 286 | }, 287 | { 288 | "cell_type": "markdown", 289 | "metadata": {}, 290 | "source": [ 291 | "**Calculate revenue deistribution per country**" 292 | ] 293 | }, 294 | { 295 | "cell_type": "code", 296 | "execution_count": null, 297 | "metadata": {}, 298 | "outputs": [], 299 | "source": [ 300 | "SELECT\n", 301 | " Country,\n", 302 | " --UnitPrice, \n", 303 | " --Quantity,\n", 304 | " SUM(UnitPrice * Quantity) AS total_revenue\n", 305 | "FROM default.retail\n", 306 | "GROUP BY Country\n", 307 | "ORDER BY total_revenue DESC" 308 | ] 309 | }, 310 | { 311 | "cell_type": "markdown", 312 | "metadata": {}, 313 | "source": [ 314 | "**Calculate monthly revenue dynamics for UK**" 315 | ] 316 | }, 317 | { 318 | "cell_type": "code", 319 | "execution_count": null, 320 | "metadata": {}, 321 | "outputs": [], 322 | "source": [ 323 | "SELECT\n", 324 | " toStartOfMonth(InvoiceDate) AS month,\n", 325 | " SUM(UnitPrice * Quantity) AS monthly_reveue\n", 326 | "FROM default.retail\n", 327 | "WHERE Country == 'United Kingdom'\n", 328 | " AND month < '2011-12-01'\n", 329 | "GROUP BY month\n", 330 | "ORDER BY month ASC\n", 331 | "LIMIT 100" 332 | ] 333 | }, 334 | { 335 | "cell_type": "markdown", 336 | "metadata": {}, 337 | "source": [ 338 | "**Find the top-10 sales value items in UK in November**" 339 | ] 340 | }, 341 | { 342 | "cell_type": "code", 343 | "execution_count": null, 344 | "metadata": {}, 345 | "outputs": [], 346 | "source": [ 347 | "SELECT\n", 348 | " Description,\n", 349 | " SUM(Quantity) AS items_sold,\n", 350 | " UnitPrice,\n", 351 | " items_sold * UnitPrice AS total_sales_revenue\n", 352 | "FROM default.retail\n", 353 | "WHERE (toStartOfMonth(InvoiceDate) == '2011-11-01')\n", 354 | " AND Country == 'United Kingdom'\n", 355 | "GROUP BY Description, UnitPrice\n", 356 | "ORDER BY total_sales_revenue DESC\n", 357 | "LIMIT 10" 358 | ] 359 | }, 360 | { 361 | "cell_type": "markdown", 362 | "metadata": {}, 363 | "source": [ 364 | "**Create a dashboard using the last three queries**" 365 | ] 366 | }, 367 | { 368 | "cell_type": "markdown", 369 | "metadata": {}, 370 | "source": [ 371 | "" 372 | ] 373 | }, 374 | { 375 | "cell_type": "markdown", 376 | "metadata": {}, 377 | "source": [ 378 | "#### Dashboard summary\n", 379 | "The geography of the store's sales is shown in the upper left part of the dashboard. These are mainly Western Europe, Scandinavian countries, Great Britain, Canada, Australia and Brazil. The breakdown of revenue by country is shown in the bottom left corner. The first place by revenue is occupied by Great Britain. Moreover, the revenue indicators in Great Britain for the reporting period exceeded the revenue in the Netherlands (the second place) by 23.78 times. \n", 380 | "\n", 381 | "Since the major part of the store's revenue is generated in the UK, an analysis of sales in this country was carried out on the right side of the dashboard. The top right corner shows the breakdown of revenue by month. It can be seen that the peak of sales is in November. This is most likely due to the eve of the Christmas holidays. After analyzing the products sold in November, we can see that the top 10 items sold in terms of revenue include Christmas decorations, garlands and New Year's gifts (frames, thermo mugs, red Christmas bags). Details on cost, quantity, as well as total revenue from the top 10 products in November are shown on the dashboard. \n", 382 | "\n", 383 | "Therefore, it is recommended to consider: \n", 384 | "- avoiding a shortage in warehouses of Christmas goods in the month of November,\n", 385 | "- to diversify the assortment of the store, to conduct campaigns to attract customers in order to avoid a significant seasonal decline in revenue in the the UK (the main market) in the period from January to September. " 386 | ] 387 | } 388 | ], 389 | "metadata": { 390 | "kernelspec": { 391 | "display_name": "Python 3", 392 | "language": "python", 393 | "name": "python3" 394 | }, 395 | "language_info": { 396 | "codemirror_mode": { 397 | "name": "ipython", 398 | "version": 3 399 | }, 400 | "file_extension": ".py", 401 | "mimetype": "text/x-python", 402 | "name": "python", 403 | "nbconvert_exporter": "python", 404 | "pygments_lexer": "ipython3", 405 | "version": "3.7.6" 406 | } 407 | }, 408 | "nbformat": 4, 409 | "nbformat_minor": 4 410 | } 411 | -------------------------------------------------------------------------------- /16_retail_dashboard/task1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nktnlx/data_analysis_course/5aa4ef8983b6ba5dc15e370750fa5b9022407918/16_retail_dashboard/task1.png -------------------------------------------------------------------------------- /16_retail_dashboard/task2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nktnlx/data_analysis_course/5aa4ef8983b6ba5dc15e370750fa5b9022407918/16_retail_dashboard/task2.png -------------------------------------------------------------------------------- /16_retail_dashboard/task3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nktnlx/data_analysis_course/5aa4ef8983b6ba5dc15e370750fa5b9022407918/16_retail_dashboard/task3.png -------------------------------------------------------------------------------- /16_retail_dashboard/task_4.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nktnlx/data_analysis_course/5aa4ef8983b6ba5dc15e370750fa5b9022407918/16_retail_dashboard/task_4.png -------------------------------------------------------------------------------- /16_retail_dashboard/task_5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nktnlx/data_analysis_course/5aa4ef8983b6ba5dc15e370750fa5b9022407918/16_retail_dashboard/task_5.png -------------------------------------------------------------------------------- /16_retail_dashboard/task_6.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nktnlx/data_analysis_course/5aa4ef8983b6ba5dc15e370750fa5b9022407918/16_retail_dashboard/task_6.png -------------------------------------------------------------------------------- /16_retail_dashboard/task_7.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nktnlx/data_analysis_course/5aa4ef8983b6ba5dc15e370750fa5b9022407918/16_retail_dashboard/task_7.png -------------------------------------------------------------------------------- /16_retail_dashboard/task_8_dashboard.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nktnlx/data_analysis_course/5aa4ef8983b6ba5dc15e370750fa5b9022407918/16_retail_dashboard/task_8_dashboard.png -------------------------------------------------------------------------------- /17_video_games/README.md: -------------------------------------------------------------------------------- 1 | This is the seventeenth dataset analyzed by me while passing [data analysis course](https://karpov.courses/) I've enrolled in January 2021. 2 | 3 | 4 | Video games -- analising video games sales dynamics with Pandas. Read_csv, head, columns, dtypes, info, isna, dropna, describe, mode, shape, groupby, agg, sort_values, rename, index, to_list, value_count methods were user for Exploratory Data Analysis. Barplot, boxplot and lineplot were used for graphing results. 5 | 6 | 7 | 8 | 9 | Hope this repo will help you to assess my coding, data analytics and SQL skills or will be just fun for you to look through. 10 | 11 | 12 | 13 | -------------------------------------------- 14 | Fill free to contact me via nktn.lx@gmal.com 15 | Follow me on twitter: @nktn_lx 16 | And here on github: github.com/nktnlx -------------------------------------------------------------------------------- /18_ads_conversion/README.md: -------------------------------------------------------------------------------- 1 | This is the eighteenth dataset analyzed by me while passing [data analysis course](https://karpov.courses/) I've enrolled in January 2021. 2 | 3 | 4 | 5 | Ads conversion -- calculating **CTR, CPC, CR** metrics. Plotting them using distplot, hist, displot and histplot methods. 6 | 7 | 8 | 9 | 10 | Hope this repo will help you to assess my coding, data analytics and SQL skills or will be just fun for you to look through. 11 | 12 | 13 | 14 | -------------------------------------------- 15 | Fill free to contact me via nktn.lx@gmal.com 16 | Follow me on twitter: @nktn_lx 17 | And here on github: github.com/nktnlx -------------------------------------------------------------------------------- /19_yandex_music/README.md: -------------------------------------------------------------------------------- 1 | This dataset is from [Yandex Praktikum data-analyst course](https://praktikum.yandex.ru/data-analyst/) I was taking in Feb-March 2021. 2 | 3 | 4 | 5 | Yandex Music -- analyzing music streaming platform songs popularity, comparing music preferences and listening templates in Moscow and Saint-Petersburg. Reading and cleaning data, renaming columns, removing duplicates, dealing with missing data, slicing dataframe to query required portion of data. 6 | 7 | 8 | 9 | 10 | Hope this repo will help you to assess my coding, data analytics and SQL skills or will be just fun for you to look through. 11 | 12 | 13 | 14 | -------------------------------------------- 15 | Fill free to contact me via nktn.lx@gmal.com 16 | Follow me on twitter: @nktn_lx 17 | And here on github: github.com/nktnlx -------------------------------------------------------------------------------- /1_taxi_in_nyc/README.md: -------------------------------------------------------------------------------- 1 | This is the first dataset analysed by me while passing [data analysis course](https://karpov.courses/) I've enrolled in January 2021. 2 | 3 | 4 | Analising NYC taxi orders with Pandas. Read_csv, rename, groupby, agg, query, sort_values, idxmax, idxmin, value_counts, pivot methods were used for Exploratory Data Analysis. 5 | 6 | 7 | 8 | Hope this repo will help you to assess my coding, data analytics and SQL skills or will be just fun for you to look through. 9 | 10 | 11 | 12 | -------------------------------------------- 13 | Fill free to contact me via nktn.lx@gmal.com 14 | Follow me on twitter: @nktn_lx 15 | And here on github: github.com/nktnlx -------------------------------------------------------------------------------- /20_bikes_rent_london/README.md: -------------------------------------------------------------------------------- 1 | This is the 20th dataset analyzed by me while passing [data analysis course](https://karpov.courses/) I've enrolled in January 2021. 2 | 3 | 4 | 5 | Bikes rent in London -- loading dataset, plotting rides count data, resampling timestamps, describing the main trends, looking for anomaly values by smoothing the data with a simple moving average, calculating the difference between the real data and smoothed data, finding standard deviation and defining the 99% confidence interval. Then we compare values with the confidence interval to find data hikes and explain them. 6 | 7 | 8 | 9 | 10 | Hope this repo will help you to assess my coding, data analytics and SQL skills or will be just fun for you to look through. 11 | 12 | 13 | 14 | -------------------------------------------- 15 | Fill free to contact me via nktn.lx@gmal.com 16 | Follow me on twitter: @nktn_lx 17 | And here on github: github.com/nktnlx -------------------------------------------------------------------------------- /21_delivery_ab/README.md: -------------------------------------------------------------------------------- 1 | This is the 21st dataset analyzed by me while passing [data analysis course](https://karpov.courses/) I've enrolled in January 2021. 2 | 3 | 4 | 5 | Delivery A/B -- Finding how a new navigational algorithm has changed the delivery time of the service. Formulating null and alternative hypothesis and performing A/B test with help of t-test. 6 | 7 | 8 | 9 | 10 | Hope this repo will help you to assess my coding, data analytics and SQL skills or will be just fun for you to look through. 11 | 12 | 13 | 14 | -------------------------------------------- 15 | Fill free to contact me via nktn.lx@gmal.com 16 | Follow me on twitter: @nktn_lx 17 | And here on github: github.com/nktnlx -------------------------------------------------------------------------------- /22_app_interface_ab/README.md: -------------------------------------------------------------------------------- 1 | This is the 22nd dataset analyzed by me while passing [data analysis course](https://karpov.courses/) I've enrolled in January 2021. 2 | 3 | 4 | 5 | App interface A/B -- Testing how images aspect ratio and a new order button design influence on the amount of orders placed by customers. Performed Levene's test to check group variance equality, Shapiro-Wilk test to check groups for normality, one-way ANOVA to check statistically significant difference between tested groups, Tukey's test to find statistically significant difference between groups, linear model multivariate analysis of variance, visualized and interpreted results, gave recommendations to put (or not to put) changes into production. 6 | 7 | 8 | 9 | 10 | Hope this repo will help you to assess my coding, data analytics and SQL skills or will be just fun for you to look through. 11 | 12 | 13 | 14 | -------------------------------------------- 15 | Fill free to contact me via nktn.lx@gmal.com 16 | Follow me on twitter: @nktn_lx 17 | And here on github: github.com/nktnlx -------------------------------------------------------------------------------- /23_cars_sales/README.md: -------------------------------------------------------------------------------- 1 | This is the 23rd dataset analyzed by me while passing [data analysis course](https://karpov.courses/) I've enrolled in January 2021. 2 | 3 | 4 | 5 | Cars sales -- predicting cars sales price using linear regerssion models (statsmodels.api & statsmodels.formula.api). Finding statistically significant predictors. 6 | 7 | 8 | 9 | 10 | Hope this repo will help you to assess my coding, data analytics and SQL skills or will be just fun for you to look through. 11 | 12 | 13 | 14 | -------------------------------------------- 15 | Fill free to contact me via nktn.lx@gmal.com 16 | Follow me on twitter: @nktn_lx 17 | And here on github: github.com/nktnlx -------------------------------------------------------------------------------- /24_bootstrap_ab/README.md: -------------------------------------------------------------------------------- 1 | This is the 24th dataset analyzed by me while passing [data analysis course](https://karpov.courses/) I've enrolled in January 2021. 2 | 3 | 4 | 5 | Bootstrap A/B -- comparing results of Mann-Whitney test and Bootstrap mean/median running on data with and without outliers. 6 | 7 | 8 | 9 | 10 | Hope this repo will help you to assess my coding, data analytics and SQL skills or will be just fun for you to look through. 11 | 12 | 13 | 14 | -------------------------------------------- 15 | Fill free to contact me via nktn.lx@gmal.com 16 | Follow me on twitter: @nktn_lx 17 | And here on github: github.com/nktnlx -------------------------------------------------------------------------------- /25_mobile_app_aa/README.md: -------------------------------------------------------------------------------- 1 | This is the 25th dataset analyzed by me while passing [data analysis course](https://karpov.courses/) I've enrolled in January 2021. 2 | 3 | 4 | 5 | Mobile App A/A -- running A/A test to check data splitting system works well. Unfortunately, we were not able to pass the test (FPR was greater than significance value). Thus, we have to dig into data and find the reason of malfunction. After removing the corrupted data we were able to pass the A/A-test. 6 | 7 | 8 | 9 | 10 | 11 | Hope this repo will help you to assess my coding, data analytics and SQL skills or will be just fun for you to look through. 12 | 13 | 14 | 15 | -------------------------------------------- 16 | Fill free to contact me via nktn.lx@gmal.com 17 | Follow me on twitter: @nktn_lx 18 | And here on github: github.com/nktnlx -------------------------------------------------------------------------------- /26_taxi_churn/README.md: -------------------------------------------------------------------------------- 1 | This is the 26th dataset analyzed by me while passing [data analysis course](https://karpov.courses/) I've enrolled in January 2021. 2 | 3 | 4 | 5 | Taxi Churn -- performing Exploratory Data Analysis, defining churn, checking distributions for normality with Shapiro-Wilk test, plotting data using plotly and A/B testing four different hypothesis with Chi-squared test, Dunn's test, Mann-Whitney U non-parametric test. 6 | 7 | 8 | 9 | 10 | 11 | Hope this repo will help you to assess my coding, data analytics and SQL skills or will be just fun for you to look through. 12 | 13 | 14 | 15 | -------------------------------------------- 16 | Fill free to contact me via nktn.lx@gmal.com 17 | Follow me on twitter: @nktn_lx 18 | And here on github: github.com/nktnlx -------------------------------------------------------------------------------- /26_taxi_churn/churn_abs.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nktnlx/data_analysis_course/5aa4ef8983b6ba5dc15e370750fa5b9022407918/26_taxi_churn/churn_abs.png -------------------------------------------------------------------------------- /26_taxi_churn/churn_city.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nktnlx/data_analysis_course/5aa4ef8983b6ba5dc15e370750fa5b9022407918/26_taxi_churn/churn_city.png -------------------------------------------------------------------------------- /26_taxi_churn/churn_pct.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nktnlx/data_analysis_course/5aa4ef8983b6ba5dc15e370750fa5b9022407918/26_taxi_churn/churn_pct.png -------------------------------------------------------------------------------- /26_taxi_churn/churn_platform.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nktnlx/data_analysis_course/5aa4ef8983b6ba5dc15e370750fa5b9022407918/26_taxi_churn/churn_platform.png -------------------------------------------------------------------------------- /27_ab_simulation/README.md: -------------------------------------------------------------------------------- 1 | This is the 27th dataset analyzed by me while passing [data analysis course](https://karpov.courses/) I've enrolled in January 2021. 2 | 3 | 4 | 5 | A/B simulation -- performed a range of A/B tests to simulate how sample size and the magnitude of difference between samples influence A/B tests performance. Investigated situations when we could have a false positive error. Gained valuable lessons on A/B tests performance. 6 | 7 | 8 | 9 | 10 | 11 | Hope this repo will help you to assess my coding, data analytics and SQL skills or will be just fun for you to look through. 12 | 13 | 14 | 15 | -------------------------------------------- 16 | Fill free to contact me via nktn.lx@gmal.com 17 | Follow me on twitter: @nktn_lx 18 | And here on github: github.com/nktnlx -------------------------------------------------------------------------------- /28_sales_monthly/README.md: -------------------------------------------------------------------------------- 1 | This is the 28th dataset analyzed by me while passing [data analysis course](https://karpov.courses/) I've enrolled in January 2021. 2 | 3 | 4 | 5 | [Sales Monthly Overview](https://public.tableau.com/profile/nktn.lx#!/vizhome/SalesMonthlyOverviewpractice1/Dashboard1) -- **Tableau Public dashboard** consisted of: KPIs, line chart, bar chart, table by category with bar charts. 6 | 7 | 8 | 9 | 10 | 11 | Hope this repo will help you to assess my coding, data analytics and SQL skills or will be just fun for you to look through. 12 | 13 | 14 | 15 | -------------------------------------------- 16 | Fill free to contact me via nktn.lx@gmal.com 17 | Follow me on twitter: @nktn_lx 18 | And here on github: github.com/nktnlx -------------------------------------------------------------------------------- /29_profit_monthly/README.md: -------------------------------------------------------------------------------- 1 | This is the 29th dataset analyzed by me while passing [data analysis course](https://karpov.courses/) I've enrolled in January 2021. 2 | 3 | 4 | 5 | [Profit Monthly Overview](https://public.tableau.com/profile/nktn.lx#!/vizhome/ProfitMonthlyOverviewpractice2/ProfitMonthlyOverview) -- **Tableau Public dashboard** consisted of: KPIs, line chart, bar chart, table by region with bar charts, profit ratio by category with horizontal bar charts. 6 | 7 | 8 | 9 | 10 | 11 | Hope this repo will help you to assess my coding, data analytics and SQL skills or will be just fun for you to look through. 12 | 13 | 14 | 15 | -------------------------------------------- 16 | Fill free to contact me via nktn.lx@gmal.com 17 | Follow me on twitter: @nktn_lx 18 | And here on github: github.com/nktnlx -------------------------------------------------------------------------------- /2_hotel_bookings/README.md: -------------------------------------------------------------------------------- 1 | This is the second dataset analysed by me while passing [data analysis course](https://karpov.courses/) I've enrolled in January 2021. 2 | 3 | 4 | Hotel Bookings -- analising hotel bookings with Pandas. Read_csv, info, rename, groupby, agg, query, sort_values, idxmax, idxmin, value_counts, pivot methods were used for Exploratory Data Analysis. Customers **Churn rate** was calculated. 5 | 6 | 7 | 8 | Hope this repo will help you to assess my coding, data analytics and SQL skills or will be just fun for you to look through. 9 | 10 | 11 | 12 | -------------------------------------------- 13 | Fill free to contact me via nktn.lx@gmal.com 14 | Follow me on twitter: @nktn_lx 15 | And here on github: github.com/nktnlx -------------------------------------------------------------------------------- /30_vacancies_overview/README.md: -------------------------------------------------------------------------------- 1 | This is the 30th dataset analyzed by me while passing [data analysis course](https://karpov.courses/) I've enrolled in January 2021. 2 | 3 | 4 | 5 | 6 | [Analytics Vacancies Overview](https://public.tableau.com/profile/nktn.lx#!/vizhome/AnalyticsVacanciesOverviewpractice3/Dashboard1) -- **Tableau Public dashboard** consisted of: horizontal bar chart, pie chart, boxplot and bubble chart. 7 | 8 | 9 | 10 | 11 | 12 | Hope this repo will help you to assess my coding, data analytics and SQL skills or will be just fun for you to look through. 13 | 14 | 15 | 16 | -------------------------------------------- 17 | Fill free to contact me via nktn.lx@gmal.com 18 | Follow me on twitter: @nktn_lx 19 | And here on github: github.com/nktnlx -------------------------------------------------------------------------------- /31_sales_overview/README.md: -------------------------------------------------------------------------------- 1 | This is the 31st dataset analyzed by me while passing [data analysis course](https://karpov.courses/) I've enrolled in January 2021. 2 | 3 | 4 | 5 | 6 | [Sales Overview](https://public.tableau.com/profile/nktn.lx#!/vizhome/SalesOverviewpractice4/published) -- **Tableau Public dashboard** consisted of: horizontal bar tables, sparklines, KPI, line charts and various filters and sortings to display the data. 7 | 8 | 9 | 10 | 11 | 12 | Hope this repo will help you to assess my coding, data analytics and SQL skills or will be just fun for you to look through. 13 | 14 | 15 | 16 | -------------------------------------------- 17 | Fill free to contact me via nktn.lx@gmal.com 18 | Follow me on twitter: @nktn_lx 19 | And here on github: github.com/nktnlx -------------------------------------------------------------------------------- /32_airbnb_listings/README.md: -------------------------------------------------------------------------------- 1 | This is the 32nd dataset analyzed by me while passing [data analysis course](https://karpov.courses/) I've enrolled in January 2021. 2 | 3 | 4 | 5 | 6 | [Airbnb Listings Analytics](https://public.tableau.com/profile/nktn.lx#!/vizhome/LondonAirbnbListingsAnalyticalDashboardpractice5/Dashboard1) -- **Tableau Public dashboard** consisted of: calculated renting property occupation rate; analytical chart to choose the best property by occupation rate, review score and price per night; a ranked table of top 10 listings by calculated potential annual revenue; average price, average occupation rate and a number of unique listings KPIs; filters by neighbourhood, occupation rate and a number of reviews per the last twelve month. 7 | 8 | **dashboard_canvas.pdf** -- the result of an interview with a client before creating the dashboard that reflects all the main features the dashboard will have and a general description how it'll be used. 9 | 10 | 11 | 12 | 13 | 14 | 15 | Hope this repo will help you to assess my coding, data analytics and SQL skills or will be just fun for you to look through. 16 | 17 | 18 | 19 | -------------------------------------------- 20 | Fill free to contact me via nktn.lx@gmal.com 21 | Follow me on twitter: @nktn_lx 22 | And here on github: github.com/nktnlx -------------------------------------------------------------------------------- /32_airbnb_listings/dashboard_canvas.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nktnlx/data_analysis_course/5aa4ef8983b6ba5dc15e370750fa5b9022407918/32_airbnb_listings/dashboard_canvas.pdf -------------------------------------------------------------------------------- /33_metrics_calc/README.md: -------------------------------------------------------------------------------- 1 | This is the 33rd dataset analyzed by me while passing [data analysis course](https://karpov.courses/) I've enrolled in January 2021. 2 | 3 | 4 | 5 | Metrics calculations -- Google analytics data cleaning and calculation of following metrics: number of unique users, conversion, average check, average purchases per user, ARPPU, ARPU. 6 | 7 | 8 | 9 | 10 | 11 | Hope this repo will help you to assess my coding, data analytics and SQL skills or will be just fun for you to look through. 12 | 13 | 14 | 15 | -------------------------------------------- 16 | Fill free to contact me via nktn.lx@gmal.com 17 | Follow me on twitter: @nktn_lx 18 | And here on github: github.com/nktnlx -------------------------------------------------------------------------------- /33_metrics_calc/metrics_calc.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## Metrics calculation (mini-project)" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "**Importing libraries**" 15 | ] 16 | }, 17 | { 18 | "cell_type": "code", 19 | "execution_count": 98, 20 | "metadata": {}, 21 | "outputs": [], 22 | "source": [ 23 | "import pandas as pd" 24 | ] 25 | }, 26 | { 27 | "cell_type": "markdown", 28 | "metadata": {}, 29 | "source": [ 30 | "**Importing dataset from a zipped file**" 31 | ] 32 | }, 33 | { 34 | "cell_type": "code", 35 | "execution_count": 99, 36 | "metadata": {}, 37 | "outputs": [ 38 | { 39 | "data": { 40 | "text/html": [ 41 | "
\n", 42 | "\n", 55 | "\n", 56 | " \n", 57 | " \n", 58 | " \n", 59 | " \n", 60 | " \n", 61 | " \n", 62 | " \n", 63 | " \n", 64 | " \n", 65 | " \n", 66 | " \n", 67 | " \n", 68 | " \n", 69 | " \n", 70 | " \n", 71 | " \n", 72 | " \n", 73 | " \n", 74 | " \n", 75 | " \n", 76 | " \n", 77 | " \n", 78 | " \n", 79 | " \n", 80 | " \n", 81 | " \n", 82 | " \n", 83 | " \n", 84 | " \n", 85 | " \n", 86 | " \n", 87 | " \n", 88 | " \n", 89 | " \n", 90 | " \n", 91 | " \n", 92 | " \n", 93 | " \n", 94 | " \n", 95 | " \n", 96 | " \n", 97 | " \n", 98 | " \n", 99 | " \n", 100 | " \n", 101 | " \n", 102 | " \n", 103 | " \n", 104 | " \n", 105 | " \n", 106 | " \n", 107 | " \n", 108 | " \n", 109 | " \n", 110 | " \n", 111 | " \n", 112 | " \n", 113 | " \n", 114 | " \n", 115 | " \n", 116 | " \n", 117 | " \n", 118 | " \n", 119 | " \n", 120 | " \n", 121 | " \n", 122 | " \n", 123 | " \n", 124 | " \n", 125 | " \n", 126 | " \n", 127 | " \n", 128 | " \n", 129 | " \n", 130 | " \n", 131 | " \n", 132 | " \n", 133 | " \n", 134 | " \n", 135 | " \n", 136 | " \n", 137 | " \n", 138 | " \n", 139 | " \n", 140 | " \n", 141 | " \n", 142 | " \n", 143 | " \n", 144 | " \n", 145 | " \n", 146 | " \n", 147 | " \n", 148 | " \n", 149 | " \n", 150 | " \n", 151 | " \n", 152 | " \n", 153 | " \n", 154 | " \n", 155 | " \n", 156 | " \n", 157 | " \n", 158 | " \n", 159 | " \n", 160 | " \n", 161 | " \n", 162 | " \n", 163 | " \n", 164 | " \n", 165 | " \n", 166 | " \n", 167 | " \n", 168 | " \n", 169 | " \n", 170 | " \n", 171 | " \n", 172 | " \n", 173 | " \n", 174 | " \n", 175 | " \n", 176 | " \n", 177 | " \n", 178 | " \n", 179 | " \n", 180 | "
ga:datega:clientiduserIDga:transaction_idga:revenueUnnamed: 5ga:user
028-08-2019 12:29:242.802509e+087,186,05438391928103NaN141000.0
128-08-2019 12:27:128.196637e+087,186,01097225177697NaNNaN
228-08-2019 11:43:241.751156e+097,184,85938508764892NaNNaN
328-08-2019 11:40:505.515333e+087,186,02938539238816NaNNaN
428-08-2019 11:25:314.527935e+087,183,5483858713112NaNNaN
........................
100601-08-2019 01:33:535.085028e+0871867813586929280NaNNaN
100701-08-2019 01:27:454.152444e+0871867803597922899NaNNaN
100801-08-2019 01:23:406.964930e+0871867823777518900NaNNaN
100901-08-2019 01:18:144.152444e+0871867803777218204NaNNaN
101001-08-2019 01:06:385.352736e+0870140623777483032NaNNaN
\n", 181 | "

1011 rows × 7 columns

\n", 182 | "
" 183 | ], 184 | "text/plain": [ 185 | " ga:date ga:clientid userID ga:transaction_id \\\n", 186 | "0 28-08-2019 12:29:24 2.802509e+08 7,186,054 383919 \n", 187 | "1 28-08-2019 12:27:12 8.196637e+08 7,186,010 97225 \n", 188 | "2 28-08-2019 11:43:24 1.751156e+09 7,184,859 385087 \n", 189 | "3 28-08-2019 11:40:50 5.515333e+08 7,186,029 385392 \n", 190 | "4 28-08-2019 11:25:31 4.527935e+08 7,183,548 385871 \n", 191 | "... ... ... ... ... \n", 192 | "1006 01-08-2019 01:33:53 5.085028e+08 7186781 358692 \n", 193 | "1007 01-08-2019 01:27:45 4.152444e+08 7186780 359792 \n", 194 | "1008 01-08-2019 01:23:40 6.964930e+08 7186782 377751 \n", 195 | "1009 01-08-2019 01:18:14 4.152444e+08 7186780 377721 \n", 196 | "1010 01-08-2019 01:06:38 5.352736e+08 7014062 377748 \n", 197 | "\n", 198 | " ga:revenue Unnamed: 5 ga:user \n", 199 | "0 28103 NaN 141000.0 \n", 200 | "1 177697 NaN NaN \n", 201 | "2 64892 NaN NaN \n", 202 | "3 38816 NaN NaN \n", 203 | "4 3112 NaN NaN \n", 204 | "... ... ... ... \n", 205 | "1006 9280 NaN NaN \n", 206 | "1007 2899 NaN NaN \n", 207 | "1008 8900 NaN NaN \n", 208 | "1009 8204 NaN NaN \n", 209 | "1010 3032 NaN NaN \n", 210 | "\n", 211 | "[1011 rows x 7 columns]" 212 | ] 213 | }, 214 | "execution_count": 99, 215 | "metadata": {}, 216 | "output_type": "execute_result" 217 | } 218 | ], 219 | "source": [ 220 | "df = pd.read_csv('august_data.zip', compression='zip')\n", 221 | "\n", 222 | "df" 223 | ] 224 | }, 225 | { 226 | "cell_type": "markdown", 227 | "metadata": {}, 228 | "source": [ 229 | "**Getting basic info about data types and missing values**" 230 | ] 231 | }, 232 | { 233 | "cell_type": "code", 234 | "execution_count": 100, 235 | "metadata": {}, 236 | "outputs": [ 237 | { 238 | "name": "stdout", 239 | "output_type": "stream", 240 | "text": [ 241 | "\n", 242 | "RangeIndex: 1011 entries, 0 to 1010\n", 243 | "Data columns (total 7 columns):\n", 244 | " # Column Non-Null Count Dtype \n", 245 | "--- ------ -------------- ----- \n", 246 | " 0 ga:date 1011 non-null object \n", 247 | " 1 ga:clientid 1011 non-null float64\n", 248 | " 2 userID 1011 non-null object \n", 249 | " 3 ga:transaction_id 1011 non-null object \n", 250 | " 4 ga:revenue 1011 non-null int64 \n", 251 | " 5 Unnamed: 5 0 non-null float64\n", 252 | " 6 ga:user 1 non-null float64\n", 253 | "dtypes: float64(3), int64(1), object(3)\n", 254 | "memory usage: 55.4+ KB\n" 255 | ] 256 | } 257 | ], 258 | "source": [ 259 | "df.info()" 260 | ] 261 | }, 262 | { 263 | "cell_type": "markdown", 264 | "metadata": {}, 265 | "source": [ 266 | "###### Columns description\n", 267 | "* **ga:date** – date\n", 268 | "* **ga:clientid** – client id from Google Analytics\n", 269 | "* **userID** – client id from another analytical system\n", 270 | "* **ga:transaction_id** – transaction id\n", 271 | "* **ga:revenue** – revenue\n", 272 | "* **ga:user** – am overall number of unique users" 273 | ] 274 | }, 275 | { 276 | "cell_type": "markdown", 277 | "metadata": {}, 278 | "source": [ 279 | "**Show the number of unique users?**" 280 | ] 281 | }, 282 | { 283 | "cell_type": "code", 284 | "execution_count": 101, 285 | "metadata": {}, 286 | "outputs": [ 287 | { 288 | "data": { 289 | "text/plain": [ 290 | "141000.0" 291 | ] 292 | }, 293 | "execution_count": 101, 294 | "metadata": {}, 295 | "output_type": "execute_result" 296 | } 297 | ], 298 | "source": [ 299 | "uniques = df.iloc[0, 6]\n", 300 | "\n", 301 | "uniques" 302 | ] 303 | }, 304 | { 305 | "cell_type": "markdown", 306 | "metadata": {}, 307 | "source": [ 308 | "**Cleaning dataframe**" 309 | ] 310 | }, 311 | { 312 | "cell_type": "code", 313 | "execution_count": 102, 314 | "metadata": {}, 315 | "outputs": [ 316 | { 317 | "data": { 318 | "text/plain": [ 319 | "Index(['ga:date', 'ga:clientid', 'userID', 'ga:transaction_id', 'ga:revenue'], dtype='object')" 320 | ] 321 | }, 322 | "execution_count": 102, 323 | "metadata": {}, 324 | "output_type": "execute_result" 325 | } 326 | ], 327 | "source": [ 328 | "# grabbing column names that we'll use\n", 329 | "column_names = df.columns[0:5]\n", 330 | "column_names" 331 | ] 332 | }, 333 | { 334 | "cell_type": "code", 335 | "execution_count": 103, 336 | "metadata": {}, 337 | "outputs": [], 338 | "source": [ 339 | "# slicing our dataset\n", 340 | "df = df[column_names]" 341 | ] 342 | }, 343 | { 344 | "cell_type": "code", 345 | "execution_count": 104, 346 | "metadata": {}, 347 | "outputs": [], 348 | "source": [ 349 | "# dropping 'userID' column (because we need data only from Google Analytics)\n", 350 | "df = df.drop(['userID'], axis=1)" 351 | ] 352 | }, 353 | { 354 | "cell_type": "code", 355 | "execution_count": 105, 356 | "metadata": {}, 357 | "outputs": [ 358 | { 359 | "data": { 360 | "text/html": [ 361 | "
\n", 362 | "\n", 375 | "\n", 376 | " \n", 377 | " \n", 378 | " \n", 379 | " \n", 380 | " \n", 381 | " \n", 382 | " \n", 383 | " \n", 384 | " \n", 385 | " \n", 386 | " \n", 387 | " \n", 388 | " \n", 389 | " \n", 390 | " \n", 391 | " \n", 392 | " \n", 393 | " \n", 394 | " \n", 395 | " \n", 396 | " \n", 397 | " \n", 398 | " \n", 399 | " \n", 400 | " \n", 401 | " \n", 402 | " \n", 403 | " \n", 404 | " \n", 405 | " \n", 406 | " \n", 407 | " \n", 408 | " \n", 409 | " \n", 410 | " \n", 411 | " \n", 412 | " \n", 413 | " \n", 414 | " \n", 415 | " \n", 416 | " \n", 417 | " \n", 418 | " \n", 419 | " \n", 420 | " \n", 421 | " \n", 422 | " \n", 423 | " \n", 424 | " \n", 425 | " \n", 426 | " \n", 427 | " \n", 428 | " \n", 429 | " \n", 430 | " \n", 431 | " \n", 432 | " \n", 433 | " \n", 434 | " \n", 435 | " \n", 436 | " \n", 437 | " \n", 438 | " \n", 439 | " \n", 440 | " \n", 441 | " \n", 442 | " \n", 443 | " \n", 444 | " \n", 445 | " \n", 446 | " \n", 447 | " \n", 448 | " \n", 449 | " \n", 450 | " \n", 451 | " \n", 452 | " \n", 453 | " \n", 454 | " \n", 455 | " \n", 456 | " \n", 457 | " \n", 458 | " \n", 459 | " \n", 460 | " \n", 461 | " \n", 462 | " \n", 463 | " \n", 464 | "
dateclientidtransaction_idrevenue
028-08-2019 12:29:242.802509e+0838391928103
128-08-2019 12:27:128.196637e+0897225177697
228-08-2019 11:43:241.751156e+0938508764892
328-08-2019 11:40:505.515333e+0838539238816
428-08-2019 11:25:314.527935e+083858713112
...............
100601-08-2019 01:33:535.085028e+083586929280
100701-08-2019 01:27:454.152444e+083597922899
100801-08-2019 01:23:406.964930e+083777518900
100901-08-2019 01:18:144.152444e+083777218204
101001-08-2019 01:06:385.352736e+083777483032
\n", 465 | "

1011 rows × 4 columns

\n", 466 | "
" 467 | ], 468 | "text/plain": [ 469 | " date clientid transaction_id revenue\n", 470 | "0 28-08-2019 12:29:24 2.802509e+08 383919 28103\n", 471 | "1 28-08-2019 12:27:12 8.196637e+08 97225 177697\n", 472 | "2 28-08-2019 11:43:24 1.751156e+09 385087 64892\n", 473 | "3 28-08-2019 11:40:50 5.515333e+08 385392 38816\n", 474 | "4 28-08-2019 11:25:31 4.527935e+08 385871 3112\n", 475 | "... ... ... ... ...\n", 476 | "1006 01-08-2019 01:33:53 5.085028e+08 358692 9280\n", 477 | "1007 01-08-2019 01:27:45 4.152444e+08 359792 2899\n", 478 | "1008 01-08-2019 01:23:40 6.964930e+08 377751 8900\n", 479 | "1009 01-08-2019 01:18:14 4.152444e+08 377721 8204\n", 480 | "1010 01-08-2019 01:06:38 5.352736e+08 377748 3032\n", 481 | "\n", 482 | "[1011 rows x 4 columns]" 483 | ] 484 | }, 485 | "execution_count": 105, 486 | "metadata": {}, 487 | "output_type": "execute_result" 488 | } 489 | ], 490 | "source": [ 491 | "# renaming columns for our convenience\n", 492 | "df = df.rename(columns=lambda x: x[3:])\n", 493 | "\n", 494 | "df" 495 | ] 496 | }, 497 | { 498 | "cell_type": "markdown", 499 | "metadata": {}, 500 | "source": [ 501 | "**Find a number of clients**" 502 | ] 503 | }, 504 | { 505 | "cell_type": "code", 506 | "execution_count": 119, 507 | "metadata": {}, 508 | "outputs": [ 509 | { 510 | "data": { 511 | "text/plain": [ 512 | "685" 513 | ] 514 | }, 515 | "execution_count": 119, 516 | "metadata": {}, 517 | "output_type": "execute_result" 518 | } 519 | ], 520 | "source": [ 521 | "clients = df[df['revenue'] > 0].clientid.nunique()\n", 522 | "\n", 523 | "clients" 524 | ] 525 | }, 526 | { 527 | "cell_type": "markdown", 528 | "metadata": {}, 529 | "source": [ 530 | "**Calculate conversion**" 531 | ] 532 | }, 533 | { 534 | "cell_type": "code", 535 | "execution_count": 154, 536 | "metadata": {}, 537 | "outputs": [ 538 | { 539 | "data": { 540 | "text/plain": [ 541 | "0.0049" 542 | ] 543 | }, 544 | "execution_count": 154, 545 | "metadata": {}, 546 | "output_type": "execute_result" 547 | } 548 | ], 549 | "source": [ 550 | "cr = round((clients / uniques), 4)\n", 551 | "\n", 552 | "cr" 553 | ] 554 | }, 555 | { 556 | "cell_type": "markdown", 557 | "metadata": {}, 558 | "source": [ 559 | "**Find an average check**" 560 | ] 561 | }, 562 | { 563 | "cell_type": "code", 564 | "execution_count": 125, 565 | "metadata": {}, 566 | "outputs": [ 567 | { 568 | "data": { 569 | "text/plain": [ 570 | "34458.0" 571 | ] 572 | }, 573 | "execution_count": 125, 574 | "metadata": {}, 575 | "output_type": "execute_result" 576 | } 577 | ], 578 | "source": [ 579 | "round(df.revenue.sum() / df[df['revenue'] > 0].shape[0], 0)" 580 | ] 581 | }, 582 | { 583 | "cell_type": "markdown", 584 | "metadata": {}, 585 | "source": [ 586 | "**Calculate an average number of purchases per user**" 587 | ] 588 | }, 589 | { 590 | "cell_type": "code", 591 | "execution_count": 142, 592 | "metadata": {}, 593 | "outputs": [ 594 | { 595 | "data": { 596 | "text/plain": [ 597 | "1.36" 598 | ] 599 | }, 600 | "execution_count": 142, 601 | "metadata": {}, 602 | "output_type": "execute_result" 603 | } 604 | ], 605 | "source": [ 606 | "round(df[df['revenue'] > 0].groupby(['clientid'], as_index=False) \\\n", 607 | " .agg({'transaction_id': 'count'}) \\\n", 608 | " .transaction_id.mean(), 2)" 609 | ] 610 | }, 611 | { 612 | "cell_type": "markdown", 613 | "metadata": {}, 614 | "source": [ 615 | "**Find ARPPU**" 616 | ] 617 | }, 618 | { 619 | "cell_type": "code", 620 | "execution_count": 150, 621 | "metadata": {}, 622 | "outputs": [ 623 | { 624 | "data": { 625 | "text/plain": [ 626 | "46883.0" 627 | ] 628 | }, 629 | "execution_count": 150, 630 | "metadata": {}, 631 | "output_type": "execute_result" 632 | } 633 | ], 634 | "source": [ 635 | "arppu = round(df.revenue.sum() / clients, 0)\n", 636 | "\n", 637 | "arppu" 638 | ] 639 | }, 640 | { 641 | "cell_type": "markdown", 642 | "metadata": {}, 643 | "source": [ 644 | "**Calculate ARPU**" 645 | ] 646 | }, 647 | { 648 | "cell_type": "code", 649 | "execution_count": 156, 650 | "metadata": {}, 651 | "outputs": [ 652 | { 653 | "data": { 654 | "text/plain": [ 655 | "230.0" 656 | ] 657 | }, 658 | "execution_count": 156, 659 | "metadata": {}, 660 | "output_type": "execute_result" 661 | } 662 | ], 663 | "source": [ 664 | "arpu = round(arppu * cr, 0)\n", 665 | "\n", 666 | "arpu" 667 | ] 668 | } 669 | ], 670 | "metadata": { 671 | "kernelspec": { 672 | "display_name": "Python 3", 673 | "language": "python", 674 | "name": "python3" 675 | }, 676 | "language_info": { 677 | "codemirror_mode": { 678 | "name": "ipython", 679 | "version": 3 680 | }, 681 | "file_extension": ".py", 682 | "mimetype": "text/x-python", 683 | "name": "python", 684 | "nbconvert_exporter": "python", 685 | "pygments_lexer": "ipython3", 686 | "version": "3.7.6" 687 | } 688 | }, 689 | "nbformat": 4, 690 | "nbformat_minor": 4 691 | } 692 | -------------------------------------------------------------------------------- /34_retention_analysis/README.md: -------------------------------------------------------------------------------- 1 | This is the 32nd dataset analyzed by me while passing [data analysis course](https://karpov.courses/) I've enrolled in January 2021. 2 | 3 | 4 | 5 | 6 | [Retention Analysis](https://public.tableau.com/profile/nktn.lx#!/vizhome/RetentionAnalysispractice6/Dashboard1) -- **Tableau Public dashboard** contains users retention and ARPU highlight tables. 7 | 8 | **retention_prep_data_for_tableau.ipynb** - jupyter notebook with my code to clean data and prepare it for further use in Tableau. 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | Hope this repo will help you to assess my coding, data analytics and SQL skills or will be just fun for you to look through. 17 | 18 | 19 | 20 | -------------------------------------------- 21 | Fill free to contact me via nktn.lx@gmal.com 22 | Follow me on twitter: @nktn_lx 23 | And here on github: github.com/nktnlx -------------------------------------------------------------------------------- /34_retention_analysis/retention_prep_data_for_tableau.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "###### Cleaning data for Tableau" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": 1, 13 | "metadata": {}, 14 | "outputs": [], 15 | "source": [ 16 | "# import libraries\n", 17 | "import pandas as pd" 18 | ] 19 | }, 20 | { 21 | "cell_type": "code", 22 | "execution_count": 2, 23 | "metadata": {}, 24 | "outputs": [ 25 | { 26 | "data": { 27 | "text/html": [ 28 | "
\n", 29 | "\n", 42 | "\n", 43 | " \n", 44 | " \n", 45 | " \n", 46 | " \n", 47 | " \n", 48 | " \n", 49 | " \n", 50 | " \n", 51 | " \n", 52 | " \n", 53 | " \n", 54 | " \n", 55 | " \n", 56 | " \n", 57 | " \n", 58 | " \n", 59 | " \n", 60 | " \n", 61 | " \n", 62 | " \n", 63 | " \n", 64 | " \n", 65 | " \n", 66 | " \n", 67 | " \n", 68 | " \n", 69 | " \n", 70 | " \n", 71 | " \n", 72 | " \n", 73 | " \n", 74 | " \n", 75 | " \n", 76 | " \n", 77 | " \n", 78 | " \n", 79 | " \n", 80 | " \n", 81 | " \n", 82 | " \n", 83 | " \n", 84 | " \n", 85 | " \n", 86 | " \n", 87 | " \n", 88 | " \n", 89 | " \n", 90 | " \n", 91 | " \n", 92 | " \n", 93 | " \n", 94 | " \n", 95 | " \n", 96 | " \n", 97 | " \n", 98 | " \n", 99 | " \n", 100 | " \n", 101 | " \n", 102 | " \n", 103 | " \n", 104 | " \n", 105 | " \n", 106 | " \n", 107 | " \n", 108 | " \n", 109 | " \n", 110 | " \n", 111 | " \n", 112 | " \n", 113 | " \n", 114 | " \n", 115 | " \n", 116 | " \n", 117 | " \n", 118 | " \n", 119 | " \n", 120 | " \n", 121 | " \n", 122 | " \n", 123 | " \n", 124 | " \n", 125 | " \n", 126 | " \n", 127 | " \n", 128 | " \n", 129 | " \n", 130 | " \n", 131 | " \n", 132 | " \n", 133 | " \n", 134 | " \n", 135 | " \n", 136 | " \n", 137 | " \n", 138 | " \n", 139 | " \n", 140 | " \n", 141 | " \n", 142 | " \n", 143 | " \n", 144 | " \n", 145 | " \n", 146 | " \n", 147 | " \n", 148 | " \n", 149 | " \n", 150 | " \n", 151 | " \n", 152 | " \n", 153 | " \n", 154 | " \n", 155 | " \n", 156 | " \n", 157 | " \n", 158 | " \n", 159 | " \n", 160 | " \n", 161 | " \n", 162 | " \n", 163 | " \n", 164 | " \n", 165 | " \n", 166 | " \n", 167 | " \n", 168 | " \n", 169 | " \n", 170 | " \n", 171 | " \n", 172 | " \n", 173 | " \n", 174 | " \n", 175 | " \n", 176 | " \n", 177 | " \n", 178 | " \n", 179 | "
dateeventpurchase_sumos_namedevice_idgendercityutm_source
02020-01-01app_startNaNandroid669460femaleMoscow-
12020-01-01app_startNaNios833621maleMoscowvk_ads
22020-01-01app_startNaNandroid1579237maleSaint-Petersburgreferal
32020-01-01app_startNaNandroid1737182femaleMoscowfacebook_ads
42020-01-01app_startNaNios4029024femaleMoscowfacebook_ads
...........................
27479632020-03-31registerNaNandroid2984778maleSaint-Petersburgfacebook_ads
27479642020-03-31registerNaNios27301864maleMoscow-
27479652020-03-31registerNaNios1294285femaleSaint-Petersburggoogle_ads
27479662020-03-31registerNaNandroid3010574femaleSaint-Petersburggoogle_ads
27479672020-03-31registerNaNandroid11153353femaleSaint-Petersburg-
\n", 180 | "

2747968 rows × 8 columns

\n", 181 | "
" 182 | ], 183 | "text/plain": [ 184 | " date event purchase_sum os_name device_id gender \\\n", 185 | "0 2020-01-01 app_start NaN android 669460 female \n", 186 | "1 2020-01-01 app_start NaN ios 833621 male \n", 187 | "2 2020-01-01 app_start NaN android 1579237 male \n", 188 | "3 2020-01-01 app_start NaN android 1737182 female \n", 189 | "4 2020-01-01 app_start NaN ios 4029024 female \n", 190 | "... ... ... ... ... ... ... \n", 191 | "2747963 2020-03-31 register NaN android 2984778 male \n", 192 | "2747964 2020-03-31 register NaN ios 27301864 male \n", 193 | "2747965 2020-03-31 register NaN ios 1294285 female \n", 194 | "2747966 2020-03-31 register NaN android 3010574 female \n", 195 | "2747967 2020-03-31 register NaN android 11153353 female \n", 196 | "\n", 197 | " city utm_source \n", 198 | "0 Moscow - \n", 199 | "1 Moscow vk_ads \n", 200 | "2 Saint-Petersburg referal \n", 201 | "3 Moscow facebook_ads \n", 202 | "4 Moscow facebook_ads \n", 203 | "... ... ... \n", 204 | "2747963 Saint-Petersburg facebook_ads \n", 205 | "2747964 Moscow - \n", 206 | "2747965 Saint-Petersburg google_ads \n", 207 | "2747966 Saint-Petersburg google_ads \n", 208 | "2747967 Saint-Petersburg - \n", 209 | "\n", 210 | "[2747968 rows x 8 columns]" 211 | ] 212 | }, 213 | "execution_count": 2, 214 | "metadata": {}, 215 | "output_type": "execute_result" 216 | } 217 | ], 218 | "source": [ 219 | "# reading data from a zipped csv file\n", 220 | "df = pd.read_csv('data_for_tableau.zip', compression='zip')\n", 221 | "\n", 222 | "df" 223 | ] 224 | }, 225 | { 226 | "cell_type": "code", 227 | "execution_count": 3, 228 | "metadata": {}, 229 | "outputs": [ 230 | { 231 | "data": { 232 | "text/plain": [ 233 | "'2020-01-01'" 234 | ] 235 | }, 236 | "execution_count": 3, 237 | "metadata": {}, 238 | "output_type": "execute_result" 239 | } 240 | ], 241 | "source": [ 242 | "# checking min date\n", 243 | "df['date'].min()" 244 | ] 245 | }, 246 | { 247 | "cell_type": "code", 248 | "execution_count": 4, 249 | "metadata": {}, 250 | "outputs": [ 251 | { 252 | "data": { 253 | "text/plain": [ 254 | "'2020-03-31'" 255 | ] 256 | }, 257 | "execution_count": 4, 258 | "metadata": {}, 259 | "output_type": "execute_result" 260 | } 261 | ], 262 | "source": [ 263 | "# checking max date\n", 264 | "df['date'].max()" 265 | ] 266 | }, 267 | { 268 | "cell_type": "code", 269 | "execution_count": 5, 270 | "metadata": {}, 271 | "outputs": [ 272 | { 273 | "data": { 274 | "text/html": [ 275 | "
\n", 276 | "\n", 289 | "\n", 290 | " \n", 291 | " \n", 292 | " \n", 293 | " \n", 294 | " \n", 295 | " \n", 296 | " \n", 297 | " \n", 298 | " \n", 299 | " \n", 300 | " \n", 301 | " \n", 302 | " \n", 303 | " \n", 304 | " \n", 305 | " \n", 306 | " \n", 307 | " \n", 308 | " \n", 309 | " \n", 310 | " \n", 311 | " \n", 312 | " \n", 313 | " \n", 314 | " \n", 315 | " \n", 316 | " \n", 317 | " \n", 318 | " \n", 319 | " \n", 320 | " \n", 321 | " \n", 322 | " \n", 323 | " \n", 324 | " \n", 325 | " \n", 326 | " \n", 327 | " \n", 328 | " \n", 329 | " \n", 330 | " \n", 331 | " \n", 332 | " \n", 333 | " \n", 334 | " \n", 335 | " \n", 336 | " \n", 337 | " \n", 338 | " \n", 339 | " \n", 340 | " \n", 341 | " \n", 342 | " \n", 343 | " \n", 344 | " \n", 345 | " \n", 346 | " \n", 347 | " \n", 348 | " \n", 349 | " \n", 350 | " \n", 351 | " \n", 352 | " \n", 353 | " \n", 354 | " \n", 355 | " \n", 356 | " \n", 357 | " \n", 358 | " \n", 359 | " \n", 360 | " \n", 361 | " \n", 362 | " \n", 363 | " \n", 364 | " \n", 365 | " \n", 366 | " \n", 367 | " \n", 368 | " \n", 369 | " \n", 370 | " \n", 371 | " \n", 372 | " \n", 373 | " \n", 374 | " \n", 375 | " \n", 376 | " \n", 377 | " \n", 378 | " \n", 379 | " \n", 380 | " \n", 381 | " \n", 382 | " \n", 383 | " \n", 384 | " \n", 385 | " \n", 386 | " \n", 387 | " \n", 388 | " \n", 389 | " \n", 390 | " \n", 391 | " \n", 392 | " \n", 393 | " \n", 394 | " \n", 395 | " \n", 396 | " \n", 397 | " \n", 398 | " \n", 399 | " \n", 400 | " \n", 401 | " \n", 402 | " \n", 403 | " \n", 404 | " \n", 405 | " \n", 406 | " \n", 407 | " \n", 408 | " \n", 409 | " \n", 410 | " \n", 411 | " \n", 412 | " \n", 413 | " \n", 414 | " \n", 415 | " \n", 416 | " \n", 417 | " \n", 418 | " \n", 419 | " \n", 420 | " \n", 421 | " \n", 422 | " \n", 423 | " \n", 424 | " \n", 425 | " \n", 426 | "
dateeventpurchase_sumos_namedevice_idgendercityutm_source
02020-01-01app_startNaNandroid669460femaleMoscow-
131922020-01-01tap_basketNaNandroid17289661maleSaint-Petersburgvk_ads
131912020-01-01tap_basketNaNandroid12215118maleSaint-Petersburgfacebook_ads
131902020-01-01tap_basketNaNios9163079maleMoscowvk_ads
131892020-01-01tap_basketNaNandroid1948894femaleSaint-Petersburggoogle_ads
...........................
24976562020-03-31app_startNaNios5682834femaleMoscow-
24994672020-03-31app_startNaNandroid9339041maleSaint-Petersburg-
24997802020-03-31app_startNaNios32346202femaleMoscowyandex-direct
24989972020-03-31app_startNaNios23772077maleMoscowvk_ads
25027362020-03-31app_startNaNandroid1870546femaleMoscowinstagram_ads
\n", 427 | "

190884 rows × 8 columns

\n", 428 | "
" 429 | ], 430 | "text/plain": [ 431 | " date event purchase_sum os_name device_id gender \\\n", 432 | "0 2020-01-01 app_start NaN android 669460 female \n", 433 | "13192 2020-01-01 tap_basket NaN android 17289661 male \n", 434 | "13191 2020-01-01 tap_basket NaN android 12215118 male \n", 435 | "13190 2020-01-01 tap_basket NaN ios 9163079 male \n", 436 | "13189 2020-01-01 tap_basket NaN android 1948894 female \n", 437 | "... ... ... ... ... ... ... \n", 438 | "2497656 2020-03-31 app_start NaN ios 5682834 female \n", 439 | "2499467 2020-03-31 app_start NaN android 9339041 male \n", 440 | "2499780 2020-03-31 app_start NaN ios 32346202 female \n", 441 | "2498997 2020-03-31 app_start NaN ios 23772077 male \n", 442 | "2502736 2020-03-31 app_start NaN android 1870546 female \n", 443 | "\n", 444 | " city utm_source \n", 445 | "0 Moscow - \n", 446 | "13192 Saint-Petersburg vk_ads \n", 447 | "13191 Saint-Petersburg facebook_ads \n", 448 | "13190 Moscow vk_ads \n", 449 | "13189 Saint-Petersburg google_ads \n", 450 | "... ... ... \n", 451 | "2497656 Moscow - \n", 452 | "2499467 Saint-Petersburg - \n", 453 | "2499780 Moscow yandex-direct \n", 454 | "2498997 Moscow vk_ads \n", 455 | "2502736 Moscow instagram_ads \n", 456 | "\n", 457 | "[190884 rows x 8 columns]" 458 | ] 459 | }, 460 | "execution_count": 5, 461 | "metadata": {}, 462 | "output_type": "execute_result" 463 | } 464 | ], 465 | "source": [ 466 | "# sorting values and dropping duplicates to create cohorts\n", 467 | "mindate = df.sort_values('date').drop_duplicates('device_id')\n", 468 | "\n", 469 | "mindate" 470 | ] 471 | }, 472 | { 473 | "cell_type": "code", 474 | "execution_count": 6, 475 | "metadata": {}, 476 | "outputs": [ 477 | { 478 | "data": { 479 | "text/plain": [ 480 | "190884" 481 | ] 482 | }, 483 | "execution_count": 6, 484 | "metadata": {}, 485 | "output_type": "execute_result" 486 | } 487 | ], 488 | "source": [ 489 | "# checking that we haven't lost any data\n", 490 | "df['device_id'].nunique()" 491 | ] 492 | }, 493 | { 494 | "cell_type": "code", 495 | "execution_count": 7, 496 | "metadata": {}, 497 | "outputs": [], 498 | "source": [ 499 | "# slicing\n", 500 | "mindate = mindate[['device_id', 'date']]" 501 | ] 502 | }, 503 | { 504 | "cell_type": "code", 505 | "execution_count": 8, 506 | "metadata": {}, 507 | "outputs": [ 508 | { 509 | "data": { 510 | "text/html": [ 511 | "
\n", 512 | "\n", 525 | "\n", 526 | " \n", 527 | " \n", 528 | " \n", 529 | " \n", 530 | " \n", 531 | " \n", 532 | " \n", 533 | " \n", 534 | " \n", 535 | " \n", 536 | " \n", 537 | " \n", 538 | " \n", 539 | " \n", 540 | " \n", 541 | " \n", 542 | " \n", 543 | " \n", 544 | " \n", 545 | " \n", 546 | " \n", 547 | " \n", 548 | " \n", 549 | " \n", 550 | " \n", 551 | " \n", 552 | " \n", 553 | " \n", 554 | " \n", 555 | " \n", 556 | " \n", 557 | " \n", 558 | " \n", 559 | " \n", 560 | " \n", 561 | " \n", 562 | " \n", 563 | " \n", 564 | " \n", 565 | " \n", 566 | " \n", 567 | " \n", 568 | " \n", 569 | " \n", 570 | " \n", 571 | " \n", 572 | " \n", 573 | " \n", 574 | " \n", 575 | " \n", 576 | " \n", 577 | " \n", 578 | " \n", 579 | " \n", 580 | " \n", 581 | " \n", 582 | " \n", 583 | " \n", 584 | " \n", 585 | " \n", 586 | " \n", 587 | " \n", 588 | " \n", 589 | " \n", 590 | "
device_idcohort_first_session
06694602020-01-01
13192172896612020-01-01
13191122151182020-01-01
1319091630792020-01-01
1318919488942020-01-01
.........
249765656828342020-03-31
249946793390412020-03-31
2499780323462022020-03-31
2498997237720772020-03-31
250273618705462020-03-31
\n", 591 | "

190884 rows × 2 columns

\n", 592 | "
" 593 | ], 594 | "text/plain": [ 595 | " device_id cohort_first_session\n", 596 | "0 669460 2020-01-01\n", 597 | "13192 17289661 2020-01-01\n", 598 | "13191 12215118 2020-01-01\n", 599 | "13190 9163079 2020-01-01\n", 600 | "13189 1948894 2020-01-01\n", 601 | "... ... ...\n", 602 | "2497656 5682834 2020-03-31\n", 603 | "2499467 9339041 2020-03-31\n", 604 | "2499780 32346202 2020-03-31\n", 605 | "2498997 23772077 2020-03-31\n", 606 | "2502736 1870546 2020-03-31\n", 607 | "\n", 608 | "[190884 rows x 2 columns]" 609 | ] 610 | }, 611 | "execution_count": 8, 612 | "metadata": {}, 613 | "output_type": "execute_result" 614 | } 615 | ], 616 | "source": [ 617 | "# renaming columns\n", 618 | "mindate = mindate.rename(columns={'date': 'cohort_first_session'})\n", 619 | "\n", 620 | "mindate" 621 | ] 622 | }, 623 | { 624 | "cell_type": "code", 625 | "execution_count": 9, 626 | "metadata": {}, 627 | "outputs": [ 628 | { 629 | "data": { 630 | "text/html": [ 631 | "
\n", 632 | "\n", 645 | "\n", 646 | " \n", 647 | " \n", 648 | " \n", 649 | " \n", 650 | " \n", 651 | " \n", 652 | " \n", 653 | " \n", 654 | " \n", 655 | " \n", 656 | " \n", 657 | " \n", 658 | " \n", 659 | " \n", 660 | " \n", 661 | " \n", 662 | " \n", 663 | " \n", 664 | " \n", 665 | " \n", 666 | " \n", 667 | " \n", 668 | " \n", 669 | " \n", 670 | " \n", 671 | " \n", 672 | " \n", 673 | " \n", 674 | " \n", 675 | " \n", 676 | " \n", 677 | " \n", 678 | " \n", 679 | " \n", 680 | " \n", 681 | " \n", 682 | " \n", 683 | " \n", 684 | " \n", 685 | " \n", 686 | " \n", 687 | " \n", 688 | " \n", 689 | " \n", 690 | " \n", 691 | " \n", 692 | " \n", 693 | " \n", 694 | " \n", 695 | " \n", 696 | " \n", 697 | " \n", 698 | " \n", 699 | " \n", 700 | " \n", 701 | " \n", 702 | " \n", 703 | " \n", 704 | " \n", 705 | " \n", 706 | " \n", 707 | " \n", 708 | " \n", 709 | " \n", 710 | " \n", 711 | " \n", 712 | " \n", 713 | " \n", 714 | " \n", 715 | " \n", 716 | " \n", 717 | " \n", 718 | " \n", 719 | " \n", 720 | " \n", 721 | " \n", 722 | " \n", 723 | " \n", 724 | " \n", 725 | " \n", 726 | " \n", 727 | " \n", 728 | " \n", 729 | " \n", 730 | " \n", 731 | " \n", 732 | " \n", 733 | " \n", 734 | " \n", 735 | " \n", 736 | " \n", 737 | " \n", 738 | " \n", 739 | " \n", 740 | " \n", 741 | " \n", 742 | " \n", 743 | " \n", 744 | " \n", 745 | " \n", 746 | " \n", 747 | " \n", 748 | " \n", 749 | " \n", 750 | " \n", 751 | " \n", 752 | " \n", 753 | " \n", 754 | " \n", 755 | " \n", 756 | " \n", 757 | " \n", 758 | " \n", 759 | " \n", 760 | " \n", 761 | " \n", 762 | " \n", 763 | " \n", 764 | " \n", 765 | " \n", 766 | " \n", 767 | " \n", 768 | " \n", 769 | " \n", 770 | " \n", 771 | " \n", 772 | " \n", 773 | " \n", 774 | " \n", 775 | " \n", 776 | " \n", 777 | " \n", 778 | " \n", 779 | " \n", 780 | " \n", 781 | " \n", 782 | " \n", 783 | " \n", 784 | " \n", 785 | " \n", 786 | " \n", 787 | " \n", 788 | " \n", 789 | " \n", 790 | " \n", 791 | " \n", 792 | " \n", 793 | " \n", 794 | "
dateeventpurchase_sumos_namedevice_idgendercityutm_sourcecohort_first_session
02020-01-01app_startNaNandroid669460femaleMoscow-2020-01-01
12020-01-01app_startNaNios833621maleMoscowvk_ads2020-01-01
22020-01-01app_startNaNandroid1579237maleSaint-Petersburgreferal2020-01-01
32020-01-01app_startNaNandroid1737182femaleMoscowfacebook_ads2020-01-01
42020-01-01app_startNaNios4029024femaleMoscowfacebook_ads2020-01-01
..............................
27479632020-03-31registerNaNandroid2984778maleSaint-Petersburgfacebook_ads2020-03-28
27479642020-03-31registerNaNios27301864maleMoscow-2020-03-31
27479652020-03-31registerNaNios1294285femaleSaint-Petersburggoogle_ads2020-03-31
27479662020-03-31registerNaNandroid3010574femaleSaint-Petersburggoogle_ads2020-03-06
27479672020-03-31registerNaNandroid11153353femaleSaint-Petersburg-2020-03-31
\n", 795 | "

2747968 rows × 9 columns

\n", 796 | "
" 797 | ], 798 | "text/plain": [ 799 | " date event purchase_sum os_name device_id gender \\\n", 800 | "0 2020-01-01 app_start NaN android 669460 female \n", 801 | "1 2020-01-01 app_start NaN ios 833621 male \n", 802 | "2 2020-01-01 app_start NaN android 1579237 male \n", 803 | "3 2020-01-01 app_start NaN android 1737182 female \n", 804 | "4 2020-01-01 app_start NaN ios 4029024 female \n", 805 | "... ... ... ... ... ... ... \n", 806 | "2747963 2020-03-31 register NaN android 2984778 male \n", 807 | "2747964 2020-03-31 register NaN ios 27301864 male \n", 808 | "2747965 2020-03-31 register NaN ios 1294285 female \n", 809 | "2747966 2020-03-31 register NaN android 3010574 female \n", 810 | "2747967 2020-03-31 register NaN android 11153353 female \n", 811 | "\n", 812 | " city utm_source cohort_first_session \n", 813 | "0 Moscow - 2020-01-01 \n", 814 | "1 Moscow vk_ads 2020-01-01 \n", 815 | "2 Saint-Petersburg referal 2020-01-01 \n", 816 | "3 Moscow facebook_ads 2020-01-01 \n", 817 | "4 Moscow facebook_ads 2020-01-01 \n", 818 | "... ... ... ... \n", 819 | "2747963 Saint-Petersburg facebook_ads 2020-03-28 \n", 820 | "2747964 Moscow - 2020-03-31 \n", 821 | "2747965 Saint-Petersburg google_ads 2020-03-31 \n", 822 | "2747966 Saint-Petersburg google_ads 2020-03-06 \n", 823 | "2747967 Saint-Petersburg - 2020-03-31 \n", 824 | "\n", 825 | "[2747968 rows x 9 columns]" 826 | ] 827 | }, 828 | "execution_count": 9, 829 | "metadata": {}, 830 | "output_type": "execute_result" 831 | } 832 | ], 833 | "source": [ 834 | "# merging cohorts data with the main dataframe\n", 835 | "df = df.merge(mindate, how='left', on='device_id')\n", 836 | "df" 837 | ] 838 | }, 839 | { 840 | "cell_type": "code", 841 | "execution_count": 11, 842 | "metadata": {}, 843 | "outputs": [ 844 | { 845 | "name": "stdout", 846 | "output_type": "stream", 847 | "text": [ 848 | "Data export is successfully finished!\n" 849 | ] 850 | } 851 | ], 852 | "source": [ 853 | "# exporting results to a csv file\n", 854 | "df.to_csv('data_for_analysis.csv', index=False)\n", 855 | "\n", 856 | "print('Data export is successfully finished!')" 857 | ] 858 | } 859 | ], 860 | "metadata": { 861 | "kernelspec": { 862 | "display_name": "Python 3", 863 | "language": "python", 864 | "name": "python3" 865 | }, 866 | "language_info": { 867 | "codemirror_mode": { 868 | "name": "ipython", 869 | "version": 3 870 | }, 871 | "file_extension": ".py", 872 | "mimetype": "text/x-python", 873 | "name": "python", 874 | "nbconvert_exporter": "python", 875 | "pygments_lexer": "ipython3", 876 | "version": "3.7.6" 877 | } 878 | }, 879 | "nbformat": 4, 880 | "nbformat_minor": 4 881 | } 882 | -------------------------------------------------------------------------------- /35_rfm_analysis/README.md: -------------------------------------------------------------------------------- 1 | This is the 35th dataset analyzed by me while passing [data analysis course](https://karpov.courses/) I've enrolled in January 2021. 2 | 3 | 4 | 5 | RFM analysis -- performed RFM analysis, built LTV heatmap and found insights about users segmentation. 6 | 7 | 8 | 9 | 10 | 11 | Hope this repo will help you to assess my coding, data analytics and SQL skills or will be just fun for you to look through. 12 | 13 | 14 | 15 | -------------------------------------------- 16 | Fill free to contact me via nktn.lx@gmal.com 17 | Follow me on twitter: @nktn_lx 18 | And here on github: github.com/nktnlx -------------------------------------------------------------------------------- /36_probabilities/.ipynb_checkpoints/probabilities-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Основы теории вероятностей" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "## Сложение и произведение вероятностей" 15 | ] 16 | }, 17 | { 18 | "cell_type": "markdown", 19 | "metadata": {}, 20 | "source": [ 21 | "### Задача 1" 22 | ] 23 | }, 24 | { 25 | "cell_type": "markdown", 26 | "metadata": {}, 27 | "source": [ 28 | "Два стрелка одновременно стреляют по мишени. Вероятность попадания по мишени у первого стрелка равна 0,5, у второго - 0,8. Какова вероятность того, что в мишени будет только одна пробоина? " 29 | ] 30 | }, 31 | { 32 | "cell_type": "markdown", 33 | "metadata": {}, 34 | "source": [ 35 | "**Решение**: \n", 36 | "Попадет первый, а второй промажет: $\\frac{1}{2}*\\frac{2}{10}=\\frac{1}{2}*\\frac{1}{5}=\\frac{1}{10}$ \n", 37 | "\n", 38 | "Попадет второй, а первый промажет: $\\frac{8}{10}*\\frac{1}{2}=\\frac{8}{20}=\\frac{2}{5}$ \n", 39 | "\n", 40 | "Попадет или первый, или второй (т.е. в мишени будет только одна пробоина): $\\frac{1}{10}+\\frac{2}{5}=\\frac{1}{10}+\\frac{4}{10}=\\frac{5}{10}=\\frac{1}{2}$ " 41 | ] 42 | }, 43 | { 44 | "cell_type": "markdown", 45 | "metadata": {}, 46 | "source": [ 47 | "### Задача 2" 48 | ] 49 | }, 50 | { 51 | "cell_type": "markdown", 52 | "metadata": {}, 53 | "source": [ 54 | "Трое аналитиков на экзамене независимо друг от друга решают одну и ту же задачу. Вероятности ее решения этими аналитиками равны 0,8, 0,7 и 0,6 соответственно. Найдите вероятность того, что хотя бы один аналитик решит задачу." 55 | ] 56 | }, 57 | { 58 | "cell_type": "markdown", 59 | "metadata": {}, 60 | "source": [ 61 | "**Решение**: \n", 62 | "хотя бы один решит, противоположная ситуции никто не решит, т.е. сначала надо найти никто не решит, а потом эту вероятность вычесть из единицы. \n", 63 | "\n", 64 | "Никто не решит: $\\frac{2}{10}*\\frac{3}{10}*\\frac{4}{10}=\\frac{24}{1000}=\\frac{12}{500}=\\frac{6}{250}$ \n", 65 | "\n", 66 | "Решит хотя бы один: $1-\\frac{6}{250}=\\frac{250}{250}-\\frac{6}{250}=\\frac{244}{250}$" 67 | ] 68 | }, 69 | { 70 | "cell_type": "markdown", 71 | "metadata": {}, 72 | "source": [ 73 | "### Задача 3" 74 | ] 75 | }, 76 | { 77 | "cell_type": "markdown", 78 | "metadata": {}, 79 | "source": [ 80 | "В первой урне находятся 10 белых и 4 черных шаров, а во второй 5 белых и 9 черных шаров. Из каждой урны вынули по шару. Какова вероятность того, что оба шара окажутся черными?" 81 | ] 82 | }, 83 | { 84 | "cell_type": "markdown", 85 | "metadata": {}, 86 | "source": [ 87 | "**Решение:** \n", 88 | "Черный из первой: $\\frac{4}{14}$ \n", 89 | "\n", 90 | "Черный из второй: $\\frac{9}{14}$ \n", 91 | "\n", 92 | "Оба черные: $\\frac{4}{14}*\\frac{9}{14}=\\frac{36}{196}=\\frac{9}{49}$ " 93 | ] 94 | }, 95 | { 96 | "cell_type": "markdown", 97 | "metadata": {}, 98 | "source": [ 99 | "### Задача 4" 100 | ] 101 | }, 102 | { 103 | "cell_type": "markdown", 104 | "metadata": {}, 105 | "source": [ 106 | "Вероятность хотя бы одного попадания в цель при четырех выстрелах равна 0,9984. Найдите вероятность попадания в цель при одном выстреле. " 107 | ] 108 | }, 109 | { 110 | "cell_type": "markdown", 111 | "metadata": {}, 112 | "source": [ 113 | "**Решение**: \n", 114 | "Найдем сначала вероятность не попасть ни одного из четырех: 1 − 0.9984 = 0.0016 \n", 115 | "\n", 116 | "Теперь найдем вероятность промаха при одном выстреле: $\\sqrt[4]{0.0016}=0.2$ \n", 117 | "Соответственно вероятность попадания при одном выстреле: 1 − 0.2 = 0.8 " 118 | ] 119 | }, 120 | { 121 | "cell_type": "markdown", 122 | "metadata": {}, 123 | "source": [ 124 | "### Задача 5" 125 | ] 126 | }, 127 | { 128 | "cell_type": "markdown", 129 | "metadata": {}, 130 | "source": [ 131 | "Из колоды в 36 карт наудачу извлекается одна карта. \n", 132 | "События:\n", 133 | "* A = {вынутая карта – туз}, \n", 134 | "* В = {вынутая карта чёрной масти},\n", 135 | "* F = {вынутая карта – фигура, т.е. является валетом, дамой, королём или тузом}. \n", 136 | "\n", 137 | "Выберите все пары зависимых событий: \n", 138 | "* B и F,\n", 139 | "* A и F,\n", 140 | "* A и B" 141 | ] 142 | }, 143 | { 144 | "cell_type": "markdown", 145 | "metadata": {}, 146 | "source": [ 147 | "**Ответ:** A и F" 148 | ] 149 | }, 150 | { 151 | "cell_type": "markdown", 152 | "metadata": {}, 153 | "source": [ 154 | "## Формула Бернулли" 155 | ] 156 | }, 157 | { 158 | "cell_type": "markdown", 159 | "metadata": {}, 160 | "source": [ 161 | "Биномиальный коэффициент:" 162 | ] 163 | }, 164 | { 165 | "cell_type": "code", 166 | "execution_count": 1, 167 | "metadata": {}, 168 | "outputs": [ 169 | { 170 | "data": { 171 | "text/plain": [ 172 | "120.0" 173 | ] 174 | }, 175 | "execution_count": 1, 176 | "metadata": {}, 177 | "output_type": "execute_result" 178 | } 179 | ], 180 | "source": [ 181 | "from math import factorial\n", 182 | "\n", 183 | "n = 10\n", 184 | "k = 3\n", 185 | "\n", 186 | "def num_of_successes(n, k):\n", 187 | " return factorial(n)/(factorial(k) * factorial(n - k))\n", 188 | "\n", 189 | "num_of_successes(n, k)" 190 | ] 191 | }, 192 | { 193 | "cell_type": "markdown", 194 | "metadata": {}, 195 | "source": [ 196 | "### Задача 1" 197 | ] 198 | }, 199 | { 200 | "cell_type": "markdown", 201 | "metadata": {}, 202 | "source": [ 203 | "Устройство, состоящее из пяти независимо работающих элементов, включается за время Т. Вероятность отказа каждого из них за это время равна 0,4. Найти вероятность того, что откажут три элемента. Ответ округлите до четырёх знаков после запятой." 204 | ] 205 | }, 206 | { 207 | "cell_type": "code", 208 | "execution_count": 2, 209 | "metadata": {}, 210 | "outputs": [ 211 | { 212 | "name": "stdout", 213 | "output_type": "stream", 214 | "text": [ 215 | "0.2304\n" 216 | ] 217 | } 218 | ], 219 | "source": [ 220 | "# Solution:\n", 221 | "n = 5\n", 222 | "k = 3\n", 223 | "p = 0.4\n", 224 | "\n", 225 | "c = num_of_successes(n, k)\n", 226 | "\n", 227 | "result = c * p ** k * (1 - p) ** (n-k)\n", 228 | "\n", 229 | "print(round(result, 4))" 230 | ] 231 | }, 232 | { 233 | "cell_type": "markdown", 234 | "metadata": {}, 235 | "source": [ 236 | "### Задача 2" 237 | ] 238 | }, 239 | { 240 | "cell_type": "markdown", 241 | "metadata": {}, 242 | "source": [ 243 | "Устройство, состоящее из пяти независимо работающих элементов, включается за время Т. Вероятность отказа каждого из них за это время равна 0,4. Найдите вероятность того, что хотя бы один элемент откажет. Ответ округлите до трёх знаков после запятой." 244 | ] 245 | }, 246 | { 247 | "cell_type": "code", 248 | "execution_count": 3, 249 | "metadata": {}, 250 | "outputs": [ 251 | { 252 | "name": "stdout", 253 | "output_type": "stream", 254 | "text": [ 255 | "0.922\n" 256 | ] 257 | } 258 | ], 259 | "source": [ 260 | "n = 5\n", 261 | "k = 3\n", 262 | "p = 0.4\n", 263 | "\n", 264 | "result = 1 - (1 - p) ** n\n", 265 | "\n", 266 | "print(round(result, 3))" 267 | ] 268 | }, 269 | { 270 | "cell_type": "markdown", 271 | "metadata": {}, 272 | "source": [ 273 | "### Задача 3" 274 | ] 275 | }, 276 | { 277 | "cell_type": "markdown", 278 | "metadata": {}, 279 | "source": [ 280 | "Производится 8 выстрелов по цели, в каждом из которых вероятность попадания равна 0,3. Найти вероятность того, что цель будет поражена хотя бы два раза. Ответ округлите до тысячных." 281 | ] 282 | }, 283 | { 284 | "cell_type": "markdown", 285 | "metadata": {}, 286 | "source": [ 287 | "**Решение:** \n", 288 | "Здесь нужно вычислить две вероятности и сложить их:\n", 289 | "* Вероятность, что по цели вообще не будет попаданий\n", 290 | "* Вероятность, что попадание будет только одно\n", 291 | "\n", 292 | "Сумма этих вероятностей противоположна вероятности события \"попали хотя бы два раза\".\n" 293 | ] 294 | }, 295 | { 296 | "cell_type": "code", 297 | "execution_count": 4, 298 | "metadata": {}, 299 | "outputs": [ 300 | { 301 | "name": "stdout", 302 | "output_type": "stream", 303 | "text": [ 304 | "Нет попаданий: 0.05764800999999997\n", 305 | "биномиальный коэффициент: 8.0\n", 306 | "Только одно пападание: 0.1976503199999999\n", 307 | "Или нет или одно попадание: 0.2552983299999999\n", 308 | "Хотя бы два попадания: 0.745\n" 309 | ] 310 | } 311 | ], 312 | "source": [ 313 | "n = 8\n", 314 | "k = 1\n", 315 | "p = 0.3\n", 316 | "\n", 317 | "p0 = (1 - p) ** n\n", 318 | "print(f'Нет попаданий: {p0}')\n", 319 | "\n", 320 | "c = num_of_successes(n, k)\n", 321 | "print(f'биномиальный коэффициент: {c}')\n", 322 | "\n", 323 | "p1 = c * p ** k * (1 - p) ** (n-k)\n", 324 | "print(f'Только одно пападание: {p1}')\n", 325 | "\n", 326 | "result = p0 + p1\n", 327 | "print(f'Или нет или одно попадание: {result}')\n", 328 | "\n", 329 | "print(f'Хотя бы два попадания: {round(1-result, 3)}')" 330 | ] 331 | }, 332 | { 333 | "cell_type": "markdown", 334 | "metadata": {}, 335 | "source": [ 336 | "### Задача 4" 337 | ] 338 | }, 339 | { 340 | "cell_type": "markdown", 341 | "metadata": {}, 342 | "source": [ 343 | "В магазине 7 покупателей. Каждый может совершить покупку с вероятностью 0,4. Найти вероятность того, что не более двух человек совершат покупку. Ответ округлите до сотых." 344 | ] 345 | }, 346 | { 347 | "cell_type": "code", 348 | "execution_count": 5, 349 | "metadata": {}, 350 | "outputs": [ 351 | { 352 | "name": "stdout", 353 | "output_type": "stream", 354 | "text": [ 355 | "Нет покупок: 0.027993599999999993\n", 356 | "Только одна покупка: 0.13063679999999997\n", 357 | "Только две покупки: 0.2612736\n", 358 | "Не более двух: 0.42\n" 359 | ] 360 | } 361 | ], 362 | "source": [ 363 | "n = 7\n", 364 | "p = 0.4\n", 365 | "\n", 366 | "p0 = (1 - p) ** n\n", 367 | "print(f'Нет покупок: {p0}')\n", 368 | "\n", 369 | "k = 1\n", 370 | "c = num_of_successes(n, k)\n", 371 | "p1 = c * p ** k * (1 - p) ** (n-k)\n", 372 | "print(f'Только одна покупка: {p1}')\n", 373 | "\n", 374 | "k = 2\n", 375 | "c = num_of_successes(n, k)\n", 376 | "p2 = c * p ** k * (1 - p) ** (n-k)\n", 377 | "print(f'Только две покупки: {p2}')\n", 378 | "\n", 379 | "result = p0 + p1 + p2\n", 380 | "print(f'Не более двух: {round(result, 2)}')" 381 | ] 382 | }, 383 | { 384 | "cell_type": "markdown", 385 | "metadata": {}, 386 | "source": [ 387 | "### Задача 5" 388 | ] 389 | }, 390 | { 391 | "cell_type": "markdown", 392 | "metadata": {}, 393 | "source": [ 394 | "В телеателье имеется 7 телевизоров. Для каждого телевизора вероятность того, что в данный момент он включен, равна 0,5. Найти вероятность того, что в данный момент включены не менее трех телевизоров. Ответ округлите до тысячных.\n" 395 | ] 396 | }, 397 | { 398 | "cell_type": "code", 399 | "execution_count": 6, 400 | "metadata": {}, 401 | "outputs": [ 402 | { 403 | "name": "stdout", 404 | "output_type": "stream", 405 | "text": [ 406 | "Все выкл: 0.0078125\n", 407 | "Один включен: 0.0546875\n", 408 | "Два включены: 0.1640625\n", 409 | "Включены ноль, один или два: 0.2265625\n", 410 | "Включены не менее трех: 0.773\n" 411 | ] 412 | } 413 | ], 414 | "source": [ 415 | "n = 7\n", 416 | "p = 0.5\n", 417 | "\n", 418 | "p0 = (1 - p) ** n\n", 419 | "print(f'Все выкл: {p0}')\n", 420 | "\n", 421 | "k = 1\n", 422 | "c = num_of_successes(n, k)\n", 423 | "p1 = c * p ** k * (1 - p) ** (n-k)\n", 424 | "print(f'Один включен: {p1}')\n", 425 | "\n", 426 | "k = 2\n", 427 | "c = num_of_successes(n, k)\n", 428 | "p2 = c * p ** k * (1 - p) ** (n-k)\n", 429 | "print(f'Два включены: {p2}')\n", 430 | "\n", 431 | "result = p0 + p1 + p2\n", 432 | "print(f'Включены ноль, один или два: {result}')\n", 433 | "\n", 434 | "print(f'Включены не менее трех: {round(1 - result, 3)}')" 435 | ] 436 | }, 437 | { 438 | "cell_type": "markdown", 439 | "metadata": {}, 440 | "source": [ 441 | "## Условная вероятность (теорема Байеса)" 442 | ] 443 | }, 444 | { 445 | "cell_type": "markdown", 446 | "metadata": {}, 447 | "source": [ 448 | "### Задача 1" 449 | ] 450 | }, 451 | { 452 | "cell_type": "markdown", 453 | "metadata": {}, 454 | "source": [ 455 | "У нас есть две группы стрелков: \n", 456 | "* в первой 5 человек и они попадают в мишень с вероятностью 0.8\n", 457 | "* во второй 15 человек и они попадают в мишень с вероятность 0.4 \n", 458 | "\n", 459 | "Мы подошли к мишени и увидели, что в неё попали. Какова вероятность, что это попадание сделал стрелок из первой группы? " 460 | ] 461 | }, 462 | { 463 | "cell_type": "markdown", 464 | "metadata": {}, 465 | "source": [ 466 | "**Решение**: \n", 467 | "1. Вероятность того, что случайно выбранный стрелок принадлежит к первой группе: Р(Y) = 5/20 = 0.25 \n", 468 | "2. Вероятность попадания стрелка из первой группы нам известна из условия: P(X|Y) = 0.8 \n", 469 | "3. Рассчитаем вероятность попадания в мишень в целом: P(X) = 5/20 * 0.8 + 15/20 * 0.4 = 0.5 \n", 470 | "4. Подставим все в формулу Байеса: $\\frac{P(Y)*P(X|Y)}{P(X)}=\\frac{0.25 * 0.8}{0.5}=0.4$" 471 | ] 472 | }, 473 | { 474 | "cell_type": "markdown", 475 | "metadata": {}, 476 | "source": [ 477 | "### Задача 2" 478 | ] 479 | }, 480 | { 481 | "cell_type": "markdown", 482 | "metadata": {}, 483 | "source": [ 484 | "Тест положительный у здорового человека с вероятностью 0.01 \n", 485 | "Тест положительный у больного человека с вероятностью 0.9 \n", 486 | "Вероятность быть больным 0.001. \n", 487 | "Какова вероятность что вы здоровы и получили положительный тест? " 488 | ] 489 | }, 490 | { 491 | "cell_type": "markdown", 492 | "metadata": {}, 493 | "source": [ 494 | "**Решение**: \n", 495 | "$P(healthy|test+)=\\frac{Р(healthy)*P(test+|healthy)}{P(test+)}$\n", 496 | "\n", 497 | "1. Вероятность того, что здоров: Р(здоров) = 1 - 0.001 = 0.999 \n", 498 | "2. Вероятность того что здоров, но получил положительный тест нам известна из условия: P(тест+|здоров) = 0.01 \n", 499 | "3. Рассчитаем вероятность получить положительный тест в целом: P(тест+) = 0.999 * 0.01 + 0.001 * 0.9 = 0.01089 \n", 500 | "4. Подставим все в формулу Байеса: $P(healthy|test+)=\\frac{Р(healthy)*P(test+|healthy)}{P(test+)}=\\frac{0.999 * 0.01}{0.01089}=0.9173553719008265$" 501 | ] 502 | }, 503 | { 504 | "cell_type": "markdown", 505 | "metadata": {}, 506 | "source": [ 507 | "### Задача 3" 508 | ] 509 | }, 510 | { 511 | "cell_type": "markdown", 512 | "metadata": {}, 513 | "source": [ 514 | "Из 500 компьютеров 180 принадлежат к 1 партии, 170 – ко второй партии, остальные к третьей. В первой партии 3% брака, во второй - 2%, в третьей – 6%. Наудачу выбирается один компьютер. Определить вероятность того, что выбранный компьютер – бракованный. Округлите полученный ответ до десятитысячных. В качестве разделителя используйте точку." 515 | ] 516 | }, 517 | { 518 | "cell_type": "markdown", 519 | "metadata": {}, 520 | "source": [ 521 | "**Решение**: \n", 522 | "$\\frac{180 * 0.03}{500} + \\frac{170 * 0.02}{500} + \\frac{150 * 0.06}{500}=0.0356$" 523 | ] 524 | }, 525 | { 526 | "cell_type": "markdown", 527 | "metadata": {}, 528 | "source": [ 529 | "### Задача 4" 530 | ] 531 | }, 532 | { 533 | "cell_type": "markdown", 534 | "metadata": {}, 535 | "source": [ 536 | "В компании программистов, которые пишут на языке Java в 3 раза больше, чем программистов на языке C++, а программистов на языке C++ в 4 раза больше, чем программистов на Python. Вероятность сделать работу за один день для Java-программиста 0,85, для программиста на языке C++ 0,9, а для Python программиста 0,8. Найти вероятность того, что программист, выбранный наугад, сделает работу за один день. Ответ округлите до сотых. В качестве десятичного разделителя используйте точку." 537 | ] 538 | }, 539 | { 540 | "cell_type": "markdown", 541 | "metadata": {}, 542 | "source": [ 543 | "**Решение**: \n", 544 | "Java = 3C++ \n", 545 | "C++ = 4P \n", 546 | "Java = 3 * 4P = 12P \n", 547 | "\n", 548 | "$\\frac{12P * 0.85}{17P} + \\frac{4P * 0.9}{17P} + \\frac{P * 0.8}{17P}=\\frac{12 * 0.85}{17} + \\frac{4 * 0.9}{17} + \\frac{0.8}{17}=0.8589$" 549 | ] 550 | }, 551 | { 552 | "cell_type": "markdown", 553 | "metadata": {}, 554 | "source": [ 555 | "### Задача 5" 556 | ] 557 | }, 558 | { 559 | "cell_type": "markdown", 560 | "metadata": {}, 561 | "source": [ 562 | "Из 40 снайперов 18 попадает в цель с вероятностью 0,9, 8 - с вероятностью 0,4 и 14 – с вероятностью 0,7. Наудачу выбранный снайпер произвел выстрел, поразив цель. Мы хотим вычислить, снайпер из какой группы наиболее вероятно совершил этот выстрел. В ответ запишите наибольшую вероятность, округленную до тысячных. В качестве разделителя используйте точку." 563 | ] 564 | }, 565 | { 566 | "cell_type": "markdown", 567 | "metadata": {}, 568 | "source": [ 569 | "**Решение**: посчитаем по теореме Байеса вероятность для каждой из групп и выберем из них наибольшую. \n", 570 | "$P(A|B)=\\frac{P(A)*P(B|A)}{P(B)}$" 571 | ] 572 | }, 573 | { 574 | "cell_type": "code", 575 | "execution_count": 7, 576 | "metadata": {}, 577 | "outputs": [ 578 | { 579 | "name": "stdout", 580 | "output_type": "stream", 581 | "text": [ 582 | "0.73\n", 583 | "0.45 0.2 0.35\n", 584 | "0.555 0.11 0.336\n" 585 | ] 586 | } 587 | ], 588 | "source": [ 589 | "g1 = 18\n", 590 | "g2 = 8\n", 591 | "g3 = 14\n", 592 | "s = 40\n", 593 | "\n", 594 | "# это наши P(B/A) они указаны в дано задачи -- вероятность попасть, если ты в такой-то группе\n", 595 | "v1 = 0.9\n", 596 | "v2 = 0.4\n", 597 | "v3 = 0.7\n", 598 | "\n", 599 | "# P(B) -- знаменатель, общий для всех трех вариантов. это общая вероятность что кто-то попал\n", 600 | "pb = g1*v1/s + g2*v2/s + g3*v3/s\n", 601 | "\n", 602 | "# посчитаем P(A) - вероятность принадлежать к одной из групп\n", 603 | "pa1 = g1/s\n", 604 | "pa2 = g2/s\n", 605 | "pa3 = g3/s\n", 606 | "\n", 607 | "# посчитаем P(A/B) для каждого из вариантов\n", 608 | "pab1 = pa1*v1/pb\n", 609 | "pab2 = pa2*v2/pb\n", 610 | "pab3 = pa3*v3/pb\n", 611 | "\n", 612 | "print(pb)\n", 613 | "print(pa1, pa2, pa3)\n", 614 | "print(round(pab1, 3), round(pab2, 3), round(pab3, 3))" 615 | ] 616 | }, 617 | { 618 | "cell_type": "markdown", 619 | "metadata": {}, 620 | "source": [ 621 | "### Задача 6" 622 | ] 623 | }, 624 | { 625 | "cell_type": "markdown", 626 | "metadata": {}, 627 | "source": [ 628 | "Есть 4 кубика. На трех из них окрашена белым половина граней, а на четвертом кубике всего одна грань из шести белая. Наудачу выбранный кубик подбрасывается семь раз. Найти вероятность того, что был выбран четвертый кубик, если при семи подбрасываниях белая грань выпала ровно один раз. Ответ округлите до тысячных. Используйте точку в качестве десятичного разделителя." 629 | ] 630 | }, 631 | { 632 | "cell_type": "markdown", 633 | "metadata": {}, 634 | "source": [ 635 | "**Решение**: посчитаем по теореме Байеса \n", 636 | "$P(A|B)=\\frac{P(A)*P(B|A)}{P(B)}$" 637 | ] 638 | }, 639 | { 640 | "cell_type": "code", 641 | "execution_count": 8, 642 | "metadata": {}, 643 | "outputs": [ 644 | { 645 | "name": "stdout", 646 | "output_type": "stream", 647 | "text": [ 648 | "0.25\n" 649 | ] 650 | } 651 | ], 652 | "source": [ 653 | "# P(A) - вероятность того что это четвертый кубик\n", 654 | "pa = 1/4\n", 655 | "print(pa)" 656 | ] 657 | }, 658 | { 659 | "cell_type": "code", 660 | "execution_count": 9, 661 | "metadata": {}, 662 | "outputs": [], 663 | "source": [ 664 | "from math import factorial\n", 665 | "\n", 666 | "# вспомогательная функция для расчета биноминального коэффициента\n", 667 | "def binom_coef(n, k):\n", 668 | " return factorial(n) / (factorial(k) * factorial(n - k))" 669 | ] 670 | }, 671 | { 672 | "cell_type": "code", 673 | "execution_count": 10, 674 | "metadata": {}, 675 | "outputs": [ 676 | { 677 | "name": "stdout", 678 | "output_type": "stream", 679 | "text": [ 680 | "0.2344285836762689\n" 681 | ] 682 | } 683 | ], 684 | "source": [ 685 | "# P(B|A) - вероятность получить одну белую грань если это четвертый кубик\n", 686 | "\n", 687 | "pba = binom_coef(n, k) * (1/6) ** k * (1 - 1/6) ** (n-k)\n", 688 | "\n", 689 | "print(pba)" 690 | ] 691 | }, 692 | { 693 | "cell_type": "code", 694 | "execution_count": 11, 695 | "metadata": {}, 696 | "outputs": [ 697 | { 698 | "name": "stdout", 699 | "output_type": "stream", 700 | "text": [ 701 | "0.1386942015317787\n" 702 | ] 703 | } 704 | ], 705 | "source": [ 706 | "# P(B) - вероятность получить одну белую грань при семи бросках\n", 707 | "n = 7\n", 708 | "k = 1\n", 709 | "\n", 710 | "pb = pa * binom_coef(n, k) * (1/6) ** k * (1 - 1/6) ** (n-k) + (1 - pa) * binom_coef(n, k) * (3/6) ** k * (1 - 3/6) ** (n-k)\n", 711 | "\n", 712 | "print(pb)" 713 | ] 714 | }, 715 | { 716 | "cell_type": "code", 717 | "execution_count": 12, 718 | "metadata": {}, 719 | "outputs": [ 720 | { 721 | "name": "stdout", 722 | "output_type": "stream", 723 | "text": [ 724 | "0.423\n" 725 | ] 726 | } 727 | ], 728 | "source": [ 729 | "# посчитаем по теореме Байеса\n", 730 | "pab = pa * pba / pb\n", 731 | "\n", 732 | "print(round(pab, 3))" 733 | ] 734 | }, 735 | { 736 | "cell_type": "markdown", 737 | "metadata": {}, 738 | "source": [ 739 | "### Задача 7" 740 | ] 741 | }, 742 | { 743 | "cell_type": "markdown", 744 | "metadata": {}, 745 | "source": [ 746 | "На гитхабе лежит 12 скриптов, написанных аналитиком №1, 20 скриптов - написанных аналитиком №2 и 18 скриптов - написанных аналитиком №3. Вероятность того, что скрипт, написанный аналитиком №1, работает без ошибок, равна 0,9; для скриптов, написанных аналитиками №2 и №3, эти вероятности соответственно равны 0,6 и 0,9.\n", 747 | "\n", 748 | "Нужно найти вероятность того, что случайно взятый скрипт будет работать без ошибок. В ответ запишите число, округленное до сотых. В качестве десятичного разделителя используйте точку." 749 | ] 750 | }, 751 | { 752 | "cell_type": "markdown", 753 | "metadata": {}, 754 | "source": [ 755 | "**Решение**: \n", 756 | "$\\frac{12}{50}*0.9 + \\frac{20}{50}*0.6 + \\frac{18}{50}*0.9=0.78$" 757 | ] 758 | } 759 | ], 760 | "metadata": { 761 | "kernelspec": { 762 | "display_name": "Python 3", 763 | "language": "python", 764 | "name": "python3" 765 | }, 766 | "language_info": { 767 | "codemirror_mode": { 768 | "name": "ipython", 769 | "version": 3 770 | }, 771 | "file_extension": ".py", 772 | "mimetype": "text/x-python", 773 | "name": "python", 774 | "nbconvert_exporter": "python", 775 | "pygments_lexer": "ipython3", 776 | "version": "3.7.6" 777 | } 778 | }, 779 | "nbformat": 4, 780 | "nbformat_minor": 4 781 | } 782 | -------------------------------------------------------------------------------- /36_probabilities/README.md: -------------------------------------------------------------------------------- 1 | This is the 36th set of tasks done by me while passing [data analysis course](https://karpov.courses/) I've enrolled in January 2021. 2 | 3 | 4 | 5 | Probabilities -- solving probability theory problems including AND/OR probabilities, Bernoulli trial and conditional probability (the Bayes theorem). 6 | 7 | 8 | 9 | 10 | 11 | Hope this repo will help you to assess my coding, data analytics and SQL skills or will be just fun for you to look through. 12 | 13 | 14 | 15 | -------------------------------------------- 16 | Feel free to contact me via nktn.lx@gmal.com 17 | Follow me on twitter: @nktn_lx 18 | And here on github: github.com/nktnlx -------------------------------------------------------------------------------- /36_probabilities/probabilities.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Основы теории вероятностей" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "## Сложение и произведение вероятностей" 15 | ] 16 | }, 17 | { 18 | "cell_type": "markdown", 19 | "metadata": {}, 20 | "source": [ 21 | "### Задача 1" 22 | ] 23 | }, 24 | { 25 | "cell_type": "markdown", 26 | "metadata": {}, 27 | "source": [ 28 | "Два стрелка одновременно стреляют по мишени. Вероятность попадания по мишени у первого стрелка равна 0,5, у второго - 0,8. Какова вероятность того, что в мишени будет только одна пробоина? " 29 | ] 30 | }, 31 | { 32 | "cell_type": "markdown", 33 | "metadata": {}, 34 | "source": [ 35 | "**Решение**: \n", 36 | "Попадет первый, а второй промажет: $\\frac{1}{2}*\\frac{2}{10}=\\frac{1}{2}*\\frac{1}{5}=\\frac{1}{10}$ \n", 37 | "\n", 38 | "Попадет второй, а первый промажет: $\\frac{8}{10}*\\frac{1}{2}=\\frac{8}{20}=\\frac{2}{5}$ \n", 39 | "\n", 40 | "Попадет или первый, или второй (т.е. в мишени будет только одна пробоина): $\\frac{1}{10}+\\frac{2}{5}=\\frac{1}{10}+\\frac{4}{10}=\\frac{5}{10}=\\frac{1}{2}$ " 41 | ] 42 | }, 43 | { 44 | "cell_type": "markdown", 45 | "metadata": {}, 46 | "source": [ 47 | "### Задача 2" 48 | ] 49 | }, 50 | { 51 | "cell_type": "markdown", 52 | "metadata": {}, 53 | "source": [ 54 | "Трое аналитиков на экзамене независимо друг от друга решают одну и ту же задачу. Вероятности ее решения этими аналитиками равны 0,8, 0,7 и 0,6 соответственно. Найдите вероятность того, что хотя бы один аналитик решит задачу." 55 | ] 56 | }, 57 | { 58 | "cell_type": "markdown", 59 | "metadata": {}, 60 | "source": [ 61 | "**Решение**: \n", 62 | "хотя бы один решит, противоположная ситуции никто не решит, т.е. сначала надо найти никто не решит, а потом эту вероятность вычесть из единицы. \n", 63 | "\n", 64 | "Никто не решит: $\\frac{2}{10}*\\frac{3}{10}*\\frac{4}{10}=\\frac{24}{1000}=\\frac{12}{500}=\\frac{6}{250}$ \n", 65 | "\n", 66 | "Решит хотя бы один: $1-\\frac{6}{250}=\\frac{250}{250}-\\frac{6}{250}=\\frac{244}{250}$" 67 | ] 68 | }, 69 | { 70 | "cell_type": "markdown", 71 | "metadata": {}, 72 | "source": [ 73 | "### Задача 3" 74 | ] 75 | }, 76 | { 77 | "cell_type": "markdown", 78 | "metadata": {}, 79 | "source": [ 80 | "В первой урне находятся 10 белых и 4 черных шаров, а во второй 5 белых и 9 черных шаров. Из каждой урны вынули по шару. Какова вероятность того, что оба шара окажутся черными?" 81 | ] 82 | }, 83 | { 84 | "cell_type": "markdown", 85 | "metadata": {}, 86 | "source": [ 87 | "**Решение:** \n", 88 | "Черный из первой: $\\frac{4}{14}$ \n", 89 | "\n", 90 | "Черный из второй: $\\frac{9}{14}$ \n", 91 | "\n", 92 | "Оба черные: $\\frac{4}{14}*\\frac{9}{14}=\\frac{36}{196}=\\frac{9}{49}$ " 93 | ] 94 | }, 95 | { 96 | "cell_type": "markdown", 97 | "metadata": {}, 98 | "source": [ 99 | "### Задача 4" 100 | ] 101 | }, 102 | { 103 | "cell_type": "markdown", 104 | "metadata": {}, 105 | "source": [ 106 | "Вероятность хотя бы одного попадания в цель при четырех выстрелах равна 0,9984. Найдите вероятность попадания в цель при одном выстреле. " 107 | ] 108 | }, 109 | { 110 | "cell_type": "markdown", 111 | "metadata": {}, 112 | "source": [ 113 | "**Решение**: \n", 114 | "Найдем сначала вероятность не попасть ни одного из четырех: 1 − 0.9984 = 0.0016 \n", 115 | "\n", 116 | "Теперь найдем вероятность промаха при одном выстреле: $\\sqrt[4]{0.0016}=0.2$ \n", 117 | "Соответственно вероятность попадания при одном выстреле: 1 − 0.2 = 0.8 " 118 | ] 119 | }, 120 | { 121 | "cell_type": "markdown", 122 | "metadata": {}, 123 | "source": [ 124 | "### Задача 5" 125 | ] 126 | }, 127 | { 128 | "cell_type": "markdown", 129 | "metadata": {}, 130 | "source": [ 131 | "Из колоды в 36 карт наудачу извлекается одна карта. \n", 132 | "События:\n", 133 | "* A = {вынутая карта – туз}, \n", 134 | "* В = {вынутая карта чёрной масти},\n", 135 | "* F = {вынутая карта – фигура, т.е. является валетом, дамой, королём или тузом}. \n", 136 | "\n", 137 | "Выберите все пары зависимых событий: \n", 138 | "* B и F,\n", 139 | "* A и F,\n", 140 | "* A и B" 141 | ] 142 | }, 143 | { 144 | "cell_type": "markdown", 145 | "metadata": {}, 146 | "source": [ 147 | "**Ответ:** A и F" 148 | ] 149 | }, 150 | { 151 | "cell_type": "markdown", 152 | "metadata": {}, 153 | "source": [ 154 | "## Формула Бернулли" 155 | ] 156 | }, 157 | { 158 | "cell_type": "markdown", 159 | "metadata": {}, 160 | "source": [ 161 | "Биномиальный коэффициент:" 162 | ] 163 | }, 164 | { 165 | "cell_type": "code", 166 | "execution_count": 1, 167 | "metadata": {}, 168 | "outputs": [ 169 | { 170 | "data": { 171 | "text/plain": [ 172 | "120.0" 173 | ] 174 | }, 175 | "execution_count": 1, 176 | "metadata": {}, 177 | "output_type": "execute_result" 178 | } 179 | ], 180 | "source": [ 181 | "from math import factorial\n", 182 | "\n", 183 | "n = 10\n", 184 | "k = 3\n", 185 | "\n", 186 | "def num_of_successes(n, k):\n", 187 | " return factorial(n)/(factorial(k) * factorial(n - k))\n", 188 | "\n", 189 | "num_of_successes(n, k)" 190 | ] 191 | }, 192 | { 193 | "cell_type": "markdown", 194 | "metadata": {}, 195 | "source": [ 196 | "### Задача 1" 197 | ] 198 | }, 199 | { 200 | "cell_type": "markdown", 201 | "metadata": {}, 202 | "source": [ 203 | "Устройство, состоящее из пяти независимо работающих элементов, включается за время Т. Вероятность отказа каждого из них за это время равна 0,4. Найти вероятность того, что откажут три элемента. Ответ округлите до четырёх знаков после запятой." 204 | ] 205 | }, 206 | { 207 | "cell_type": "code", 208 | "execution_count": 2, 209 | "metadata": {}, 210 | "outputs": [ 211 | { 212 | "name": "stdout", 213 | "output_type": "stream", 214 | "text": [ 215 | "0.2304\n" 216 | ] 217 | } 218 | ], 219 | "source": [ 220 | "# Solution:\n", 221 | "n = 5\n", 222 | "k = 3\n", 223 | "p = 0.4\n", 224 | "\n", 225 | "c = num_of_successes(n, k)\n", 226 | "\n", 227 | "result = c * p ** k * (1 - p) ** (n-k)\n", 228 | "\n", 229 | "print(round(result, 4))" 230 | ] 231 | }, 232 | { 233 | "cell_type": "markdown", 234 | "metadata": {}, 235 | "source": [ 236 | "### Задача 2" 237 | ] 238 | }, 239 | { 240 | "cell_type": "markdown", 241 | "metadata": {}, 242 | "source": [ 243 | "Устройство, состоящее из пяти независимо работающих элементов, включается за время Т. Вероятность отказа каждого из них за это время равна 0,4. Найдите вероятность того, что хотя бы один элемент откажет. Ответ округлите до трёх знаков после запятой." 244 | ] 245 | }, 246 | { 247 | "cell_type": "code", 248 | "execution_count": 3, 249 | "metadata": {}, 250 | "outputs": [ 251 | { 252 | "name": "stdout", 253 | "output_type": "stream", 254 | "text": [ 255 | "0.922\n" 256 | ] 257 | } 258 | ], 259 | "source": [ 260 | "n = 5\n", 261 | "k = 3\n", 262 | "p = 0.4\n", 263 | "\n", 264 | "result = 1 - (1 - p) ** n\n", 265 | "\n", 266 | "print(round(result, 3))" 267 | ] 268 | }, 269 | { 270 | "cell_type": "markdown", 271 | "metadata": {}, 272 | "source": [ 273 | "### Задача 3" 274 | ] 275 | }, 276 | { 277 | "cell_type": "markdown", 278 | "metadata": {}, 279 | "source": [ 280 | "Производится 8 выстрелов по цели, в каждом из которых вероятность попадания равна 0,3. Найти вероятность того, что цель будет поражена хотя бы два раза. Ответ округлите до тысячных." 281 | ] 282 | }, 283 | { 284 | "cell_type": "markdown", 285 | "metadata": {}, 286 | "source": [ 287 | "**Решение:** \n", 288 | "Здесь нужно вычислить две вероятности и сложить их:\n", 289 | "* Вероятность, что по цели вообще не будет попаданий\n", 290 | "* Вероятность, что попадание будет только одно\n", 291 | "\n", 292 | "Сумма этих вероятностей противоположна вероятности события \"попали хотя бы два раза\".\n" 293 | ] 294 | }, 295 | { 296 | "cell_type": "code", 297 | "execution_count": 4, 298 | "metadata": {}, 299 | "outputs": [ 300 | { 301 | "name": "stdout", 302 | "output_type": "stream", 303 | "text": [ 304 | "Нет попаданий: 0.05764800999999997\n", 305 | "биномиальный коэффициент: 8.0\n", 306 | "Только одно пападание: 0.1976503199999999\n", 307 | "Или нет или одно попадание: 0.2552983299999999\n", 308 | "Хотя бы два попадания: 0.745\n" 309 | ] 310 | } 311 | ], 312 | "source": [ 313 | "n = 8\n", 314 | "k = 1\n", 315 | "p = 0.3\n", 316 | "\n", 317 | "p0 = (1 - p) ** n\n", 318 | "print(f'Нет попаданий: {p0}')\n", 319 | "\n", 320 | "c = num_of_successes(n, k)\n", 321 | "print(f'биномиальный коэффициент: {c}')\n", 322 | "\n", 323 | "p1 = c * p ** k * (1 - p) ** (n-k)\n", 324 | "print(f'Только одно пападание: {p1}')\n", 325 | "\n", 326 | "result = p0 + p1\n", 327 | "print(f'Или нет или одно попадание: {result}')\n", 328 | "\n", 329 | "print(f'Хотя бы два попадания: {round(1-result, 3)}')" 330 | ] 331 | }, 332 | { 333 | "cell_type": "markdown", 334 | "metadata": {}, 335 | "source": [ 336 | "### Задача 4" 337 | ] 338 | }, 339 | { 340 | "cell_type": "markdown", 341 | "metadata": {}, 342 | "source": [ 343 | "В магазине 7 покупателей. Каждый может совершить покупку с вероятностью 0,4. Найти вероятность того, что не более двух человек совершат покупку. Ответ округлите до сотых." 344 | ] 345 | }, 346 | { 347 | "cell_type": "code", 348 | "execution_count": 5, 349 | "metadata": {}, 350 | "outputs": [ 351 | { 352 | "name": "stdout", 353 | "output_type": "stream", 354 | "text": [ 355 | "Нет покупок: 0.027993599999999993\n", 356 | "Только одна покупка: 0.13063679999999997\n", 357 | "Только две покупки: 0.2612736\n", 358 | "Не более двух: 0.42\n" 359 | ] 360 | } 361 | ], 362 | "source": [ 363 | "n = 7\n", 364 | "p = 0.4\n", 365 | "\n", 366 | "p0 = (1 - p) ** n\n", 367 | "print(f'Нет покупок: {p0}')\n", 368 | "\n", 369 | "k = 1\n", 370 | "c = num_of_successes(n, k)\n", 371 | "p1 = c * p ** k * (1 - p) ** (n-k)\n", 372 | "print(f'Только одна покупка: {p1}')\n", 373 | "\n", 374 | "k = 2\n", 375 | "c = num_of_successes(n, k)\n", 376 | "p2 = c * p ** k * (1 - p) ** (n-k)\n", 377 | "print(f'Только две покупки: {p2}')\n", 378 | "\n", 379 | "result = p0 + p1 + p2\n", 380 | "print(f'Не более двух: {round(result, 2)}')" 381 | ] 382 | }, 383 | { 384 | "cell_type": "markdown", 385 | "metadata": {}, 386 | "source": [ 387 | "### Задача 5" 388 | ] 389 | }, 390 | { 391 | "cell_type": "markdown", 392 | "metadata": {}, 393 | "source": [ 394 | "В телеателье имеется 7 телевизоров. Для каждого телевизора вероятность того, что в данный момент он включен, равна 0,5. Найти вероятность того, что в данный момент включены не менее трех телевизоров. Ответ округлите до тысячных.\n" 395 | ] 396 | }, 397 | { 398 | "cell_type": "code", 399 | "execution_count": 6, 400 | "metadata": {}, 401 | "outputs": [ 402 | { 403 | "name": "stdout", 404 | "output_type": "stream", 405 | "text": [ 406 | "Все выкл: 0.0078125\n", 407 | "Один включен: 0.0546875\n", 408 | "Два включены: 0.1640625\n", 409 | "Включены ноль, один или два: 0.2265625\n", 410 | "Включены не менее трех: 0.773\n" 411 | ] 412 | } 413 | ], 414 | "source": [ 415 | "n = 7\n", 416 | "p = 0.5\n", 417 | "\n", 418 | "p0 = (1 - p) ** n\n", 419 | "print(f'Все выкл: {p0}')\n", 420 | "\n", 421 | "k = 1\n", 422 | "c = num_of_successes(n, k)\n", 423 | "p1 = c * p ** k * (1 - p) ** (n-k)\n", 424 | "print(f'Один включен: {p1}')\n", 425 | "\n", 426 | "k = 2\n", 427 | "c = num_of_successes(n, k)\n", 428 | "p2 = c * p ** k * (1 - p) ** (n-k)\n", 429 | "print(f'Два включены: {p2}')\n", 430 | "\n", 431 | "result = p0 + p1 + p2\n", 432 | "print(f'Включены ноль, один или два: {result}')\n", 433 | "\n", 434 | "print(f'Включены не менее трех: {round(1 - result, 3)}')" 435 | ] 436 | }, 437 | { 438 | "cell_type": "markdown", 439 | "metadata": {}, 440 | "source": [ 441 | "## Условная вероятность (теорема Байеса)" 442 | ] 443 | }, 444 | { 445 | "cell_type": "markdown", 446 | "metadata": {}, 447 | "source": [ 448 | "### Задача 1" 449 | ] 450 | }, 451 | { 452 | "cell_type": "markdown", 453 | "metadata": {}, 454 | "source": [ 455 | "У нас есть две группы стрелков: \n", 456 | "* в первой 5 человек и они попадают в мишень с вероятностью 0.8\n", 457 | "* во второй 15 человек и они попадают в мишень с вероятность 0.4 \n", 458 | "\n", 459 | "Мы подошли к мишени и увидели, что в неё попали. Какова вероятность, что это попадание сделал стрелок из первой группы? " 460 | ] 461 | }, 462 | { 463 | "cell_type": "markdown", 464 | "metadata": {}, 465 | "source": [ 466 | "**Решение**: \n", 467 | "1. Вероятность того, что случайно выбранный стрелок принадлежит к первой группе: Р(Y) = 5/20 = 0.25 \n", 468 | "2. Вероятность попадания стрелка из первой группы нам известна из условия: P(X|Y) = 0.8 \n", 469 | "3. Рассчитаем вероятность попадания в мишень в целом: P(X) = 5/20 * 0.8 + 15/20 * 0.4 = 0.5 \n", 470 | "4. Подставим все в формулу Байеса: $\\frac{P(Y)*P(X|Y)}{P(X)}=\\frac{0.25 * 0.8}{0.5}=0.4$" 471 | ] 472 | }, 473 | { 474 | "cell_type": "markdown", 475 | "metadata": {}, 476 | "source": [ 477 | "### Задача 2" 478 | ] 479 | }, 480 | { 481 | "cell_type": "markdown", 482 | "metadata": {}, 483 | "source": [ 484 | "Тест положительный у здорового человека с вероятностью 0.01 \n", 485 | "Тест положительный у больного человека с вероятностью 0.9 \n", 486 | "Вероятность быть больным 0.001. \n", 487 | "Какова вероятность что вы здоровы и получили положительный тест? " 488 | ] 489 | }, 490 | { 491 | "cell_type": "markdown", 492 | "metadata": {}, 493 | "source": [ 494 | "**Решение**: \n", 495 | "$P(healthy|test+)=\\frac{Р(healthy)*P(test+|healthy)}{P(test+)}$\n", 496 | "\n", 497 | "1. Вероятность того, что здоров: Р(здоров) = 1 - 0.001 = 0.999 \n", 498 | "2. Вероятность того что здоров, но получил положительный тест нам известна из условия: P(тест+|здоров) = 0.01 \n", 499 | "3. Рассчитаем вероятность получить положительный тест в целом: P(тест+) = 0.999 * 0.01 + 0.001 * 0.9 = 0.01089 \n", 500 | "4. Подставим все в формулу Байеса: $P(healthy|test+)=\\frac{Р(healthy)*P(test+|healthy)}{P(test+)}=\\frac{0.999 * 0.01}{0.01089}=0.9173553719008265$" 501 | ] 502 | }, 503 | { 504 | "cell_type": "markdown", 505 | "metadata": {}, 506 | "source": [ 507 | "### Задача 3" 508 | ] 509 | }, 510 | { 511 | "cell_type": "markdown", 512 | "metadata": {}, 513 | "source": [ 514 | "Из 500 компьютеров 180 принадлежат к 1 партии, 170 – ко второй партии, остальные к третьей. В первой партии 3% брака, во второй - 2%, в третьей – 6%. Наудачу выбирается один компьютер. Определить вероятность того, что выбранный компьютер – бракованный. Округлите полученный ответ до десятитысячных. В качестве разделителя используйте точку." 515 | ] 516 | }, 517 | { 518 | "cell_type": "markdown", 519 | "metadata": {}, 520 | "source": [ 521 | "**Решение**: \n", 522 | "$\\frac{180 * 0.03}{500} + \\frac{170 * 0.02}{500} + \\frac{150 * 0.06}{500}=0.0356$" 523 | ] 524 | }, 525 | { 526 | "cell_type": "markdown", 527 | "metadata": {}, 528 | "source": [ 529 | "### Задача 4" 530 | ] 531 | }, 532 | { 533 | "cell_type": "markdown", 534 | "metadata": {}, 535 | "source": [ 536 | "В компании программистов, которые пишут на языке Java в 3 раза больше, чем программистов на языке C++, а программистов на языке C++ в 4 раза больше, чем программистов на Python. Вероятность сделать работу за один день для Java-программиста 0,85, для программиста на языке C++ 0,9, а для Python программиста 0,8. Найти вероятность того, что программист, выбранный наугад, сделает работу за один день. Ответ округлите до сотых. В качестве десятичного разделителя используйте точку." 537 | ] 538 | }, 539 | { 540 | "cell_type": "markdown", 541 | "metadata": {}, 542 | "source": [ 543 | "**Решение**: \n", 544 | "Java = 3C++ \n", 545 | "C++ = 4P \n", 546 | "Java = 3 * 4P = 12P \n", 547 | "\n", 548 | "$\\frac{12P * 0.85}{17P} + \\frac{4P * 0.9}{17P} + \\frac{P * 0.8}{17P}=\\frac{12 * 0.85}{17} + \\frac{4 * 0.9}{17} + \\frac{0.8}{17}=0.8589$" 549 | ] 550 | }, 551 | { 552 | "cell_type": "markdown", 553 | "metadata": {}, 554 | "source": [ 555 | "### Задача 5" 556 | ] 557 | }, 558 | { 559 | "cell_type": "markdown", 560 | "metadata": {}, 561 | "source": [ 562 | "Из 40 снайперов 18 попадает в цель с вероятностью 0,9, 8 - с вероятностью 0,4 и 14 – с вероятностью 0,7. Наудачу выбранный снайпер произвел выстрел, поразив цель. Мы хотим вычислить, снайпер из какой группы наиболее вероятно совершил этот выстрел. В ответ запишите наибольшую вероятность, округленную до тысячных. В качестве разделителя используйте точку." 563 | ] 564 | }, 565 | { 566 | "cell_type": "markdown", 567 | "metadata": {}, 568 | "source": [ 569 | "**Решение**: посчитаем по теореме Байеса вероятность для каждой из групп и выберем из них наибольшую. \n", 570 | "$P(A|B)=\\frac{P(A)*P(B|A)}{P(B)}$" 571 | ] 572 | }, 573 | { 574 | "cell_type": "code", 575 | "execution_count": 7, 576 | "metadata": {}, 577 | "outputs": [ 578 | { 579 | "name": "stdout", 580 | "output_type": "stream", 581 | "text": [ 582 | "0.73\n", 583 | "0.45 0.2 0.35\n", 584 | "0.555 0.11 0.336\n" 585 | ] 586 | } 587 | ], 588 | "source": [ 589 | "g1 = 18\n", 590 | "g2 = 8\n", 591 | "g3 = 14\n", 592 | "s = 40\n", 593 | "\n", 594 | "# это наши P(B/A) они указаны в дано задачи -- вероятность попасть, если ты в такой-то группе\n", 595 | "v1 = 0.9\n", 596 | "v2 = 0.4\n", 597 | "v3 = 0.7\n", 598 | "\n", 599 | "# P(B) -- знаменатель, общий для всех трех вариантов. это общая вероятность что кто-то попал\n", 600 | "pb = g1*v1/s + g2*v2/s + g3*v3/s\n", 601 | "\n", 602 | "# посчитаем P(A) - вероятность принадлежать к одной из групп\n", 603 | "pa1 = g1/s\n", 604 | "pa2 = g2/s\n", 605 | "pa3 = g3/s\n", 606 | "\n", 607 | "# посчитаем P(A/B) для каждого из вариантов\n", 608 | "pab1 = pa1*v1/pb\n", 609 | "pab2 = pa2*v2/pb\n", 610 | "pab3 = pa3*v3/pb\n", 611 | "\n", 612 | "print(pb)\n", 613 | "print(pa1, pa2, pa3)\n", 614 | "print(round(pab1, 3), round(pab2, 3), round(pab3, 3))" 615 | ] 616 | }, 617 | { 618 | "cell_type": "markdown", 619 | "metadata": {}, 620 | "source": [ 621 | "### Задача 6" 622 | ] 623 | }, 624 | { 625 | "cell_type": "markdown", 626 | "metadata": {}, 627 | "source": [ 628 | "Есть 4 кубика. На трех из них окрашена белым половина граней, а на четвертом кубике всего одна грань из шести белая. Наудачу выбранный кубик подбрасывается семь раз. Найти вероятность того, что был выбран четвертый кубик, если при семи подбрасываниях белая грань выпала ровно один раз. Ответ округлите до тысячных. Используйте точку в качестве десятичного разделителя." 629 | ] 630 | }, 631 | { 632 | "cell_type": "markdown", 633 | "metadata": {}, 634 | "source": [ 635 | "**Решение**: посчитаем по теореме Байеса \n", 636 | "$P(A|B)=\\frac{P(A)*P(B|A)}{P(B)}$" 637 | ] 638 | }, 639 | { 640 | "cell_type": "code", 641 | "execution_count": 8, 642 | "metadata": {}, 643 | "outputs": [ 644 | { 645 | "name": "stdout", 646 | "output_type": "stream", 647 | "text": [ 648 | "0.25\n" 649 | ] 650 | } 651 | ], 652 | "source": [ 653 | "# P(A) - вероятность того что это четвертый кубик\n", 654 | "pa = 1/4\n", 655 | "print(pa)" 656 | ] 657 | }, 658 | { 659 | "cell_type": "code", 660 | "execution_count": 9, 661 | "metadata": {}, 662 | "outputs": [], 663 | "source": [ 664 | "from math import factorial\n", 665 | "\n", 666 | "# вспомогательная функция для расчета биноминального коэффициента\n", 667 | "def binom_coef(n, k):\n", 668 | " return factorial(n) / (factorial(k) * factorial(n - k))" 669 | ] 670 | }, 671 | { 672 | "cell_type": "code", 673 | "execution_count": 10, 674 | "metadata": {}, 675 | "outputs": [ 676 | { 677 | "name": "stdout", 678 | "output_type": "stream", 679 | "text": [ 680 | "0.2344285836762689\n" 681 | ] 682 | } 683 | ], 684 | "source": [ 685 | "# P(B|A) - вероятность получить одну белую грань если это четвертый кубик\n", 686 | "\n", 687 | "pba = binom_coef(n, k) * (1/6) ** k * (1 - 1/6) ** (n-k)\n", 688 | "\n", 689 | "print(pba)" 690 | ] 691 | }, 692 | { 693 | "cell_type": "code", 694 | "execution_count": 11, 695 | "metadata": {}, 696 | "outputs": [ 697 | { 698 | "name": "stdout", 699 | "output_type": "stream", 700 | "text": [ 701 | "0.1386942015317787\n" 702 | ] 703 | } 704 | ], 705 | "source": [ 706 | "# P(B) - вероятность получить одну белую грань при семи бросках\n", 707 | "n = 7\n", 708 | "k = 1\n", 709 | "\n", 710 | "pb = pa * binom_coef(n, k) * (1/6) ** k * (1 - 1/6) ** (n-k) + (1 - pa) * binom_coef(n, k) * (3/6) ** k * (1 - 3/6) ** (n-k)\n", 711 | "\n", 712 | "print(pb)" 713 | ] 714 | }, 715 | { 716 | "cell_type": "code", 717 | "execution_count": 12, 718 | "metadata": {}, 719 | "outputs": [ 720 | { 721 | "name": "stdout", 722 | "output_type": "stream", 723 | "text": [ 724 | "0.423\n" 725 | ] 726 | } 727 | ], 728 | "source": [ 729 | "# посчитаем по теореме Байеса\n", 730 | "pab = pa * pba / pb\n", 731 | "\n", 732 | "print(round(pab, 3))" 733 | ] 734 | }, 735 | { 736 | "cell_type": "markdown", 737 | "metadata": {}, 738 | "source": [ 739 | "### Задача 7" 740 | ] 741 | }, 742 | { 743 | "cell_type": "markdown", 744 | "metadata": {}, 745 | "source": [ 746 | "На гитхабе лежит 12 скриптов, написанных аналитиком №1, 20 скриптов - написанных аналитиком №2 и 18 скриптов - написанных аналитиком №3. Вероятность того, что скрипт, написанный аналитиком №1, работает без ошибок, равна 0,9; для скриптов, написанных аналитиками №2 и №3, эти вероятности соответственно равны 0,6 и 0,9.\n", 747 | "\n", 748 | "Нужно найти вероятность того, что случайно взятый скрипт будет работать без ошибок. В ответ запишите число, округленное до сотых. В качестве десятичного разделителя используйте точку." 749 | ] 750 | }, 751 | { 752 | "cell_type": "markdown", 753 | "metadata": {}, 754 | "source": [ 755 | "**Решение**: \n", 756 | "$\\frac{12}{50}*0.9 + \\frac{20}{50}*0.6 + \\frac{18}{50}*0.9=0.78$" 757 | ] 758 | } 759 | ], 760 | "metadata": { 761 | "kernelspec": { 762 | "display_name": "Python 3", 763 | "language": "python", 764 | "name": "python3" 765 | }, 766 | "language_info": { 767 | "codemirror_mode": { 768 | "name": "ipython", 769 | "version": 3 770 | }, 771 | "file_extension": ".py", 772 | "mimetype": "text/x-python", 773 | "name": "python", 774 | "nbconvert_exporter": "python", 775 | "pygments_lexer": "ipython3", 776 | "version": "3.7.6" 777 | } 778 | }, 779 | "nbformat": 4, 780 | "nbformat_minor": 4 781 | } 782 | -------------------------------------------------------------------------------- /37_final_project/README.md: -------------------------------------------------------------------------------- 1 | This is the final project done by me while passing [data analysis course](https://karpov.courses/) I've enrolled in January 2021. 2 | 3 | 4 | 5 | Final project -- you're employed in a mobile games development company. A Product Manager gives you following tasks: 6 | * Find and visualize retention, 7 | * Make a decision based on the A/B test data, 8 | * Suggest a number of metrics to evaluate the results of the last monthly campaign. 9 | 10 | 11 | 12 | 13 | 14 | Hope this repo will help you to assess my coding, data analytics and SQL skills or will be just fun for you to look through. 15 | 16 | 17 | 18 | -------------------------------------------- 19 | Feel free to contact me via nktn.lx@gmal.com 20 | Follow me on twitter: @nktn_lx 21 | And here on github: github.com/nktnlx -------------------------------------------------------------------------------- /37_final_project/my_project_slides.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nktnlx/data_analysis_course/5aa4ef8983b6ba5dc15e370750fa5b9022407918/37_final_project/my_project_slides.pdf -------------------------------------------------------------------------------- /37_final_project/retention_200901_200923.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nktnlx/data_analysis_course/5aa4ef8983b6ba5dc15e370750fa5b9022407918/37_final_project/retention_200901_200923.jpg -------------------------------------------------------------------------------- /3_user_logs/README.md: -------------------------------------------------------------------------------- 1 | This is the third dataset analysed by me while passing [data analysis course](https://karpov.courses/) I've enrolled in January 2021. 2 | 3 | 4 | User Logs -- analising customers data. Finding the most popular platform and the most active users. Visualizing data with Seaborn distplot, barplot and countplot methods. 5 | 6 | 7 | 8 | 9 | Hope this repo will help you to assess my coding, data analytics and SQL skills or will be just fun for you to look through. 10 | 11 | 12 | 13 | -------------------------------------------- 14 | Fill free to contact me via nktn.lx@gmal.com 15 | Follow me on twitter: @nktn_lx 16 | And here on github: github.com/nktnlx -------------------------------------------------------------------------------- /4_taxi_peru/README.md: -------------------------------------------------------------------------------- 1 | This is the fourth dataset analysed by me while passing [data analysis course](https://karpov.courses/) I've enrolled in January 2021. 2 | 3 | 4 | Taxi in Peru -- analising taxi orders in Peru with Pandas. An Exploratory Data Analysis was performed. Drivers' score, passengers' score, **DAU** and **MAU** metrics were calculated and plotted with Seaborn. 5 | 6 | 7 | 8 | 9 | Hope this repo will help you to assess my coding, data analytics and SQL skills or will be just fun for you to look through. 10 | 11 | 12 | 13 | -------------------------------------------- 14 | Fill free to contact me via nktn.lx@gmal.com 15 | Follow me on twitter: @nktn_lx 16 | And here on github: github.com/nktnlx -------------------------------------------------------------------------------- /5_raw_data_handling/README.md: -------------------------------------------------------------------------------- 1 | This is the fifth dataset analysed by me while passing [data analysis course](https://karpov.courses/) I've enrolled in January 2021. 2 | 3 | 4 | Raw Data Handling -- creating dataframe from a set of csv-files stored in various folders. Practicing Python skills to automate data handling. 5 | 6 | Folders tree is shown below: 7 | 8 | 9 | 10 | Hope this repo will help you to assess my coding, data analytics and SQL skills or will be just fun for you to look through. 11 | 12 | 13 | 14 | -------------------------------------------- 15 | Fill free to contact me via nktn.lx@gmal.com 16 | Follow me on twitter: @nktn_lx 17 | And here on github: github.com/nktnlx -------------------------------------------------------------------------------- /5_raw_data_handling/data_tree.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nktnlx/data_analysis_course/5aa4ef8983b6ba5dc15e370750fa5b9022407918/5_raw_data_handling/data_tree.png -------------------------------------------------------------------------------- /5_raw_data_handling/raw_data_handling.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## Raw Data Handling (mini-project)" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "**The data is in multiple files stored in various folders. Create a united dataframe to store all the data. Add to the dataframe columns DATE and NAME, rows should be names of the corresponding folder names.** \n", 15 | " **Folders tree is shown below:**\n", 16 | "" 17 | ] 18 | }, 19 | { 20 | "cell_type": "markdown", 21 | "metadata": {}, 22 | "source": [ 23 | "**The resulting dataframe should have the following columns with data:** \n", 24 | "* product_id \n", 25 | "* quantity\n", 26 | "* name\n", 27 | "* date " 28 | ] 29 | }, 30 | { 31 | "cell_type": "code", 32 | "execution_count": 3, 33 | "metadata": {}, 34 | "outputs": [ 35 | { 36 | "data": { 37 | "text/html": [ 38 | "
\n", 39 | "\n", 52 | "\n", 53 | " \n", 54 | " \n", 55 | " \n", 56 | " \n", 57 | " \n", 58 | " \n", 59 | " \n", 60 | " \n", 61 | " \n", 62 | " \n", 63 | " \n", 64 | " \n", 65 | " \n", 66 | " \n", 67 | " \n", 68 | " \n", 69 | " \n", 70 | " \n", 71 | " \n", 72 | " \n", 73 | " \n", 74 | " \n", 75 | " \n", 76 | " \n", 77 | " \n", 78 | " \n", 79 | " \n", 80 | " \n", 81 | " \n", 82 | " \n", 83 | " \n", 84 | " \n", 85 | " \n", 86 | " \n", 87 | " \n", 88 | " \n", 89 | " \n", 90 | " \n", 91 | " \n", 92 | " \n", 93 | " \n", 94 | " \n", 95 | " \n", 96 | " \n", 97 | " \n", 98 | " \n", 99 | " \n", 100 | " \n", 101 | " \n", 102 | " \n", 103 | " \n", 104 | " \n", 105 | " \n", 106 | " \n", 107 | " \n", 108 | " \n", 109 | " \n", 110 | " \n", 111 | " \n", 112 | " \n", 113 | " \n", 114 | " \n", 115 | " \n", 116 | " \n", 117 | " \n", 118 | " \n", 119 | " \n", 120 | " \n", 121 | " \n", 122 | " \n", 123 | " \n", 124 | " \n", 125 | " \n", 126 | " \n", 127 | " \n", 128 | " \n", 129 | " \n", 130 | " \n", 131 | " \n", 132 | " \n", 133 | " \n", 134 | " \n", 135 | " \n", 136 | " \n", 137 | " \n", 138 | " \n", 139 | " \n", 140 | " \n", 141 | "
product_idquantitynamedate
0793Anton_Smirnov2020-12-09
1331Anton_Smirnov2020-12-09
2813Anton_Smirnov2020-12-09
3704Anton_Smirnov2020-12-09
4565Alexey_Smirnov2020-12-09
...............
156735Alexey_Fedorov2020-12-08
157341Alexey_Fedorov2020-12-08
158711Alexey_Fedorov2020-12-08
159182Alexey_Fedorov2020-12-08
160672Alexey_Fedorov2020-12-08
\n", 142 | "

161 rows × 4 columns

\n", 143 | "
" 144 | ], 145 | "text/plain": [ 146 | " product_id quantity name date\n", 147 | "0 79 3 Anton_Smirnov 2020-12-09\n", 148 | "1 33 1 Anton_Smirnov 2020-12-09\n", 149 | "2 81 3 Anton_Smirnov 2020-12-09\n", 150 | "3 70 4 Anton_Smirnov 2020-12-09\n", 151 | "4 56 5 Alexey_Smirnov 2020-12-09\n", 152 | ".. ... ... ... ...\n", 153 | "156 73 5 Alexey_Fedorov 2020-12-08\n", 154 | "157 34 1 Alexey_Fedorov 2020-12-08\n", 155 | "158 71 1 Alexey_Fedorov 2020-12-08\n", 156 | "159 18 2 Alexey_Fedorov 2020-12-08\n", 157 | "160 67 2 Alexey_Fedorov 2020-12-08\n", 158 | "\n", 159 | "[161 rows x 4 columns]" 160 | ] 161 | }, 162 | "execution_count": 3, 163 | "metadata": {}, 164 | "output_type": "execute_result" 165 | } 166 | ], 167 | "source": [ 168 | "import pandas as pd\n", 169 | "import os\n", 170 | "\n", 171 | "\n", 172 | "# checking current working directory and pointing to the folder with our data\n", 173 | "cwd = os.getcwd()\n", 174 | "data_folder = 'data_mini_project'\n", 175 | "\n", 176 | "# creating list of folders\n", 177 | "folders = []\n", 178 | "for path, folder, file in os.walk(cwd + '/' + data_folder):\n", 179 | " if file:\n", 180 | " file = file[0]\n", 181 | " if folder:\n", 182 | " folders.append(folder)\n", 183 | " \n", 184 | " \n", 185 | "# creating dict with inner folders structure (key - for date, value - for names) \n", 186 | "folders_dict = {key: [] for key in folders[0]}\n", 187 | "i = 1\n", 188 | "for key in folders_dict.keys():\n", 189 | " folders_dict[key] = folders[i]\n", 190 | " i += 1\n", 191 | "\n", 192 | "\n", 193 | "# creating an empty dataframe\n", 194 | "df = pd.DataFrame()\n", 195 | "\n", 196 | "\n", 197 | "# reading csv files one by one and appending data to our dataframe\n", 198 | "for key in folders_dict:\n", 199 | " for name in folders_dict[key]:\n", 200 | " df_temp = pd.read_csv(cwd + '/' + data_folder + '/' + key + '/' + name + '/' + file, index_col=0)\n", 201 | " df_temp['name'] = name\n", 202 | " df_temp['date'] = key\n", 203 | " df = pd.concat([df, df_temp])\n", 204 | "\n", 205 | " \n", 206 | "# resetting indexes to have an index through\n", 207 | "df.reset_index(drop=True)" 208 | ] 209 | }, 210 | { 211 | "cell_type": "markdown", 212 | "metadata": {}, 213 | "source": [ 214 | "**To check that you've created a proper dataset, please, show the total quantity of items. The sum should be equal to 480**" 215 | ] 216 | }, 217 | { 218 | "cell_type": "code", 219 | "execution_count": 546, 220 | "metadata": {}, 221 | "outputs": [ 222 | { 223 | "data": { 224 | "text/plain": [ 225 | "480" 226 | ] 227 | }, 228 | "execution_count": 546, 229 | "metadata": {}, 230 | "output_type": "execute_result" 231 | } 232 | ], 233 | "source": [ 234 | "# checking the overal quantity of items\n", 235 | "df.quantity.sum()" 236 | ] 237 | }, 238 | { 239 | "cell_type": "markdown", 240 | "metadata": {}, 241 | "source": [ 242 | "**Find out the user(s) who bought the most nubmer of items. If there is more than one such a user, list them in alphabetical order.**" 243 | ] 244 | }, 245 | { 246 | "cell_type": "code", 247 | "execution_count": 4, 248 | "metadata": {}, 249 | "outputs": [ 250 | { 251 | "data": { 252 | "text/plain": [ 253 | "'Alexey_Smirnov, Petr_Smirnov'" 254 | ] 255 | }, 256 | "execution_count": 4, 257 | "metadata": {}, 258 | "output_type": "execute_result" 259 | } 260 | ], 261 | "source": [ 262 | "best_user = df.groupby(['name'], as_index=False) \\\n", 263 | " .agg({'quantity': 'sum'}) \\\n", 264 | " .sort_values('name') \\\n", 265 | " .query('quantity == quantity.max()') \\\n", 266 | " .name \\\n", 267 | " .to_list()\n", 268 | "\n", 269 | "# string representation\n", 270 | "', '.join(best_user)" 271 | ] 272 | }, 273 | { 274 | "cell_type": "markdown", 275 | "metadata": {}, 276 | "source": [ 277 | "**Find top-10 sold product_ids and plot the data**" 278 | ] 279 | }, 280 | { 281 | "cell_type": "code", 282 | "execution_count": 619, 283 | "metadata": {}, 284 | "outputs": [ 285 | { 286 | "data": { 287 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYgAAAEHCAYAAAC0pdErAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjMsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+AADFEAAAYXUlEQVR4nO3dfZRddX3v8fdHwAeQFjDDg0CMD5QldRWkEbFULopS4FJQqy1ZSlOFRixY8dYq1nXB4uq6ilVqxSULBUVFxAdQSlFJqRXxKhJigCBYEGOJQRLF8lC8lxv6vX+cPeUw7EkOYfY+Q+b9Wuus2fu3f/v8vjmZmc/s51QVkiRN9YRxFyBJmp0MCElSKwNCktTKgJAktTIgJEmtthx3ATNp3rx5tWDBgnGXIUmPG9dee+3Pq2qibdlmFRALFixg2bJl4y5Dkh43kvxkumXuYpIktTIgJEmtDAhJUisDQpLUyoCQJLUyICRJrToLiCS7J/lGkpuS3JjkLU37DkmWJrml+br9NOsvbvrckmRxV3VKktp1uQWxHviLqnousD9wQpK9gJOBK6pqD+CKZv5hkuwAnAq8ENgPOHW6IJEkdaOzgKiqO6pqeTN9L3ATsCtwFHBe0+084BUtq/8esLSq7qqqXwJLgUO7qlWS9Ei9XEmdZAHwfOBqYKequgMGIZJkx5ZVdgVuH5pf3bS1vfcSYAnA/PnzZ65oSXPGz/7pT3sba+eXfay3sR6rzg9SJ3kq8CXgpKq6Z9TVWtpaH31XVWdX1cKqWjgx0Xo7EUnSJug0IJJsxSAczq+qi5rmO5Ps0izfBVjbsupqYPeh+d2ANV3WKkl6uC7PYgpwDnBTVX1waNElwORZSYuBr7Ss/nXgkCTbNwenD2naJEk96XIL4gDgGOClSVY0r8OB9wIvT3IL8PJmniQLk3wcoKruAt4DXNO8TmvaJEk96ewgdVVdRfuxBICDW/ovA44bmj8XOLeb6iRJG+OV1JKkVgaEJKmVASFJamVASJJaGRCSpFYGhCSplQEhSWplQEiSWhkQkqRWBoQkqZUBIUlqZUBIkloZEJKkVgaEJKmVASFJatXZ8yCk2ejSt5ze21hHfOjtvY0ldaGzgEhyLnAEsLaqnte0XQjs2XTZDvj3qtqnZd1VwL3Ag8D6qlrYVZ2SpHZdbkF8EjgT+NRkQ1X90eR0kg8Ad29g/ZdU1c87q06StEFdPnL0yiQL2pYlCfCHwEu7Gl+S9NiM6yD1i4E7q+qWaZYXcHmSa5Ms6bEuSVJjXAepFwEXbGD5AVW1JsmOwNIkN1fVlW0dmwBZAjB//vyZr1SS5qjetyCSbAm8Crhwuj5Vtab5uha4GNhvA33PrqqFVbVwYmJipsuVpDlrHLuYXgbcXFWr2xYm2SbJtpPTwCHAyh7rkyTRYUAkuQD4DrBnktVJjm0WHc2U3UtJnp7ksmZ2J+CqJNcB3wP+saq+1lWdkqR2XZ7FtGia9j9paVsDHN5M3wbs3VVdkqTReKsNSVIrA0KS1MqAkCS1MiAkSa0MCElSKwNCktTKgJAktTIgJEmtDAhJUisDQpLUyoCQJLUyICRJrQwISVIrA0KS1MqAkCS1MiAkSa0MCElSqy4fOXpukrVJVg61vTvJT5OsaF6HT7PuoUl+mOTWJCd3VaMkaXpdbkF8Eji0pf2MqtqneV02dWGSLYCPAIcBewGLkuzVYZ2SpBadBURVXQnctQmr7gfcWlW3VdUDwOeAo2a0OEnSRo3jGMSJSa5vdkFt37J8V+D2ofnVTZskqUdb9jzeR4H3ANV8/QDwhil90rJeTfeGSZYASwDmz58/M1VuZk77o/f3NtYpF/5lb2NJm5Of/fD03sbaec+3j9Sv1y2Iqrqzqh6sqv8EPsZgd9JUq4Hdh+Z3A9Zs4D3PrqqFVbVwYmJiZguWpDms14BIssvQ7CuBlS3drgH2SPLMJE8EjgYu6aM+SdJDOtvFlOQC4CBgXpLVwKnAQUn2YbDLaBXwxqbv04GPV9XhVbU+yYnA14EtgHOr6sau6pQktessIKpqUUvzOdP0XQMcPjR/GfCIU2AlSf3xSmpJUisDQpLUyoCQJLUyICRJrQwISVKrvq+kljRL3HpGf1e9P+et/V3Nr5njFoQkqZUBIUlqZUBIkloZEJKkVgaEJKmVASFJamVASJJaGRCSpFYGhCSp1WZ7JfWrfv+43sa66B8+3ttYj1cfO/ZvexvrT895W29j6bFZ/Zljextrt9e1Po5GG+AWhCSpVWcBkeTcJGuTrBxqe3+Sm5Ncn+TiJNtNs+6qJDckWZFkWVc1SpKm1+UWxCeBQ6e0LQWeV1W/Bfwr8M4NrP+SqtqnqhZ2VJ8kaQM6C4iquhK4a0rb5VW1vpn9LrBbV+NLkh6bcR6DeAPw1WmWFXB5kmuTLNnQmyRZkmRZkmXr1q2b8SIlaa4aS0AkeRewHjh/mi4HVNW+wGHACUkOnO69qursqlpYVQsnJiY6qFaS5qbeAyLJYuAI4LVVVW19qmpN83UtcDGwX38VSpJgxIBIckSSxxwmSQ4F3gEcWVX3T9NnmyTbTk4DhwAr2/pKkroz6i/9o4Fbkpye5LmjrJDkAuA7wJ5JVic5FjgT2BZY2pzCelbT9+lJLmtW3Qm4Ksl1wPeAf6yqrz2Kf5MkaQaMdCV1Vb0uya8Bi4BPJCngE8AFVXXvNOssamluvZSx2aV0eDN9G7D3KHVJkroz8q02quqeJF8CngKcBLwS+Mskf19VH+6qwMezE1/xnt7GOvPL/7O3sfTYLDulv++Lhaf5faFNN+oxiCOTXAz8M7AVsF9VHcbgL31vfCNJm6FRtyBeDZzRXPz2X6rq/iRvmPmyJEnjNupB6jumhkOS9wFU1RUzXpUkaexGDYiXt7QdNpOFSJJmlw3uYkryJuDPgGcnuX5o0bbAt7ssTJI0Xhs7BvFZBvdL+l/AyUPt91bVXe2rSJI2BxsLiKqqVUlOmLogyQ6GhCRtvkbZgjgCuJbBHVYztKyAZ3VUlyRpzDYYEFV1RPP1mf2UI0maLUa9UO4Rp7K2tUmSNh8bO4vpycDWwLwk2/PQLqZfA57ecW2SpDHa2DGINzK479LTGRyHmAyIe4CPdFiXJGnMNnYM4kPAh5K82RvySdLcMurtvj+c5HeABcPrVNWnOqpLkjRmIwVEkk8DzwZWAA82zQUYEJK0mRr1bq4Lgb2me4a0JGnzM+rN+lYCOz/aN09ybpK1SVYOte2QZGmSW5qv20+z7uKmzy1JFj/asSVJj82oATEP+EGSrye5ZPI1wnqfBA6d0nYycEVV7QFcwcPv8QQMQgQ4FXghsB9w6nRBIknqxqi7mN69KW9eVVcmWTCl+SjgoGb6POBfgHdM6fN7wNLJez0lWcogaC7YlDokSY/eqGcxfXMGx9ypqu5o3veOJDu29NkVuH1ofnXT9ghJlgBLAObPnz+DZUrS3DbqrTb2T3JNkvuSPJDkwST3dFhXWtpaD5BX1dlVtbCqFk5MTHRYkiTNLaMegzgTWATcAjwFOK5p2xR3JtkFoPm6tqXPamD3ofndgDWbOJ4kaROMGhBU1a3AFlX1YFV9goeOIzxalwCTZyUtBr7S0ufrwCFJtm8OTh/StEmSejLqQer7kzwRWJHkdOAOYJuNrZTkAgZBMi/JagZnJr0X+HySY4F/A17T9F0IHF9Vx1XVXUneA1zTvNVpPpxIkvo1akAcA2wBnAi8lcHunz/Y2EpVtWiaRQe39F3GYNfV5Py5wLkj1idJmmGjnsX0k2byV8Bfd1eOJGm2GPVeTD+m5SyiqvKRo5K0mXo092Ka9GQGxw12mPlyJEmzxUhnMVXVL4ZeP62qvwNe2nFtkqQxGnUX075Ds09gsEWxbScVSZJmhVF3MX2Ah45BrAdW0ZyeKknaPI0aEJcyCIjJW2AUcEQymK2qD858aZKkcRo1IH4beAGDq54D/D5wJQ+/oZ4kaTMyakDMA/atqnsBkrwb+EJVHbfBtSRJj1uj3otpPvDA0PwDwIIZr0aSNGuMugXxaeB7SS5mcPzhlQwe9iNJ2kyNequNv0nyVeDFTdPrq+r73ZUlSRq3UbcgqKrlwPIOa5EkzSIjPw9CkjS3GBCSpFYGhCSplQEhSWrVe0Ak2TPJiqHXPUlOmtLnoCR3D/U5pe86JWmuG/kspplSVT8E9gFIsgXwU+Dilq7fqqoj+qxNkvSQce9iOhj40dAjTSVJs8S4A+Jo4IJplr0oyXVJvprkN6d7gyRLkixLsmzdunXdVClJc9DYAiLJE4EjgS+0LF4OPKOq9gY+DHx5uvepqrOramFVLZyYmOimWEmag8a5BXEYsLyq7py6oKruqar7munLgK2SzOu7QEmay8YZEIuYZvdSkp3TPI0oyX4M6vxFj7VJ0pzX+1lMAEm2Bl4OvHGo7XiAqjoLeDXwpiTrgV8BR1dVtb2XJKkbYwmIqrofeNqUtrOGps8Ezuy7LknSQ8Z9FpMkaZYyICRJrQwISVIrA0KS1MqAkCS1MiAkSa0MCElSKwNCktTKgJAktTIgJEmtDAhJUisDQpLUyoCQJLUyICRJrQwISVIrA0KS1MqAkCS1GltAJFmV5IYkK5Isa1meJH+f5NYk1yfZdxx1StJcNZZHjg55SVX9fJplhwF7NK8XAh9tvkqSejCbdzEdBXyqBr4LbJdkl3EXJUlzxTgDooDLk1ybZEnL8l2B24fmVzdtD5NkSZJlSZatW7euo1Ilae4ZZ0AcUFX7MtiVdEKSA6csT8s69YiGqrOramFVLZyYmOiiTkmak8YWEFW1pvm6FrgY2G9Kl9XA7kPzuwFr+qlOkjSWgEiyTZJtJ6eBQ4CVU7pdAvxxczbT/sDdVXVHz6VK0pw1rrOYdgIuTjJZw2er6mtJjgeoqrOAy4DDgVuB+4HXj6lWSZqTxhIQVXUbsHdL+1lD0wWc0GddkqSHzObTXCVJY2RASJJaGRCSpFYGhCSplQEhSWplQEiSWhkQkqRWBoQkqZUBIUlqZUBIkloZEJKkVgaEJKmVASFJamVASJJaGRCSpFYGhCSplQEhSWrVe0Ak2T3JN5LclOTGJG9p6XNQkruTrGhep/RdpyTNdeN45Oh64C+qanmSbYFrkyytqh9M6fetqjpiDPVJkhjDFkRV3VFVy5vpe4GbgF37rkOStGFjPQaRZAHwfODqlsUvSnJdkq8m+c0NvMeSJMuSLFu3bl1HlUrS3DO2gEjyVOBLwElVdc+UxcuBZ1TV3sCHgS9P9z5VdXZVLayqhRMTE90VLElzzFgCIslWDMLh/Kq6aOryqrqnqu5rpi8Dtkoyr+cyJWlOG8dZTAHOAW6qqg9O02fnph9J9mNQ5y/6q1KSNI6zmA4AjgFuSLKiafsrYD5AVZ0FvBp4U5L1wK+Ao6uqxlCrJM1ZvQdEVV0FZCN9zgTO7KciSVIbr6SWJLUyICRJrQwISVIrA0KS1MqAkCS1MiAkSa0MCElSKwNCktTKgJAktTIgJEmtDAhJUisDQpLUyoCQJLUyICRJrQwISVIrA0KS1MqAkCS1GktAJDk0yQ+T3Jrk5JblT0pyYbP86iQL+q9Skua23gMiyRbAR4DDgL2ARUn2mtLtWOCXVfUc4Azgff1WKUkaxxbEfsCtVXVbVT0AfA44akqfo4DzmukvAgcn2eBzrCVJMytV1e+AyauBQ6vquGb+GOCFVXXiUJ+VTZ/VzfyPmj4/b3m/JcCSZnZP4IePobx5wCPGGIPZUMdsqAFmRx2zoQaYHXXMhhpgdtQxG2qAx17HM6pqom3Blo/hTTdV25bA1JQapc+gseps4OzHWhRAkmVVtXAm3uvxXsdsqGG21DEbapgtdcyGGmZLHbOhhq7rGMcuptXA7kPzuwFrpuuTZEvg14G7eqlOkgSMJyCuAfZI8swkTwSOBi6Z0ucSYHEz/Wrgn6vvfWGSNMf1voupqtYnORH4OrAFcG5V3ZjkNGBZVV0CnAN8OsmtDLYcju6pvBnZVTUDZkMds6EGmB11zIYaYHbUMRtqgNlRx2yoATqso/eD1JKkxwevpJYktTIgJEmt5nRAJHlLkpVJbkxy0pRlb0tSSeb1XUOS9ye5Ocn1SS5Osl3HNeyZZMXQ657hz6PHz2K7JF9s/u03JXlR0/7m5tYsNyY5vcsamvFWJbmh+SyWDbX3VkdbDUl2SLI0yS3N1+27rKEZc4sk309yaTN/YnMLnM6/H5rxnpzke0muaz73v56y/MNJ7uuhjt2TfKP5vrwxyVua9guHfm5WJVnRdw1Dy2f+57Sq5uQLeB6wEtiawcH6fwL2aJbtzuAg+k+AeX3XABwCbNn0eR/wvh4/ly2AnzG4eKa3z6IZ6zzguGb6icB2wEuaz+VJTfuOPXwGq6b+W/uuY5oaTgdObqZP7uP7AvgfwGeBS5v55wML2urraPwAT22mtwKuBvZv5hcCnwbu66GOXYB9m+ltgX8F9prS5wPAKeOooauf07m8BfFc4LtVdX9VrQe+CbyyWXYG8HamuTiv6xqq6vJmHuC7DK4V6cvBwI+q6ifNfC+fRZJfAw5kcAYbVfVAVf078CbgvVX1f5v2tV3WsQGzoY7hW9CcB7yiy8GS7Ab8d+Djk21V9f2qWtXluMNqYHILYavmVc093d7P4HuzjzruqKrlzfS9wE3ArpPLm1sB/SFwwZhq6OTndC4HxErgwCRPS7I1cDiwe5IjgZ9W1XXjqmFKnzcAX+2hlklH03yT9/xZPAtYB3yi2aXx8STbAL8BvLi5q+83k7ygh1oKuDzJtc2tXBhDHW017FRVd8DglwWwY8c1/B2DXzr/2fE4G9Ts5loBrAWWVtXVwInAJZOfR8/1LGCwJXX1UPOLgTur6pa+a+jy53Qct9qYFarqpiTvA5YC9wHXAeuBdzHYxTPOGgBI8q5m/vw+6mkuXDwSeGcTWL19Fgy+F/cF3lxVVyf5EIPdKFsC2wP7Ay8APp/kWdVsV3fkgKpak2RHYGmSm8dQR1sNvUlyBLC2qq5NclCfY09VVQ8C+zTH4i5OciDwGqD3upI8FfgScFJV3TO0aBEdbj1MVwMd/86ay1sQVNU5VbVvVR3I4IK8VcAzgeuSrGKwa2d5kp17rOEWgCSLgSOA13b8y3DYYcDyqroTeDb9fhargdXNX4cwuIvvvk37Rc2uhu8x+Gu204OjVbWm+boWuJjBHYh7rWOaGu5MsgtA87XL3VwHAEc2//efA16a5DMdjrdRzS7Hf2FwPOg5wK1NfVtncFFtp5JsxeAX8/lVddFQ+5bAq4ALx1BDpz+nczogmr/OSDKfwX/wp6pqx6paUFULGPxS2LeqftZjDRckORR4B3BkVd3f1dgt/uuvoKq6oc/Ponnf25Ps2TQdDPwA+DLwUoAkv8Hg4HVnd9BMsk2SbSenGfxltrLPOjZQw/AtaBYDX+lifICqemdV7db83x/N4HY3r+tqvOkkmWi2HEjyFOBlwLVVtfPQ9+b9NXh2TJd1hMHxsZuq6oNTFr8MuLmau0/3WUPXP6dzdhdT40tJngb8P+CEqvrlbKghyZnAkxjsWoDBgezjuyyi2aX0cuCNXY6zEW8Gzm92dd0GvB74D+DcDG4B/wCwuOMtqp0Y7MaAwc/HZ6vqa01NfdUxXQ3XMNi1dSzwbwx2s/QqyZ8zOC6xM3B9ksuquXV/R3YBzmsOSj8B+HxVXdrheNM5ADgGuGHoVNa/qqrLGDpuN8YaOuGtNiRJreb0LiZJ0vQMCElSKwNCktTKgJAktTIgJEmtDAhJUisDQpohSRY010lsyroHJfmdjfQ5Pskfz+S40obM9QvlpI1KskVzP6AuHcTgflz/e7oOVXVWxzVID+MWhOa05q/vm5Ocl8EDmr6YZOsMHv5ySpKrgNck2SfJd/PQQ5y2b9b/7QweZvMd4ISh9/2T5or4yflLJ296l+TQJMub9a5o7sx5PPDWDB488+Jpan13krdtaFxpJhkQEuwJnF1VvwXcA/xZ0/5/qup3q+pzwKeAdzR9bgBObfp8AvjzqnrRKAMlmQA+BvxBVe0NvKZ5vsJZwBlVtU9VfWuEt3pU40qbwoCQ4Paq+nYz/Rngd5vpCwGS/DqwXVV9s2k/j8FzPKa2f3qEsfYHrqyqHwNU1V2PtthNHFd61AwI6ZFP4Zqc/4+NrJeWdSet5+E/X08eYZ1RzcR7SBtlQEgwP8nkrppFwFXDC6vqbuCXQ8cGjgG+2Tyf4O4kk1scrx1abRWDh9w8IcnuDJ7nAPAd4L8leSZAkh2a9nsZPGd4ozYyrjRjDAhp8GzfxUmuB3YAPtrSZzHw/qbPPsBpTfvrgY80B4t/NdT/28CPGRyv+Ftg8lnC64AlwEVJruOhh8z8A/DKDR2knmK6caUZ4+2+Nac1ZxBdWlXPG3Mp0qzjFoQkqZVbENIsk+RdPPJpcV+oqr8ZRz2auwwISVIrdzFJkloZEJKkVgaEJKmVASFJavX/AZiwjjwQNhBxAAAAAElFTkSuQmCC\n", 288 | "text/plain": [ 289 | "
" 290 | ] 291 | }, 292 | "metadata": { 293 | "needs_background": "light" 294 | }, 295 | "output_type": "display_data" 296 | } 297 | ], 298 | "source": [ 299 | "import seaborn as sns\n", 300 | "import matplotlib.pyplot as plt\n", 301 | "%matplotlib inline\n", 302 | "\n", 303 | "\n", 304 | "top_10 = df.groupby(['product_id'], as_index=False) \\\n", 305 | " .quantity.sum() \\\n", 306 | " .sort_values('quantity', ascending=False).head(10)\n", 307 | "\n", 308 | "# a list for reverse id oreder sorting\n", 309 | "for_order = sorted([24, 27, 34, 41, 50, 56, 66, 74, 92, 94], reverse=True)\n", 310 | "\n", 311 | "# alpha - transparency parameter\n", 312 | "ax = sns.barplot(x='product_id', y='quantity',data=top_10, palette='inferno', alpha=0.75, order=for_order)" 313 | ] 314 | }, 315 | { 316 | "cell_type": "markdown", 317 | "metadata": {}, 318 | "source": [ 319 | "**Check daily sales of items. Plot the data.**" 320 | ] 321 | }, 322 | { 323 | "cell_type": "code", 324 | "execution_count": 638, 325 | "metadata": {}, 326 | "outputs": [ 327 | { 328 | "data": { 329 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAtMAAAHgCAYAAABn8uGvAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjMsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+AADFEAAAakklEQVR4nO3df7Tt93zn8ddbLkVFk8hlUgkJK1UZq4o7YZhODTXFIDGiK5ZqaEyGalC1ijFTxozlx8z4UTU6afwIS/0KVozlx9KQ0VLRG+L3RIIgZLj1I9HqotHP/LG/dzluzk32fTvnfM859/FY66yz93d/99nv+7ln5T6zz+fsXWOMAAAAB+4Gcw8AAABblZgGAIAmMQ0AAE1iGgAAmsQ0AAA0iWkAAGjaMfcAP40jjzxyHHvssXOPAQDANnfRRRf9zRhj577Ht3RMH3vssdm9e/fcYwAAsM1V1ZdXO26bBwAANIlpAABoEtMAANAkpgEAoElMAwBAk5gGAIAmMQ0AAE1iGgAAmsQ0AAA0iWkAAGgS0wAA0CSmAQCgSUwDAECTmAYAgCYxDQAATWIaAACaxDQAADSJaQAAaBLTAADQtGPuAdjevv1fPjr3CFvOEf/pxLlHAACW5JlpAABoEtMAANAkpgEAoElMAwBAk5gGAIAmMQ0AAE1iGgAAmsQ0AAA0iWkAAGgS0wAA0CSmAQCgSUwDAECTmAYAgCYxDQAATWIaAACaxDQAADSJaQAAaBLTAADQJKYBAKBJTAMAQJOYBgCAJjENAABNYhoAAJrENAAANIlpAABoEtMAANAkpgEAoElMAwBAk5gGAIAmMQ0AAE1iGgAAmsQ0AAA0iWkAAGgS0wAA0CSmAQCgSUwDAECTmAYAgCYxDQAATWIaAACaxDQAADSJaQAAaBLTAADQJKYBAKBp3WK6ql5VVd+sqk+vOHZEVb2vqi6dPh8+Ha+q+qOquqyqPllVd12vuQAAYK2s5zPTr0ly/32OPT3J+WOM45OcP11PkgckOX76OCPJK9ZxLgAAWBPrFtNjjA8m+fY+h09Kcs50+ZwkJ684/tqx8JEkh1XVUes1GwAArIWN3jN9qzHGlUkyfb7ldPzWSb664rwrpmPXUlVnVNXuqtq9Z8+edR0WAACuy2b5BcRa5dhY7cQxxlljjF1jjF07d+5c57EAAGD/Njqmv7F3+8b0+ZvT8SuSHLPivKOTfH2DZwMAgAOy0TH9jiSnTZdPS3LeiuO/Nb2qxz2SXLV3OwgAAGxWO9brC1fVG5LcO8mRVXVFkmcleX6SN1fV6Um+kuTh0+nvSvLAJJcl+X6Sx6zXXAAAsFbWLabHGI/Yz033XeXckeQJ6zULAACsh83yC4gAALDliGkAAGgS0wAA0CSmAQCgSUwDAECTmAYAgCYxDQAATWIaAACaxDQAADSJaQAAaBLTAADQJKYBAKBJTAMAQJOYBgCAJjENAABNYhoAAJp2zD0AAKy1q/78FXOPsOX83K89fu4RYEvyzDQAADSJaQAAaBLTAADQJKYBAKBJTAMAQJOYBgCAJjENAABNYhoAAJrENAAANIlpAABoEtMAANAkpgEAoElMAwBAk5gGAIAmMQ0AAE1iGgAAmsQ0AAA0iWkAAGgS0wAA0CSmAQCgSUwDAECTmAYAgCYxDQAATWIaAACaxDQAADSJaQAAaBLTAADQJKYBAKBJTAMAQJOYBgCAJjENAABNYhoAAJrENAAANIlpAABoEtMAANAkpgEAoElMAwBAk5gGAIAmMQ0AAE1iGgAAmsQ0AAA0iWkAAGgS0wAA0CSmAQCgSUwDAECTmAYAgCYxDQAATbPEdFX9XlV9pqo+XVVvqKobV9VxVXVhVV1aVW+qqhvNMRsAACxrw2O6qm6d5IlJdo0x7pTkkCSnJnlBkhePMY5P8p0kp2/0bAAAcCDm2uaxI8lNqmpHkpsmuTLJfZKcO91+TpKTZ5oNAACWsuExPcb4WpL/nuQrWUT0VUkuSvLdMcY102lXJLn1Rs8GAAAHYo5tHocnOSnJcUl+PsnPJnnAKqeO/dz/jKraXVW79+zZs36DAgDA9Zhjm8evJfnSGGPPGOMfkrwtyT2THDZt+0iSo5N8fbU7jzHOGmPsGmPs2rlz58ZMDAAAq5gjpr+S5B5VddOqqiT3TfLZJB9Icsp0zmlJzpthNgAAWNoce6YvzOIXDT+W5FPTDGcleVqSp1TVZUlukeSVGz0bAAAciB3Xf8raG2M8K8mz9jn8xSQnzjAOAAC0eAdEAABoEtMAANAkpgEAoElMAwBAk5gGAIAmMQ0AAE1iGgAAmsQ0AAA0iWkAAGgS0wAA0CSmAQCgSUwDAECTmAYAgCYxDQAATWIaAACaxDQAADSJaQAAaBLTAADQJKYBAKBJTAMAQJOYBgCAJjENAABNYhoAAJrENAAANIlpAABoEtMAANAkpgEAoElMAwBAk5gGAIAmMQ0AAE1iGgAAmsQ0AAA0iWkAAGgS0wAA0CSmAQCgacfcAwAAsHb2XHLm3CNsKTvv8LKf6v6emQYAgCYxDQAATWIaAACa7JkGANbUtz/6nLlH2FKOOPEP5x6Bn4JnpgEAoElMAwBAk5gGAIAmMQ0AAE1iGgAAmsQ0AAA0iWkAAGgS0wAA0CSmAQCgSUwDAECTmAYAgCYxDQAATUvFdFU9qKqENwAArLBsIJ+a5NKqemFV3XE9BwIAgK1iqZgeY/xmkrsk+UKSV1fVX1XVGVV16LpOBwAAm9jSWzfGGFcneWuSNyY5KslDk3ysqs5cp9kAAGBTW3bP9EOq6u1J3p/khklOHGM8IMmdkzx1HecDAIBNa8eS552S5MVjjA+uPDjG+H5V/fbajwUAAJvfsts8rtw3pKvqBUkyxjh/zacCAIAtYNmYvt8qxx6wloMAAMBWc53bPKrq8Ul+J8ntq+qTK246NMmH1nMwAADY7K5vz/SfJXl3kuclefqK498bY3x73aYCAIAt4PpieowxLq+qJ+x7Q1UdIagBADiYLfPM9IOSXJRkJKkVt40kt1unuQAAYNO7zpgeYzxo+nzcWj5oVR2W5Owkd8oiyn87ySVJ3pTk2CSXJ/mNMcZ31vJx4WBz1Z+8b+4RtpSfe9xqv2sNAPu37Ju2XOvl71Y7dgBemuQ9Y4xfzOKNXz6XxZ7s88cYxyc5Pz+5RxsAADad63s1jxsnuWmSI6vq8Px4m8fNk/x85wGr6uZJ/mWSRyfJGOOHSX5YVSclufd02jlJLkjytM5jAADARri+PdP/PsmTswjni/LjmL46ycubj3m7JHuSvLqq7jx93ScludUY48okGWNcWVW3bH59AADYENe3Z/qlSV5aVWeOMV62ho951yRnjjEurKqX5gC2dFTVGUnOSJLb3OY2S91nzxMvaYx58Nr5R3eYewQAgC3h+p6ZTpKMMV5WVffM4pcDd6w4/trGY16R5IoxxoXT9XOziOlvVNVR07PSRyX55n5mOSvJWUmya9eu0Xh8AABYE0vFdFW9Lsntk1yc5EfT4ZHkgGN6jPH/quqrVXWHMcYlSe6b5LPTx2lJnj99Pu9AvzYAAGykpWI6ya4kJ4wx1uqZ4DOTvL6qbpTki0kek8Uri7y5qk5P8pUkD1+jxwIAgHWxbEx/Osk/SXLlWjzoGOPiLAJ9X/ddi68PAAAbYdmYPjLJZ6vqo0l+sPfgGOMh6zIVAABsAcvG9LPXcwgAANiKln01j/+z3oMAAMBWs+zbid+jqv66qv62qn5YVT+qqqvXezgAANjMlorpJH+c5BFJLk1ykySPnY4BAMBBa9k90xljXFZVh4wxfpTFW4F/eB3nAgCATW/ZmP7+9JrQF1fVC7N4ibyfXb+xAABg81t2m8ejkhyS5HeT/F2SY5I8bL2GAgCArWDZV/P48nTx75P85/UbBwAAto6lYrqqvpTkWm8lPsa43ZpPBAAAW8Sye6ZXvvX3jZM8PMkRaz8OAABsHUvtmR5jfGvFx9fGGC9Jcp91ng0AADa1Zbd53HXF1Rtk8Uz1oesyEQAAbBHLbvP4H/nxnulrklyexVYPAAA4aC0b0+/MIqZruj6SPKhqcXWM8aK1Hw0AADa3ZWP6bkn+WZLzsgjqByf5YJKvrtNcAACw6S0b00cmuesY43tJUlXPTvKWMcZj12swAADY7JZ9B8TbJPnhius/THLsmk8DAABbyLLPTL8uyUer6u1Z7Jd+aJJz1m0qAADYApZ9O/HnVtW7k/zKdOgxY4yPr99YAACw+S37zHTGGB9L8rF1nAUAALaUZfdMAwAA+xDTAADQJKYBAKBJTAMAQJOYBgCAJjENAABNYhoAAJrENAAANIlpAABoEtMAANAkpgEAoElMAwBAk5gGAIAmMQ0AAE1iGgAAmsQ0AAA0iWkAAGgS0wAA0CSmAQCgSUwDAECTmAYAgCYxDQAATWIaAACaxDQAADSJaQAAaBLTAADQJKYBAKBJTAMAQJOYBgCAJjENAABNYhoAAJrENAAANIlpAABoEtMAANAkpgEAoElMAwBAk5gGAIAmMQ0AAE1iGgAAmsQ0AAA0iWkAAGgS0wAA0CSmAQCgabaYrqpDqurjVfXO6fpxVXVhVV1aVW+qqhvNNRsAACxjzmemn5TkcyuuvyDJi8cYxyf5TpLTZ5kKAACWNEtMV9XRSf5NkrOn65XkPknOnU45J8nJc8wGAADLmuuZ6Zck+YMk/zhdv0WS744xrpmuX5Hk1nMMBgAAy9rwmK6qByX55hjjopWHVzl17Of+Z1TV7qravWfPnnWZEQAAljHHM9P3SvKQqro8yRuz2N7xkiSHVdWO6Zyjk3x9tTuPMc4aY+waY+zauXPnRswLAACr2vCYHmM8Y4xx9Bjj2CSnJnn/GOORST6Q5JTptNOSnLfRswEAwIHYTK8z/bQkT6mqy7LYQ/3KmecBAIDrtOP6T1k/Y4wLklwwXf5ikhPnnAcAAA7EZnpmGgAAthQxDQAATWIaAACaxDQAADSJaQAAaBLTAADQJKYBAKBJTAMAQJOYBgCAJjENAABNYhoAAJrENAAANIlpAABoEtMAANAkpgEAoElMAwBAk5gGAIAmMQ0AAE1iGgAAmsQ0AAA0iWkAAGgS0wAA0CSmAQCgSUwDAECTmAYAgCYxDQAATWIaAACaxDQAADSJaQAAaBLTAADQJKYBAKBJTAMAQJOYBgCAJjENAABNYhoAAJrENAAANIlpAABoEtMAANAkpgEAoElMAwBAk5gGAIAmMQ0AAE1iGgAAmsQ0AAA0iWkAAGgS0wAA0CSmAQCgSUwDAECTmAYAgCYxDQAATWIaAACaxDQAADSJaQAAaBLTAADQJKYBAKBJTAMAQJOYBgCAJjENAABNYhoAAJrENAAANIlpAABoEtMAANAkpgEAoElMAwBAk5gGAICmDY/pqjqmqj5QVZ+rqs9U1ZOm40dU1fuq6tLp8+EbPRsAAByIOZ6ZvibJ748x7pjkHkmeUFUnJHl6kvPHGMcnOX+6DgAAm9aGx/QY48oxxsemy99L8rkkt05yUpJzptPOSXLyRs8GAAAHYtY901V1bJK7JLkwya3GGFcmi+BOcsv93OeMqtpdVbv37NmzUaMCAMC1zBbTVXWzJG9N8uQxxtXL3m+McdYYY9cYY9fOnTvXb0AAALges8R0Vd0wi5B+/RjjbdPhb1TVUdPtRyX55hyzAQDAsuZ4NY9K8soknxtjvGjFTe9Ictp0+bQk5230bAAAcCB2zPCY90ryqCSfqqqLp2P/Icnzk7y5qk5P8pUkD59hNgAAWNqGx/QY4y+T1H5uvu9GzgIAAD8N74AIAABNYhoAAJrENAAANIlpAABoEtMAANAkpgEAoElMAwBAk5gGAIAmMQ0AAE1iGgAAmsQ0AAA0iWkAAGgS0wAA0LRj7gEAtqu/feu5c4+wpdzsYafMPQLAAfPMNAAANIlpAABoEtMAANAkpgEAoElMAwBAk5gGAIAmMQ0AAE1iGgAAmsQ0AAA0iWkAAGgS0wAA0CSmAQCgSUwDAECTmAYAgCYxDQAATWIaAACaxDQAADSJaQAAaBLTAADQJKYBAKBJTAMAQJOYBgCAJjENAABNYhoAAJrENAAANIlpAABoEtMAANAkpgEAoElMAwBAk5gGAIAmMQ0AAE1iGgAAmsQ0AAA0iWkAAGgS0wAA0CSmAQCgSUwDAECTmAYAgCYxDQAATWIaAACaxDQAADSJaQAAaBLTAADQJKYBAKBJTAMAQJOYBgCAJjENAABNYhoAAJrENAAANIlpAABo2lQxXVX3r6pLquqyqnr63PMAAMB12TQxXVWHJHl5kgckOSHJI6rqhHmnAgCA/ds0MZ3kxCSXjTG+OMb4YZI3Jjlp5pkAAGC/NlNM3zrJV1dcv2I6BgAAm1KNMeaeIUlSVQ9P8utjjMdO1x+V5MQxxpn7nHdGkjOmq3dIcsmGDrq2jkzyN3MPcRCz/vOx9vOy/vOy/vOx9vPa6ut/2zHGzn0P7phjkv24IskxK64fneTr+540xjgryVkbNdR6qqrdY4xdc89xsLL+87H287L+87L+87H289qu67+Ztnn8dZLjq+q4qrpRklOTvGPmmQAAYL82zTPTY4xrqup3k7w3ySFJXjXG+MzMYwEAwH5tmphOkjHGu5K8a+45NtC22K6yhVn/+Vj7eVn/eVn/+Vj7eW3L9d80v4AIAABbzWbaMw0AAFuKmF6hqo6pqg9U1eeq6jNV9aTp+BFV9b6qunT6fPh0/JFV9cnp48NVdecVX2upt0avqvdU1Xer6p37HH/9dP9PV9WrquqG+7n/cVV14TTbm6Zf3kxVPa6qPlVVF1fVX272d5PcTmu/4vZTqmpU1ab/zeXttP5V9eiq2jN9719cVY9dizVaT9tp/afbfqOqPjv9Wf7sp12f9bSd1r6qXrzi+/7zVfXdtVij9bTN1v8205/l49N8D1yLNVpP22z9b1tV50+zXVBVR6/FGi1ljOFj+khyVJK7TpcPTfL5LN7a/IVJnj4df3qSF0yX75nk8OnyA5JcOF0+JMkXktwuyY2SfCLJCft5zPsmeXCSd+5z/IFJavp4Q5LH7+f+b05y6nT5T/ael+TmK855SJL3zL2+B8var/gzfDDJR5Lsmnt9D6b1T/LoJH8895oexOt/fJKPr5jvlnOv78Gy9vucc2YWv8g/+xofLOufxX7gvZdPSHL53Ot7kK3/W5KcNl2+T5LXbdg6zv0XuZk/kpyX5H5ZvDHMUSu+8S5Z5dzDk3xtuvzPk7x3xW3PSPKM63ice+/7TbXP7b+X5LmrHK8sXvx8x2qPu+K8RyR599zreTCtfZKXJHlQkguyBWJ6O61/tmBMb7P1f2GSx869hgfj2u9z3oeT3G/u9TyY1j/J/0rytBXHPzz3eh5k6/+ZJEevOO/qjVo32zz2o6qOTXKXJBcmudUY48okmT7fcpW7nJ7k3dPlNXtr9OnHHI9K8p5Vbr5Fku+OMa5Z7XGq6glV9YUs/nF7Yufx57DV176q7pLkmDHGO1e536a31dd/8rDpR33nVtUx17775rUN1v8XkvxCVX2oqj5SVffvPP4ctsHa773/bZMcl+T9ncefyzZY/2cn+c2quiKLVyY789p337y2wfp/IsnDpssPTXJoVd2iM8OBEtOrqKqbJXlrkiePMa5e4vx/lcU31dP2HlrltNEc538m+eAY4y9We+jrepwxxsvHGLef5vqPzcffUFt97avqBklenOT3m485q62+/tPn/53k2DHGLyX58yTnNB9/w22T9d+RxVaPe2fxU7Gzq+qw5gwbZpus/V6nJjl3jPGj5uNvuG2y/o9I8poxxtFZbFl43fRvwqa3Tdb/qUl+tao+nuRXk3wtyTWrnL/mtsRf8kaa/o/orUleP8Z423T4G1V11HT7UUm+ueL8X0pydpKTxhjfmg6v+tboVXX3+vEvhzxkiVmelWRnkqesOPbe6f5nZ/GjjsOqau/rha/6FuxJ3pjk5Ot7vLltk7U/NMmdklxQVZcnuUeSd9TW+CXE7bD+GWN8a4zxg+n4nya52/KrMJ/tsv7TDOeNMf5hjPGlLH5cfPzyK7HxttHa73VqFntOt4RttP6nZ7GfN2OMv0py4yRHLrsOc9ku6z/G+PoY49+OMe6S5JnTsasOaDG6Nmo/yVb4yOL/eF6b5CX7HP9v+cmN+C+cLt8myWVJ7rnP+TuSfDGLH7Pt3Yj/T6/jce+da2/Ef2wWe95ucj0zvyU/uRH/d6bLx68458FJds+9vgfL2u9zzgXZAnumt9P6Z9rnN11+aJKPzL2+B9n63z/JOdPlI7P40e8t5l7jg2Htp+t3SHJ5sngfic3+sZ3WP4stD4+eLt8xi8jb1H8P22z9j0xyg+nyc5M8Z8PWce6/yM30keRfZPHjgk8muXj6eGAWe3TOT3Lp9PmI6fyzk3xnxbm7V3ytB2bxW7FfSPLM63jMv0iyJ8nfZ/F/dr8+Hb9muu/er/2H+7n/7ZJ8dPrmfkuSn5mOvzSLzfgXJ/nAdX1Tb4aP7bT2+5xzQbZGTG+b9U/yvOl7/xPT9/4vzr2+B9n6V5IXJflskk9l+kdvs35sp7Wfbnt2kufPva4H4/pn8SoYH8rivz0XJ/nXc6/vQbb+p0zzfn6a81r/Jq/Xh3dABACAJnumAQCgSUwDAECTmAYAgCYxDQAATWIaAACaxDTANlRVz66qp17H7SdX1QkbORPAdiSmAQ5OJ2fxurgA/BS8zjTANlFVz0zyW1m86+CeJBcluSrJGVm8K9llSR6V5JeTvHO67aokD5u+xMuzeCvf7yf5d2OM/7uR8wNsRWIaYBuoqrsleU2Su2fx1r4fy+Ktdl89xvjWdM5/TfKNMcbLquo1Wbyd77nTbecnedwY49KqunuS540x7rPxfxKArWXH3AMAsCZ+JcnbxxjfT5Kqesd0/E5TRB+W5GZJ3rvvHavqZknumeQtVbX38M+s+8QA24CYBtg+VvtR42uSnDzG+ERVPTrJvVc55wZJvjvG+OX1Gw1ge/ILiADbwweTPLSqblJVhyZ58HT80CRXVtUNkzxyxfnfm27LGOPqJF+qqocnSS3ceeNGB9i67JkG2CZW/ALil5NckeSzSf4uyR9Mxz6V5NAxxqOr6l5J/jTJD5KckuQfk7wiyVFJbpjkjWOM52z4HwJgixHTAADQZJsHAAA0iWkAAGgS0wAA0CSmAQCgSUwDAECTmAYAgCYxDQAATWIaAACa/j9rrejWqIw1FwAAAABJRU5ErkJggg==\n", 330 | "text/plain": [ 331 | "
" 332 | ] 333 | }, 334 | "metadata": { 335 | "needs_background": "light" 336 | }, 337 | "output_type": "display_data" 338 | } 339 | ], 340 | "source": [ 341 | "days = df.groupby('date', as_index=False).agg({'quantity': 'sum'})\n", 342 | "\n", 343 | "plt.figure(figsize=(12, 8))\n", 344 | "ax = sns.barplot(x='date', y='quantity', data=days, palette='spring', alpha=0.75)" 345 | ] 346 | }, 347 | { 348 | "cell_type": "markdown", 349 | "metadata": {}, 350 | "source": [ 351 | "**Find users who has bought the same product more than once in different days**" 352 | ] 353 | }, 354 | { 355 | "cell_type": "code", 356 | "execution_count": 672, 357 | "metadata": {}, 358 | "outputs": [ 359 | { 360 | "data": { 361 | "text/html": [ 362 | "
\n", 363 | "\n", 376 | "\n", 377 | " \n", 378 | " \n", 379 | " \n", 380 | " \n", 381 | " \n", 382 | " \n", 383 | " \n", 384 | " \n", 385 | " \n", 386 | " \n", 387 | " \n", 388 | " \n", 389 | " \n", 390 | " \n", 391 | " \n", 392 | " \n", 393 | " \n", 394 | " \n", 395 | " \n", 396 | " \n", 397 | " \n", 398 | " \n", 399 | "
nameproduct_iddate
37Anton_Ivanov152
92Petr_Fedorov942
\n", 400 | "
" 401 | ], 402 | "text/plain": [ 403 | " name product_id date\n", 404 | "37 Anton_Ivanov 15 2\n", 405 | "92 Petr_Fedorov 94 2" 406 | ] 407 | }, 408 | "execution_count": 672, 409 | "metadata": {}, 410 | "output_type": "execute_result" 411 | } 412 | ], 413 | "source": [ 414 | "fav_products = ( \n", 415 | " df.reset_index(drop=True) \n", 416 | " .drop_duplicates(subset=['product_id', 'date', 'name']) # removing duplicates by product_id, date, name -- in case someone bought the same product more than once per day\n", 417 | " .groupby(['name', 'product_id'], as_index=False) # grouping by name and product_id\n", 418 | " .agg({'date': 'count'}) # counting number of different dates when name and product_id are the same\n", 419 | " .query('date > 1') # filtering values less than one two obtain users who bought same products on different days\n", 420 | " ) \n", 421 | "\n", 422 | "fav_products" 423 | ] 424 | } 425 | ], 426 | "metadata": { 427 | "kernelspec": { 428 | "display_name": "Python 3", 429 | "language": "python", 430 | "name": "python3" 431 | }, 432 | "language_info": { 433 | "codemirror_mode": { 434 | "name": "ipython", 435 | "version": 3 436 | }, 437 | "file_extension": ".py", 438 | "mimetype": "text/x-python", 439 | "name": "python", 440 | "nbconvert_exporter": "python", 441 | "pygments_lexer": "ipython3", 442 | "version": "3.7.6" 443 | } 444 | }, 445 | "nbformat": 4, 446 | "nbformat_minor": 4 447 | } 448 | -------------------------------------------------------------------------------- /6_retail_in_germany/README.md: -------------------------------------------------------------------------------- 1 | This is the sixth dataset analysed by me while passing [data analysis course](https://karpov.courses/) I've enrolled in January 2021. 2 | 3 | 4 | Retail in Germany -- having a dataset with purchases of clients from Europe. Count basic sales statistics for clients from Germany. Duplicated, drop_duplicates, groupby, agg, query, sort_values, assign, quantile and str methods were used for Exploratory Data Analysis. 5 | 6 | 7 | 8 | Hope this repo will help you to assess my coding, data analytics and SQL skills or will be just fun for you to look through. 9 | 10 | 11 | 12 | -------------------------------------------- 13 | Fill free to contact me via nktn.lx@gmal.com 14 | Follow me on twitter: @nktn_lx 15 | And here on github: github.com/nktnlx -------------------------------------------------------------------------------- /7_error_in_transaction_data/README.md: -------------------------------------------------------------------------------- 1 | This is the seventh dataset analysed by me while passing [data analysis course](https://karpov.courses/) I've enrolled in January 2021. 2 | 3 | 4 | Error in Transactions Data -- we've found and corrected an error while analysing a dataset with transactions. Plotting data in logarithmic scale, converting data to datetime format as well as implementing describe, isna, sum, value_counts, groupby, agg, query, sort_values, rename, min, max and pivot methods were used for Exploratory Data Analysis. 5 | 6 | 7 | 8 | Hope this repo will help you to assess my coding, data analytics and SQL skills or will be just fun for you to look through. 9 | 10 | 11 | 12 | -------------------------------------------- 13 | Fill free to contact me via nktn.lx@gmal.com 14 | Follow me on twitter: @nktn_lx 15 | And here on github: github.com/nktnlx -------------------------------------------------------------------------------- /8_avocado_price/README.md: -------------------------------------------------------------------------------- 1 | This is the eighth dataset analysed by me while passing [data analysis course](https://karpov.courses/) I've enrolled in January 2021. 2 | 3 | 4 | Avocado Price -- Comparing avocado average, simple moving average and exponential weighted average price values. Categorizing delay data and labelling it. Plotting results with help of subplots and interactive Plotly plots. 5 | 6 | 7 | 8 | Hope this repo will help you to assess my coding, data analytics and SQL skills or will be just fun for you to look through. 9 | 10 | 11 | 12 | -------------------------------------------- 13 | Fill free to contact me via nktn.lx@gmal.com 14 | Follow me on twitter: @nktn_lx 15 | And here on github: github.com/nktnlx -------------------------------------------------------------------------------- /8_avocado_price/avocado_EWA.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nktnlx/data_analysis_course/5aa4ef8983b6ba5dc15e370750fa5b9022407918/8_avocado_price/avocado_EWA.jpg -------------------------------------------------------------------------------- /8_avocado_price/avocado_SMA.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nktnlx/data_analysis_course/5aa4ef8983b6ba5dc15e370750fa5b9022407918/8_avocado_price/avocado_SMA.jpg -------------------------------------------------------------------------------- /8_avocado_price/plotly_result.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nktnlx/data_analysis_course/5aa4ef8983b6ba5dc15e370750fa5b9022407918/8_avocado_price/plotly_result.png -------------------------------------------------------------------------------- /9_ads_campaign/README.md: -------------------------------------------------------------------------------- 1 | This is the nineth dataset analysed by me while passing [data analysis course](https://karpov.courses/) I've enrolled in January 2021. 2 | 3 | 4 | Ads Campaign -- Plotting data in logarithmic scale to find the type of data distribution, finding ad_id with an anomalistict number of views. Comparing average and simple moving average views data. Categorizing clients' registration data and labelling it. Plotting results with help of interactive Plotly plot. 5 | 6 | 7 | 8 | Hope this repo will help you to assess my coding, data analytics and SQL skills or will be just fun for you to look through. 9 | 10 | 11 | 12 | -------------------------------------------- 13 | Fill free to contact me via nktn.lx@gmal.com 14 | Follow me on twitter: @nktn_lx 15 | And here on github: github.com/nktnlx -------------------------------------------------------------------------------- /9_ads_campaign/plotly_result.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nktnlx/data_analysis_course/5aa4ef8983b6ba5dc15e370750fa5b9022407918/9_ads_campaign/plotly_result.png -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | Hi there! 2 | 3 | 4 | I'm learning to code, interested in data analytics and data science, aiming to be hired by the mid of 2021. 5 | This repository is for [data analysis course](https://karpov.courses/) I've enrolled in January 2021. 6 | 7 | The course curriculum includes the following technologies and topics mastered by me: 8 | 1. Python 9 | 2. Pandas 10 | 3. Numpy 11 | 4. Seaborn 12 | 5. Google APIs 13 | 6. Git 14 | 7. Airflow 15 | 8. SQL 16 | 9. ClickHouse 17 | 10. PostgreSQL 18 | 11. Redash 19 | 12. Superset 20 | 13. Statistics 21 | 14. A/B-tests 22 | 15. Bootstrapping 23 | 16. Amplitude 24 | 17. Tableau 25 | 18. DAU, MAU, ARPU, LTV, Retention, CR and other metrics 26 | 19. Product Development basics 27 | 20. Product Management basics 28 | 21. Soft-skills 29 | 30 | 31 | 32 | 33 | 34 | **List of projects:** 35 | 1. [Taxi in NYC](https://github.com/nktnlx/data_analysis_course/tree/main/1_taxi_in_nyc) -- analising NYC taxi orders with Pandas. Read_csv, rename, groupby, agg, query, sort_values, idxmax, idxmin, value_counts, pivot methods were used for **Exploratory Data Analysis**. 36 | 2. [Hotel Bookings](https://github.com/nktnlx/data_analysis_course/tree/main/2_hotel_bookings) -- analising hotel bookings with Pandas. Read_csv, info, rename, groupby, agg, query, sort_values, idxmax, idxmin, value_counts, pivot methods were used for Exploratory Data Analysis. Customers **Churn rate** was calculated. 37 | 3. [User Logs](https://github.com/nktnlx/data_analysis_course/tree/main/3_user_logs) -- analising customers data. Finding the most popular platform and the most active users. Visualizing data with **Seaborn** distplot, barplot and countplot methods. 38 | 4. [Taxi in Peru](https://github.com/nktnlx/data_analysis_course/tree/main/4_taxi_peru) -- analising taxi orders in Peru with Pandas. An Exploratory Data Analysis was performed. Drivers' score, passengers' score, **DAU** and **MAU** metrics were calculated and plotted with Seaborn. 39 | 5. [Raw Data Handling](https://github.com/nktnlx/data_analysis_course/tree/main/5_raw_data_handling) -- creating dataframe from a set of csv-files stored in various folders. Practicing Python skills to **automate data handling**. 40 | 6. [Retail in Germany](https://github.com/nktnlx/data_analysis_course/tree/main/6_retail_in_germany) -- having a dataset with purchases of clients from Europe. Count basic **sales statistics** for clients from Germany. Duplicated, drop_duplicates, groupby, agg, query, sort_values, assign, quantile and str methods were used for Exploratory Data Analysis. 41 | 7. [Error in Transactions Data](https://github.com/nktnlx/data_analysis_course/tree/main/7_error_in_transaction_data) -- we've found and corrected an error while analising a dataset with transactions. Plotting data in logarithmic scale, converting data to datetime format as well as implementing describe, isna, sum, value_counts, groupby, agg, query, sort_values, rename, min, max and **pivot** methods were used for Exploratory Data Analysis. 42 | 8. [Avocado Price](https://github.com/nktnlx/data_analysis_course/tree/main/8_avocado_price) -- comparing avocado average, simple **moving average** and exponential weighted average price values. Categorizing delay data and labeling it. Plotting results with help of subplots and interactive **Plotly** plots. 43 | 9. [Ads Campaign](https://github.com/nktnlx/data_analysis_course/tree/main/9_ads_campaign) -- plotting data in logarithmic scale to find the type of data distribution, finding ad_id with an anomalistic number of views. Comparing average and simple moving average views data. Calculating clients' registration to publishing ad conversion rate (**CR**). Categorizing clients' registration data and labeling it. Plotting results with help of interactive Plotly plot. 44 | 10. [Visits by Browser](https://github.com/nktnlx/data_analysis_course/tree/main/10_visits_by_browser) -- analising web-site visits. Defining proportion of real users and visits by bots. Finding the most popular browser for users and for bots. Bar-plotting results, downloading data using **Google Docs API** and merging it to our dataframe. Read_csv, groupby, agg, query, sort_values, pivot, fillna, assign and merge methods were used for Exploratory Data Analysis. 45 | 11. [Telegram Bot Airflow Reporting](https://github.com/nktnlx/data_analysis_course/tree/main/11_telegram_bot_airflow_reporting) -- reading an advertising campaign data from Google Docs spreadsheet, creating pandas dataframe to calculate clicks, views, **CTR** and money spent on the campaign. Calculating day by day change of the metrics, writing report with results to a txt file and sending this file via **telegram bot** to your mobile phone. The script is executed by **Airflow** every Monday at 12:00 p.m. 46 | 12. [SQL Tasks](https://github.com/nktnlx/data_analysis_course/tree/main/12_sql_task) -- **SQL** exercises done by me while passing this data analysis course. Clickhouse (via Tabix) was used to solve the tasks. 47 | 13. [NYC taxi & timeit optimization](https://github.com/nktnlx/data_analysis_course/tree/main/13_nyc_taxi_timeit_optimization) -- calculating distance of a ride using pick-up and drop-off coordinates. Compared a couple of ways to apply distance calculation to the dataframe. The optimization helped to decrease calculation run-time about 3276 times! Checked calculation results, found outliers using boxplot graphs and **descriptive statistics**. Fixed dataframe by removing outliers and found the cost of the longest ride. 48 | 14. [Bikes rent in Chicago](https://github.com/nktnlx/data_analysis_course/tree/main/14_bikes_rent_chicago) -- dates to dateformat conversion, **resampling data** to aggregate by days, automatically merging data from distinct files into one dataframe using os.walk(), differentiating bikes rents by user type, finding the most popular destination points overall and based on the week of the day. 49 | 15. [Bookings in London](https://github.com/nktnlx/data_analysis_course/tree/main/15_booking_in_london) -- used **Pandahouse** and SQL queries to import data from Clickhouse into pandas dataframe. Processed imported data and performed Exploratory Data Analysis. Built scatterplot, distplot, lineplot and heatmap using Seaborn and Matplotlib. 50 | 16. [Retail dashboard](https://github.com/nktnlx/data_analysis_course/tree/main/16_retail_dashboard) -- built a series of visualizations and a **dashboard** using SQL queries and **Redash**. Calculated and checked the dynamics of **MAU** and **AOV**. Found an anomaly in data, defined the market generating the majority of revenue, analyzed the most popular goods sold in the store. Wrote a dashboard summary with recommendation to push up sales. 51 | 17. [Video games](https://github.com/nktnlx/data_analysis_course/tree/main/17_video_games) -- analising video games sales dynamics with Pandas. Read_csv, head, columns, dtypes, info, isna, dropna, describe, mode, shape, groupby, agg, sort_values, rename, index, to_list, value_counts methods were user for **Exploratory Data Analysis**. Barplot, boxplot and lineplot were used for graphing results. 52 | 18. [Ads conversion](https://github.com/nktnlx/data_analysis_course/tree/main/18_ads_conversion) -- calculating **CTR, CPC, CR** metrics. Plotting them using distplot, hist, displot and histplot methods. 53 | 19. [Yandex Music](https://github.com/nktnlx/data_analysis_course/tree/main/19_yandex_music) -- analyzing music streaming platform songs popularity, comparing music preferences and listening templates in Moscow and Saint-Petersburg. Reading and **cleaning data**, renaming columns, removing duplicates, dealing with missing data, slicing dataframe to query required portion of data. 54 | 20. [Bikes rent in London](https://github.com/nktnlx/data_analysis_course/tree/main/20_bikes_rent_london) -- loading dataset, plotting rides count data, resampling timestamps, describing the main trends, looking for anomaly values by smoothing the data with a **simple moving average**, calculating the difference between the real data and smoothed data, finding **standard deviation** and defining the **99% confidence interval**. Then we compare values with the confidence interval to find data hikes and explain them. 55 | 21. [Delivery A/B](https://github.com/nktnlx/data_analysis_course/tree/main/21_delivery_ab) -- Finding how a new navigational algorithm has changed the delivery time of the service. Formulating null and alternative hypothesis and performing A/B test with help of **t-test**. 56 | 22. [App interface A/B](https://github.com/nktnlx/data_analysis_course/tree/main/22_app_interface_ab) -- Testing how images aspect ratio and a new order button design influence on the amount of orders placed by customers. Performed **Levene's test** to check group variance equality, **Shapiro-Wilk test** to check groups for normality, **one-way ANOVA** to check statistically significant difference between tested groups, **Tukey's test** to find statistically significant difference between groups, linear model multivariate analysis of variance, visualized and interpreted results, gave recommendations to put (or not to put) changes into production. 57 | 23. [Cars sales](https://github.com/nktnlx/data_analysis_course/tree/main/23_cars_sales) -- predicting cars sales price using **linear regression models** (statsmodels.api & statsmodels.formula.api). Finding statistically significant predictors. 58 | 24. [Bootstrap A/B](https://github.com/nktnlx/data_analysis_course/tree/main/24_bootstrap_ab) -- comparing results of **Mann-Whitney test** and **Bootstrap** mean/median running on data with and without outliers. 59 | 25. [Mobile App A/A](https://github.com/nktnlx/data_analysis_course/tree/main/25_mobile_app_aa) -- running **A/A test** to check data splitting system works well. Unfortunately, we were not able to pass the test (FPR was greater than significance value). Thus, we have to dig into data and find the reason of malfunction. After removing the corrupted data we were able to pass the A/A-test. 60 | 26. [Taxi Churn](https://github.com/nktnlx/data_analysis_course/tree/main/26_taxi_churn) -- performing Exploratory Data Analysis, defining churn, checking distributions for normality with **Shapiro-Wilk test**, plotting data using plotly and A/B testing four different hypothesis with **Chi-squared test, Dunn's test, Mann-Whitney U non-parametric test**. 61 | 27. [A/B simulation](https://github.com/nktnlx/data_analysis_course/tree/main/27_ab_simulation) -- performed a range of A/B tests to simulate how sample size and the magnitude of difference between samples influence A/B tests performance. Investigated situations when we could have a false positive error. Gained valuable **lessons on A/B tests performance**. 62 | 28. [Sales Monthly Overview](https://public.tableau.com/profile/nktn.lx#!/vizhome/SalesMonthlyOverviewpractice1/Dashboard1) -- **Tableau Public dashboard** consisted of: KPIs, line chart, bar chart, table by category with bar charts. 63 | 29. [Profit Monthly Overview](https://public.tableau.com/profile/nktn.lx#!/vizhome/ProfitMonthlyOverviewpractice2/ProfitMonthlyOverview) -- **Tableau Public dashboard** consisted of: KPIs, line chart, bar chart, table by region with bar charts, profit ratio by category with horizontal bar charts. 64 | 30. [Analytics Vacancies Overview](https://public.tableau.com/profile/nktn.lx#!/vizhome/AnalyticsVacanciesOverviewpractice3/Dashboard1) -- **Tableau Public dashboard** consisted of: horizontal bar chart, pie chart, boxplot and bubble chart. 65 | 31. [Sales Overview](https://public.tableau.com/profile/nktn.lx#!/vizhome/SalesOverviewpractice4/published) -- **Tableau Public dashboard** consisted of: horizontal bar tables, sparklines, KPI, line charts and various filters and sortings to display the data. 66 | 32. [Airbnb Listings Analytics](https://public.tableau.com/profile/nktn.lx#!/vizhome/LondonAirbnbListingsAnalyticalDashboardpractice5/Dashboard1) -- **Tableau Public dashboard** consisted of: calculated renting property occupation rate; analytical chart to choose the best property by occupation rate, review score and price per night; a ranked table of top 10 listings by calculated potential annual revenue; average price, average occupation rate and a number of unique listings KPIs; filters by neighbourhood, occupation rate and a number of reviews per the last twelve month. 67 | 33. [Metrics calculations](https://github.com/nktnlx/data_analysis_course/tree/main/33_metrics_calc) -- Google analytics data cleaning and calculation of following **metrics**: number of unique users, conversion, average check, average purchases per user, **ARPPU, ARPU**. 68 | 34. [Retention Analysis](https://public.tableau.com/profile/nktn.lx#!/vizhome/RetentionAnalysispractice6/Dashboard1) -- **Tableau Public dashboard** contains users retention and ARPU highlight tables. 69 | 35. [RFM analysis](https://github.com/nktnlx/data_analysis_course/tree/main/35_rfm_analysis) -- performed **RFM analysis**, built **LTV** heatmap and found insights about users segmentation. 70 | 36. [Probabilities](https://github.com/nktnlx/data_analysis_course/tree/main/36_probabilities) -- solving **probability theory problems** including AND/OR probabilities, Bernoulli trial and conditional probability (the Bayes theorem). 71 | 37. [Final project](https://github.com/nktnlx/data_analysis_course/tree/main/37_final_project) -- you're employed in a mobile games development company. A Product Manager gives you following tasks: find and visualize retention, make a decision based on the A/B test data, suggest a number of metrics to evaluate the results of the last monthly campaign. 72 | 73 | 74 |
75 | Hope this repo will help you to assess my coding, data analytics and SQL skills or will be just fun for you to look through. 76 | 77 | 78 | 79 | -------------------------------------------- 80 | Feel free to contact me via nktn.lx@gmal.com 81 | Follow me on twitter: @nktn_lx 82 | And here on github: github.com/nktnlx --------------------------------------------------------------------------------