├── .github ├── CODEOWNERS ├── ISSUE_TEMPLATE.md ├── PULL_REQUEST_TEMPLATE.md └── workflows │ └── main.yml ├── .gitignore ├── CONTRIBUTING.md ├── Ch01 ├── 01_02 │ ├── cart.csv │ └── missing.py ├── 01_03 │ ├── bad_values.py │ └── metrics.csv └── 01_04 │ ├── cart.csv │ └── duplicates.py ├── Ch02 ├── 02_01 │ └── payment_form.png ├── 02_03 │ └── payment_form.png ├── challenge │ └── payment_form.png └── solution │ └── payment-design.png ├── Ch03 ├── 03_01 │ ├── schema.sql │ ├── ships.csv │ └── ships.py ├── 03_02 │ ├── ships.csv │ └── ships.py ├── 03_03 │ ├── locations.csv │ ├── ships.csv │ └── ships.py ├── 03_04 │ ├── ships.csv │ └── ships.py ├── 03_05 │ ├── heights.csv │ └── heights.py ├── challenge │ ├── rides.csv │ └── rides.py └── solution │ ├── rides.csv │ └── rides.py ├── Ch04 ├── 04_01 │ └── metrics.py ├── 04_02 │ └── metrics.csv ├── 04_03 │ ├── rides.csv │ ├── rides.db │ └── tasks.py ├── 04_04 │ ├── etl.py │ └── ships.csv ├── 04_05 │ ├── metrics.csv │ └── metrics_wide.csv ├── 04_06 │ ├── orders.csv │ └── orders.py ├── challenge │ ├── etl.py │ └── traffic.csv └── solution │ ├── etl.py │ └── traffic.csv ├── Ch05 ├── 05_01 │ ├── columns.py │ ├── donations.csv │ └── weather.csv ├── 05_02 │ ├── points.csv │ └── points.py ├── 05_03 │ ├── 2021-06.csv │ └── work.py ├── 05_04 │ ├── rides.csv │ └── rides.py ├── 05_05 │ ├── cart.csv │ └── missing.py ├── 05_06 │ ├── metrics.csv │ └── metrics.py ├── challenge │ ├── workshops.csv │ ├── workshops.png │ └── workshops.py └── solution │ ├── workshops.csv │ └── workshops.py ├── LICENSE ├── NOTICE ├── README.md └── requirements.txt /.github/CODEOWNERS: -------------------------------------------------------------------------------- 1 | # Codeowners for these exercise files: 2 | # * (asterisk) deotes "all files and folders" 3 | # Example: * @producer @instructor 4 | -------------------------------------------------------------------------------- /.github/ISSUE_TEMPLATE.md: -------------------------------------------------------------------------------- 1 | 7 | 8 | ## Issue Overview 9 | 10 | 11 | ## Describe your environment 12 | 13 | 14 | ## Steps to Reproduce 15 | 16 | 1. 17 | 2. 18 | 3. 19 | 4. 20 | 21 | ## Expected Behavior 22 | 23 | 24 | ## Current Behavior 25 | 26 | 27 | ## Possible Solution 28 | 29 | 30 | ## Screenshots / Video 31 | 32 | 33 | ## Related Issues 34 | 35 | -------------------------------------------------------------------------------- /.github/PULL_REQUEST_TEMPLATE.md: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /.github/workflows/main.yml: -------------------------------------------------------------------------------- 1 | name: Copy To Branches 2 | on: 3 | workflow_dispatch: 4 | jobs: 5 | copy-to-branches: 6 | runs-on: ubuntu-latest 7 | steps: 8 | - uses: actions/checkout@v2 9 | with: 10 | fetch-depth: 0 11 | - name: Copy To Branches Action 12 | uses: planetoftheweb/copy-to-branches@v1 13 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | .DS_Store 2 | node_modules 3 | .tmp 4 | npm-debug.log 5 | -------------------------------------------------------------------------------- /CONTRIBUTING.md: -------------------------------------------------------------------------------- 1 | 2 | Contribution Agreement 3 | ====================== 4 | 5 | This repository does not accept pull requests (PRs). All pull requests will be closed. 6 | 7 | However, if any contributions (through pull requests, issues, feedback or otherwise) are provided, as a contributor, you represent that the code you submit is your original work or that of your employer (in which case you represent you have the right to bind your employer). By submitting code (or otherwise providing feedback), you (and, if applicable, your employer) are licensing the submitted code (and/or feedback) to LinkedIn and the open source community subject to the BSD 2-Clause license. 8 | -------------------------------------------------------------------------------- /Ch01/01_02/cart.csv: -------------------------------------------------------------------------------- 1 | date,name,amount,price 2 | 2021-03-01,carrot,7,5.73 3 | 2021-03-01,egg,12,1.7 4 | 2021-03-01,milk,,3.57 5 | 2021-03-01,potato,2, 6 | ,tomato,6,1.52 7 | 2021-03-02,potato,3,2.17 8 | 2021-03-03,,5,3.68 -------------------------------------------------------------------------------- /Ch01/01_02/missing.py: -------------------------------------------------------------------------------- 1 | # %% 2 | import pandas as pd 3 | 4 | # %% 5 | df = pd.read_csv('cart.csv', parse_dates=['date']) 6 | df 7 | 8 | # %% 9 | df['amount'].astype('Int32') 10 | 11 | # %% 12 | df.isnull() 13 | 14 | # %% 15 | df.isnull().any(axis=1) 16 | -------------------------------------------------------------------------------- /Ch01/01_03/bad_values.py: -------------------------------------------------------------------------------- 1 | # %% 2 | import pandas as pd 3 | 4 | # %% 5 | df = pd.read_csv('metrics.csv', parse_dates=['time']) 6 | df.sample(10) 7 | 8 | # %% 9 | df.groupby('name').describe() 10 | 11 | # %% 12 | df['name'].value_counts() 13 | 14 | # %% 15 | pd.pivot(df, index='time', columns='name').plot(subplots=True) 16 | 17 | # %% 18 | df.query('name == "cpu" & (value < 0 | value > 100)') 19 | 20 | # %% 21 | mem = df[df['name'] == 'mem']['value'] 22 | z_score = (mem - mem.mean())/mem.std() 23 | bad_mem = mem[z_score.abs() > 2] 24 | # bad_mem 25 | df.loc[bad_mem.index] 26 | -------------------------------------------------------------------------------- /Ch01/01_03/metrics.csv: -------------------------------------------------------------------------------- 1 | time,name,value 2 | 2021-07-13 14:36:52.380,mem,227517194.0 3 | 2021-07-13 14:36:52.380,cpu,31.57 4 | 2021-07-13 14:36:53.337,mem,227519176.0 5 | 2021-07-13 14:36:53.337,cpu,300.9 6 | 2021-07-13 14:36:54.294,mem,227515712.0 7 | 2021-07-13 14:36:54.294,cpu,31.64 8 | 2021-07-13 14:36:55.251,mem,295.0 9 | 2021-07-13 14:36:55.251,cpu,31.88 10 | 2021-07-13 14:36:56.208,mem,227531324.0 11 | 2021-07-13 14:36:56.208,cpu,-32.14 12 | 2021-07-13 14:36:57.165,cpu,31.73 13 | 2021-07-13 14:36:57.165,mem,227528488.0 14 | 2021-07-13 14:36:58.122,mem,227515595.0 15 | 2021-07-13 14:36:58.122,cpu,32.33 16 | 2021-07-13 14:36:59.079,CPU,30.4 17 | 2021-07-13 14:36:59.079,mem,227525066.0 18 | 2021-07-13 14:37:00.036,cpu,30.95 19 | 2021-07-13 14:37:00.036,mem,227531520.0 20 | 2021-07-13 14:37:00.993,cpu,31.23 21 | 2021-07-13 14:37:00.993,mem,227508323.0 22 | 2021-07-13 14:37:01.950,cpu,29.95 23 | 2021-07-13 14:37:01.950,mem,227519918.0 24 | 2021-07-13 14:37:02.907,cpu,29.62 25 | 2021-07-13 14:37:02.907,mem,227523134.0 26 | 2021-07-13 14:37:03.864,cpu,30.49 27 | 2021-07-13 14:37:03.864,mem,227517742.0 28 | 2021-07-13 14:37:04.821,mem,227526537.0 29 | 2021-07-13 14:37:04.821,cpu,29.52 30 | 2021-07-13 14:37:05.778,mem,227514284.0 31 | 2021-07-13 14:37:05.778,cpu,28.62 32 | 2021-07-13 14:37:06.735,mem,227528568.0 33 | 2021-07-13 14:37:06.735,cpu,29.78 34 | 2021-07-13 14:37:07.692,mem,227516744.0 35 | 2021-07-13 14:37:07.692,cpu,28.72 36 | 2021-07-13 14:37:08.649,cpu,27.87 37 | 2021-07-13 14:37:08.649,mem,227538721.0 38 | 2021-07-13 14:37:09.606,cpu,29.56 39 | 2021-07-13 14:37:09.606,mem,227537406.0 40 | 2021-07-13 14:37:10.563,cpu,27.37 41 | 2021-07-13 14:37:10.563,mem,227543777.0 42 | 2021-07-13 14:37:11.520,mem,227555962.0 43 | 2021-07-13 14:37:11.520,cpu,28.04 44 | 2021-07-13 14:37:12.477,mem,227548711.0 45 | 2021-07-13 14:37:12.477,cpu,28.39 46 | 2021-07-13 14:37:13.434,mem,227547875.0 47 | 2021-07-13 14:37:13.434,cpu,28.99 48 | 2021-07-13 14:37:14.391,mem,227540807.0 49 | 2021-07-13 14:37:14.391,cpu,28.11 50 | 2021-07-13 14:37:15.348,cpu,28.63 51 | 2021-07-13 14:37:15.348,mem,227547996.0 52 | 2021-07-13 14:37:16.305,cpu,29.54 53 | 2021-07-13 14:37:16.305,mem,227556054.0 54 | 2021-07-13 14:37:17.262,cpu,29.97 55 | 2021-07-13 14:37:17.262,mem,227551174.0 56 | 2021-07-13 14:37:18.219,cpu,31.37 57 | 2021-07-13 14:37:18.219,mem,227552297.0 58 | 2021-07-13 14:37:19.176,cpu,30.6 59 | 2021-07-13 14:37:19.176,mem,227536294.0 60 | 2021-07-13 14:37:20.133,cpu,31.36 61 | 2021-07-13 14:37:20.133,mem,227529078.0 62 | 2021-07-13 14:37:21.090,mem,227534673.0 63 | 2021-07-13 14:37:21.090,cpu,32.13 64 | 2021-07-13 14:37:22.047,cpu,30.95 65 | 2021-07-13 14:37:22.047,mem,227532677.0 66 | 2021-07-13 14:37:23.004,mem,227529274.0 67 | 2021-07-13 14:37:23.004,cpu,32.98 68 | 2021-07-13 14:37:23.961,cpu,33.13 69 | 2021-07-13 14:37:23.961,mem,227517273.0 70 | 2021-07-13 14:37:24.918,cpu,33.37 71 | 2021-07-13 14:37:24.918,mem,227506948.0 72 | 2021-07-13 14:37:25.875,cpu,34.41 73 | 2021-07-13 14:37:25.875,mem,227497850.0 74 | 2021-07-13 14:37:26.832,mem,227493728.0 75 | 2021-07-13 14:37:26.832,cpu,35.29 76 | 2021-07-13 14:37:27.789,cpu,35.64 77 | 2021-07-13 14:37:27.789,mem,227495928.0 78 | 2021-07-13 14:37:28.746,cpu,34.94 79 | 2021-07-13 14:37:28.746,mem,227512097.0 80 | 2021-07-13 14:37:29.703,cpu,37.0 81 | 2021-07-13 14:37:29.703,mem,227518444.0 82 | 2021-07-13 14:37:30.660,cpu,37.64 83 | 2021-07-13 14:37:30.660,mem,227499510.0 84 | 2021-07-13 14:37:31.617,cpu,37.19 85 | 2021-07-13 14:37:31.617,mem,227507039.0 86 | 2021-07-13 14:37:32.574,cpu,36.45 87 | 2021-07-13 14:37:32.574,mem,227496995.0 88 | 2021-07-13 14:37:33.531,cpu,35.85 89 | 2021-07-13 14:37:33.531,mem,227496194.0 90 | 2021-07-13 14:37:34.488,mem,227509918.0 91 | 2021-07-13 14:37:34.488,cpu,39.46 92 | 2021-07-13 14:37:35.445,cpu,37.93 93 | 2021-07-13 14:37:35.445,mem,227501851.0 94 | 2021-07-13 14:37:36.402,cpu,38.24 95 | 2021-07-13 14:37:36.402,mem,227493116.0 96 | 2021-07-13 14:37:37.359,cpu,37.33 97 | 2021-07-13 14:37:37.359,mem,227498512.0 98 | 2021-07-13 14:37:38.316,cpu,35.07 99 | 2021-07-13 14:37:38.316,mem,227492915.0 100 | 2021-07-13 14:37:39.273,mem,227496663.0 101 | 2021-07-13 14:37:39.273,cpu,34.77 102 | -------------------------------------------------------------------------------- /Ch01/01_04/cart.csv: -------------------------------------------------------------------------------- 1 | date,name,amount,price 2 | 2021-03-01,carrot,7,5.73 3 | 2021-03-01,egg,12,1.7 4 | 2021-03-01,egg,12,1.2 5 | 2021-03-01,milk,1,3.57 6 | 2021-03-02,potato,3,2.17 7 | 2021-03-02,potato,3,2.17 -------------------------------------------------------------------------------- /Ch01/01_04/duplicates.py: -------------------------------------------------------------------------------- 1 | # %% 2 | import pandas as pd 3 | 4 | 5 | # %% 6 | df = pd.read_csv('cart.csv', parse_dates=['date']) 7 | df 8 | # %% 9 | df.duplicated() 10 | 11 | # %% 12 | df.duplicated(['date', 'name']) -------------------------------------------------------------------------------- /Ch02/02_01/payment_form.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/LinkedInLearning/data_cleaning_python_2883183/3c2fea80c9eb6bfd91bcced06f7fc78eeeab8688/Ch02/02_01/payment_form.png -------------------------------------------------------------------------------- /Ch02/02_03/payment_form.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/LinkedInLearning/data_cleaning_python_2883183/3c2fea80c9eb6bfd91bcced06f7fc78eeeab8688/Ch02/02_03/payment_form.png -------------------------------------------------------------------------------- /Ch02/challenge/payment_form.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/LinkedInLearning/data_cleaning_python_2883183/3c2fea80c9eb6bfd91bcced06f7fc78eeeab8688/Ch02/challenge/payment_form.png -------------------------------------------------------------------------------- /Ch02/solution/payment-design.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/LinkedInLearning/data_cleaning_python_2883183/3c2fea80c9eb6bfd91bcced06f7fc78eeeab8688/Ch02/solution/payment-design.png -------------------------------------------------------------------------------- /Ch03/03_01/schema.sql: -------------------------------------------------------------------------------- 1 | CREATE TABLE ships ( 2 | name TEXT, 3 | lat FLOAT, 4 | lng FLOAT 5 | ); -------------------------------------------------------------------------------- /Ch03/03_01/ships.csv: -------------------------------------------------------------------------------- 1 | name,lat,lng 2 | Black Pearl,20.664865,-80.709747 3 | Cobra,20.664868,-80.709740 4 | Flying Dutchman,20.664878,-80.709941 5 | Empress,, -------------------------------------------------------------------------------- /Ch03/03_01/ships.py: -------------------------------------------------------------------------------- 1 | # %% 2 | import pandas as pd 3 | 4 | 5 | df = pd.read_csv('ships.csv') 6 | df 7 | # %% 8 | df.dtypes 9 | -------------------------------------------------------------------------------- /Ch03/03_02/ships.csv: -------------------------------------------------------------------------------- 1 | name,lat,lng 2 | Black Pearl,20.664865,-80.709747 3 | Cobra,20.664868,-80.709740 4 | Flying Dutchman,20.664878,-80.709941 5 | Empress,, -------------------------------------------------------------------------------- /Ch03/03_02/ships.py: -------------------------------------------------------------------------------- 1 | # %% 2 | import pandas as pd 3 | 4 | df = pd.read_csv('ships.csv') 5 | df 6 | # %% 7 | import pandera as pa 8 | import numpy as np 9 | 10 | schema = pa.DataFrameSchema({ 11 | 'name': pa.Column(pa.String), 12 | 'lat': pa.Column(pa.Float), 13 | 'lng': pa.Column(pa.Float), 14 | }) 15 | 16 | schema.validate(df) 17 | -------------------------------------------------------------------------------- /Ch03/03_03/locations.csv: -------------------------------------------------------------------------------- 1 | date,lat,lng 2 | 1720-03-07,20.664865,-80.709747 3 | 1720-03-08,20.664866,-80.709746 4 | 1720-03-10,20.664868,-80.709745 5 | 1720-03-11,20.664869,-80.709744 -------------------------------------------------------------------------------- /Ch03/03_03/ships.csv: -------------------------------------------------------------------------------- 1 | name,lat,lng 2 | Black Pearl,20.664865,-80.709747 3 | Cobra,20.664868,-80.709740 4 | Flying Dutchman,20.664878,-80.709941 5 | Empress,, 6 | ,20.664875,-80.709777 -------------------------------------------------------------------------------- /Ch03/03_03/ships.py: -------------------------------------------------------------------------------- 1 | # %% 2 | import pandas as pd 3 | 4 | df = pd.read_csv('ships.csv') 5 | df 6 | 7 | # %% 8 | df[df.isnull().any(axis=1)] 9 | # %% 10 | df.iloc[-1]['name'] 11 | # %% 12 | df['name'] = df['name'].str.strip() 13 | df.iloc[-1]['name'] 14 | 15 | # %% 16 | df[df.isnull().any(axis=1)] 17 | 18 | # %% 19 | import numpy as np 20 | mask = df['name'].str.strip() == '' 21 | df.loc[mask, 'name'] = np.nan 22 | # %% 23 | 24 | df[df.isnull().any(axis=1)] -------------------------------------------------------------------------------- /Ch03/03_04/ships.csv: -------------------------------------------------------------------------------- 1 | name,lat,lng 2 | Black Pearl,20.664865,-80.709747 3 | Cobra,20.664868,-80.709740 4 | Flying Dutchman,20.664878,-80.709941 5 | Empress,, -------------------------------------------------------------------------------- /Ch03/03_04/ships.py: -------------------------------------------------------------------------------- 1 | # %% 2 | import pandas as pd 3 | 4 | df = pd.read_csv('ships.csv') 5 | df 6 | # %% 7 | import pandera as pa 8 | import numpy as np 9 | 10 | schema = pa.DataFrameSchema({ 11 | 'name': pa.Column(pa.String), 12 | 'lat': pa.Column( 13 | pa.Float, 14 | nullable=True, 15 | checks=pa.Check( 16 | lambda v: v >= -90 and v <= 90, 17 | element_wise=True, 18 | ), 19 | ), 20 | 'lng': pa.Column( 21 | pa.Float, 22 | nullable=True, 23 | checks=pa.Check( 24 | lambda v: v >= -180 and v <= 180, 25 | element_wise=True, 26 | ), 27 | ), 28 | }) 29 | 30 | schema.validate(df) -------------------------------------------------------------------------------- /Ch03/03_05/heights.csv: -------------------------------------------------------------------------------- 1 | name,grade,height 2 | Adam,1,31.7 3 | Beth,1,74.9 4 | Chris,12,72.3 5 | Dana,12,61.8 -------------------------------------------------------------------------------- /Ch03/03_05/heights.py: -------------------------------------------------------------------------------- 1 | # %% 2 | import pandas as pd 3 | 4 | df = pd.read_csv('heights.csv') 5 | df 6 | 7 | # %% 8 | max_heights = pd.DataFrame([ 9 | [1, 32], 10 | ], columns=['grade', 'max_height']) 11 | max_heights 12 | 13 | # %% 14 | mdf = pd.merge(df, max_heights, how='left') 15 | mdf 16 | 17 | # %% 18 | df[mdf['height'] > mdf['max_height']] -------------------------------------------------------------------------------- /Ch03/challenge/rides.csv: -------------------------------------------------------------------------------- 1 | name,plate,distance 2 | Gomez,1XZ2,3.7 3 | Morticia,,2.1 4 | Fester, ,3.4 5 | Lurch,Q38X3,-3.2 6 | ,03A,14.3 7 | Wednesday,A,0.3 8 | Pugsley,ZF003,153.14 -------------------------------------------------------------------------------- /Ch03/challenge/rides.py: -------------------------------------------------------------------------------- 1 | # %% 2 | import pandas as pd 3 | 4 | df = pd.read_csv('rides.csv') 5 | df 6 | # %% 7 | # Find out all the rows that have bad values 8 | # - Missing values are not allowed 9 | # - A plate must be a combination of at least 3 upper case letters or digits 10 | # - Distance much be bigger than 0 -------------------------------------------------------------------------------- /Ch03/solution/rides.csv: -------------------------------------------------------------------------------- 1 | name,plate,distance 2 | Gomez,1XZ2,3.7 3 | Morticia,,2.1 4 | Fester, ,3.4 5 | Lurch,Q38X3,-3.2 6 | ,03A,14.3 7 | Wednesday,A,0.3 8 | Pugsley,ZF003,153.14 -------------------------------------------------------------------------------- /Ch03/solution/rides.py: -------------------------------------------------------------------------------- 1 | # %% 2 | import pandas as pd 3 | 4 | df = pd.read_csv('rides.csv') 5 | df 6 | # %% 7 | # Find out all the rows that have bad values 8 | # - Missing values are not allowed 9 | # - A plate must be a combination of at least 3 upper case letters or digits 10 | # - Distance much be bigger than 0 11 | null_mask = df.isnull().any(axis=1) 12 | df[null_mask] 13 | # %% 14 | plate_mask = ~df['plate'].str.match(r'^[0-9A-Z]{3,}', na=False) 15 | df[plate_mask] 16 | 17 | # %% 18 | dist_mask = df['distance'] < 0 19 | df[dist_mask] 20 | # %% 21 | mask = null_mask | plate_mask | dist_mask 22 | df[mask] -------------------------------------------------------------------------------- /Ch04/04_01/metrics.py: -------------------------------------------------------------------------------- 1 | # %% 2 | import pandas as pd 3 | import numpy as np 4 | 5 | size = 5 6 | df = pd.DataFrame({ 7 | 'time': pd.date_range('2021', freq='17s', periods=size), 8 | 'name': ['cpu'] * size, 9 | 'value': np.random.rand(size) * 40, 10 | }) 11 | df 12 | 13 | # %% 14 | import pyarrow as pa 15 | 16 | schema = pa.schema([ 17 | ('time', pa.timestamp('ms')), 18 | ('name', pa.string()), 19 | ('value', pa.float64()), 20 | ]) 21 | 22 | # %% 23 | out_file = 'metrics.parquet' 24 | df.to_parquet(out_file, schema=schema) 25 | 26 | # %% 27 | pd.read_parquet(out_file) 28 | 29 | # %% 30 | df['time'] = df['time'].astype(str) 31 | df.to_parquet(out_file, schema=schema) -------------------------------------------------------------------------------- /Ch04/04_02/metrics.csv: -------------------------------------------------------------------------------- 1 | time,name,value 2 | 2021-07-13 14:36:52.380,mem,227517194.0 3 | 2021-07-13 14:36:52.380,cpu,31.57 4 | 2021-07-13 14:36:53.337,mem,227519176.0 5 | 2021-07-13 14:36:53.337,cpu,300.9 6 | 2021-07-13 14:36:54.294,mem,227515712.0 7 | 2021-07-13 14:36:54.294,cpu,31.64 8 | 2021-07-13 14:36:55.251,mem,295.0 9 | 2021-07-13 14:36:55.251,cpu,31.88 10 | 2021-07-13 14:36:56.208,mem,227531324.0 11 | 2021-07-13 14:36:56.208,cpu,-32.14 12 | 2021-07-13 14:36:57.165,cpu,31.73 13 | 2021-07-13 14:36:57.165,mem,227528488.0 14 | 2021-07-13 14:36:58.122,mem,227515595.0 15 | 2021-07-13 14:36:58.122,cpu,32.33 16 | 2021-07-13 14:36:59.079,CPU,30.4 17 | 2021-07-13 14:36:59.079,mem,227525066.0 18 | 2021-07-13 14:37:00.036,cpu,30.95 19 | 2021-07-13 14:37:00.036,mem,227531520.0 20 | 2021-07-13 14:37:00.993,cpu,31.23 21 | 2021-07-13 14:37:00.993,mem,227508323.0 22 | 2021-07-13 14:37:01.950,cpu,29.95 23 | 2021-07-13 14:37:01.950,mem,227519918.0 24 | 2021-07-13 14:37:02.907,cpu,29.62 25 | 2021-07-13 14:37:02.907,mem,227523134.0 26 | 2021-07-13 14:37:03.864,cpu,30.49 27 | 2021-07-13 14:37:03.864,mem,227517742.0 28 | 2021-07-13 14:37:04.821,mem,227526537.0 29 | 2021-07-13 14:37:04.821,cpu,29.52 30 | 2021-07-13 14:37:05.778,mem,227514284.0 31 | 2021-07-13 14:37:05.778,cpu,28.62 32 | 2021-07-13 14:37:06.735,mem,227528568.0 33 | 2021-07-13 14:37:06.735,cpu,29.78 34 | 2021-07-13 14:37:07.692,mem,227516744.0 35 | 2021-07-13 14:37:07.692,cpu,28.72 36 | 2021-07-13 14:37:08.649,cpu,27.87 37 | 2021-07-13 14:37:08.649,mem,227538721.0 38 | 2021-07-13 14:37:09.606,cpu,29.56 39 | 2021-07-13 14:37:09.606,mem,227537406.0 40 | 2021-07-13 14:37:10.563,cpu,27.37 41 | 2021-07-13 14:37:10.563,mem,227543777.0 42 | 2021-07-13 14:37:11.520,mem,227555962.0 43 | 2021-07-13 14:37:11.520,cpu,28.04 44 | 2021-07-13 14:37:12.477,mem,227548711.0 45 | 2021-07-13 14:37:12.477,cpu,28.39 46 | 2021-07-13 14:37:13.434,mem,227547875.0 47 | 2021-07-13 14:37:13.434,cpu,28.99 48 | 2021-07-13 14:37:14.391,mem,227540807.0 49 | 2021-07-13 14:37:14.391,cpu,28.11 50 | 2021-07-13 14:37:15.348,cpu,28.63 51 | 2021-07-13 14:37:15.348,mem,227547996.0 52 | 2021-07-13 14:37:16.305,cpu,29.54 53 | 2021-07-13 14:37:16.305,mem,227556054.0 54 | 2021-07-13 14:37:17.262,cpu,29.97 55 | 2021-07-13 14:37:17.262,mem,227551174.0 56 | 2021-07-13 14:37:18.219,cpu,31.37 57 | 2021-07-13 14:37:18.219,mem,227552297.0 58 | 2021-07-13 14:37:19.176,cpu,30.6 59 | 2021-07-13 14:37:19.176,mem,227536294.0 60 | 2021-07-13 14:37:20.133,cpu,31.36 61 | 2021-07-13 14:37:20.133,mem,227529078.0 62 | 2021-07-13 14:37:21.090,mem,227534673.0 63 | 2021-07-13 14:37:21.090,cpu,32.13 64 | 2021-07-13 14:37:22.047,cpu,30.95 65 | 2021-07-13 14:37:22.047,mem,227532677.0 66 | 2021-07-13 14:37:23.004,mem,227529274.0 67 | 2021-07-13 14:37:23.004,cpu,32.98 68 | 2021-07-13 14:37:23.961,cpu,33.13 69 | 2021-07-13 14:37:23.961,mem,227517273.0 70 | 2021-07-13 14:37:24.918,cpu,33.37 71 | 2021-07-13 14:37:24.918,mem,227506948.0 72 | 2021-07-13 14:37:25.875,cpu,34.41 73 | 2021-07-13 14:37:25.875,mem,227497850.0 74 | 2021-07-13 14:37:26.832,mem,227493728.0 75 | 2021-07-13 14:37:26.832,cpu,35.29 76 | 2021-07-13 14:37:27.789,cpu,35.64 77 | 2021-07-13 14:37:27.789,mem,227495928.0 78 | 2021-07-13 14:37:28.746,cpu,34.94 79 | 2021-07-13 14:37:28.746,mem,227512097.0 80 | 2021-07-13 14:37:29.703,cpu,37.0 81 | 2021-07-13 14:37:29.703,mem,227518444.0 82 | 2021-07-13 14:37:30.660,cpu,37.64 83 | 2021-07-13 14:37:30.660,mem,227499510.0 84 | 2021-07-13 14:37:31.617,cpu,37.19 85 | 2021-07-13 14:37:31.617,mem,227507039.0 86 | 2021-07-13 14:37:32.574,cpu,36.45 87 | 2021-07-13 14:37:32.574,mem,227496995.0 88 | 2021-07-13 14:37:33.531,cpu,35.85 89 | 2021-07-13 14:37:33.531,mem,227496194.0 90 | 2021-07-13 14:37:34.488,mem,227509918.0 91 | 2021-07-13 14:37:34.488,cpu,39.46 92 | 2021-07-13 14:37:35.445,cpu,37.93 93 | 2021-07-13 14:37:35.445,mem,227501851.0 94 | 2021-07-13 14:37:36.402,cpu,38.24 95 | 2021-07-13 14:37:36.402,mem,227493116.0 96 | 2021-07-13 14:37:37.359,cpu,37.33 97 | 2021-07-13 14:37:37.359,mem,227498512.0 98 | 2021-07-13 14:37:38.316,cpu,35.07 99 | 2021-07-13 14:37:38.316,mem,227492915.0 100 | 2021-07-13 14:37:39.273,mem,227496663.0 101 | 2021-07-13 14:37:39.273,cpu,34.77 102 | -------------------------------------------------------------------------------- /Ch04/04_03/rides.csv: -------------------------------------------------------------------------------- 1 | car,start,end,charge 2 | c1,2021-06-27T08:32,2021-06-27T08:47,17.3 3 | c2,2021-06-27T09:07,2021-06-27T09:10,3.3 4 | c2,2021-06-27T09:23,2021-06-27T09:42,30.6 5 | c1,2021-06-27T09:33,2021-06-27T09:47,5.8 -------------------------------------------------------------------------------- /Ch04/04_03/rides.db: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/LinkedInLearning/data_cleaning_python_2883183/3c2fea80c9eb6bfd91bcced06f7fc78eeeab8688/Ch04/04_03/rides.db -------------------------------------------------------------------------------- /Ch04/04_03/tasks.py: -------------------------------------------------------------------------------- 1 | import sqlite3 2 | 3 | import pandas as pd 4 | from invoke import task 5 | 6 | 7 | def load_csv(csv_file): 8 | df = pd.read_csv(csv_file, parse_dates=['start', 'end']) 9 | return df 10 | 11 | 12 | def validate(df): 13 | bad_time = df.query('start >= end') 14 | if len(bad_time) > 0: 15 | raise ValueError(bad_time) 16 | 17 | 18 | @task 19 | def etl(ctx, csv_file): 20 | df = load_csv(csv_file) 21 | validate(df) 22 | 23 | db_file = f'rides.db' 24 | conn = sqlite3.connect(db_file) 25 | df.to_sql('rides', conn, index=False, if_exists='append') 26 | -------------------------------------------------------------------------------- /Ch04/04_04/etl.py: -------------------------------------------------------------------------------- 1 | # %% 2 | import pandas as pd 3 | 4 | df = pd.read_csv('ships.csv') 5 | df 6 | 7 | # %% 8 | import sqlite3 9 | 10 | 11 | schema = ''' 12 | CREATE TABLE ships ( 13 | name TEXT, 14 | lat FLOAT NOT NULL, 15 | lng FLOAT NOT NULL 16 | ); 17 | ''' 18 | 19 | db_file = 'ships.db' 20 | conn = sqlite3.connect(db_file) 21 | conn.executescript(schema) 22 | 23 | try: 24 | with conn as cur: 25 | cur.execute('BEGIN') 26 | df.to_sql('ships', conn, if_exists='append', index=False) 27 | finally: 28 | conn.close() 29 | -------------------------------------------------------------------------------- /Ch04/04_04/ships.csv: -------------------------------------------------------------------------------- 1 | name,lat,lng 2 | Black Pearl,20.664865,-80.709747 3 | Cobra,20.664868,-80.709740 4 | Flying Dutchman,20.664878,-80.709941 5 | Empress,, -------------------------------------------------------------------------------- /Ch04/04_05/metrics.csv: -------------------------------------------------------------------------------- 1 | time,metric,value 2 | 2021-07-23 14:33:04,cpu,30.2 3 | 2021-07-23 14:44:05,cpu,32.9 4 | 2021-07-23 14:55:06,cpu,37.1 5 | 2021-07-23 14:33:04,memory,571.83 6 | 2021-07-23 14:44:05,memory,524.72 7 | 2021-07-23 14:55:06,memory,617.9 8 | -------------------------------------------------------------------------------- /Ch04/04_05/metrics_wide.csv: -------------------------------------------------------------------------------- 1 | time,cpu,memory 2 | 2021-07-23T14:33:04,30.2,571.83 3 | 2021-07-23T14:44:05,32.9,524.72 4 | 2021-07-23T14:55:06,37.1,617.9 -------------------------------------------------------------------------------- /Ch04/04_06/orders.csv: -------------------------------------------------------------------------------- 1 | time,symbol,price,side 2 | 2027-06-01T08:02:00,MSFT,264.1252085586294,bid 3 | 2027-06-01T08:02:17,MSFT,265.5017453117783,ask 4 | 2027-06-01T08:02:34,IBM,145.9908889786,bid 5 | 2027-06-01T08:02:51,ORCL,78.80933934941555,ask 6 | 2027-06-01T08:03:08,ORCL,78.62171608293582,ask 7 | 2027-06-01T08:03:25,IBM,146.39460333104478,bid 8 | 2027-06-01T08:03:42,MSFT,264.5414625156951,bid 9 | 2027-06-01T08:03:59,MSFT,265.920655685467,ask 10 | 2027-06-01T08:04:16,ORCL,77.83246629647707,bid 11 | 2027-06-01T08:04:33,MSFT,265.6907276936156,ask 12 | 2027-06-01T08:04:50,ORCL,78.98301932754048,ask 13 | 2027-06-01T08:05:07,IBM,147.4352752915348,ask 14 | 2027-06-01T08:05:24,MSFT,265.4856221936908,ask 15 | 2027-06-01T08:05:41,MSFT,265.19491621016135,ask 16 | 2027-06-01T08:05:58,MSFT,264.66183803436746,bid 17 | 2027-06-01T08:06:15,MSFT,265.684404543366,ask 18 | 2027-06-01T08:06:32,,265.4432949473238,ask 19 | 2027-06-01T08:06:49,ORCL,79.27201640861864,ask 20 | 2027-06-01T08:07:06,IBM,146.31606679185106,bid 21 | 2027-06-01T08:07:23,ORCL,77.97591150838826,bid 22 | -------------------------------------------------------------------------------- /Ch04/04_06/orders.py: -------------------------------------------------------------------------------- 1 | # %% 2 | import pandas as pd 3 | 4 | df = pd.read_csv('orders.csv', parse_dates=['time']) 5 | df 6 | 7 | # %% 8 | def is_valid_row(row): 9 | if row['time'] < pd.Timestamp('1900'): 10 | return False 11 | 12 | if pd.isnull(row['symbol']) or row['symbol'].strip() == '': 13 | return False 14 | 15 | if row['price'] <= 0: 16 | return False 17 | 18 | return True 19 | 20 | 21 | ok_df = df[df.apply(is_valid_row, axis=1)] 22 | 23 | # %% 24 | num_bad = len(df) - len(ok_df) 25 | percent_bad = num_bad/len(df) * 100 26 | print(f'{percent_bad:.2f}% bad rows') 27 | if num_bad > 0: 28 | bad_rows = df[~df.index.isin(ok_df.index)] 29 | print('bad rows:') 30 | print(bad_rows) -------------------------------------------------------------------------------- /Ch04/challenge/etl.py: -------------------------------------------------------------------------------- 1 | """ 2 | Load traffic.csv into "traffic" table in sqlite3 database. 3 | 4 | Drop and report invalid rows. 5 | - ip should be valid IP (see ipaddress) 6 | - time must not be in the future 7 | - path can't be empty 8 | - status code must be a valid HTTP status code (see http.HTTPStatus) 9 | - size can't be negative or empty 10 | 11 | Report the percentage of bad rows. Fail the ETL if there are more than 5% bad rows 12 | """ -------------------------------------------------------------------------------- /Ch04/challenge/traffic.csv: -------------------------------------------------------------------------------- 1 | ip,time,path,status,size 2 | 108.66.146.1,2017-06-19T14:03:00,/images,200,1095 3 | 108.66.146.3,2017-06-19T14:03:21,/posts,200,1572 4 | 108.66.146.6,2017-06-19T14:03:42,/posts,200,1174 5 | 108.66.146.1,2017-06-19T14:04:03,/users,200,684 6 | 108.66.146.1,2017-06-19T14:04:24,/images,400,0 7 | 108.66.146.4,2017-06-19T14:04:45,/posts,200,1594 8 | 108.66.146.6,2017-06-19T14:05:06,/images,200,857 9 | 108.66.146.2,2017-06-19T14:05:27,/users,200,1834 10 | 108.66.146.1,2017-06-19T14:05:48,/images,200,1412 11 | 108.66.146.2,2017-06-19T14:06:09,/health,200,698 12 | 108.66.446.2,2017-06-19T14:06:30,/users,200,833 13 | 108.66.146.5,2017-06-19T14:06:51,/health,200,534 14 | 108.66.146.1,2017-06-19T14:07:12,/users,200,1833 15 | 108.66.146.3,2017-06-19T14:07:33,/posts,200,1574 16 | 108.66.146.5,2017-06-19T14:07:54,/posts,200,1472 17 | 108.66.146.2,2017-06-19T14:08:15,/health,200,2048 18 | 108.66.146.3,2017-06-19T14:08:36,/health,400,0 19 | 108.66.146.1,2037-06-19T14:08:57,/users,200,1658 20 | 108.66.146.4,2017-06-19T14:09:18,/posts,400,0 21 | 108.66.146.3,2017-06-19T14:09:39,/images,200,996 22 | 108.66.146.4,2017-06-19T14:10:00,/users,200,905 23 | 108.66.146.2,2017-06-19T14:10:21,,200,1987 24 | 108.66.146.6,2017-06-19T14:10:42,/images,200,2004 25 | 108.66.146.4,2017-06-19T14:11:03,/users,200,1973 26 | 108.66.146.1,2017-06-19T14:11:24,/posts,200,674 27 | 108.66.146.5,2017-06-19T14:11:45,/posts,200,1928 28 | 108.66.146.3,2017-06-19T14:12:06,/posts,200,-1755 29 | 108.66.146.3,2017-06-19T14:12:27,/health,200,1702 30 | 108.66.146.6,2017-06-19T14:12:48,/health,200,1226 31 | 108.66.146.4,2017-06-19T14:13:09,/images,200,1691 32 | 108.66.146.5,2017-06-19T14:13:30,/health,200,741 33 | 108.66.146.2,2017-06-19T14:13:51,/posts,200,2000 34 | 108.66.146.2,2017-06-19T14:14:12,/health,200,1144 35 | 108.66.146.1,2017-06-19T14:14:33,/health,200,937 36 | 108.66.146.3,2017-06-19T14:14:54,/images,200,1233 37 | 108.66.146.2,2017-06-19T14:15:15,/images,200,1655 38 | 108.66.146.1,2017-06-19T14:15:36,/users,200,964 39 | 108.66.146.5,2017-06-19T14:15:57,/posts,200,1937 40 | 108.66.146.3,2017-06-19T14:16:18,/users,200,1613 41 | 108.66.146.3,2017-06-19T14:16:39,/images,200,1070 42 | 108.66.146.6,2017-06-19T14:17:00,/users,200,1199 43 | 108.66.146.2,2017-06-19T14:17:21,/health,200,1456 44 | 108.66.146.1,2017-06-19T14:17:42,/users,200,758 45 | 108.66.146.2,2017-06-19T14:18:03,/users,200,889 46 | 108.66.146.5,2017-06-19T14:18:24,/health,200,512 47 | 108.66.146.4,2017-06-19T14:18:45,/health,200,1131 48 | 108.66.146.3,2017-06-19T14:19:06,/health,200,1583 49 | 108.66.146.6,2017-06-19T14:19:27,/users,200,757 50 | 108.66.146.5,2017-06-19T14:19:48,/users,200,1632 51 | 108.66.146.3,2017-06-19T14:20:09,/posts,200,1063 52 | 108.66.146.6,2017-06-19T14:20:30,/images,200,1225 53 | 108.66.146.3,2017-06-19T14:20:51,/images,200,1832 54 | 108.66.146.1,2017-06-19T14:21:12,/users,200,623 55 | 108.66.146.3,2017-06-19T14:21:33,/health,200,1280 56 | 108.66.146.6,2017-06-19T14:21:54,/posts,200,637 57 | 108.66.146.4,2017-06-19T14:22:15,/health,200,1696 58 | 108.66.146.6,2017-06-19T14:22:36,/images,200,1701 59 | 108.66.146.5,2017-06-19T14:22:57,/health,200,681 60 | 108.66.146.2,2017-06-19T14:23:18,/health,200,1298 61 | 108.66.146.4,2017-06-19T14:23:39,/images,900,1426 62 | 108.66.146.5,2017-06-19T14:24:00,/images,200,1754 63 | 108.66.146.5,2017-06-19T14:24:21,/posts,200,710 64 | 108.66.146.5,2017-06-19T14:24:42,/health,200,1565 65 | 108.66.146.2,2017-06-19T14:25:03,/users,200,850 66 | 108.66.146.1,2017-06-19T14:25:24,/posts,200,532 67 | 108.66.146.6,2017-06-19T14:25:45,/users,200,728 68 | 108.66.146.4,2017-06-19T14:26:06,/users,200,2033 69 | 108.66.146.2,2017-06-19T14:26:27,/users,200,1132 70 | 108.66.146.1,2017-06-19T14:26:48,/health,200,1892 71 | 108.66.146.5,2017-06-19T14:27:09,/users,200,1001 72 | 108.66.146.2,2017-06-19T14:27:30,/health,200,1318 73 | 108.66.146.4,2017-06-19T14:27:51,/users,400,0 74 | 108.66.146.5,2017-06-19T14:28:12,/images,200,1628 75 | 108.66.146.3,2017-06-19T14:28:33,/posts,200,1692 76 | 108.66.146.6,2017-06-19T14:28:54,/users,200,1702 77 | 108.66.146.2,2017-06-19T14:29:15,/users,400,0 78 | 108.66.146.3,2017-06-19T14:29:36,/users,200,814 79 | 108.66.146.3,2017-06-19T14:29:57,/users,200,1139 80 | 108.66.146.3,2017-06-19T14:30:18,/images,400,0 81 | 108.66.146.6,2017-06-19T14:30:39,/posts,200,1520 82 | 108.66.146.1,2017-06-19T14:31:00,/users,200,1261 83 | 108.66.146.3,2017-06-19T14:31:21,/posts,400,0 84 | 108.66.146.2,2017-06-19T14:31:42,/images,200,1554 85 | 108.66.146.5,2017-06-19T14:32:03,/images,200,711 86 | 108.66.146.4,2017-06-19T14:32:24,/health,200,603 87 | 108.66.146.4,2017-06-19T14:32:45,/images,200,995 88 | 108.66.146.6,2017-06-19T14:33:06,/health,200,546 89 | 108.66.146.6,2017-06-19T14:33:27,/posts,200,1374 90 | 108.66.146.3,2017-06-19T14:33:48,/users,200,1267 91 | 108.66.146.2,2017-06-19T14:34:09,/users,200,582 92 | 108.66.146.4,2017-06-19T14:34:30,/users,200,1605 93 | 108.66.146.1,2017-06-19T14:34:51,/posts,200,1250 94 | 108.66.146.5,2017-06-19T14:35:12,/images,200,1393 95 | 108.66.146.2,2017-06-19T14:35:33,/users,200,1153 96 | 108.66.146.5,2017-06-19T14:35:54,/posts,200,1635 97 | 108.66.146.5,2017-06-19T14:36:15,/images,200,1452 98 | 108.66.146.1,2017-06-19T14:36:36,/health,200,1568 99 | 108.66.146.3,2017-06-19T14:36:57,/health,200,1450 100 | 108.66.146.1,2017-06-19T14:37:18,/posts,200,1231 101 | 108.66.146.1,2017-06-19T14:37:39,/health,200,793 102 | 108.66.146.3,2017-06-19T14:38:00,/users,200,842 103 | 108.66.146.6,2017-06-19T14:38:21,/images,200,1939 104 | 108.66.146.5,2017-06-19T14:38:42,/health,200,1583 105 | 108.66.146.1,2017-06-19T14:39:03,/users,200,1382 106 | 108.66.146.4,2017-06-19T14:39:24,/users,400,0 107 | 108.66.146.1,2017-06-19T14:39:45,/images,200,1066 108 | 108.66.146.2,2017-06-19T14:40:06,/images,200,1602 109 | 108.66.146.1,2017-06-19T14:40:27,/users,200,952 110 | 108.66.146.6,2017-06-19T14:40:48,/health,200,1348 111 | 108.66.146.5,2017-06-19T14:41:09,/health,200,1296 112 | 108.66.146.4,2017-06-19T14:41:30,/posts,200,1459 113 | 108.66.146.5,2017-06-19T14:41:51,/users,400,0 114 | 108.66.146.1,2017-06-19T14:42:12,/health,200,531 115 | 108.66.146.3,2017-06-19T14:42:33,/health,200,1702 116 | 108.66.146.5,2017-06-19T14:42:54,/users,200,1116 117 | 108.66.146.1,2017-06-19T14:43:15,/images,200,1304 118 | 108.66.146.3,2017-06-19T14:43:36,/posts,200,1008 119 | 108.66.146.4,2017-06-19T14:43:57,/health,200,1182 120 | 108.66.146.6,2017-06-19T14:44:18,/images,200,987 121 | 108.66.146.4,2017-06-19T14:44:39,/users,400,0 122 | 108.66.146.2,2017-06-19T14:45:00,/images,200,1770 123 | 108.66.146.1,2017-06-19T14:45:21,/users,200,1008 124 | 108.66.146.2,2017-06-19T14:45:42,/posts,200,981 125 | 108.66.146.5,2017-06-19T14:46:03,/posts,200,1750 126 | 108.66.146.3,1017-06-19T14:46:24,/users,200,1079 127 | 108.66.146.1,2017-06-19T14:46:45,/images,200,1064 128 | 108.66.146.5,2017-06-19T14:47:06,/health,200,1531 129 | 108.66.146.4,2017-06-19T14:47:27,/images,200,1029 130 | -------------------------------------------------------------------------------- /Ch04/solution/etl.py: -------------------------------------------------------------------------------- 1 | """ 2 | Load traffic.csv into "traffic" table in sqlite3 database. 3 | 4 | Drop and report invalid rows. 5 | - ip should be valid IP (see ipaddress) 6 | - time must not be in the future 7 | - path can't be empty 8 | - status code must be a valid HTTP status code (see http.HTTPStatus) 9 | - size can't be negative or empty 10 | 11 | Report the percentage of bad rows. Fail the ETL if there are more than 5% bad rows 12 | """ 13 | 14 | 15 | import sqlite3 16 | from contextlib import closing 17 | from http import HTTPStatus 18 | from ipaddress import ip_address 19 | 20 | import pandas as pd 21 | 22 | status_codes = set(HTTPStatus) 23 | 24 | max_bad_percent = 5 25 | 26 | 27 | def is_valid_row(row): 28 | # ip should be valid IP (see ipaddress) 29 | try: 30 | ip_address(row['ip']) 31 | except ValueError: 32 | return False 33 | 34 | # time must not be in the future or older than 1 year 35 | now = pd.Timestamp.now() 36 | if row['time'] > now: 37 | return False 38 | 39 | # path can't be empty 40 | if pd.isnull(row['path']) or not row['path'].strip(): 41 | return False 42 | 43 | # status code must be a valid HTTP status code (see http.HTTPStatus) 44 | if row['status'] not in status_codes: 45 | return False 46 | 47 | # size can't be negative or empty 48 | if pd.isnull(row['size']) or row['size'] < 0: 49 | return False 50 | 51 | return True 52 | 53 | 54 | def etl(csv_file, db_file): 55 | df = pd.read_csv(csv_file, parse_dates=['time']) 56 | 57 | bad_rows = df[~df.apply(is_valid_row, axis=1)] 58 | if len(bad_rows) > 0: 59 | percent_bad = len(bad_rows)/len(df) * 100 60 | print(f'{len(bad_rows)} ({percent_bad:.2f}%) bad rows') 61 | if percent_bad >= max_bad_percent: 62 | raise ValueError('too many bad rows ({precent_bad:.2f}%)') 63 | 64 | df = df[~df.index.isin(bad_rows.index)] 65 | with closing(sqlite3.connect(db_file)) as conn: 66 | conn.execute('BEGIN') 67 | with conn: 68 | df.to_sql('traffic', conn, if_exists='append', index=False) 69 | 70 | if __name__ == '__main__': 71 | etl('traffic.csv', 'traffic.db') -------------------------------------------------------------------------------- /Ch04/solution/traffic.csv: -------------------------------------------------------------------------------- 1 | ip,time,path,status,size 2 | 108.66.146.1,2017-06-19T14:03:00,/images,200,1095 3 | 108.66.146.3,2017-06-19T14:03:21,/posts,200,1572 4 | 108.66.146.6,2017-06-19T14:03:42,/posts,200,1174 5 | 108.66.146.1,2017-06-19T14:04:03,/users,200,684 6 | 108.66.146.1,2017-06-19T14:04:24,/images,400,0 7 | 108.66.146.4,2017-06-19T14:04:45,/posts,200,1594 8 | 108.66.146.6,2017-06-19T14:05:06,/images,200,857 9 | 108.66.146.2,2017-06-19T14:05:27,/users,200,1834 10 | 108.66.146.1,2017-06-19T14:05:48,/images,200,1412 11 | 108.66.146.2,2017-06-19T14:06:09,/health,200,698 12 | 108.66.446.2,2017-06-19T14:06:30,/users,200,833 13 | 108.66.146.5,2017-06-19T14:06:51,/health,200,534 14 | 108.66.146.1,2017-06-19T14:07:12,/users,200,1833 15 | 108.66.146.3,2017-06-19T14:07:33,/posts,200,1574 16 | 108.66.146.5,2017-06-19T14:07:54,/posts,200,1472 17 | 108.66.146.2,2017-06-19T14:08:15,/health,200,2048 18 | 108.66.146.3,2017-06-19T14:08:36,/health,400,0 19 | 108.66.146.1,2037-06-19T14:08:57,/users,200,1658 20 | 108.66.146.4,2017-06-19T14:09:18,/posts,400,0 21 | 108.66.146.3,2017-06-19T14:09:39,/images,200,996 22 | 108.66.146.4,2017-06-19T14:10:00,/users,200,905 23 | 108.66.146.2,2017-06-19T14:10:21,,200,1987 24 | 108.66.146.6,2017-06-19T14:10:42,/images,200,2004 25 | 108.66.146.4,2017-06-19T14:11:03,/users,200,1973 26 | 108.66.146.1,2017-06-19T14:11:24,/posts,200,674 27 | 108.66.146.5,2017-06-19T14:11:45,/posts,200,1928 28 | 108.66.146.3,2017-06-19T14:12:06,/posts,200,-1755 29 | 108.66.146.3,2017-06-19T14:12:27,/health,200,1702 30 | 108.66.146.6,2017-06-19T14:12:48,/health,200,1226 31 | 108.66.146.4,2017-06-19T14:13:09,/images,200,1691 32 | 108.66.146.5,2017-06-19T14:13:30,/health,200,741 33 | 108.66.146.2,2017-06-19T14:13:51,/posts,200,2000 34 | 108.66.146.2,2017-06-19T14:14:12,/health,200,1144 35 | 108.66.146.1,2017-06-19T14:14:33,/health,200,937 36 | 108.66.146.3,2017-06-19T14:14:54,/images,200,1233 37 | 108.66.146.2,2017-06-19T14:15:15,/images,200,1655 38 | 108.66.146.1,2017-06-19T14:15:36,/users,200,964 39 | 108.66.146.5,2017-06-19T14:15:57,/posts,200,1937 40 | 108.66.146.3,2017-06-19T14:16:18,/users,200,1613 41 | 108.66.146.3,2017-06-19T14:16:39,/images,200,1070 42 | 108.66.146.6,2017-06-19T14:17:00,/users,200,1199 43 | 108.66.146.2,2017-06-19T14:17:21,/health,200,1456 44 | 108.66.146.1,2017-06-19T14:17:42,/users,200,758 45 | 108.66.146.2,2017-06-19T14:18:03,/users,200,889 46 | 108.66.146.5,2017-06-19T14:18:24,/health,200,512 47 | 108.66.146.4,2017-06-19T14:18:45,/health,200,1131 48 | 108.66.146.3,2017-06-19T14:19:06,/health,200,1583 49 | 108.66.146.6,2017-06-19T14:19:27,/users,200,757 50 | 108.66.146.5,2017-06-19T14:19:48,/users,200,1632 51 | 108.66.146.3,2017-06-19T14:20:09,/posts,200,1063 52 | 108.66.146.6,2017-06-19T14:20:30,/images,200,1225 53 | 108.66.146.3,2017-06-19T14:20:51,/images,200,1832 54 | 108.66.146.1,2017-06-19T14:21:12,/users,200,623 55 | 108.66.146.3,2017-06-19T14:21:33,/health,200,1280 56 | 108.66.146.6,2017-06-19T14:21:54,/posts,200,637 57 | 108.66.146.4,2017-06-19T14:22:15,/health,200,1696 58 | 108.66.146.6,2017-06-19T14:22:36,/images,200,1701 59 | 108.66.146.5,2017-06-19T14:22:57,/health,200,681 60 | 108.66.146.2,2017-06-19T14:23:18,/health,200,1298 61 | 108.66.146.4,2017-06-19T14:23:39,/images,900,1426 62 | 108.66.146.5,2017-06-19T14:24:00,/images,200,1754 63 | 108.66.146.5,2017-06-19T14:24:21,/posts,200,710 64 | 108.66.146.5,2017-06-19T14:24:42,/health,200,1565 65 | 108.66.146.2,2017-06-19T14:25:03,/users,200,850 66 | 108.66.146.1,2017-06-19T14:25:24,/posts,200,532 67 | 108.66.146.6,2017-06-19T14:25:45,/users,200,728 68 | 108.66.146.4,2017-06-19T14:26:06,/users,200,2033 69 | 108.66.146.2,2017-06-19T14:26:27,/users,200,1132 70 | 108.66.146.1,2017-06-19T14:26:48,/health,200,1892 71 | 108.66.146.5,2017-06-19T14:27:09,/users,200,1001 72 | 108.66.146.2,2017-06-19T14:27:30,/health,200,1318 73 | 108.66.146.4,2017-06-19T14:27:51,/users,400,0 74 | 108.66.146.5,2017-06-19T14:28:12,/images,200,1628 75 | 108.66.146.3,2017-06-19T14:28:33,/posts,200,1692 76 | 108.66.146.6,2017-06-19T14:28:54,/users,200,1702 77 | 108.66.146.2,2017-06-19T14:29:15,/users,400,0 78 | 108.66.146.3,2017-06-19T14:29:36,/users,200,814 79 | 108.66.146.3,2017-06-19T14:29:57,/users,200,1139 80 | 108.66.146.3,2017-06-19T14:30:18,/images,400,0 81 | 108.66.146.6,2017-06-19T14:30:39,/posts,200,1520 82 | 108.66.146.1,2017-06-19T14:31:00,/users,200,1261 83 | 108.66.146.3,2017-06-19T14:31:21,/posts,400,0 84 | 108.66.146.2,2017-06-19T14:31:42,/images,200,1554 85 | 108.66.146.5,2017-06-19T14:32:03,/images,200,711 86 | 108.66.146.4,2017-06-19T14:32:24,/health,200,603 87 | 108.66.146.4,2017-06-19T14:32:45,/images,200,995 88 | 108.66.146.6,2017-06-19T14:33:06,/health,200,546 89 | 108.66.146.6,2017-06-19T14:33:27,/posts,200,1374 90 | 108.66.146.3,2017-06-19T14:33:48,/users,200,1267 91 | 108.66.146.2,2017-06-19T14:34:09,/users,200,582 92 | 108.66.146.4,2017-06-19T14:34:30,/users,200,1605 93 | 108.66.146.1,2017-06-19T14:34:51,/posts,200,1250 94 | 108.66.146.5,2017-06-19T14:35:12,/images,200,1393 95 | 108.66.146.2,2017-06-19T14:35:33,/users,200,1153 96 | 108.66.146.5,2017-06-19T14:35:54,/posts,200,1635 97 | 108.66.146.5,2017-06-19T14:36:15,/images,200,1452 98 | 108.66.146.1,2017-06-19T14:36:36,/health,200,1568 99 | 108.66.146.3,2017-06-19T14:36:57,/health,200,1450 100 | 108.66.146.1,2017-06-19T14:37:18,/posts,200,1231 101 | 108.66.146.1,2017-06-19T14:37:39,/health,200,793 102 | 108.66.146.3,2017-06-19T14:38:00,/users,200,842 103 | 108.66.146.6,2017-06-19T14:38:21,/images,200,1939 104 | 108.66.146.5,2017-06-19T14:38:42,/health,200,1583 105 | 108.66.146.1,2017-06-19T14:39:03,/users,200,1382 106 | 108.66.146.4,2017-06-19T14:39:24,/users,400,0 107 | 108.66.146.1,2017-06-19T14:39:45,/images,200,1066 108 | 108.66.146.2,2017-06-19T14:40:06,/images,200,1602 109 | 108.66.146.1,2017-06-19T14:40:27,/users,200,952 110 | 108.66.146.6,2017-06-19T14:40:48,/health,200,1348 111 | 108.66.146.5,2017-06-19T14:41:09,/health,200,1296 112 | 108.66.146.4,2017-06-19T14:41:30,/posts,200,1459 113 | 108.66.146.5,2017-06-19T14:41:51,/users,400,0 114 | 108.66.146.1,2017-06-19T14:42:12,/health,200,531 115 | 108.66.146.3,2017-06-19T14:42:33,/health,200,1702 116 | 108.66.146.5,2017-06-19T14:42:54,/users,200,1116 117 | 108.66.146.1,2017-06-19T14:43:15,/images,200,1304 118 | 108.66.146.3,2017-06-19T14:43:36,/posts,200,1008 119 | 108.66.146.4,2017-06-19T14:43:57,/health,200,1182 120 | 108.66.146.6,2017-06-19T14:44:18,/images,200,987 121 | 108.66.146.4,2017-06-19T14:44:39,/users,400,0 122 | 108.66.146.2,2017-06-19T14:45:00,/images,200,1770 123 | 108.66.146.1,2017-06-19T14:45:21,/users,200,1008 124 | 108.66.146.2,2017-06-19T14:45:42,/posts,200,981 125 | 108.66.146.5,2017-06-19T14:46:03,/posts,200,1750 126 | 108.66.146.3,2007-06-19T14:46:24,/users,200,1079 127 | 108.66.146.1,2017-06-19T14:46:45,/images,200,1064 128 | 108.66.146.5,2017-06-19T14:47:06,/health,200,1531 129 | 108.66.146.4,2017-06-19T14:47:27,/images,200,1029 -------------------------------------------------------------------------------- /Ch05/05_01/columns.py: -------------------------------------------------------------------------------- 1 | # %% 2 | import pandas as pd 3 | 4 | df = pd.read_csv('weather.csv', parse_dates=['DATE']) 5 | df 6 | # %% 7 | df.rename(columns={ 8 | 'DATE': 'date', 9 | 'TMIN': 'min_temp', 10 | 'TMAX': 'max_temp', 11 | }, inplace=True) 12 | df 13 | # %% 14 | df = pd.read_csv('donations.csv') 15 | df 16 | # %% 17 | import re 18 | 19 | 20 | def fix_col(col): 21 | """Fix column name 22 | >>> fix_col('1. First Name') 23 | 'first_name' 24 | """ 25 | return ( 26 | re.sub(r'\d+\.\s+', '', col) 27 | .lower() 28 | .replace(' ', '_') 29 | ) 30 | 31 | df.rename(columns=fix_col, inplace=True) 32 | df -------------------------------------------------------------------------------- /Ch05/05_01/donations.csv: -------------------------------------------------------------------------------- 1 | 1. First Name,2. Last Name,3. Donation Amount 2 | Amy,Wang,200 3 | Bender,Rodriguez,12 4 | Philip,Fry,70 5 | -------------------------------------------------------------------------------- /Ch05/05_01/weather.csv: -------------------------------------------------------------------------------- 1 | DATE,TMIN,TMAX 2 | 2021-04-25,18,28 3 | 2021-04-26,16,23 4 | 2021-04-27,17,24 5 | 2021-04-28,15,25 6 | 2021-04-29,17,28 -------------------------------------------------------------------------------- /Ch05/05_02/points.csv: -------------------------------------------------------------------------------- 1 | x,y,color,visible 2 | 1,1,0xFF0000,yes 3 | 2,2,0x00FF00,no 4 | 3,3,0x0000FF,yes -------------------------------------------------------------------------------- /Ch05/05_02/points.py: -------------------------------------------------------------------------------- 1 | # %% 2 | import pandas as pd 3 | 4 | df = pd.read_csv('points.csv') 5 | df.dtypes 6 | 7 | # %% 8 | def asint(val): 9 | return int(val, base=0) 10 | 11 | df['color'] = df['color'].apply(asint) 12 | df.dtypes 13 | 14 | # %% 15 | bools = { 16 | 'yes': True, 17 | 'no': False, 18 | } 19 | df['visible'] = df['visible'].map(bools) 20 | df.dtypes 21 | 22 | # %% 23 | df -------------------------------------------------------------------------------- /Ch05/05_03/2021-06.csv: -------------------------------------------------------------------------------- 1 | day,time,client 2 | 01,09:00-11:00,ecorp 3 | 01,12:00-18:00,allsafe 4 | 02,10:00-19:30,allsafe 5 | 03,11:30-17:00,ecorp -------------------------------------------------------------------------------- /Ch05/05_03/work.py: -------------------------------------------------------------------------------- 1 | # %% 2 | import pandas as pd 3 | 4 | csv_file = '2021-06.csv' 5 | 6 | df = pd.read_csv(csv_file) 7 | df 8 | # %% 9 | df['date'] = csv_file[:-len('.csv')] 10 | df 11 | 12 | # %% 13 | times = df['time'].str.split('-', expand=True) 14 | times.columns = ['start', 'end'] 15 | times 16 | # %% 17 | df = pd.concat([df, times], axis=1) 18 | df 19 | 20 | # %% 21 | df['start'] = pd.to_datetime( 22 | df['date'].str.cat(df['start'], sep='T') 23 | ) 24 | df['end'] = pd.to_datetime( 25 | df['date'].str.cat(df['end'], sep='T') 26 | ) 27 | df 28 | 29 | # %% 30 | (df['end'] - df['start']).sum() 31 | -------------------------------------------------------------------------------- /Ch05/05_04/rides.csv: -------------------------------------------------------------------------------- 1 | name,plate,distance 2 | Gomez,1XZ2,3.7 3 | Morticia,,2.1 4 | Fester, ,3.4 5 | Lurch,Q38X3,-3.2 6 | ,03A,14.3 7 | Wednesday,A,0.3 8 | Pugsley,ZF003,153.14 -------------------------------------------------------------------------------- /Ch05/05_04/rides.py: -------------------------------------------------------------------------------- 1 | # %% 2 | 3 | import pandas as pd 4 | 5 | df = pd.read_csv('rides.csv') 6 | df 7 | 8 | # %% 9 | mask = df.eval('name.isnull() | distance <= 0') 10 | mask 11 | 12 | # %% 13 | df = df[~mask] 14 | df 15 | -------------------------------------------------------------------------------- /Ch05/05_05/cart.csv: -------------------------------------------------------------------------------- 1 | date,name,amount,price 2 | 2021-03-01,carrot,7,5.73 3 | 2021-03-01,egg,12,1.7 4 | 2021-03-01,milk,,3.57 5 | 2021-03-01,potato,2, 6 | ,tomato,6,1.52 7 | 2021-03-02,potato,3,2.17 8 | 2021-03-03,,5,3.68 -------------------------------------------------------------------------------- /Ch05/05_05/missing.py: -------------------------------------------------------------------------------- 1 | # %% 2 | import pandas as pd 3 | 4 | 5 | df = pd.read_csv('cart.csv', parse_dates=['date']) 6 | df 7 | 8 | # %% 9 | df['amount'].fillna(1, inplace=True) 10 | df 11 | 12 | # %% 13 | most_common = df['name'].mode()[0] 14 | df['name'].fillna(most_common, inplace=True) 15 | df 16 | 17 | # %% 18 | df['date'].fillna(method='ffill', inplace=True) 19 | df 20 | 21 | # %% 22 | import numpy as np 23 | prices = df.groupby('name')['price'].transform(np.mean) 24 | prices 25 | 26 | # %% 27 | df['price'].fillna(prices, inplace=True) 28 | df -------------------------------------------------------------------------------- /Ch05/05_06/metrics.csv: -------------------------------------------------------------------------------- 1 | time,cpu,memory 2 | 2021-07-23T14:33:04,30.2,571.83 3 | 2021-07-23T14:44:05,32.9,524.72 4 | 2021-07-23T14:55:06,37.1,617.9 -------------------------------------------------------------------------------- /Ch05/05_06/metrics.py: -------------------------------------------------------------------------------- 1 | # %% 2 | import pandas as pd 3 | 4 | df = pd.read_csv('metrics.csv', parse_dates=['time']) 5 | df 6 | # %% 7 | 8 | df = pd.melt( 9 | df, 10 | value_vars=['cpu', 'memory'], 11 | id_vars=['time'], 12 | var_name='metric', 13 | ) 14 | df 15 | -------------------------------------------------------------------------------- /Ch05/challenge/workshops.csv: -------------------------------------------------------------------------------- 1 | Year,Month,Start,End,Name,Earnings 2 | 2021,,,, 3 | ,June,,,, 4 | ,,1,3,gRPC in Go,"$33,019" 5 | ,,7,10,Optimizing Python,"$42,238" 6 | ,,28,30,python Foundations,"$24,372" 7 | ,July,,,, 8 | ,,5,8,go concurrency,"$46,382" 9 | ,,21,22,Writing Secure Go,"$27,038" 10 | -------------------------------------------------------------------------------- /Ch05/challenge/workshops.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/LinkedInLearning/data_cleaning_python_2883183/3c2fea80c9eb6bfd91bcced06f7fc78eeeab8688/Ch05/challenge/workshops.png -------------------------------------------------------------------------------- /Ch05/challenge/workshops.py: -------------------------------------------------------------------------------- 1 | # %% 2 | import pandas as pd 3 | 4 | df = pd.read_csv('workshops.csv') 5 | df 6 | # %% 7 | """ 8 | Fix the data frame. At the end, row should have the following columns: 9 | - start: pd.Timestemap 10 | - end: pd.Timestamp 11 | - name: str 12 | - topic: str (python or go) 13 | - earnings: np.float64 14 | """ -------------------------------------------------------------------------------- /Ch05/solution/workshops.csv: -------------------------------------------------------------------------------- 1 | Year,Month,Start,End,Name,Earnings 2 | 2021,,,, 3 | ,June,,,, 4 | ,,1,3,gRPC in Go,"$33,019" 5 | ,,7,10,Optimizing Python,"$42,238" 6 | ,,28,30,python Foundations,"$24,372" 7 | ,July,,,, 8 | ,,5,8,go concurrency,"$46,382" 9 | ,,21,22,Writing Secure Go,"$27,038" 10 | -------------------------------------------------------------------------------- /Ch05/solution/workshops.py: -------------------------------------------------------------------------------- 1 | # %% 2 | import pandas as pd 3 | 4 | df = pd.read_csv('workshops.csv') 5 | df 6 | 7 | # %% Fill Year & Month 8 | """ 9 | Fix the data frame. At the end, row should have the following columns: 10 | - start: pd.Timestemap 11 | - end: pd.Timestamp 12 | - name: str 13 | - topic: str (python or go) 14 | - earnings: np.float64 15 | """ 16 | df['Year'].fillna(method='ffill', inplace=True) 17 | df['Month'].fillna(method='ffill', inplace=True) 18 | df 19 | 20 | # %% Drop year & month rows 21 | df = df[pd.notnull(df['Earnings'])].copy() 22 | df 23 | 24 | # %% 25 | def as_date(row, col): 26 | year = int(row['Year']) 27 | month = row['Month'] 28 | day = int(row[col]) 29 | ts = f'{month} {day}, {year}' 30 | return pd.to_datetime(ts, format='%B %d, %Y') 31 | 32 | df['start'] = df.apply(as_date, axis=1, args=('Start',)) 33 | df['end'] = df.apply(as_date, axis=1, args=('End',)) 34 | df 35 | 36 | # %% Extract topic 37 | def topic(name): 38 | if 'go' in name: 39 | return 'go' 40 | if 'python' in name: 41 | return 'python' 42 | 43 | df['topic'] = df['Name'].str.lower().apply(topic) 44 | df 45 | 46 | # %% Earnings 47 | import numpy as np 48 | df['earnings'] = pd.to_numeric( 49 | df['Earnings'].str.replace(r'[$,]', '') 50 | ).astype(np.float64) 51 | df 52 | 53 | # %% Cleanup 54 | df = df[['start', 'end', 'Name', 'topic', 'earnings']] 55 | df.rename(columns={'Name': 'name'}, inplace=True) 56 | df -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | LinkedIn Learning Exercise Files License Agreement 2 | ================================================== 3 | 4 | This License Agreement (the "Agreement") is a binding legal agreement 5 | between you (as an individual or entity, as applicable) and LinkedIn 6 | Corporation (“LinkedIn”). By downloading or using the LinkedIn Learning 7 | exercise files in this repository (“Licensed Materials”), you agree to 8 | be bound by the terms of this Agreement. If you do not agree to these 9 | terms, do not download or use the Licensed Materials. 10 | 11 | 1. License. 12 | - a. Subject to the terms of this Agreement, LinkedIn hereby grants LinkedIn 13 | members during their LinkedIn Learning subscription a non-exclusive, 14 | non-transferable copyright license, for internal use only, to 1) make a 15 | reasonable number of copies of the Licensed Materials, and 2) make 16 | derivative works of the Licensed Materials for the sole purpose of 17 | practicing skills taught in LinkedIn Learning courses. 18 | - b. Distribution. Unless otherwise noted in the Licensed Materials, subject 19 | to the terms of this Agreement, LinkedIn hereby grants LinkedIn members 20 | with a LinkedIn Learning subscription a non-exclusive, non-transferable 21 | copyright license to distribute the Licensed Materials, except the 22 | Licensed Materials may not be included in any product or service (or 23 | otherwise used) to instruct or educate others. 24 | 25 | 2. Restrictions and Intellectual Property. 26 | - a. You may not to use, modify, copy, make derivative works of, publish, 27 | distribute, rent, lease, sell, sublicense, assign or otherwise transfer the 28 | Licensed Materials, except as expressly set forth above in Section 1. 29 | - b. Linkedin (and its licensors) retains its intellectual property rights 30 | in the Licensed Materials. Except as expressly set forth in Section 1, 31 | LinkedIn grants no licenses. 32 | - c. You indemnify LinkedIn and its licensors and affiliates for i) any 33 | alleged infringement or misappropriation of any intellectual property rights 34 | of any third party based on modifications you make to the Licensed Materials, 35 | ii) any claims arising from your use or distribution of all or part of the 36 | Licensed Materials and iii) a breach of this Agreement. You will defend, hold 37 | harmless, and indemnify LinkedIn and its affiliates (and our and their 38 | respective employees, shareholders, and directors) from any claim or action 39 | brought by a third party, including all damages, liabilities, costs and 40 | expenses, including reasonable attorneys’ fees, to the extent resulting from, 41 | alleged to have resulted from, or in connection with: (a) your breach of your 42 | obligations herein; or (b) your use or distribution of any Licensed Materials. 43 | 44 | 3. Open source. This code may include open source software, which may be 45 | subject to other license terms as provided in the files. 46 | 47 | 4. Warranty Disclaimer. LINKEDIN PROVIDES THE LICENSED MATERIALS ON AN “AS IS” 48 | AND “AS AVAILABLE” BASIS. LINKEDIN MAKES NO REPRESENTATION OR WARRANTY, 49 | WHETHER EXPRESS OR IMPLIED, ABOUT THE LICENSED MATERIALS, INCLUDING ANY 50 | REPRESENTATION THAT THE LICENSED MATERIALS WILL BE FREE OF ERRORS, BUGS OR 51 | INTERRUPTIONS, OR THAT THE LICENSED MATERIALS ARE ACCURATE, COMPLETE OR 52 | OTHERWISE VALID. TO THE FULLEST EXTENT PERMITTED BY LAW, LINKEDIN AND ITS 53 | AFFILIATES DISCLAIM ANY IMPLIED OR STATUTORY WARRANTY OR CONDITION, INCLUDING 54 | ANY IMPLIED WARRANTY OR CONDITION OF MERCHANTABILITY OR FITNESS FOR A 55 | PARTICULAR PURPOSE, AVAILABILITY, SECURITY, TITLE AND/OR NON-INFRINGEMENT. 56 | YOUR USE OF THE LICENSED MATERIALS IS AT YOUR OWN DISCRETION AND RISK, AND 57 | YOU WILL BE SOLELY RESPONSIBLE FOR ANY DAMAGE THAT RESULTS FROM USE OF THE 58 | LICENSED MATERIALS TO YOUR COMPUTER SYSTEM OR LOSS OF DATA. NO ADVICE OR 59 | INFORMATION, WHETHER ORAL OR WRITTEN, OBTAINED BY YOU FROM US OR THROUGH OR 60 | FROM THE LICENSED MATERIALS WILL CREATE ANY WARRANTY OR CONDITION NOT 61 | EXPRESSLY STATED IN THESE TERMS. 62 | 63 | 5. Limitation of Liability. LINKEDIN SHALL NOT BE LIABLE FOR ANY INDIRECT, 64 | INCIDENTAL, SPECIAL, PUNITIVE, CONSEQUENTIAL OR EXEMPLARY DAMAGES, INCLUDING 65 | BUT NOT LIMITED TO, DAMAGES FOR LOSS OF PROFITS, GOODWILL, USE, DATA OR OTHER 66 | INTANGIBLE LOSSES . IN NO EVENT WILL LINKEDIN'S AGGREGATE LIABILITY TO YOU 67 | EXCEED $100. THIS LIMITATION OF LIABILITY SHALL: 68 | - i. APPLY REGARDLESS OF WHETHER (A) YOU BASE YOUR CLAIM ON CONTRACT, TORT, 69 | STATUTE, OR ANY OTHER LEGAL THEORY, (B) WE KNEW OR SHOULD HAVE KNOWN ABOUT 70 | THE POSSIBILITY OF SUCH DAMAGES, OR (C) THE LIMITED REMEDIES PROVIDED IN THIS 71 | SECTION FAIL OF THEIR ESSENTIAL PURPOSE; AND 72 | - ii. NOT APPLY TO ANY DAMAGE THAT LINKEDIN MAY CAUSE YOU INTENTIONALLY OR 73 | KNOWINGLY IN VIOLATION OF THESE TERMS OR APPLICABLE LAW, OR AS OTHERWISE 74 | MANDATED BY APPLICABLE LAW THAT CANNOT BE DISCLAIMED IN THESE TERMS. 75 | 76 | 6. Termination. This Agreement automatically terminates upon your breach of 77 | this Agreement or termination of your LinkedIn Learning subscription. On 78 | termination, all licenses granted under this Agreement will terminate 79 | immediately and you will delete the Licensed Materials. Sections 2-7 of this 80 | Agreement survive any termination of this Agreement. LinkedIn may discontinue 81 | the availability of some or all of the Licensed Materials at any time for any 82 | reason. 83 | 84 | 7. Miscellaneous. This Agreement will be governed by and construed in 85 | accordance with the laws of the State of California without regard to conflict 86 | of laws principles. The exclusive forum for any disputes arising out of or 87 | relating to this Agreement shall be an appropriate federal or state court 88 | sitting in the County of Santa Clara, State of California. If LinkedIn does 89 | not act to enforce a breach of this Agreement, that does not mean that 90 | LinkedIn has waived its right to enforce this Agreement. The Agreement does 91 | not create a partnership, agency relationship, or joint venture between the 92 | parties. Neither party has the power or authority to bind the other or to 93 | create any obligation or responsibility on behalf of the other. You may not, 94 | without LinkedIn’s prior written consent, assign or delegate any rights or 95 | obligations under these terms, including in connection with a change of 96 | control. Any purported assignment and delegation shall be ineffective. The 97 | Agreement shall bind and inure to the benefit of the parties, their respective 98 | successors and permitted assigns. If any provision of the Agreement is 99 | unenforceable, that provision will be modified to render it enforceable to the 100 | extent possible to give effect to the parties’ intentions and the remaining 101 | provisions will not be affected. This Agreement is the only agreement between 102 | you and LinkedIn regarding the Licensed Materials, and supersedes all prior 103 | agreements relating to the Licensed Materials. 104 | 105 | Last Updated: March 2019 106 | -------------------------------------------------------------------------------- /NOTICE: -------------------------------------------------------------------------------- 1 | Copyright 2021 LinkedIn Corporation 2 | All Rights Reserved. 3 | 4 | Licensed under the LinkedIn Learning Exercise File License (the "License"). 5 | See LICENSE in the project root for license information. 6 | 7 | Please note, this project may automatically load third party code from external 8 | repositories (for example, NPM modules, Composer packages, or other dependencies). 9 | If so, such third party code may be subject to other license terms than as set 10 | forth above. In addition, such third party code may also depend on and load 11 | multiple tiers of dependencies. Please review the applicable licenses of the 12 | additional dependencies. 13 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Data Cleaning in Python Essential Training 2 | This is the repository for the LinkedIn Learning course Data Cleaning in Python Essential Training. The full course is available from [LinkedIn Learning][lil-course-url]. 3 | 4 | ![Data Cleaning in Python Essential Training][lil-thumbnail-url] 5 | 6 | Do you need to understand how to keep data clean and well-organized for your company? In this course, instructor Miki Tebeka explains why clean data is so important, what can cause errors, and how to detect, prevent, and fix errors to keep your data clean. Miki explains the types of errors that can occur in data, as well as missing values or bad values in the data. He goes over how human errors, machine-introduced errors, and design errors can find their way into your data, then shows you how to detect these errors. Miki dives into error prevention, with techniques like digital signatures, data pipelines and automation, and transactions. He concludes with ways you can fix errors, including renaming fields, fixing types, joining and splitting data, and more. 7 | 8 | ## Instructions 9 | This repository has branches for each of the videos in the course. You can use the branch pop up menu in github to switch to a specific branch and take a look at the course at that stage, or you can add `/tree/BRANCH_NAME` to the URL to go to the branch you want to access. 10 | 11 | ## Branches 12 | The branches are structured to correspond to the videos in the course. The naming convention is `CHAPTER#_MOVIE#`. As an example, the branch named `02_03` corresponds to the second chapter and the third video in that chapter. 13 | Some branches will have a beginning and an end state. These are marked with the letters `b` for "beginning" and `e` for "end". The `b` branch contains the code as it is at the beginning of the movie. The `e` branch contains the code as it is at the end of the movie. The `main` branch holds the final state of the code when in the course. 14 | 15 | When switching from one exercise files branch to the next after making changes to the files, you may get a message like this: 16 | 17 | error: Your local changes to the following files would be overwritten by checkout: [files] 18 | Please commit your changes or stash them before you switch branches. 19 | Aborting 20 | 21 | To resolve this issue: 22 | 23 | Add changes to git using this command: git add . 24 | Commit changes using this command: git commit -m "some message" 25 | 26 | ## Installing 27 | 1. To use these exercise files, you must have the following installed: 28 | - Python 3.6 and up 29 | 2. Clone this repository into your local machine using the terminal (Mac), CMD (Windows), or a GUI tool like SourceTree. 30 | 3. Install the dependencies 31 | - `python -m pip install -r requirements.txt` 32 | 33 | 34 | ### Instructor 35 | 36 | Miki Tebeka 37 | 38 | 39 | Check out my other courses on [LinkedIn Learning](https://www.linkedin.com/learning/instructors/miki-tebeka). 40 | 41 | [lil-course-url]: https://www.linkedin.com/learning/data-cleaning-in-python-essential-training 42 | [lil-thumbnail-url]: https://cdn.lynda.com/course/2883183/2883183-1632766207382-16x9.jpg 43 | 44 | 45 | 46 | 47 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | invoke~=1.5 2 | ipykernel~=5.5 3 | matplotlib~=3.4 4 | pandas~=1.2 5 | pandera~=0.6 6 | pyarrow~=4.0 7 | --------------------------------------------------------------------------------