├── README.md └── pandas.md /pandas.md: -------------------------------------------------------------------------------- 1 | ## Creating data 2 | 3 | ### DataFrame 4 | A dataframe is a table. 5 | It contains an array of individual entries, 6 | each of which has a certain value. 7 | Each entry corresponds to a row and a column. 8 | 9 | ``` 10 | pd.DataFrame({'Yes': [50, 21], 'No': [131,2]}) 11 | ``` 12 | 13 | ![image](https://user-images.githubusercontent.com/95273765/219207825-cad209bd-8cbc-4cec-93a5-489bffdae851.png) 14 | 15 | ``` 16 | pd.DataFrame({'Bob': ['I liked it.', 'It was awful.'], 17 | 'Sue': ['Pretty good.', 'Bland.']}, 18 | index=['Product A', 'Product B']) 19 | ``` 20 | 21 | ![image](https://user-images.githubusercontent.com/95273765/219208281-719abedb-6bcf-48ae-90f3-04b2fa742b8b.png) 22 | 23 | ### Series 24 | A series, by contrast, is a sequence of data values. 25 | If a dataframe is a table, a series is a list. 26 | 27 | ![image](https://user-images.githubusercontent.com/95273765/219208506-31e638b7-0a5e-4e1e-a8e6-5499dc379aba.png) 28 | 29 | A series is, in essence, a single column of a dataframe. 30 | So we can assign row labels to the series the same way as before, using an `index parameter. 31 | However, a series does not have a column name, it only has one overall `name`. 32 | 33 | ![image](https://user-images.githubusercontent.com/95273765/219208837-93449377-a30c-484c-8350-923e58f8c2c8.png) 34 | 35 | ## Reading data files 36 | Read CSV file: 37 | ``` python 38 | wine_reviews = pd.read_csv("../input/wine-reviews/winemag-data-130k-v2.csv") 39 | ``` 40 | 41 | We can use the `shape` attribute to check how large the resulting DataFrame is: 42 | ``` 43 | wine_reviews.shape 44 | 45 | # it tells how many records split across a certain number of different columns 46 | ``` 47 | 48 | We can examine the contents of the resultant DataFrame using the `head()` command, which grabs the first five rows. 49 | ``` python 50 | wine_reviews.head() 51 | ``` 52 | 53 | ## Native accessors 54 | In python, we can access the property of an object by accessing it as an attribute. 55 | A `book` object, for example, might have a `title` property, which we can access by calling `book`. 56 | 57 | Hence to access the `country` property of `reviews` we can use: 58 | ``` 59 | reviews.country 60 | ``` 61 | 62 | ![image](https://user-images.githubusercontent.com/95273765/219213679-7daeb44a-5917-42df-94a9-24ed9e0dcfa0.png) 63 | 64 | We can also do 65 | ``` 66 | reviews['country'] 67 | ``` 68 | 69 | ## Indexing in pandas 70 | The indexing operator and attribute selection are nice because they work just like they do in the rest of the Python ecosystem. 71 | 72 | ### Index-based selection 73 | Pandas indexing works in one of two paradigms. 74 | The first is index-based selection: 75 | selecting data based on its numerical position in the data. 76 | Both `loc` and `iloc` are row-first, column-second. 77 | 78 | `loc` and `iloc` are two important indexing methods in Pandas that allow selecting rows and columns of a DataFrame. 79 | 80 | `loc` is used to select data based on labels of rows or columns. It takes the form `df.loc[row_label, column_label].` 81 | The `row_label` and `column_label` can be a single label, a list of labels, or a slice of labels. 82 | 83 | ``` python 84 | import pandas as pd 85 | 86 | data = {'name': ['Alice', 'Bob', 'Charlie', 'David', 'Emily'], 87 | 'age': [25, 28, 19, 31, 22], 88 | 'gender': ['F', 'M', 'M', 'M', 'F']} 89 | 90 | df = pd.DataFrame(data) 91 | df.set_index('name', inplace=True) 92 | 93 | print(df.loc['Bob', 'age']) # Output: 28 94 | ``` 95 | 96 | `iloc` is used to select data based on the integer position of rows or columns. 97 | It takes the form `df.iloc[row_index, column_index]`. 98 | 99 | The `row_index` and `column_index` can be a single index, a list of indexes, or a slice of indexes. 100 | 101 | ``` python 102 | import pandas as pd 103 | 104 | data = {'name': ['Alice', 'Bob', 'Charlie', 'David', 'Emily'], 105 | 'age': [25, 28, 19, 31, 22], 106 | 'gender': ['F', 'M', 'M', 'M', 'F']} 107 | 108 | df = pd.DataFrame(data) 109 | df.set_index('name', inplace=True) 110 | 111 | print(df.iloc[1, 1]) # Output: 28 112 | ``` 113 | 114 | On its own, the `:` operator, which also comes from native Python, means 'everything'. 115 | ``` python 116 | reviews.iloc[:3, 0] 117 | ``` 118 | 119 | ![image](https://user-images.githubusercontent.com/95273765/219215404-05bb3082-5b2a-4f1f-a381-f251c6ac2b87.png) 120 | 121 | ## Manipulating the index 122 | Label-based selection derives its power from the labels in the index. Critically, the index we use is not immutable. We can manipulate the index in any way we see fit. 123 | 124 | The `set_index()` method can be used to do the job. 125 | 126 | ## Info about the data 127 | The DataFrames object has a method called `info()`, that gives us more information about the data set. 128 | 129 | ``` python 130 | print(df.info()) 131 | ``` 132 | 133 | The `info()` method also tells us how many non-null values there are present in each column. 134 | 135 | ## Get columns 136 | ``` python 137 | print(df[['Name', 'Type 1', 'HP']]) 138 | ``` 139 | 140 | ``` python 141 | for index, row in df.iterrows(): 142 | print(index, row['Name']) 143 | ``` 144 | 145 | ## Conditional getting 146 | ``` python 147 | df.loc(df['Type 1'] == Grass) 148 | ``` 149 | 150 | ## Sort values 151 | ``` python 152 | df.sort_values(['Name', 'HP'], ascending=[True, False]) 153 | ``` 154 | 155 | ## Make changes 156 | ``` python 157 | df['Total'] = df['HP'] + df['Attack'] 158 | ``` 159 | 160 | drop column 161 | 162 | ``` python 163 | df = df.drop(column=['Total']) 164 | ``` 165 | 166 | another way getting total 167 | 168 | ``` python 169 | df['Total'] = df.iloc[:, 4:9].sum(axis=1) 170 | ``` 171 | 172 | or 173 | 174 | ``` python 175 | cols = list(df.columns.values) 176 | df = df[cols[0:4] + [cols[-1]]+cols[4:12]] 177 | ``` 178 | 179 | ## Convert to csv 180 | df.to_csv('modified.csv') 181 | 182 | ## Filtering data 183 | ``` python 184 | df.loc[(df['Type 1'] == 'Grass') & (df['Type 2'] == 'Poison')] 185 | ``` 186 | 187 | if contains 188 | 189 | ``` python 190 | df.loc[~df['Name'].str.contains('Mega')] 191 | ``` 192 | 193 | ## Reset index 194 | new_df = new_df.reset_index() 195 | 196 | ## Regex filtering 197 | ``` python 198 | df.loc[df['Type 1'].str.contains('File|Grass', regex=True)] 199 | ``` 200 | 201 | ignores case 202 | 203 | ``` python 204 | df.loc[df['Type 1'].str.contains('File|Grass', flag=re.I, regex=True)] 205 | ``` 206 | --------------------------------------------------------------------------------