├── PySpark └── Introduction to PySpark │ ├── Getting to know PySpark │ ├── Machine Learning Pipelines │ └── Manipulating Data ├── Python ├── Analyzing Police Activity with pandas │ ├── Analyzing the effect of weather on policing │ ├── Exploring the relationship between gender and policing │ ├── Preparing the data for analysis │ └── Visual exploratory data analysis ├── Cleaning Data in Python │ ├── Case study │ ├── Cleaning data for analysis │ ├── Combining data for analysis │ ├── Exploring your data │ └── Tidying data for analysis ├── Conda Essentials │ └── Installing Packages ├── Importing Data in Python -Part 1 │ ├── Importing data from other file types │ ├── Introduction and flat files │ └── Introduction to relational databases ├── Importing Data in Python -Part 2 │ ├── Diving deep into the Twitter API │ ├── Importing data from the Internet │ └── Interacting with APIs to import data from the web ├── Interactive Data Visualization with Bokeh │ ├── Basic plotting with Bokeh │ ├── Building interactive apps with Bokeh │ ├── Layouts, Interactions, and Annotations │ └── Putting It All Together! A Case Study ├── Intermediate-Python-for-Data-Science │ ├── Case-Study-Hacker-Statistics │ ├── Dictionaries-Pandas │ ├── Logic-ControlFlow-Filtering │ ├── Matplotlib │ └── loops ├── Intro to SQL for Data Science │ ├── Aggregate Functions │ ├── Filtering rows │ ├── Selecting columns │ └── Sorting, grouping and joins ├── Intro-to-data-science │ ├── Functions-Strings │ ├── Numpy-Statistics │ ├── Python-Basics │ └── Python-Lists ├── Introduction to Data Visualization with Python │ ├── Analyzing time series and images │ ├── Customizing plots │ ├── Plotting 2D arrays │ └── Statistical plots with Seaborn ├── Introduction to Databases in Python │ ├── Advanced SQLAlchemy Queries │ ├── Applying Filtering, Ordering and Grouping to Queries │ ├── Basics of Relational Databases │ ├── Creating and Manipulating your own Databases │ └── Putting it all together ├── Introduction to Relational Databases in SQL │ ├── Enforce data consistency with attribute constraints │ ├── Glue together tables with foreign keys │ ├── Uniquely identify records with key constraints │ └── Your first database ├── Introduction to Shell for Data Science │ └── Manipulating files and directories ├── Joining Data in SQL │ ├── Introduction to joins │ ├── Outer joins and cross joins │ ├── Set theory clauses │ └── Subqueries ├── Machine Learning with the Experts: School Budgets │ ├── Creating a simple first model │ ├── Exploring the raw data │ ├── Improving your model │ └── Learning from the experts ├── Manipulating DataFrames with pandas │ ├── Advanced indexing │ ├── Bringing it all together │ ├── Extracting and transforming data │ ├── Grouping data │ └── Rearranging and reshaping data ├── Merging DataFrames with pandas │ ├── Case Study - Summer Olympics │ ├── Concatenating data │ ├── Merging data │ └── Preparing data ├── Network Analysis in Python (Part 1) │ ├── Bringing it all together │ ├── Important nodes │ ├── Introduction to networks │ └── Structures ├── Python Data Science Toolbox -Part 1 │ ├── Default arguments, variable-length arguments and scope │ ├── Lambda functions and error-handling │ └── Writing your own functions ├── Python Data Science Toolbox -Part 2 │ ├── Case Study │ ├── List comprehensions and generators │ └── Using iterators in PythonLand ├── Python Data Science Toolbox -Part │ └── Case Study ├── Statistical Thinking in Python (Part 2) │ ├── Bootstrap confidence intervals │ ├── Hypothesis test examples │ ├── Introduction to hypothesis testing │ ├── Parameter estimation by optimization │ └── Putting it all together: a case study ├── Statistical Thinking in Python -Part 1 │ ├── Graphical exploratory data analysis │ ├── Quantitative exploratory data analysis │ ├── Thinking probabilistically-- Continuous variables │ └── Thinking probabilistically-- Discrete variables ├── Supervised Learning with scikit-learn │ ├── Classification │ ├── Fine-tuning your model │ ├── Preprocessing and pipelines │ └── Regression ├── Unsupervised Learning in Python │ ├── Clustering for dataset exploration │ ├── Decorrelating your data and dimension reduction │ ├── Discovering interpretable features │ └── Visualization with hierarchical clustering and t-SNE └── pandas Foundations │ ├── Case Study - Sunlight in Austin │ ├── Data ingestion & inspection │ ├── Exploratory data analysis │ └── Time series in pandas ├── README.md ├── SparkR └── Introduction to Spark in R using sparklyr │ ├── Going Native: Use The Native Interface to Manipulate Spark DataFrames │ ├── Light My Fire: Starting To Use Spark With dplyr Syntax │ └── Tools of the Trade: Advanced dplyr Usage └── Spoken Language Processing in Python └── Introduction to Spoken Language Processing with Python /PySpark/Introduction to PySpark/Getting to know PySpark: -------------------------------------------------------------------------------- 1 | Q1:- 2 | How do you connect to a Spark cluster from PySpark? 3 | 4 | Solution:- 5 | Create an instance of the SparkContext class. 6 | 7 | Q2:- 8 | Get to know the SparkContext. 9 | Call print() on sc to verify there's a SparkContext in your environment. 10 | print() sc.version to see what version of Spark is running on your cluster. 11 | 12 | Solution:- 13 | # Verify SparkContext 14 | print(sc) 15 | 16 | # Print Spark version 17 | print(sc.version) 18 | 19 | Q3:- 20 | Which of the following is an advantage of Spark DataFrames over RDDs? 21 | 22 | Solution:- 23 | Operations using DataFrames are automatically optimized. 24 | 25 | Q4:- 26 | Import SparkSession from pyspark.sql. 27 | Make a new SparkSession called my_spark using SparkSession.builder.getOrCreate(). 28 | Print my_spark to the console to verify it's a SparkSession. 29 | 30 | Solution:- 31 | # Import SparkSession from pyspark.sql 32 | from pyspark.sql import SparkSession 33 | 34 | # Create my_spark 35 | my_spark = SparkSession.builder.getOrCreate() 36 | 37 | # Print my_spark 38 | print(my_spark) 39 | 40 | Q5:- 41 | See what tables are in your cluster by calling spark.catalog.listTables() and printing the result! 42 | 43 | Solution:- 44 | # Print the tables in the catalog 45 | print(spark.catalog.listTables()) 46 | 47 | Q6:- 48 | Use the .sql() method to get the first 10 rows of the flights table and save the result to flights10. The variable query contains the appropriate SQL query. 49 | Use the DataFrame method .show() to print flights10 50 | 51 | Solution:- 52 | # Don't change this query 53 | query = "FROM flights SELECT * LIMIT 10" 54 | 55 | # Get the first 10 rows of flights 56 | flights10 = spark.sql(query) 57 | 58 | # Show the results 59 | flights10.show() 60 | 61 | Q7:- 62 | Run the query using the .sql() method. Save the result in flight_counts. 63 | Use the .toPandas() method on flight_counts to create a pandas DataFrame called pd_counts. 64 | Print the .head() of pd_counts to the console. 65 | 66 | Solution:- 67 | # Don't change this query 68 | query = "SELECT origin, dest, COUNT(*) as N FROM flights GROUP BY origin, dest" 69 | 70 | # Run the query 71 | flight_counts = spark.sql(query) 72 | 73 | # Convert the results to a pandas DataFrame 74 | pd_counts = flight_counts.toPandas() 75 | 76 | # Print the head of pd_counts 77 | print(pd_counts.head()) 78 | 79 | Q8:- 80 | The code to create a pandas DataFrame of random numbers has already been provided and saved under pd_temp. 81 | Create a Spark DataFrame called spark_temp by calling the .createDataFrame() method with pd_temp as the argument. 82 | Examine the list of tables in your Spark cluster and verify that the new DataFrame is not present. Remember you can use spark.catalog.listTables() to do so. 83 | Register spark_temp as a temporary table named "temp" using the .createOrReplaceTempView() method. Rememeber that the table name is set including it as the only argument! 84 | Examine the list of tables again! 85 | 86 | Solution:- 87 | # Create pd_temp 88 | pd_temp = pd.DataFrame(np.random.random(10)) 89 | 90 | # Create spark_temp from pd_temp 91 | spark_temp = spark.createDataFrame(pd_temp) 92 | 93 | # Examine the tables in the catalog 94 | print(spark.catalog.listTables()) 95 | 96 | # Add spark_temp to the catalog 97 | spark_temp.createOrReplaceTempView("temp") 98 | 99 | # Examine the tables in the catalog again 100 | print(spark.catalog.listTables()) 101 | 102 | Q9:- 103 | Use the .read.csv() method to create a Spark DataFrame called airports 104 | The first argument is file_path 105 | Pass the argument header=True so that Spark knows to take the column names from the first line of the file. 106 | Print out this DataFrame by calling .show(). 107 | 108 | Solution:- 109 | # Don't change this file path 110 | file_path = "/usr/local/share/datasets/airports.csv" 111 | 112 | # Read in the airports data 113 | airports = spark.read.csv(file_path,header=True) 114 | 115 | # Show the data 116 | print(airports.show()) 117 | 118 | -------------------------------------------------------------------------------- /PySpark/Introduction to PySpark/Machine Learning Pipelines: -------------------------------------------------------------------------------- 1 | Q1:- 2 | First, rename the year column of planes to plane_year to avoid duplicate column names. 3 | Create a new DataFrame called model_data by joining the flights table with planes using the tailnum column as the key. 4 | 5 | Solution:- 6 | -------------------------------------------------------------------------------- /Python/Analyzing Police Activity with pandas/Analyzing the effect of weather on policing: -------------------------------------------------------------------------------- 1 | Q1:- 2 | Read weather.csv into a DataFrame named weather. 3 | Select the temperature columns (TMIN, TAVG, TMAX) and print their summary statistics using the .describe() method. 4 | Create a box plot to visualize the temperature columns. 5 | Display the plot. 6 | 7 | Solution:- 8 | # Read 'weather.csv' into a DataFrame named 'weather' 9 | weather = pd.read_csv('weather.csv') 10 | 11 | # Describe the temperature columns 12 | print(weather[['TMIN', 'TAVG', 'TMAX']].describe()) 13 | 14 | # Create a box plot of the temperature columns 15 | weather[['TMIN', 'TAVG', 'TMAX']].plot(kind='box') 16 | 17 | # Display the plot 18 | plt.show() 19 | 20 | Q2:- 21 | Create a new column in the weather DataFrame named TDIFF that represents the difference between the maximum and minimum temperatures. 22 | Print the summary statistics for TDIFF using the .describe() method. 23 | Create a histogram with 20 bins to visualize TDIFF. 24 | Display the plot. 25 | 26 | Solution:- 27 | # Create a 'TDIFF' column that represents temperature difference 28 | weather['TDIFF'] =weather.TMAX - weather.TMIN 29 | 30 | # Describe the 'TDIFF' column 31 | print(weather['TDIFF'].describe()) 32 | 33 | # Create a histogram with 20 bins to visualize 'TDIFF' 34 | weather.TDIFF.plot(kind='hist',bins=20) 35 | 36 | # Display the plot 37 | plt.show() 38 | 39 | Q3:- 40 | Copy the columns WT01 through WT22 from weather to a new DataFrame named WT. 41 | Calculate the sum of each row in WT, and store the results in a new weather column named bad_conditions. 42 | Replace any missing values in bad_conditions with a 0. (This has been done for you.) 43 | Create a histogram to visualize bad_conditions, and then display the plot. 44 | 45 | Solution:- 46 | # Copy 'WT01' through 'WT22' to a new DataFrame 47 | WT = weather.loc[:,'WT01':'WT22'] 48 | 49 | # Calculate the sum of each row in 'WT' 50 | weather['bad_conditions'] = WT.sum(axis='columns') 51 | 52 | # Replace missing values in 'bad_conditions' with '0' 53 | weather['bad_conditions'] = weather.bad_conditions.fillna(0).astype('int') 54 | 55 | # Create a histogram to visualize 'bad_conditions' 56 | weather.bad_conditions.plot(kind='hist') 57 | 58 | # Display the plot 59 | plt.show() 60 | 61 | Q4:- 62 | Count the unique values in the bad_conditions column and sort the index. (This has been done for you.) 63 | Create a dictionary called mapping that maps the bad_conditions integers to strings as specified above. 64 | Convert the bad_conditions integers to strings using the mapping and store the results in a new column called rating. 65 | Count the unique values in rating to verify that the integers were properly converted to strings. 66 | 67 | Solution:- 68 | # Count the unique values in 'bad_conditions' and sort the index 69 | print(weather.bad_conditions.value_counts().sort_index()) 70 | 71 | # Create a dictionary that maps integers to strings 72 | mapping = {0:'good', 1:'bad', 2:'bad', 3:'bad',4:'bad',5:'worse',6:'worse',7:'worse',8:'worse',9:'worse'} 73 | 74 | # Convert the 'bad_conditions' integers to strings using the 'mapping' 75 | weather['rating'] = weather.bad_conditions.map(mapping) 76 | 77 | # Count the unique values in 'rating' 78 | print(weather.rating.value_counts()) 79 | 80 | Q5:- 81 | Create a list object called cats that lists the weather ratings in a logical order: 'good', 'bad', 'worse'. 82 | Change the data type of the rating column from object to category. Make sure to use the cats list to define the category ordering. 83 | Examine the head of the rating column to confirm that the categories are logically ordered. 84 | 85 | Solution:- 86 | # Create a list of weather ratings in logical order 87 | cats= ['good','bad','worse'] 88 | 89 | # Change the data type of 'rating' to category 90 | weather['rating'] = weather['rating'].astype('category').cat.reorder_categories(cats, ordered=True) 91 | 92 | # Examine the head of 'rating' 93 | print(weather['rating'].head()) 94 | 95 | Q6:- 96 | Reset the index of the ri DataFrame. 97 | Examine the head of ri to verify that stop_datetime is now a DataFrame column, and the index is now the default integer index. 98 | Create a new DataFrame named weather_rating that contains only the DATE and rating columns from the weather DataFrame. 99 | Examine the head of weather_rating to verify that it contains the proper columns. 100 | 101 | Solution:- 102 | # Reset the index of 'ri' 103 | ri.reset_index(inplace=True) 104 | 105 | # Examine the head of 'ri' 106 | print(ri.head()) 107 | 108 | # Create a DataFrame from the 'DATE' and 'rating' columns 109 | weather_rating = weather[['DATE','rating']] 110 | 111 | # Examine the head of 'weather_rating' 112 | print(weather_rating.head()) 113 | 114 | Q7:- 115 | Examine the shape of the ri DataFrame. 116 | Merge the ri and weather_rating DataFrames using a left join. 117 | Examine the shape of ri_weather to confirm that it has two more columns but the same number of rows as ri. 118 | Replace the index of ri_weather with the stop_datetime column. 119 | 120 | Solution:- 121 | # Examine the shape of 'ri' 122 | print(ri.shape) 123 | 124 | # Merge 'ri' and 'weather_rating' using a left join 125 | ri_weather = pd.merge(left=ri, right=weather_rating, left_on='stop_date', right_on='DATE', how='left') 126 | 127 | # Examine the shape of 'ri_weather' 128 | print(ri_weather.shape) 129 | 130 | # Set 'stop_datetime' as the index of 'ri_weather' 131 | ri_weather.set_index('stop_datetime', inplace=True) 132 | 133 | Q8:- 134 | Calculate the overall arrest rate by taking the mean of the is_arrested Series. 135 | 136 | Solution:- 137 | # Calculate the overall arrest rate 138 | print(ri_weather.is_arrested.mean()) 139 | 140 | Q9:- 141 | Calculate the arrest rate for each weather rating using a .groupby(). 142 | 143 | Solution:- 144 | # Calculate the arrest rate for each 'rating' 145 | print(ri_weather.groupby('rating').is_arrested.mean()) 146 | 147 | Q10:- 148 | Calculate the arrest rate for each combination of violation and rating. How do the arrest rates differ by group? 149 | 150 | Solution- 151 | # Calculate the arrest rate for each 'violation' and 'rating' 152 | print(ri_weather.groupby(['violation','rating']).is_arrested.mean()) 153 | 154 | Q11:- 155 | Save the output of the .groupby() operation from the last exercise as a new object, arrest_rate. (This has been done for you.) 156 | Print the arrest_rate Series and examine it. 157 | Print the arrest rate for moving violations in bad weather. 158 | Print the arrest rates for speeding violations in all three weather conditions. 159 | 160 | Solution:- 161 | # Save the output of the groupby operation from the last exercise 162 | arrest_rate = ri_weather.groupby(['violation', 'rating']).is_arrested.mean() 163 | 164 | # Print the 'arrest_rate' Series 165 | print(arrest_rate) 166 | 167 | # Print the arrest rate for moving violations in bad weather 168 | print(arrest_rate.loc['Moving violation','bad']) 169 | 170 | # Print the arrest rates for speeding violations in all three weather conditions 171 | print(arrest_rate.loc['Speeding']) 172 | 173 | Q12:- 174 | Unstack the arrest_rate Series to reshape it into a DataFrame. 175 | Create the exact same DataFrame using a pivot table! Each of the three .pivot_table() parameters should be specified as one of the ri_weather columns. 176 | 177 | Solution:- 178 | # Unstack the 'arrest_rate' Series into a DataFrame 179 | print(arrest_rate.unstack()) 180 | 181 | # Create the same DataFrame using a pivot table 182 | print(ri_weather.pivot_table(index='violation', columns=['rating'], values='is_arrested')) 183 | -------------------------------------------------------------------------------- /Python/Analyzing Police Activity with pandas/Exploring the relationship between gender and policing: -------------------------------------------------------------------------------- 1 | Q1:- 2 | Count the unique values in the violation column of the ri DataFrame, to see what violations are being committed by all drivers. 3 | Express the violation counts as proportions of the total. 4 | 5 | Solution:- 6 | # Count the unique values in 'violation' 7 | print(ri.violation.value_counts()) 8 | 9 | # Express the counts as proportions 10 | print(ri.violation.value_counts(normalize=True)) 11 | 12 | Q2:- 13 | Create a DataFrame, female, that only contains rows in which driver_gender is 'F'. 14 | Create a DataFrame, male, that only contains rows in which driver_gender is 'M'. 15 | Count the violations committed by female drivers and express them as proportions. 16 | Count the violations committed by male drivers and express them as proportions. 17 | 18 | Solution:- 19 | # Create a DataFrame of female drivers 20 | female = ri[ri['driver_gender'] == 'F'] 21 | 22 | # Create a DataFrame of male drivers 23 | male = ri[ri['driver_gender'] == 'M'] 24 | 25 | # Compute the violations by female drivers (as proportions) 26 | print(female.violation.value_counts(normalize=True)) 27 | 28 | # Compute the violations by male drivers (as proportions) 29 | print(male.violation.value_counts(normalize=True)) 30 | 31 | Q3:- 32 | Create a DataFrame, female_and_speeding, that only includes female drivers who were stopped for speeding. 33 | Create a DataFrame, male_and_speeding, that only includes male drivers who were stopped for speeding. 34 | Count the stop outcomes for the female drivers and express them as proportions. 35 | Count the stop outcomes for the male drivers and express them as proportions. 36 | 37 | Solution:- 38 | # Create a DataFrame of female drivers stopped for speeding 39 | female_and_speeding = ri[(ri.driver_gender=='F') & (ri.violation=='Speeding')] 40 | 41 | # Create a DataFrame of male drivers stopped for speeding 42 | male_and_speeding = ri[(ri.driver_gender=='M') & (ri.violation=='Speeding')] 43 | 44 | # Compute the stop outcomes for female drivers (as proportions) 45 | print(female_and_speeding.stop_outcome.value_counts(normalize=True)) 46 | 47 | # Compute the stop outcomes for male drivers (as proportions) 48 | print(male_and_speeding.stop_outcome.value_counts(normalize=True)) 49 | 50 | Q4:- 51 | Check the data type of search_conducted to confirm that it's a Boolean Series. 52 | Calculate the search rate by counting the Series values and expressing them as proportions. 53 | Calculate the search rate by taking the mean of the Series. (It should match the proportion of True values calculated above.) 54 | 55 | Solution:- 56 | # Check the data type of 'search_conducted' 57 | print(ri.search_conducted.dtype) 58 | 59 | # Calculate the search rate by counting the values 60 | print(ri.search_conducted.value_counts(normalize=True)) 61 | 62 | # Calculate the search rate by taking the mean 63 | print(ri.search_conducted.mean()) 64 | 65 | Q5:- 66 | Filter the DataFrame to only include female drivers, and then calculate the search rate by taking the mean of search_conducted. 67 | 68 | Solution:- 69 | # Calculate the search rate for female drivers 70 | print(ri[ri.driver_gender=='F'].search_conducted.mean()) 71 | 72 | Q6:- 73 | Filter the DataFrame to only include male drivers, and then repeat the search rate calculation. 74 | 75 | Solution:- 76 | # Calculate the search rate for male drivers 77 | print(ri[ri.driver_gender=='M'].search_conducted.mean()) 78 | 79 | Q7:- 80 | Group by driver gender to calculate the search rate for both groups simultaneously. (It should match the previous results.) 81 | 82 | Solution:- 83 | # Calculate the search rate for both groups simultaneously 84 | print(ri.groupby('driver_gender').search_conducted.mean()) 85 | 86 | Q8:- 87 | Use a .groupby() to calculate the search rate for each combination of gender and violation. Are males and females searched at about the same rate for each violation? 88 | 89 | Solution:- 90 | # Calculate the search rate for each combination of gender and violation 91 | print(ri.groupby(['driver_gender','violation']).search_conducted.mean()) 92 | 93 | Q9:- 94 | Reverse the ordering to group by violation before gender. The results may be easier to compare when presented this way. 95 | 96 | Solution:- 97 | # Reverse the ordering to group by violation before gender 98 | print(ri.groupby(['violation','driver_gender']).search_conducted.mean()) 99 | 100 | Q10:- 101 | Count the search_type values to see how many times "Protective Frisk" was the only search type. 102 | Create a new column, frisk, that is True if search_type contains the string "Protective Frisk" and False otherwise. 103 | Check the data type of frisk to confirm that it's a Boolean Series. 104 | Take the sum of frisk to count the total number of frisks. 105 | 106 | Solution:- 107 | # Count the 'search_type' values 108 | print(ri.search_type.value_counts()) 109 | 110 | # Check if 'search_type' contains the string 'Protective Frisk' 111 | ri['frisk'] = ri.search_type.str.contains('Protective Frisk', na=False) 112 | 113 | # Check the data type of 'frisk' 114 | print(ri.frisk.dtype) 115 | 116 | # Take the sum of 'frisk' 117 | print(ri.frisk.sum()) 118 | 119 | Q11:- 120 | Create a DataFrame, searched, that only contains rows in which search_conducted is True. 121 | Take the mean of the frisk column to find out what percentage of searches included a frisk. 122 | Calculate the frisk rate for each gender using a .groupby(). 123 | 124 | Solution:- 125 | # Create a DataFrame of stops in which a search was conducted 126 | searched = ri[ri.search_conducted == True] 127 | 128 | # Calculate the overall frisk rate by taking the mean of 'frisk' 129 | print(searched.frisk.mean()) 130 | 131 | # Calculate the frisk rate for each gender 132 | print(searched.groupby(['driver_gender']).frisk.mean()) 133 | -------------------------------------------------------------------------------- /Python/Analyzing Police Activity with pandas/Preparing the data for analysis: -------------------------------------------------------------------------------- 1 | Q1:- 2 | Import pandas using the alias pd. 3 | Read the file police.csv into a DataFrame named ri. 4 | Examine the first 5 rows of the DataFrame (known as the "head"). 5 | Count the number of missing values in each column: Use .isnull() to check which DataFrame elements are missing, and then take the .sum() to count the number of True values in each column. 6 | 7 | Solution:- 8 | # Import the pandas library as pd 9 | import pandas as pd 10 | 11 | # Read 'police.csv' into a DataFrame named ri 12 | ri = pd.read_csv('police.csv') 13 | 14 | # Examine the head of the DataFrame 15 | print(ri.head()) 16 | 17 | # Count the number of missing values in each column 18 | print(ri.isnull().sum()) 19 | 20 | Q2:- 21 | Count the number of missing values in each column. (This has been done for you.) 22 | Examine the DataFrame's .shape to find out the number of rows and columns. 23 | Drop both the county_name and state columns by passing the column names to the .drop() method as a list of strings. 24 | Examine the .shape again to verify that there are now two fewer columns. 25 | 26 | Solution:- 27 | # Count the number of missing values in each column 28 | print(ri.isnull().sum()) 29 | 30 | # Examine the shape of the DataFrame 31 | print(ri.shape) 32 | 33 | # Drop the 'county_name' and 'state' columns 34 | ri.drop(['county_name', 'state'], axis='columns', inplace=True) 35 | 36 | # Examine the shape of the DataFrame (again) 37 | print(ri.shape) 38 | 39 | Q3:- 40 | Count the number of missing values in each column. 41 | Drop all rows that are missing driver_gender by passing the column name to the subset parameter of .dropna(). 42 | Count the number of missing values in each column again, to verify that none of the remaining rows are missing driver_gender. 43 | Examine the DataFrame's .shape to see how many rows and columns remain. 44 | 45 | Solution:- 46 | # Count the number of missing values in each column 47 | print(ri.isnull().sum()) 48 | 49 | # Drop all rows that are missing 'driver_gender' 50 | ri.dropna(subset=['driver_gender'], inplace=True) 51 | 52 | # Count the number of missing values in each column (again) 53 | print(ri.isnull().sum()) 54 | 55 | # Examine the shape of the DataFrame 56 | print(ri.shape) 57 | 58 | Q4:- 59 | Examine the head of the is_arrested column to verify that it contains True and False values. 60 | Check the current data type of is_arrested. 61 | Use the .astype() method to convert is_arrested to a bool column. 62 | Check the new data type of is_arrested, to confirm that it is now a bool column. 63 | 64 | Solution:- 65 | # Examine the head of the 'is_arrested' column 66 | print(ri.is_arrested.head()) 67 | 68 | # Check the data type of 'is_arrested' 69 | print(ri.is_arrested.dtype) 70 | 71 | # Change the data type of 'is_arrested' to 'bool' 72 | ri['is_arrested'] = ri.is_arrested.astype('bool') 73 | 74 | # Check the data type of 'is_arrested' (again) 75 | print(ri.is_arrested.dtype) 76 | 77 | Q5:- 78 | Use a string method to concatenate stop_date and stop_time (separated by a space), and store the result in combined. 79 | Convert combined to datetime format, and store the result in a new column named stop_datetime. 80 | Examine the DataFrame .dtypes to confirm that stop_datetime is a datetime column. 81 | 82 | Solution:- 83 | # Concatenate 'stop_date' and 'stop_time' (separated by a space) 84 | combined = ri.stop_date.str.cat(ri.stop_time, sep=' ') 85 | 86 | # Convert 'combined' to datetime format 87 | ri['stop_datetime'] = pd.to_datetime(combined) 88 | 89 | # Examine the data types of the DataFrame 90 | print(ri.dtypes) 91 | 92 | Q6:- 93 | Set stop_datetime as the DataFrame index. 94 | Examine the index to verify that it is a DatetimeIndex. 95 | Examine the DataFrame columns to confirm that stop_datetime is no longer one of the columns. 96 | 97 | Solution:- 98 | # Set 'stop_datetime' as the index 99 | ri.set_index('stop_datetime', inplace=True) 100 | 101 | # Examine the index 102 | print(ri.index) 103 | 104 | # Examine the columns 105 | print(ri.columns) 106 | -------------------------------------------------------------------------------- /Python/Analyzing Police Activity with pandas/Visual exploratory data analysis: -------------------------------------------------------------------------------- 1 | Q1:- 2 | Take the mean of the is_arrested column to calculate the overall arrest rate. 3 | Group by the hour attribute of the DataFrame index to calculate the hourly arrest rate. 4 | Save the hourly arrest rate Series as a new object, hourly_arrest_rate. 5 | 6 | Solution:- 7 | # Calculate the overall arrest rate 8 | print(ri.is_arrested.mean()) 9 | 10 | # Calculate the hourly arrest rate 11 | print(ri.groupby(ri.index.hour).is_arrested.mean()) 12 | 13 | # Save the hourly arrest rate 14 | hourly_arrest_rate = ri.groupby(ri.index.hour).is_arrested.mean() 15 | 16 | Q2:- 17 | Import matplotlib.pyplot using the alias plt. 18 | Create a line plot of hourly_arrest_rate using the .plot() method. 19 | Label the x-axis as 'Hour', label the y-axis as 'Arrest Rate', and title the plot 'Arrest Rate by Time of Day'. 20 | Display the plot using the .show() function. 21 | 22 | Solution:- 23 | # Import matplotlib.pyplot as plt 24 | import matplotlib.pyplot as plt 25 | 26 | # Create a line plot of 'hourly_arrest_rate' 27 | plt.plot(hourly_arrest_rate) 28 | 29 | # Add the xlabel, ylabel, and title 30 | plt.xlabel('Hour') 31 | plt.ylabel('Arrest Rate') 32 | plt.title('Arrest Rate by Time of Day') 33 | 34 | # Display the plot 35 | plt.show() 36 | 37 | Q3:- 38 | Calculate the annual rate of drug-related stops by resampling the drugs_related_stop column (on the 'A' frequency) and taking the mean. 39 | Save the annual drug rate Series as a new object, annual_drug_rate. 40 | Create a line plot of annual_drug_rate using the .plot() method. 41 | Display the plot using the .show() function. 42 | 43 | Solution:- 44 | # Calculate the annual rate of drug-related stops 45 | print(ri.drugs_related_stop.resample('A').mean()) 46 | 47 | # Save the annual rate of drug-related stops 48 | annual_drug_rate = ri.drugs_related_stop.resample('A').mean() 49 | 50 | # Create a line plot of 'annual_drug_rate' 51 | plt.plot(annual_drug_rate) 52 | 53 | # Display the plot 54 | plt.show() 55 | 56 | Q4:- 57 | Calculate the annual search rate by resampling the search_conducted column, and save the result as annual_search_rate. 58 | Concatenate annual_drug_rate and annual_search_rate along the columns axis, and save the result as annual. 59 | Create subplots of the drug and search rates from the annual DataFrame. 60 | Display the subplots. 61 | 62 | Solution:- 63 | # Calculate and save the annual search rate 64 | annual_search_rate = ri.search_conducted.resample('A').mean() 65 | 66 | # Concatenate 'annual_drug_rate' and 'annual_search_rate' 67 | annual = pd.concat([annual_drug_rate,annual_search_rate], axis=1) 68 | 69 | # Create subplots from 'annual' 70 | annual.plot(subplots=True) 71 | 72 | # Display the subplots 73 | plt.show() 74 | 75 | Q5:- 76 | Create a frequency table from the district and violation columns using the pd.crosstab() function. 77 | Save the frequency table as a new object, all_zones. 78 | Select rows 'Zone K1' through 'Zone K3' from all_zones using the .loc[] accessor. 79 | Save the smaller table as a new object, k_zones. 80 | 81 | Solution:- 82 | # Create a frequency table of districts and violations 83 | print(pd.crosstab(ri.district,ri.violation)) 84 | 85 | # Save the frequency table as 'all_zones' 86 | all_zones = pd.crosstab(ri.district,ri.violation) 87 | 88 | # Select rows 'Zone K1' through 'Zone K3' 89 | print(all_zones.loc['Zone K1':'Zone K3']) 90 | 91 | # Save the smaller table as 'k_zones' 92 | k_zones = all_zones.loc['Zone K1':'Zone K3'] 93 | 94 | Q6:- 95 | Create a bar plot of k_zones. 96 | Display the plot and examine it. What do you notice about each of the zones? 97 | 98 | Solution:- 99 | # Create a bar plot of 'k_zones' 100 | k_zones.plot(kind='bar') 101 | 102 | # Display the plot 103 | plt.show() 104 | 105 | Q7:- 106 | Create a stacked bar plot of k_zones. 107 | Display the plot and examine it. Do you notice anything different about the data than you did previously? 108 | 109 | Solution:- 110 | # Create a stacked bar plot of 'k_zones' 111 | k_zones.plot(kind='bar',stacked=True) 112 | 113 | # Display the plot 114 | plt.show() 115 | 116 | Q8:- 117 | Print the unique values in the stop_duration column. (This has been done for you.) 118 | Create a dictionary called mapping that maps the stop_duration strings to the integers specified above. 119 | Convert the stop_duration strings to integers using the mapping, and store the results in a new column called stop_minutes. 120 | Print the unique values in the stop_minutes column, to verify that the durations were properly converted to integers. 121 | 122 | Solution:- 123 | # Print the unique values in 'stop_duration' 124 | print(ri.stop_duration.unique()) 125 | 126 | # Create a dictionary that maps strings to integers 127 | mapping = {'0-15 Min':8,'16-30 Min':23,'30+ Min':45} 128 | 129 | # Convert the 'stop_duration' strings to integers using the 'mapping' 130 | ri['stop_minutes'] = ri.stop_duration.map(mapping) 131 | 132 | # Print the unique values in 'stop_minutes' 133 | print(ri.stop_minutes.unique()) 134 | 135 | Q9:- 136 | For each value in the violation_raw column, calculate the mean number of stop_minutes that a driver is detained. 137 | Save the resulting Series as a new object, stop_length. 138 | Sort stop_length by its values, and then visualize it using a horizontal bar plot. 139 | Display the plot. 140 | 141 | Solution:- 142 | # Calculate the mean 'stop_minutes' for each value in 'violation_raw' 143 | print(ri.groupby(['violation_raw']).stop_minutes.mean()) 144 | 145 | # Save the resulting Series as 'stop_length' 146 | stop_length = ri.groupby(['violation_raw']).stop_minutes.mean() 147 | 148 | # Sort 'stop_length' by its values and create a horizontal bar plot 149 | stop_length.sort_values().plot(kind='barh') 150 | 151 | # Display the plot 152 | plt.show() 153 | -------------------------------------------------------------------------------- /Python/Cleaning Data in Python/Combining data for analysis: -------------------------------------------------------------------------------- 1 | Q1:- 2 | Concatenate uber1, uber2, and uber3 together using pd.concat(). You'll have to pass the DataFrames in as a list. 3 | Print the shape and then the head of the concatenated DataFrame, row_concat. 4 | 5 | Solution:- 6 | # Concatenate uber1, uber2, and uber3: row_concat 7 | row_concat = pd.concat([uber1,uber2,uber3]) 8 | 9 | # Print the shape of row_concat 10 | print(row_concat.shape) 11 | 12 | # Print the head of row_concat 13 | print(row_concat.head()) 14 | 15 | Q2:- 16 | Concatenate ebola_melt and status_country column-wise into a single DataFrame called ebola_tidy. Be sure to specify axis=1 and to pass the two DataFrames in as a list. 17 | Print the shape and then the head of the concatenated DataFrame, ebola_tidy. 18 | 19 | Solution:- 20 | # Concatenate ebola_melt and status_country column-wise: ebola_tidy 21 | ebola_tidy = pd.concat([ebola_melt,status_country], axis=1) 22 | 23 | # Print the shape of ebola_tidy 24 | print(ebola_tidy.shape) 25 | 26 | # Print the head of ebola_tidy 27 | print(ebola_tidy.head()) 28 | 29 | 30 | Q3:- 31 | Import the glob module along with pandas (as its usual alias pd). 32 | Write a pattern to match all .csv files. 33 | Save all files that match the pattern using the glob() function within the glob module. That is, by using glob.glob(). 34 | Print the list of file names. This has been done for you. 35 | Read the second file in csv_files (i.e., index 1) into a DataFrame called csv2. 36 | Hit 'Submit Answer' to print the head of csv2. Does it look familiar? 37 | 38 | Solution:- 39 | # Import necessary modules 40 | import glob 41 | import pandas as pd 42 | 43 | # Write the pattern: pattern 44 | pattern = '*.csv' 45 | 46 | # Save all file matches: csv_files 47 | csv_files = glob.glob(pattern) 48 | 49 | # Print the file names 50 | print(csv_files) 51 | 52 | # Load the second file into a DataFrame: csv2 53 | csv2 = pd.read_csv(csv_files[1]) 54 | 55 | # Print the head of csv2 56 | print(csv2.head()) 57 | 58 | 59 | Q4:- 60 | Write a for loop to iterate though csv_files: 61 | In each iteration of the loop, read csv into a DataFrame called df. 62 | After creating df, append it to the list frames using the .append() method. 63 | Concatenate frames into a single DataFrame called uber. 64 | Hit 'Submit Answer' to see the head and shape of the concatenated DataFrame! 65 | 66 | Solution:- 67 | # Create an empty list: frames 68 | frames = [] 69 | 70 | # Iterate over csv_files 71 | for csv in csv_files: 72 | 73 | # Read csv into a DataFrame: df 74 | df = pd.read_csv(csv) 75 | 76 | # Append df to frames 77 | frames.append(df) 78 | 79 | # Concatenate frames into a single DataFrame: uber 80 | uber = pd.concat(frames) 81 | 82 | # Print the shape of uber 83 | print(uber.shape) 84 | 85 | # Print the head of uber 86 | print(uber.head()) 87 | 88 | Q5:- 89 | Merge the site and visited DataFrames on the 'name' column of site and 'site' column of visited. 90 | Print the merged DataFrame o2o. 91 | 92 | Solution:- 93 | # Merge the DataFrames: o2o 94 | o2o = pd.merge(left=site, right=visited, left_on='name', right_on='site') 95 | 96 | # Print o2o 97 | print(o2o) 98 | 99 | Q6:- 100 | Merge the site and visited DataFrames on the 'name' column of site and 'site' column of visited, exactly as you did in the previous exercise. 101 | Print the merged DataFrame and then hit 'Submit Answer' to see the different output produced by this merge! 102 | 103 | Solution:- 104 | # Merge the DataFrames: m2o 105 | m2o = pd.merge(left=site, right=visited, left_on='name', right_on='site') 106 | 107 | # Print m2o 108 | print(m2o) 109 | 110 | Q7:- 111 | Merge the site and visited DataFrames on the 'name' column of site and 'site' column of visited, exactly as you did in the previous two exercises. Save the result as m2m. 112 | Merge the m2m and survey DataFrames on the 'ident' column of m2m and 'taken' column of survey. 113 | Hit 'Submit Answer' to print the first 20 lines of the merged DataFrame! 114 | 115 | Solution:- 116 | # Merge site and visited: m2m 117 | m2m = pd.merge(left=site, right=visited, left_on='name', right_on='site') 118 | 119 | # Merge m2m and survey: m2m 120 | m2m = pd.merge(left=m2m, right=survey, left_on='ident', right_on='taken') 121 | 122 | # Print the first 20 lines of m2m 123 | print(m2m.head(20)) 124 | -------------------------------------------------------------------------------- /Python/Cleaning Data in Python/Exploring your data: -------------------------------------------------------------------------------- 1 | Q1:- 2 | Import pandas as pd. 3 | Read 'dob_job_application_filings_subset.csv' into a DataFrame called df. 4 | Print the head and tail of df. 5 | Print the shape of df and its columns. Note: .shape and .columns are attributes, not methods, so you don't need to follow these with parentheses (). 6 | Hit 'Submit Answer' to view the results! Notice the suspicious number of 0 values. Perhaps these represent missing data. 7 | 8 | Solution:- 9 | # Import pandas 10 | import pandas as pd 11 | 12 | # Read the file into a DataFrame: df 13 | df = pd.read_csv('dob_job_application_filings_subset.csv') 14 | 15 | # Print the head of df 16 | print(df.head()) 17 | 18 | # Print the tail of df 19 | print(df.tail()) 20 | 21 | # Print the shape of df 22 | print(df.shape) 23 | 24 | # Print the columns of df 25 | print(df.columns) 26 | 27 | # Print the head and tail of df_subset 28 | print(df_subset.head()) 29 | print(df_subset.tail()) 30 | 31 | Q2:- 32 | 33 | Print the info of df. 34 | Print the info of the subset dataframe, df_subset. 35 | 36 | Solution:- 37 | # Print the info of df 38 | print(df.info()) 39 | 40 | # Print the info of df_subset 41 | print(df_subset.info()) 42 | 43 | Q3:- 44 | Print the value counts for: 45 | The 'Borough' column. 46 | The 'State' column. 47 | The 'Site Fill' column. 48 | 49 | Solution:- 50 | # Print the value counts for 'Borough' 51 | print(df['Borough'].value_counts(dropna=False)) 52 | 53 | # Print the value_counts for 'State' 54 | print(df['State'].value_counts(dropna=False)) 55 | 56 | # Print the value counts for 'Site Fill' 57 | print(df['Site Fill'].value_counts(dropna=False)) 58 | 59 | Q4:- 60 | Import matplotlib.pyplot as plt. 61 | Create a histogram of the 'Existing Zoning Sqft' column. Rotate the axis labels by 70 degrees and use a log scale for both axes. 62 | Display the histogram using plt.show(). 63 | 64 | Solution:- 65 | # Import matplotlib.pyplot 66 | import matplotlib.pyplot as plt 67 | 68 | # Plot the histogram 69 | df['Existing Zoning Sqft'].plot(kind='hist', rot=70, logx=True, logy=True) 70 | 71 | # Display the histogram 72 | plt.show() 73 | 74 | Q5:- 75 | Using the .boxplot() method of df, create a boxplot of 'initial_cost' across the different values of 'Borough'. 76 | Display the plot. 77 | 78 | Solution:- 79 | # Import necessary modules 80 | import pandas as pd 81 | import matplotlib.pyplot as plt 82 | 83 | # Create the boxplot 84 | df.boxplot(column='initial_cost', by='Borough', rot=90) 85 | 86 | # Display the plot 87 | plt.show() 88 | 89 | Q6:- 90 | Using df, create a scatter plot (kind='scatter') with 'initial_cost' on the x-axis and the 'total_est_fee' on the y-axis. 91 | Rotate the x-axis labels by 70 degrees. 92 | Create another scatter plot exactly as above, substituting df_subset in place of df. 93 | 94 | Solution:- 95 | # Import necessary modules 96 | import pandas as pd 97 | import matplotlib.pyplot as plt 98 | 99 | # Create and display the first scatter plot 100 | df.plot(kind='scatter', x='initial_cost', y='total_est_fee', rot=70) 101 | plt.show() 102 | 103 | # Create and display the second scatter plot 104 | df_subset.plot(kind='scatter', x='initial_cost', y='total_est_fee', rot=70) 105 | plt.show() 106 | -------------------------------------------------------------------------------- /Python/Cleaning Data in Python/Tidying data for analysis: -------------------------------------------------------------------------------- 1 | Q1:- 2 | Print the head of airquality. 3 | Use pd.melt() to melt the Ozone, Solar.R, Wind, and Temp columns of airquality into rows. 4 | Do this by using id_vars to specify the columns you do not wish to melt: 'Month' and 'Day'. 5 | Print the head of airquality_melt. 6 | 7 | Solution:- 8 | # Print the head of airquality 9 | print(airquality.head()) 10 | 11 | # Melt airquality: airquality_melt 12 | airquality_melt = pd.melt(airquality, id_vars=['Month', 'Day']) 13 | 14 | # Print the head of airquality_melt 15 | print(airquality_melt.head()) 16 | 17 | Q2:- 18 | Print the head of airquality. 19 | Melt the Ozone, Solar.R, Wind, and Temp columns of airquality into rows, with the default variable column renamed to 'measurement' and the default value column renamed to 'reading'. You can do this by specifying, respectively, the var_name and value_name parameters. 20 | Print the head of airquality_melt. 21 | 22 | Solution:- 23 | # Print the head of airquality 24 | print(airquality.head()) 25 | 26 | # Melt airquality: airquality_melt 27 | airquality_melt = pd.melt(airquality, id_vars=['Month', 'Day'], var_name='measurement', value_name='reading') 28 | 29 | # Print the head of airquality_melt 30 | print(airquality_melt.head()) 31 | 32 | Q3:- 33 | Print the head of airquality_melt. 34 | Pivot airquality_melt by using .pivot_table() with the rows indexed by 'Month' and 'Day', the columns indexed by 'measurement', and the values populated with 'reading'. 35 | Print the head of airquality_pivot. 36 | 37 | Solution:- 38 | # Print the head of airquality_melt 39 | print(airquality_melt.head()) 40 | 41 | # Pivot airquality_melt: airquality_pivot 42 | airquality_pivot = airquality_melt.pivot_table(index=['Month', 'Day'], columns='measurement', values='reading') 43 | 44 | # Print the head of airquality_pivot 45 | print(airquality_pivot.head()) 46 | 47 | Q4:- 48 | Print the index of airquality_pivot by accessing its .index attribute. This has been done for you. 49 | Reset the index of airquality_pivot using its .reset_index() method. 50 | Print the new index of airquality_pivot. 51 | Print the head of airquality_pivot. 52 | 53 | Solution:- 54 | # Print the index of airquality_pivot 55 | print(airquality_pivot.index) 56 | 57 | # Reset the index of airquality_pivot: airquality_pivot 58 | airquality_pivot = airquality_pivot.reset_index() 59 | 60 | # Print the new index of airquality_pivot 61 | print(airquality_pivot.index) 62 | 63 | # Print the head of airquality_pivot 64 | print(airquality_pivot.head()) 65 | 66 | Q5:- 67 | Pivot airquality_dup by using .pivot_table() with the rows indexed by 'Month' and 'Day', the columns indexed by 'measurement', and the values populated with 'reading'. Use np.mean for the aggregation function. 68 | Flatten airquality_pivot by resetting its index. 69 | Print the head of airquality_pivot and then the original airquality DataFrame to compare their structure. 70 | 71 | Solution:- 72 | # Pivot airquality_dup: airquality_pivot 73 | airquality_pivot = airquality_dup.pivot_table(index=['Month', 'Day'], columns='measurement', values='reading', aggfunc=np.mean) 74 | 75 | # Reset the index of airquality_pivot 76 | airquality_pivot = airquality_pivot.reset_index() 77 | 78 | # Print the head of airquality_pivot 79 | print(airquality_pivot.head()) 80 | 81 | # Print the head of airquality 82 | print(airquality.head()) 83 | 84 | Q6:- 85 | Melt tb keeping 'country' and 'year' fixed. 86 | Create a 'gender' column by slicing the first letter of the variable column of tb_melt. 87 | Create an 'age_group' column by slicing the rest of the variable column of tb_melt. 88 | Print the head of tb_melt 89 | 90 | Solution:- 91 | # Melt tb: tb_melt 92 | tb_melt = pd.melt(tb, id_vars=['country', 'year']) 93 | 94 | # Create the 'gender' column 95 | tb_melt['gender'] = tb_melt.variable.str[0] 96 | 97 | # Create the 'age_group' column 98 | tb_melt['age_group'] = tb_melt.variable.str[1:] 99 | 100 | # Print the head of tb_melt 101 | print(tb_melt.head()) 102 | 103 | Q7:- 104 | Create a column called 'str_split' by splitting the 'type_country' column of ebola_melt on '_'. Note that you will first have to access the str attribute of type_country before you can use .split(). 105 | Create a column called 'type' by using the .get() method to retrieve index 0 of the 'str_split' column of ebola_melt. 106 | Create a column called 'country' by using the .get() method to retrieve index 1 of the 'str_split' column of ebola_melt. 107 | Print the head of ebola. This has been done for you, so hit 'Submit Answer' to view the results! 108 | 109 | Solution:- 110 | # Melt ebola: ebola_melt 111 | ebola_melt = pd.melt(ebola, id_vars=['Date', 'Day'], var_name='type_country', value_name='counts') 112 | 113 | # Create the 'str_split' column 114 | ebola_melt['str_split'] = ebola_melt.type_country.str.split('_') 115 | 116 | # Create the 'type' column 117 | ebola_melt['type'] = ebola_melt.str_split.str.get(0) 118 | 119 | # Create the 'country' column 120 | ebola_melt['country'] = ebola_melt.str_split.str.get(1) 121 | 122 | # Print the head of ebola_melt 123 | print(ebola_melt.head()) 124 | 125 | -------------------------------------------------------------------------------- /Python/Conda Essentials/Installing Packages: -------------------------------------------------------------------------------- 1 | Q1:- 2 | -------------------------------------------------------------------------------- /Python/Importing Data in Python -Part 1/Introduction and flat files: -------------------------------------------------------------------------------- 1 | Q1:- 2 | Open the file moby_dick.txt as read-only and store it in the variable file. Make sure to pass the filename enclosed in quotation marks ''. 3 | Print the contents of the file to the shell using the print() function. As Hugo showed in the video, you'll need to apply the method read() to the object file. 4 | Check whether the file is closed by executing print(file.closed). 5 | Close the file using the close() method. 6 | Check again that the file is closed as you did above. 7 | 8 | Solution:- 9 | # Open a file: file 10 | file = open("moby_dick.txt","r") 11 | 12 | # Print it 13 | print(file.read()) 14 | 15 | # Check whether file is closed 16 | print(file.closed) 17 | 18 | # Close file 19 | file.close() 20 | 21 | # Check whether file is closed 22 | print(file.closed) 23 | 24 | Q2:- 25 | Open moby_dick.txt using the with context manager and the variable file. 26 | Print the first three lines of the file to the shell by using readline() three times within the context manager. 27 | 28 | Solution:- 29 | # Read & print the first 3 lines 30 | with open('moby_dick.txt') as file: 31 | print(file.readline()) 32 | print(file.readline()) 33 | print(file.readline()) 34 | 35 | Q3:- 36 | Fill in the arguments of np.loadtxt() by passing file and a comma ',' for the delimiter. 37 | Fill in the argument of print() to print the type of the object digits. Use the function type(). 38 | Execute the rest of the code to visualize one of the rows of the data. 39 | 40 | Solution:- 41 | # Import package 42 | import numpy as np 43 | 44 | # Assign filename to variable: file 45 | file = 'digits.csv' 46 | 47 | # Load file as array: digits 48 | digits = np.loadtxt(file, delimiter=',') 49 | 50 | # Print datatype of digits 51 | print(type(digits)) 52 | 53 | # Select and reshape a row 54 | im = digits[21, 1:] 55 | im_sq = np.reshape(im, (28, 28)) 56 | 57 | # Plot reshaped data (matplotlib.pyplot already loaded as plt) 58 | plt.imshow(im_sq, cmap='Greys', interpolation='nearest') 59 | plt.show() 60 | 61 | Q4:- 62 | Complete the arguments of np.loadtxt(): the file you're importing is tab-delimited, you want to skip the first row and you only want to import the first and third columns. 63 | Complete the argument of the print() call in order to print the entire array that you just imported. 64 | 65 | Solution:- 66 | # Import numpy 67 | import numpy as np 68 | 69 | # Assign the filename: file 70 | file = 'digits_header.txt' 71 | 72 | # Load the data: data 73 | data = np.loadtxt(file, delimiter='\t', skiprows=1, usecols=[0,2]) 74 | 75 | # Print data 76 | print(data) 77 | 78 | Q5:- 79 | Complete the first call to np.loadtxt() by passing file as the first argument. 80 | Execute print(data[0]) to print the first element of data. 81 | Complete the second call to np.loadtxt(). The file you're importing is tab-delimited, the datatype is float, and you want to skip the first row. 82 | Print the 10th element of data_float by completing the print() command. Be guided by the previous print() call. 83 | Execute the rest of the code to visualize the data. 84 | 85 | Solution:- 86 | # Assign filename: file 87 | file = 'seaslug.txt' 88 | 89 | # Import file: data 90 | data = np.loadtxt(file, delimiter='\t', dtype=str) 91 | 92 | # Print the first element of data 93 | print(data[0]) 94 | 95 | # Import data as floats and skip the first row: data_float 96 | data_float = np.loadtxt(file, delimiter='\t', dtype=float, skiprows=1) 97 | 98 | # Print the 10th element of data_float 99 | print(data_float[9]) 100 | 101 | # Plot a scatterplot of the data 102 | plt.scatter(data_float[:, 0], data_float[:, 1]) 103 | plt.xlabel('time (min.)') 104 | plt.ylabel('percentage of larvae') 105 | plt.show() 106 | 107 | Q6:- 108 | Import titanic.csv using the function np.recfromcsv() and assign it to the variable, d. You'll only need to pass file to it because it has the defaults delimiter=',' and names=True in addition to dtype=None! 109 | Run the remaining code to print the first three entries of the resulting array d. 110 | 111 | Solution:- 112 | # Assign the filename: file 113 | file = 'titanic.csv' 114 | 115 | # Import file using np.recfromcsv: d 116 | d = np.recfromcsv(file,delimiter=',',names=True,dtype=None) 117 | 118 | # Print out first three entries of d 119 | print(d[:3]) 120 | 121 | Q7:- 122 | Import the pandas package using the alias pd. 123 | Read titanic.csv into a DataFrame called df. The file name is already stored in the file object. 124 | In a print() call, view the head of the DataFrame. 125 | 126 | Solution:- 127 | # Import pandas as pd 128 | import pandas as pd 129 | 130 | # Assign the filename: file 131 | file = 'titanic.csv' 132 | 133 | # Read the file into a DataFrame: df 134 | df = pd.read_csv(file) 135 | 136 | # View the head of the DataFrame 137 | print(df.head()) 138 | 139 | Q8:- 140 | Import the first 5 rows of the file into a DataFrame using the function pd.read_csv() and assign the result to data. You'll need to use the arguments nrows and header (there is no header in this file). 141 | Build a numpy array from the resulting DataFrame in data and assign to data_array. 142 | Execute print(type(data_array)) to print the datatype of data_array. 143 | 144 | Solution:- 145 | # Assign the filename: file 146 | file = 'digits.csv' 147 | 148 | # Read the first 5 rows of the file into a DataFrame: data 149 | data = pd.read_csv(file,nrows=5,header=None) 150 | 151 | # Build a numpy array from the DataFrame: data_array 152 | data_array = data.values 153 | 154 | # Print the datatype of data_array to the shell 155 | print(type(data_array)) 156 | 157 | Q9:- 158 | Complete the sep (the pandas version of delim), comment and na_values arguments of pd.read_csv(). 159 | comment takes characters that comments occur after in the file, which in this case is '#'. na_values takes a list of strings to recognize as NA/NaN, 160 | in this case the string 'Nothing'. 161 | Execute the rest of the code to print the head of the resulting DataFrame and plot the histogram of the 'Age' of passengers aboard the Titanic. 162 | 163 | Solution:- 164 | # Import matplotlib.pyplot as plt 165 | import matplotlib.pyplot as plt 166 | 167 | # Assign filename: file 168 | file = 'titanic_corrupt.txt' 169 | 170 | # Import file: data 171 | data = pd.read_csv(file, sep='\t', comment="#", na_values=["Nothing"]) 172 | 173 | # Print the head of the DataFrame 174 | print(data.head()) 175 | 176 | # Plot 'Age' variable in a histogram 177 | pd.DataFrame.hist(data[['Age']]) 178 | plt.xlabel('Age (years)') 179 | plt.ylabel('count') 180 | plt.show() 181 | -------------------------------------------------------------------------------- /Python/Importing Data in Python -Part 2/Diving deep into the Twitter API: -------------------------------------------------------------------------------- 1 | Q1:- 2 | Import the package tweepy. 3 | Pass the parameters consumer_key and consumer_secret to the function tweepy.OAuthHandler(). 4 | Complete the passing of OAuth credentials to the OAuth handler auth by applying to it the method set_access_token(), 5 | along with arguments access_token and access_token_secret. 6 | 7 | Solution:- 8 | # Import package 9 | import tweepy,json 10 | 11 | # Store OAuth authentication credentials in relevant variables 12 | access_token = "1092294848-aHN7DcRP9B4VMTQIhwqOYiB14YkW92fFO8k8EPy" 13 | access_token_secret = "X4dHmhPfaksHcQ7SCbmZa2oYBBVSD2g8uIHXsp5CTaksx" 14 | consumer_key = "nZ6EA0FxZ293SxGNg8g8aP0HM" 15 | consumer_secret = "fJGEodwe3KiKUnsYJC3VRndj7jevVvXbK2D5EiJ2nehafRgA6i" 16 | 17 | # Pass OAuth details to tweepy's OAuth handler 18 | auth = tweepy.OAuthHandler(consumer_key, consumer_secret) 19 | auth.set_access_token(access_token,access_token_secret) 20 | 21 | Q2:- 22 | Create your Stream object with authentication by passing tweepy.Stream() the authentication handler auth and the Stream listener l; 23 | To filter Twitter streams, 24 | pass to the track argument in stream.filter() a list containing the desired keywords 'clinton', 'trump', 'sanders', and 'cruz'. 25 | 26 | solution:- 27 | # Initialize Stream listener 28 | l = MyStreamListener() 29 | 30 | # Create you Stream object with authentication 31 | stream = tweepy.Stream(auth, l) 32 | 33 | 34 | # Filter Twitter Streams to capture data by the keywords: 35 | s = ['clinton', 'trump', 'sanders','cruz'] 36 | stream.filter(track = s) 37 | 38 | Q3:- 39 | Assign the filename 'tweets.txt' to the variable tweets_data_path. 40 | Initialize tweets_data as an empty list to store the tweets in. 41 | Within the for loop initiated by for line in tweets_file:, load each tweet into a variable, tweet, using json.loads(), 42 | then append tweet to tweets_data using the append() method. 43 | Hit submit and check out the keys of the first tweet dictionary printed to the shell. 44 | 45 | Solution:- 46 | # Import package 47 | import json 48 | 49 | # String of path to file: tweets_data_path 50 | tweets_data_path = 'tweets.txt' 51 | 52 | # Initialize empty list to store tweets: tweets_data 53 | tweets_data = [] 54 | 55 | # Open connection to file 56 | tweets_file = open(tweets_data_path, "r") 57 | 58 | # Read in tweets and store in list: tweets_data 59 | for line in tweets_file: 60 | tweet = json.loads(line) 61 | tweets_data.append(tweet) 62 | 63 | # Close connection to file 64 | tweets_file.close() 65 | 66 | # Print the keys of the first tweet dict 67 | print(tweets_data[0].keys()) 68 | 69 | Q4:- 70 | Use pd.DataFrame() to construct a DataFrame of tweet texts and languages; to do so, the first argument should be tweets_data, a list of dictionaries. 71 | The second argument to pd.DataFrame() is a list of the keys you wish to have as columns. Assign the result of the pd.DataFrame() call to df. 72 | 73 | Solution:- 74 | # Import package 75 | import pandas as pd 76 | 77 | # Build DataFrame of tweet texts and languages 78 | df = pd.DataFrame(tweets_data, columns=['text','lang']) 79 | 80 | # Print head of DataFrame 81 | print(df.head()) 82 | 83 | Q5:- 84 | Within the for loop for index, row in df.iterrows():, the code currently increases the value of clinton by 1 each time a tweet mentioning 'Clinton' is encountered; 85 | complete the code so that the same happens for trump, sanders and cruz. 86 | 87 | Solution:- 88 | # Initialize list to store tweet counts 89 | [clinton, trump, sanders, cruz] = [0, 0, 0, 0] 90 | 91 | # Iterate through df, counting the number of tweets in which 92 | # each candidate is mentioned 93 | for index, row in df.iterrows(): 94 | clinton += word_in_text('clinton', row['text']) 95 | trump += word_in_text('trump', row['text']) 96 | sanders += word_in_text('sanders', row['text']) 97 | cruz += word_in_text('cruz', row['text']) 98 | 99 | Q6:- 100 | Import both matplotlib.pyplot and seaborn using the aliases plt and sns, respectively. 101 | Complete the arguments of sns.barplot: the first argument should be the labels to appear on the x-axis; 102 | the second argument should be the list of the variables you wish to plot, as produced in the previous exercise. 103 | 104 | solution:- 105 | # Import packages 106 | import seaborn as sns 107 | import matplotlib.pyplot as plt 108 | 109 | # Set seaborn style 110 | sns.set(color_codes=True) 111 | 112 | # Create a list of labels:cd 113 | cd = ['clinton', 'trump', 'sanders', 'cruz'] 114 | 115 | # Plot histogram 116 | ax = sns.barplot(cd, [clinton, trump, sanders, cruz]) 117 | ax.set(ylabel="count") 118 | plt.show() 119 | 120 | Print the head of the DataFrame. 121 | -------------------------------------------------------------------------------- /Python/Importing Data in Python -Part 2/Interacting with APIs to import data from the web: -------------------------------------------------------------------------------- 1 | Q1:- 2 | Load the JSON 'a_movie.json' into the variable json_data within the context provided by the with statement. 3 | To do so, use the function json.load() within the context manager. 4 | Use a for loop to print all key-value pairs in the dictionary json_data. 5 | Recall that you can access a value in a dictionary using the syntax: dictionary[key]. 6 | 7 | Solution:- 8 | # Load JSON: json_data 9 | with open("a_movie.json") as json_file: 10 | json_data = json.load(json_file) 11 | 12 | # Print each key-value pair in json_data 13 | for k in json_data.keys(): 14 | print(k + ': ', json_data[k]) 15 | 16 | Q2:- 17 | Import the requests package. 18 | Assign to the variable url the URL of interest in order to query 'http://www.omdbapi.com' for the data corresponding to the movie The Social Network. The query string should have two arguments: apikey=ff21610b and t=social+network. You can combine them as follows: apikey=ff21610b&t=social+network. 19 | Print the text of the reponse object r by using its text attribute and passing the result to the print() function. 20 | 21 | Solution:- 22 | # Import requests package 23 | import requests 24 | 25 | # Assign URL to variable: url 26 | url = 'http://www.omdbapi.com/?apikey=ff21610b&t=social+network' 27 | 28 | # Package the request, send the request and catch the response: r 29 | r = requests.get(url) 30 | 31 | # Print the text of the response 32 | print(r.text) 33 | 34 | Q3:- 35 | Pass the variable url to the requests.get() function in order to send the relevant request and catch the response, assigning the resultant response message to the variable r. 36 | Apply the json() method to the response object r and store the resulting dictionary in the variable json_data. 37 | Hit Submit Answer to print the key-value pairs of the dictionary json_data to the shell 38 | 39 | Solution:- 40 | # Import package 41 | import requests 42 | 43 | # Assign URL to variable: url 44 | url = 'http://www.omdbapi.com/?apikey=ff21610b&t=social+network' 45 | 46 | # Package the request, send the request and catch the response: r 47 | r = requests.get(url) 48 | 49 | # Decode the JSON data into a dictionary: json_data 50 | json_data = r.json() 51 | 52 | # Print each key-value pair in json_data 53 | for k in json_data.keys(): 54 | print(k + ': ', json_data[k]) 55 | 56 | Q4:- 57 | Assign the relevant URL to the variable url. 58 | Apply the json() method to the response object r and store the resulting dictionary in the variable json_data. 59 | The variable pizza_extract holds the HTML of an extract from Wikipedia's Pizza page as a string; use the function print() to print this string to the shell. 60 | 61 | Solution:- 62 | # Import package 63 | import requests 64 | 65 | # Assign URL to variable: url 66 | url = 'https://en.wikipedia.org/w/api.php?action=query&prop=extracts&format=json&exintro=&titles=pizza' 67 | 68 | # Package the request, send the request and catch the response: r 69 | r = requests.get(url) 70 | 71 | # Decode the JSON data into a dictionary: json_data 72 | json_data = r.json() 73 | 74 | # Print the Wikipedia page extract 75 | pizza_extract = json_data['query']['pages']['24768']['extract'] 76 | print(pizza_extract) 77 | -------------------------------------------------------------------------------- /Python/Intermediate-Python-for-Data-Science/Logic-ControlFlow-Filtering: -------------------------------------------------------------------------------- 1 | Q1:- 2 | In the editor on the right, write code to see if True equals False. 3 | Write Python code to check if -5 * 15 is not equal to 75. 4 | Ask Python whether the strings "pyscript" and "PyScript" are equal. 5 | What happens if you compare booleans and integers? Write code to see if True and 1 are equal. 6 | 7 | Solution:- 8 | # Comparison of booleans 9 | print(True==False) 10 | 11 | # Comparison of integers 12 | print(-5 * 15 != 75) 13 | 14 | # Comparison of strings 15 | print("pyscript" == "PyScript") 16 | 17 | # Compare a boolean with an integer 18 | print(True == 1) 19 | 20 | Q2:- 21 | Write Python expressions, wrapped in a print() function, to check whether: 22 | x is greater than or equal to -10. x has already been defined for you. 23 | "test" is less than or equal to y. y has already been defined for you. 24 | True is greater than False. 25 | 26 | Solution:- 27 | # Comparison of integers 28 | x = -3 * 6 29 | print(x >= -10) 30 | 31 | # Comparison of strings 32 | y = "test" 33 | print("test" <= y) 34 | 35 | # Comparison of booleans 36 | print(True > False) 37 | 38 | Q3:- 39 | Using comparison operators, generate boolean arrays that answer the following questions: 40 | Which areas in my_house are greater than or equal to 18? 41 | You can also compare two Numpy arrays element-wise. Which areas in my_house are smaller than the ones in your_house? 42 | Make sure to wrap both commands in a print() statement, so that you can inspect the output. 43 | 44 | Solution:- 45 | # Create arrays 46 | import numpy as np 47 | my_house = np.array([18.0, 20.0, 10.75, 9.50]) 48 | your_house = np.array([14.0, 24.0, 14.25, 9.0]) 49 | 50 | # my_house greater than or equal to 18 51 | print(my_house >= 18) 52 | 53 | # my_house less than your_house 54 | print(my_house < your_house) 55 | 56 | Q4:- 57 | Write Python expressions, wrapped in a print() function, to check whether: 58 | my_kitchen is bigger than 10 and smaller than 18. 59 | my_kitchen is smaller than 14 or bigger than 17. 60 | double the area of my_kitchen is smaller than triple the area of your_kitchen 61 | 62 | Solution:- 63 | # Define variables 64 | my_kitchen = 18.0 65 | your_kitchen = 14.0 66 | 67 | # my_kitchen bigger than 10 and smaller than 18? 68 | print(my_kitchen > 10 and my_kitchen < 18) 69 | 70 | # my_kitchen smaller than 14 or bigger than 17? 71 | print(my_kitchen < 14 or my_kitchen > 17) 72 | 73 | # Double my_kitchen smaller than triple your_kitchen? 74 | print(my_kitchen*2 < your_kitchen*3) 75 | 76 | Q5:- 77 | Generate boolean arrays that answer the following questions: 78 | Which areas in my_house are greater than 18.5 or smaller than 10? 79 | Which areas are smaller than 11 in both my_house and your_house? Make sure to wrap both commands in print() statement, so that you can inspect the output. 80 | 81 | Solution:- 82 | # Create arrays 83 | import numpy as np 84 | my_house = np.array([18.0, 20.0, 10.75, 9.50]) 85 | your_house = np.array([14.0, 24.0, 14.25, 9.0]) 86 | 87 | # my_house greater than 18.5 or smaller than 10 88 | print(np.logical_or(my_house > 18.5,my_house < 10)) 89 | 90 | # Both my_house and your_house smaller than 11 91 | print(np.logical_and(my_house < 11, your_house < 11)) 92 | 93 | Q6;-Examine the if statement that prints out "Looking around in the kitchen." if room equals "kit". 94 | Write another if statement that prints out "big place!" if area is greater than 15. 95 | 96 | Solution:- 97 | # Define variables 98 | room = "kit" 99 | area = 14.0 100 | 101 | # if statement for room 102 | if room == "kit" : 103 | print("looking around in the kitchen.") 104 | 105 | # if statement for area 106 | if area>15: 107 | print("big place!") 108 | 109 | Q7:- 110 | Add an else statement to the second control structure so that "pretty small." is printed out if area > 15 evaluates to False. 111 | 112 | Solution:- 113 | # Define variables 114 | room = "kit" 115 | area = 14.0 116 | 117 | # if-else construct for room 118 | if room == "kit" : 119 | print("looking around in the kitchen.") 120 | else : 121 | print("looking around elsewhere.") 122 | 123 | # if-else construct for area 124 | if area > 15 : 125 | print("big place!") 126 | else: 127 | print("pretty small.") 128 | 129 | Q8:- 130 | Add an elif to the second control structure such that "medium size, nice!" is printed out if area is greater than 10. 131 | 132 | Solution:- 133 | # Define variables 134 | room = "bed" 135 | area = 14.0 136 | 137 | # if-elif-else construct for room 138 | if room == "kit" : 139 | print("looking around in the kitchen.") 140 | elif room == "bed": 141 | print("looking around in the bedroom.") 142 | else : 143 | print("looking around elsewhere.") 144 | 145 | # if-elif-else construct for area 146 | if area > 15 : 147 | print("big place!") 148 | elif area > 10: 149 | print("medium size, nice!") 150 | else : 151 | print("pretty small.") 152 | 153 | Q9:- 154 | Extract the drives_right column as a Pandas Series and store it as dr. 155 | Use dr, a boolean Series, to subset the cars DataFrame. Store the resulting selection in sel. 156 | Print sel, and assert that drives_right is True for all observations. 157 | 158 | Solution:- 159 | # Import cars data 160 | import pandas as pd 161 | cars = pd.read_csv('cars.csv', index_col = 0) 162 | 163 | # Extract drives_right column as Series: dr 164 | dr = cars["drives_right"] 165 | 166 | # Use dr to subset cars: sel 167 | sel = cars[dr] 168 | # Print sel 169 | print(sel) 170 | 171 | Q10:- 172 | Convert the code on the right to a one-liner that calculates the variable sel as before. 173 | 174 | Solution:- 175 | # Import cars data 176 | import pandas as pd 177 | cars = pd.read_csv('cars.csv', index_col = 0) 178 | 179 | # Convert code to a one-liner 180 | 181 | sel = cars[cars['drives_right']] 182 | 183 | # Print sel 184 | print(sel) 185 | 186 | Q11:- 187 | Select the cars_per_cap column from cars as a Pandas Series and store it as cpc. 188 | Use cpc in combination with a comparison operator and 500. You want to end up with a boolean Series that's True if the corresponding country has a cars_per_cap of more than 500 and False otherwise. Store this boolean Series as many_cars. 189 | Use many_cars to subset cars, similar to what you did before. Store the result as car_maniac. 190 | Print out car_maniac to see if you got it right. 191 | 192 | Solution:- 193 | # Import cars data 194 | import pandas as pd 195 | cars = pd.read_csv('cars.csv', index_col = 0) 196 | 197 | # Create car_maniac: observations that have a cars_per_cap over 500 198 | cpc = cars["cars_per_cap"] 199 | many_cars = cpc > 500 200 | car_maniac = cars[many_cars] 201 | 202 | # Print car_maniac 203 | print(car_maniac) 204 | 205 | Q12:- 206 | Use the code sample above to create a DataFrame medium, that includes all the observations of cars that have a cars_per_cap between 100 and 500. 207 | Print out medium. 208 | 209 | Solution:- 210 | # Import cars data 211 | import pandas as pd 212 | cars = pd.read_csv('cars.csv', index_col = 0) 213 | 214 | # Import numpy, you'll need this 215 | import numpy as np 216 | 217 | # Create medium: observations with cars_per_cap between 100 and 500 218 | medium = cars[np.logical_and(cars["cars_per_cap"] >100, cars["cars_per_cap"] < 500)] 219 | 220 | 221 | 222 | # Print medium 223 | print(medium) 224 | 225 | 226 | -------------------------------------------------------------------------------- /Python/Intermediate-Python-for-Data-Science/loops: -------------------------------------------------------------------------------- 1 | Q1:- 2 | Create the variable offset with an initial value of 8. 3 | Code a while loop that keeps running as long as offset is not equal to 0. Inside the while loop: 4 | Print out the sentence "correcting...". 5 | Next, decrease the value of offset by 1. You can do this with offset = offset - 1. 6 | Finally, print out offset so you can see how it changes. 7 | 8 | Solution:- 9 | # Initialize offset 10 | offset = 8 11 | 12 | # Code the while loop 13 | while offset != 0: 14 | print("correcting...") 15 | offset = offset - 1 16 | print(offset) 17 | 18 | Q2:- 19 | Inside the while loop, replace offset = offset - 1 by an if-else statement: 20 | If offset > 0, you should decrease offset by 1. 21 | Else, you should increase offset by 1. 22 | If you've coded things correctly, hitting Submit Answer should work this time. 23 | 24 | Solution:- 25 | # Initialize offset 26 | offset = -6 27 | 28 | # Code the while loop 29 | while offset != 0 : 30 | print("correcting...") 31 | #offset = offset - 1 32 | if offset > 0: 33 | offset = offset - 1 34 | else: 35 | offset = offset + 1 36 | print(offset) 37 | 38 | Q3:- 39 | Write a for loop that iterates over all elements of the areas list and prints out every element separately. 40 | 41 | Solution:- 42 | # areas list 43 | areas = [11.25, 18.0, 20.0, 10.75, 9.50] 44 | 45 | # Code the for loop 46 | for i in areas: 47 | print(i) 48 | 49 | Q4:- 50 | Adapt the for loop in the sample code to use enumerate(). On each run, a line of the form "room x: y" should be printed, where x is the index of the list element and y is the actual list element, i.e. the area. 51 | Make sure to print out this exact string, with the correct spacing. 52 | 53 | Solution:- 54 | # areas list 55 | areas = [11.25, 18.0, 20.0, 10.75, 9.50] 56 | 57 | # Change for loop to use enumerate() 58 | for i,a in enumerate(areas) : 59 | print("room " + str(i) + ": " + str(a)) 60 | 61 | Q5:- 62 | Adapt the print() function in the for loop on the right so that the first printout becomes "room 1: 11.25", the second one "room 2: 18.0" and so on. 63 | 64 | Solution:- 65 | # areas list 66 | areas = [11.25, 18.0, 20.0, 10.75, 9.50] 67 | 68 | # Code the for loop 69 | for index, area in enumerate(areas) : 70 | print("room " + str(index+1) + ": " + str(area)) 71 | 72 | Q6:- 73 | Write a for loop that goes through each sublist of house and prints out the x is y sqm, where x is the name of the room and y is the area of the room. 74 | 75 | Solution:- 76 | # house list of lists 77 | house = [["hallway", 11.25], 78 | ["kitchen", 18.0], 79 | ["living room", 20.0], 80 | ["bedroom", 10.75], 81 | ["bathroom", 9.50]] 82 | 83 | # Build a for loop from scratch 84 | for i in house: 85 | print("the " + i[0] + " is " + str(i[1]) + " sqm") 86 | 87 | Q7:- 88 | Write a for loop that goes through each key:value pair of europe. On each iteration, "the capital of x is y" should be printed out, where x is the key and y is the value of the pair. 89 | Solution:- 90 | # Definition of dictionary 91 | europe = {'spain':'madrid', 'france':'paris', 'germany':'bonn', 92 | 'norway':'oslo', 'italy':'rome', 'poland':'warsaw', 'australia':'vienna' } 93 | 94 | # Iterate over europe 95 | for key,value in europe.items(): 96 | print("the capital of " + key + " is " + str(value)) 97 | 98 | Q8:- 99 | Import the numpy package under the local alias np. 100 | Write a for loop that iterates over all elements in np_height and prints out "x inches" for each element, where x is the value in the array. 101 | Write a for loop that visits every element of the np_baseball array and prints it out. 102 | 103 | Solution:- 104 | # Import numpy as np 105 | import numpy as np 106 | 107 | # For loop over np_height 108 | for i in np_height: 109 | print(str(i) + " inches") 110 | 111 | # For loop over np_baseball 112 | for j in np.nditer(np_baseball): 113 | print(j) 114 | 115 | Q9:- 116 | Write a for loop that iterates over the rows of cars and on each iteration perform two print() calls: one to print out the row label and one to print out all of the rows contents. 117 | Solution:- 118 | # Import cars data 119 | import pandas as pd 120 | cars = pd.read_csv('cars.csv', index_col = 0) 121 | 122 | # Iterate over rows of cars 123 | for lab,i in cars.iterrows(): 124 | print(lab) 125 | print(i) 126 | 127 | Q10:- 128 | Adapt the code in the for loop such that the first iteration prints out "US: 809", the second iteration "AUS: 731", and so on. 129 | The output should be in the form "country: cars_per_cap". 130 | Make sure to print out this exact string, with the correct spacing. 131 | 132 | Solution:- 133 | # Import cars data 134 | import pandas as pd 135 | cars = pd.read_csv('cars.csv', index_col = 0) 136 | 137 | # Adapt for loop 138 | for lab, row in cars.iterrows() : 139 | print(lab + ": " + str(row['cars_per_cap'])) 140 | 141 | Q11:- 142 | Use a for loop to add a new column, named COUNTRY, that contains a uppercase version of the country names in the "country" column. You can use the string method upper() for this. 143 | To see if your code worked, print out cars. Don't indent this code, so that it's not part of the for loop. 144 | 145 | Solution:- 146 | # Import cars data 147 | import pandas as pd 148 | cars = pd.read_csv('cars.csv', index_col = 0) 149 | 150 | # Code for loop that adds COUNTRY column 151 | for lab,row in cars.iterrows(): 152 | cars.loc[lab,"COUNTRY"] = row["country"].upper() 153 | 154 | 155 | # Print cars 156 | print(cars) 157 | 158 | Q12:- 159 | Replace the for loop with a one-liner that uses .apply(str.upper). The call should give the same result: a column COUNTRY should be added to cars, containing an uppercase version of the country names. 160 | As usual, print out cars to see the fruits of your hard labor 161 | 162 | Solution:- 163 | # Import cars data 164 | import pandas as pd 165 | cars = pd.read_csv('cars.csv', index_col = 0) 166 | 167 | # Use .apply(str.upper) 168 | #for lab, row in cars.iterrows() : 169 | cars["COUNTRY"] = cars["country"].apply(str.upper) 170 | print(cars) 171 | -------------------------------------------------------------------------------- /Python/Intro to SQL for Data Science/Aggregate Functions: -------------------------------------------------------------------------------- 1 | Q1:- 2 | Use the SUM function to get the total duration of all films. 3 | -SELECT SUM(DURATION) 4 | FROM FILMS; 5 | 6 | Get the average duration of all films. 7 | -SELECT AVG(DURATION) 8 | FROM FILMS; 9 | 10 | Get the duration of the shortest film. 11 | -SELECT MIN(DURATION) 12 | FROM FILMS; 13 | 14 | Get the duration of the longest film. 15 | -SELECT MAX(DURATION) 16 | FROM FILMS; 17 | 18 | Q2:- 19 | Use the SUM function to get the total amount grossed by all films. 20 | -SELECT SUM(gross) 21 | FROM FILMS; 22 | 23 | Get the average amount grossed by all films. 24 | -SELECT AVG(gross) 25 | FROM FILMS; 26 | 27 | Get the amount grossed by the worst performing film. 28 | -SELECT MIN(gross) 29 | FROM FILMS; 30 | 31 | Get the amount grossed by the best performing film. 32 | -SELECT MAX(gross) 33 | FROM FILMS; 34 | 35 | Q3:- 36 | Use the SUM function to get the total amount grossed by all films made in the year 2000 or later. 37 | -SELECT SUM(gross) 38 | FROM FILMS 39 | WHERE release_year >= 2000; 40 | 41 | Get the average amount grossed by all films whose titles start with the letter 'A'. 42 | -SELECT AVG(gross) 43 | FROM FILMS 44 | WHERE title LIKE 'A%'; 45 | 46 | Get the amount grossed by the worst performing film in 1994. 47 | -SELECT MIN(gross) 48 | FROM FILMS 49 | WHERE release_year = 1994; 50 | 51 | Get the amount grossed by the best performing film between 2000 and 2012, inclusive. 52 | -SELECT MAX(gross) 53 | FROM films 54 | WHERE release_year BETWEEN 2000 AND 2012; 55 | 56 | Q4:- 57 | Get the title and net profit (the amount a film grossed, minus its budget) for all films. Alias the net profit as net_profit. 58 | -SELECT title,gross-budget AS net_profit 59 | FROM films; 60 | 61 | Get the title and duration in hours for all films. The duration is in minutes, so you'll need to divide by 60.0 to get the duration in hours. Alias the duration in hours as duration_hours. 62 | -SELECT title,duration/60.0 AS duration_hours 63 | FROM films; 64 | 65 | Get the average duration in hours for all films, aliased as avg_duration_hours. 66 | -SELECT AVG(duration/60.0) AS avg_duration_hours 67 | FROM films; 68 | 69 | Q5:- 70 | Get the percentage of people who are no longer alive. Alias the result as percentage_dead. Remember to use 100.0 and not 100! 71 | --- get the count(deathdate) and multiply by 100.0 72 | -- then divide by count(*) 73 | SELECT COUNT(deathdate)*100.0/COUNT(*) AS percentage_dead 74 | FROM people; 75 | 76 | Get the number of years between the newest film and oldest film. Alias the result as difference. 77 | -SELECT MAX(release_year) - MIN(release_year) AS difference 78 | FROM films; 79 | 80 | Get the number of decades the films table covers. Alias the result as number_of_decades. The top half of your fraction should be enclosed in parentheses. 81 | -SELECT (MAX(release_year) - MIN(release_year))/10.0 AS number_of_decades 82 | FROM films; 83 | 84 | -------------------------------------------------------------------------------- /Python/Intro to SQL for Data Science/Filtering rows: -------------------------------------------------------------------------------- 1 | Q1:- 2 | Get all details for all films released in 2016. 3 | -SELECT * 4 | FROM films 5 | WHERE release_year = 2016; 6 | 7 | Get the number of films released before 2000. 8 | -SELECT COUNT(*) 9 | FROM films 10 | WHERE release_year < 2000; 11 | 12 | Get the title and release year of films released after 2000. 13 | -SELECT title,release_year 14 | FROM films 15 | WHERE release_year > 2000; 16 | 17 | Q2:- 18 | Get all details for all French language films. 19 | -SELECT * 20 | FROM films 21 | WHERE language='French'; 22 | 23 | Get the name and birth date of the person born on November 11th, 1974. Remember to use ISO date format ('1974-11-11')! 24 | -SELECT name,birthdate 25 | FROM people 26 | WHERE birthdate='1974-11-11'; 27 | 28 | Get the number of Hindi language films. 29 | -SELECT COUNT(*) 30 | FROM films 31 | WHERE language='Hindi'; 32 | 33 | Get all details for all films with an R certification. 34 | -SELECT * 35 | FROM films 36 | WHERE certification='R'; 37 | 38 | Q3:- 39 | Get the title and release year for all Spanish language films released before 2000. 40 | -SELECT title,release_year 41 | FROM films 42 | WHERE language='Spanish' 43 | AND release_year < 2000; 44 | 45 | Get all details for Spanish language films released after 2000. 46 | -SELECT * 47 | FROM films 48 | WHERE language='Spanish' 49 | AND release_year > 2000; 50 | 51 | Get all details for Spanish language films released after 2000, but before 2010. 52 | -SELECT * 53 | FROM films 54 | WHERE language='Spanish' 55 | AND release_year > 2000 56 | AND release_year < 2010; 57 | 58 | Q4:- 59 | Get the title and release year for films released in the 90s. 60 | -SELECT title,release_year 61 | FROM films 62 | WHERE release_year>='1990' 63 | AND release_year<'2000'; 64 | 65 | Now, build on your query to filter the records to only include French or Spanish language films. 66 | -SELECT title,release_year 67 | FROM films 68 | WHERE (release_year>='1990'AND release_year<'2000') 69 | AND (language='Spanish' OR language='French'); 70 | 71 | Finally, restrict the query to only return films that took in more than $2M gross. 72 | -SELECT title,release_year 73 | FROM films 74 | WHERE (release_year>='1990'AND release_year<'2000') 75 | AND (language='Spanish' OR language='French') 76 | AND gross > 2000000; 77 | 78 | Q5:- 79 | Get the title and release year of all films released between 1990 and 2000 (inclusive). 80 | -SELECT title,release_year 81 | FROM films 82 | WHERE release_year BETWEEN 1990 AND 2000; 83 | 84 | Now, build on your previous query to select only films that have budgets over $100 million 85 | -SELECT title,release_year 86 | FROM films 87 | WHERE release_year BETWEEN 1990 AND 2000 88 | AND budget >100000000; 89 | 90 | Now restrict the query to only return Spanish language films. 91 | -SELECT title,release_year 92 | FROM films 93 | WHERE release_year BETWEEN 1990 AND 2000 94 | AND budget >100000000 95 | AND language='Spanish'; 96 | 97 | Finally, modify to your previous query to include all Spanish language or French language films with the same criteria as before. Don't forget your parentheses! 98 | -SELECT title,release_year 99 | FROM films 100 | WHERE release_year BETWEEN 1990 AND 2000 101 | AND budget >100000000 102 | AND (language='Spanish' OR language='French'); 103 | 104 | Q6:- 105 | Get the title and release year of all films released in 1990 or 2000 that were longer than two hours. Remember, duration is in minutes! 106 | -SELECT title,release_year 107 | FROM films 108 | WHERE release_year IN (1990,2000) 109 | AND duration >120; 110 | 111 | Get the title and language of all films which were in English, Spanish, or French. 112 | -SELECT title,language 113 | FROM films 114 | WHERE language IN ('English','Spanish','French'); 115 | 116 | Get the title and certification of all films with an NC-17 or R certification. 117 | -SELECT title,certification 118 | FROM films 119 | WHERE certification IN ('R','NC-17'); 120 | 121 | Q7:- 122 | Get the names of people who are still alive, i.e. whose death date is missing. 123 | -SELECT name 124 | FROM people 125 | WHERE deathdate IS NULL; 126 | 127 | Get the title of every film which doesn't have a budget associated with it. 128 | -SELECT title 129 | FROM films 130 | WHERE budget IS NULL; 131 | 132 | Get the number of films which don't have a language associated with them. 133 | -SELECT COUNT(*) 134 | FROM films 135 | WHERE language IS NULL; 136 | 137 | Q8:- 138 | Get the names of all people whose names begin with 'B'. The pattern you need is 'B%'. 139 | -SELECT name 140 | FROM people 141 | WHERE name LIKE 'B%'; 142 | 143 | Get the names of people whose names have 'r' as the second letter. The pattern you need is '_r%'. 144 | -SELECT name 145 | FROM people 146 | WHERE name LIKE '_r%'; 147 | 148 | Get the names of people whose names don't start with A. The pattern you need is 'A%'. 149 | -SELECT name 150 | FROM people 151 | WHERE name NOT LIKE 'A%'; 152 | -------------------------------------------------------------------------------- /Python/Intro to SQL for Data Science/Selecting columns: -------------------------------------------------------------------------------- 1 | Q1:- 2 | Select the title column from the films table. 3 | - SELECT title FROM films; 4 | Select the release_year column from the films table. 5 | -SELECT release_year FROM films; 6 | Select the name of each person in the people table. 7 | -SELECT name FROM people; 8 | 9 | Q2:- 10 | Get the title of every film from the films table. 11 | -SELECT title FROM films; 12 | Get the title and release year for every film. 13 | -SELECT title,release_year FROM films; 14 | Get the title, release year and country for every film. 15 | -SELECT title,release_year,country FROM films; 16 | Get all columns from the films table. 17 | -SELECT * FROM films; 18 | 19 | Q3:- 20 | Get all the unique countries represented in the films table. 21 | -SELECT DISTINCT country FROM films; 22 | Get all the different film certifications from the films table. 23 | -SELECT DISTINCT certification FROM films; 24 | Get the different types of film roles from the roles table. 25 | -SELECT DISTINCT role FROM roles; 26 | 27 | Q4:- 28 | Count the number of rows in the people table. 29 | -SELECT COUNT(*) FROM people; 30 | Count the number of (non-missing) birth dates in the people table. 31 | -SELECT COUNT(birthdate) FROM people; 32 | Count the number of unique birth dates in the people table. 33 | -SELECT COUNT(DISTINCT birthdate) FROM people; 34 | Count the number of unique languages in the films table. 35 | -SELECT COUNT(DISTINCT language) FROM films; 36 | Count the number of unique countries in the films table. 37 | -SELECT COUNT(DISTINCT country) FROM films; 38 | -------------------------------------------------------------------------------- /Python/Intro to SQL for Data Science/Sorting, grouping and joins: -------------------------------------------------------------------------------- 1 | Q1:- 2 | Get the names of people from the people table, sorted alphabetically. 3 | -SELECT name 4 | FROM people 5 | ORDER BY name; 6 | 7 | Get the names of people, sorted by birth date. 8 | -SELECT name 9 | FROM people 10 | ORDER BY birthdate; 11 | 12 | Get the birth date and name for every person, in order of when they were born. 13 | -SELECT birthdate,name 14 | FROM people 15 | ORDER BY birthdate; 16 | 17 | Q2:- 18 | Get the title of films released in 2000 or 2012, in the order they were released. 19 | -SELECT title 20 | FROM films 21 | WHERE release_year IN (2000,2012) 22 | ORDER BY release_year; 23 | 24 | Get all details for all films except those released in 2015 and order them by duration. 25 | -SELECT * 26 | FROM films 27 | WHERE release_year NOT IN (2015) 28 | ORDER BY duration; 29 | 30 | Get the title and gross earnings for movies which begin with the letter 'M' and order the results alphabetically. 31 | -SELECT title,gross 32 | FROM films 33 | WHERE title LIKE 'M%' 34 | ORDER BY title; 35 | 36 | Q3:- 37 | Get the IMDB score and film ID for every film from the reviews table, sorted from highest to lowest score 38 | -SELECT imdb_score,film_id 39 | FROM reviews 40 | ORDER BY imdb_score; 41 | 42 | Get the title for every film, in reverse order. 43 | -SELECT title 44 | FROM films 45 | ORDER BY title DESC; 46 | 47 | Get the title and duration for every film, in order of longest duration to shortest. 48 | -SELECT title,duration 49 | FROM films 50 | ORDER BY duration DESC; 51 | 52 | Q4:- 53 | Get the birth date and name of people in the people table, in order of when they were born and alphabetically by name. 54 | -SELECT birthdate,name 55 | FROM people 56 | ORDER BY birthdate,name; 57 | 58 | Get the release year, duration, and title of films ordered by their release year and duration. 59 | -SELECT release_year,duration,title 60 | FROM films 61 | ORDER BY release_year,duration; 62 | 63 | Get certifications, release years, and titles of films ordered by certification (alphabetically) and release year. 64 | -SELECT certification,release_year,title 65 | FROM films 66 | ORDER BY certification,release_year; 67 | 68 | Get the names and birthdates of people ordered by name and birth date. 69 | -SELECT name,birthdate 70 | FROM people 71 | ORDER BY name,birthdate; 72 | 73 | Q5:- 74 | Get the release year and count of films released in each year. 75 | -SELECT release_year,COUNT(*) 76 | FROM films 77 | GROUP BY release_year; 78 | 79 | Get the release year and average duration of all films, grouped by release year. 80 | -SELECT release_year,AVG(duration) 81 | FROM films 82 | GROUP BY release_year; 83 | 84 | Get the release year and largest budget for all films, grouped by release year. 85 | -SELECT release_year,MAX(budget) 86 | FROM films 87 | GROUP BY release_year; 88 | 89 | Get the IMDB score and count of film reviews grouped by IMDB score in the reviews table. 90 | -SELECT imdb_score,COUNT(film_id) 91 | FROM reviews 92 | GROUP BY imdb_score; 93 | 94 | Q6:- 95 | Get the release year and lowest gross earnings per release year. 96 | -SELECT release_year,MIN(gross) 97 | FROM films 98 | GROUP BY release_year; 99 | 100 | Get the language and total gross amount films in each language made. 101 | -SELECT language,SUM(gross) 102 | FROM films 103 | GROUP BY language; 104 | 105 | Get the country and total budget spent making movies in each country. 106 | -SELECT country,SUM(budget) 107 | FROM films 108 | GROUP BY country; 109 | 110 | Get the release year, country, and highest budget spent making a film for each year, for each country. Sort your results by release year and country. 111 | -SELECT release_year,country,MAX(budget) 112 | FROM films 113 | GROUP BY release_year,country 114 | ORDER BY release_year,country; 115 | 116 | Get the country, release year, and lowest amount grossed per release year per country. Order your results by country and release year. 117 | -SELECT country,release_year,MIN(gross) 118 | FROM films 119 | GROUP BY release_year,country 120 | ORDER BY country,release_year; 121 | 122 | Q7:- 123 | Get the release year, budget and gross earnings for each film in the films table. 124 | -SELECT release_year,budget,gross 125 | FROM films; 126 | 127 | Modify your query so that only results after 1990 are included. 128 | -SELECT release_year,budget,gross 129 | FROM films 130 | WHERE release_year > 1990; 131 | 132 | Remove the budget and gross columns, and group your results by release year. 133 | -SELECT release_year 134 | FROM films 135 | WHERE release_year > 1990 136 | GROUP BY release_year; 137 | 138 | Remove the budget and gross columns, and group your results by release year. 139 | -SELECT release_year 140 | FROM films 141 | WHERE release_year > 1990 142 | GROUP BY release_year; 143 | 144 | Modify your query to add in the average budget and average gross earnings for the results you have so far. Alias your results as avg_budget and avg_gross, respectively. 145 | -SELECT release_year,AVG(budget) AS avg_budget,AVG(gross) AS avg_gross 146 | FROM films 147 | WHERE release_year > 1990 148 | GROUP BY release_year; 149 | 150 | Modify your query so that only years with an average budget of greater than $60 million are included. 151 | -SELECT release_year,AVG(budget) AS avg_budget,AVG(gross) AS avg_gross 152 | FROM films 153 | WHERE release_year > 1990 154 | GROUP BY release_year 155 | HAVING AVG(budget) > 60000000; 156 | 157 | Finally, modify your query to order the results from highest average gross earnings to lowest. 158 | -SELECT release_year,AVG(budget) AS avg_budget,AVG(gross) AS avg_gross 159 | FROM films 160 | WHERE release_year > 1990 161 | GROUP BY release_year 162 | HAVING AVG(budget) > 60000000 163 | ORDER BY avg_gross; 164 | 165 | Q8:- 166 | Get the country, average budget, and average gross take of countries that have made more than 10 films. Order the result by country name, and limit the number of results displayed to 5. You should alias the averages as avg_budget and avg_gross respectively. 167 | --- select country, average budget, average gross 168 | SELECT country,AVG(budget) AS avg_budget,AVG(gross) AS avg_gross 169 | 170 | -- from the films table 171 | FROM films 172 | -- group by country 173 | GROUP BY country 174 | -- where the country has a title count greater than 10 175 | HAVING COUNT(title) > 10 176 | -- order by country 177 | ORDER BY country 178 | -- limit to only show 5 results 179 | LIMIT 5; 180 | 181 | Joins:- 182 | SELECT title, imdb_score 183 | FROM films 184 | JOIN reviews 185 | ON films.id = reviews.film_id 186 | WHERE title = 'To Kill a Mockingbird'; 187 | 188 | 189 | -------------------------------------------------------------------------------- /Python/Intro-to-data-science/Numpy-Statistics: -------------------------------------------------------------------------------- 1 | Q1:- 2 | Create numpy array np_height that is equal to first column of np_baseball. 3 | Print out the mean of np_height. 4 | Print out the median of np_height. 5 | 6 | Solution:- 7 | # np_baseball is available 8 | 9 | # Import numpy 10 | import numpy as np 11 | 12 | # Create np_height from np_baseball 13 | np_height = np.array(np_baseball[:,0]) 14 | 15 | # Print out the mean of np_height 16 | print(np.mean(np_height)) 17 | 18 | # Print out the median of np_height 19 | print(np.median(np_height)) 20 | 21 | Q2:- 22 | The code to print out the mean height is already included. Complete the code for the median height. Replace None with the correct code. 23 | Use np.std() on the first column of np_baseball to calculate stddev. Replace None with the correct code. 24 | Do big players tend to be heavier? Use np.corrcoef() to store the correlation between the first and second column of np_baseball in corr. 25 | Replace None with the correct code. 26 | 27 | Solution:- 28 | # np_baseball is available 29 | 30 | # Import numpy 31 | import numpy as np 32 | 33 | # Print mean height (first column) 34 | avg = np.mean(np_baseball[:,0]) 35 | print("Average: " + str(avg)) 36 | 37 | # Print median height. Replace 'None' 38 | med = np.median(np_baseball[:,0]) 39 | print("Median: " + str(med)) 40 | 41 | # Print out the standard deviation on height. Replace 'None' 42 | stddev = np.std(np_baseball[:,0]) 43 | 44 | Q3:- 45 | The code to print out the mean height is already included. Complete the code for the median height. Replace None with the correct code. 46 | Use np.std() on the first column of np_baseball to calculate stddev. Replace None with the correct code. 47 | Do big players tend to be heavier? Use np.corrcoef() to store the correlation between the first and second column of np_baseball in corr. 48 | Replace None with the correct code. 49 | 50 | Solution:- 51 | # np_baseball is available 52 | 53 | # Import numpy 54 | import numpy as np 55 | 56 | # Print mean height (first column) 57 | avg = np.mean(np_baseball[:,0]) 58 | print("Average: " + str(avg)) 59 | 60 | # Print median height. Replace 'None' 61 | med = np.median(np_baseball[:,0]) 62 | print("Median: " + str(med)) 63 | 64 | # Print out the standard deviation on height. Replace 'None' 65 | stddev = np.std(np_baseball[:,0]) 66 | print("Standard Deviation: " + str(stddev)) 67 | 68 | # Print out correlation between first and second column. Replace 'None' 69 | corr = np.corrcoef(np_baseball[:,0],np_baseball[:,1]) 70 | print("Correlation: " + str(corr)) 71 | 72 | Q4:- 73 | Convert heights and positions, which are regular lists, to numpy arrays. Call them np_heights and np_positions. 74 | Extract all the heights of the goalkeepers. You can use a little trick here: use np_positions == 'GK' as an index for np_heights. Assign the result to gk_heights. 75 | Extract all the heights of all the other players. This time use np_positions != 'GK' as an index for np_heights. Assign the result to other_heights. 76 | Print out the median height of the goalkeepers using np.median(). Replace None with the correct code. 77 | Do the same for the other players. Print out their median height. Replace None with the correct code. 78 | 79 | Solution:- 80 | # heights and positions are available as lists 81 | 82 | # Import numpy 83 | import numpy as np 84 | 85 | # Convert positions and heights to numpy arrays: np_positions, np_heights 86 | np_positions = np.array(positions) 87 | np_heights = np.array(heights) 88 | 89 | 90 | # Heights of the goalkeepers: gk_heights 91 | gk_heights = np_heights[np_positions == 'GK'] 92 | 93 | # Heights of the other players: other_heights 94 | other_heights = np_heights[np_positions != 'GK'] 95 | 96 | # Print out the median height of goalkeepers. Replace 'None' 97 | print("Median height of goalkeepers: " + str(np.median(gk_heights))) 98 | 99 | # Print out the median height of other players. Replace 'None' 100 | print("Median height of other players: " + str(np.median(other_heights))) 101 | -------------------------------------------------------------------------------- /Python/Intro-to-data-science/Python-Basics: -------------------------------------------------------------------------------- 1 | Q1:- 2 | Experiment in the IPython Shell; type 5 / 8, for example. 3 | Add another line of code to the Python script: print(7 + 10). 4 | 5 | Solution:- 6 | # Example, do not modify! 7 | print(5 / 8) 8 | 9 | # Put code below here 10 | print(7 + 10) 11 | 12 | Q2:- 13 | Suppose you have $100, which you can invest with a 10% return each year. After one year, it's 100×1.1=110100×1.1=110 dollars, and after two years it's 100×1.1×1.1=121100×1.1×1.1=121. 14 | Add code on the right to calculate how much money you end up with after 7 years. 15 | 16 | Solution:- 17 | # Addition and subtraction 18 | print(5 + 5) 19 | print(5 - 5) 20 | 21 | # Multiplication and division 22 | print(3 * 5) 23 | print(10 / 2) 24 | 25 | # Exponentiation 26 | print(4 ** 2) 27 | 28 | # Modulo 29 | print(18 % 7) 30 | 31 | # How much is your $100 worth after 7 years? 32 | print(100*1.1**7) 33 | 34 | Q3:- 35 | Calculate the product of savings and factor. Store the result in year1. 36 | What do you think the resulting type will be? Find out by printing out the type of year1. 37 | Calculate the sum of desc and desc and store the result in a new variable doubledesc. 38 | Print out doubledesc. Did you expect this? 39 | 40 | Solution:- 41 | # Several variables to experiment with 42 | savings = 100 43 | factor = 1.1 44 | desc = "compound interest" 45 | 46 | # Assign product of factor and savings to year1 47 | year1 = factor * savings 48 | 49 | # Print the type of year1 50 | print(type(year1)) 51 | 52 | # Assign sum of desc and desc to doubledesc 53 | doubledesc = desc + desc 54 | 55 | # Print out doubledesc 56 | print(doubledesc) 57 | 58 | Q4:- 59 | Fix the code on the right such that the printout runs without errors; use the function str() to convert the variables to strings. 60 | Convert the variable pi_string to a float and store this float as a new variable, pi_float. 61 | 62 | Solution:- 63 | # Definition of savings and result 64 | savings = 100 65 | result = 100 * 1.10 ** 7 66 | 67 | # Fix the printout 68 | print("I started with $" + str(savings) + " and now have $" + str(result) + ". Awesome!") 69 | 70 | # Definition of pi_string 71 | pi_string = "3.1415926" 72 | 73 | # Convert pi_string into float: pi_float 74 | pi_float = float(pi_string) 75 | -------------------------------------------------------------------------------- /Python/Intro-to-data-science/Python-Lists: -------------------------------------------------------------------------------- 1 | Q1:- 2 | Create a list, areas, that contains the area of the hallway (hall), kitchen (kit), living room (liv), bedroom (bed) and bathroom (bath), in this order. Use the predefined variables. 3 | Print areas with the print() function. 4 | 5 | Solution:- 6 | # area variables (in square meters) 7 | hall = 11.25 8 | kit = 18.0 9 | liv = 20.0 10 | bed = 10.75 11 | bath = 9.50 12 | 13 | # Create list areas 14 | areas = [hall,kit,liv,bed,bath] 15 | 16 | # Print areas 17 | print(areas) 18 | 19 | Q2:- 20 | Finish the line of code that creates the areas list such that the list first contains the name of each room as a string and then its area. More specifically, add the strings "hallway", "kitchen" and "bedroom" at the appropriate locations. 21 | Print areas again; is the printout more informative this time? 22 | 23 | Solution:- 24 | # area variables (in square meters) 25 | hall = 11.25 26 | kit = 18.0 27 | liv = 20.0 28 | bed = 10.75 29 | bath = 9.50 30 | 31 | # Adapt list areas 32 | areas = ["hallway",hall,"kitchen", kit, "living room", liv, "bedroom",bed, "bathroom", bath] 33 | 34 | # Print areas 35 | print(areas) 36 | 37 | Q3:- 38 | Finish the list of lists so that it also contains the bedroom and bathroom data. Make sure you enter these in order! 39 | Print out house; does this way of structuring your data make more sense? 40 | Print out the type of house. Are you still dealing with a list? 41 | 42 | Solution:- 43 | # area variables (in square meters) 44 | hall = 11.25 45 | kit = 18.0 46 | liv = 20.0 47 | bed = 10.75 48 | bath = 9.50 49 | 50 | # house information as list of lists 51 | house = [["hallway", hall], 52 | ["kitchen", kit], 53 | ["living room", liv], 54 | ["bedroom",bed], 55 | ["bathroom",bath]] 56 | 57 | # Print out house 58 | print(house) 59 | 60 | # Print out the type of house 61 | print(type(house)) 62 | 63 | Q4:- 64 | Print out the second element from the areas list, so 11.25. 65 | Subset and print out the last element of areas, being 9.50. Using a negative index makes sense here! 66 | Select the number representing the area of the living room and print it out. 67 | 68 | Solution:- 69 | # Create the areas list 70 | areas = ["hallway", 11.25, "kitchen", 18.0, "living room", 20.0, "bedroom", 10.75, "bathroom", 9.50] 71 | 72 | # Print out second element from areas 73 | print(areas[1]) 74 | 75 | # Print out last element from areas 76 | print(areas[-1]) 77 | 78 | # Print out the area of the living room 79 | print(areas[5]) 80 | 81 | Q5:- 82 | Using a combination of list subsetting and variable assignment, create a new variable, eat_sleep_area, that contains the sum of the area of the kitchen and the area of the bedroom. 83 | Print the new variable eat_sleep_area. 84 | 85 | Solution:- 86 | # Create the areas list 87 | areas = ["hallway", 11.25, "kitchen", 18.0, "living room", 20.0, "bedroom", 10.75, "bathroom", 9.50] 88 | 89 | # Sum of kitchen and bedroom area: eat_sleep_area 90 | eat_sleep_area = areas[3] + areas[7] 91 | 92 | # Print the variable eat_sleep_area 93 | print(eat_sleep_area) 94 | 95 | Q6:- 96 | Use slicing to create a list, downstairs, that contains the first 6 elements of areas. 97 | Do a similar thing to create a new variable, upstairs, that contains the last 4 elements of areas. 98 | Print both downstairs and upstairs using print(). 99 | 100 | Solution:- 101 | # Create the areas list 102 | areas = ["hallway", 11.25, "kitchen", 18.0, "living room", 20.0, "bedroom", 10.75, "bathroom", 9.50] 103 | 104 | # Use slicing to create downstairs 105 | downstairs = areas[:6] 106 | 107 | # Use slicing to create upstairs 108 | upstairs = areas[6:11] 109 | 110 | # Print out downstairs and upstairs 111 | print(downstairs) 112 | print(upstairs) 113 | 114 | Q7:- 115 | Use slicing to create the lists downstairs and upstairs again, but this time without using indexes if it's not necessary. 116 | Remember downstairs is the first 6 elements of areas and upstairs is the last 4 elements of areas. 117 | 118 | Solution:- 119 | # Create the areas list 120 | areas = ["hallway", 11.25, "kitchen", 18.0, "living room", 20.0, "bedroom", 10.75, "bathroom", 9.50] 121 | 122 | # Alternative slicing to create downstairs 123 | downstairs = areas[:6] 124 | 125 | # Alternative slicing to create upstairs 126 | upstairs = areas[6:] 127 | 128 | Q8:- 129 | You did a miscalculation when determining the area of the bathroom; it's 10.50 square meters instead of 9.50. Can you make the changes? 130 | Make the areas list more trendy! Change "living room" to "chill zone". 131 | 132 | Solution:- 133 | # Create the areas list 134 | areas = ["hallway", 11.25, "kitchen", 18.0, "living room", 20.0, "bedroom", 10.75, "bathroom", 9.50] 135 | 136 | # Correct the bathroom area 137 | areas[-1] = 10.50 138 | 139 | # Change "living room" to "chill zone" 140 | areas[4] = "chill zone" 141 | 142 | Q9:- 143 | Use the + operator to paste the list ["poolhouse", 24.5] to the end of the areas list. Store the resulting list as areas_1. 144 | Further extend areas_1 by adding data on your garage. Add the string "garage" and float 15.45. Name the resulting list areas_2. 145 | 146 | Solution:- 147 | # Create the areas list and make some changes 148 | areas = ["hallway", 11.25, "kitchen", 18.0, "chill zone", 20.0, 149 | "bedroom", 10.75, "bathroom", 10.50] 150 | 151 | # Add poolhouse data to areas, new list is areas_1 152 | areas_1 = areas + ["poolhouse", 24.5] 153 | 154 | # Add garage data to areas_1, new list is areas_2 155 | areas_2 = areas_1 + ["garage", 15.45] 156 | 157 | Q10:- 158 | Change the second command, that creates the variable areas_copy, such that areas_copy is an explicit copy of areas 159 | Now, changes made to areas_copy shouldn't affect areas. Hit Submit Answer to check this. 160 | 161 | Solution:- 162 | # Create list areas 163 | areas = [11.25, 18.0, 20.0, 10.75, 9.50] 164 | 165 | # Create areas_copy 166 | areas_copy = list(areas) 167 | 168 | # Change areas_copy 169 | areas_copy[0] = 5.0 170 | 171 | # Print areas 172 | print(areas) 173 | 174 | Q11:- 175 | Use print() in combination with type() to print out the type of var1. 176 | Use len() to get the length of the list var1. Wrap it in a print() call to directly print it out. 177 | Use int() to convert var2 to an integer. Store the output as out2. 178 | 179 | Solution:- 180 | # Create variables var1 and var2 181 | var1 = [1, 2, 3, 4] 182 | var2 = True 183 | 184 | # Print out type of var1 185 | print(type(var1)) 186 | 187 | # Print out length of var1 188 | print(len(var1)) 189 | 190 | # Convert var2 to an integer: out2 191 | out2 = int(var2) 192 | 193 | Q12:- 194 | Use + to merge the contents of first and second into a new list: full. 195 | Call sorted() on full and specify the reverse argument to be True. Save the sorted list as full_sorted. 196 | Finish off by printing out full_sorted. 197 | 198 | Solution:- 199 | # Create lists first and second 200 | first = [11.25, 18.0, 20.0] 201 | second = [10.75, 9.50] 202 | 203 | # Paste together first and second: full 204 | full = first + second 205 | 206 | # Sort full in descending order: full_sorted 207 | full_sorted = sorted(full,reverse=True) 208 | 209 | # Print out full_sorted 210 | print(full_sorted) 211 | -------------------------------------------------------------------------------- /Python/Introduction to Databases in Python/Basics of Relational Databases: -------------------------------------------------------------------------------- 1 | Q1:- 2 | Import create_engine from the sqlalchemy module. 3 | Using the create_engine() function, create an engine for a local file named census.sqlite with sqlite as the driver. Be sure to enclose the connection string within quotation marks. 4 | Print the output from the .table_names() method on the engine. 5 | 6 | Solution:- 7 | # Import create_engine 8 | from sqlalchemy import create_engine 9 | 10 | # Create an engine that connects to the census.sqlite file: engine 11 | engine = create_engine('sqlite:///census.sqlite') 12 | 13 | # Print table names 14 | print(engine.table_names()) 15 | 16 | Q2:- 17 | Import the Table object from sqlalchemy. 18 | Reflect the census table by using the Table object with the arguments: 19 | The name of the table as a string ('census'). 20 | The metadata, contained in the variable metadata. 21 | autoload=True 22 | The engine to autoload with - in this case, engine. 23 | Print the details of census using the repr() function. 24 | 25 | Solution:- 26 | # Import Table 27 | from sqlalchemy import Table 28 | 29 | # Reflect census table from the engine: census 30 | census = Table('census', metadata, autoload=True, autoload_with=engine) 31 | 32 | # Print census table metadata 33 | print(repr(census)) 34 | 35 | Q3:- 36 | Reflect the census table as you did in the previous exercise using the Table() function. 37 | Print a list of column names of the census table by applying the .keys() method to census.columns. 38 | Print the details of the census table using the metadata.tables dictionary along with the repr() function. To do this, first access the 'census' key of the metadata.tables dictionary, and place this inside the provided repr() function. 39 | 40 | Solution:- 41 | # Reflect the census table from the engine: census 42 | census = Table('census', metadata, autoload=True, autoload_with=engine) 43 | 44 | # Print the column names 45 | print(census.columns.keys()) 46 | 47 | # Print full table metadata 48 | print(repr(metadata.tables['census'])) 49 | 50 | Q3:- 51 | Build a SQL statement to query all the columns from census and store it in stmt. Note that your SQL statement must be a string. 52 | Use the .execute() and .fetchall() methods on connection and store the result in results. Remember that .execute() comes before .fetchall() and that stmt needs to be passed to .execute(). 53 | Print results. 54 | 55 | Solution:- 56 | # Build select statement for census table: stmt 57 | stmt = 'select * from census' 58 | 59 | # Execute the statement and fetch the results: results 60 | results = connection.execute(stmt).fetchall() 61 | 62 | # Print results 63 | print(results) 64 | 65 | Q4:- 66 | Import select from the sqlalchemy module. 67 | Reflect the census table. This code is already written for you. 68 | Create a query using the select() function to retrieve the census table. To do so, pass a list to select() containing a single element: census. 69 | Print stmt to see the actual SQL query being created. This code has been written for you. 70 | Using the provided print() function, print all the records from the census table. To do this: 71 | Use the .execute() method on connection with stmt as the argument to retrieve the ResultProxy. 72 | Use .fetchall() on connection.execute(stmt) to retrieve the ResultSet. 73 | 74 | Solution:- 75 | # Import select 76 | from sqlalchemy import select 77 | 78 | # Reflect census table via engine: census 79 | census = Table('census', metadata, autoload=True, autoload_with=engine) 80 | 81 | # Build select statement for census table: stmt 82 | stmt = select([census]) 83 | 84 | # Print the emitted statement to see the SQL emitted 85 | print(stmt) 86 | 87 | # Execute the statement and print the results 88 | print(connection.execute(stmt).fetchall()) 89 | 90 | Q5:- 91 | Extract the first row of results and assign it to the variable first_row. 92 | Print the value of the first column in first_row. 93 | Print the value of the 'state' column in first_row. 94 | 95 | Solution:- 96 | # Get the first row of the results by using an index: first_row 97 | first_row = results[0] 98 | 99 | # Print the first row of the results 100 | print(first_row) 101 | 102 | # Print the first column of the first row by using an index 103 | print(first_row[0]) 104 | 105 | # Print the 'state' column of the first row by using its name 106 | print(first_row['state']) 107 | -------------------------------------------------------------------------------- /Python/Introduction to Databases in Python/Putting it all together: -------------------------------------------------------------------------------- 1 | Q1:- 2 | Import create_engine and MetaData from sqlalchemy. 3 | Create an engine to the chapter 5 database by using 'sqlite:///chapter5.sqlite' as the connection string. 4 | Create a MetaData object as metadata. 5 | 6 | Solution:- 7 | # Import create_engine, MetaData 8 | from sqlalchemy import create_engine,MetaData 9 | 10 | # Define an engine to connect to chapter5.sqlite: engine 11 | engine = create_engine('sqlite:///chapter5.sqlite') 12 | 13 | # Initialize MetaData: metadata 14 | metadata = MetaData() 15 | 16 | Q2:- 17 | Import Table, Column, String, and Integer from sqlalchemy. 18 | Define a census table with the following columns: 19 | 'state' - String - length of 30 20 | 'sex' - String - length of 1 21 | 'age' - Integer 22 | 'pop2000' - Integer 23 | 'pop2008' - Integer 24 | Create the table in the database using the metadata and engine. 25 | 26 | Solution:- 27 | # Import Table, Column, String, and Integer 28 | from sqlalchemy import Table, Column,String,Integer 29 | 30 | # Build a census table: census 31 | census = Table('census', metadata, 32 | Column('state', String(30)), 33 | Column('sex', String(1)), 34 | Column('age',Integer()), 35 | Column('pop2000', Integer()), 36 | Column('pop2008', Integer())) 37 | 38 | # Create the table in the database 39 | metadata.create_all(engine) 40 | 41 | Q3:- 42 | Create an empty list called values_list. 43 | Iterate over the rows of csv_reader with a for loop, creating a dictionary called data for each row and append it to values_list. 44 | Within the for loop, row will be a list whose entries are 'state' , 'sex', 'age', 'pop2000' and 'pop2008' (in that order). 45 | 46 | Solution:- 47 | # Create an empty list: values_list 48 | values_list = [] 49 | 50 | # Iterate over the rows 51 | for row in csv_reader: 52 | # Create a dictionary with the values 53 | data = {'state': row[0], 'sex': row[1], 'age':row[2], 'pop2000': row[3], 54 | 'pop2008': row[4]} 55 | # Append the dictionary to the values list 56 | values_list.append(data) 57 | 58 | Q4:- 59 | Import insert from sqlalchemy. 60 | Build an insert statement for the census table. 61 | Execute the statement stmt along with values_list. You will need to pass them both as arguments to connection.execute(). 62 | Print the rowcount attribute of results. 63 | 64 | Solution:- 65 | # Import insert 66 | from sqlalchemy import insert 67 | 68 | # Build insert statement: stmt 69 | stmt = insert(census) 70 | 71 | # Use values_list to insert data: results 72 | results = connection.execute(stmt, values_list) 73 | 74 | # Print rowcount 75 | print(results.rowcount) 76 | 77 | Q5:- 78 | Import select from sqlalchemy. 79 | Build a statement to: 80 | Select sex from the census table. 81 | Select the average age weighted by the population in 2008 (pop2008). See the example given in the assignment text to see how you can do this. Label this average age calculation as 'average_age'. 82 | Group the query by sex. 83 | Execute the query and store it as results. 84 | Loop over results and print the sex and average_age for each record. 85 | 86 | Solution:- 87 | # Import select 88 | from sqlalchemy import select 89 | 90 | # Calculate weighted average age: stmt 91 | stmt = select([census.columns.sex, 92 | (func.sum(census.columns.pop2008 * census.columns.age) / 93 | func.sum(census.columns.pop2008)).label('average_age') 94 | ]) 95 | 96 | # Group by sex 97 | stmt = stmt.group_by(census.columns.sex) 98 | 99 | # Execute the query and store the results: results 100 | results = connection.execute(stmt).fetchall() 101 | 102 | # Print the average age by sex 103 | for row in results: 104 | print(row.sex, row.average_age) 105 | 106 | Q6:- 107 | Import case, cast and Float from sqlalchemy. 108 | Define a statement to select state and the percentage of females in 2000. 109 | Inside func.sum(), use case() to select females (using the sex column) from pop2000. Remember to specify else_=0 if the sex is not 'F'. 110 | To get the percentage, divide the number of females in the year 2000 by the overall population in 2000. Cast the divisor - census.columns.pop2000 - to Float before multiplying by 100. 111 | Group the query by state. 112 | Execute the query and store it as results. 113 | Print state and percent_female for each record. This has been done for you, so hit 'Submit Answer' to see the result. 114 | 115 | Solution:- 116 | # import case, cast and Float from sqlalchemy 117 | from sqlalchemy import case, cast, Float 118 | 119 | # Build a query to calculate the percentage of females in 2000: stmt 120 | stmt = select([census.columns.state, 121 | (func.sum( 122 | case([ 123 | (census.columns.sex == 'F', census.columns.pop2000) 124 | ], else_=0)) / 125 | cast(func.sum(census.columns.pop2000), Float) * 100).label('percent_female') 126 | ]) 127 | 128 | # Group By state 129 | stmt = stmt.group_by(census.columns.state) 130 | 131 | # Execute the query and store the results: results 132 | results = connection.execute(stmt).fetchall() 133 | 134 | # Print the percentage 135 | for result in results: 136 | print(result.state, result.percent_female) 137 | 138 | Q7:- 139 | Build a statement to: 140 | Select state. 141 | Calculate the difference in population between 2008 (pop2008) and 2000 (pop2000). 142 | Group the query by census.columns.state using the .group_by() method on stmt. 143 | Order by 'pop_change' in descending order using the .order_by() method with the desc() function on 'pop_change'. 144 | Limit the query to the top 10 states using the .limit() method. 145 | Execute the query and store it as results. 146 | Print the state and the population change for each result. This has been done for you, so hit 'Submit Answer' to see the result! 147 | 148 | Solution:- 149 | # Build query to return state name and population difference from 2008 to 2000 150 | stmt = select([census.columns.state, 151 | (census.columns.pop2008-census.columns.pop2000).label('pop_change') 152 | ]) 153 | 154 | # Group by State 155 | stmt = stmt.group_by(census.columns.state) 156 | 157 | # Order by Population Change 158 | stmt = stmt.order_by(desc('pop_change')) 159 | 160 | # Limit to top 10 161 | stmt = stmt.limit(10) 162 | 163 | # Use connection to execute the statement and fetch all results 164 | results = connection.execute(stmt).fetchall() 165 | 166 | # Print the state and population change for each record 167 | for result in results: 168 | print('{}:{}'.format(result.state, result.pop_change)) 169 | -------------------------------------------------------------------------------- /Python/Introduction to Relational Databases in SQL/Enforce data consistency with attribute constraints: -------------------------------------------------------------------------------- 1 | Q1:- 2 | Execute the given sample code. 3 | As it doesn't work, have a look at the error message and correct the statement accordingly – then execute it again. 4 | 5 | Solution:- 6 | -- Let's add a record to the table 7 | INSERT INTO transactions (transaction_date, amount, fee) 8 | VALUES ('2018-09-24', 5454, '30'); 9 | 10 | -- Doublecheck the contents 11 | SELECT * 12 | FROM transactions; 13 | 14 | Q2:- 15 | Execute the given sample code. 16 | As it doesn't work, add an integer type cast at the right place and execute it again. 17 | 18 | Solution:- 19 | -- Calculate the net amount as amount + fee 20 | SELECT transaction_date, amount + cast(fee as integer) AS net_amount 21 | FROM transactions; 22 | 23 | Q3:- 24 | Have a look at the distinct university_shortname values and take note of the length of the strings. 25 | 26 | Solution:- 27 | -- Select the university_shortname column 28 | SELECT distinct(university_shortname) 29 | FROM professors; 30 | 31 | Q4:- 32 | Now specify a fixed-length character type with the correct length for university_shortname 33 | 34 | Solution:- 35 | -- Specify the correct fixed-length character type 36 | ALTER TABLE professors 37 | ALTER COLUMN university_shortname 38 | TYPE char(3); 39 | 40 | Q5:- 41 | Change the type of the firstname column to varchar(64) 42 | 43 | Solution:- 44 | -- Change the type of firstname 45 | alter table professors 46 | alter column firstname 47 | type varchar(64); 48 | 49 | Q5:- 50 | Run the sample code as is and take note of the error. 51 | Now use SUBSTRING() to reduce firstname to 16 characters so its type can be altered to varchar(16). 52 | 53 | Solution:- 54 | -- Convert the values in firstname to a max. of 16 characters 55 | ALTER TABLE professors 56 | ALTER COLUMN firstname 57 | TYPE varchar(16) 58 | using substring(firstname from 1 for 16) 59 | 60 | Q6:- 61 | Add a not-null constraint for the firstname column. 62 | 63 | Solution:- 64 | -- Disallow NULL values in firstname 65 | alter table professors 66 | ALTER COLUMN firstname SET NOT NULL; 67 | 68 | Q7:- 69 | Add a not-null constraint for the lastname column. 70 | 71 | Solution:- 72 | -- Disallow NULL values in lastname 73 | alter table professors 74 | alter column lastname set not null; 75 | 76 | Q8:- 77 | Add a unique constraint to the university_shortname column in universities. Give it the name university_shortname_unq 78 | 79 | Solution:- 80 | -- Make universities.university_shortname unique 81 | ALTER table universities 82 | ADD constraint university_shortname_unq UNIQUE(university_shortname); 83 | 84 | Q9:- 85 | Add a unique constraint to the organization column in organizations. Give it the name organization_unq 86 | 87 | Solution:- 88 | -- Make organizations.organization unique 89 | alter table organizations 90 | add constraint organization_unq unique(organization) 91 | -------------------------------------------------------------------------------- /Python/Introduction to Relational Databases in SQL/Uniquely identify records with key constraints: -------------------------------------------------------------------------------- 1 | Q1:- 2 | First, find out the number of rows in universities. 3 | 4 | Solution:- 5 | -- Count the number of rows in universities 6 | SELECT count(*) 7 | FROM universities; 8 | 9 | Q2:- 10 | Then, find out how many unique values there are in the university_city column. 11 | 12 | Solution:- 13 | -- Count the number of distinct values in the university_city column 14 | SELECT count(distinct(university_city)) 15 | FROM universities; 16 | 17 | Q3:- 18 | Using the above steps, identify the candidate key by trying out different combination of columns. 19 | 20 | Solution:- 21 | -- Try out different combinations 22 | select COUNT(distinct(firstname,lastname)) 23 | FROM professors; 24 | 25 | Q4:- 26 | Rename the organization column to id in organizations. 27 | Make id a primary key and name it organization_pk. 28 | 29 | Solution:- 30 | -- Rename the organization column to id 31 | ALTER TABLE organizations 32 | RENAME COLUMN organization TO id; 33 | 34 | -- Make id a primary key 35 | ALTER TABLE organizations 36 | ADD CONSTRAINT organization_pk PRIMARY KEY (id); 37 | 38 | Q5:- 39 | Rename the university_shortname column to id in universities. 40 | Make id a primary key and name it university_pk. 41 | 42 | Solution:- 43 | -- Rename the university_shortname column to id 44 | alter table universities 45 | rename column university_shortname to id; 46 | 47 | -- Make id a primary key 48 | alter table universities 49 | add constraint university_pk primary key (id); 50 | 51 | Q6:- 52 | Add a new column id with data type serial to the professors table. 53 | 54 | Solution:- 55 | -- Add the new column to the table 56 | ALTER TABLE professors 57 | add column id serial; 58 | 59 | Q7:- 60 | Make id a primary key and name it professors_pkey 61 | 62 | solution:- 63 | -- Add the new column to the table 64 | ALTER TABLE professors 65 | ADD COLUMN id serial; 66 | 67 | -- Make id a primary key 68 | ALTER table professors 69 | add CONSTRAINT professors_pkey primary key (id); 70 | 71 | Q8:- 72 | Write a query that returns all the columns and 10 rows from professors. 73 | 74 | solution:- 75 | -- Add the new column to the table 76 | ALTER TABLE professors 77 | ADD COLUMN id serial; 78 | 79 | -- Make id a primary key 80 | ALTER TABLE professors 81 | ADD CONSTRAINT professors_pkey PRIMARY KEY (id); 82 | 83 | -- Have a look at the first 10 rows of professors 84 | select * from professors limit 10; 85 | 86 | Q9:- 87 | Count the number of distinct rows with a combination of the make and model columns. 88 | 89 | Solution:- 90 | -- Count the number of distinct rows with columns make, model 91 | select count(distinct(make,model)) 92 | FROM cars; 93 | 94 | Q10:- 95 | Add a new column id with the data type varchar(128). 96 | 97 | Solution:- 98 | -- Count the number of distinct rows with columns make, model 99 | SELECT COUNT(DISTINCT(make, model)) 100 | FROM cars; 101 | 102 | -- Add the id column 103 | ALTER TABLE cars 104 | add column id varchar(128); 105 | 106 | Q11:- 107 | Concatenate make and model into id using an UPDATE query and the CONCAT() function. 108 | 109 | Solution:- 110 | -- Count the number of distinct rows with columns make, model 111 | SELECT COUNT(DISTINCT(make, model)) 112 | FROM cars; 113 | 114 | -- Add the id column 115 | ALTER TABLE cars 116 | ADD COLUMN id varchar(128); 117 | 118 | -- Update id with make + model 119 | UPDATE cars 120 | set id = concat(make, model); 121 | 122 | Q12:- 123 | Make id a primary key and name it id_pk 124 | 125 | Solution:- 126 | -- Count the number of distinct rows with columns make, model 127 | SELECT COUNT(DISTINCT(make, model)) 128 | FROM cars; 129 | 130 | -- Add the id column 131 | ALTER TABLE cars 132 | ADD COLUMN id varchar(128); 133 | 134 | -- Update id with make + model 135 | UPDATE cars 136 | SET id = CONCAT(make, model); 137 | 138 | -- Make id a primary key 139 | alter table cars 140 | add constraint id_pk primary key(id); 141 | 142 | -- Have a look at the table 143 | SELECT * FROM cars; 144 | 145 | Q13:- 146 | Given the above description of a student entity, create a table students with the correct column types. 147 | Add a primary key for the social security number. 148 | 149 | Solution:- 150 | -- Create the table 151 | create table students ( 152 | last_name varchar(128) not null, 153 | ssn integer primary key, 154 | phone_no char(12) 155 | ); 156 | -------------------------------------------------------------------------------- /Python/Introduction to Relational Databases in SQL/Your first database: -------------------------------------------------------------------------------- 1 | Q1:- 2 | Get information on all table names in the current database, while limiting your query to the 'public' table_schema. 3 | 4 | Solution:- 5 | -- Query the right table in information_schema 6 | SELECT table_name 7 | FROM information_schema.tables 8 | -- Specify the correct table_schema value 9 | WHERE table_schema = 'public'; 10 | 11 | Q2:- 12 | Now have a look at the columns in university_professors by selecting all entries in information_schema.columns that correspond to that table. 13 | 14 | Solution:- 15 | -- Query the right table in information_schema to get columns 16 | SELECT column_name, data_type 17 | FROM information_schema.columns 18 | WHERE table_name = 'university_professors' AND table_schema = 'public'; 19 | 20 | Q3:- 21 | Finally, print the first five rows of the university_professors table. 22 | 23 | Solution:- 24 | -- Query the first five rows of our table 25 | select * 26 | from university_professors 27 | LIMIT 5; 28 | 29 | Q4:- 30 | Create a table professors with two text columns: firstname and lastname. 31 | 32 | Solution:- 33 | -- Create a table for the professors entity type 34 | CREATE TABLE professors ( 35 | firstname text, 36 | lastname text 37 | ); 38 | 39 | -- Print the contents of this table 40 | SELECT * 41 | FROM professors 42 | 43 | Q5:- 44 | Create a table universities with three text columns: university_shortname, university, and university_city. 45 | 46 | Solution:- 47 | -- Create a table for the universities entity type 48 | create table universities( 49 | university_shortname text, 50 | university text, 51 | university_city text 52 | ); 53 | 54 | 55 | 56 | 57 | 58 | -- Print the contents of this table 59 | SELECT * 60 | FROM universities 61 | 62 | Q6:- 63 | Alter professors to add the text column university_shortname. 64 | 65 | Solution:- 66 | -- Add the university_shortname column 67 | alter table professors 68 | add column university_shortname text; 69 | 70 | -- Print the contents of this table 71 | SELECT * 72 | FROM professors 73 | 74 | Q7:- 75 | Rename the organisation column to organization in affiliations. 76 | 77 | Solution:- 78 | -- Rename the organisation column 79 | ALTER TABLE affiliations 80 | RENAME column organisation TO organization; 81 | 82 | Q8:- 83 | Delete the university_shortname column in affiliations. 84 | 85 | Solution:- 86 | -- Rename the organisation column 87 | ALTER TABLE affiliations 88 | RENAME COLUMN organisation TO organization; 89 | 90 | -- Delete the university_shortname column 91 | alter table affiliations 92 | drop column university_shortname; 93 | 94 | Q9:- 95 | Insert all DISTINCT professors from university_professors into professors. 96 | Print all the rows in professors. 97 | 98 | Solution:- 99 | -- Insert unique professors into the new table 100 | insert into professors 101 | SELECT DISTINCT firstname, lastname, university_shortname 102 | FROM university_professors; 103 | 104 | -- Doublecheck the contents of professors 105 | SELECT * 106 | FROM professors; 107 | 108 | Q10:- 109 | Insert all DISTINCT affiliations into affiliations. 110 | 111 | Solution:- 112 | -- Insert unique affiliations into the new table 113 | INSERT INTO affiliations 114 | SELECT DISTINCT firstname, lastname, function, organization 115 | FROM university_professors; 116 | 117 | -- Doublecheck the contents of affiliations 118 | SELECT * 119 | FROM affiliations; 120 | 121 | Q11:- 122 | Delete the university_professors table. 123 | 124 | Solution:- 125 | -- Delete the university_professors table 126 | drop table university_professors; 127 | -------------------------------------------------------------------------------- /Python/Introduction to Shell for Data Science/Manipulating files and directories: -------------------------------------------------------------------------------- 1 | Q1:- 2 | -------------------------------------------------------------------------------- /Python/Machine Learning with the Experts: School Budgets/Exploring the raw data: -------------------------------------------------------------------------------- 1 | Q1:- 2 | Print summary statistics of the numeric columns in the DataFrame df using the .describe() method. 3 | Import matplotlib.pyplot as plt. 4 | Create a histogram of the non-null 'FTE' column. You can do this by passing df['FTE'].dropna() to plt.hist(). 5 | The title has been specified and axes have been labeled, so hit 'Submit Answer' to see how often school employees work full-time! 6 | 7 | Solution:- 8 | 9 | # Print the summary statistics 10 | print(df.describe()) 11 | 12 | # Import matplotlib.pyplot as plt 13 | import matplotlib.pyplot as plt 14 | 15 | # Create the histogram 16 | plt.hist(df['FTE'].dropna()) 17 | 18 | # Add title and labels 19 | plt.title('Distribution of %full-time \n employee works') 20 | plt.xlabel('% of full-time') 21 | plt.ylabel('num employees') 22 | 23 | # Display the histogram 24 | plt.show() 25 | 26 | Q2:- 27 | Define the lambda function categorize_label to convert column x into x.astype('category'). 28 | Use the LABELS list provided to convert the subset of data df[LABELS] to categorical types using the .apply() method and categorize_label. Don't forget axis=0. 29 | Print the converted .dtypes attribute of df[LABELS] 30 | 31 | Solution:- 32 | # Define the lambda function: categorize_label 33 | categorize_label = lambda x: x.astype('category') 34 | 35 | # Convert df[LABELS] to a categorical type 36 | df[LABELS] = df[LABELS].apply(categorize_label,axis=0) 37 | 38 | # Print the converted dtypes 39 | print(df[LABELS].dtypes) 40 | 41 | Q3:- 42 | Create the DataFrame num_unique_labels by using the .apply() method on df[LABELS] with pd.Series.nunique as the argument. 43 | Create a bar plot of num_unique_labels using pandas' .plot(kind='bar') method. 44 | The axes have been labeled for you, so hit 'Submit Answer' to see the number of unique values for each label. 45 | 46 | Solution:- 47 | # Import matplotlib.pyplot 48 | import matplotlib.pyplot as plt 49 | 50 | # Calculate number of unique values for each label: num_unique_labels 51 | num_unique_labels = df[LABELS].apply(pd.Series.nunique) 52 | 53 | # Plot number of unique values for each label 54 | num_unique_labels.plot(kind='bar') 55 | 56 | # Label the axes 57 | plt.xlabel('Labels') 58 | plt.ylabel('Number of unique values') 59 | 60 | # Display the plot 61 | plt.show() 62 | 63 | Q4:- 64 | Using the compute_log_loss() function, compute the log loss for the following predicted values (in each case, the actual values are contained in actual_labels): 65 | correct_confident. 66 | correct_not_confident. 67 | wrong_not_confident. 68 | wrong_confident. 69 | actual_labels. 70 | 71 | Solution:- 72 | # Compute and print log loss for 1st case 73 | correct_confident_loss = compute_log_loss(correct_confident, actual_labels) 74 | print("Log loss, correct and confident: {}".format(correct_confident_loss)) 75 | 76 | # Compute log loss for 2nd case 77 | correct_not_confident_loss = compute_log_loss(correct_not_confident, actual_labels) 78 | print("Log loss, correct and not confident: {}".format(correct_not_confident_loss)) 79 | 80 | # Compute and print log loss for 3rd case 81 | wrong_not_confident_loss = compute_log_loss(wrong_not_confident, actual_labels) 82 | print("Log loss, wrong and not confident: {}".format(wrong_not_confident_loss)) 83 | 84 | # Compute and print log loss for 4th case 85 | wrong_confident_loss = compute_log_loss(wrong_confident,actual_labels) 86 | print("Log loss, wrong and confident: {}".format(wrong_confident_loss)) 87 | 88 | # Compute and print log loss for actual labels 89 | actual_labels_loss = compute_log_loss(actual_labels, actual_labels) 90 | print("Log loss, actual labels: {}".format(actual_labels_loss)) 91 | -------------------------------------------------------------------------------- /Python/Machine Learning with the Experts: School Budgets/Learning from the experts: -------------------------------------------------------------------------------- 1 | Q1:- 2 | Create text_vector by preprocessing X_train using combine_text_columns. This is important, or else you won't get any tokens! 3 | Instantiate CountVectorizer as text_features. Specify the keyword argument token_pattern=TOKENS_ALPHANUMERIC. 4 | Fit text_features to the text_vector. 5 | 6 | Solution:- 7 | # Import the CountVectorizer 8 | from sklearn.feature_extraction.text import CountVectorizer 9 | 10 | # Create the text vector 11 | text_vector = combine_text_columns(X_train) 12 | 13 | # Create the token pattern: TOKENS_ALPHANUMERIC 14 | TOKENS_ALPHANUMERIC = '[A-Za-z0-9]+(?=\\s+)' 15 | 16 | # Instantiate the CountVectorizer: text_features 17 | text_features = CountVectorizer(token_pattern=TOKENS_ALPHANUMERIC) 18 | 19 | # Fit text_features to the text vector 20 | text_features.fit(text_vector) 21 | 22 | # Print the first 10 tokens 23 | print(text_features.get_feature_names()[:10]) 24 | 25 | Q2:- 26 | Import CountVectorizer from sklearn.feature_extraction.text. 27 | Add a CountVectorizer step to the pipeline with the name 'vectorizer'. 28 | Set the token pattern to be TOKENS_ALPHANUMERIC. 29 | Set the ngram_range to be (1, 2). 30 | 31 | Solution:- 32 | # Import pipeline 33 | from sklearn.pipeline import Pipeline 34 | 35 | # Import classifiers 36 | from sklearn.linear_model import LogisticRegression 37 | from sklearn.multiclass import OneVsRestClassifier 38 | 39 | # Import CountVectorizer 40 | from sklearn.feature_extraction.text import CountVectorizer 41 | 42 | # Import other preprocessing modules 43 | from sklearn.preprocessing import Imputer 44 | from sklearn.feature_selection import chi2, SelectKBest 45 | 46 | # Select 300 best features 47 | chi_k = 300 48 | 49 | # Import functional utilities 50 | from sklearn.preprocessing import FunctionTransformer, MaxAbsScaler 51 | from sklearn.pipeline import FeatureUnion 52 | 53 | # Perform preprocessing 54 | get_text_data = FunctionTransformer(combine_text_columns, validate=False) 55 | get_numeric_data = FunctionTransformer(lambda x: x[NUMERIC_COLUMNS], validate=False) 56 | 57 | # Create the token pattern: TOKENS_ALPHANUMERIC 58 | TOKENS_ALPHANUMERIC = '[A-Za-z0-9]+(?=\\s+)' 59 | 60 | # Instantiate pipeline: pl 61 | pl = Pipeline([ 62 | ('union', FeatureUnion( 63 | transformer_list = [ 64 | ('numeric_features', Pipeline([ 65 | ('selector', get_numeric_data), 66 | ('imputer', Imputer()) 67 | ])), 68 | ('text_features', Pipeline([ 69 | ('selector', get_text_data), 70 | ('vectorizer', CountVectorizer(token_pattern=TOKENS_ALPHANUMERIC, 71 | ngram_range=(1,2))), 72 | ('dim_red', SelectKBest(chi2, chi_k)) 73 | ])) 74 | ] 75 | )), 76 | ('scale', MaxAbsScaler()), 77 | ('clf', OneVsRestClassifier(LogisticRegression())) 78 | ]) 79 | 80 | Q3:- 81 | Add the interaction terms step using SparseInteractions() with degree=2. Give it a name of 'int', and make sure it is after the preprocessing step but before scaling. 82 | 83 | Solution:- 84 | # Instantiate pipeline: pl 85 | pl = Pipeline([ 86 | ('union', FeatureUnion( 87 | transformer_list = [ 88 | ('numeric_features', Pipeline([ 89 | ('selector', get_numeric_data), 90 | ('imputer', Imputer()) 91 | ])), 92 | ('text_features', Pipeline([ 93 | ('selector', get_text_data), 94 | ('vectorizer', CountVectorizer(token_pattern=TOKENS_ALPHANUMERIC, 95 | ngram_range=(1, 2))), 96 | ('dim_red', SelectKBest(chi2, chi_k)) 97 | ])) 98 | ] 99 | )), 100 | ('int', SparseInteractions(degree=2)), 101 | ('scale', MaxAbsScaler()), 102 | ('clf', OneVsRestClassifier(LogisticRegression())) 103 | ]) 104 | 105 | Q4:- 106 | Import HashingVectorizer from sklearn.feature_extraction.text. 107 | Instantiate the HashingVectorizer as hashing_vec using the TOKENS_ALPHANUMERIC pattern. 108 | Fit and transform hashing_vec using text_data. Save the result as hashed_text. 109 | Hit 'Submit Answer' to see some of the resulting hash values. 110 | 111 | Solution:- 112 | # Import HashingVectorizer 113 | from sklearn.feature_extraction.text import HashingVectorizer 114 | 115 | # Get text data: text_data 116 | text_data = combine_text_columns(X_train) 117 | 118 | # Create the token pattern: TOKENS_ALPHANUMERIC 119 | TOKENS_ALPHANUMERIC = '[A-Za-z0-9]+(?=\\s+)' 120 | 121 | # Instantiate the HashingVectorizer: hashing_vec 122 | hashing_vec = HashingVectorizer(token_pattern=TOKENS_ALPHANUMERIC) 123 | 124 | # Fit and transform the Hashing Vectorizer 125 | hashed_text = hashing_vec.fit_transform(text_data) 126 | 127 | # Create DataFrame and print the head 128 | hashed_df = pd.DataFrame(hashed_text.data) 129 | print(hashed_df.head()) 130 | 131 | Q5:- 132 | Import HashingVectorizer from sklearn.feature_extraction.text. 133 | Add a HashingVectorizer step to the pipeline. 134 | Name the step 'vectorizer'. 135 | Use the TOKENS_ALPHANUMERIC token pattern. 136 | Specify the ngram_range to be (1, 2) 137 | 138 | Solution:- 139 | # Import the hashing vectorizer 140 | from sklearn.feature_extraction.text import HashingVectorizer 141 | 142 | # Instantiate the winning model pipeline: pl 143 | pl = Pipeline([ 144 | ('union', FeatureUnion( 145 | transformer_list = [ 146 | ('numeric_features', Pipeline([ 147 | ('selector', get_numeric_data), 148 | ('imputer', Imputer()) 149 | ])), 150 | ('text_features', Pipeline([ 151 | ('selector', get_text_data), 152 | ('vectorizer', HashingVectorizer(token_pattern=TOKENS_ALPHANUMERIC, 153 | non_negative=True, norm=None, binary=False, 154 | ngram_range=(1,2))), 155 | ('dim_red', SelectKBest(chi2, chi_k)) 156 | ])) 157 | ] 158 | )), 159 | ('int', SparseInteractions(degree=2)), 160 | ('scale', MaxAbsScaler()), 161 | ('clf', OneVsRestClassifier(LogisticRegression())) 162 | ]) 163 | -------------------------------------------------------------------------------- /Python/Manipulating DataFrames with pandas/Advanced indexing: -------------------------------------------------------------------------------- 1 | Q1:- 2 | Create a list new_idx with the same elements as in sales.index, but with all characters capitalized. 3 | Assign new_idx to sales.index. 4 | Print the sales dataframe. This has been done for you, so hit 'Submit Answer' and to see how the index changed. 5 | 6 | Solution:- 7 | # Create the list of new indexes: new_idx 8 | new_idx = [i.upper() for i in sales.index] 9 | 10 | # Assign new_idx to sales.index 11 | sales.index = new_idx 12 | 13 | # Print the sales DataFrame 14 | print(sales) 15 | 16 | Q2:- 17 | Assign the string 'MONTHS' to sales.index.name to create a name for the index. 18 | Print the sales dataframe to see the index name you just created. 19 | Now assign the string 'PRODUCTS' to sales.columns.name to give a name to the set of columns. 20 | Print the sales dataframe again to see the columns name you just created. 21 | 22 | Solution:- 23 | # Assign the string 'MONTHS' to sales.index.name 24 | sales.index.name = 'MONTHS' 25 | 26 | # Print the sales DataFrame 27 | print(sales) 28 | 29 | # Assign the string 'PRODUCTS' to sales.columns.name 30 | sales.columns.name = 'PRODUCTS' 31 | 32 | # Print the sales dataframe again 33 | print(sales) 34 | 35 | Q3:- 36 | Generate a list months with the data ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun']. This has been done for you. 37 | Assign months to sales.index. 38 | Print the modified sales dataframe and verify that you now have month information in the index. 39 | 40 | Solution:- 41 | # Generate the list of months: months 42 | months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun'] 43 | 44 | # Assign months to sales.index 45 | sales.index = months 46 | 47 | # Print the modified sales DataFrame 48 | print(sales) 49 | 50 | Q4:- 51 | Create a MultiIndex by setting the index to be the columns ['state', 'month']. 52 | Sort the MultiIndex using the .sort_index() method. 53 | Print the sales DataFrame. This has been done for you, so hit 'Submit Answer' to verify that indeed you have an index with the fields state and month! 54 | 55 | Solution:- 56 | # Set the index to be the columns ['state', 'month']: sales 57 | sales = sales.set_index(['state', 'month']) 58 | 59 | # Sort the MultiIndex: sales 60 | sales = sales.sort_index() 61 | 62 | # Print the sales DataFrame 63 | print(sales) 64 | 65 | Q5:- 66 | Set the index of sales to be the column 'state'. 67 | Print the sales DataFrame to verify that indeed you have an index with state values. 68 | Access the data from 'NY' and print it to verify that you obtain two rows. 69 | 70 | Solution:- 71 | # Set the index to the column 'state': sales 72 | sales = sales.set_index(['state']) 73 | 74 | # Print the sales DataFrame 75 | print(sales) 76 | 77 | # Access the data from 'NY' 78 | print(sales.loc['NY']) 79 | 80 | Q6:- 81 | Look up data for the New York column ('NY') in month 1. 82 | Look up data for the California and Texas columns ('CA', 'TX') in month 2. 83 | Look up data for all states in month 2. Use (slice(None), 2) to extract all rows in month 2. 84 | 85 | Solution:- 86 | # Look up data for NY in month 1: NY_month1 87 | NY_month1 = sales.loc[('NY',1)] 88 | 89 | # Look up data for CA and TX in month 2: CA_TX_month2 90 | CA_TX_month2 = sales.loc[(['CA','TX'],2),:] 91 | 92 | # Look up data for all states in month 2: all_month2 93 | all_month2 = sales.loc[(slice(None),2),:] 94 | -------------------------------------------------------------------------------- /Python/Merging DataFrames with pandas/Merging data: -------------------------------------------------------------------------------- 1 | Q1:- 2 | Using pd.merge(), merge the DataFrames revenue and managers on the 'city' column of each. Store the result as merge_by_city. 3 | Print the DataFrame merge_by_city. This has been done for you. 4 | Merge the DataFrames revenue and managers on the 'branch_id' column of each. Store the result as merge_by_id. 5 | Print the DataFrame merge_by_id. This has been done for you, so hit 'Submit Answer' to see the result! 6 | 7 | Solution:- 8 | # Merge revenue with managers on 'city': merge_by_city 9 | merge_by_city = pd.merge(revenue,managers,on='city') 10 | 11 | # Print merge_by_city 12 | print(merge_by_city) 13 | 14 | # Merge revenue with managers on 'branch_id': merge_by_id 15 | merge_by_id = pd.merge(revenue,managers,on='branch_id') 16 | 17 | # Print merge_by_id 18 | print(merge_by_id) 19 | 20 | Q2:- 21 | Merge the DataFrames revenue and managers into a single DataFrame called combined using the 'city' and 'branch' columns from the appropriate DataFrames. 22 | In your call to pd.merge(), you will have to specify the parameters left_on and right_on appropriately. 23 | Print the new DataFrame combined. 24 | 25 | Solution:- 26 | # Merge revenue & managers on 'city' & 'branch': combined 27 | combined = pd.merge(revenue,managers,left_on='city',right_on='branch') 28 | 29 | # Print combined 30 | print(combined) 31 | 32 | Q3:- 33 | Create a column called 'state' in the DataFrame revenue, consisting of the list ['TX','CO','IL','CA']. 34 | Create a column called 'state' in the DataFrame managers, consisting of the list ['TX','CO','CA','MO']. 35 | Merge the DataFrames revenue and managers using three columns :'branch_id', 'city', and 'state'. Pass them in as a list to the on paramater of pd.merge(). 36 | 37 | Solution:- 38 | # Add 'state' column to revenue: revenue['state'] 39 | revenue['state'] = ['TX','CO','IL','CA'] 40 | 41 | # Add 'state' column to managers: managers['state'] 42 | managers['state'] = ['TX','CO','CA','MO'] 43 | 44 | # Merge revenue & managers on 'branch_id', 'city', & 'state': combined 45 | combined = pd.merge(revenue,managers,on=['branch_id', 'city', 'state']) 46 | 47 | # Print combined 48 | print(combined) 49 | 50 | Q4:- 51 | Execute a right merge using pd.merge() with revenue and sales to yield a new DataFrame revenue_and_sales. 52 | Use how='right' and on=['city', 'state']. 53 | Print the new DataFrame revenue_and_sales. This has been done for you. 54 | Execute a left merge with sales and managers to yield a new DataFrame sales_and_managers. 55 | Use how='left', left_on=['city', 'state'], and right_on=['branch', 'state']. 56 | Print the new DataFrame sales_and_managers. This has been done for you, so hit 'Submit Answer' to see the result! 57 | 58 | Solution:- 59 | # Merge revenue and sales: revenue_and_sales 60 | revenue_and_sales = pd.merge(revenue,sales ,how='right',on=['city','state']) 61 | 62 | # Print revenue_and_sales 63 | print(revenue_and_sales) 64 | 65 | # Merge sales and managers: sales_and_managers 66 | sales_and_managers = pd.merge(sales,managers,how='left',left_on=['city','state'],right_on=['branch','state']) 67 | 68 | # Print sales_and_managers 69 | print(sales_and_managers) 70 | 71 | Q5:- 72 | Merge sales_and_managers with revenue_and_sales. Store the result as merge_default. 73 | Print merge_default. This has been done for you. 74 | Merge sales_and_managers with revenue_and_sales using how='outer'. Store the result as merge_outer. 75 | Print merge_outer. This has been done for you. 76 | Merge sales_and_managers with revenue_and_sales only on ['city','state'] using an outer join. Store the result as merge_outer_on and hit 'Submit Answer' to see what the merged DataFrames look like! 77 | 78 | Solution:- 79 | # Perform the first merge: merge_default 80 | merge_default = pd.merge(sales_and_managers,revenue_and_sales) 81 | 82 | # Print merge_default 83 | print(merge_default) 84 | 85 | # Perform the second merge: merge_outer 86 | merge_outer = pd.merge(sales_and_managers,revenue_and_sales,how='outer') 87 | 88 | # Print merge_outer 89 | print(merge_outer) 90 | 91 | # Perform the third merge: merge_outer_on 92 | merge_outer_on = pd.merge(sales_and_managers,revenue_and_sales,on=['city','state'],how='outer') 93 | 94 | # Print merge_outer_on 95 | print(merge_outer_on) 96 | 97 | Q6:- 98 | Perform an ordered merge on austin and houston using pd.merge_ordered(). Store the result as tx_weather. 99 | Print tx_weather. You should notice that the rows are sorted by the date but it is not possible to tell which observation came from which city. 100 | Perform another ordered merge on austin and houston. 101 | This time, specify the keyword arguments on='date' and suffixes=['_aus','_hus'] so that the rows can be distinguished. Store the result as tx_weather_suff. 102 | Print tx_weather_suff to examine its contents. This has been done for you. 103 | Perform a third ordered merge on austin and houston. 104 | This time, in addition to the on and suffixes parameters, specify the keyword argument fill_method='ffill' to use forward-filling to replace NaN entries with the most recent non-null entry, and hit 'Submit Answer' to examine the contents of the merged DataFrames! 105 | 106 | Solution:- 107 | # Perform the first ordered merge: tx_weather 108 | tx_weather = pd.merge_ordered(austin,houston) 109 | 110 | # Print tx_weather 111 | print(tx_weather) 112 | 113 | # Perform the second ordered merge: tx_weather_suff 114 | tx_weather_suff = pd.merge_ordered(austin,houston,on='date',suffixes=['_aus','_hus']) 115 | 116 | # Print tx_weather_suff 117 | print(tx_weather_suff) 118 | 119 | # Perform the third ordered merge: tx_weather_ffill 120 | tx_weather_ffill = pd.merge_ordered(austin,houston,on='date',suffixes=['_aus','_hus'],fill_method='ffill') 121 | 122 | # Print tx_weather_ffill 123 | print(tx_weather_ffill) 124 | 125 | Q7:- 126 | Merge auto and oil using pd.merge_asof() with left_on='yr' and right_on='Date'. Store the result as merged. 127 | Print the tail of merged. This has been done for you. 128 | Resample merged using 'A' (annual frequency), and on='Date'. Select [['mpg','Price']] and aggregate the mean. Store the result as yearly. 129 | Hit Submit Answer to examine the contents of yearly and yearly.corr(), which shows the Pearson correlation between the resampled 'Price' and 'mpg'. 130 | 131 | Solution:- 132 | # Merge auto and oil: merged 133 | merged = pd.merge_asof(auto,oil,left_on='yr',right_on='Date') 134 | 135 | # Print the tail of merged 136 | print(merged.tail()) 137 | 138 | # Resample merged: yearly 139 | yearly = merged.resample('A',on='Date')[['mpg','Price']].mean() 140 | 141 | # Print yearly 142 | print(yearly) 143 | 144 | # print yearly.corr() 145 | print(yearly.corr()) 146 | -------------------------------------------------------------------------------- /Python/Network Analysis in Python (Part 1)/Introduction to networks: -------------------------------------------------------------------------------- 1 | Q1:- 2 | Import matplotlib.pyplot as plt and networkx as nx. 3 | Draw T_sub to the screen by using the nx.draw() function, and don't forget to also use plt.show() to display it. 4 | 5 | Solution:- 6 | # Import necessary modules 7 | import matplotlib.pyplot as plt 8 | import networkx as nx 9 | 10 | 11 | # Draw the graph to screen 12 | nx.draw(T_sub) 13 | plt.show() 14 | 15 | Q2:- 16 | Use a list comprehension to get a list of nodes from the graph T that have the 'occupation' label of 'scientist'. 17 | The output expression n has been specified for you, along with the iterator variables n and d. Your task is to fill in the iterable and the conditional expression. 18 | Use the .nodes() method of T access its nodes, and be sure to specify data=True to obtain the metadata for the nodes. 19 | The iterator variable d is a dictionary. The key of interest here is 'occupation' and value of interest is 'scientist'. 20 | Use a list comprehension to get a list of edges from the graph T that were formed for at least 6 years, i.e., from before 1 Jan 2010. 21 | Your task once again is to fill in the iterable and conditional expression. 22 | Use the .edges() method of T to access its edges. Be sure to obtain the metadata for the edges as well. 23 | The dates are stored as datetime.date objects in the metadata dictionary d, under the key 'date'. To access the date 1 Jan 2009, for example, the dictionary value would be date(2009, 1, 1). 24 | 25 | Solution:- 26 | # Use a list comprehension to get the nodes of interest: noi 27 | noi = [n for n, d in T.nodes(data=True) if d['occupation'] == 'scientist'] 28 | 29 | # Use a list comprehension to get the edges of interest: eoi 30 | eoi = [(u, v) for u, v, d in T.edges(data=True) if d['date'] < date(2010,1,1)] 31 | 32 | Q3:- 33 | Set the 'weight' attribute of the edge between node 1 and 10 of T to be equal to 2. Refer to the following template to set an attribute of an edge: network_name.edges[node1, node2]['attribute'] = value. Here, the 'attribute' is 'weight'. 34 | Set the weight of every edge involving node 293 to be equal to 1.1. To do this: 35 | Using a for loop, iterate over all the edges of T, including the metadata. 36 | If 293 is involved in the list of nodes [u, v]: 37 | Set the weight of the edge between u and v to be 1.1. 38 | 39 | Solution:- 40 | # Set the weight of the edge 41 | T.edges[1,10]['weight'] = 2 42 | 43 | # Iterate over all the edges (with metadata) 44 | for u, v, d in T.edges(data=True): 45 | 46 | # Check if node 293 is involved 47 | if 293 in [u,v]: 48 | 49 | # Set the weight to 1.1 50 | T.edges[u,v]['weight'] = 1.1 51 | 52 | Q4:- 53 | Define a function called find_selfloop_nodes() which takes one argument: G. 54 | Using a for loop, iterate over all the edges in G (excluding the metadata). 55 | If node u is equal to node v: 56 | Append u to the list nodes_in_selfloops. 57 | Return the list nodes_in_selfloops. 58 | Check that the number of self loops in the graph equals the number of nodes in self loops. This has been done for you, so hit 'Submit Answer' to see the result! 59 | 60 | Solution:- 61 | # Define find_selfloop_nodes() 62 | def find_selfloop_nodes(T): 63 | """ 64 | Finds all nodes that have self-loops in the graph G. 65 | """ 66 | nodes_in_selfloops = [] 67 | 68 | # Iterate over all the edges of G 69 | for u, v in T.edges(): 70 | 71 | # Check if node u and node v are the same 72 | if u==v: 73 | 74 | # Append node u to nodes_in_selfloops 75 | nodes_in_selfloops.append(u) 76 | 77 | return nodes_in_selfloops 78 | 79 | # Check whether number of self loops equals the number of nodes in self loops 80 | assert T.number_of_selfloops() == len(find_selfloop_nodes(T)) 81 | 82 | Q5:- 83 | Import nxviz as nv. 84 | Plot the graph T as a matrix plot. To do this: 85 | Create the MatrixPlot object called m using the nv.MatrixPlot() function with T passed in as an argument. 86 | Draw the m to the screen using the .draw() method. 87 | Display the plot using plt.show(). 88 | Convert the graph to a matrix format, and then convert the graph to back to the NetworkX form from the matrix as a directed graph. This has been done for you. 89 | Check that the category metadata field is lost from each node. This has also been done for you, so hit 'Submit Answer' to see the results! 90 | 91 | Solution:- 92 | # Import nxviz 93 | import nxviz as nv 94 | 95 | # Create the MatrixPlot object: m 96 | m = nv.MatrixPlot(T) 97 | 98 | # Draw m to the screen 99 | m.draw() 100 | 101 | # Display the plot 102 | plt.show() 103 | 104 | # Convert T to a matrix format: A 105 | A = nx.to_numpy_matrix(T) 106 | 107 | # Convert A back to the NetworkX form as a directed graph: T_conv 108 | T_conv = nx.from_numpy_matrix(A, create_using=nx.DiGraph()) 109 | 110 | # Check that the `category` metadata field is lost from each node 111 | for n, d in T_conv.nodes(data=True): 112 | assert 'category' not in d.keys() 113 | 114 | Q6:- 115 | Import CircosPlot from nxviz. 116 | Plot the Twitter network T as a Circos plot without any styling. Use the CircosPlot() function to do this. Don't forget to draw it to the screen using .draw() and then display it using plt.show(). 117 | 118 | Solution:- 119 | # Import necessary modules 120 | import matplotlib.pyplot as plt 121 | import nxviz as nv 122 | from nxviz import CircosPlot 123 | 124 | # Create the CircosPlot object: c 125 | c = nv.CircosPlot(T) 126 | 127 | # Draw c to the screen 128 | c.draw() 129 | 130 | # Display the plot 131 | plt.show() 132 | 133 | Q7:- 134 | Import ArcPlot from nxviz. 135 | Create an un-customized ArcPlot of T. To do this, use the ArcPlot() function with just T as the argument. 136 | Create another ArcPlot of T in which the nodes are ordered and colored by the 'category' keyword. You'll have to specify the node_order and node_color parameters to do this. For both plots, be sure to draw them to the screen and display them with plt.show(). 137 | 138 | Solution:- 139 | # Import necessary modules 140 | import matplotlib.pyplot as plt 141 | import nxviz as nv 142 | from nxviz import ArcPlot 143 | 144 | # Create the un-customized ArcPlot object: a 145 | a = nv.ArcPlot(T) 146 | 147 | # Draw a to the screen 148 | a.draw() 149 | 150 | # Display the plot 151 | plt.show() 152 | 153 | # Create the customized ArcPlot object: a2 154 | a2 = nv.ArcPlot(T,node_order='category',node_color='category') 155 | 156 | # Draw a2 to the screen 157 | a2.draw() 158 | 159 | # Display the plot 160 | plt.show() 161 | -------------------------------------------------------------------------------- /Python/Python Data Science Toolbox -Part 1/Writing your own functions: -------------------------------------------------------------------------------- 1 | Q1:- 2 | Complete the function header by adding the appropriate function name, shout. 3 | In the function body, concatenate the string, 'congratulations' with another string, '!!!'. Assign the result to shout_word. 4 | Print the value of shout_word. 5 | Call the shout function. 6 | 7 | Solution:- 8 | # Define the function shout 9 | def shout(): 10 | """Print a string with three exclamation marks""" 11 | # Concatenate the strings: shout_word 12 | shout_word = "congratulations" + "!!!" 13 | 14 | # Print shout_word 15 | print(shout_word) 16 | 17 | # Call shout 18 | shout() 19 | 20 | Q2:- 21 | Complete the function header by adding the parameter name, word. 22 | Assign the result of concatenating word with '!!!' to shout_word. 23 | Print the value of shout_word. 24 | Call the shout() function, passing to it the string, 'congratulations' 25 | 26 | Solution:- 27 | # Define shout with the parameter, word 28 | def shout(word): 29 | """Print a string with three exclamation marks""" 30 | # Concatenate the strings: shout_word 31 | shout_word = word + '!!!' 32 | 33 | # Print shout_word 34 | print(shout_word) 35 | 36 | # Call shout with the string 'congratulations' 37 | shout("congratulations") 38 | 39 | Q3:- 40 | In the function body, concatenate the string in word with '!!!' and assign to shout_word. 41 | Replace the print() statement with the appropriate return statement. 42 | Call the shout() function, passing to it the string, 'congratulations', and assigning the call to the variable, yell. 43 | To check if yell contains the value returned by shout(), print the value of yell. 44 | 45 | Solution:- 46 | # Define shout with the parameter, word 47 | def shout(word): 48 | """Return a string with three exclamation marks""" 49 | # Concatenate the strings: shout_word 50 | shout_word = word + "!!!" 51 | 52 | # Replace print with return 53 | return shout_word 54 | 55 | # Pass 'congratulations' to shout: yell 56 | yell = shout("congratulations") 57 | 58 | # Print yell 59 | print(yell) 60 | 61 | Q4:- 62 | Modify the function header such that it accepts two parameters, word1 and word2, in that order. 63 | Concatenate each of word1 and word2 with '!!!' and assign to shout1 and shout2, respectively. 64 | Concatenate shout1 and shout2 together, in that order, and assign to new_shout. 65 | Pass the strings 'congratulations' and 'you', in that order, to a call to shout(). Assign the return value to yell. 66 | 67 | Solution:- 68 | # Define shout with parameters word1 and word2 69 | def shout(word1, word2): 70 | """Concatenate strings with three exclamation marks""" 71 | # Concatenate word1 with '!!!': shout1 72 | shout1 = word1 + "!!!" 73 | 74 | # Concatenate word2 with '!!!': shout2 75 | shout2 = word2 + "!!!" 76 | 77 | # Concatenate shout1 with shout2: new_shout 78 | new_shout = shout1 + shout2 79 | 80 | # Return new_shout 81 | return new_shout 82 | 83 | # Pass 'congratulations' and 'you' to shout(): yell 84 | yell = shout("congratulations","you") 85 | 86 | # Print yell 87 | print(yell) 88 | 89 | Q5:- 90 | Unpack nums to the variables num1, num2, and num3. 91 | Construct a new tuple, even_nums composed of the same elements in nums, but with the 1st element replaced with the value, 2. 92 | 93 | Solution:- 94 | # Unpack nums into num1, num2, and num3 95 | num1,num2,num3 = nums 96 | 97 | # Construct even_nums 98 | even_nums = (2, num2, num3) 99 | 100 | Q6:- 101 | Modify the function header such that the function name is now shout_all, and it accepts two parameters, word1 and word2, in that order. 102 | Concatenate the string '!!!' to each of word1 and word2 and assign to shout1 and shout2, respectively. 103 | Construct a tuple shout_words, composed of shout1 and shout2. 104 | Call shout_all() with the strings 'congratulations' and 'you' and assign the result to yell1 and yell2 (remember, shout_all returns 2 variables!). 105 | 106 | Solution:- 107 | # Define shout_all with parameters word1 and word2 108 | def shout_all(word1, word2): 109 | 110 | # Concatenate word1 with '!!!': shout1 111 | shout1 = word1 + "!!!" 112 | 113 | # Concatenate word2 with '!!!': shout2 114 | shout2 = word2 + "!!!" 115 | 116 | # Construct a tuple with shout1 and shout2: shout_words 117 | shout_words = (shout1,shout2) 118 | 119 | # Return shout_words 120 | return shout_words 121 | 122 | # Pass 'congratulations' and 'you' to shout_all(): yell1, yell2 123 | yell1, yell2 = shout_all("congratulations","you") 124 | 125 | # Print yell1 and yell2 126 | print(yell1) 127 | print(yell2) 128 | 129 | Q7:- 130 | Import the pandas package with the alias pd. 131 | Import the file 'tweets.csv' using the pandas function read_csv(). Assign the resulting DataFrame to df. 132 | Complete the for loop by iterating over col, the 'lang' column in the DataFrame df. 133 | Complete the bodies of the if-else statements in the for loop: if the key is in the dictionary langs_count, add 1 to its current value, else add the key to langs_count and set its value to 1. 134 | Use the loop variable entry in your code. 135 | 136 | Solution:- 137 | # Import pandas 138 | import pandas as pd 139 | 140 | # Import Twitter data as DataFrame: df 141 | df = pd.read_csv("tweets.csv") 142 | 143 | # Initialize an empty dictionary: langs_count 144 | langs_count = {} 145 | 146 | # Extract column from DataFrame: col 147 | col = df['lang'] 148 | 149 | # Iterate over lang column in DataFrame 150 | for entry in col: 151 | 152 | # If the language is in langs_count, add 1 153 | if entry in langs_count.keys(): 154 | langs_count[entry] +=1 155 | # Else add the language to langs_count, set the value to 1 156 | else: 157 | langs_count[entry] = 1 158 | 159 | # Print the populated dictionary 160 | print(langs_count) 161 | 162 | Q8:- 163 | Define the function count_entries(), which has two parameters. The first parameter is df for the DataFrame and the second is col_name for the column name. 164 | Complete the bodies of the if-else statements in the for loop: if the key is in the dictionary langs_count, add 1 to its current value, else add the key to langs_count and set its value to 1. Use the loop variable entry in your code. 165 | Return the langs_count dictionary from inside the count_entries() function. 166 | Call the count_entries() function by passing to it tweets_df and the name of the column, 'lang'. Assign the result of the call to the variable result. 167 | 168 | Solution:- 169 | # Define count_entries() 170 | def count_entries(df, col_name): 171 | """Return a dictionary with counts of 172 | occurrences as value for each key.""" 173 | 174 | # Initialize an empty dictionary: langs_count 175 | langs_count = {} 176 | 177 | # Extract column from DataFrame: col 178 | col = df[col_name] 179 | 180 | # Iterate over lang column in DataFrame 181 | for entry in col: 182 | 183 | # If the language is in langs_count, add 1 184 | if entry in langs_count.keys(): 185 | langs_count[entry] +=1 186 | # Else add the language to langs_count, set the value to 1 187 | else: 188 | langs_count[entry] = 1 189 | 190 | # Return the langs_count dictionary 191 | return langs_count 192 | 193 | # Call count_entries(): result 194 | result = count_entries(tweets_df,"lang") 195 | 196 | # Print the result 197 | print(result) 198 | 199 | 200 | -------------------------------------------------------------------------------- /Python/Python Data Science Toolbox -Part 2/List comprehensions and generators: -------------------------------------------------------------------------------- 1 | Q1:- 2 | Using the range of numbers from 0 to 9 as your iterable and i as your iterator variable, write a list comprehension that produces a list of numbers consisting of the squared values of i. 3 | 4 | Solution:- 5 | # Create list comprehension: squares 6 | squares = [i*i for i in range(0,10)] 7 | 8 | Q2:- 9 | In the inner list comprehension - that is, the output expression of the nested list comprehension - create a list of values from 0 to 4 using range(). Use col as the iterator variable. 10 | In the iterable part of your nested list comprehension, use range() to count 5 rows - that is, create a list of values from 0 to 4. 11 | Use row as the iterator variable; note that you won't be needing this to create values in the list of lists. 12 | 13 | Solution:- 14 | # Create a 5 x 5 matrix using a list of lists: matrix 15 | matrix = [[col for col in range(0,5)] for row in range(0,5)] 16 | 17 | # Print the matrix 18 | for row in matrix: 19 | print(row) 20 | 21 | Q3:- 22 | Use member as the iterator variable in the list comprehension. For the conditional, use len() to evaluate the iterator variable. 23 | Note that you only want strings with 7 characters or more. 24 | 25 | Solution:- 26 | # Create a list of strings: fellowship 27 | fellowship = ['frodo', 'samwise', 'merry', 'aragorn', 'legolas', 'boromir', 'gimli'] 28 | 29 | # Create list comprehension: new_fellowship 30 | new_fellowship = [member for member in fellowship if len(member) >= 7] 31 | 32 | # Print the new list 33 | print(new_fellowship) 34 | 35 | Q4:- 36 | In the output expression, keep the string as-is if the number of characters is >= 7, else replace it with an empty string - that is, '' or "". 37 | 38 | Solution:- 39 | # Create a list of strings: fellowship 40 | fellowship = ['frodo', 'samwise', 'merry', 'aragorn', 'legolas', 'boromir', 'gimli'] 41 | 42 | # Create list comprehension: new_fellowship 43 | new_fellowship = [member if len(member) >= 7 else "" for member in fellowship] 44 | 45 | # Print the new list 46 | print(new_fellowship) 47 | 48 | Q5:- 49 | Create a dict comprehension where the key is a string in fellowship and the value is the length of the string. 50 | Remember to use the syntax key:value in the output expression part of the comprehension to create the members of the dictionary. 51 | Use member as the iterator variable. 52 | 53 | Solution:- 54 | # Create a list of strings: fellowship 55 | fellowship = ['frodo', 'samwise', 'merry', 'aragorn', 'legolas', 'boromir', 'gimli'] 56 | 57 | # Create dict comprehension: new_fellowship 58 | new_fellowship = {member:len(member) for member in fellowship} 59 | 60 | # Print the new list 61 | print(new_fellowship) 62 | 63 | Q6:- 64 | Create a generator object that will produce values from 0 to 30. Assign the result to result and use num as the iterator variable in the generator expression. 65 | Print the first 5 values by using next() appropriately in print(). 66 | Print the rest of the values by using a for loop to iterate over the generator object. 67 | 68 | Solution:- 69 | # Create generator object: result 70 | result = (num for num in range(0,31)) 71 | 72 | # Print the first 5 values 73 | print(next(result)) 74 | print(next(result)) 75 | print(next(result)) 76 | print(next(result)) 77 | print(next(result)) 78 | 79 | # Print the rest of the values 80 | for value in result: 81 | print(value) 82 | 83 | Q7:- 84 | Write a generator expression that will generate the lengths of each string in lannister. Use person as the iterator variable. Assign the result to lengths. 85 | Supply the correct iterable in the for loop for printing the values in the generator object. 86 | 87 | Solution:- 88 | # Create a list of strings: lannister 89 | lannister = ['cersei', 'jaime', 'tywin', 'tyrion', 'joffrey'] 90 | 91 | # Create a generator object: lengths 92 | lengths = (len(person) for person in lannister ) 93 | 94 | # Iterate over and print the values in lengths 95 | for value in lengths: 96 | print(value) 97 | 98 | Q8:- 99 | Complete the function header for the function get_lengths() that has a single parameter, input_list. 100 | In the for loop in the function definition, yield the length of the strings in input_list. 101 | Complete the iterable part of the for loop for printing the values generated by the get_lengths() generator function. 102 | Supply the call to get_lengths(), passing in the list lannister. 103 | 104 | Solution:- 105 | # Define generator function get_lengths 106 | def get_lengths(input_list): 107 | """Generator function that yields the 108 | length of the strings in input_list.""" 109 | 110 | # Yield the length of a string 111 | for person in input_list: 112 | yield len(person) 113 | 114 | # Print the values generated by get_lengths() 115 | for value in get_lengths(lannister): 116 | print(value) 117 | 118 | Q9:- 119 | Extract the column 'created_at' from df and assign the result to tweet_time. Fun fact: the extracted column in tweet_time here is a Series data structure! 120 | Create a list comprehension that extracts the time from each row in tweet_time. Each row is a string that represents a timestamp, and you will access the 12th to 19th characters in the string to extract the time. 121 | Use entry as the iterator variable and assign the result to tweet_clock_time. Remember that Python uses 0-based indexing! 122 | 123 | Solution:- 124 | # Extract the created_at column from df: tweet_time 125 | tweet_time = df['created_at'] 126 | 127 | # Extract the clock time: tweet_clock_time 128 | tweet_clock_time = [entry[11:19] for entry in tweet_time] 129 | 130 | # Print the extracted times 131 | print(tweet_clock_time) 132 | 133 | Q10:- 134 | Extract the column 'created_at' from df and assign the result to tweet_time. 135 | Create a list comprehension that extracts the time from each row in tweet_time. 136 | Each row is a string that represents a timestamp, and you will access the 12th to 19th characters in the string to extract the time. 137 | Use entry as the iterator variable and assign the result to tweet_clock_time. 138 | 139 | Solution:- 140 | # Extract the created_at column from df: tweet_time 141 | tweet_time = df['created_at'] 142 | 143 | # Extract the clock time: tweet_clock_time 144 | tweet_clock_time = [entry[11:19] for entry in tweet_time if entry[17:19] == '19'] 145 | 146 | # Print the extracted times 147 | print(tweet_clock_time) 148 | 149 | 150 | Additionally, add a conditional expression that checks whether entry[17:19] is equal to '19'. 151 | 152 | -------------------------------------------------------------------------------- /Python/Python Data Science Toolbox -Part/Case Study: -------------------------------------------------------------------------------- 1 | Q1:- 2 | Create a zip object by calling zip() and passing to it feature_names and row_vals. Assign the result to zipped_lists. 3 | Create a dictionary from the zipped_lists zip object by calling dict() with zipped_lists. Assign the resulting dictionary to rs_dict. 4 | 5 | Solution:- 6 | # Zip lists: zipped_lists 7 | zipped_lists = zip(feature_names,row_vals) 8 | 9 | # Create a dictionary: rs_dict 10 | rs_dict = dict(zipped_lists) 11 | 12 | # Print the dictionary 13 | print(rs_dict) 14 | 15 | Q2:- 16 | Define the function lists2dict() with two parameters: first is list1 and second is list2. 17 | Return the resulting dictionary rs_dict in lists2dict(). 18 | Call the lists2dict() function with the arguments feature_names and row_vals. Assign the result of the function call to rs_fxn. 19 | 20 | Solution:- 21 | # Define lists2dict() 22 | def lists2dict(list1, list2): 23 | """Return a dictionary where list1 provides 24 | the keys and list2 provides the values.""" 25 | 26 | # Zip lists: zipped_lists 27 | zipped_lists = zip(list1, list2) 28 | 29 | # Create a dictionary: rs_dict 30 | rs_dict = dict(zipped_lists) 31 | 32 | # Return the dictionary 33 | return rs_dict 34 | 35 | # Call lists2dict: rs_fxn 36 | rs_fxn = lists2dict(feature_names,row_vals) 37 | 38 | # Print rs_fxn 39 | print(rs_fxn) 40 | 41 | Q3:- 42 | Inspect the contents of row_lists by printing the first two lists in row_lists. 43 | Create a list comprehension that generates a dictionary using lists2dict() for each sublist in row_lists. The keys are from the feature_names list and the values are the row entries in row_lists. Use sublist as your iterator variable and assign the resulting list of dictionaries to list_of_dicts. 44 | Look at the first two dictionaries in list_of_dicts by printing them out. 45 | 46 | Solution:- 47 | -------------------------------------------------------------------------------- /Python/Statistical Thinking in Python (Part 2)/Hypothesis test examples: -------------------------------------------------------------------------------- 1 | Q1:- 2 | Construct Boolean arrays, dems and reps that contain the votes of the respective parties; e.g., dems has 153 True entries and 91 False entries. 3 | Write a function, frac_yea_dems(dems, reps) that returns the fraction of Democrats that voted yea. The first input is an array of Booleans, Two inputs are required to use your draw_perm_reps() function, but the second is not used. 4 | Use your draw_perm_reps() function to draw 10,000 permutation replicates of the fraction of Democrat yea votes. 5 | Compute and print the p-value. 6 | 7 | Solution:- 8 | # Construct arrays of data: dems, reps 9 | dems = np.array([True] * 153 + [False] * 91) 10 | reps = np.array([True]* 136 + [False]*35) 11 | 12 | def frac_yea_dems(dems, reps): 13 | """Compute fraction of Democrat yea votes.""" 14 | frac = np.sum(dems) / len(dems) 15 | return frac 16 | 17 | # Acquire permutation samples: perm_replicates 18 | perm_replicates = draw_perm_reps(dems, reps, frac_yea_dems, size=10000) 19 | 20 | # Compute and print p-value: p 21 | p = np.sum(perm_replicates <= 153/244) / len(perm_replicates) 22 | print('p-value =', p) 23 | 24 | Q2:- 25 | Compute the observed difference in mean inter-nohitter time using diff_of_means(). 26 | Generate 10,000 permutation replicates of the difference of means using draw_perm_reps(). 27 | Compute and print the p-value. 28 | 29 | Solution:- 30 | # Compute the observed difference in mean inter-no-hitter times: nht_diff_obs 31 | nht_diff_obs = diff_of_means(nht_dead,nht_live) 32 | 33 | # Acquire 10,000 permutation replicates of difference in mean no-hitter time: perm_replicates 34 | perm_replicates = draw_perm_reps(nht_dead,nht_live,diff_of_means,size=10000) 35 | 36 | 37 | # Compute and print the p-value: p 38 | p = np.sum(perm_replicates <= nht_diff_obs)/len(perm_replicates) 39 | print('p-val =', p) 40 | 41 | Q3:- 42 | Compute the observed Pearson correlation between illiteracy and fertility. 43 | Initialize an array to store your permutation replicates. 44 | Write a for loop to draw 10,000 replicates: 45 | Permute the illiteracy measurements using np.random.permutation(). 46 | Compute the Pearson correlation between the permuted illiteracy array, illiteracy_permuted, and fertility. 47 | Compute and print the p-value from the replicates. 48 | 49 | Solution:- 50 | # Compute observed correlation: r_obs 51 | r_obs = pearson_r(illiteracy,fertility) 52 | 53 | # Initialize permutation replicates: perm_replicates 54 | perm_replicates = np.empty(10000) 55 | 56 | # Draw replicates 57 | for i in range(10000): 58 | # Permute illiteracy measurments: illiteracy_permuted 59 | illiteracy_permuted = np.random.permutation(illiteracy) 60 | 61 | # Compute Pearson correlation 62 | perm_replicates[i] = pearson_r(illiteracy_permuted,fertility) 63 | 64 | # Compute p-value: p 65 | p = np.sum(perm_replicates >= 1)/len(perm_replicates) 66 | print('p-val =', p) 67 | 68 | Q4:- 69 | Use your ecdf() function to generate x,y values from the control and treated arrays for plotting the ECDFs. 70 | Plot the ECDFs on the same plot. 71 | The margins have been set for you, along with the legend and axis labels. Hit 'Submit Answer' to see the result! 72 | 73 | Solution:- 74 | # Compute x,y values for ECDFs 75 | x_control, y_control = ecdf(control) 76 | x_treated, y_treated = ecdf(treated) 77 | 78 | # Plot the ECDFs 79 | plt.plot(x_control, y_control, marker='.', linestyle='none') 80 | plt.plot(x_treated, y_treated, marker='.', linestyle='none') 81 | 82 | # Set the margins 83 | plt.margins(0.02) 84 | 85 | # Add a legend 86 | plt.legend(('control', 'treated'), loc='lower right') 87 | 88 | # Label axes and show plot 89 | plt.xlabel('millions of alive sperm per mL') 90 | plt.ylabel('ECDF') 91 | plt.show() 92 | 93 | Q5:- 94 | Compute the mean alive sperm count of control minus that of treated. 95 | Compute the mean of all alive sperm counts. To do this, first concatenate control and treated and take the mean of the concatenated array. 96 | Generate shifted data sets for both control and treated such that the shifted data sets have the same mean. This has already been done for you. 97 | Generate 10,000 bootstrap replicates of the mean each for the two shifted arrays. Use your draw_bs_reps() function. 98 | Compute the bootstrap replicates of the difference of means. 99 | The code to compute and print the p-value has been written for you. Hit 'Submit Answer' to see the result! 100 | 101 | Solution:- 102 | # Compute the difference in mean sperm count: diff_means 103 | diff_means = np.mean(control) - np.mean(treated) 104 | 105 | # Compute mean of pooled data: mean_count 106 | mean_count = np.mean(np.concatenate((control,treated))) 107 | 108 | # Generate shifted data sets 109 | control_shifted = control - np.mean(control) + mean_count 110 | treated_shifted = treated - np.mean(treated) + mean_count 111 | 112 | # Generate bootstrap replicates 113 | bs_reps_control = draw_bs_reps(control_shifted, 114 | np.mean, size=10000) 115 | bs_reps_treated = draw_bs_reps(treated_shifted, 116 | np.mean, size=10000) 117 | 118 | # Get replicates of difference of means: bs_replicates 119 | bs_replicates = bs_reps_control- bs_reps_treated 120 | 121 | # Compute and print p-value: p 122 | p = np.sum(bs_replicates >= np.mean(control) - np.mean(treated)) \ 123 | / len(bs_replicates) 124 | print('p-value =', p) 125 | -------------------------------------------------------------------------------- /Python/Statistical Thinking in Python (Part 2)/Parameter estimation by optimization: -------------------------------------------------------------------------------- 1 | Q1:- 2 | Seed the random number generator with 42. 3 | Compute the mean time (in units of number of games) between no-hitters. 4 | Draw 100,000 samples from an Exponential distribution with the parameter you computed from the mean of the inter-no-hitter times. 5 | Plot the theoretical PDF using plt.hist(). Remember to use keyword arguments bins=50, normed=True, and histtype='step'. Be sure to label your axes. 6 | Show your plot. 7 | 8 | Solution:- 9 | # Seed random number generator 10 | np.random.seed(42) 11 | 12 | # Compute mean no-hitter time: tau 13 | tau = np.mean(nohitter_times) 14 | 15 | # Draw out of an exponential distribution with parameter tau: inter_nohitter_time 16 | inter_nohitter_time = np.random.exponential(tau, 100000) 17 | 18 | # Plot the PDF and label axes 19 | _ = plt.hist(inter_nohitter_time, 20 | bins=50, normed=True, histtype='step') 21 | _ = plt.xlabel('Games between no-hitters') 22 | _ = plt.ylabel('PDF') 23 | 24 | # Show the plot 25 | plt.show() 26 | 27 | Q2:- 28 | # Create an ECDF from real data: x, y 29 | x, y = ecdf(nohitter_times) 30 | 31 | # Create a CDF from theoretical samples: x_theor, y_theor 32 | x_theor, y_theor = ecdf(inter_nohitter_time) 33 | 34 | # Overlay the plots 35 | plt.plot(x_theor, y_theor) 36 | plt.plot(x, y, marker='.', linestyle='none') 37 | 38 | # Margins and axis labels 39 | plt.margins(0.02) 40 | plt.xlabel('Games between no-hitters') 41 | plt.ylabel('CDF') 42 | 43 | # Show the plot 44 | plt.show() 45 | 46 | Q3:- 47 | Take 10000 samples out of an Exponential distribution with parameter τ1/2 = tau/2. 48 | Take 10000 samples out of an Exponential distribution with parameter τ2 = 2*tau. 49 | Generate CDFs from these two sets of samples using your ecdf() function. 50 | Add these two CDFs as lines to your plot. This has been done for you, so hit 'Submit Answer' to view the plot! 51 | 52 | Solution:- 53 | # Plot the theoretical CDFs 54 | plt.plot(x_theor, y_theor) 55 | plt.plot(x, y, marker='.', linestyle='none') 56 | plt.margins(0.02) 57 | plt.xlabel('Games between no-hitters') 58 | plt.ylabel('CDF') 59 | 60 | # Take samples with half tau: samples_half 61 | samples_half = np.random.exponential(tau/2,10000) 62 | 63 | # Take samples with double tau: samples_double 64 | samples_double = np.random.exponential(2*tau,10000) 65 | 66 | # Generate CDFs from these samples 67 | x_half, y_half = ecdf(samples_half) 68 | x_double, y_double = ecdf(samples_double) 69 | 70 | # Plot these CDFs as lines 71 | _ = plt.plot(x_half, y_half) 72 | _ = plt.plot(x_double, y_double) 73 | 74 | # Show the plot 75 | plt.show() 76 | 77 | Q4:- 78 | Plot fertility (y-axis) versus illiteracy (x-axis) as a scatter plot. 79 | Set a 2% margin. 80 | Compute and print the Pearson correlation coefficient between illiteracy and fertility. 81 | 82 | Solution:- 83 | # Plot the illiteracy rate versus fertility 84 | _ = plt.plot(illiteracy, fertility, marker='.', linestyle='none') 85 | 86 | # Set the margins and label axes 87 | plt.margins(0.02) 88 | _ = plt.xlabel('percent illiterate') 89 | _ = plt.ylabel('fertility') 90 | 91 | # Show the plot 92 | plt.show() 93 | 94 | # Show the Pearson correlation coefficient 95 | print(pearson_r(illiteracy, fertility)) 96 | 97 | Q5:- 98 | Compute the slope and intercept of the regression line using np.polyfit(). Remember, fertility is on the y-axis and illiteracy on the x-axis. 99 | Print out the slope and intercept from the linear regression. 100 | To plot the best fit line, create an array x that consists of 0 and 100 using np.array(). Then, compute the theoretical values of y based on your regression parameters. I.e., y = a * x + b. 101 | Plot the data and the regression line on the same plot. Be sure to label your axes. 102 | Hit 'Submit Answer' to display your plot. 103 | 104 | Solution:- 105 | # Plot the illiteracy rate versus fertility 106 | _ = plt.plot(illiteracy, fertility, marker='.', linestyle='none') 107 | plt.margins(0.02) 108 | _ = plt.xlabel('percent illiterate') 109 | _ = plt.ylabel('fertility') 110 | 111 | # Perform a linear regression using np.polyfit(): a, b 112 | a, b = np.polyfit(illiteracy,fertility,1) 113 | 114 | # Print the results to the screen 115 | print('slope =', a, 'children per woman / percent illiterate') 116 | print('intercept =', b, 'children per woman') 117 | 118 | # Make theoretical line to plot 119 | x = np.array([0,100]) 120 | y = a * x + b 121 | 122 | # Add regression line to your plot 123 | _ = plt.plot(x, y) 124 | 125 | # Draw the plot 126 | plt.show() 127 | 128 | Q6:- 129 | Specify the values of the slope to compute the RSS. Use np.linspace() to get 200 points in the range between 0 and 0.1. For example, to get 100 points in the range between 0 and 0.5, you could use np.linspace() like so: np.linspace(0, 0.5, 100). 130 | Initialize an array, rss, to contain the RSS using np.empty_like() and the array you created above. The empty_like() function returns a new array with the same shape and type as a given array (in this case, a_vals). 131 | Write a for loop to compute the sum of RSS of the slope. Hint: the RSS is given by np.sum((y_data - a * x_data - b)**2). The variable b you computed in the last exercise is already in your namespace. Here, fertility is the y_data and illiteracy the x_data. 132 | Plot the RSS (rss) versus slope (a_vals). 133 | 134 | Solution:- 135 | # Specify slopes to consider: a_vals 136 | a_vals = np.linspace(0,0.1,200) 137 | 138 | # Initialize sum of square of residuals: rss 139 | rss = np.empty_like(a_vals) 140 | 141 | # Compute sum of square of residuals for each value of a_vals 142 | for i, a in enumerate(a_vals): 143 | rss[i] = np.sum((fertility - a*illiteracy - b)**2) 144 | 145 | # Plot the RSS 146 | plt.plot(a_vals, rss, '-') 147 | plt.xlabel('slope (children per woman / percent illiterate)') 148 | plt.ylabel('sum of square of residuals') 149 | 150 | plt.show() 151 | 152 | Q7:- 153 | Compute the parameters for the slope and intercept using np.polyfit(). The Anscombe data are stored in the arrays x and y. 154 | Print the slope a and intercept b. 155 | Generate theoretical x and y data from the linear regression. Your x array, which you can create with np.array(), should consist of 3 and 15. To generate the y data, multiply the slope by x_theor and add the intercept. 156 | Plot the Anscombe data as a scatter plot and then plot the theoretical line. Remember to include the marker='.' and linestyle='none' keyword arguments in addition to x and y when to plot the Anscombe data as a scatter plot. You do not need these arguments when plotting the theoretical line. 157 | 158 | Solution:- 159 | # Perform linear regression: a, b 160 | a, b = np.polyfit(x,y,1) 161 | 162 | # Print the slope and intercept 163 | print(a, b) 164 | 165 | # Generate theoretical x and y data: x_theor, y_theor 166 | x_theor = np.array([3, 15]) 167 | y_theor = a * x_theor + b 168 | 169 | # Plot the Anscombe data and theoretical line 170 | _ = plt.plot(x,y,marker='.',linestyle='none') 171 | _ = plt.plot(x_theor,y_theor) 172 | 173 | # Label the axes 174 | plt.xlabel('x') 175 | plt.ylabel('y') 176 | 177 | # Show the plot 178 | plt.show() 179 | 180 | Q7:- 181 | Write a for loop to do the following for each Anscombe data set. 182 | Compute the slope and intercept. 183 | Print the slope and intercept. 184 | 185 | Solution:- 186 | # Iterate through x,y pairs 187 | for x, y in zip(anscombe_x , anscombe_y ): 188 | # Compute the slope and intercept: a, b 189 | a, b = np.polyfit(x,y,1) 190 | 191 | # Print the result 192 | print('slope:', a, 'intercept:', b) 193 | 194 | 195 | -------------------------------------------------------------------------------- /Python/Statistical Thinking in Python -Part 1/Graphical exploratory data analysis: -------------------------------------------------------------------------------- 1 | Q1:- 2 | Import matplotlib.pyplot and seaborn as their usual aliases (plt and sns). 3 | Use seaborn to set the plotting defaults. 4 | Plot a histogram of the Iris versicolor petal lengths using plt.hist() and the provided NumPy array versicolor_petal_length. 5 | Show the histogram using plt.show(). 6 | 7 | Solution:- 8 | # Import plotting modules 9 | import matplotlib.pyplot as plt 10 | import seaborn as sns 11 | 12 | 13 | # Set default Seaborn style 14 | sns.set() 15 | 16 | # Plot histogram of versicolor petal lengths 17 | plt.hist(versicolor_petal_length) 18 | 19 | # Show histogram 20 | plt.show() 21 | 22 | Q2:- 23 | Label the axes. Don't forget that you should always include units in your axis labels. Your y-axis label is just 'count'. Your x-axis label is 'petal length (cm)'. The units are essential! 24 | Display the plot constructed in the above steps using plt.show(). 25 | 26 | Solution:- 27 | # Plot histogram of versicolor petal lengths 28 | _ = plt.hist(versicolor_petal_length) 29 | 30 | # Label axes 31 | plt.xlabel('petal length (cm)') 32 | plt.ylabel('count') 33 | 34 | # Show histogram 35 | plt.show() 36 | 37 | Q3:- 38 | Import numpy as np. This gives access to the square root function, np.sqrt(). 39 | Determine how many data points you have using len(). 40 | Compute the number of bins using the square root rule. 41 | Convert the number of bins to an integer using the built in int() function. 42 | Generate the histogram and make sure to use the bins keyword argument. 43 | Hit 'Submit Answer' to plot the figure and see the fruit of your labors! 44 | 45 | Solution:- 46 | # Import numpy 47 | import numpy as np 48 | 49 | # Compute number of data points: n_data 50 | n_data = len(versicolor_petal_length) 51 | 52 | # Number of bins is the square root of number of data points: n_bins 53 | n_bins = np.sqrt(n_data) 54 | 55 | # Convert number of bins to integer: n_bins 56 | n_bins = int(n_bins) 57 | 58 | # Plot the histogram 59 | plt.hist(versicolor_petal_length, bins= n_bins) 60 | 61 | # Label axes 62 | _ = plt.xlabel('petal length (cm)') 63 | _ = plt.ylabel('count') 64 | 65 | # Show histogram 66 | plt.show() 67 | 68 | Q4:- 69 | In the IPython Shell, inspect the DataFrame df using df.head(). This will let you identify which column names you need to pass as the x and y keyword arguments in your call to sns.swarmplot(). 70 | Use sns.swarmplot() to make a bee swarm plot from the DataFrame containing the Fisher iris data set, df. The x-axis should contain each of the three species, and the y-axis should contain the petal lengths. 71 | Label the axes. 72 | Show your plot. 73 | 74 | Solution:- 75 | # Create bee swarm plot with Seaborn's default settings 76 | df.head() 77 | 78 | # Label the axes 79 | sns.swarmplot(x = 'species', y = 'petal length (cm)' , data = df) 80 | _ = plt.xlabel('species') 81 | _ = plt.ylabel('petal length (cm)') 82 | # Show the plot 83 | plt.show() 84 | 85 | Q5:- 86 | Define a function with the signature ecdf(data). Within the function definition, 87 | Compute the number of data points, n, using the len() function. 88 | The x-values are the sorted data. Use the np.sort() function to perform the sorting. 89 | The y data of the ECDF go from 1/n to 1 in equally spaced increments. You can construct this using np.arange(). Remember, however, that the end value in np.arange() is not inclusive. Therefore, np.arange() will need to go from 1 to n+1. Be sure to divide this by n. 90 | The function returns the values x and y. 91 | 92 | Solution:- 93 | def ecdf(data): 94 | """Compute ECDF for a one-dimensional array of measurements.""" 95 | # Number of data points: n 96 | n = len(data) 97 | 98 | # x-data for the ECDF: x 99 | x = np.sort(data) 100 | 101 | # y-data for the ECDF: y 102 | y = np.arange(1, n+1) / n 103 | 104 | return x, y 105 | 106 | Q6:- 107 | Use ecdf() to compute the ECDF of versicolor_petal_length. Unpack the output into x_vers and y_vers. 108 | Plot the ECDF as dots. Remember to include marker = '.' and linestyle = 'none' in addition to x_vers and y_vers as arguments inside plt.plot(). 109 | Label the axes. You can label the y-axis 'ECDF'. 110 | Show your plot 111 | 112 | Solution:- 113 | # Compute ECDF for versicolor data: x_vers, y_vers 114 | x_vers, y_vers = ecdf(versicolor_petal_length) 115 | 116 | # Generate plot 117 | _ = plt.plot(x_vers, y_vers,marker='.',linestyle='none') 118 | 119 | # Label the axes 120 | _ = plt.xlabel('length') 121 | _ = plt.ylabel('ECDF') 122 | 123 | 124 | # Display the plot 125 | plt.show() 126 | 127 | Q7:- 128 | Compute ECDFs for each of the three species using your ecdf() function. The variables setosa_petal_length, versicolor_petal_length, and virginica_petal_length are all in your namespace. Unpack the ECDFs into x_set, y_set, x_vers, y_vers and x_virg, y_virg, respectively. 129 | Plot all three ECDFs on the same plot as dots. To do this, you will need three plt.plot() commands. Assign the result of each to _. 130 | A legend and axis labels have been added for you, so hit 'Submit Answer' to see all the ECDFs! 131 | 132 | Solution:- 133 | # Compute ECDFs 134 | # Compute ECDFs 135 | x_set, y_set = ecdf(setosa_petal_length) 136 | x_vers, y_vers = ecdf(versicolor_petal_length) 137 | x_virg, y_virg = ecdf(virginica_petal_length) 138 | 139 | # Plot all ECDFs on the same plot 140 | _ = plt.plot(x_set, y_set, marker = '.', linestyle = 'none') 141 | _ = plt.plot(x_vers, y_vers, marker = '.', linestyle = 'none') 142 | _ = plt.plot(x_virg, y_virg, marker = '.', linestyle = 'none') 143 | 144 | 145 | # Annotate the plot 146 | plt.legend(('setosa', 'versicolor', 'virginica'), loc='lower right') 147 | _ = plt.xlabel('petal length (cm)') 148 | _ = plt.ylabel('ECDF') 149 | 150 | # Display the plot 151 | plt.show() 152 | -------------------------------------------------------------------------------- /Python/Statistical Thinking in Python -Part 1/Quantitative exploratory data analysis: -------------------------------------------------------------------------------- 1 | Q1:- 2 | Compute the mean petal length of Iris versicolor from Anderson's classic data set. The variable versicolor_petal_length is provided in your namespace. Assign the mean to mean_length_vers. 3 | 4 | Solution:- 5 | # Compute the mean: mean_length_vers 6 | 7 | mean_length_vers = versicolor_petal_length.mean() 8 | # Print the result with some nice formatting 9 | print('I. versicolor:', mean_length_vers, 'cm') 10 | 11 | Q2:- 12 | Create percentiles, a NumPy array of percentiles you want to compute. These are the 2.5th, 25th, 50th, 75th, and 97.5th. You can do so by creating a list containing these ints/floats and convert the list to a NumPy array using np.array(). For example, np.array([30, 50]) would create an array consisting of the 30th and 50th percentiles. 13 | Use np.percentile() to compute the percentiles of the petal lengths from the Iris versicolor samples. The variable versicolor_petal_length is in your namespace. 14 | Print the percentiles. 15 | 16 | Solution:- 17 | # Specify array of percentiles: percentiles 18 | percentiles = np.array([2.5,25,50,75,97.5]) 19 | 20 | # Compute percentiles: ptiles_vers 21 | ptiles_vers = np.percentile(versicolor_petal_length,percentiles) 22 | 23 | # Print the result 24 | print(ptiles_vers) 25 | 26 | Q3:- 27 | Plot the percentiles as red diamonds on the ECDF. Pass the x and y co-ordinates - ptiles_vers and percentiles/100 - as positional arguments and specify the marker='D', color='red' and linestyle='none' keyword arguments. The argument for the y-axis - percentiles/100 has been specified for you. 28 | 29 | Solution:- 30 | # Plot the ECDF 31 | _ = plt.plot(x_vers, y_vers, '.') 32 | _ = plt.xlabel('petal length (cm)') 33 | _ = plt.ylabel('ECDF') 34 | 35 | # Overlay percentiles as red diamonds. 36 | _ = plt.plot(ptiles_vers, percentiles/100, marker='D', color='red', 37 | linestyle='none') 38 | 39 | # Show the plot 40 | plt.show() 41 | 42 | Q4:- 43 | The set-up is exactly the same as for the bee swarm plot; you just call sns.boxplot() with the same keyword arguments as you would sns.swarmplot(). The x-axis is 'species' and y-axis is 'petal length (cm)'. 44 | Don't forget to label your axes! 45 | Display the figure using the normal call. 46 | 47 | Solution:- 48 | # Create box plot with Seaborn's default settings 49 | _ = sns.boxplot(x='species', y='petal length (cm)', data=df) 50 | 51 | # Label the axes 52 | _ = plt.xlabel('species') 53 | 54 | _ = plt.ylabel('petal length (cm)') 55 | 56 | 57 | # Show the plot 58 | plt.show() 59 | 60 | Q5:- 61 | Create an array called differences that is the difference between the petal lengths (versicolor_petal_length) and the mean petal length. The variable versicolor_petal_length is already in your namespace as a NumPy array so you can take advantage of NumPy's vectorized operations. 62 | Square each element in this array. For example, x**2 squares each element in the array x. Store the result as diff_sq. 63 | Compute the mean of the elements in diff_sq using np.mean(). Store the result as variance_explicit. 64 | Compute the variance of versicolor_petal_length using np.var(). Store the result as variance_np. 65 | Print both variance_explicit and variance_np in one print call to make sure they are consistent. 66 | 67 | Solution:- 68 | # Array of differences to mean: differences 69 | differences = np.array(versicolor_petal_length - np.mean(versicolor_petal_length)) 70 | 71 | # Square the differences: diff_sq 72 | diff_sq = differences **2 73 | 74 | # Compute the mean square difference: variance_explicit 75 | variance_explicit = np.mean(diff_sq) 76 | 77 | # Compute the variance using NumPy: variance_np 78 | variance_np = np.var(versicolor_petal_length) 79 | 80 | # Print the results 81 | print(variance_explicit,variance_np) 82 | 83 | Q6:- 84 | Compute the variance of the data in the versicolor_petal_length array using np.var() and store it in a variable called variance. 85 | 86 | Print the square root of this value. 87 | 88 | Print the standard deviation of the data in the versicolor_petal_length array using np.std(). 89 | 90 | Solution:- 91 | # Compute the variance: variance 92 | variance = np.var(versicolor_petal_length) 93 | 94 | # Print the square root of the variance 95 | print(np.sqrt(variance)) 96 | 97 | # Print the standard deviation 98 | print(np.std(versicolor_petal_length)) 99 | 100 | Q7:- 101 | Use plt.plot() with the appropriate keyword arguments to make a scatter plot of versicolor petal length (x-axis) versus petal width (y-axis). The variables versicolor_petal_length and versicolor_petal_width are already in your namespace. Do not forget to use the marker='.' and linestyle='none' keyword arguments. 102 | Label the axes. 103 | Display the plot. 104 | 105 | Solution:- 106 | # Make a scatter plot 107 | _ = plt.plot(versicolor_petal_length,versicolor_petal_width,marker='.',linestyle='none') 108 | 109 | 110 | # Label the axes 111 | _ = plt.xlabel('versicolor petal length') 112 | 113 | _ = plt.ylabel('versicolor petal width') 114 | 115 | 116 | 117 | # Show the result 118 | plt.show() 119 | 120 | Q8:- 121 | Use np.cov() to compute the covariance matrix for the petal length (versicolor_petal_length) and width (versicolor_petal_width) of I. versicolor. 122 | Print the covariance matrix. 123 | Extract the covariance from entry [0,1] of the covariance matrix. Note that by symmetry, entry [1,0] is the same as entry [0,1]. 124 | Print the covariance. 125 | 126 | Solution:- 127 | # Compute the covariance matrix: covariance_matrix 128 | covariance_matrix = np.cov(versicolor_petal_length, versicolor_petal_width) 129 | 130 | # Print covariance matrix 131 | print(covariance_matrix) 132 | 133 | # Extract covariance of length and width of petals: petal_cov 134 | petal_cov = covariance_matrix[0,1] 135 | 136 | # Print the length/width covariance 137 | print(petal_cov) 138 | 139 | Q9:- 140 | Define a function with signature pearson_r(x, y). 141 | Use np.corrcoef() to compute the correlation matrix of x and y (pass them to np.corrcoef() in that order). 142 | The function returns entry [0,1] of the correlation matrix. 143 | Compute the Pearson correlation between the data in the arrays versicolor_petal_length and versicolor_petal_width. Assign the result to r. 144 | Print the result. 145 | 146 | Solution:- 147 | def pearson_r(x, y): 148 | """Compute Pearson correlation coefficient between two arrays.""" 149 | # Compute correlation matrix: corr_mat 150 | corr_mat = np.corrcoef(x,y) 151 | 152 | 153 | # Return entry [0,1] 154 | return corr_mat[0,1] 155 | 156 | # Compute Pearson correlation coefficient for I. versicolor: r 157 | r = pearson_r(versicolor_petal_length,versicolor_petal_width) 158 | 159 | # Print the result 160 | print(r) 161 | 162 | -------------------------------------------------------------------------------- /Python/Statistical Thinking in Python -Part 1/Thinking probabilistically-- Continuous variables: -------------------------------------------------------------------------------- 1 | Q1:- 2 | Draw 100,000 samples from a Normal distribution that has a mean of 20 and a standard deviation of 1. Do the same for Normal distributions with standard deviations of 3 and 10, each still with a mean of 20. Assign the results to samples_std1, samples_std3 and samples_std10, respectively. 3 | Plot a histograms of each of the samples; for each, use 100 bins, also using the keyword arguments normed=True and histtype='step'. The latter keyword argument makes the plot look much like the smooth theoretical PDF. You will need to make 3 plt.hist() calls. 4 | Hit 'Submit Answer' to make a legend, showing which standard deviations you used, and show your plot! There is no need to label the axes because we have not defined what is being described by the Normal distribution; we are just looking at shapes of PDFs. 5 | 6 | Solution:- 7 | # Draw 100000 samples from Normal distribution with stds of interest: samples_std1, samples_std3, samples_std10 8 | samples_std1 = np.random.normal(20,1,100000) 9 | samples_std3 = np.random.normal(20,3,100000) 10 | samples_std10 = np.random.normal(20,10,100000) 11 | 12 | 13 | # Make histograms 14 | plt.hist(samples_std1,bins=100, normed=True,histtype='step') 15 | plt.hist(samples_std3,bins=100, normed=True,histtype='step') 16 | plt.hist(samples_std10,bins=100, normed=True,histtype='step') 17 | 18 | # Make a legend, set limits and show plot 19 | _ = plt.legend(('std = 1', 'std = 3', 'std = 10')) 20 | plt.ylim(-0.01, 0.42) 21 | plt.show() 22 | 23 | Q2:- 24 | Use your ecdf() function to generate x and y values for CDFs: x_std1, y_std1, x_std3, y_std3 and x_std10, y_std10, respectively. 25 | Plot all three CDFs as dots (do not forget the marker and linestyle keyword arguments!). 26 | Hit submit to make a legend, showing which standard deviations you used, and to show your plot. There is no need to label the axes because we have not defined what is being described by the Normal distribution; we are just looking at shapes of CDFs. 27 | 28 | Solution:- 29 | # Generate CDFs 30 | x_std1, y_std1 = ecdf(samples_std1) 31 | x_std3, y_std3 = ecdf(samples_std3) 32 | x_std10, y_std10 = ecdf(samples_std10) 33 | 34 | # Plot CDFs 35 | _ = plt.plot(x_std1, y_std1 , marker='.', linestyle='none') 36 | _ = plt.plot(x_std3, y_std3 , marker='.', linestyle='none') 37 | _ = plt.plot(x_std10, y_std10 , marker='.', linestyle='none') 38 | 39 | 40 | # Make a legend and show the plot 41 | _ = plt.legend(('std = 1', 'std = 3', 'std = 10'), loc='lower right') 42 | plt.show() 43 | 44 | Q3:- 45 | Compute mean and standard deviation of Belmont winners' times with the two outliers removed. The NumPy array belmont_no_outliers has these data. 46 | Take 10,000 samples out of a normal distribution with this mean and standard deviation using np.random.normal(). 47 | Compute the CDF of the theoretical samples and the ECDF of the Belmont winners' data, assigning the results to x_theor, y_theor and x, y, respectively. 48 | Hit submit to plot the CDF of your samples with the ECDF, label your axes and show the plot. 49 | 50 | Solution:- 51 | # Compute mean and standard deviation: mu, sigma 52 | mu, sigma = np.mean(belmont_no_outliers), np.std(belmont_no_outliers) 53 | 54 | 55 | # Sample out of a normal distribution with this mu and sigma: samples 56 | samples = np.random.normal(mu,sigma,10000) 57 | 58 | # Get the CDF of the samples and of the data 59 | x_theor, y_theor = ecdf(samples) 60 | x,y = ecdf(belmont_no_outliers) 61 | 62 | 63 | # Plot the CDFs and show the plot 64 | _ = plt.plot(x_theor, y_theor) 65 | _ = plt.plot(x, y, marker='.', linestyle='none') 66 | _ = plt.xlabel('Belmont winning time (sec.)') 67 | _ = plt.ylabel('CDF') 68 | plt.show() 69 | 70 | Q4:- 71 | Take 1,000,000 samples from the normal distribution using the np.random.normal() function. The mean mu and standard deviation sigma are already loaded into the namespace of your IPython instance. 72 | Compute the fraction of samples that have a time less than or equal to Secretariat's time of 144 seconds. 73 | 74 | Solution:- 75 | # Take a million samples out of the Normal distribution: samples 76 | samples = np.random.normal(mu,sigma,1000000) 77 | 78 | # Compute the fraction that are faster than 144 seconds: prob 79 | prob = sum(samples <= 144)/1000000 80 | 81 | # Print the result 82 | print('Probability of besting Secretariat:', prob) 83 | 84 | Q5:- 85 | Define a function with call signature successive_poisson(tau1, tau2, size=1) that samples the waiting time for a no-hitter and a hit of the cycle. 86 | Draw waiting times tau1 (size number of samples) for the no-hitter out of an exponential distribution and assign to t1. 87 | Draw waiting times tau2 (size number of samples) for hitting the cycle out of an exponential distribution and assign to t2. 88 | The function returns the sum of the waiting times for the two events. 89 | 90 | Solution:- 91 | def successive_poisson(tau1, tau2, size=1): 92 | """Compute time for arrival of 2 successive Poisson processes.""" 93 | # Draw samples out of first exponential distribution: t1 94 | t1 = np.random.exponential(tau1, size) 95 | 96 | # Draw samples out of second exponential distribution: t2 97 | t2 = np.random.exponential(tau2, size) 98 | 99 | return t1 + t2 100 | 101 | Q6:- 102 | Use your successive_poisson() function to draw 100,000 out of the distribution of waiting times for observing a no-hitter and a hitting of the cycle. 103 | Plot the PDF of the waiting times using the step histogram technique of a previous exercise. Don't forget the necessary keyword arguments. You should use bins=100, normed=True, and histtype='step'. 104 | Label the axes. 105 | Show your plot. 106 | 107 | Solution:- 108 | # Draw samples of waiting times: waiting_times 109 | waiting_times = waiting_times = np.array(successive_poisson(764, 715, 100000)) 110 | 111 | # Make the histogram 112 | plt.hist(waiting_times, bins=100,normed=True,histtype='step') 113 | 114 | 115 | # Label axes 116 | plt.xlabel('x') 117 | plt.ylabel('y') 118 | 119 | 120 | # Show the plot 121 | plt.show() 122 | 123 | -------------------------------------------------------------------------------- /Python/Supervised Learning with scikit-learn/Classification: -------------------------------------------------------------------------------- 1 | Q1:- 2 | Import KNeighborsClassifier from sklearn.neighbors. 3 | Create arrays X and y for the features and the target variable. Here this has been done for you. Note the use of .drop() to drop the target variable 'party' from the feature array X as well as the use of the .values attribute to ensure X and y are NumPy arrays. Without using .values, X and y are a DataFrame and Series respectively; the scikit-learn API will accept them in this form also as long as they are of the right shape. 4 | Instantiate a KNeighborsClassifier called knn with 6 neighbors by specifying the n_neighbors parameter. 5 | Fit the classifier to the data using the .fit() method. 6 | 7 | Solution:- 8 | # Import KNeighborsClassifier from sklearn.neighbors 9 | from sklearn.neighbors import KNeighborsClassifier 10 | 11 | # Create arrays for the features and the response variable 12 | y = df['party'].values 13 | X = df.drop('party', axis=1).values 14 | 15 | # Create a k-NN classifier with 6 neighbors 16 | knn = KNeighborsClassifier(n_neighbors=6) 17 | 18 | # Fit the classifier to the data 19 | knn.fit(X,y) 20 | 21 | Q2:- 22 | Create arrays for the features and the target variable from df. As a reminder, the target variable is 'party'. 23 | Instantiate a KNeighborsClassifier with 6 neighbors. 24 | Fit the classifier to the data. 25 | Predict the labels of the training data, X. 26 | Predict the label of the new data point X_new. 27 | 28 | Solution:- 29 | # Import KNeighborsClassifier from sklearn.neighbors 30 | from sklearn.neighbors import KNeighborsClassifier 31 | 32 | # Create arrays for the features and the response variable 33 | y = df['party'].values 34 | X = df.drop('party',axis=1).values 35 | 36 | # Create a k-NN classifier with 6 neighbors: knn 37 | knn = KNeighborsClassifier(n_neighbors=6) 38 | 39 | # Fit the classifier to the data 40 | knn.fit(X,y) 41 | 42 | # Predict the labels for the training data X 43 | y_pred = knn.predict(X) 44 | 45 | # Predict and print the label for the new data point X_new 46 | new_prediction = knn.predict(X_new) 47 | print("Prediction: {}".format(new_prediction)) 48 | 49 | Q3:- 50 | Import datasets from sklearn and matplotlib.pyplot as plt. 51 | Load the digits dataset using the .load_digits() method on datasets. 52 | Print the keys and DESCR of digits. 53 | Print the shape of images and data keys using the . notation. 54 | Display the 1011th image using plt.imshow(). This has been done for you, so hit 'Submit Answer' to see which handwritten digit this happens to be! 55 | 56 | Solution:- 57 | # Import necessary modules 58 | from sklearn import datasets 59 | import matplotlib.pyplot as plt 60 | 61 | # Load the digits dataset: digits 62 | digits = datasets.load_digits() 63 | 64 | # Print the keys and DESCR of the dataset 65 | print(digits.DESCR) 66 | print(digits.keys()) 67 | 68 | # Print the shape of the images and data keys 69 | print(digits.images.shape) 70 | print(digits.data.shape) 71 | 72 | # Display digit 1010 73 | plt.imshow(digits.images[1010], cmap=plt.cm.gray_r, interpolation='nearest') 74 | plt.show() 75 | 76 | Q4:- 77 | Import KNeighborsClassifier from sklearn.neighbors and train_test_split from sklearn.model_selection. 78 | Create an array for the features using digits.data and an array for the target using digits.target. 79 | Create stratified training and test sets using 0.2 for the size of the test set. Use a random state of 42. Stratify the split according to the labels so that they are distributed in the training and test sets as they are in the original dataset. 80 | Create a k-NN classifier with 7 neighbors and fit it to the training data. 81 | Compute and print the accuracy of the classifier's predictions using the .score() method. 82 | 83 | Solution:- 84 | # Import necessary modules 85 | from sklearn.neighbors import KNeighborsClassifier 86 | from sklearn.model_selection import train_test_split 87 | digits = datasets.load_digits() 88 | 89 | # Create feature and target arrays 90 | X = digits.data 91 | y = digits.target 92 | 93 | # Split into training and test set 94 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=42, stratify=y) 95 | 96 | # Create a k-NN classifier with 7 neighbors: knn 97 | knn = KNeighborsClassifier(n_neighbors=7) 98 | 99 | # Fit the classifier to the training data 100 | knn.fit(X_train,y_train) 101 | 102 | # Print the accuracy 103 | print(knn.score(X_test, y_test)) 104 | 105 | Q5:- 106 | Inside the for loop: 107 | Setup a k-NN classifier with the number of neighbors equal to k. 108 | Fit the classifier with k neighbors to the training data. 109 | Compute accuracy scores the training set and test set separately using the .score() method and assign the results to the train_accuracy and test_accuracy arrays respectively. 110 | 111 | Solution:- 112 | # Setup arrays to store train and test accuracies 113 | neighbors = np.arange(1, 9) 114 | train_accuracy = np.empty(len(neighbors)) 115 | test_accuracy = np.empty(len(neighbors)) 116 | 117 | # Loop over different values of k 118 | for i, k in enumerate(neighbors): 119 | # Setup a k-NN Classifier with k neighbors: knn 120 | knn = KNeighborsClassifier(n_neighbors=k) 121 | 122 | # Fit the classifier to the training data 123 | knn.fit(X_train,y_train) 124 | 125 | #Compute accuracy on the training set 126 | train_accuracy[i] = knn.score(X_train, y_train) 127 | 128 | #Compute accuracy on the testing set 129 | test_accuracy[i] = knn.score(X_test, y_test) 130 | 131 | # Generate plot 132 | plt.title('k-NN: Varying Number of Neighbors') 133 | plt.plot(neighbors, test_accuracy, label = 'Testing Accuracy') 134 | plt.plot(neighbors, train_accuracy, label = 'Training Accuracy') 135 | plt.legend() 136 | plt.xlabel('Number of Neighbors') 137 | plt.ylabel('Accuracy') 138 | plt.show() 139 | -------------------------------------------------------------------------------- /Python/Unsupervised Learning in Python/Discovering interpretable features: -------------------------------------------------------------------------------- 1 | Q1:- 2 | -------------------------------------------------------------------------------- /Python/Unsupervised Learning in Python/Visualization with hierarchical clustering and t-SNE: -------------------------------------------------------------------------------- 1 | Q1:- 2 | Import: 3 | linkage and dendrogram from scipy.cluster.hierarchy. 4 | matplotlib.pyplot as plt. 5 | Perform hierarchical clustering on samples using the linkage() function with the method='complete' keyword argument. Assign the result to mergings. 6 | Plot a dendrogram using the dendrogram() function on mergings. Specify the keyword arguments labels=varieties, leaf_rotation=90, and leaf_font_size=6. 7 | 8 | Solution:- 9 | # Perform the necessary imports 10 | from scipy.cluster.hierarchy import linkage, dendrogram 11 | import matplotlib.pyplot as plt 12 | 13 | # Calculate the linkage: mergings 14 | mergings = linkage(samples, method='complete') 15 | 16 | # Plot the dendrogram, using varieties as labels 17 | dendrogram(mergings, 18 | labels=varieties, 19 | leaf_rotation=90, 20 | leaf_font_size=6, 21 | ) 22 | plt.show() 23 | 24 | Q2:- 25 | Import normalize from sklearn.preprocessing. 26 | Rescale the price movements for each stock by using the normalize() function on movements. 27 | Apply the linkage() function to normalized_movements, using 'complete' linkage, to calculate the hierarchical clustering. Assign the result to mergings. 28 | Plot a dendrogram of the hierarchical clustering, using the list companies of company names as the labels. In addition, specify the leaf_rotation=90, and leaf_font_size=6 keyword arguments as you did in the previous exercise. 29 | 30 | Solution:- 31 | # Import normalize 32 | from sklearn.preprocessing import normalize 33 | 34 | # Normalize the movements: normalized_movements 35 | normalized_movements = normalize(movements) 36 | 37 | # Calculate the linkage: mergings 38 | mergings = linkage(normalized_movements,method='complete') 39 | 40 | # Plot the dendrogram 41 | dendrogram(mergings,labels=companies,leaf_rotation=90,leaf_font_size=6) 42 | plt.show() 43 | 44 | Q3:- 45 | Import: 46 | linkage and dendrogram from scipy.cluster.hierarchy. 47 | matplotlib.pyplot as plt. 48 | Perform hierarchical clustering on samples using the linkage() function with the method='single' keyword argument. Assign the result to mergings. 49 | Plot a dendrogram of the hierarchical clustering, using the list country_names as the labels. In addition, specify the leaf_rotation=90, and leaf_font_size=6 keyword arguments as you have done earlier. 50 | 51 | Solution:- 52 | # Perform the necessary imports 53 | import matplotlib.pyplot as plt 54 | from scipy.cluster.hierarchy import linkage, dendrogram 55 | 56 | # Calculate the linkage: mergings 57 | mergings = linkage(samples,method='single') 58 | 59 | # Plot the dendrogram 60 | dendrogram(mergings,labels=country_names,leaf_rotation=90,leaf_font_size=6) 61 | plt.show() 62 | 63 | Q4:- 64 | Import: 65 | pandas as pd. 66 | fcluster from scipy.cluster.hierarchy. 67 | Perform a flat hierarchical clustering by using the fcluster() function on mergings. Specify a maximum height of 6 and the keyword argument criterion='distance'. 68 | Create a DataFrame df with two columns named 'labels' and 'varieties', using labels and varieties, respectively, for the column values. This has been done for you. 69 | Create a cross-tabulation ct between df['labels'] and df['varieties'] to count the number of times each grain variety coincides with each cluster label. 70 | 71 | Solution:- 72 | # Perform the necessary imports 73 | import pandas as pd 74 | from scipy.cluster.hierarchy import fcluster 75 | 76 | # Use fcluster to extract labels: labels 77 | labels = fcluster(mergings,6,criterion='distance') 78 | 79 | # Create a DataFrame with labels and varieties as columns: df 80 | df = pd.DataFrame({'labels': labels, 'varieties': varieties}) 81 | 82 | # Create crosstab: ct 83 | ct = pd.crosstab(df['labels'],df['varieties']) 84 | 85 | # Display ct 86 | print(ct) 87 | 88 | Q5:- 89 | Import TSNE from sklearn.manifold. 90 | Create a TSNE instance called model with learning_rate=200. 91 | Apply the .fit_transform() method of model to samples. Assign the result to tsne_features. 92 | Select the column 0 of tsne_features. Assign the result to xs. 93 | Select the column 1 of tsne_features. Assign the result to ys. 94 | Make a scatter plot of the t-SNE features xs and ys. To color the points by the grain variety, specify the additional keyword argument c=variety_numbers. 95 | 96 | Solution:- 97 | # Import TSNE 98 | from sklearn.manifold import TSNE 99 | 100 | # Create a TSNE instance: model 101 | model = TSNE(learning_rate=200) 102 | 103 | # Apply fit_transform to samples: tsne_features 104 | tsne_features = model.fit_transform(samples) 105 | 106 | # Select the 0th feature: xs 107 | xs = tsne_features[:,0] 108 | 109 | # Select the 1st feature: ys 110 | ys = tsne_features[:,1] 111 | 112 | # Scatter plot, coloring by variety_numbers 113 | plt.scatter(xs,ys,c=variety_numbers) 114 | plt.show() 115 | 116 | Q6:- 117 | Import TSNE from sklearn.manifold. 118 | Create a TSNE instance called model with learning_rate=50. 119 | Apply the .fit_transform() method of model to normalized_movements. Assign the result to tsne_features. 120 | Select column 0 and column 1 of tsne_features. 121 | Make a scatter plot of the t-SNE features xs and ys. Specify the additional keyword argument alpha=0.5. 122 | Code to label each point with its company name has been written for you using plt.annotate(), so just hit 'Submit Answer' to see the visualization! 123 | 124 | Solution:- 125 | # Import TSNE 126 | from sklearn.manifold import TSNE 127 | 128 | # Create a TSNE instance: model 129 | model = TSNE(learning_rate=50) 130 | 131 | # Apply fit_transform to normalized_movements: tsne_features 132 | tsne_features = model.fit_transform(normalized_movements) 133 | 134 | # Select the 0th feature: xs 135 | xs = tsne_features[:,0] 136 | 137 | # Select the 1th feature: ys 138 | ys = tsne_features[:,1] 139 | 140 | # Scatter plot 141 | plt.scatter(xs,ys,alpha=0.5) 142 | 143 | # Annotate the points 144 | for x, y, company in zip(xs, ys, companies): 145 | plt.annotate(company, (x, y), fontsize=5, alpha=0.75) 146 | plt.show() 147 | -------------------------------------------------------------------------------- /Python/pandas Foundations/Data ingestion & inspection: -------------------------------------------------------------------------------- 1 | Q1:- 2 | Import numpy using the standard alias np. 3 | Assign the numerical values in the DataFrame df to an array np_vals using the attribute values. 4 | Pass np_vals into the NumPy method log10() and store the results in np_vals_log10. 5 | Pass the entire df DataFrame into the NumPy method log10() and store the results in df_log10. 6 | Inspect the output of the print() code to see the type() of the variables that you created. 7 | 8 | Solution:- 9 | # Import numpy 10 | import numpy as np 11 | 12 | # Create array of DataFrame values: np_vals 13 | np_vals = df.values 14 | 15 | # Create new array of base 10 logarithm values: np_vals_log10 16 | np_vals_log10 = np.log10(np_vals) 17 | 18 | # Create array of new DataFrame by passing df to np.log10(): df_log10 19 | df_log10 = np.log10(df) 20 | 21 | # Print original and new data containers 22 | [print(x, 'has type', type(eval(x))) for x in ['np_vals', 'np_vals_log10', 'df', 'df_log10']] 23 | 24 | Q2:- 25 | Zip the 2 lists list_keys and list_values together into one list of (key, value) tuples. Be sure to convert the zip object into a list, and store the result in zipped. 26 | Inspect the contents of zipped using print(). This has been done for you. 27 | Construct a dictionary using zipped. Store the result as data. 28 | Construct a DataFrame using the dictionary. Store the result as df. 29 | 30 | Solution:- 31 | # Zip the 2 lists together into one list of (key,value) tuples: zipped 32 | zipped = list(zip(list_keys,list_values)) 33 | 34 | # Inspect the list using print() 35 | print(zipped) 36 | 37 | # Build a dictionary with the zipped list: data 38 | data = dict(zipped) 39 | 40 | # Build and inspect a DataFrame from the dictionary: df 41 | df = pd.DataFrame(data) 42 | print(df) 43 | 44 | Q3:- 45 | Create a list of new column labels with 'year', 'artist', 'song', 'chart weeks', and assign it to list_labels. 46 | Assign your list of labels to df.columns. 47 | 48 | Solution:- 49 | # Build a list of labels: list_labels 50 | list_labels = ['year','artist','song','chart weeks'] 51 | 52 | # Assign the list of labels to the columns attribute: df.columns 53 | df.columns = list_labels 54 | 55 | Q4:- 56 | Make a string object with the value 'PA' and assign it to state. 57 | Construct a dictionary with 2 key:value pairs: 'state':state and 'city':cities. 58 | Construct a pandas DataFrame from the dictionary you created and assign it to df. 59 | 60 | Solution:- 61 | # Make a string with the value 'PA': state 62 | state = "PA" 63 | 64 | # Construct a dictionary: data 65 | data = {'state':state, 'city':cities} 66 | 67 | # Construct a DataFrame from dictionary data: df 68 | df = pd.DataFrame(data) 69 | 70 | # Print the DataFrame 71 | print(df) 72 | 73 | Q5:- 74 | Use pd.read_csv() with the string 'world_population.csv' to read the CSV file into a DataFrame and assign it to df1. 75 | Create a list of new column labels - 'year', 'population' - and assign it to the variable new_labels. 76 | Reread the same file, again using pd.read_csv(), but this time, add the keyword arguments header=0 and names=new_labels. Assign the resulting DataFrame to df2. 77 | Print both the df1 and df2 DataFrames to see the change in column names. This has already been done for you. 78 | 79 | Solution:- 80 | # Read in the file: df1 81 | df1 = pd.read_csv('world_population.csv') 82 | 83 | # Create a list of the new column labels: new_labels 84 | new_labels = ['year','population'] 85 | 86 | # Read in the file, specifying the header and names parameters: df2 87 | df2 = pd.read_csv('world_population.csv', header=0, names=new_labels) 88 | 89 | # Print both the DataFrames 90 | print(df1) 91 | print(df2) 92 | 93 | Q6:- 94 | Use pd.read_csv() without using any keyword arguments to read file_messy into a pandas DataFrame df1. 95 | Use .head() to print the first 5 rows of df1 and see how messy it is. Do this in the IPython Shell first so you can see how modifying read_csv() can clean up this mess. 96 | Using the keyword arguments delimiter=' ', header=3 and comment='#', use pd.read_csv() again to read file_messy into a new DataFrame df2. 97 | Print the output of df2.head() to verify the file was read correctly. 98 | Use the DataFrame method .to_csv() to save the DataFrame df2 to the variable file_clean. Be sure to specify index=False. 99 | Use the DataFrame method .to_excel() to save the DataFrame df2 to the file 'file_clean.xlsx'. Again, remember to specify index=False. 100 | 101 | Solution:- 102 | # Read the raw file as-is: df1 103 | df1 = pd.read_csv(file_messy) 104 | 105 | # Print the output of df1.head() 106 | print(df1.head()) 107 | 108 | # Read in the file with the correct parameters: df2 109 | df2 = pd.read_csv(file_messy, delimiter=' ', header=3, comment='#') 110 | 111 | # Print the output of df2.head() 112 | print(df2.head()) 113 | 114 | # Save the cleaned up DataFrame to a CSV file without the index 115 | df2.to_csv(file_clean, index=False) 116 | 117 | # Save the cleaned up DataFrame to an excel file without the index 118 | df2.to_excel('file_clean.xlsx', index=False) 119 | 120 | Q7:- 121 | Create the plot with the DataFrame method df.plot(). Specify a color of 'red'. 122 | Note: c and color are interchangeable as parameters here, but we ask you to be explicit and specify color. 123 | Use plt.title() to give the plot a title of 'Temperature in Austin'. 124 | Use plt.xlabel() to give the plot an x-axis label of 'Hours since midnight August 1, 2010'. 125 | Use plt.ylabel() to give the plot a y-axis label of 'Temperature (degrees F)'. 126 | Finally, display the plot using plt.show(). 127 | 128 | Solution:- 129 | # Create a plot with color='red' 130 | df.plot(color='red') 131 | 132 | # Add a title 133 | plt.title('Temperature in Austin') 134 | 135 | # Specify the x-axis label 136 | plt.xlabel('Hours since midnight August 1, 2010') 137 | 138 | # Specify the y-axis label 139 | plt.ylabel('Temperature (degrees F)') 140 | 141 | # Display the plot 142 | plt.show() 143 | 144 | Q8:- 145 | Plot all columns together on one figure by calling df.plot(), and noting the vertical scaling problem. 146 | Plot all columns as subplots. To do so, you need to specify subplots=True inside .plot(). 147 | Plot a single column of dew point data. To do this, define a column list containing a single column name 'Dew Point (deg F)', and call df[column_list1].plot(). 148 | Plot two columns of data, 'Temperature (deg F)' and 'Dew Point (deg F)'. To do this, define a list containing those column names and pass it into df[], as df[column_list2].plot(). 149 | 150 | Solution:- 151 | # Plot all columns (default) 152 | df.plot() 153 | plt.show() 154 | 155 | # Plot all columns as subplots 156 | df.plot(subplots=True) 157 | plt.show() 158 | 159 | # Plot just the Dew Point data 160 | column_list1 = ['Dew Point (deg F)'] 161 | df[column_list1].plot() 162 | plt.show() 163 | 164 | # Plot the Dew Point and Temperature data, but not the Pressure data 165 | column_list2 = ['Temperature (deg F)','Dew Point (deg F)'] 166 | df[column_list2].plot() 167 | plt.show() 168 | 169 | -------------------------------------------------------------------------------- /Python/pandas Foundations/Exploratory data analysis: -------------------------------------------------------------------------------- 1 | Q1:- 2 | Create a list of y-axis column names called y_columns consisting of 'AAPL' and 'IBM'. 3 | Generate a line plot with x='Month' and y=y_columns as inputs. 4 | Give the plot a title of 'Monthly stock prices'. 5 | Specify the y-axis label. 6 | Display the plot. 7 | 8 | Solution:- 9 | # Create a list of y-axis column names: y_columns 10 | y_columns = ['AAPL','IBM'] 11 | 12 | # Generate a line plot 13 | df.plot(x='Month', y=y_columns) 14 | 15 | # Add the title 16 | plt.title('Monthly stock prices') 17 | 18 | # Add the y-axis label 19 | plt.ylabel('Price ($US)') 20 | 21 | # Display the plot 22 | plt.show() 23 | 24 | Q2:- 25 | Generate a scatter plot with 'hp' on the x-axis and 'mpg' on the y-axis. Specify s=sizes. 26 | Add a title to the plot. 27 | Specify the x-axis and y-axis labels. 28 | 29 | Solution:- 30 | # Generate a scatter plot 31 | df.plot(kind='scatter', x='hp', y='mpg', s=sizes) 32 | 33 | # Add the title 34 | plt.title('Fuel efficiency vs Horse-power') 35 | 36 | # Add the x-axis label 37 | plt.xlabel('Horse-power') 38 | 39 | # Add the y-axis label 40 | plt.ylabel('Fuel efficiency (mpg)') 41 | 42 | # Display the plot 43 | plt.show() 44 | 45 | Q3:- 46 | Make a list called cols of the column names to be plotted: 'weight' and 'mpg'. You can then access it using df[cols]. 47 | Generate a box plot of the two columns in a single figure. To do this, specify subplots=True 48 | 49 | Solution:- 50 | # Make a list of the column names to be plotted: cols 51 | cols = ['weight','mpg'] 52 | 53 | # Generate the box plots 54 | df[cols].plot(kind='box',subplots=True) 55 | 56 | # Display the plot 57 | plt.show() 58 | 59 | Q4:- 60 | Plot a PDF for the values in fraction with 30 bins between 0 and 30%. 61 | The range has been taken care of for you. ax=axes[0] means that this plot will appear in the first row. 62 | Plot a CDF for the values in fraction with 30 bins between 0 and 30%. 63 | Again, the range has been specified for you. To make the CDF appear on the second row, you need to specify ax=axes[1]. 64 | 65 | Solution:- 66 | # This formats the plots such that they appear on separate rows 67 | fig, axes = plt.subplots(nrows=2, ncols=1) 68 | 69 | # Plot the PDF 70 | df.fraction.plot(ax=axes[0], kind='hist', bins=30, normed=True, range=(0,.3)) 71 | plt.show() 72 | 73 | # Plot the CDF 74 | df.fraction.plot(kind='hist', bins=30, cumulative=True, normed=True, ax=axes[1], range=(0,.3)) 75 | plt.show() 76 | 77 | Q5:- 78 | Print the minimum value of the 'Engineering' column. 79 | Print the maximum value of the 'Engineering' column. 80 | Construct the mean percentage per year with .mean(axis='columns'). Assign the result to mean. 81 | Plot the average percentage per year. 82 | Since 'Year' is the index of df, it will appear on the x-axis of the plot. No keyword arguments are needed in your call to .plot(). 83 | 84 | Solution:- 85 | # Print the minimum value of the Engineering column 86 | print(df['Engineering'].min()) 87 | 88 | # Print the maximum value of the Engineering column 89 | print(df['Engineering'].max()) 90 | 91 | # Construct the mean percentage per year: mean 92 | mean = df.mean(axis='columns') 93 | 94 | # Plot the average percentage per year 95 | mean.plot() 96 | 97 | # Display the plot 98 | plt.show() 99 | 100 | Q6:- 101 | Print summary statistics of the 'fare' column of df with .describe() and print(). Note: df.fare and df['fare'] are equivalent. 102 | Generate a box plot of the 'fare' column. 103 | 104 | Solution:- 105 | # Print summary statistics of the fare column with .describe() 106 | print(df['fare'].describe()) 107 | 108 | # Generate a box plot of the fare column 109 | df.fare.plot(kind='box') 110 | 111 | # Show the plot 112 | plt.show() 113 | 114 | Q7:- 115 | Print the number of countries reported in 2015. To do this, use the .count() method on the '2015' column of df. 116 | Print the 5th and 95th percentiles of df. To do this, use the .quantile() method with the list [0.05, 0.95]. 117 | Generate a box plot using the list of columns provided in years. 118 | This has already been done for you, so click on 'Submit Answer' to view the result! 119 | 120 | Solution- 121 | # Print the number of countries reported in 2015 122 | print(df['2015'].count()) 123 | 124 | # Print the 5th and 95th percentiles 125 | print(df.quantile([0.05, 0.95])) 126 | 127 | # Generate a box plot 128 | years = ['1800','1850','1900','1950','2000'] 129 | df[years].plot(kind='box') 130 | plt.show() 131 | 132 | Q8:- 133 | Compute and print the means of the January and March data using the .mean() method. 134 | Compute and print the standard deviations of the January and March data using the .std() method. 135 | 136 | Solution:- 137 | # Print the mean of the January and March data 138 | print(january.mean(), march.mean()) 139 | 140 | # Print the standard deviation of the January and March data 141 | print(january.std(), march.std()) 142 | 143 | Q9:- 144 | Filtering and counting 145 | How many automobiles were manufactured in Asia in the automobile dataset? 146 | The DataFrame has been provided for you as df. Use filtering and the .count() member method to determine the number of rows where the 'origin' column has the value 'Asia'. 147 | As an example, you can extract the rows that contain 'US' as the country of origin using df[df['origin'] == 'US']. 148 | 149 | Solution:- 150 | df[df['origin'] == 'Asia'].count() 151 | 152 | Q10:- 153 | Compute the global mean and global standard deviations of df using the .mean() and .std() methods. 154 | Assign the results to global_mean and global_std. 155 | Filter the 'US' population from the 'origin' column and assign the result to us. 156 | Compute the US mean and US standard deviations of us using the .mean() and .std() methods. Assign the results to us_mean and us_std. 157 | Print the differences between us_mean and global_mean and us_std and global_std. This has already been done for you. 158 | 159 | Solution:- 160 | # Compute the global mean and global standard deviation: global_mean, global_std 161 | global_mean = df.mean() 162 | global_std = df.std() 163 | 164 | # Filter the US population from the origin column: us 165 | us = df[df['origin']=='US'] 166 | 167 | # Compute the US mean and US standard deviation: us_mean, us_std 168 | us_mean = us.mean() 169 | us_std = us.std() 170 | 171 | # Print the differences 172 | print(us_mean - global_mean) 173 | print(us_std - global_std) 174 | 175 | Q11:- 176 | Inside plt.subplots(), specify the nrows and ncols parameters so that there are 3 rows and 1 column. 177 | Filter the rows where the 'pclass' column has the values 1 and generate a box plot of the 'fare' column. 178 | Filter the rows where the 'pclass' column has the values 2 and generate a box plot of the 'fare' column. 179 | Filter the rows where the 'pclass' column has the values 3 and generate a box plot of the 'fare' column. 180 | 181 | Solution:- 182 | # Display the box plots on 3 separate rows and 1 column 183 | fig, axes = plt.subplots(nrows=3, ncols=1) 184 | 185 | # Generate a box plot of the fare prices for the First passenger class 186 | titanic.loc[titanic['pclass'] == 1].plot(ax=axes[0], y='fare', kind='box') 187 | 188 | # Generate a box plot of the fare prices for the Second passenger class 189 | titanic.loc[titanic['pclass'] == 2].plot(ax=axes[1], y='fare', kind='box') 190 | 191 | # Generate a box plot of the fare prices for the Third passenger class 192 | titanic.loc[titanic['pclass'] == 3].plot(ax=axes[2], y='fare', kind='box') 193 | 194 | # Display the plot 195 | plt.show() 196 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # DataCamp 2 | This repository contains assignments on courses related to data science from Data camp 3 | -------------------------------------------------------------------------------- /SparkR/Introduction to Spark in R using sparklyr/Going Native: Use The Native Interface to Manipulate Spark DataFrames: -------------------------------------------------------------------------------- 1 | Q1:- 2 | Which of these statements is true? 3 | 4 | sparklyr's dplyr methods convert code into Scala code before running it on Spark. 5 | Converting R code into SQL code limits the number of supported computations. 6 | Most Spark MLlib modeling functions require DoubleType inputs and return DoubleType outputs. 7 | Most Spark MLlib modeling functions require IntegerType inputs and return BooleanType outputs 8 | 9 | Solution:- 10 | 2 and 3. 11 | 12 | Q2:- 13 | A Spark connection has been created for you as spark_conn. A tibble attached to the track metadata stored in Spark has been pre-defined as track_metadata_tbl. 14 | 15 | Create a variable named hotttnesss from track_metadata_tbl. 16 | Select the artist_hotttnesss field. 17 | Use ft_binarizer() to create a new field, is_hottt_or_nottt, which is true when artist_hotttnesss is greater than 0.5. 18 | Collect the result. 19 | Convert the is_hottt_or_nottt field to be logical. 20 | Draw a ggplot() bar plot of is_hottt_or_nottt. 21 | The first argument to ggplot() is the data argument, hotttnesss. 22 | The second argument to ggplot() is the aesthetic, is_hottt_or_nottt wrapped in aes(). 23 | Add geom_bar() to draw the bars. 24 | 25 | Solution:- 26 | -------------------------------------------------------------------------------- /SparkR/Introduction to Spark in R using sparklyr/Light My Fire: Starting To Use Spark With dplyr Syntax: -------------------------------------------------------------------------------- 1 | Q1:- 2 | Load the sparklyr package with library().Connect to Spark by calling spark_connect(), with argument master = "local". Assign the result to spark_conn. 3 | Get the Spark version using spark_version(), with argument sc = spark_conn.Disconnect from Spark using spark_disconnect(), with argument sc = spark_conn. 4 | 5 | Solution:- 6 | # Load sparklyr 7 | library(sparklyr) 8 | 9 | # Connect to your Spark cluster 10 | spark_conn <- spark_connect(master="local") 11 | 12 | # Print the version of Spark 13 | print(spark_version(sc=spark_conn)) 14 | 15 | # Disconnect from Spark 16 | spark_disconnect(sc=spark_conn) 17 | 18 | Q2:- 19 | track_metadata, containing the song name, artist name, and other metadata for 1,000 tracks, has been pre-defined in your workspace. 20 | Use str() to explore the track_metadata dataset.Connect to your local Spark cluster, storing the connection in spark_conn. 21 | Copy track_metadata to the Spark cluster using copy_to() .See which data frames are available in Spark, using src_tbls(). 22 | Disconnect from Spark. 23 | 24 | Solution:- 25 | # Load dplyr 26 | library(dplyr) 27 | 28 | # Explore track_metadata structure 29 | str(track_metadata) 30 | 31 | # Connect to your Spark cluster 32 | spark_conn <- spark_connect("local") 33 | 34 | # Copy track_metadata to Spark 35 | track_metadata_tbl <- copy_to(spark_conn,track_metadata,overwrite=TRUE) 36 | 37 | # List the data frames available in Spark 38 | src_tbls(spark_conn) 39 | 40 | # Disconnect from Spark 41 | spark_disconnect(spark_conn) 42 | 43 | Q3:- 44 | A Spark connection has been created for you as spark_conn. The track metadata for 1,000 tracks is stored in the Spark cluster in the table "track_metadata". 45 | Link to the "track_metadata" table using tbl(). Assign the result to track_metadata_tbl.See how big the dataset is, using dim() on track_metadata_tbl. 46 | See how small the tibble is, using object_size() on track_metadata_tbl. 47 | 48 | Solution:- 49 | # Link to the track_metadata table in Spark 50 | track_metadata_tbl <- tbl(spark_conn, "track_metadata") 51 | 52 | # See how big the dataset is 53 | dim(track_metadata_tbl) 54 | 55 | # See how small the tibble is 56 | object_size(track_metadata_tbl) 57 | 58 | Q4:- 59 | A Spark connection has been created for you as spark_conn. A tibble attached to the track metadata stored in Spark has been pre-defined 60 | as track_metadata_tbl.Print the first 5 rows and all the columns of the track metadata.Examine the structure of the tibble using str(). 61 | Examine the structure of the track metadata using glimpse(). 62 | 63 | Solution:- 64 | # Print 5 rows, all columnsprint 65 | (track_metadata_tbl,n=5,width=Inf) 66 | 67 | # Examine structure of tibble 68 | str(track_metadata_tbl) 69 | 70 | # Examine structure of data 71 | glimpse(track_metadata_tbl) 72 | 73 | Q5:- 74 | A Spark connection has been created for you as spark_conn. A tibble attached to the track metadata stored in Spark has been pre-defined as 75 | track_metadata_tbl.Select the artist_name, release, title, and year using select().Try to do the same thing using square bracket 76 | indexing. Spoiler! This code throws an error, so it is wrapped in a call to tryCatch(). 77 | 78 | Solution:- 79 | # track_metadata_tbl has been pre-defined 80 | track_metadata_tbl 81 | 82 | # Manipulate the track metadata 83 | track_metadata_tbl %>% 84 | # Select columns 85 | select('artist_name', 'release', 'title', 'year') 86 | 87 | # Try to select columns using [ ] 88 | tryCatch({ 89 | # Selection code here 90 | track_metadata_tbl[, c("artist_name", "release", "title", "year")] 91 | }, 92 | error = print 93 | ) 94 | 95 | Q6:- 96 | A Spark connection has been created for you as spark_conn. A tibble attached to the track metadata stored in Spark has been pre-defined as 97 | track_metadata_tbl.As in the previous exercise, select the artist_name, release, title, and year using select().Pipe the result of this to filter() to get the tracks 98 | from the 1960s. 99 | 100 | Solution:- 101 | # track_metadata_tbl has been pre-defined 102 | glimpse(track_metadata_tbl) 103 | 104 | # Manipulate the track metadata 105 | track_metadata_tbl %>% 106 | # Select columns 107 | select(artist_name, release, title, year) %>% 108 | #filter rows 109 | filter(year >= 1960, year < 1970) 110 | 111 | Q7:- 112 | A Spark connection has been created for you as spark_conn. A tibble attached to the track metadata stored in Spark has been pre-defined as 113 | track_metadata_tbl.Select the artist_name, release, title, and year fields.Pipe the result of this to filter on tracks from the 1960s. 114 | Pipe the result of this to arrange() to order by artist_name, then descending year, then title. 115 | 116 | Solution:- 117 | # track_metadata_tbl has been pre-defined 118 | track_metadata_tbl 119 | # Manipulate the track metadata 120 | track_metadata_tbl %>% 121 | # Select columns 122 | select(artist_name, release, title, year) %>% 123 | # Filter rows 124 | filter(year >= 1960, year < 1970) %>% 125 | # Arrange rows 126 | arrange(artist_name,desc(year),title) 127 | 128 | Q8:- 129 | A Spark connection has been created for you as spark_conn. A tibble attached to the track metadata stored in Spark has been pre-defined 130 | as track_metadata_tbl.Select the title, and duration fields. Note that the durations are in seconds.Pipe the result of this to mutate() to create a new field, duration_minutes, 131 | that contains the track duration in minutes. 132 | 133 | Solution:- 134 | # track_metadata_tbl has been pre-defined 135 | track_metadata_tbl 136 | 137 | # Manipulate the track metadata 138 | track_metadata_tbl %>% 139 | # Select columns 140 | select(title,duration) %>% 141 | # Mutate columns 142 | mutate( 143 | duration_minutes = duration/60 144 | ) 145 | 146 | Q9:- 147 | A Spark connection has been created for you as spark_conn. A tibble attached to the track metadata stored in Spark has been pre-defined as 148 | track_metadata_tbl.Select the title, and duration fields.Pipe the result of this to create a new field, duration_minutes, that contains the track duration in minutes. 149 | Pipe the result of this to summarize() to calculate the mean duration in minutes, in a field named mean_duration_minutes. 150 | 151 | Solution:- 152 | # track_metadata_tbl has been pre-defined 153 | track_metadata_tbl 154 | 155 | # Manipulate the track metadata 156 | track_metadata_tbl %>% 157 | # Select columns 158 | select(title,duration) %>% 159 | # Mutate columns 160 | mutate( 161 | duration_minutes = duration/60 162 | ) %>% 163 | # Summarize columns 164 | summarize( 165 | mean_duration_minutes = mean(duration_minutes) 166 | ) 167 | 168 | 169 | -------------------------------------------------------------------------------- /Spoken Language Processing in Python/Introduction to Spoken Language Processing with Python: -------------------------------------------------------------------------------- 1 | Q1:- 2 | Import the Python wave library. 3 | Read in the good_morning.wav audio file and save it to good_morning. 4 | Create signal_gm by reading all the frames from good_morning using readframes(). 5 | See what the first 10 frames of audio look like by slicing signal_gm. 6 | 7 | Solution:- 8 | import wave 9 | 10 | # Create audio file wave object 11 | good_morning = wave.open("good_morning.wav", 'r') 12 | 13 | # Read all frames from wave object 14 | signal_gm = good_morning.readframes(-1) 15 | 16 | # View first 10 17 | print(signal_gm[:10]) 18 | --------------------------------------------------------------------------------