├── PySpark
    └── Introduction to PySpark
    │   ├── Getting to know PySpark
    │   ├── Machine Learning Pipelines
    │   └── Manipulating Data
├── Python
    ├── Analyzing Police Activity with pandas
    │   ├── Analyzing the effect of weather on policing
    │   ├── Exploring the relationship between gender and policing
    │   ├── Preparing the data for analysis
    │   └── Visual exploratory data analysis
    ├── Cleaning Data in Python
    │   ├── Case study
    │   ├── Cleaning data for analysis
    │   ├── Combining data for analysis
    │   ├── Exploring your data
    │   └── Tidying data for analysis
    ├── Conda Essentials
    │   └── Installing Packages
    ├── Importing Data in Python -Part 1
    │   ├── Importing data from other file types
    │   ├── Introduction and flat files
    │   └── Introduction to relational databases
    ├── Importing Data in Python -Part 2
    │   ├── Diving deep into the Twitter API
    │   ├── Importing data from the Internet
    │   └── Interacting with APIs to import data from the web
    ├── Interactive Data Visualization with Bokeh
    │   ├── Basic plotting with Bokeh
    │   ├── Building interactive apps with Bokeh
    │   ├── Layouts, Interactions, and Annotations
    │   └── Putting It All Together! A Case Study
    ├── Intermediate-Python-for-Data-Science
    │   ├── Case-Study-Hacker-Statistics
    │   ├── Dictionaries-Pandas
    │   ├── Logic-ControlFlow-Filtering
    │   ├── Matplotlib
    │   └── loops
    ├── Intro to SQL for Data Science
    │   ├── Aggregate Functions
    │   ├── Filtering rows
    │   ├── Selecting columns
    │   └── Sorting, grouping and joins
    ├── Intro-to-data-science
    │   ├── Functions-Strings
    │   ├── Numpy-Statistics
    │   ├── Python-Basics
    │   └── Python-Lists
    ├── Introduction to Data Visualization with Python
    │   ├── Analyzing time series and images
    │   ├── Customizing plots
    │   ├── Plotting 2D arrays
    │   └── Statistical plots with Seaborn
    ├── Introduction to Databases in Python
    │   ├── Advanced SQLAlchemy Queries
    │   ├── Applying Filtering, Ordering and Grouping to Queries
    │   ├── Basics of Relational Databases
    │   ├── Creating and Manipulating your own Databases
    │   └── Putting it all together
    ├── Introduction to Relational Databases in SQL
    │   ├── Enforce data consistency with attribute constraints
    │   ├── Glue together tables with foreign keys
    │   ├── Uniquely identify records with key constraints
    │   └── Your first database
    ├── Introduction to Shell for Data Science
    │   └── Manipulating files and directories
    ├── Joining Data in SQL
    │   ├── Introduction to joins
    │   ├── Outer joins and cross joins
    │   ├── Set theory clauses
    │   └── Subqueries
    ├── Machine Learning with the Experts: School Budgets
    │   ├── Creating a simple first model
    │   ├── Exploring the raw data
    │   ├── Improving your model
    │   └── Learning from the experts
    ├── Manipulating DataFrames with pandas
    │   ├── Advanced indexing
    │   ├── Bringing it all together
    │   ├── Extracting and transforming data
    │   ├── Grouping data
    │   └── Rearranging and reshaping data
    ├── Merging DataFrames with pandas
    │   ├── Case Study - Summer Olympics
    │   ├── Concatenating data
    │   ├── Merging data
    │   └── Preparing data
    ├── Network Analysis in Python (Part 1)
    │   ├── Bringing it all together
    │   ├── Important nodes
    │   ├── Introduction to networks
    │   └── Structures
    ├── Python Data Science Toolbox -Part 1
    │   ├── Default arguments, variable-length arguments and scope
    │   ├── Lambda functions and error-handling
    │   └── Writing your own functions
    ├── Python Data Science Toolbox -Part 2
    │   ├── Case Study
    │   ├── List comprehensions and generators
    │   └── Using iterators in PythonLand
    ├── Python Data Science Toolbox -Part
    │   └── Case Study
    ├── Statistical Thinking in Python (Part 2)
    │   ├── Bootstrap confidence intervals
    │   ├── Hypothesis test examples
    │   ├── Introduction to hypothesis testing
    │   ├── Parameter estimation by optimization
    │   └── Putting it all together: a case study
    ├── Statistical Thinking in Python -Part 1
    │   ├── Graphical exploratory data analysis
    │   ├── Quantitative exploratory data analysis
    │   ├── Thinking probabilistically-- Continuous variables
    │   └── Thinking probabilistically-- Discrete variables
    ├── Supervised Learning with scikit-learn
    │   ├── Classification
    │   ├── Fine-tuning your model
    │   ├── Preprocessing and pipelines
    │   └── Regression
    ├── Unsupervised Learning in Python
    │   ├── Clustering for dataset exploration
    │   ├── Decorrelating your data and dimension reduction
    │   ├── Discovering interpretable features
    │   └── Visualization with hierarchical clustering and t-SNE
    └── pandas Foundations
    │   ├── Case Study - Sunlight in Austin
    │   ├── Data ingestion & inspection
    │   ├── Exploratory data analysis
    │   └── Time series in pandas
├── README.md
├── SparkR
    └── Introduction to Spark in R using sparklyr
    │   ├── Going Native: Use The Native Interface to Manipulate Spark DataFrames
    │   ├── Light My Fire: Starting To Use Spark With dplyr Syntax
    │   └── Tools of the Trade: Advanced dplyr Usage
└── Spoken Language Processing in Python
    └── Introduction to Spoken Language Processing with Python


/PySpark/Introduction to PySpark/Getting to know PySpark:
--------------------------------------------------------------------------------
  1 | Q1:-
  2 | How do you connect to a Spark cluster from PySpark?
  3 | 
  4 | Solution:-
  5 | Create an instance of the SparkContext class.
  6 | 
  7 | Q2:-
  8 | Get to know the SparkContext.
  9 | Call print() on sc to verify there's a SparkContext in your environment.
 10 | print() sc.version to see what version of Spark is running on your cluster.
 11 | 
 12 | Solution:-
 13 | # Verify SparkContext
 14 | print(sc)
 15 | 
 16 | # Print Spark version
 17 | print(sc.version)
 18 | 
 19 | Q3:-
 20 | Which of the following is an advantage of Spark DataFrames over RDDs?
 21 | 
 22 | Solution:-
 23 | Operations using DataFrames are automatically optimized.
 24 | 
 25 | Q4:-
 26 | Import SparkSession from pyspark.sql.
 27 | Make a new SparkSession called my_spark using SparkSession.builder.getOrCreate().
 28 | Print my_spark to the console to verify it's a SparkSession.
 29 | 
 30 | Solution:-
 31 | # Import SparkSession from pyspark.sql
 32 | from pyspark.sql  import SparkSession
 33 | 
 34 | # Create my_spark
 35 | my_spark = SparkSession.builder.getOrCreate()
 36 | 
 37 | # Print my_spark
 38 | print(my_spark)
 39 | 
 40 | Q5:-
 41 | See what tables are in your cluster by calling spark.catalog.listTables() and printing the result!
 42 | 
 43 | Solution:-
 44 | # Print the tables in the catalog
 45 | print(spark.catalog.listTables())
 46 | 
 47 | Q6:-
 48 | Use the .sql() method to get the first 10 rows of the flights table and save the result to flights10. The variable query contains the appropriate SQL query.
 49 | Use the DataFrame method .show() to print flights10
 50 | 
 51 | Solution:-
 52 | # Don't change this query
 53 | query = "FROM flights SELECT * LIMIT 10"
 54 | 
 55 | # Get the first 10 rows of flights
 56 | flights10 = spark.sql(query)
 57 | 
 58 | # Show the results
 59 | flights10.show()
 60 | 
 61 | Q7:-
 62 | Run the query using the .sql() method. Save the result in flight_counts.
 63 | Use the .toPandas() method on flight_counts to create a pandas DataFrame called pd_counts.
 64 | Print the .head() of pd_counts to the console.
 65 | 
 66 | Solution:-
 67 | # Don't change this query
 68 | query = "SELECT origin, dest, COUNT(*) as N FROM flights GROUP BY origin, dest"
 69 | 
 70 | # Run the query
 71 | flight_counts = spark.sql(query)
 72 | 
 73 | # Convert the results to a pandas DataFrame
 74 | pd_counts = flight_counts.toPandas()
 75 | 
 76 | # Print the head of pd_counts
 77 | print(pd_counts.head())
 78 | 
 79 | Q8:-
 80 | The code to create a pandas DataFrame of random numbers has already been provided and saved under pd_temp.
 81 | Create a Spark DataFrame called spark_temp by calling the .createDataFrame() method with pd_temp as the argument.
 82 | Examine the list of tables in your Spark cluster and verify that the new DataFrame is not present. Remember you can use spark.catalog.listTables() to do so.
 83 | Register spark_temp as a temporary table named "temp" using the .createOrReplaceTempView() method. Rememeber that the table name is set including it as the only argument!
 84 | Examine the list of tables again!
 85 | 
 86 | Solution:-
 87 | # Create pd_temp
 88 | pd_temp = pd.DataFrame(np.random.random(10))
 89 | 
 90 | # Create spark_temp from pd_temp
 91 | spark_temp = spark.createDataFrame(pd_temp)
 92 | 
 93 | # Examine the tables in the catalog
 94 | print(spark.catalog.listTables())
 95 | 
 96 | # Add spark_temp to the catalog
 97 | spark_temp.createOrReplaceTempView("temp")
 98 | 
 99 | # Examine the tables in the catalog again
100 | print(spark.catalog.listTables())
101 | 
102 | Q9:-
103 | Use the .read.csv() method to create a Spark DataFrame called airports
104 | The first argument is file_path
105 | Pass the argument header=True so that Spark knows to take the column names from the first line of the file.
106 | Print out this DataFrame by calling .show().
107 | 
108 | Solution:-
109 | # Don't change this file path
110 | file_path = "/usr/local/share/datasets/airports.csv"
111 | 
112 | # Read in the airports data
113 | airports = spark.read.csv(file_path,header=True)
114 | 
115 | # Show the data
116 | print(airports.show())
117 | 
118 | 


--------------------------------------------------------------------------------
/PySpark/Introduction to PySpark/Machine Learning Pipelines:
--------------------------------------------------------------------------------
1 | Q1:-
2 | First, rename the year column of planes to plane_year to avoid duplicate column names.
3 | Create a new DataFrame called model_data by joining the flights table with planes using the tailnum column as the key.
4 | 
5 | Solution:-
6 | 


--------------------------------------------------------------------------------
/Python/Analyzing Police Activity with pandas/Analyzing the effect of weather on policing:
--------------------------------------------------------------------------------
  1 | Q1:-
  2 | Read weather.csv into a DataFrame named weather.
  3 | Select the temperature columns (TMIN, TAVG, TMAX) and print their summary statistics using the .describe() method.
  4 | Create a box plot to visualize the temperature columns.
  5 | Display the plot.
  6 | 
  7 | Solution:-
  8 | # Read 'weather.csv' into a DataFrame named 'weather'
  9 | weather = pd.read_csv('weather.csv')
 10 | 
 11 | # Describe the temperature columns
 12 | print(weather[['TMIN', 'TAVG', 'TMAX']].describe())
 13 | 
 14 | # Create a box plot of the temperature columns
 15 | weather[['TMIN', 'TAVG', 'TMAX']].plot(kind='box')
 16 | 
 17 | # Display the plot
 18 | plt.show()
 19 | 
 20 | Q2:-
 21 | Create a new column in the weather DataFrame named TDIFF that represents the difference between the maximum and minimum temperatures.
 22 | Print the summary statistics for TDIFF using the .describe() method.
 23 | Create a histogram with 20 bins to visualize TDIFF.
 24 | Display the plot.
 25 | 
 26 | Solution:-
 27 | # Create a 'TDIFF' column that represents temperature difference
 28 | weather['TDIFF'] =weather.TMAX - weather.TMIN
 29 | 
 30 | # Describe the 'TDIFF' column
 31 | print(weather['TDIFF'].describe())
 32 | 
 33 | # Create a histogram with 20 bins to visualize 'TDIFF'
 34 | weather.TDIFF.plot(kind='hist',bins=20)
 35 | 
 36 | # Display the plot
 37 | plt.show()
 38 | 
 39 | Q3:-
 40 | Copy the columns WT01 through WT22 from weather to a new DataFrame named WT.
 41 | Calculate the sum of each row in WT, and store the results in a new weather column named bad_conditions.
 42 | Replace any missing values in bad_conditions with a 0. (This has been done for you.)
 43 | Create a histogram to visualize bad_conditions, and then display the plot.
 44 | 
 45 | Solution:-
 46 | # Copy 'WT01' through 'WT22' to a new DataFrame
 47 | WT = weather.loc[:,'WT01':'WT22']
 48 | 
 49 | # Calculate the sum of each row in 'WT'
 50 | weather['bad_conditions'] = WT.sum(axis='columns')
 51 | 
 52 | # Replace missing values in 'bad_conditions' with '0'
 53 | weather['bad_conditions'] = weather.bad_conditions.fillna(0).astype('int')
 54 | 
 55 | # Create a histogram to visualize 'bad_conditions'
 56 | weather.bad_conditions.plot(kind='hist')
 57 | 
 58 | # Display the plot
 59 | plt.show()
 60 | 
 61 | Q4:-
 62 | Count the unique values in the bad_conditions column and sort the index. (This has been done for you.)
 63 | Create a dictionary called mapping that maps the bad_conditions integers to strings as specified above.
 64 | Convert the bad_conditions integers to strings using the mapping and store the results in a new column called rating.
 65 | Count the unique values in rating to verify that the integers were properly converted to strings.
 66 | 
 67 | Solution:-
 68 | # Count the unique values in 'bad_conditions' and sort the index
 69 | print(weather.bad_conditions.value_counts().sort_index())
 70 | 
 71 | # Create a dictionary that maps integers to strings
 72 | mapping = {0:'good', 1:'bad', 2:'bad', 3:'bad',4:'bad',5:'worse',6:'worse',7:'worse',8:'worse',9:'worse'}
 73 | 
 74 | # Convert the 'bad_conditions' integers to strings using the 'mapping'
 75 | weather['rating'] = weather.bad_conditions.map(mapping)
 76 | 
 77 | # Count the unique values in 'rating'
 78 | print(weather.rating.value_counts())
 79 | 
 80 | Q5:-
 81 | Create a list object called cats that lists the weather ratings in a logical order: 'good', 'bad', 'worse'.
 82 | Change the data type of the rating column from object to category. Make sure to use the cats list to define the category ordering.
 83 | Examine the head of the rating column to confirm that the categories are logically ordered.
 84 | 
 85 | Solution:-
 86 | # Create a list of weather ratings in logical order
 87 | cats= ['good','bad','worse']
 88 | 
 89 | # Change the data type of 'rating' to category
 90 | weather['rating'] = weather['rating'].astype('category').cat.reorder_categories(cats, ordered=True)
 91 | 
 92 | # Examine the head of 'rating'
 93 | print(weather['rating'].head())
 94 | 
 95 | Q6:-
 96 | Reset the index of the ri DataFrame.
 97 | Examine the head of ri to verify that stop_datetime is now a DataFrame column, and the index is now the default integer index.
 98 | Create a new DataFrame named weather_rating that contains only the DATE and rating columns from the weather DataFrame.
 99 | Examine the head of weather_rating to verify that it contains the proper columns.
100 | 
101 | Solution:-
102 | # Reset the index of 'ri'
103 | ri.reset_index(inplace=True)
104 | 
105 | # Examine the head of 'ri'
106 | print(ri.head())
107 | 
108 | # Create a DataFrame from the 'DATE' and 'rating' columns
109 | weather_rating = weather[['DATE','rating']]
110 | 
111 | # Examine the head of 'weather_rating'
112 | print(weather_rating.head())
113 | 
114 | Q7:-
115 | Examine the shape of the ri DataFrame.
116 | Merge the ri and weather_rating DataFrames using a left join.
117 | Examine the shape of ri_weather to confirm that it has two more columns but the same number of rows as ri.
118 | Replace the index of ri_weather with the stop_datetime column.
119 | 
120 | Solution:-
121 | # Examine the shape of 'ri'
122 | print(ri.shape)
123 | 
124 | # Merge 'ri' and 'weather_rating' using a left join
125 | ri_weather = pd.merge(left=ri, right=weather_rating, left_on='stop_date', right_on='DATE', how='left')
126 | 
127 | # Examine the shape of 'ri_weather'
128 | print(ri_weather.shape)
129 | 
130 | # Set 'stop_datetime' as the index of 'ri_weather'
131 | ri_weather.set_index('stop_datetime', inplace=True)
132 | 
133 | Q8:-
134 | Calculate the overall arrest rate by taking the mean of the is_arrested Series.
135 | 
136 | Solution:-
137 | # Calculate the overall arrest rate
138 | print(ri_weather.is_arrested.mean())
139 | 
140 | Q9:-
141 | Calculate the arrest rate for each weather rating using a .groupby().
142 | 
143 | Solution:-
144 | # Calculate the arrest rate for each 'rating'
145 | print(ri_weather.groupby('rating').is_arrested.mean())
146 | 
147 | Q10:-
148 | Calculate the arrest rate for each combination of violation and rating. How do the arrest rates differ by group?
149 | 
150 | Solution-
151 | # Calculate the arrest rate for each 'violation' and 'rating'
152 | print(ri_weather.groupby(['violation','rating']).is_arrested.mean())
153 | 
154 | Q11:-
155 | Save the output of the .groupby() operation from the last exercise as a new object, arrest_rate. (This has been done for you.)
156 | Print the arrest_rate Series and examine it.
157 | Print the arrest rate for moving violations in bad weather.
158 | Print the arrest rates for speeding violations in all three weather conditions.
159 | 
160 | Solution:-
161 | # Save the output of the groupby operation from the last exercise
162 | arrest_rate = ri_weather.groupby(['violation', 'rating']).is_arrested.mean()
163 | 
164 | # Print the 'arrest_rate' Series
165 | print(arrest_rate)
166 | 
167 | # Print the arrest rate for moving violations in bad weather
168 | print(arrest_rate.loc['Moving violation','bad'])
169 | 
170 | # Print the arrest rates for speeding violations in all three weather conditions
171 | print(arrest_rate.loc['Speeding'])
172 | 
173 | Q12:-
174 | Unstack the arrest_rate Series to reshape it into a DataFrame.
175 | Create the exact same DataFrame using a pivot table! Each of the three .pivot_table() parameters should be specified as one of the ri_weather columns.
176 | 
177 | Solution:-
178 | # Unstack the 'arrest_rate' Series into a DataFrame
179 | print(arrest_rate.unstack())
180 | 
181 | # Create the same DataFrame using a pivot table
182 | print(ri_weather.pivot_table(index='violation', columns=['rating'], values='is_arrested'))
183 | 


--------------------------------------------------------------------------------
/Python/Analyzing Police Activity with pandas/Exploring the relationship between gender and policing:
--------------------------------------------------------------------------------
  1 | Q1:-
  2 | Count the unique values in the violation column of the ri DataFrame, to see what violations are being committed by all drivers.
  3 | Express the violation counts as proportions of the total.
  4 | 
  5 | Solution:-
  6 | # Count the unique values in 'violation'
  7 | print(ri.violation.value_counts())
  8 | 
  9 | # Express the counts as proportions
 10 | print(ri.violation.value_counts(normalize=True))
 11 | 
 12 | Q2:-
 13 | Create a DataFrame, female, that only contains rows in which driver_gender is 'F'.
 14 | Create a DataFrame, male, that only contains rows in which driver_gender is 'M'.
 15 | Count the violations committed by female drivers and express them as proportions.
 16 | Count the violations committed by male drivers and express them as proportions.
 17 | 
 18 | Solution:-
 19 | # Create a DataFrame of female drivers
 20 | female = ri[ri['driver_gender'] == 'F']
 21 | 
 22 | # Create a DataFrame of male drivers
 23 | male = ri[ri['driver_gender'] == 'M']
 24 | 
 25 | # Compute the violations by female drivers (as proportions)
 26 | print(female.violation.value_counts(normalize=True))
 27 | 
 28 | # Compute the violations by male drivers (as proportions)
 29 | print(male.violation.value_counts(normalize=True))
 30 | 
 31 | Q3:-
 32 | Create a DataFrame, female_and_speeding, that only includes female drivers who were stopped for speeding.
 33 | Create a DataFrame, male_and_speeding, that only includes male drivers who were stopped for speeding.
 34 | Count the stop outcomes for the female drivers and express them as proportions.
 35 | Count the stop outcomes for the male drivers and express them as proportions.
 36 | 
 37 | Solution:-
 38 | # Create a DataFrame of female drivers stopped for speeding
 39 | female_and_speeding = ri[(ri.driver_gender=='F') & (ri.violation=='Speeding')]
 40 | 
 41 | # Create a DataFrame of male drivers stopped for speeding
 42 | male_and_speeding = ri[(ri.driver_gender=='M') & (ri.violation=='Speeding')]
 43 | 
 44 | # Compute the stop outcomes for female drivers (as proportions)
 45 | print(female_and_speeding.stop_outcome.value_counts(normalize=True))
 46 | 
 47 | # Compute the stop outcomes for male drivers (as proportions)
 48 | print(male_and_speeding.stop_outcome.value_counts(normalize=True))
 49 | 
 50 | Q4:-
 51 | Check the data type of search_conducted to confirm that it's a Boolean Series.
 52 | Calculate the search rate by counting the Series values and expressing them as proportions.
 53 | Calculate the search rate by taking the mean of the Series. (It should match the proportion of True values calculated above.)
 54 | 
 55 | Solution:-
 56 | # Check the data type of 'search_conducted'
 57 | print(ri.search_conducted.dtype)
 58 | 
 59 | # Calculate the search rate by counting the values
 60 | print(ri.search_conducted.value_counts(normalize=True))
 61 | 
 62 | # Calculate the search rate by taking the mean
 63 | print(ri.search_conducted.mean())
 64 | 
 65 | Q5:-
 66 | Filter the DataFrame to only include female drivers, and then calculate the search rate by taking the mean of search_conducted.
 67 | 
 68 | Solution:-
 69 | # Calculate the search rate for female drivers
 70 | print(ri[ri.driver_gender=='F'].search_conducted.mean())
 71 | 
 72 | Q6:-
 73 | Filter the DataFrame to only include male drivers, and then repeat the search rate calculation.
 74 | 
 75 | Solution:-
 76 | # Calculate the search rate for male drivers
 77 | print(ri[ri.driver_gender=='M'].search_conducted.mean())
 78 | 
 79 | Q7:-
 80 | Group by driver gender to calculate the search rate for both groups simultaneously. (It should match the previous results.)
 81 | 
 82 | Solution:-
 83 | # Calculate the search rate for both groups simultaneously
 84 | print(ri.groupby('driver_gender').search_conducted.mean())
 85 | 
 86 | Q8:-
 87 | Use a .groupby() to calculate the search rate for each combination of gender and violation. Are males and females searched at about the same rate for each violation?
 88 | 
 89 | Solution:-
 90 | # Calculate the search rate for each combination of gender and violation
 91 | print(ri.groupby(['driver_gender','violation']).search_conducted.mean())
 92 | 
 93 | Q9:-
 94 | Reverse the ordering to group by violation before gender. The results may be easier to compare when presented this way.
 95 | 
 96 | Solution:-
 97 | # Reverse the ordering to group by violation before gender
 98 | print(ri.groupby(['violation','driver_gender']).search_conducted.mean())
 99 | 
100 | Q10:-
101 | Count the search_type values to see how many times "Protective Frisk" was the only search type.
102 | Create a new column, frisk, that is True if search_type contains the string "Protective Frisk" and False otherwise.
103 | Check the data type of frisk to confirm that it's a Boolean Series.
104 | Take the sum of frisk to count the total number of frisks.
105 | 
106 | Solution:-
107 | # Count the 'search_type' values
108 | print(ri.search_type.value_counts())
109 | 
110 | # Check if 'search_type' contains the string 'Protective Frisk'
111 | ri['frisk'] = ri.search_type.str.contains('Protective Frisk', na=False)
112 | 
113 | # Check the data type of 'frisk'
114 | print(ri.frisk.dtype)
115 | 
116 | # Take the sum of 'frisk'
117 | print(ri.frisk.sum())
118 | 
119 | Q11:-
120 | Create a DataFrame, searched, that only contains rows in which search_conducted is True.
121 | Take the mean of the frisk column to find out what percentage of searches included a frisk.
122 | Calculate the frisk rate for each gender using a .groupby().
123 | 
124 | Solution:-
125 | # Create a DataFrame of stops in which a search was conducted
126 | searched = ri[ri.search_conducted == True]
127 | 
128 | # Calculate the overall frisk rate by taking the mean of 'frisk'
129 | print(searched.frisk.mean())
130 | 
131 | # Calculate the frisk rate for each gender
132 | print(searched.groupby(['driver_gender']).frisk.mean())
133 | 


--------------------------------------------------------------------------------
/Python/Analyzing Police Activity with pandas/Preparing the data for analysis:
--------------------------------------------------------------------------------
  1 | Q1:-
  2 | Import pandas using the alias pd.
  3 | Read the file police.csv into a DataFrame named ri.
  4 | Examine the first 5 rows of the DataFrame (known as the "head").
  5 | Count the number of missing values in each column: Use .isnull() to check which DataFrame elements are missing, and then take the .sum() to count the number of True values in each column.
  6 | 
  7 | Solution:-
  8 | # Import the pandas library as pd
  9 | import pandas as pd
 10 | 
 11 | # Read 'police.csv' into a DataFrame named ri
 12 | ri = pd.read_csv('police.csv')
 13 | 
 14 | # Examine the head of the DataFrame
 15 | print(ri.head())
 16 | 
 17 | # Count the number of missing values in each column
 18 | print(ri.isnull().sum())
 19 | 
 20 | Q2:-
 21 | Count the number of missing values in each column. (This has been done for you.)
 22 | Examine the DataFrame's .shape to find out the number of rows and columns.
 23 | Drop both the county_name and state columns by passing the column names to the .drop() method as a list of strings.
 24 | Examine the .shape again to verify that there are now two fewer columns.
 25 | 
 26 | Solution:-
 27 | # Count the number of missing values in each column
 28 | print(ri.isnull().sum())
 29 | 
 30 | # Examine the shape of the DataFrame
 31 | print(ri.shape)
 32 | 
 33 | # Drop the 'county_name' and 'state' columns
 34 | ri.drop(['county_name', 'state'], axis='columns', inplace=True)
 35 | 
 36 | # Examine the shape of the DataFrame (again)
 37 | print(ri.shape)
 38 | 
 39 | Q3:-
 40 | Count the number of missing values in each column.
 41 | Drop all rows that are missing driver_gender by passing the column name to the subset parameter of .dropna().
 42 | Count the number of missing values in each column again, to verify that none of the remaining rows are missing driver_gender.
 43 | Examine the DataFrame's .shape to see how many rows and columns remain.
 44 | 
 45 | Solution:-
 46 | # Count the number of missing values in each column
 47 | print(ri.isnull().sum())
 48 | 
 49 | # Drop all rows that are missing 'driver_gender'
 50 | ri.dropna(subset=['driver_gender'], inplace=True)
 51 | 
 52 | # Count the number of missing values in each column (again)
 53 | print(ri.isnull().sum())
 54 | 
 55 | # Examine the shape of the DataFrame
 56 | print(ri.shape)
 57 | 
 58 | Q4:-
 59 | Examine the head of the is_arrested column to verify that it contains True and False values.
 60 | Check the current data type of is_arrested.
 61 | Use the .astype() method to convert is_arrested to a bool column.
 62 | Check the new data type of is_arrested, to confirm that it is now a bool column.
 63 | 
 64 | Solution:-
 65 | # Examine the head of the 'is_arrested' column
 66 | print(ri.is_arrested.head())
 67 | 
 68 | # Check the data type of 'is_arrested'
 69 | print(ri.is_arrested.dtype)
 70 | 
 71 | # Change the data type of 'is_arrested' to 'bool'
 72 | ri['is_arrested'] = ri.is_arrested.astype('bool')
 73 | 
 74 | # Check the data type of 'is_arrested' (again)
 75 | print(ri.is_arrested.dtype)
 76 | 
 77 | Q5:-
 78 | Use a string method to concatenate stop_date and stop_time (separated by a space), and store the result in combined.
 79 | Convert combined to datetime format, and store the result in a new column named stop_datetime.
 80 | Examine the DataFrame .dtypes to confirm that stop_datetime is a datetime column.
 81 | 
 82 | Solution:-
 83 | # Concatenate 'stop_date' and 'stop_time' (separated by a space)
 84 | combined = ri.stop_date.str.cat(ri.stop_time, sep=' ')
 85 | 
 86 | # Convert 'combined' to datetime format
 87 | ri['stop_datetime'] = pd.to_datetime(combined)
 88 | 
 89 | # Examine the data types of the DataFrame
 90 | print(ri.dtypes)
 91 | 
 92 | Q6:-
 93 | Set stop_datetime as the DataFrame index.
 94 | Examine the index to verify that it is a DatetimeIndex.
 95 | Examine the DataFrame columns to confirm that stop_datetime is no longer one of the columns.
 96 | 
 97 | Solution:-
 98 | # Set 'stop_datetime' as the index
 99 | ri.set_index('stop_datetime', inplace=True)
100 | 
101 | # Examine the index
102 | print(ri.index)
103 | 
104 | # Examine the columns
105 | print(ri.columns)
106 | 


--------------------------------------------------------------------------------
/Python/Analyzing Police Activity with pandas/Visual exploratory data analysis:
--------------------------------------------------------------------------------
  1 | Q1:-
  2 | Take the mean of the is_arrested column to calculate the overall arrest rate.
  3 | Group by the hour attribute of the DataFrame index to calculate the hourly arrest rate.
  4 | Save the hourly arrest rate Series as a new object, hourly_arrest_rate.
  5 | 
  6 | Solution:-
  7 | # Calculate the overall arrest rate
  8 | print(ri.is_arrested.mean())
  9 | 
 10 | # Calculate the hourly arrest rate
 11 | print(ri.groupby(ri.index.hour).is_arrested.mean())
 12 | 
 13 | # Save the hourly arrest rate
 14 | hourly_arrest_rate = ri.groupby(ri.index.hour).is_arrested.mean()
 15 | 
 16 | Q2:-
 17 | Import matplotlib.pyplot using the alias plt.
 18 | Create a line plot of hourly_arrest_rate using the .plot() method.
 19 | Label the x-axis as 'Hour', label the y-axis as 'Arrest Rate', and title the plot 'Arrest Rate by Time of Day'.
 20 | Display the plot using the .show() function.
 21 | 
 22 | Solution:-
 23 | # Import matplotlib.pyplot as plt
 24 | import matplotlib.pyplot as plt
 25 | 
 26 | # Create a line plot of 'hourly_arrest_rate'
 27 | plt.plot(hourly_arrest_rate)
 28 | 
 29 | # Add the xlabel, ylabel, and title
 30 | plt.xlabel('Hour')
 31 | plt.ylabel('Arrest Rate')
 32 | plt.title('Arrest Rate by Time of Day')
 33 | 
 34 | # Display the plot
 35 | plt.show()
 36 | 
 37 | Q3:-
 38 | Calculate the annual rate of drug-related stops by resampling the drugs_related_stop column (on the 'A' frequency) and taking the mean.
 39 | Save the annual drug rate Series as a new object, annual_drug_rate.
 40 | Create a line plot of annual_drug_rate using the .plot() method.
 41 | Display the plot using the .show() function.
 42 | 
 43 | Solution:-
 44 | # Calculate the annual rate of drug-related stops
 45 | print(ri.drugs_related_stop.resample('A').mean())
 46 | 
 47 | # Save the annual rate of drug-related stops
 48 | annual_drug_rate = ri.drugs_related_stop.resample('A').mean()
 49 | 
 50 | # Create a line plot of 'annual_drug_rate'
 51 | plt.plot(annual_drug_rate)
 52 | 
 53 | # Display the plot
 54 | plt.show()
 55 | 
 56 | Q4:-
 57 | Calculate the annual search rate by resampling the search_conducted column, and save the result as annual_search_rate.
 58 | Concatenate annual_drug_rate and annual_search_rate along the columns axis, and save the result as annual.
 59 | Create subplots of the drug and search rates from the annual DataFrame.
 60 | Display the subplots.
 61 | 
 62 | Solution:-
 63 | # Calculate and save the annual search rate
 64 | annual_search_rate = ri.search_conducted.resample('A').mean()
 65 | 
 66 | # Concatenate 'annual_drug_rate' and 'annual_search_rate'
 67 | annual = pd.concat([annual_drug_rate,annual_search_rate], axis=1)
 68 | 
 69 | # Create subplots from 'annual'
 70 | annual.plot(subplots=True)
 71 | 
 72 | # Display the subplots
 73 | plt.show()
 74 | 
 75 | Q5:-
 76 | Create a frequency table from the district and violation columns using the pd.crosstab() function.
 77 | Save the frequency table as a new object, all_zones.
 78 | Select rows 'Zone K1' through 'Zone K3' from all_zones using the .loc[] accessor.
 79 | Save the smaller table as a new object, k_zones.
 80 | 
 81 | Solution:-
 82 | # Create a frequency table of districts and violations
 83 | print(pd.crosstab(ri.district,ri.violation))
 84 | 
 85 | # Save the frequency table as 'all_zones'
 86 | all_zones = pd.crosstab(ri.district,ri.violation)
 87 | 
 88 | # Select rows 'Zone K1' through 'Zone K3'
 89 | print(all_zones.loc['Zone K1':'Zone K3'])
 90 | 
 91 | # Save the smaller table as 'k_zones'
 92 | k_zones = all_zones.loc['Zone K1':'Zone K3']
 93 | 
 94 | Q6:-
 95 | Create a bar plot of k_zones.
 96 | Display the plot and examine it. What do you notice about each of the zones?
 97 | 
 98 | Solution:-
 99 | # Create a bar plot of 'k_zones'
100 | k_zones.plot(kind='bar')
101 | 
102 | # Display the plot
103 | plt.show()
104 | 
105 | Q7:-
106 | Create a stacked bar plot of k_zones.
107 | Display the plot and examine it. Do you notice anything different about the data than you did previously?
108 | 
109 | Solution:-
110 | # Create a stacked bar plot of 'k_zones'
111 | k_zones.plot(kind='bar',stacked=True)
112 | 
113 | # Display the plot
114 | plt.show()
115 | 
116 | Q8:-
117 | Print the unique values in the stop_duration column. (This has been done for you.)
118 | Create a dictionary called mapping that maps the stop_duration strings to the integers specified above.
119 | Convert the stop_duration strings to integers using the mapping, and store the results in a new column called stop_minutes.
120 | Print the unique values in the stop_minutes column, to verify that the durations were properly converted to integers.
121 | 
122 | Solution:-
123 | # Print the unique values in 'stop_duration'
124 | print(ri.stop_duration.unique())
125 | 
126 | # Create a dictionary that maps strings to integers
127 | mapping = {'0-15 Min':8,'16-30 Min':23,'30+ Min':45}
128 | 
129 | # Convert the 'stop_duration' strings to integers using the 'mapping'
130 | ri['stop_minutes'] = ri.stop_duration.map(mapping)
131 | 
132 | # Print the unique values in 'stop_minutes'
133 | print(ri.stop_minutes.unique())
134 | 
135 | Q9:-
136 | For each value in the violation_raw column, calculate the mean number of stop_minutes that a driver is detained.
137 | Save the resulting Series as a new object, stop_length.
138 | Sort stop_length by its values, and then visualize it using a horizontal bar plot.
139 | Display the plot.
140 | 
141 | Solution:-
142 | # Calculate the mean 'stop_minutes' for each value in 'violation_raw'
143 | print(ri.groupby(['violation_raw']).stop_minutes.mean())
144 | 
145 | # Save the resulting Series as 'stop_length'
146 | stop_length = ri.groupby(['violation_raw']).stop_minutes.mean()
147 | 
148 | # Sort 'stop_length' by its values and create a horizontal bar plot
149 | stop_length.sort_values().plot(kind='barh')
150 | 
151 | # Display the plot
152 | plt.show()
153 | 


--------------------------------------------------------------------------------
/Python/Cleaning Data in Python/Combining data for analysis:
--------------------------------------------------------------------------------
  1 | Q1:-
  2 | Concatenate uber1, uber2, and uber3 together using pd.concat(). You'll have to pass the DataFrames in as a list.
  3 | Print the shape and then the head of the concatenated DataFrame, row_concat.
  4 | 
  5 | Solution:-
  6 | # Concatenate uber1, uber2, and uber3: row_concat
  7 | row_concat = pd.concat([uber1,uber2,uber3])
  8 | 
  9 | # Print the shape of row_concat
 10 | print(row_concat.shape)
 11 | 
 12 | # Print the head of row_concat
 13 | print(row_concat.head())
 14 | 
 15 | Q2:-
 16 | Concatenate ebola_melt and status_country column-wise into a single DataFrame called ebola_tidy. Be sure to specify axis=1 and to pass the two DataFrames in as a list.
 17 | Print the shape and then the head of the concatenated DataFrame, ebola_tidy.
 18 | 
 19 | Solution:-
 20 | # Concatenate ebola_melt and status_country column-wise: ebola_tidy
 21 | ebola_tidy = pd.concat([ebola_melt,status_country], axis=1)
 22 | 
 23 | # Print the shape of ebola_tidy
 24 | print(ebola_tidy.shape)
 25 | 
 26 | # Print the head of ebola_tidy
 27 | print(ebola_tidy.head())
 28 | 
 29 | 
 30 | Q3:-
 31 | Import the glob module along with pandas (as its usual alias pd).
 32 | Write a pattern to match all .csv files.
 33 | Save all files that match the pattern using the glob() function within the glob module. That is, by using glob.glob().
 34 | Print the list of file names. This has been done for you.
 35 | Read the second file in csv_files (i.e., index 1) into a DataFrame called csv2.
 36 | Hit 'Submit Answer' to print the head of csv2. Does it look familiar?
 37 | 
 38 | Solution:-
 39 | # Import necessary modules
 40 | import glob
 41 | import pandas as pd
 42 | 
 43 | # Write the pattern: pattern
 44 | pattern = '*.csv'
 45 | 
 46 | # Save all file matches: csv_files
 47 | csv_files = glob.glob(pattern)
 48 | 
 49 | # Print the file names
 50 | print(csv_files)
 51 | 
 52 | # Load the second file into a DataFrame: csv2
 53 | csv2 = pd.read_csv(csv_files[1])
 54 | 
 55 | # Print the head of csv2
 56 | print(csv2.head())
 57 | 
 58 | 
 59 | Q4:-
 60 | Write a for loop to iterate though csv_files:
 61 | In each iteration of the loop, read csv into a DataFrame called df.
 62 | After creating df, append it to the list frames using the .append() method.
 63 | Concatenate frames into a single DataFrame called uber.
 64 | Hit 'Submit Answer' to see the head and shape of the concatenated DataFrame!
 65 | 
 66 | Solution:-
 67 | # Create an empty list: frames
 68 | frames = []
 69 | 
 70 | #  Iterate over csv_files
 71 | for csv in csv_files:
 72 | 
 73 |     #  Read csv into a DataFrame: df
 74 |     df = pd.read_csv(csv)
 75 |     
 76 |     # Append df to frames
 77 |     frames.append(df)
 78 | 
 79 | # Concatenate frames into a single DataFrame: uber
 80 | uber = pd.concat(frames)
 81 | 
 82 | # Print the shape of uber
 83 | print(uber.shape)
 84 | 
 85 | # Print the head of uber
 86 | print(uber.head())
 87 | 
 88 | Q5:-
 89 | Merge the site and visited DataFrames on the 'name' column of site and 'site' column of visited.
 90 | Print the merged DataFrame o2o.
 91 | 
 92 | Solution:-
 93 | # Merge the DataFrames: o2o
 94 | o2o = pd.merge(left=site, right=visited, left_on='name', right_on='site')
 95 | 
 96 | # Print o2o
 97 | print(o2o)
 98 | 
 99 | Q6:-
100 | Merge the site and visited DataFrames on the 'name' column of site and 'site' column of visited, exactly as you did in the previous exercise.
101 | Print the merged DataFrame and then hit 'Submit Answer' to see the different output produced by this merge!
102 | 
103 | Solution:-
104 | # Merge the DataFrames: m2o
105 | m2o = pd.merge(left=site, right=visited, left_on='name', right_on='site')
106 | 
107 | # Print m2o
108 | print(m2o)
109 | 
110 | Q7:-
111 | Merge the site and visited DataFrames on the 'name' column of site and 'site' column of visited, exactly as you did in the previous two exercises. Save the result as m2m.
112 | Merge the m2m and survey DataFrames on the 'ident' column of m2m and 'taken' column of survey.
113 | Hit 'Submit Answer' to print the first 20 lines of the merged DataFrame!
114 | 
115 | Solution:-
116 | # Merge site and visited: m2m
117 | m2m = pd.merge(left=site, right=visited, left_on='name', right_on='site')
118 | 
119 | # Merge m2m and survey: m2m
120 | m2m = pd.merge(left=m2m, right=survey, left_on='ident', right_on='taken')
121 | 
122 | # Print the first 20 lines of m2m
123 | print(m2m.head(20))
124 | 


--------------------------------------------------------------------------------
/Python/Cleaning Data in Python/Exploring your data:
--------------------------------------------------------------------------------
  1 | Q1:-
  2 | Import pandas as pd.
  3 | Read 'dob_job_application_filings_subset.csv' into a DataFrame called df.
  4 | Print the head and tail of df.
  5 | Print the shape of df and its columns. Note: .shape and .columns are attributes, not methods, so you don't need to follow these with parentheses ().
  6 | Hit 'Submit Answer' to view the results! Notice the suspicious number of 0 values. Perhaps these represent missing data.
  7 | 
  8 | Solution:-
  9 | # Import pandas
 10 | import pandas as pd
 11 | 
 12 | # Read the file into a DataFrame: df
 13 | df = pd.read_csv('dob_job_application_filings_subset.csv')
 14 | 
 15 | # Print the head of df
 16 | print(df.head())
 17 | 
 18 | # Print the tail of df
 19 | print(df.tail())
 20 | 
 21 | # Print the shape of df
 22 | print(df.shape)
 23 | 
 24 | # Print the columns of df
 25 | print(df.columns)
 26 | 
 27 | # Print the head and tail of df_subset
 28 | print(df_subset.head())
 29 | print(df_subset.tail())
 30 | 
 31 | Q2:-
 32 | 
 33 | Print the info of df.
 34 | Print the info of the subset dataframe, df_subset.
 35 | 
 36 | Solution:-
 37 | # Print the info of df
 38 | print(df.info())
 39 | 
 40 | # Print the info of df_subset
 41 | print(df_subset.info())
 42 | 
 43 | Q3:-
 44 | Print the value counts for:
 45 | The 'Borough' column.
 46 | The 'State' column.
 47 | The 'Site Fill' column.
 48 | 
 49 | Solution:-
 50 | # Print the value counts for 'Borough'
 51 | print(df['Borough'].value_counts(dropna=False))
 52 | 
 53 | # Print the value_counts for 'State'
 54 | print(df['State'].value_counts(dropna=False))
 55 | 
 56 | # Print the value counts for 'Site Fill'
 57 | print(df['Site Fill'].value_counts(dropna=False))
 58 | 
 59 | Q4:-
 60 | Import matplotlib.pyplot as plt.
 61 | Create a histogram of the 'Existing Zoning Sqft' column. Rotate the axis labels by 70 degrees and use a log scale for both axes.
 62 | Display the histogram using plt.show().
 63 | 
 64 | Solution:-
 65 | # Import matplotlib.pyplot
 66 | import matplotlib.pyplot as plt
 67 | 
 68 | # Plot the histogram
 69 | df['Existing Zoning Sqft'].plot(kind='hist', rot=70, logx=True, logy=True)
 70 | 
 71 | # Display the histogram
 72 | plt.show()
 73 | 
 74 | Q5:-
 75 | Using the .boxplot() method of df, create a boxplot of 'initial_cost' across the different values of 'Borough'.
 76 | Display the plot.
 77 | 
 78 | Solution:-
 79 | # Import necessary modules
 80 | import pandas as pd
 81 | import matplotlib.pyplot as plt
 82 | 
 83 | # Create the boxplot
 84 | df.boxplot(column='initial_cost', by='Borough', rot=90)
 85 | 
 86 | # Display the plot
 87 | plt.show()
 88 | 
 89 | Q6:-
 90 | Using df, create a scatter plot (kind='scatter') with 'initial_cost' on the x-axis and the 'total_est_fee' on the y-axis. 
 91 | Rotate the x-axis labels by 70 degrees.
 92 | Create another scatter plot exactly as above, substituting df_subset in place of df.
 93 | 
 94 | Solution:-
 95 | # Import necessary modules
 96 | import pandas as pd
 97 | import matplotlib.pyplot as plt
 98 | 
 99 | # Create and display the first scatter plot
100 | df.plot(kind='scatter', x='initial_cost', y='total_est_fee', rot=70)
101 | plt.show()
102 | 
103 | # Create and display the second scatter plot
104 | df_subset.plot(kind='scatter', x='initial_cost', y='total_est_fee', rot=70)
105 | plt.show()
106 | 


--------------------------------------------------------------------------------
/Python/Cleaning Data in Python/Tidying data for analysis:
--------------------------------------------------------------------------------
  1 | Q1:-
  2 | Print the head of airquality.
  3 | Use pd.melt() to melt the Ozone, Solar.R, Wind, and Temp columns of airquality into rows. 
  4 | Do this by using id_vars to specify the columns you do not wish to melt: 'Month' and 'Day'.
  5 | Print the head of airquality_melt.
  6 | 
  7 | Solution:-
  8 | # Print the head of airquality
  9 | print(airquality.head())
 10 | 
 11 | # Melt airquality: airquality_melt
 12 | airquality_melt = pd.melt(airquality, id_vars=['Month', 'Day'])
 13 | 
 14 | # Print the head of airquality_melt
 15 | print(airquality_melt.head())
 16 | 
 17 | Q2:-
 18 | Print the head of airquality.
 19 | Melt the Ozone, Solar.R, Wind, and Temp columns of airquality into rows, with the default variable column renamed to 'measurement' and the default value column renamed to 'reading'. You can do this by specifying, respectively, the var_name and value_name parameters.
 20 | Print the head of airquality_melt.
 21 | 
 22 | Solution:-
 23 | # Print the head of airquality
 24 | print(airquality.head())
 25 | 
 26 | # Melt airquality: airquality_melt
 27 | airquality_melt = pd.melt(airquality, id_vars=['Month', 'Day'], var_name='measurement', value_name='reading')
 28 | 
 29 | # Print the head of airquality_melt
 30 | print(airquality_melt.head())
 31 | 
 32 | Q3:-
 33 | Print the head of airquality_melt.
 34 | Pivot airquality_melt by using .pivot_table() with the rows indexed by 'Month' and 'Day', the columns indexed by 'measurement', and the values populated with 'reading'.
 35 | Print the head of airquality_pivot.
 36 | 
 37 | Solution:-
 38 | # Print the head of airquality_melt
 39 | print(airquality_melt.head())
 40 | 
 41 | # Pivot airquality_melt: airquality_pivot
 42 | airquality_pivot = airquality_melt.pivot_table(index=['Month', 'Day'], columns='measurement', values='reading')
 43 | 
 44 | # Print the head of airquality_pivot
 45 | print(airquality_pivot.head())
 46 | 
 47 | Q4:-
 48 | Print the index of airquality_pivot by accessing its .index attribute. This has been done for you.
 49 | Reset the index of airquality_pivot using its .reset_index() method.
 50 | Print the new index of airquality_pivot.
 51 | Print the head of airquality_pivot.
 52 | 
 53 | Solution:-
 54 | # Print the index of airquality_pivot
 55 | print(airquality_pivot.index)
 56 | 
 57 | # Reset the index of airquality_pivot: airquality_pivot
 58 | airquality_pivot = airquality_pivot.reset_index()
 59 | 
 60 | # Print the new index of airquality_pivot
 61 | print(airquality_pivot.index)
 62 | 
 63 | # Print the head of airquality_pivot
 64 | print(airquality_pivot.head())
 65 | 
 66 | Q5:-
 67 | Pivot airquality_dup by using .pivot_table() with the rows indexed by 'Month' and 'Day', the columns indexed by 'measurement', and the values populated with 'reading'. Use np.mean for the aggregation function.
 68 | Flatten airquality_pivot by resetting its index.
 69 | Print the head of airquality_pivot and then the original airquality DataFrame to compare their structure.
 70 | 
 71 | Solution:-
 72 | # Pivot airquality_dup: airquality_pivot
 73 | airquality_pivot = airquality_dup.pivot_table(index=['Month', 'Day'], columns='measurement', values='reading', aggfunc=np.mean)
 74 | 
 75 | # Reset the index of airquality_pivot
 76 | airquality_pivot = airquality_pivot.reset_index()
 77 | 
 78 | # Print the head of airquality_pivot
 79 | print(airquality_pivot.head())
 80 | 
 81 | # Print the head of airquality
 82 | print(airquality.head())
 83 | 
 84 | Q6:-
 85 | Melt tb keeping 'country' and 'year' fixed.
 86 | Create a 'gender' column by slicing the first letter of the variable column of tb_melt.
 87 | Create an 'age_group' column by slicing the rest of the variable column of tb_melt.
 88 | Print the head of tb_melt
 89 | 
 90 | Solution:-
 91 | # Melt tb: tb_melt
 92 | tb_melt = pd.melt(tb, id_vars=['country', 'year'])
 93 | 
 94 | # Create the 'gender' column
 95 | tb_melt['gender'] = tb_melt.variable.str[0]
 96 | 
 97 | # Create the 'age_group' column
 98 | tb_melt['age_group'] = tb_melt.variable.str[1:]
 99 | 
100 | # Print the head of tb_melt
101 | print(tb_melt.head())
102 | 
103 | Q7:-
104 | Create a column called 'str_split' by splitting the 'type_country' column of ebola_melt on '_'. Note that you will first have to access the str attribute of type_country before you can use .split().
105 | Create a column called 'type' by using the .get() method to retrieve index 0 of the 'str_split' column of ebola_melt.
106 | Create a column called 'country' by using the .get() method to retrieve index 1 of the 'str_split' column of ebola_melt.
107 | Print the head of ebola. This has been done for you, so hit 'Submit Answer' to view the results!
108 | 
109 | Solution:-
110 | # Melt ebola: ebola_melt
111 | ebola_melt = pd.melt(ebola, id_vars=['Date', 'Day'], var_name='type_country', value_name='counts')
112 | 
113 | # Create the 'str_split' column
114 | ebola_melt['str_split'] = ebola_melt.type_country.str.split('_')
115 | 
116 | # Create the 'type' column
117 | ebola_melt['type'] = ebola_melt.str_split.str.get(0)
118 | 
119 | # Create the 'country' column
120 | ebola_melt['country'] = ebola_melt.str_split.str.get(1)
121 | 
122 | # Print the head of ebola_melt
123 | print(ebola_melt.head())
124 | 
125 | 


--------------------------------------------------------------------------------
/Python/Conda Essentials/Installing Packages:
--------------------------------------------------------------------------------
1 | Q1:-
2 | 


--------------------------------------------------------------------------------
/Python/Importing Data in Python -Part 1/Introduction and flat files:
--------------------------------------------------------------------------------
  1 | Q1:-
  2 | Open the file moby_dick.txt as read-only and store it in the variable file. Make sure to pass the filename enclosed in quotation marks ''.
  3 | Print the contents of the file to the shell using the print() function. As Hugo showed in the video, you'll need to apply the method read() to the object file.
  4 | Check whether the file is closed by executing print(file.closed).
  5 | Close the file using the close() method.
  6 | Check again that the file is closed as you did above.
  7 | 
  8 | Solution:-
  9 | # Open a file: file
 10 | file = open("moby_dick.txt","r")
 11 | 
 12 | # Print it
 13 | print(file.read())
 14 | 
 15 | # Check whether file is closed
 16 | print(file.closed)
 17 | 
 18 | # Close file
 19 | file.close()
 20 | 
 21 | # Check whether file is closed
 22 | print(file.closed)
 23 | 
 24 | Q2:-
 25 | Open moby_dick.txt using the with context manager and the variable file.
 26 | Print the first three lines of the file to the shell by using readline() three times within the context manager.
 27 | 
 28 | Solution:-
 29 | # Read & print the first 3 lines
 30 | with open('moby_dick.txt') as file:
 31 |     print(file.readline())
 32 |     print(file.readline())
 33 |     print(file.readline())
 34 | 
 35 | Q3:-
 36 | Fill in the arguments of np.loadtxt() by passing file and a comma ',' for the delimiter.
 37 | Fill in the argument of print() to print the type of the object digits. Use the function type().
 38 | Execute the rest of the code to visualize one of the rows of the data.
 39 | 
 40 | Solution:-
 41 | # Import package
 42 | import numpy as np
 43 | 
 44 | # Assign filename to variable: file
 45 | file = 'digits.csv'
 46 | 
 47 | # Load file as array: digits
 48 | digits = np.loadtxt(file, delimiter=',')
 49 | 
 50 | # Print datatype of digits
 51 | print(type(digits))
 52 | 
 53 | # Select and reshape a row
 54 | im = digits[21, 1:]
 55 | im_sq = np.reshape(im, (28, 28))
 56 | 
 57 | # Plot reshaped data (matplotlib.pyplot already loaded as plt)
 58 | plt.imshow(im_sq, cmap='Greys', interpolation='nearest')
 59 | plt.show()
 60 | 
 61 | Q4:-
 62 | Complete the arguments of np.loadtxt(): the file you're importing is tab-delimited, you want to skip the first row and you only want to import the first and third columns.
 63 | Complete the argument of the print() call in order to print the entire array that you just imported.
 64 | 
 65 | Solution:-
 66 | # Import numpy
 67 | import numpy as np
 68 | 
 69 | # Assign the filename: file
 70 | file = 'digits_header.txt'
 71 | 
 72 | # Load the data: data
 73 | data = np.loadtxt(file, delimiter='\t', skiprows=1, usecols=[0,2])
 74 | 
 75 | # Print data
 76 | print(data)
 77 | 
 78 | Q5:-
 79 | Complete the first call to np.loadtxt() by passing file as the first argument.
 80 | Execute print(data[0]) to print the first element of data.
 81 | Complete the second call to np.loadtxt(). The file you're importing is tab-delimited, the datatype is float, and you want to skip the first row.
 82 | Print the 10th element of data_float by completing the print() command. Be guided by the previous print() call.
 83 | Execute the rest of the code to visualize the data.
 84 | 
 85 | Solution:-
 86 | # Assign filename: file
 87 | file = 'seaslug.txt'
 88 | 
 89 | # Import file: data
 90 | data = np.loadtxt(file, delimiter='\t', dtype=str)
 91 | 
 92 | # Print the first element of data
 93 | print(data[0])
 94 | 
 95 | # Import data as floats and skip the first row: data_float
 96 | data_float = np.loadtxt(file, delimiter='\t', dtype=float, skiprows=1)
 97 | 
 98 | # Print the 10th element of data_float
 99 | print(data_float[9])
100 | 
101 | # Plot a scatterplot of the data
102 | plt.scatter(data_float[:, 0], data_float[:, 1])
103 | plt.xlabel('time (min.)')
104 | plt.ylabel('percentage of larvae')
105 | plt.show()
106 | 
107 | Q6:-
108 | Import titanic.csv using the function np.recfromcsv() and assign it to the variable, d. You'll only need to pass file to it because it has the defaults delimiter=',' and names=True in addition to dtype=None!
109 | Run the remaining code to print the first three entries of the resulting array d.
110 | 
111 | Solution:-
112 | # Assign the filename: file
113 | file = 'titanic.csv'
114 | 
115 | # Import file using np.recfromcsv: d
116 | d = np.recfromcsv(file,delimiter=',',names=True,dtype=None)
117 | 
118 | # Print out first three entries of d
119 | print(d[:3])
120 | 
121 | Q7:-
122 | Import the pandas package using the alias pd.
123 | Read titanic.csv into a DataFrame called df. The file name is already stored in the file object.
124 | In a print() call, view the head of the DataFrame.
125 | 
126 | Solution:-
127 | # Import pandas as pd
128 | import pandas as pd
129 | 
130 | # Assign the filename: file
131 | file = 'titanic.csv'
132 | 
133 | # Read the file into a DataFrame: df
134 | df = pd.read_csv(file)
135 | 
136 | # View the head of the DataFrame
137 | print(df.head())
138 | 
139 | Q8:-
140 | Import the first 5 rows of the file into a DataFrame using the function pd.read_csv() and assign the result to data. You'll need to use the arguments nrows and header (there is no header in this file).
141 | Build a numpy array from the resulting DataFrame in data and assign to data_array.
142 | Execute print(type(data_array)) to print the datatype of data_array.
143 | 
144 | Solution:-
145 | # Assign the filename: file
146 | file = 'digits.csv'
147 | 
148 | # Read the first 5 rows of the file into a DataFrame: data
149 | data = pd.read_csv(file,nrows=5,header=None)
150 | 
151 | # Build a numpy array from the DataFrame: data_array
152 | data_array = data.values
153 | 
154 | # Print the datatype of data_array to the shell
155 | print(type(data_array))
156 | 
157 | Q9:-
158 | Complete the sep (the pandas version of delim), comment and na_values arguments of pd.read_csv(). 
159 | comment takes characters that comments occur after in the file, which in this case is '#'. na_values takes a list of strings to recognize as NA/NaN, 
160 | in this case the string 'Nothing'.
161 | Execute the rest of the code to print the head of the resulting DataFrame and plot the histogram of the 'Age' of passengers aboard the Titanic.
162 | 
163 | Solution:-
164 | # Import matplotlib.pyplot as plt
165 | import matplotlib.pyplot as plt
166 | 
167 | # Assign filename: file
168 | file = 'titanic_corrupt.txt'
169 | 
170 | # Import file: data
171 | data = pd.read_csv(file, sep='\t', comment="#", na_values=["Nothing"])
172 | 
173 | # Print the head of the DataFrame
174 | print(data.head())
175 | 
176 | # Plot 'Age' variable in a histogram
177 | pd.DataFrame.hist(data[['Age']])
178 | plt.xlabel('Age (years)')
179 | plt.ylabel('count')
180 | plt.show()
181 | 


--------------------------------------------------------------------------------
/Python/Importing Data in Python -Part 2/Diving deep into the Twitter API:
--------------------------------------------------------------------------------
  1 | Q1:-
  2 | Import the package tweepy.
  3 | Pass the parameters consumer_key and consumer_secret to the function tweepy.OAuthHandler().
  4 | Complete the passing of OAuth credentials to the OAuth handler auth by applying to it the method set_access_token(), 
  5 | along with arguments access_token and access_token_secret.
  6 | 
  7 | Solution:-
  8 | # Import package
  9 | import tweepy,json
 10 | 
 11 | # Store OAuth authentication credentials in relevant variables
 12 | access_token = "1092294848-aHN7DcRP9B4VMTQIhwqOYiB14YkW92fFO8k8EPy"
 13 | access_token_secret = "X4dHmhPfaksHcQ7SCbmZa2oYBBVSD2g8uIHXsp5CTaksx"
 14 | consumer_key = "nZ6EA0FxZ293SxGNg8g8aP0HM"
 15 | consumer_secret = "fJGEodwe3KiKUnsYJC3VRndj7jevVvXbK2D5EiJ2nehafRgA6i"
 16 | 
 17 | # Pass OAuth details to tweepy's OAuth handler
 18 | auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
 19 | auth.set_access_token(access_token,access_token_secret)
 20 | 
 21 | Q2:-
 22 | Create your Stream object with authentication by passing tweepy.Stream() the authentication handler auth and the Stream listener l;
 23 | To filter Twitter streams, 
 24 | pass to the track argument in stream.filter() a list containing the desired keywords 'clinton', 'trump', 'sanders', and 'cruz'.
 25 | 
 26 | solution:-
 27 | # Initialize Stream listener
 28 | l = MyStreamListener()
 29 | 
 30 | # Create you Stream object with authentication
 31 | stream = tweepy.Stream(auth, l)
 32 | 
 33 | 
 34 | # Filter Twitter Streams to capture data by the keywords:
 35 | s = ['clinton', 'trump', 'sanders','cruz']
 36 | stream.filter(track = s)
 37 | 
 38 | Q3:-
 39 | Assign the filename 'tweets.txt' to the variable tweets_data_path.
 40 | Initialize tweets_data as an empty list to store the tweets in.
 41 | Within the for loop initiated by for line in tweets_file:, load each tweet into a variable, tweet, using json.loads(), 
 42 | then append tweet to tweets_data using the append() method.
 43 | Hit submit and check out the keys of the first tweet dictionary printed to the shell.
 44 | 
 45 | Solution:-
 46 | # Import package
 47 | import json
 48 | 
 49 | # String of path to file: tweets_data_path
 50 | tweets_data_path = 'tweets.txt'
 51 | 
 52 | # Initialize empty list to store tweets: tweets_data
 53 | tweets_data = []
 54 | 
 55 | # Open connection to file
 56 | tweets_file = open(tweets_data_path, "r")
 57 | 
 58 | # Read in tweets and store in list: tweets_data
 59 | for line in tweets_file:
 60 |     tweet = json.loads(line)
 61 |     tweets_data.append(tweet)
 62 | 
 63 | # Close connection to file
 64 | tweets_file.close()
 65 | 
 66 | # Print the keys of the first tweet dict
 67 | print(tweets_data[0].keys())
 68 | 
 69 | Q4:-
 70 | Use pd.DataFrame() to construct a DataFrame of tweet texts and languages; to do so, the first argument should be tweets_data, a list of dictionaries. 
 71 | The second argument to pd.DataFrame() is a list of the keys you wish to have as columns. Assign the result of the pd.DataFrame() call to df.
 72 | 
 73 | Solution:-
 74 | # Import package
 75 | import pandas as pd
 76 | 
 77 | # Build DataFrame of tweet texts and languages
 78 | df = pd.DataFrame(tweets_data, columns=['text','lang'])
 79 | 
 80 | # Print head of DataFrame
 81 | print(df.head())
 82 | 
 83 | Q5:-
 84 | Within the for loop for index, row in df.iterrows():, the code currently increases the value of clinton by 1 each time a tweet mentioning 'Clinton' is encountered; 
 85 | complete the code so that the same happens for trump, sanders and cruz.
 86 | 
 87 | Solution:-
 88 | # Initialize list to store tweet counts
 89 | [clinton, trump, sanders, cruz] = [0, 0, 0, 0]
 90 | 
 91 | # Iterate through df, counting the number of tweets in which
 92 | # each candidate is mentioned
 93 | for index, row in df.iterrows():
 94 |     clinton += word_in_text('clinton', row['text'])
 95 |     trump += word_in_text('trump',  row['text'])
 96 |     sanders += word_in_text('sanders',  row['text'])
 97 |     cruz += word_in_text('cruz',  row['text'])
 98 | 
 99 | Q6:-
100 | Import both matplotlib.pyplot and seaborn using the aliases plt and sns, respectively.
101 | Complete the arguments of sns.barplot: the first argument should be the labels to appear on the x-axis; 
102 | the second argument should be the list of the variables you wish to plot, as produced in the previous exercise.
103 | 
104 | solution:-
105 | # Import packages
106 | import seaborn as sns
107 | import matplotlib.pyplot as plt
108 | 
109 | # Set seaborn style
110 | sns.set(color_codes=True)
111 | 
112 | # Create a list of labels:cd
113 | cd = ['clinton', 'trump', 'sanders', 'cruz']
114 | 
115 | # Plot histogram
116 | ax = sns.barplot(cd, [clinton, trump, sanders, cruz])
117 | ax.set(ylabel="count")
118 | plt.show()
119 | 
120 | Print the head of the DataFrame.
121 | 


--------------------------------------------------------------------------------
/Python/Importing Data in Python -Part 2/Interacting with APIs to import data from the web:
--------------------------------------------------------------------------------
 1 | Q1:-
 2 | Load the JSON 'a_movie.json' into the variable json_data within the context provided by the with statement. 
 3 | To do so, use the function json.load() within the context manager.
 4 | Use a for loop to print all key-value pairs in the dictionary json_data. 
 5 | Recall that you can access a value in a dictionary using the syntax: dictionary[key].
 6 | 
 7 | Solution:-
 8 | # Load JSON: json_data
 9 | with open("a_movie.json") as json_file:
10 |     json_data = json.load(json_file)
11 | 
12 | # Print each key-value pair in json_data
13 | for k in json_data.keys():
14 |     print(k + ': ', json_data[k])
15 |     
16 | Q2:-
17 | Import the requests package.
18 | Assign to the variable url the URL of interest in order to query 'http://www.omdbapi.com' for the data corresponding to the movie The Social Network. The query string should have two arguments: apikey=ff21610b and t=social+network. You can combine them as follows: apikey=ff21610b&t=social+network.
19 | Print the text of the reponse object r by using its text attribute and passing the result to the print() function.
20 | 
21 | Solution:-
22 | # Import requests package
23 | import requests
24 | 
25 | # Assign URL to variable: url
26 | url = 'http://www.omdbapi.com/?apikey=ff21610b&t=social+network'
27 | 
28 | # Package the request, send the request and catch the response: r
29 | r = requests.get(url)
30 | 
31 | # Print the text of the response
32 | print(r.text)
33 | 
34 | Q3:-
35 | Pass the variable url to the requests.get() function in order to send the relevant request and catch the response, assigning the resultant response message to the variable r.
36 | Apply the json() method to the response object r and store the resulting dictionary in the variable json_data.
37 | Hit Submit Answer to print the key-value pairs of the dictionary json_data to the shell
38 | 
39 | Solution:-
40 | # Import package
41 | import requests
42 | 
43 | # Assign URL to variable: url
44 | url = 'http://www.omdbapi.com/?apikey=ff21610b&t=social+network'
45 | 
46 | # Package the request, send the request and catch the response: r
47 | r = requests.get(url)
48 | 
49 | # Decode the JSON data into a dictionary: json_data
50 | json_data = r.json()
51 | 
52 | # Print each key-value pair in json_data
53 | for k in json_data.keys():
54 |     print(k + ': ', json_data[k])
55 | 
56 | Q4:-
57 | Assign the relevant URL to the variable url.
58 | Apply the json() method to the response object r and store the resulting dictionary in the variable json_data.
59 | The variable pizza_extract holds the HTML of an extract from Wikipedia's Pizza page as a string; use the function print() to print this string to the shell.
60 | 
61 | Solution:-
62 | # Import package
63 | import requests
64 | 
65 | # Assign URL to variable: url
66 | url = 'https://en.wikipedia.org/w/api.php?action=query&prop=extracts&format=json&exintro=&titles=pizza'
67 | 
68 | # Package the request, send the request and catch the response: r
69 | r = requests.get(url)
70 | 
71 | # Decode the JSON data into a dictionary: json_data
72 | json_data = r.json()
73 | 
74 | # Print the Wikipedia page extract
75 | pizza_extract = json_data['query']['pages']['24768']['extract']
76 | print(pizza_extract)
77 | 


--------------------------------------------------------------------------------
/Python/Intermediate-Python-for-Data-Science/Logic-ControlFlow-Filtering:
--------------------------------------------------------------------------------
  1 | Q1:-
  2 | In the editor on the right, write code to see if True equals False.
  3 | Write Python code to check if -5 * 15 is not equal to 75.
  4 | Ask Python whether the strings "pyscript" and "PyScript" are equal.
  5 | What happens if you compare booleans and integers? Write code to see if True and 1 are equal.
  6 | 
  7 | Solution:-
  8 | # Comparison of booleans
  9 | print(True==False)
 10 | 
 11 | # Comparison of integers
 12 | print(-5 * 15 != 75)
 13 | 
 14 | # Comparison of strings
 15 | print("pyscript" == "PyScript")
 16 | 
 17 | # Compare a boolean with an integer
 18 | print(True == 1)
 19 | 
 20 | Q2:-
 21 | Write Python expressions, wrapped in a print() function, to check whether:
 22 | x is greater than or equal to -10. x has already been defined for you.
 23 | "test" is less than or equal to y. y has already been defined for you.
 24 | True is greater than False.
 25 | 
 26 | Solution:-
 27 | # Comparison of integers
 28 | x = -3 * 6
 29 | print(x >= -10)
 30 | 
 31 | # Comparison of strings
 32 | y = "test"
 33 | print("test" <= y)
 34 | 
 35 | # Comparison of booleans
 36 | print(True > False)
 37 | 
 38 | Q3:-
 39 | Using comparison operators, generate boolean arrays that answer the following questions:
 40 | Which areas in my_house are greater than or equal to 18?
 41 | You can also compare two Numpy arrays element-wise. Which areas in my_house are smaller than the ones in your_house?
 42 | Make sure to wrap both commands in a print() statement, so that you can inspect the output.
 43 | 
 44 | Solution:-
 45 | # Create arrays
 46 | import numpy as np
 47 | my_house = np.array([18.0, 20.0, 10.75, 9.50])
 48 | your_house = np.array([14.0, 24.0, 14.25, 9.0])
 49 | 
 50 | # my_house greater than or equal to 18
 51 | print(my_house >= 18)
 52 | 
 53 | # my_house less than your_house
 54 | print(my_house < your_house)
 55 | 
 56 | Q4:-
 57 | Write Python expressions, wrapped in a print() function, to check whether:
 58 | my_kitchen is bigger than 10 and smaller than 18.
 59 | my_kitchen is smaller than 14 or bigger than 17.
 60 | double the area of my_kitchen is smaller than triple the area of your_kitchen
 61 | 
 62 | Solution:-
 63 | # Define variables
 64 | my_kitchen = 18.0
 65 | your_kitchen = 14.0
 66 | 
 67 | # my_kitchen bigger than 10 and smaller than 18?
 68 | print(my_kitchen > 10 and my_kitchen < 18)
 69 | 
 70 | # my_kitchen smaller than 14 or bigger than 17?
 71 | print(my_kitchen < 14 or my_kitchen > 17)
 72 | 
 73 | # Double my_kitchen smaller than triple your_kitchen?
 74 | print(my_kitchen*2 < your_kitchen*3)
 75 | 
 76 | Q5:-
 77 | Generate boolean arrays that answer the following questions:
 78 | Which areas in my_house are greater than 18.5 or smaller than 10?
 79 | Which areas are smaller than 11 in both my_house and your_house? Make sure to wrap both commands in print() statement, so that you can inspect the output.
 80 | 
 81 | Solution:-
 82 | # Create arrays
 83 | import numpy as np
 84 | my_house = np.array([18.0, 20.0, 10.75, 9.50])
 85 | your_house = np.array([14.0, 24.0, 14.25, 9.0])
 86 | 
 87 | # my_house greater than 18.5 or smaller than 10
 88 | print(np.logical_or(my_house > 18.5,my_house < 10))
 89 | 
 90 | # Both my_house and your_house smaller than 11
 91 | print(np.logical_and(my_house < 11, your_house < 11))
 92 | 
 93 | Q6;-Examine the if statement that prints out "Looking around in the kitchen." if room equals "kit".
 94 | Write another if statement that prints out "big place!" if area is greater than 15.
 95 | 
 96 | Solution:-
 97 | # Define variables
 98 | room = "kit"
 99 | area = 14.0
100 | 
101 | # if statement for room
102 | if room == "kit" :
103 |     print("looking around in the kitchen.")
104 | 
105 | # if statement for area
106 | if area>15:
107 |     print("big place!")
108 |     
109 |     Q7:-
110 |     Add an else statement to the second control structure so that "pretty small." is printed out if area > 15 evaluates to False.
111 |     
112 |     Solution:-
113 |     # Define variables
114 | room = "kit"
115 | area = 14.0
116 | 
117 | # if-else construct for room
118 | if room == "kit" :
119 |     print("looking around in the kitchen.")
120 | else :
121 |     print("looking around elsewhere.")
122 | 
123 | # if-else construct for area
124 | if area > 15 :
125 |     print("big place!")
126 | else:
127 |     print("pretty small.")
128 |     
129 |     Q8:-
130 |     Add an elif to the second control structure such that "medium size, nice!" is printed out if area is greater than 10.
131 |     
132 |     Solution:-
133 |     # Define variables
134 | room = "bed"
135 | area = 14.0
136 | 
137 | # if-elif-else construct for room
138 | if room == "kit" :
139 |     print("looking around in the kitchen.")
140 | elif room == "bed":
141 |     print("looking around in the bedroom.")
142 | else :
143 |     print("looking around elsewhere.")
144 | 
145 | # if-elif-else construct for area
146 | if area > 15 :
147 |     print("big place!")
148 | elif area > 10:
149 |     print("medium size, nice!")
150 | else :
151 |     print("pretty small.")
152 |     
153 |     Q9:-
154 |     Extract the drives_right column as a Pandas Series and store it as dr.
155 | Use dr, a boolean Series, to subset the cars DataFrame. Store the resulting selection in sel.
156 | Print sel, and assert that drives_right is True for all observations.
157 | 
158 | Solution:-
159 | # Import cars data
160 | import pandas as pd
161 | cars = pd.read_csv('cars.csv', index_col = 0)
162 | 
163 | # Extract drives_right column as Series: dr
164 | dr = cars["drives_right"]
165 | 
166 | # Use dr to subset cars: sel
167 | sel = cars[dr]
168 | # Print sel
169 | print(sel)
170 | 
171 | Q10:-
172 | Convert the code on the right to a one-liner that calculates the variable sel as before.
173 | 
174 | Solution:-
175 | # Import cars data
176 | import pandas as pd
177 | cars = pd.read_csv('cars.csv', index_col = 0)
178 | 
179 | # Convert code to a one-liner
180 | 
181 | sel = cars[cars['drives_right']]
182 | 
183 | # Print sel
184 | print(sel)
185 | 
186 | Q11:-
187 | Select the cars_per_cap column from cars as a Pandas Series and store it as cpc.
188 | Use cpc in combination with a comparison operator and 500. You want to end up with a boolean Series that's True if the corresponding country has a cars_per_cap of more than 500 and False otherwise. Store this boolean Series as many_cars.
189 | Use many_cars to subset cars, similar to what you did before. Store the result as car_maniac.
190 | Print out car_maniac to see if you got it right.
191 | 
192 | Solution:-
193 | # Import cars data
194 | import pandas as pd
195 | cars = pd.read_csv('cars.csv', index_col = 0)
196 | 
197 | # Create car_maniac: observations that have a cars_per_cap over 500
198 | cpc = cars["cars_per_cap"]
199 | many_cars = cpc > 500
200 | car_maniac = cars[many_cars]
201 | 
202 | # Print car_maniac
203 | print(car_maniac)
204 | 
205 | Q12:-
206 | Use the code sample above to create a DataFrame medium, that includes all the observations of cars that have a cars_per_cap between 100 and 500.
207 | Print out medium.
208 | 
209 | Solution:-
210 | # Import cars data
211 | import pandas as pd
212 | cars = pd.read_csv('cars.csv', index_col = 0)
213 | 
214 | # Import numpy, you'll need this
215 | import numpy as np
216 | 
217 | # Create medium: observations with cars_per_cap between 100 and 500
218 | medium = cars[np.logical_and(cars["cars_per_cap"] >100, cars["cars_per_cap"] < 500)] 
219 | 
220 | 
221 | 
222 | # Print medium
223 | print(medium)
224 | 
225 | 
226 | 


--------------------------------------------------------------------------------
/Python/Intermediate-Python-for-Data-Science/loops:
--------------------------------------------------------------------------------
  1 | Q1:-
  2 | Create the variable offset with an initial value of 8.
  3 | Code a while loop that keeps running as long as offset is not equal to 0. Inside the while loop:
  4 | Print out the sentence "correcting...".
  5 | Next, decrease the value of offset by 1. You can do this with offset = offset - 1.
  6 | Finally, print out offset so you can see how it changes.
  7 | 
  8 | Solution:-
  9 | # Initialize offset
 10 | offset = 8
 11 | 
 12 | # Code the while loop
 13 | while offset != 0:
 14 |     print("correcting...")
 15 |     offset = offset - 1
 16 |     print(offset)
 17 |     
 18 | Q2:-
 19 | Inside the while loop, replace offset = offset - 1 by an if-else statement:
 20 | If offset > 0, you should decrease offset by 1.
 21 | Else, you should increase offset by 1.
 22 | If you've coded things correctly, hitting Submit Answer should work this time.
 23 | 
 24 | Solution:-
 25 | # Initialize offset
 26 | offset = -6
 27 | 
 28 | # Code the while loop
 29 | while offset != 0 :
 30 |     print("correcting...")
 31 |     #offset = offset - 1
 32 |     if offset > 0:
 33 |         offset = offset - 1
 34 |     else:
 35 |         offset = offset + 1
 36 |     print(offset)
 37 |     
 38 |  Q3:-
 39 |  Write a for loop that iterates over all elements of the areas list and prints out every element separately.
 40 |  
 41 |  Solution:-
 42 |  # areas list
 43 | areas = [11.25, 18.0, 20.0, 10.75, 9.50]
 44 | 
 45 | # Code the for loop
 46 | for i in areas:
 47 |     print(i)
 48 |     
 49 | Q4:-
 50 | Adapt the for loop in the sample code to use enumerate(). On each run, a line of the form "room x: y" should be printed, where x is the index of the list element and y is the actual list element, i.e. the area.
 51 | Make sure to print out this exact string, with the correct spacing.
 52 | 
 53 | Solution:-
 54 | # areas list
 55 | areas = [11.25, 18.0, 20.0, 10.75, 9.50]
 56 | 
 57 | # Change for loop to use enumerate()
 58 | for i,a in enumerate(areas) :
 59 |     print("room " + str(i) + ": " + str(a))
 60 |     
 61 | Q5:-
 62 | Adapt the print() function in the for loop on the right so that the first printout becomes "room 1: 11.25", the second one "room 2: 18.0" and so on.
 63 | 
 64 | Solution:-
 65 | # areas list
 66 | areas = [11.25, 18.0, 20.0, 10.75, 9.50]
 67 | 
 68 | # Code the for loop
 69 | for index, area in enumerate(areas) :
 70 |     print("room " + str(index+1) + ": " + str(area))
 71 |     
 72 | Q6:-
 73 | Write a for loop that goes through each sublist of house and prints out the x is y sqm, where x is the name of the room and y is the area of the room.
 74 | 
 75 | Solution:-
 76 | # house list of lists
 77 | house = [["hallway", 11.25], 
 78 |          ["kitchen", 18.0], 
 79 |          ["living room", 20.0], 
 80 |          ["bedroom", 10.75], 
 81 |          ["bathroom", 9.50]]
 82 |          
 83 | # Build a for loop from scratch
 84 | for i in house:
 85 |     print("the " + i[0] + " is " + str(i[1]) + " sqm")
 86 |     
 87 |  Q7:-
 88 |  Write a for loop that goes through each key:value pair of europe. On each iteration, "the capital of x is y" should be printed out, where x is the key and y is the value of the pair.
 89 |  Solution:-
 90 |  # Definition of dictionary
 91 | europe = {'spain':'madrid', 'france':'paris', 'germany':'bonn', 
 92 |           'norway':'oslo', 'italy':'rome', 'poland':'warsaw', 'australia':'vienna' }
 93 |           
 94 | # Iterate over europe
 95 | for key,value in europe.items():
 96 |     print("the capital of " + key + " is " + str(value))
 97 |     
 98 |  Q8:-
 99 |  Import the numpy package under the local alias np.
100 | Write a for loop that iterates over all elements in np_height and prints out "x inches" for each element, where x is the value in the array.
101 | Write a for loop that visits every element of the np_baseball array and prints it out.
102 | 
103 | Solution:-
104 | # Import numpy as np
105 | import numpy as np
106 | 
107 | # For loop over np_height
108 | for i in np_height:
109 |     print(str(i) + " inches")
110 | 
111 | # For loop over np_baseball
112 | for j in np.nditer(np_baseball):
113 |     print(j)
114 | 
115 | Q9:-
116 | Write a for loop that iterates over the rows of cars and on each iteration perform two print() calls: one to print out the row label and one to print out all of the rows contents.
117 | Solution:-
118 | # Import cars data
119 | import pandas as pd
120 | cars = pd.read_csv('cars.csv', index_col = 0)
121 | 
122 | # Iterate over rows of cars
123 | for lab,i in cars.iterrows():
124 |     print(lab)
125 |     print(i)
126 |     
127 | Q10:-
128 | Adapt the code in the for loop such that the first iteration prints out "US: 809", the second iteration "AUS: 731", and so on. 
129 | The output should be in the form "country: cars_per_cap". 
130 | Make sure to print out this exact string, with the correct spacing.
131 | 
132 | Solution:-
133 | # Import cars data
134 | import pandas as pd
135 | cars = pd.read_csv('cars.csv', index_col = 0)
136 | 
137 | # Adapt for loop
138 | for lab, row in cars.iterrows() :
139 |     print(lab + ": " + str(row['cars_per_cap']))
140 |     
141 | Q11:-
142 | Use a for loop to add a new column, named COUNTRY, that contains a uppercase version of the country names in the "country" column. You can use the string method upper() for this.
143 | To see if your code worked, print out cars. Don't indent this code, so that it's not part of the for loop.
144 | 
145 | Solution:-
146 | # Import cars data
147 | import pandas as pd
148 | cars = pd.read_csv('cars.csv', index_col = 0)
149 | 
150 | # Code for loop that adds COUNTRY column
151 | for lab,row in cars.iterrows():
152 |     cars.loc[lab,"COUNTRY"] = row["country"].upper()
153 | 
154 | 
155 | # Print cars
156 | print(cars)
157 | 
158 | Q12:-
159 | Replace the for loop with a one-liner that uses .apply(str.upper). The call should give the same result: a column COUNTRY should be added to cars, containing an uppercase version of the country names.
160 | As usual, print out cars to see the fruits of your hard labor
161 | 
162 | Solution:-
163 | # Import cars data
164 | import pandas as pd
165 | cars = pd.read_csv('cars.csv', index_col = 0)
166 | 
167 | # Use .apply(str.upper)
168 | #for lab, row in cars.iterrows() :
169 | cars["COUNTRY"] = cars["country"].apply(str.upper)
170 | print(cars)
171 | 


--------------------------------------------------------------------------------
/Python/Intro to SQL for Data Science/Aggregate Functions:
--------------------------------------------------------------------------------
 1 | Q1:-
 2 | Use the SUM function to get the total duration of all films.
 3 | -SELECT SUM(DURATION)
 4 | FROM FILMS;
 5 | 
 6 | Get the average duration of all films.
 7 | -SELECT AVG(DURATION)
 8 | FROM FILMS;
 9 | 
10 | Get the duration of the shortest film.
11 | -SELECT MIN(DURATION)
12 | FROM FILMS;
13 | 
14 | Get the duration of the longest film.
15 | -SELECT MAX(DURATION)
16 | FROM FILMS;
17 | 
18 | Q2:-
19 | Use the SUM function to get the total amount grossed by all films.
20 | -SELECT SUM(gross)
21 | FROM FILMS;
22 | 
23 | Get the average amount grossed by all films.
24 | -SELECT AVG(gross)
25 | FROM FILMS;
26 | 
27 | Get the amount grossed by the worst performing film.
28 | -SELECT MIN(gross)
29 | FROM FILMS;
30 | 
31 | Get the amount grossed by the best performing film.
32 | -SELECT MAX(gross)
33 | FROM FILMS;
34 | 
35 | Q3:-
36 | Use the SUM function to get the total amount grossed by all films made in the year 2000 or later.
37 | -SELECT SUM(gross)
38 | FROM FILMS
39 | WHERE release_year >= 2000;
40 | 
41 | Get the average amount grossed by all films whose titles start with the letter 'A'.
42 | -SELECT AVG(gross)
43 | FROM FILMS
44 | WHERE title LIKE 'A%';
45 | 
46 | Get the amount grossed by the worst performing film in 1994.
47 | -SELECT MIN(gross)
48 | FROM FILMS
49 | WHERE release_year = 1994;
50 | 
51 | Get the amount grossed by the best performing film between 2000 and 2012, inclusive.
52 | -SELECT MAX(gross)
53 | FROM films
54 | WHERE release_year BETWEEN 2000 AND 2012;
55 | 
56 | Q4:-
57 | Get the title and net profit (the amount a film grossed, minus its budget) for all films. Alias the net profit as net_profit.
58 | -SELECT title,gross-budget AS net_profit
59 | FROM films;
60 | 
61 | Get the title and duration in hours for all films. The duration is in minutes, so you'll need to divide by 60.0 to get the duration in hours. Alias the duration in hours as duration_hours.
62 | -SELECT title,duration/60.0 AS duration_hours
63 | FROM films;
64 | 
65 | Get the average duration in hours for all films, aliased as avg_duration_hours.
66 | -SELECT AVG(duration/60.0) AS avg_duration_hours
67 | FROM films;
68 | 
69 | Q5:-
70 | Get the percentage of people who are no longer alive. Alias the result as percentage_dead. Remember to use 100.0 and not 100!
71 | --- get the count(deathdate) and multiply by 100.0
72 | -- then divide by count(*)
73 | SELECT COUNT(deathdate)*100.0/COUNT(*) AS percentage_dead
74 | FROM people;
75 | 
76 | Get the number of years between the newest film and oldest film. Alias the result as difference.
77 | -SELECT MAX(release_year) - MIN(release_year) AS difference
78 | FROM films;
79 | 
80 | Get the number of decades the films table covers. Alias the result as number_of_decades. The top half of your fraction should be enclosed in parentheses.
81 | -SELECT (MAX(release_year) - MIN(release_year))/10.0 AS number_of_decades
82 | FROM films;
83 | 
84 | 


--------------------------------------------------------------------------------
/Python/Intro to SQL for Data Science/Filtering rows:
--------------------------------------------------------------------------------
  1 | Q1:-
  2 | Get all details for all films released in 2016.
  3 | -SELECT *
  4 | FROM films
  5 | WHERE release_year = 2016;
  6 | 
  7 | Get the number of films released before 2000.
  8 | -SELECT COUNT(*)
  9 | FROM films
 10 | WHERE release_year < 2000;
 11 | 
 12 | Get the title and release year of films released after 2000.
 13 | -SELECT title,release_year
 14 | FROM films
 15 | WHERE release_year > 2000;
 16 | 
 17 | Q2:-
 18 | Get all details for all French language films.
 19 | -SELECT *
 20 | FROM films
 21 | WHERE language='French';
 22 | 
 23 | Get the name and birth date of the person born on November 11th, 1974. Remember to use ISO date format ('1974-11-11')!
 24 | -SELECT name,birthdate
 25 | FROM people
 26 | WHERE birthdate='1974-11-11';
 27 | 
 28 | Get the number of Hindi language films.
 29 | -SELECT COUNT(*)
 30 | FROM films
 31 | WHERE language='Hindi';
 32 | 
 33 | Get all details for all films with an R certification.
 34 | -SELECT *
 35 | FROM films
 36 | WHERE certification='R';
 37 | 
 38 | Q3:-
 39 | Get the title and release year for all Spanish language films released before 2000.
 40 | -SELECT title,release_year
 41 | FROM films
 42 | WHERE language='Spanish'
 43 | AND release_year < 2000;
 44 | 
 45 | Get all details for Spanish language films released after 2000.
 46 | -SELECT *
 47 | FROM films
 48 | WHERE language='Spanish'
 49 | AND release_year > 2000;
 50 | 
 51 | Get all details for Spanish language films released after 2000, but before 2010.
 52 | -SELECT *
 53 | FROM films
 54 | WHERE language='Spanish'
 55 | AND release_year > 2000
 56 | AND release_year < 2010;
 57 | 
 58 | Q4:-
 59 | Get the title and release year for films released in the 90s.
 60 | -SELECT title,release_year
 61 | FROM films
 62 | WHERE release_year>='1990'
 63 | AND release_year<'2000';
 64 | 
 65 | Now, build on your query to filter the records to only include French or Spanish language films.
 66 | -SELECT title,release_year
 67 | FROM films
 68 | WHERE (release_year>='1990'AND release_year<'2000')
 69 | AND (language='Spanish' OR language='French');
 70 | 
 71 | Finally, restrict the query to only return films that took in more than $2M gross.
 72 | -SELECT title,release_year
 73 | FROM films
 74 | WHERE (release_year>='1990'AND release_year<'2000')
 75 | AND (language='Spanish' OR language='French')
 76 | AND gross > 2000000;
 77 | 
 78 | Q5:-
 79 | Get the title and release year of all films released between 1990 and 2000 (inclusive).
 80 | -SELECT title,release_year
 81 | FROM films
 82 | WHERE release_year BETWEEN 1990 AND 2000;
 83 | 
 84 | Now, build on your previous query to select only films that have budgets over $100 million
 85 | -SELECT title,release_year
 86 | FROM films
 87 | WHERE release_year BETWEEN 1990 AND 2000
 88 | AND budget >100000000;
 89 | 
 90 | Now restrict the query to only return Spanish language films.
 91 | -SELECT title,release_year
 92 | FROM films
 93 | WHERE release_year BETWEEN 1990 AND 2000
 94 | AND budget >100000000
 95 | AND language='Spanish';
 96 | 
 97 | Finally, modify to your previous query to include all Spanish language or French language films with the same criteria as before. Don't forget your parentheses!
 98 | -SELECT title,release_year
 99 | FROM films
100 | WHERE release_year BETWEEN 1990 AND 2000
101 | AND budget >100000000
102 | AND (language='Spanish' OR language='French');
103 | 
104 | Q6:-
105 | Get the title and release year of all films released in 1990 or 2000 that were longer than two hours. Remember, duration is in minutes!
106 | -SELECT title,release_year
107 | FROM films
108 | WHERE release_year IN (1990,2000)
109 | AND duration >120;
110 | 
111 | Get the title and language of all films which were in English, Spanish, or French.
112 | -SELECT title,language
113 | FROM films
114 | WHERE language IN ('English','Spanish','French');
115 | 
116 | Get the title and certification of all films with an NC-17 or R certification.
117 | -SELECT title,certification
118 | FROM films
119 | WHERE certification IN ('R','NC-17');
120 | 
121 | Q7:-
122 | Get the names of people who are still alive, i.e. whose death date is missing.
123 | -SELECT name
124 | FROM people
125 | WHERE deathdate IS NULL;
126 | 
127 | Get the title of every film which doesn't have a budget associated with it.
128 | -SELECT title
129 | FROM films
130 | WHERE budget IS NULL;
131 | 
132 | Get the number of films which don't have a language associated with them.
133 | -SELECT COUNT(*)
134 | FROM films
135 | WHERE language IS NULL;
136 | 
137 | Q8:-
138 | Get the names of all people whose names begin with 'B'. The pattern you need is 'B%'.
139 | -SELECT name
140 | FROM people
141 | WHERE name LIKE 'B%';
142 | 
143 | Get the names of people whose names have 'r' as the second letter. The pattern you need is '_r%'.
144 | -SELECT name
145 | FROM people
146 | WHERE name LIKE '_r%';
147 | 
148 | Get the names of people whose names don't start with A. The pattern you need is 'A%'.
149 | -SELECT name
150 | FROM people
151 | WHERE name NOT LIKE 'A%';
152 | 


--------------------------------------------------------------------------------
/Python/Intro to SQL for Data Science/Selecting columns:
--------------------------------------------------------------------------------
 1 | Q1:-
 2 | Select the title column from the films table.
 3 | - SELECT title FROM films;
 4 | Select the release_year column from the films table.
 5 | -SELECT release_year FROM films;
 6 | Select the name of each person in the people table.
 7 | -SELECT name FROM people;
 8 | 
 9 | Q2:-
10 | Get the title of every film from the films table.
11 | -SELECT title FROM films;
12 | Get the title and release year for every film.
13 | -SELECT title,release_year FROM films;
14 | Get the title, release year and country for every film.
15 | -SELECT title,release_year,country FROM films;
16 | Get all columns from the films table.
17 | -SELECT * FROM films;
18 | 
19 | Q3:-
20 | Get all the unique countries represented in the films table.
21 | -SELECT DISTINCT country FROM films;
22 | Get all the different film certifications from the films table.
23 | -SELECT DISTINCT certification FROM films;
24 | Get the different types of film roles from the roles table.
25 | -SELECT DISTINCT role FROM roles;
26 | 
27 | Q4:-
28 | Count the number of rows in the people table.
29 | -SELECT COUNT(*) FROM people;
30 | Count the number of (non-missing) birth dates in the people table.
31 | -SELECT COUNT(birthdate) FROM people;
32 | Count the number of unique birth dates in the people table.
33 | -SELECT COUNT(DISTINCT birthdate) FROM people;
34 | Count the number of unique languages in the films table.
35 | -SELECT COUNT(DISTINCT language) FROM films;
36 | Count the number of unique countries in the films table.
37 | -SELECT COUNT(DISTINCT country) FROM films;
38 | 


--------------------------------------------------------------------------------
/Python/Intro to SQL for Data Science/Sorting, grouping and joins:
--------------------------------------------------------------------------------
  1 | Q1:-
  2 | Get the names of people from the people table, sorted alphabetically.
  3 | -SELECT name
  4 | FROM people
  5 | ORDER BY name;
  6 | 
  7 | Get the names of people, sorted by birth date.
  8 | -SELECT name
  9 | FROM people
 10 | ORDER BY birthdate;
 11 | 
 12 | Get the birth date and name for every person, in order of when they were born.
 13 | -SELECT birthdate,name
 14 | FROM people
 15 | ORDER BY birthdate;
 16 | 
 17 | Q2:-
 18 | Get the title of films released in 2000 or 2012, in the order they were released.
 19 | -SELECT title
 20 | FROM films
 21 | WHERE release_year IN (2000,2012)
 22 | ORDER BY release_year;
 23 | 
 24 | Get all details for all films except those released in 2015 and order them by duration.
 25 | -SELECT *
 26 | FROM films
 27 | WHERE release_year NOT IN (2015)
 28 | ORDER BY duration;
 29 | 
 30 | Get the title and gross earnings for movies which begin with the letter 'M' and order the results alphabetically.
 31 | -SELECT title,gross
 32 | FROM films
 33 | WHERE title LIKE 'M%'
 34 | ORDER BY title;
 35 | 
 36 | Q3:-
 37 | Get the IMDB score and film ID for every film from the reviews table, sorted from highest to lowest score
 38 | -SELECT imdb_score,film_id
 39 | FROM reviews
 40 | ORDER BY imdb_score;
 41 | 
 42 | Get the title for every film, in reverse order.
 43 | -SELECT title
 44 | FROM films
 45 | ORDER BY title DESC;
 46 | 
 47 | Get the title and duration for every film, in order of longest duration to shortest.
 48 | -SELECT title,duration
 49 | FROM films
 50 | ORDER BY duration DESC;
 51 | 
 52 | Q4:-
 53 | Get the birth date and name of people in the people table, in order of when they were born and alphabetically by name.
 54 | -SELECT birthdate,name
 55 | FROM people
 56 | ORDER BY birthdate,name;
 57 | 
 58 | Get the release year, duration, and title of films ordered by their release year and duration.
 59 | -SELECT release_year,duration,title
 60 | FROM films
 61 | ORDER BY release_year,duration;
 62 | 
 63 | Get certifications, release years, and titles of films ordered by certification (alphabetically) and release year.
 64 | -SELECT certification,release_year,title
 65 | FROM films
 66 | ORDER BY certification,release_year;
 67 | 
 68 | Get the names and birthdates of people ordered by name and birth date.
 69 | -SELECT name,birthdate
 70 | FROM people
 71 | ORDER BY name,birthdate;
 72 | 
 73 | Q5:-
 74 | Get the release year and count of films released in each year.
 75 | -SELECT release_year,COUNT(*)
 76 | FROM films
 77 | GROUP BY release_year;
 78 | 
 79 | Get the release year and average duration of all films, grouped by release year.
 80 | -SELECT release_year,AVG(duration)
 81 | FROM films
 82 | GROUP BY release_year;
 83 | 
 84 | Get the release year and largest budget for all films, grouped by release year.
 85 | -SELECT release_year,MAX(budget)
 86 | FROM films
 87 | GROUP BY release_year;
 88 | 
 89 | Get the IMDB score and count of film reviews grouped by IMDB score in the reviews table.
 90 | -SELECT imdb_score,COUNT(film_id)
 91 | FROM reviews
 92 | GROUP BY imdb_score;
 93 | 
 94 | Q6:-
 95 | Get the release year and lowest gross earnings per release year.
 96 | -SELECT release_year,MIN(gross)
 97 | FROM films
 98 | GROUP BY release_year;
 99 | 
100 | Get the language and total gross amount films in each language made.
101 | -SELECT language,SUM(gross)
102 | FROM films
103 | GROUP BY language;
104 | 
105 | Get the country and total budget spent making movies in each country.
106 | -SELECT country,SUM(budget)
107 | FROM films
108 | GROUP BY country;
109 | 
110 | Get the release year, country, and highest budget spent making a film for each year, for each country. Sort your results by release year and country.
111 | -SELECT release_year,country,MAX(budget)
112 | FROM films
113 | GROUP BY release_year,country
114 | ORDER BY release_year,country;
115 | 
116 | Get the country, release year, and lowest amount grossed per release year per country. Order your results by country and release year.
117 | -SELECT country,release_year,MIN(gross)
118 | FROM films
119 | GROUP BY release_year,country
120 | ORDER BY country,release_year;
121 | 
122 | Q7:-
123 | Get the release year, budget and gross earnings for each film in the films table.
124 | -SELECT release_year,budget,gross
125 | FROM films;
126 | 
127 | Modify your query so that only results after 1990 are included.
128 | -SELECT release_year,budget,gross
129 | FROM films
130 | WHERE release_year > 1990;
131 | 
132 | Remove the budget and gross columns, and group your results by release year.
133 | -SELECT release_year
134 | FROM films
135 | WHERE release_year > 1990
136 | GROUP BY release_year;
137 | 
138 | Remove the budget and gross columns, and group your results by release year.
139 | -SELECT release_year
140 | FROM films
141 | WHERE release_year > 1990
142 | GROUP BY release_year;
143 | 
144 | Modify your query to add in the average budget and average gross earnings for the results you have so far. Alias your results as avg_budget and avg_gross, respectively.
145 | -SELECT release_year,AVG(budget) AS avg_budget,AVG(gross) AS avg_gross
146 | FROM films
147 | WHERE release_year > 1990
148 | GROUP BY release_year;
149 | 
150 | Modify your query so that only years with an average budget of greater than $60 million are included.
151 | -SELECT release_year,AVG(budget) AS avg_budget,AVG(gross) AS avg_gross
152 | FROM films
153 | WHERE release_year > 1990
154 | GROUP BY release_year
155 | HAVING AVG(budget) > 60000000;
156 | 
157 | Finally, modify your query to order the results from highest average gross earnings to lowest.
158 | -SELECT release_year,AVG(budget) AS avg_budget,AVG(gross) AS avg_gross
159 | FROM films
160 | WHERE release_year > 1990
161 | GROUP BY release_year
162 | HAVING AVG(budget) > 60000000
163 | ORDER BY avg_gross;
164 | 
165 | Q8:-
166 | Get the country, average budget, and average gross take of countries that have made more than 10 films. Order the result by country name, and limit the number of results displayed to 5. You should alias the averages as avg_budget and avg_gross respectively.
167 | --- select country, average budget, average gross
168 | SELECT country,AVG(budget) AS avg_budget,AVG(gross) AS avg_gross
169 | 
170 | -- from the films table
171 | FROM films
172 | -- group by country 
173 | GROUP BY country
174 | -- where the country has a title count greater than 10
175 | HAVING COUNT(title) > 10
176 | -- order by country
177 | ORDER BY country
178 | -- limit to only show 5 results
179 | LIMIT 5;
180 | 
181 | Joins:-
182 | SELECT title, imdb_score
183 | FROM films
184 | JOIN reviews
185 | ON films.id = reviews.film_id
186 | WHERE title = 'To Kill a Mockingbird';
187 | 
188 | 
189 | 


--------------------------------------------------------------------------------
/Python/Intro-to-data-science/Numpy-Statistics:
--------------------------------------------------------------------------------
  1 | Q1:-
  2 | Create numpy array np_height that is equal to first column of np_baseball.
  3 | Print out the mean of np_height.
  4 | Print out the median of np_height.
  5 | 
  6 | Solution:-
  7 | # np_baseball is available
  8 | 
  9 | # Import numpy
 10 | import numpy as np
 11 | 
 12 | # Create np_height from np_baseball
 13 | np_height = np.array(np_baseball[:,0])
 14 | 
 15 | # Print out the mean of np_height
 16 | print(np.mean(np_height))
 17 | 
 18 | # Print out the median of np_height
 19 | print(np.median(np_height))
 20 | 
 21 | Q2:-
 22 | The code to print out the mean height is already included. Complete the code for the median height. Replace None with the correct code.
 23 | Use np.std() on the first column of np_baseball to calculate stddev. Replace None with the correct code.
 24 | Do big players tend to be heavier? Use np.corrcoef() to store the correlation between the first and second column of np_baseball in corr. 
 25 | Replace None with the correct code.
 26 | 
 27 | Solution:-
 28 | # np_baseball is available
 29 | 
 30 | # Import numpy
 31 | import numpy as np
 32 | 
 33 | # Print mean height (first column)
 34 | avg = np.mean(np_baseball[:,0])
 35 | print("Average: " + str(avg))
 36 | 
 37 | # Print median height. Replace 'None'
 38 | med = np.median(np_baseball[:,0])
 39 | print("Median: " + str(med))
 40 | 
 41 | # Print out the standard deviation on height. Replace 'None'
 42 | stddev = np.std(np_baseball[:,0])
 43 | 
 44 | Q3:-
 45 | The code to print out the mean height is already included. Complete the code for the median height. Replace None with the correct code.
 46 | Use np.std() on the first column of np_baseball to calculate stddev. Replace None with the correct code.
 47 | Do big players tend to be heavier? Use np.corrcoef() to store the correlation between the first and second column of np_baseball in corr. 
 48 | Replace None with the correct code.
 49 | 
 50 | Solution:-
 51 | # np_baseball is available
 52 | 
 53 | # Import numpy
 54 | import numpy as np
 55 | 
 56 | # Print mean height (first column)
 57 | avg = np.mean(np_baseball[:,0])
 58 | print("Average: " + str(avg))
 59 | 
 60 | # Print median height. Replace 'None'
 61 | med = np.median(np_baseball[:,0])
 62 | print("Median: " + str(med))
 63 | 
 64 | # Print out the standard deviation on height. Replace 'None'
 65 | stddev = np.std(np_baseball[:,0])
 66 | print("Standard Deviation: " + str(stddev))
 67 | 
 68 | # Print out correlation between first and second column. Replace 'None'
 69 | corr = np.corrcoef(np_baseball[:,0],np_baseball[:,1])
 70 | print("Correlation: " + str(corr))
 71 | 
 72 | Q4:-
 73 | Convert heights and positions, which are regular lists, to numpy arrays. Call them np_heights and np_positions.
 74 | Extract all the heights of the goalkeepers. You can use a little trick here: use np_positions == 'GK' as an index for np_heights. Assign the result to gk_heights.
 75 | Extract all the heights of all the other players. This time use np_positions != 'GK' as an index for np_heights. Assign the result to other_heights.
 76 | Print out the median height of the goalkeepers using np.median(). Replace None with the correct code.
 77 | Do the same for the other players. Print out their median height. Replace None with the correct code.
 78 | 
 79 | Solution:-
 80 | # heights and positions are available as lists
 81 | 
 82 | # Import numpy
 83 | import numpy as np
 84 | 
 85 | # Convert positions and heights to numpy arrays: np_positions, np_heights
 86 | np_positions = np.array(positions)
 87 | np_heights = np.array(heights)
 88 | 
 89 | 
 90 | # Heights of the goalkeepers: gk_heights
 91 | gk_heights = np_heights[np_positions == 'GK']
 92 | 
 93 | # Heights of the other players: other_heights
 94 | other_heights = np_heights[np_positions != 'GK']
 95 | 
 96 | # Print out the median height of goalkeepers. Replace 'None'
 97 | print("Median height of goalkeepers: " + str(np.median(gk_heights)))
 98 | 
 99 | # Print out the median height of other players. Replace 'None'
100 | print("Median height of other players: " + str(np.median(other_heights)))
101 | 


--------------------------------------------------------------------------------
/Python/Intro-to-data-science/Python-Basics:
--------------------------------------------------------------------------------
 1 | Q1:-
 2 | Experiment in the IPython Shell; type 5 / 8, for example.
 3 | Add another line of code to the Python script: print(7 + 10).
 4 | 
 5 | Solution:-
 6 | # Example, do not modify!
 7 | print(5 / 8)
 8 | 
 9 | # Put code below here
10 | print(7 + 10)
11 | 
12 | Q2:-
13 | Suppose you have $100, which you can invest with a 10% return each year. After one year, it's 100×1.1=110100×1.1=110 dollars, and after two years it's 100×1.1×1.1=121100×1.1×1.1=121. 
14 | Add code on the right to calculate how much money you end up with after 7 years.
15 | 
16 | Solution:-
17 | # Addition and subtraction
18 | print(5 + 5)
19 | print(5 - 5)
20 | 
21 | # Multiplication and division
22 | print(3 * 5)
23 | print(10 / 2)
24 | 
25 | # Exponentiation
26 | print(4 ** 2)
27 | 
28 | # Modulo
29 | print(18 % 7)
30 | 
31 | # How much is your $100 worth after 7 years?
32 | print(100*1.1**7)
33 | 
34 | Q3:-
35 | Calculate the product of savings and factor. Store the result in year1.
36 | What do you think the resulting type will be? Find out by printing out the type of year1.
37 | Calculate the sum of desc and desc and store the result in a new variable doubledesc.
38 | Print out doubledesc. Did you expect this?
39 | 
40 | Solution:-
41 | # Several variables to experiment with
42 | savings = 100
43 | factor = 1.1
44 | desc = "compound interest"
45 | 
46 | # Assign product of factor and savings to year1
47 | year1 = factor * savings
48 | 
49 | # Print the type of year1
50 | print(type(year1))
51 | 
52 | # Assign sum of desc and desc to doubledesc
53 | doubledesc = desc + desc
54 | 
55 | # Print out doubledesc
56 | print(doubledesc)
57 | 
58 | Q4:-
59 | Fix the code on the right such that the printout runs without errors; use the function str() to convert the variables to strings.
60 | Convert the variable pi_string to a float and store this float as a new variable, pi_float.
61 | 
62 | Solution:-
63 | # Definition of savings and result
64 | savings = 100
65 | result = 100 * 1.10 ** 7
66 | 
67 | # Fix the printout
68 | print("I started with $" + str(savings) + " and now have $" + str(result) + ". Awesome!")
69 | 
70 | # Definition of pi_string
71 | pi_string = "3.1415926"
72 | 
73 | # Convert pi_string into float: pi_float
74 | pi_float = float(pi_string)
75 | 


--------------------------------------------------------------------------------
/Python/Intro-to-data-science/Python-Lists:
--------------------------------------------------------------------------------
  1 | Q1:-
  2 | Create a list, areas, that contains the area of the hallway (hall), kitchen (kit), living room (liv), bedroom (bed) and bathroom (bath), in this order. Use the predefined variables.
  3 | Print areas with the print() function.
  4 | 
  5 | Solution:-
  6 | # area variables (in square meters)
  7 | hall = 11.25
  8 | kit = 18.0
  9 | liv = 20.0
 10 | bed = 10.75
 11 | bath = 9.50
 12 | 
 13 | # Create list areas
 14 | areas = [hall,kit,liv,bed,bath]
 15 | 
 16 | # Print areas
 17 | print(areas)
 18 | 
 19 | Q2:-
 20 | Finish the line of code that creates the areas list such that the list first contains the name of each room as a string and then its area. More specifically, add the strings "hallway", "kitchen" and "bedroom" at the appropriate locations.
 21 | Print areas again; is the printout more informative this time?
 22 | 
 23 | Solution:-
 24 | # area variables (in square meters)
 25 | hall = 11.25
 26 | kit = 18.0
 27 | liv = 20.0
 28 | bed = 10.75
 29 | bath = 9.50
 30 | 
 31 | # Adapt list areas
 32 | areas = ["hallway",hall,"kitchen", kit, "living room", liv, "bedroom",bed, "bathroom", bath]
 33 | 
 34 | # Print areas
 35 | print(areas)
 36 | 
 37 | Q3:-
 38 | Finish the list of lists so that it also contains the bedroom and bathroom data. Make sure you enter these in order!
 39 | Print out house; does this way of structuring your data make more sense?
 40 | Print out the type of house. Are you still dealing with a list?
 41 | 
 42 | Solution:-
 43 | # area variables (in square meters)
 44 | hall = 11.25
 45 | kit = 18.0
 46 | liv = 20.0
 47 | bed = 10.75
 48 | bath = 9.50
 49 | 
 50 | # house information as list of lists
 51 | house = [["hallway", hall],
 52 |          ["kitchen", kit],
 53 |          ["living room", liv],
 54 |          ["bedroom",bed],
 55 |          ["bathroom",bath]]
 56 | 
 57 | # Print out house
 58 | print(house)
 59 | 
 60 | # Print out the type of house
 61 | print(type(house))
 62 | 
 63 | Q4:-
 64 | Print out the second element from the areas list, so 11.25.
 65 | Subset and print out the last element of areas, being 9.50. Using a negative index makes sense here!
 66 | Select the number representing the area of the living room and print it out.
 67 | 
 68 | Solution:-
 69 | # Create the areas list
 70 | areas = ["hallway", 11.25, "kitchen", 18.0, "living room", 20.0, "bedroom", 10.75, "bathroom", 9.50]
 71 | 
 72 | # Print out second element from areas
 73 | print(areas[1])
 74 | 
 75 | # Print out last element from areas
 76 | print(areas[-1])
 77 | 
 78 | # Print out the area of the living room
 79 | print(areas[5])
 80 | 
 81 | Q5:-
 82 | Using a combination of list subsetting and variable assignment, create a new variable, eat_sleep_area, that contains the sum of the area of the kitchen and the area of the bedroom.
 83 | Print the new variable eat_sleep_area.
 84 | 
 85 | Solution:-
 86 | # Create the areas list
 87 | areas = ["hallway", 11.25, "kitchen", 18.0, "living room", 20.0, "bedroom", 10.75, "bathroom", 9.50]
 88 | 
 89 | # Sum of kitchen and bedroom area: eat_sleep_area
 90 | eat_sleep_area = areas[3] + areas[7]
 91 | 
 92 | # Print the variable eat_sleep_area
 93 | print(eat_sleep_area)
 94 | 
 95 | Q6:-
 96 | Use slicing to create a list, downstairs, that contains the first 6 elements of areas.
 97 | Do a similar thing to create a new variable, upstairs, that contains the last 4 elements of areas.
 98 | Print both downstairs and upstairs using print().
 99 | 
100 | Solution:-
101 | # Create the areas list
102 | areas = ["hallway", 11.25, "kitchen", 18.0, "living room", 20.0, "bedroom", 10.75, "bathroom", 9.50]
103 | 
104 | # Use slicing to create downstairs
105 | downstairs = areas[:6]
106 | 
107 | # Use slicing to create upstairs
108 | upstairs = areas[6:11]
109 | 
110 | # Print out downstairs and upstairs
111 | print(downstairs)
112 | print(upstairs)
113 | 
114 | Q7:-
115 | Use slicing to create the lists downstairs and upstairs again, but this time without using indexes if it's not necessary. 
116 | Remember downstairs is the first 6 elements of areas and upstairs is the last 4 elements of areas.
117 | 
118 | Solution:-
119 | # Create the areas list
120 | areas = ["hallway", 11.25, "kitchen", 18.0, "living room", 20.0, "bedroom", 10.75, "bathroom", 9.50]
121 | 
122 | # Alternative slicing to create downstairs
123 | downstairs = areas[:6]
124 | 
125 | # Alternative slicing to create upstairs
126 | upstairs = areas[6:]
127 | 
128 | Q8:-
129 | You did a miscalculation when determining the area of the bathroom; it's 10.50 square meters instead of 9.50. Can you make the changes?
130 | Make the areas list more trendy! Change "living room" to "chill zone".
131 | 
132 | Solution:-
133 | # Create the areas list
134 | areas = ["hallway", 11.25, "kitchen", 18.0, "living room", 20.0, "bedroom", 10.75, "bathroom", 9.50]
135 | 
136 | # Correct the bathroom area
137 | areas[-1] = 10.50 
138 | 
139 | # Change "living room" to "chill zone"
140 | areas[4] = "chill zone"
141 | 
142 | Q9:-
143 | Use the + operator to paste the list ["poolhouse", 24.5] to the end of the areas list. Store the resulting list as areas_1.
144 | Further extend areas_1 by adding data on your garage. Add the string "garage" and float 15.45. Name the resulting list areas_2.
145 | 
146 | Solution:-
147 | # Create the areas list and make some changes
148 | areas = ["hallway", 11.25, "kitchen", 18.0, "chill zone", 20.0,
149 |          "bedroom", 10.75, "bathroom", 10.50]
150 | 
151 | # Add poolhouse data to areas, new list is areas_1
152 | areas_1 = areas + ["poolhouse", 24.5]
153 | 
154 | # Add garage data to areas_1, new list is areas_2
155 | areas_2 = areas_1 + ["garage", 15.45]
156 | 
157 | Q10:-
158 | Change the second command, that creates the variable areas_copy, such that areas_copy is an explicit copy of areas
159 | Now, changes made to areas_copy shouldn't affect areas. Hit Submit Answer to check this.
160 | 
161 | Solution:-
162 | # Create list areas
163 | areas = [11.25, 18.0, 20.0, 10.75, 9.50]
164 | 
165 | # Create areas_copy
166 | areas_copy = list(areas)
167 | 
168 | # Change areas_copy
169 | areas_copy[0] = 5.0
170 | 
171 | # Print areas
172 | print(areas)
173 | 
174 | Q11:-
175 | Use print() in combination with type() to print out the type of var1.
176 | Use len() to get the length of the list var1. Wrap it in a print() call to directly print it out.
177 | Use int() to convert var2 to an integer. Store the output as out2.
178 | 
179 | Solution:-
180 | # Create variables var1 and var2
181 | var1 = [1, 2, 3, 4]
182 | var2 = True
183 | 
184 | # Print out type of var1
185 | print(type(var1))
186 | 
187 | # Print out length of var1
188 | print(len(var1))
189 | 
190 | # Convert var2 to an integer: out2
191 | out2 = int(var2)
192 | 
193 | Q12:-
194 | Use + to merge the contents of first and second into a new list: full.
195 | Call sorted() on full and specify the reverse argument to be True. Save the sorted list as full_sorted.
196 | Finish off by printing out full_sorted.
197 | 
198 | Solution:-
199 | # Create lists first and second
200 | first = [11.25, 18.0, 20.0]
201 | second = [10.75, 9.50]
202 | 
203 | # Paste together first and second: full
204 | full = first + second
205 | 
206 | # Sort full in descending order: full_sorted
207 | full_sorted = sorted(full,reverse=True)
208 | 
209 | # Print out full_sorted
210 | print(full_sorted)
211 | 


--------------------------------------------------------------------------------
/Python/Introduction to Databases in Python/Basics of Relational Databases:
--------------------------------------------------------------------------------
  1 | Q1:-
  2 | Import create_engine from the sqlalchemy module.
  3 | Using the create_engine() function, create an engine for a local file named census.sqlite with sqlite as the driver. Be sure to enclose the connection string within quotation marks.
  4 | Print the output from the .table_names() method on the engine.
  5 | 
  6 | Solution:-
  7 | # Import create_engine
  8 | from sqlalchemy import create_engine 
  9 | 
 10 | # Create an engine that connects to the census.sqlite file: engine
 11 | engine = create_engine('sqlite:///census.sqlite')
 12 | 
 13 | # Print table names
 14 | print(engine.table_names())
 15 | 
 16 | Q2:-
 17 | Import the Table object from sqlalchemy.
 18 | Reflect the census table by using the Table object with the arguments:
 19 | The name of the table as a string ('census').
 20 | The metadata, contained in the variable metadata.
 21 | autoload=True
 22 | The engine to autoload with - in this case, engine.
 23 | Print the details of census using the repr() function.
 24 | 
 25 | Solution:-
 26 | # Import Table
 27 | from sqlalchemy import Table
 28 | 
 29 | # Reflect census table from the engine: census
 30 | census = Table('census', metadata, autoload=True, autoload_with=engine)
 31 | 
 32 | # Print census table metadata
 33 | print(repr(census))
 34 | 
 35 | Q3:-
 36 | Reflect the census table as you did in the previous exercise using the Table() function.
 37 | Print a list of column names of the census table by applying the .keys() method to census.columns.
 38 | Print the details of the census table using the metadata.tables dictionary along with the repr() function. To do this, first access the 'census' key of the metadata.tables dictionary, and place this inside the provided repr() function.
 39 | 
 40 | Solution:-
 41 | # Reflect the census table from the engine: census
 42 | census = Table('census', metadata, autoload=True, autoload_with=engine)
 43 | 
 44 | # Print the column names
 45 | print(census.columns.keys())
 46 | 
 47 | # Print full table metadata
 48 | print(repr(metadata.tables['census']))
 49 | 
 50 | Q3:-
 51 | Build a SQL statement to query all the columns from census and store it in stmt. Note that your SQL statement must be a string.
 52 | Use the .execute() and .fetchall() methods on connection and store the result in results. Remember that .execute() comes before .fetchall() and that stmt needs to be passed to .execute().
 53 | Print results.
 54 | 
 55 | Solution:-
 56 | # Build select statement for census table: stmt
 57 | stmt = 'select * from census'
 58 | 
 59 | # Execute the statement and fetch the results: results
 60 | results = connection.execute(stmt).fetchall()
 61 | 
 62 | # Print results
 63 | print(results)
 64 | 
 65 | Q4:-
 66 | Import select from the sqlalchemy module.
 67 | Reflect the census table. This code is already written for you.
 68 | Create a query using the select() function to retrieve the census table. To do so, pass a list to select() containing a single element: census.
 69 | Print stmt to see the actual SQL query being created. This code has been written for you.
 70 | Using the provided print() function, print all the records from the census table. To do this:
 71 | Use the .execute() method on connection with stmt as the argument to retrieve the ResultProxy.
 72 | Use .fetchall() on connection.execute(stmt) to retrieve the ResultSet.
 73 | 
 74 | Solution:-
 75 | # Import select
 76 | from sqlalchemy import select
 77 | 
 78 | # Reflect census table via engine: census
 79 | census = Table('census', metadata, autoload=True, autoload_with=engine)
 80 | 
 81 | # Build select statement for census table: stmt
 82 | stmt = select([census])
 83 | 
 84 | # Print the emitted statement to see the SQL emitted
 85 | print(stmt)
 86 | 
 87 | # Execute the statement and print the results
 88 | print(connection.execute(stmt).fetchall())
 89 | 
 90 | Q5:-
 91 | Extract the first row of results and assign it to the variable first_row.
 92 | Print the value of the first column in first_row.
 93 | Print the value of the 'state' column in first_row.
 94 | 
 95 | Solution:-
 96 | # Get the first row of the results by using an index: first_row
 97 | first_row = results[0]
 98 | 
 99 | # Print the first row of the results
100 | print(first_row)
101 | 
102 | # Print the first column of the first row by using an index
103 | print(first_row[0])
104 | 
105 | # Print the 'state' column of the first row by using its name
106 | print(first_row['state'])
107 | 


--------------------------------------------------------------------------------
/Python/Introduction to Databases in Python/Putting it all together:
--------------------------------------------------------------------------------
  1 | Q1:-
  2 | Import create_engine and MetaData from sqlalchemy.
  3 | Create an engine to the chapter 5 database by using 'sqlite:///chapter5.sqlite' as the connection string.
  4 | Create a MetaData object as metadata.
  5 | 
  6 | Solution:-
  7 | # Import create_engine, MetaData
  8 | from sqlalchemy import create_engine,MetaData
  9 | 
 10 | # Define an engine to connect to chapter5.sqlite: engine
 11 | engine = create_engine('sqlite:///chapter5.sqlite')
 12 | 
 13 | # Initialize MetaData: metadata
 14 | metadata = MetaData()
 15 | 
 16 | Q2:-
 17 | Import Table, Column, String, and Integer from sqlalchemy.
 18 | Define a census table with the following columns:
 19 | 'state' - String - length of 30
 20 | 'sex' - String - length of 1
 21 | 'age' - Integer
 22 | 'pop2000' - Integer
 23 | 'pop2008' - Integer
 24 | Create the table in the database using the metadata and engine.
 25 | 
 26 | Solution:-
 27 | # Import Table, Column, String, and Integer
 28 | from sqlalchemy import Table, Column,String,Integer
 29 | 
 30 | # Build a census table: census
 31 | census = Table('census', metadata,
 32 |                Column('state', String(30)),
 33 |                Column('sex', String(1)),
 34 |                Column('age',Integer()),
 35 |                Column('pop2000', Integer()),
 36 |                Column('pop2008', Integer()))
 37 | 
 38 | # Create the table in the database
 39 | metadata.create_all(engine)
 40 | 
 41 | Q3:-
 42 | Create an empty list called values_list.
 43 | Iterate over the rows of csv_reader with a for loop, creating a dictionary called data for each row and append it to values_list.
 44 | Within the for loop, row will be a list whose entries are 'state' , 'sex', 'age', 'pop2000' and 'pop2008' (in that order).
 45 | 
 46 | Solution:-
 47 | # Create an empty list: values_list
 48 | values_list = []
 49 | 
 50 | # Iterate over the rows
 51 | for row in csv_reader:
 52 |     # Create a dictionary with the values
 53 |     data = {'state': row[0], 'sex': row[1], 'age':row[2], 'pop2000': row[3],
 54 |             'pop2008': row[4]}
 55 |     # Append the dictionary to the values list
 56 |     values_list.append(data)
 57 | 
 58 | Q4:-
 59 | Import insert from sqlalchemy.
 60 | Build an insert statement for the census table.
 61 | Execute the statement stmt along with values_list. You will need to pass them both as arguments to connection.execute().
 62 | Print the rowcount attribute of results.
 63 | 
 64 | Solution:-
 65 | # Import insert
 66 | from sqlalchemy import insert
 67 | 
 68 | # Build insert statement: stmt
 69 | stmt = insert(census)
 70 | 
 71 | # Use values_list to insert data: results
 72 | results = connection.execute(stmt, values_list)
 73 | 
 74 | # Print rowcount
 75 | print(results.rowcount)
 76 | 
 77 | Q5:-
 78 | Import select from sqlalchemy.
 79 | Build a statement to:
 80 | Select sex from the census table.
 81 | Select the average age weighted by the population in 2008 (pop2008). See the example given in the assignment text to see how you can do this. Label this average age calculation as 'average_age'.
 82 | Group the query by sex.
 83 | Execute the query and store it as results.
 84 | Loop over results and print the sex and average_age for each record.
 85 | 
 86 | Solution:-
 87 | # Import select
 88 | from sqlalchemy import select
 89 | 
 90 | # Calculate weighted average age: stmt
 91 | stmt = select([census.columns.sex,
 92 |                (func.sum(census.columns.pop2008 * census.columns.age) /
 93 |                 func.sum(census.columns.pop2008)).label('average_age')
 94 |                ])
 95 | 
 96 | # Group by sex
 97 | stmt = stmt.group_by(census.columns.sex)
 98 | 
 99 | # Execute the query and store the results: results
100 | results = connection.execute(stmt).fetchall()
101 | 
102 | # Print the average age by sex
103 | for row in results:
104 |     print(row.sex, row.average_age)
105 | 
106 | Q6:-
107 | Import case, cast and Float from sqlalchemy.
108 | Define a statement to select state and the percentage of females in 2000.
109 | Inside func.sum(), use case() to select females (using the sex column) from pop2000. Remember to specify else_=0 if the sex is not 'F'.
110 | To get the percentage, divide the number of females in the year 2000 by the overall population in 2000. Cast the divisor - census.columns.pop2000 - to Float before multiplying by 100.
111 | Group the query by state.
112 | Execute the query and store it as results.
113 | Print state and percent_female for each record. This has been done for you, so hit 'Submit Answer' to see the result.
114 | 
115 | Solution:-
116 | # import case, cast and Float from sqlalchemy
117 | from sqlalchemy import case, cast, Float
118 | 
119 | # Build a query to calculate the percentage of females in 2000: stmt
120 | stmt = select([census.columns.state,
121 |     (func.sum(
122 |         case([
123 |             (census.columns.sex == 'F', census.columns.pop2000)
124 |         ], else_=0)) /
125 |      cast(func.sum(census.columns.pop2000), Float) * 100).label('percent_female')
126 | ])
127 | 
128 | # Group By state
129 | stmt = stmt.group_by(census.columns.state)
130 | 
131 | # Execute the query and store the results: results
132 | results = connection.execute(stmt).fetchall()
133 | 
134 | # Print the percentage
135 | for result in results:
136 |     print(result.state, result.percent_female)
137 | 
138 | Q7:-
139 | Build a statement to:
140 | Select state.
141 | Calculate the difference in population between 2008 (pop2008) and 2000 (pop2000).
142 | Group the query by census.columns.state using the .group_by() method on stmt.
143 | Order by 'pop_change' in descending order using the .order_by() method with the desc() function on 'pop_change'.
144 | Limit the query to the top 10 states using the .limit() method.
145 | Execute the query and store it as results.
146 | Print the state and the population change for each result. This has been done for you, so hit 'Submit Answer' to see the result!
147 | 
148 | Solution:-
149 | # Build query to return state name and population difference from 2008 to 2000
150 | stmt = select([census.columns.state,
151 |      (census.columns.pop2008-census.columns.pop2000).label('pop_change')
152 | ])
153 | 
154 | # Group by State
155 | stmt = stmt.group_by(census.columns.state)
156 | 
157 | # Order by Population Change
158 | stmt = stmt.order_by(desc('pop_change'))
159 | 
160 | # Limit to top 10
161 | stmt = stmt.limit(10)
162 | 
163 | # Use connection to execute the statement and fetch all results
164 | results = connection.execute(stmt).fetchall()
165 | 
166 | # Print the state and population change for each record
167 | for result in results:
168 |     print('{}:{}'.format(result.state, result.pop_change))
169 | 


--------------------------------------------------------------------------------
/Python/Introduction to Relational Databases in SQL/Enforce data consistency with attribute constraints:
--------------------------------------------------------------------------------
 1 | Q1:-
 2 | Execute the given sample code.
 3 | As it doesn't work, have a look at the error message and correct the statement accordingly – then execute it again.
 4 | 
 5 | Solution:-
 6 | -- Let's add a record to the table
 7 | INSERT INTO transactions (transaction_date, amount, fee) 
 8 | VALUES ('2018-09-24', 5454, '30');
 9 | 
10 | -- Doublecheck the contents
11 | SELECT *
12 | FROM transactions;
13 | 
14 | Q2:-
15 | Execute the given sample code.
16 | As it doesn't work, add an integer type cast at the right place and execute it again.
17 | 
18 | Solution:-
19 | -- Calculate the net amount as amount + fee
20 | SELECT transaction_date, amount + cast(fee as integer)  AS net_amount 
21 | FROM transactions;
22 | 
23 | Q3:-
24 | Have a look at the distinct university_shortname values and take note of the length of the strings.
25 | 
26 | Solution:-
27 | -- Select the university_shortname column
28 | SELECT distinct(university_shortname) 
29 | FROM professors;
30 | 
31 | Q4:-
32 | Now specify a fixed-length character type with the correct length for university_shortname
33 | 
34 | Solution:-
35 | -- Specify the correct fixed-length character type
36 | ALTER TABLE professors
37 | ALTER COLUMN university_shortname
38 | TYPE char(3);
39 | 
40 | Q5:-
41 | Change the type of the firstname column to varchar(64)
42 | 
43 | Solution:-
44 | -- Change the type of firstname
45 | alter table professors
46 | alter column firstname
47 | type varchar(64);
48 | 
49 | Q5:-
50 | Run the sample code as is and take note of the error.
51 | Now use SUBSTRING() to reduce firstname to 16 characters so its type can be altered to varchar(16).
52 | 
53 | Solution:-
54 | -- Convert the values in firstname to a max. of 16 characters
55 | ALTER TABLE professors 
56 | ALTER COLUMN firstname 
57 | TYPE varchar(16)
58 | using substring(firstname from 1 for 16)
59 | 
60 | Q6:-
61 | Add a not-null constraint for the firstname column.
62 | 
63 | Solution:-
64 | -- Disallow NULL values in firstname
65 | alter table professors 
66 | ALTER COLUMN firstname SET NOT NULL;
67 | 
68 | Q7:-
69 | Add a not-null constraint for the lastname column.
70 | 
71 | Solution:-
72 | -- Disallow NULL values in lastname
73 | alter table professors
74 | alter column lastname set not null;
75 | 
76 | Q8:-
77 | Add a unique constraint to the university_shortname column in universities. Give it the name university_shortname_unq
78 | 
79 | Solution:-
80 | -- Make universities.university_shortname unique
81 | ALTER table universities
82 | ADD constraint university_shortname_unq UNIQUE(university_shortname);
83 | 
84 | Q9:-
85 | Add a unique constraint to the organization column in organizations. Give it the name organization_unq
86 | 
87 | Solution:-
88 | -- Make organizations.organization unique
89 | alter table organizations
90 | add constraint organization_unq unique(organization)
91 | 


--------------------------------------------------------------------------------
/Python/Introduction to Relational Databases in SQL/Uniquely identify records with key constraints:
--------------------------------------------------------------------------------
  1 | Q1:-
  2 | First, find out the number of rows in universities.
  3 | 
  4 | Solution:-
  5 | -- Count the number of rows in universities
  6 | SELECT count(*)
  7 | FROM universities;
  8 | 
  9 | Q2:-
 10 | Then, find out how many unique values there are in the university_city column.
 11 | 
 12 | Solution:-
 13 | -- Count the number of distinct values in the university_city column
 14 | SELECT count(distinct(university_city)) 
 15 | FROM universities;
 16 | 
 17 | Q3:-
 18 | Using the above steps, identify the candidate key by trying out different combination of columns.
 19 | 
 20 | Solution:-
 21 | -- Try out different combinations
 22 | select COUNT(distinct(firstname,lastname)) 
 23 | FROM professors;
 24 | 
 25 | Q4:-
 26 | Rename the organization column to id in organizations.
 27 | Make id a primary key and name it organization_pk.
 28 | 
 29 | Solution:-
 30 | -- Rename the organization column to id
 31 | ALTER TABLE organizations
 32 | RENAME COLUMN organization TO id;
 33 | 
 34 | -- Make id a primary key
 35 | ALTER TABLE organizations
 36 | ADD CONSTRAINT organization_pk PRIMARY KEY (id);
 37 | 
 38 | Q5:-
 39 | Rename the university_shortname column to id in universities.
 40 | Make id a primary key and name it university_pk.
 41 | 
 42 | Solution:-
 43 | -- Rename the university_shortname column to id
 44 | alter table universities
 45 | rename column university_shortname to id;
 46 | 
 47 | -- Make id a primary key
 48 | alter table universities
 49 | add constraint university_pk primary key (id);
 50 | 
 51 | Q6:-
 52 | Add a new column id with data type serial to the professors table.
 53 | 
 54 | Solution:-
 55 | -- Add the new column to the table
 56 | ALTER TABLE professors 
 57 | add column id serial;
 58 | 
 59 | Q7:-
 60 | Make id a primary key and name it professors_pkey
 61 | 
 62 | solution:-
 63 | -- Add the new column to the table
 64 | ALTER TABLE professors 
 65 | ADD COLUMN id serial;
 66 | 
 67 | -- Make id a primary key
 68 | ALTER table professors 
 69 | add CONSTRAINT professors_pkey primary key (id);
 70 | 
 71 | Q8:-
 72 | Write a query that returns all the columns and 10 rows from professors.
 73 | 
 74 | solution:-
 75 | -- Add the new column to the table
 76 | ALTER TABLE professors 
 77 | ADD COLUMN id serial;
 78 | 
 79 | -- Make id a primary key
 80 | ALTER TABLE professors 
 81 | ADD CONSTRAINT professors_pkey PRIMARY KEY (id);
 82 | 
 83 | -- Have a look at the first 10 rows of professors
 84 | select * from professors limit 10;
 85 | 
 86 | Q9:-
 87 | Count the number of distinct rows with a combination of the make and model columns.
 88 | 
 89 | Solution:-
 90 | -- Count the number of distinct rows with columns make, model
 91 | select count(distinct(make,model))
 92 | FROM cars;
 93 | 
 94 | Q10:-
 95 | Add a new column id with the data type varchar(128).
 96 | 
 97 | Solution:-
 98 | -- Count the number of distinct rows with columns make, model
 99 | SELECT COUNT(DISTINCT(make, model)) 
100 | FROM cars;
101 | 
102 | -- Add the id column
103 | ALTER TABLE cars
104 | add column id varchar(128);
105 | 
106 | Q11:-
107 | Concatenate make and model into id using an UPDATE query and the CONCAT() function.
108 | 
109 | Solution:-
110 | -- Count the number of distinct rows with columns make, model
111 | SELECT COUNT(DISTINCT(make, model)) 
112 | FROM cars;
113 | 
114 | -- Add the id column
115 | ALTER TABLE cars
116 | ADD COLUMN id varchar(128);
117 | 
118 | -- Update id with make + model
119 | UPDATE cars
120 | set id = concat(make, model);
121 | 
122 | Q12:-
123 | Make id a primary key and name it id_pk
124 | 
125 | Solution:-
126 | -- Count the number of distinct rows with columns make, model
127 | SELECT COUNT(DISTINCT(make, model)) 
128 | FROM cars;
129 | 
130 | -- Add the id column
131 | ALTER TABLE cars
132 | ADD COLUMN id varchar(128);
133 | 
134 | -- Update id with make + model
135 | UPDATE cars
136 | SET id = CONCAT(make, model);
137 | 
138 | -- Make id a primary key
139 | alter table cars
140 | add constraint id_pk primary key(id);
141 | 
142 | -- Have a look at the table
143 | SELECT * FROM cars;
144 | 
145 | Q13:-
146 | Given the above description of a student entity, create a table students with the correct column types.
147 | Add a primary key for the social security number.
148 | 
149 | Solution:-
150 | -- Create the table
151 | create table students (
152 |   last_name varchar(128) not null,
153 |   ssn integer primary key,
154 |   phone_no char(12)
155 | );
156 | 


--------------------------------------------------------------------------------
/Python/Introduction to Relational Databases in SQL/Your first database:
--------------------------------------------------------------------------------
  1 | Q1:-
  2 | Get information on all table names in the current database, while limiting your query to the 'public' table_schema.
  3 | 
  4 | Solution:-
  5 | -- Query the right table in information_schema
  6 | SELECT table_name 
  7 | FROM information_schema.tables
  8 | -- Specify the correct table_schema value
  9 | WHERE table_schema = 'public';
 10 | 
 11 | Q2:-
 12 | Now have a look at the columns in university_professors by selecting all entries in information_schema.columns that correspond to that table.
 13 | 
 14 | Solution:-
 15 | -- Query the right table in information_schema to get columns
 16 | SELECT column_name, data_type 
 17 | FROM information_schema.columns 
 18 | WHERE table_name = 'university_professors' AND table_schema = 'public';
 19 | 
 20 | Q3:-
 21 | Finally, print the first five rows of the university_professors table.
 22 | 
 23 | Solution:-
 24 | -- Query the first five rows of our table
 25 | select * 
 26 | from university_professors 
 27 | LIMIT 5;
 28 | 
 29 | Q4:-
 30 | Create a table professors with two text columns: firstname and lastname.
 31 | 
 32 | Solution:-
 33 | -- Create a table for the professors entity type
 34 | CREATE TABLE professors (
 35 |  firstname text,
 36 |  lastname text
 37 | );
 38 | 
 39 | -- Print the contents of this table
 40 | SELECT * 
 41 | FROM professors
 42 | 
 43 | Q5:-
 44 | Create a table universities with three text columns: university_shortname, university, and university_city.
 45 | 
 46 | Solution:-
 47 | -- Create a table for the universities entity type
 48 | create table universities(
 49 | university_shortname text,
 50 | university text,
 51 | university_city text
 52 | );
 53 | 
 54 | 
 55 | 
 56 | 
 57 | 
 58 | -- Print the contents of this table
 59 | SELECT * 
 60 | FROM universities
 61 | 
 62 | Q6:-
 63 | Alter professors to add the text column university_shortname.
 64 | 
 65 | Solution:-
 66 | -- Add the university_shortname column
 67 | alter table professors
 68 | add column university_shortname text;
 69 | 
 70 | -- Print the contents of this table
 71 | SELECT * 
 72 | FROM professors
 73 | 
 74 | Q7:-
 75 | Rename the organisation column to organization in affiliations.
 76 | 
 77 | Solution:-
 78 | -- Rename the organisation column
 79 | ALTER TABLE affiliations
 80 | RENAME column organisation TO organization;
 81 | 
 82 | Q8:-
 83 | Delete the university_shortname column in affiliations.
 84 | 
 85 | Solution:-
 86 | -- Rename the organisation column
 87 | ALTER TABLE affiliations
 88 | RENAME COLUMN organisation TO organization;
 89 | 
 90 | -- Delete the university_shortname column
 91 | alter table affiliations
 92 | drop column university_shortname;
 93 | 
 94 | Q9:-
 95 | Insert all DISTINCT professors from university_professors into professors.
 96 | Print all the rows in professors.
 97 | 
 98 | Solution:-
 99 | -- Insert unique professors into the new table
100 | insert into professors 
101 | SELECT DISTINCT firstname, lastname, university_shortname 
102 | FROM university_professors;
103 | 
104 | -- Doublecheck the contents of professors
105 | SELECT * 
106 | FROM professors;
107 | 
108 | Q10:-
109 | Insert all DISTINCT affiliations into affiliations.
110 | 
111 | Solution:-
112 | -- Insert unique affiliations into the new table
113 | INSERT INTO affiliations 
114 | SELECT DISTINCT firstname, lastname, function, organization 
115 | FROM university_professors;
116 | 
117 | -- Doublecheck the contents of affiliations
118 | SELECT * 
119 | FROM affiliations;
120 | 
121 | Q11:-
122 | Delete the university_professors table.
123 | 
124 | Solution:-
125 | -- Delete the university_professors table
126 | drop table university_professors;
127 | 


--------------------------------------------------------------------------------
/Python/Introduction to Shell for Data Science/Manipulating files and directories:
--------------------------------------------------------------------------------
1 | Q1:-
2 | 


--------------------------------------------------------------------------------
/Python/Machine Learning with the Experts: School Budgets/Exploring the raw data:
--------------------------------------------------------------------------------
 1 | Q1:-
 2 | Print summary statistics of the numeric columns in the DataFrame df using the .describe() method.
 3 | Import matplotlib.pyplot as plt.
 4 | Create a histogram of the non-null 'FTE' column. You can do this by passing df['FTE'].dropna() to plt.hist().
 5 | The title has been specified and axes have been labeled, so hit 'Submit Answer' to see how often school employees work full-time!
 6 | 
 7 | Solution:-
 8 | 
 9 | # Print the summary statistics
10 | print(df.describe())
11 | 
12 | # Import matplotlib.pyplot as plt
13 | import matplotlib.pyplot as plt
14 | 
15 | # Create the histogram
16 | plt.hist(df['FTE'].dropna())
17 | 
18 | # Add title and labels
19 | plt.title('Distribution of %full-time \n employee works')
20 | plt.xlabel('% of full-time')
21 | plt.ylabel('num employees')
22 | 
23 | # Display the histogram
24 | plt.show()
25 | 
26 | Q2:-
27 | Define the lambda function categorize_label to convert column x into x.astype('category').
28 | Use the LABELS list provided to convert the subset of data df[LABELS] to categorical types using the .apply() method and categorize_label. Don't forget axis=0.
29 | Print the converted .dtypes attribute of df[LABELS]
30 | 
31 | Solution:-
32 | # Define the lambda function: categorize_label
33 | categorize_label = lambda x: x.astype('category')
34 | 
35 | # Convert df[LABELS] to a categorical type
36 | df[LABELS] = df[LABELS].apply(categorize_label,axis=0)
37 | 
38 | # Print the converted dtypes
39 | print(df[LABELS].dtypes)
40 | 
41 | Q3:-
42 | Create the DataFrame num_unique_labels by using the .apply() method on df[LABELS] with pd.Series.nunique as the argument.
43 | Create a bar plot of num_unique_labels using pandas' .plot(kind='bar') method.
44 | The axes have been labeled for you, so hit 'Submit Answer' to see the number of unique values for each label.
45 | 
46 | Solution:-
47 | # Import matplotlib.pyplot
48 | import matplotlib.pyplot as plt
49 | 
50 | # Calculate number of unique values for each label: num_unique_labels
51 | num_unique_labels = df[LABELS].apply(pd.Series.nunique)
52 | 
53 | # Plot number of unique values for each label
54 | num_unique_labels.plot(kind='bar')
55 | 
56 | # Label the axes
57 | plt.xlabel('Labels')
58 | plt.ylabel('Number of unique values')
59 | 
60 | # Display the plot
61 | plt.show()
62 | 
63 | Q4:-
64 | Using the compute_log_loss() function, compute the log loss for the following predicted values (in each case, the actual values are contained in actual_labels):
65 | correct_confident.
66 | correct_not_confident.
67 | wrong_not_confident.
68 | wrong_confident.
69 | actual_labels.
70 | 
71 | Solution:-
72 | # Compute and print log loss for 1st case
73 | correct_confident_loss = compute_log_loss(correct_confident, actual_labels)
74 | print("Log loss, correct and confident: {}".format(correct_confident_loss)) 
75 | 
76 | # Compute log loss for 2nd case
77 | correct_not_confident_loss = compute_log_loss(correct_not_confident, actual_labels)
78 | print("Log loss, correct and not confident: {}".format(correct_not_confident_loss)) 
79 | 
80 | # Compute and print log loss for 3rd case
81 | wrong_not_confident_loss = compute_log_loss(wrong_not_confident, actual_labels)
82 | print("Log loss, wrong and not confident: {}".format(wrong_not_confident_loss)) 
83 | 
84 | # Compute and print log loss for 4th case
85 | wrong_confident_loss = compute_log_loss(wrong_confident,actual_labels)
86 | print("Log loss, wrong and confident: {}".format(wrong_confident_loss)) 
87 | 
88 | # Compute and print log loss for actual labels
89 | actual_labels_loss = compute_log_loss(actual_labels, actual_labels)
90 | print("Log loss, actual labels: {}".format(actual_labels_loss))
91 | 


--------------------------------------------------------------------------------
/Python/Machine Learning with the Experts: School Budgets/Learning from the experts:
--------------------------------------------------------------------------------
  1 | Q1:-
  2 | Create text_vector by preprocessing X_train using combine_text_columns. This is important, or else you won't get any tokens!
  3 | Instantiate CountVectorizer as text_features. Specify the keyword argument token_pattern=TOKENS_ALPHANUMERIC.
  4 | Fit text_features to the text_vector.
  5 | 
  6 | Solution:-
  7 | # Import the CountVectorizer
  8 | from sklearn.feature_extraction.text import CountVectorizer
  9 | 
 10 | # Create the text vector
 11 | text_vector = combine_text_columns(X_train)
 12 | 
 13 | # Create the token pattern: TOKENS_ALPHANUMERIC
 14 | TOKENS_ALPHANUMERIC = '[A-Za-z0-9]+(?=\\s+)'
 15 | 
 16 | # Instantiate the CountVectorizer: text_features
 17 | text_features = CountVectorizer(token_pattern=TOKENS_ALPHANUMERIC)
 18 | 
 19 | # Fit text_features to the text vector
 20 | text_features.fit(text_vector)
 21 | 
 22 | # Print the first 10 tokens
 23 | print(text_features.get_feature_names()[:10])
 24 | 
 25 | Q2:-
 26 | Import CountVectorizer from sklearn.feature_extraction.text.
 27 | Add a CountVectorizer step to the pipeline with the name 'vectorizer'.
 28 | Set the token pattern to be TOKENS_ALPHANUMERIC.
 29 | Set the ngram_range to be (1, 2).
 30 | 
 31 | Solution:-
 32 | # Import pipeline
 33 | from sklearn.pipeline import Pipeline
 34 | 
 35 | # Import classifiers
 36 | from sklearn.linear_model import LogisticRegression
 37 | from sklearn.multiclass import OneVsRestClassifier
 38 | 
 39 | # Import CountVectorizer
 40 | from sklearn.feature_extraction.text import CountVectorizer
 41 | 
 42 | # Import other preprocessing modules
 43 | from sklearn.preprocessing import Imputer
 44 | from sklearn.feature_selection import chi2, SelectKBest
 45 | 
 46 | # Select 300 best features
 47 | chi_k = 300
 48 | 
 49 | # Import functional utilities
 50 | from sklearn.preprocessing import FunctionTransformer, MaxAbsScaler
 51 | from sklearn.pipeline import FeatureUnion
 52 | 
 53 | # Perform preprocessing
 54 | get_text_data = FunctionTransformer(combine_text_columns, validate=False)
 55 | get_numeric_data = FunctionTransformer(lambda x: x[NUMERIC_COLUMNS], validate=False)
 56 | 
 57 | # Create the token pattern: TOKENS_ALPHANUMERIC
 58 | TOKENS_ALPHANUMERIC = '[A-Za-z0-9]+(?=\\s+)'
 59 | 
 60 | # Instantiate pipeline: pl
 61 | pl = Pipeline([
 62 |         ('union', FeatureUnion(
 63 |             transformer_list = [
 64 |                 ('numeric_features', Pipeline([
 65 |                     ('selector', get_numeric_data),
 66 |                     ('imputer', Imputer())
 67 |                 ])),
 68 |                 ('text_features', Pipeline([
 69 |                     ('selector', get_text_data),
 70 |                     ('vectorizer', CountVectorizer(token_pattern=TOKENS_ALPHANUMERIC,
 71 |                                                    ngram_range=(1,2))),
 72 |                     ('dim_red', SelectKBest(chi2, chi_k))
 73 |                 ]))
 74 |              ]
 75 |         )),
 76 |         ('scale', MaxAbsScaler()),
 77 |         ('clf', OneVsRestClassifier(LogisticRegression()))
 78 |     ])
 79 |     
 80 | Q3:-
 81 | Add the interaction terms step using SparseInteractions() with degree=2. Give it a name of 'int', and make sure it is after the preprocessing step but before scaling.
 82 | 
 83 | Solution:-
 84 | # Instantiate pipeline: pl
 85 | pl = Pipeline([
 86 |         ('union', FeatureUnion(
 87 |             transformer_list = [
 88 |                 ('numeric_features', Pipeline([
 89 |                     ('selector', get_numeric_data),
 90 |                     ('imputer', Imputer())
 91 |                 ])),
 92 |                 ('text_features', Pipeline([
 93 |                     ('selector', get_text_data),
 94 |                     ('vectorizer', CountVectorizer(token_pattern=TOKENS_ALPHANUMERIC,
 95 |                                                    ngram_range=(1, 2))),  
 96 |                     ('dim_red', SelectKBest(chi2, chi_k))
 97 |                 ]))
 98 |              ]
 99 |         )),
100 |         ('int', SparseInteractions(degree=2)),
101 |         ('scale', MaxAbsScaler()),
102 |         ('clf', OneVsRestClassifier(LogisticRegression()))
103 |     ])
104 |     
105 | Q4:-
106 | Import HashingVectorizer from sklearn.feature_extraction.text.
107 | Instantiate the HashingVectorizer as hashing_vec using the TOKENS_ALPHANUMERIC pattern.
108 | Fit and transform hashing_vec using text_data. Save the result as hashed_text.
109 | Hit 'Submit Answer' to see some of the resulting hash values.
110 | 
111 | Solution:-
112 | # Import HashingVectorizer
113 | from sklearn.feature_extraction.text import HashingVectorizer
114 | 
115 | # Get text data: text_data
116 | text_data = combine_text_columns(X_train)
117 | 
118 | # Create the token pattern: TOKENS_ALPHANUMERIC
119 | TOKENS_ALPHANUMERIC = '[A-Za-z0-9]+(?=\\s+)' 
120 | 
121 | # Instantiate the HashingVectorizer: hashing_vec
122 | hashing_vec = HashingVectorizer(token_pattern=TOKENS_ALPHANUMERIC)
123 | 
124 | # Fit and transform the Hashing Vectorizer
125 | hashed_text = hashing_vec.fit_transform(text_data)
126 | 
127 | # Create DataFrame and print the head
128 | hashed_df = pd.DataFrame(hashed_text.data)
129 | print(hashed_df.head())
130 | 
131 | Q5:-
132 | Import HashingVectorizer from sklearn.feature_extraction.text.
133 | Add a HashingVectorizer step to the pipeline.
134 | Name the step 'vectorizer'.
135 | Use the TOKENS_ALPHANUMERIC token pattern.
136 | Specify the ngram_range to be (1, 2)
137 | 
138 | Solution:-
139 | # Import the hashing vectorizer
140 | from sklearn.feature_extraction.text import HashingVectorizer
141 | 
142 | # Instantiate the winning model pipeline: pl
143 | pl = Pipeline([
144 |         ('union', FeatureUnion(
145 |             transformer_list = [
146 |                 ('numeric_features', Pipeline([
147 |                     ('selector', get_numeric_data),
148 |                     ('imputer', Imputer())
149 |                 ])),
150 |                 ('text_features', Pipeline([
151 |                     ('selector', get_text_data),
152 |                     ('vectorizer', HashingVectorizer(token_pattern=TOKENS_ALPHANUMERIC,
153 |                                                      non_negative=True, norm=None, binary=False,
154 |                                                      ngram_range=(1,2))),
155 |                     ('dim_red', SelectKBest(chi2, chi_k))
156 |                 ]))
157 |              ]
158 |         )),
159 |         ('int', SparseInteractions(degree=2)),
160 |         ('scale', MaxAbsScaler()),
161 |         ('clf', OneVsRestClassifier(LogisticRegression()))
162 |     ])
163 | 


--------------------------------------------------------------------------------
/Python/Manipulating DataFrames with pandas/Advanced indexing:
--------------------------------------------------------------------------------
 1 | Q1:-
 2 | Create a list new_idx with the same elements as in sales.index, but with all characters capitalized.
 3 | Assign new_idx to sales.index.
 4 | Print the sales dataframe. This has been done for you, so hit 'Submit Answer' and to see how the index changed.
 5 | 
 6 | Solution:-
 7 | # Create the list of new indexes: new_idx
 8 | new_idx = [i.upper() for i in sales.index]
 9 | 
10 | # Assign new_idx to sales.index
11 | sales.index = new_idx
12 | 
13 | # Print the sales DataFrame
14 | print(sales)
15 | 
16 | Q2:-
17 | Assign the string 'MONTHS' to sales.index.name to create a name for the index.
18 | Print the sales dataframe to see the index name you just created.
19 | Now assign the string 'PRODUCTS' to sales.columns.name to give a name to the set of columns.
20 | Print the sales dataframe again to see the columns name you just created.
21 | 
22 | Solution:-
23 | # Assign the string 'MONTHS' to sales.index.name
24 | sales.index.name = 'MONTHS'
25 | 
26 | # Print the sales DataFrame
27 | print(sales)
28 | 
29 | # Assign the string 'PRODUCTS' to sales.columns.name 
30 | sales.columns.name = 'PRODUCTS'
31 | 
32 | # Print the sales dataframe again
33 | print(sales)
34 | 
35 | Q3:-
36 | Generate a list months with the data ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun']. This has been done for you.
37 | Assign months to sales.index.
38 | Print the modified sales dataframe and verify that you now have month information in the index.
39 | 
40 | Solution:-
41 | # Generate the list of months: months
42 | months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun']
43 | 
44 | # Assign months to sales.index
45 | sales.index = months
46 | 
47 | # Print the modified sales DataFrame
48 | print(sales)
49 | 
50 | Q4:-
51 | Create a MultiIndex by setting the index to be the columns ['state', 'month'].
52 | Sort the MultiIndex using the .sort_index() method.
53 | Print the sales DataFrame. This has been done for you, so hit 'Submit Answer' to verify that indeed you have an index with the fields state and month!
54 | 
55 | Solution:-
56 | # Set the index to be the columns ['state', 'month']: sales
57 | sales = sales.set_index(['state', 'month'])
58 | 
59 | # Sort the MultiIndex: sales
60 | sales = sales.sort_index()
61 | 
62 | # Print the sales DataFrame
63 | print(sales)
64 | 
65 | Q5:-
66 | Set the index of sales to be the column 'state'.
67 | Print the sales DataFrame to verify that indeed you have an index with state values.
68 | Access the data from 'NY' and print it to verify that you obtain two rows.
69 | 
70 | Solution:-
71 | # Set the index to the column 'state': sales
72 | sales = sales.set_index(['state'])
73 | 
74 | # Print the sales DataFrame
75 | print(sales)
76 | 
77 | # Access the data from 'NY'
78 | print(sales.loc['NY'])
79 | 
80 | Q6:-
81 | Look up data for the New York column ('NY') in month 1.
82 | Look up data for the California and Texas columns ('CA', 'TX') in month 2.
83 | Look up data for all states in month 2. Use (slice(None), 2) to extract all rows in month 2.
84 | 
85 | Solution:-
86 | # Look up data for NY in month 1: NY_month1
87 | NY_month1 = sales.loc[('NY',1)]
88 | 
89 | # Look up data for CA and TX in month 2: CA_TX_month2
90 | CA_TX_month2 = sales.loc[(['CA','TX'],2),:]
91 | 
92 | # Look up data for all states in month 2: all_month2
93 | all_month2 = sales.loc[(slice(None),2),:]
94 | 


--------------------------------------------------------------------------------
/Python/Merging DataFrames with pandas/Merging data:
--------------------------------------------------------------------------------
  1 | Q1:-
  2 | Using pd.merge(), merge the DataFrames revenue and managers on the 'city' column of each. Store the result as merge_by_city.
  3 | Print the DataFrame merge_by_city. This has been done for you.
  4 | Merge the DataFrames revenue and managers on the 'branch_id' column of each. Store the result as merge_by_id.
  5 | Print the DataFrame merge_by_id. This has been done for you, so hit 'Submit Answer' to see the result!
  6 | 
  7 | Solution:-
  8 | # Merge revenue with managers on 'city': merge_by_city
  9 | merge_by_city = pd.merge(revenue,managers,on='city')
 10 | 
 11 | # Print merge_by_city
 12 | print(merge_by_city)
 13 | 
 14 | # Merge revenue with managers on 'branch_id': merge_by_id
 15 | merge_by_id = pd.merge(revenue,managers,on='branch_id')
 16 | 
 17 | # Print merge_by_id
 18 | print(merge_by_id)
 19 | 
 20 | Q2:-
 21 | Merge the DataFrames revenue and managers into a single DataFrame called combined using the 'city' and 'branch' columns from the appropriate DataFrames.
 22 | In your call to pd.merge(), you will have to specify the parameters left_on and right_on appropriately.
 23 | Print the new DataFrame combined.
 24 | 
 25 | Solution:-
 26 | # Merge revenue & managers on 'city' & 'branch': combined
 27 | combined = pd.merge(revenue,managers,left_on='city',right_on='branch')
 28 | 
 29 | # Print combined
 30 | print(combined)
 31 | 
 32 | Q3:-
 33 | Create a column called 'state' in the DataFrame revenue, consisting of the list ['TX','CO','IL','CA'].
 34 | Create a column called 'state' in the DataFrame managers, consisting of the list ['TX','CO','CA','MO'].
 35 | Merge the DataFrames revenue and managers using three columns :'branch_id', 'city', and 'state'. Pass them in as a list to the on paramater of pd.merge().
 36 | 
 37 | Solution:-
 38 | # Add 'state' column to revenue: revenue['state']
 39 | revenue['state'] = ['TX','CO','IL','CA']
 40 | 
 41 | # Add 'state' column to managers: managers['state']
 42 | managers['state'] = ['TX','CO','CA','MO']
 43 | 
 44 | # Merge revenue & managers on 'branch_id', 'city', & 'state': combined
 45 | combined = pd.merge(revenue,managers,on=['branch_id', 'city', 'state'])
 46 | 
 47 | # Print combined
 48 | print(combined)
 49 | 
 50 | Q4:-
 51 | Execute a right merge using pd.merge() with revenue and sales to yield a new DataFrame revenue_and_sales.
 52 | Use how='right' and on=['city', 'state'].
 53 | Print the new DataFrame revenue_and_sales. This has been done for you.
 54 | Execute a left merge with sales and managers to yield a new DataFrame sales_and_managers.
 55 | Use how='left', left_on=['city', 'state'], and right_on=['branch', 'state'].
 56 | Print the new DataFrame sales_and_managers. This has been done for you, so hit 'Submit Answer' to see the result!
 57 | 
 58 | Solution:-
 59 | # Merge revenue and sales: revenue_and_sales
 60 | revenue_and_sales = pd.merge(revenue,sales ,how='right',on=['city','state'])
 61 | 
 62 | # Print revenue_and_sales
 63 | print(revenue_and_sales)
 64 | 
 65 | # Merge sales and managers: sales_and_managers
 66 | sales_and_managers = pd.merge(sales,managers,how='left',left_on=['city','state'],right_on=['branch','state'])
 67 | 
 68 | # Print sales_and_managers
 69 | print(sales_and_managers)
 70 | 
 71 | Q5:-
 72 | Merge sales_and_managers with revenue_and_sales. Store the result as merge_default.
 73 | Print merge_default. This has been done for you.
 74 | Merge sales_and_managers with revenue_and_sales using how='outer'. Store the result as merge_outer.
 75 | Print merge_outer. This has been done for you.
 76 | Merge sales_and_managers with revenue_and_sales only on ['city','state'] using an outer join. Store the result as merge_outer_on and hit 'Submit Answer' to see what the merged DataFrames look like!
 77 | 
 78 | Solution:-
 79 | # Perform the first merge: merge_default
 80 | merge_default = pd.merge(sales_and_managers,revenue_and_sales)
 81 | 
 82 | # Print merge_default
 83 | print(merge_default)
 84 | 
 85 | # Perform the second merge: merge_outer
 86 | merge_outer = pd.merge(sales_and_managers,revenue_and_sales,how='outer')
 87 | 
 88 | # Print merge_outer
 89 | print(merge_outer)
 90 | 
 91 | # Perform the third merge: merge_outer_on
 92 | merge_outer_on = pd.merge(sales_and_managers,revenue_and_sales,on=['city','state'],how='outer')
 93 | 
 94 | # Print merge_outer_on
 95 | print(merge_outer_on)
 96 | 
 97 | Q6:-
 98 | Perform an ordered merge on austin and houston using pd.merge_ordered(). Store the result as tx_weather.
 99 | Print tx_weather. You should notice that the rows are sorted by the date but it is not possible to tell which observation came from which city.
100 | Perform another ordered merge on austin and houston.
101 | This time, specify the keyword arguments on='date' and suffixes=['_aus','_hus'] so that the rows can be distinguished. Store the result as tx_weather_suff.
102 | Print tx_weather_suff to examine its contents. This has been done for you.
103 | Perform a third ordered merge on austin and houston.
104 | This time, in addition to the on and suffixes parameters, specify the keyword argument fill_method='ffill' to use forward-filling to replace NaN entries with the most recent non-null entry, and hit 'Submit Answer' to examine the contents of the merged DataFrames!
105 | 
106 | Solution:-
107 | # Perform the first ordered merge: tx_weather
108 | tx_weather = pd.merge_ordered(austin,houston)
109 | 
110 | # Print tx_weather
111 | print(tx_weather)
112 | 
113 | # Perform the second ordered merge: tx_weather_suff
114 | tx_weather_suff = pd.merge_ordered(austin,houston,on='date',suffixes=['_aus','_hus'])
115 | 
116 | # Print tx_weather_suff
117 | print(tx_weather_suff)
118 | 
119 | # Perform the third ordered merge: tx_weather_ffill
120 | tx_weather_ffill = pd.merge_ordered(austin,houston,on='date',suffixes=['_aus','_hus'],fill_method='ffill')
121 | 
122 | # Print tx_weather_ffill
123 | print(tx_weather_ffill)
124 | 
125 | Q7:-
126 | Merge auto and oil using pd.merge_asof() with left_on='yr' and right_on='Date'. Store the result as merged.
127 | Print the tail of merged. This has been done for you.
128 | Resample merged using 'A' (annual frequency), and on='Date'. Select [['mpg','Price']] and aggregate the mean. Store the result as yearly.
129 | Hit Submit Answer to examine the contents of yearly and yearly.corr(), which shows the Pearson correlation between the resampled 'Price' and 'mpg'.
130 | 
131 | Solution:-
132 | # Merge auto and oil: merged
133 | merged = pd.merge_asof(auto,oil,left_on='yr',right_on='Date')
134 | 
135 | # Print the tail of merged
136 | print(merged.tail())
137 | 
138 | # Resample merged: yearly
139 | yearly = merged.resample('A',on='Date')[['mpg','Price']].mean()
140 | 
141 | # Print yearly
142 | print(yearly)
143 | 
144 | # print yearly.corr()
145 | print(yearly.corr())
146 | 


--------------------------------------------------------------------------------
/Python/Network Analysis in Python (Part 1)/Introduction to networks:
--------------------------------------------------------------------------------
  1 | Q1:-
  2 | Import matplotlib.pyplot as plt and networkx as nx.
  3 | Draw T_sub to the screen by using the nx.draw() function, and don't forget to also use plt.show() to display it.
  4 | 
  5 | Solution:-
  6 | # Import necessary modules
  7 | import matplotlib.pyplot as plt
  8 | import networkx as nx
  9 | 
 10 | 
 11 | # Draw the graph to screen
 12 | nx.draw(T_sub)
 13 | plt.show()
 14 | 
 15 | Q2:-
 16 | Use a list comprehension to get a list of nodes from the graph T that have the 'occupation' label of 'scientist'.
 17 | The output expression n has been specified for you, along with the iterator variables n and d. Your task is to fill in the iterable and the conditional expression.
 18 | Use the .nodes() method of T access its nodes, and be sure to specify data=True to obtain the metadata for the nodes.
 19 | The iterator variable d is a dictionary. The key of interest here is 'occupation' and value of interest is 'scientist'.
 20 | Use a list comprehension to get a list of edges from the graph T that were formed for at least 6 years, i.e., from before 1 Jan 2010.
 21 | Your task once again is to fill in the iterable and conditional expression.
 22 | Use the .edges() method of T to access its edges. Be sure to obtain the metadata for the edges as well.
 23 | The dates are stored as datetime.date objects in the metadata dictionary d, under the key 'date'. To access the date 1 Jan 2009, for example, the dictionary value would be date(2009, 1, 1).
 24 | 
 25 | Solution:-
 26 | # Use a list comprehension to get the nodes of interest: noi
 27 | noi = [n for n, d in T.nodes(data=True) if d['occupation'] == 'scientist']
 28 | 
 29 | # Use a list comprehension to get the edges of interest: eoi
 30 | eoi = [(u, v) for u, v, d in T.edges(data=True) if d['date'] < date(2010,1,1)]
 31 | 
 32 | Q3:-
 33 | Set the 'weight' attribute of the edge between node 1 and 10 of T to be equal to 2. Refer to the following template to set an attribute of an edge: network_name.edges[node1, node2]['attribute'] = value. Here, the 'attribute' is 'weight'.
 34 | Set the weight of every edge involving node 293 to be equal to 1.1. To do this:
 35 | Using a for loop, iterate over all the edges of T, including the metadata.
 36 | If 293 is involved in the list of nodes [u, v]:
 37 | Set the weight of the edge between u and v to be 1.1.
 38 | 
 39 | Solution:-
 40 | # Set the weight of the edge
 41 | T.edges[1,10]['weight'] = 2
 42 | 
 43 | # Iterate over all the edges (with metadata)
 44 | for u, v, d in T.edges(data=True):
 45 | 
 46 |     # Check if node 293 is involved
 47 |     if 293 in [u,v]:
 48 | 
 49 |         # Set the weight to 1.1
 50 |         T.edges[u,v]['weight'] = 1.1
 51 | 
 52 | Q4:-
 53 | Define a function called find_selfloop_nodes() which takes one argument: G.
 54 | Using a for loop, iterate over all the edges in G (excluding the metadata).
 55 | If node u is equal to node v:
 56 | Append u to the list nodes_in_selfloops.
 57 | Return the list nodes_in_selfloops.
 58 | Check that the number of self loops in the graph equals the number of nodes in self loops. This has been done for you, so hit 'Submit Answer' to see the result!
 59 | 
 60 | Solution:-
 61 | # Define find_selfloop_nodes()
 62 | def find_selfloop_nodes(T):
 63 |     """
 64 |     Finds all nodes that have self-loops in the graph G.
 65 |     """
 66 |     nodes_in_selfloops = []
 67 | 
 68 |     # Iterate over all the edges of G
 69 |     for u, v in T.edges():
 70 | 
 71 |     # Check if node u and node v are the same
 72 |         if u==v:
 73 | 
 74 |             # Append node u to nodes_in_selfloops
 75 |             nodes_in_selfloops.append(u)
 76 | 
 77 |     return nodes_in_selfloops
 78 | 
 79 | # Check whether number of self loops equals the number of nodes in self loops
 80 | assert T.number_of_selfloops() == len(find_selfloop_nodes(T))
 81 | 
 82 | Q5:-
 83 | Import nxviz as nv.
 84 | Plot the graph T as a matrix plot. To do this:
 85 | Create the MatrixPlot object called m using the nv.MatrixPlot() function with T passed in as an argument.
 86 | Draw the m to the screen using the .draw() method.
 87 | Display the plot using plt.show().
 88 | Convert the graph to a matrix format, and then convert the graph to back to the NetworkX form from the matrix as a directed graph. This has been done for you.
 89 | Check that the category metadata field is lost from each node. This has also been done for you, so hit 'Submit Answer' to see the results!
 90 | 
 91 | Solution:-
 92 | # Import nxviz
 93 | import nxviz as nv
 94 | 
 95 | # Create the MatrixPlot object: m
 96 | m = nv.MatrixPlot(T)
 97 | 
 98 | # Draw m to the screen
 99 | m.draw()
100 | 
101 | # Display the plot
102 | plt.show()
103 | 
104 | # Convert T to a matrix format: A
105 | A = nx.to_numpy_matrix(T)
106 | 
107 | # Convert A back to the NetworkX form as a directed graph: T_conv
108 | T_conv = nx.from_numpy_matrix(A, create_using=nx.DiGraph())
109 | 
110 | # Check that the `category` metadata field is lost from each node
111 | for n, d in T_conv.nodes(data=True):
112 |     assert 'category' not in d.keys()
113 |    
114 | Q6:-
115 | Import CircosPlot from nxviz.
116 | Plot the Twitter network T as a Circos plot without any styling. Use the CircosPlot() function to do this. Don't forget to draw it to the screen using .draw() and then display it using plt.show().
117 | 
118 | Solution:-
119 | # Import necessary modules
120 | import matplotlib.pyplot as plt
121 | import nxviz as nv
122 | from nxviz import CircosPlot
123 | 
124 | # Create the CircosPlot object: c
125 | c = nv.CircosPlot(T)
126 | 
127 | # Draw c to the screen
128 | c.draw()
129 | 
130 | # Display the plot
131 | plt.show()
132 | 
133 | Q7:-
134 | Import ArcPlot from nxviz.
135 | Create an un-customized ArcPlot of T. To do this, use the ArcPlot() function with just T as the argument.
136 | Create another ArcPlot of T in which the nodes are ordered and colored by the 'category' keyword. You'll have to specify the node_order and node_color parameters to do this. For both plots, be sure to draw them to the screen and display them with plt.show().
137 | 
138 | Solution:-
139 | # Import necessary modules
140 | import matplotlib.pyplot as plt
141 | import nxviz as nv
142 | from nxviz import ArcPlot
143 | 
144 | # Create the un-customized ArcPlot object: a
145 | a = nv.ArcPlot(T)
146 | 
147 | # Draw a to the screen
148 | a.draw()
149 | 
150 | # Display the plot
151 | plt.show()
152 | 
153 | # Create the customized ArcPlot object: a2
154 | a2 = nv.ArcPlot(T,node_order='category',node_color='category')
155 | 
156 | # Draw a2 to the screen
157 | a2.draw()
158 | 
159 | # Display the plot
160 | plt.show()
161 | 


--------------------------------------------------------------------------------
/Python/Python Data Science Toolbox -Part 1/Writing your own functions:
--------------------------------------------------------------------------------
  1 | Q1:-
  2 | Complete the function header by adding the appropriate function name, shout.
  3 | In the function body, concatenate the string, 'congratulations' with another string, '!!!'. Assign the result to shout_word.
  4 | Print the value of shout_word.
  5 | Call the shout function.
  6 | 
  7 | Solution:-
  8 | # Define the function shout
  9 | def shout():
 10 |     """Print a string with three exclamation marks"""
 11 |     # Concatenate the strings: shout_word
 12 |     shout_word = "congratulations" + "!!!"
 13 | 
 14 |     # Print shout_word
 15 |     print(shout_word)
 16 | 
 17 | # Call shout
 18 | shout()
 19 | 
 20 | Q2:-
 21 | Complete the function header by adding the parameter name, word.
 22 | Assign the result of concatenating word with '!!!' to shout_word.
 23 | Print the value of shout_word.
 24 | Call the shout() function, passing to it the string, 'congratulations'
 25 | 
 26 | Solution:-
 27 | # Define shout with the parameter, word
 28 | def shout(word):
 29 |     """Print a string with three exclamation marks"""
 30 |     # Concatenate the strings: shout_word
 31 |     shout_word = word + '!!!'
 32 | 
 33 |     # Print shout_word
 34 |     print(shout_word)
 35 | 
 36 | # Call shout with the string 'congratulations'
 37 | shout("congratulations")
 38 | 
 39 | Q3:-
 40 | In the function body, concatenate the string in word with '!!!' and assign to shout_word.
 41 | Replace the print() statement with the appropriate return statement.
 42 | Call the shout() function, passing to it the string, 'congratulations', and assigning the call to the variable, yell.
 43 | To check if yell contains the value returned by shout(), print the value of yell.
 44 | 
 45 | Solution:-
 46 | # Define shout with the parameter, word
 47 | def shout(word):
 48 |     """Return a string with three exclamation marks"""
 49 |     # Concatenate the strings: shout_word
 50 |     shout_word = word + "!!!"
 51 | 
 52 |     # Replace print with return
 53 |     return shout_word
 54 | 
 55 | # Pass 'congratulations' to shout: yell
 56 | yell = shout("congratulations")
 57 | 
 58 | # Print yell
 59 | print(yell)
 60 | 
 61 | Q4:-
 62 | Modify the function header such that it accepts two parameters, word1 and word2, in that order.
 63 | Concatenate each of word1 and word2 with '!!!' and assign to shout1 and shout2, respectively.
 64 | Concatenate shout1 and shout2 together, in that order, and assign to new_shout.
 65 | Pass the strings 'congratulations' and 'you', in that order, to a call to shout(). Assign the return value to yell.
 66 | 
 67 | Solution:-
 68 | # Define shout with parameters word1 and word2
 69 | def shout(word1, word2):
 70 |     """Concatenate strings with three exclamation marks"""
 71 |     # Concatenate word1 with '!!!': shout1
 72 |     shout1 = word1 + "!!!"
 73 |     
 74 |     # Concatenate word2 with '!!!': shout2
 75 |     shout2 = word2 + "!!!"
 76 |     
 77 |     # Concatenate shout1 with shout2: new_shout
 78 |     new_shout = shout1 + shout2
 79 | 
 80 |     # Return new_shout
 81 |     return new_shout
 82 | 
 83 | # Pass 'congratulations' and 'you' to shout(): yell
 84 | yell = shout("congratulations","you")
 85 | 
 86 | # Print yell
 87 | print(yell)
 88 | 
 89 | Q5:-
 90 | Unpack nums to the variables num1, num2, and num3.
 91 | Construct a new tuple, even_nums composed of the same elements in nums, but with the 1st element replaced with the value, 2.
 92 | 
 93 | Solution:-
 94 | # Unpack nums into num1, num2, and num3
 95 | num1,num2,num3 = nums
 96 | 
 97 | # Construct even_nums
 98 | even_nums = (2, num2, num3)
 99 | 
100 | Q6:-
101 | Modify the function header such that the function name is now shout_all, and it accepts two parameters, word1 and word2, in that order.
102 | Concatenate the string '!!!' to each of word1 and word2 and assign to shout1 and shout2, respectively.
103 | Construct a tuple shout_words, composed of shout1 and shout2.
104 | Call shout_all() with the strings 'congratulations' and 'you' and assign the result to yell1 and yell2 (remember, shout_all returns 2 variables!).
105 | 
106 | Solution:-
107 | # Define shout_all with parameters word1 and word2
108 | def shout_all(word1, word2):
109 |     
110 |     # Concatenate word1 with '!!!': shout1
111 |     shout1 = word1 + "!!!"
112 |     
113 |     # Concatenate word2 with '!!!': shout2
114 |     shout2 = word2 + "!!!"
115 |     
116 |     # Construct a tuple with shout1 and shout2: shout_words
117 |     shout_words = (shout1,shout2)
118 | 
119 |     # Return shout_words
120 |     return shout_words
121 | 
122 | # Pass 'congratulations' and 'you' to shout_all(): yell1, yell2
123 | yell1, yell2 = shout_all("congratulations","you")
124 | 
125 | # Print yell1 and yell2
126 | print(yell1)
127 | print(yell2)
128 | 
129 | Q7:-
130 | Import the pandas package with the alias pd.
131 | Import the file 'tweets.csv' using the pandas function read_csv(). Assign the resulting DataFrame to df.
132 | Complete the for loop by iterating over col, the 'lang' column in the DataFrame df.
133 | Complete the bodies of the if-else statements in the for loop: if the key is in the dictionary langs_count, add 1 to its current value, else add the key to langs_count and set its value to 1. 
134 | Use the loop variable entry in your code.
135 | 
136 | Solution:-
137 | # Import pandas
138 | import pandas as pd
139 | 
140 | # Import Twitter data as DataFrame: df
141 | df = pd.read_csv("tweets.csv")
142 | 
143 | # Initialize an empty dictionary: langs_count
144 | langs_count = {}
145 | 
146 | # Extract column from DataFrame: col
147 | col = df['lang']
148 | 
149 | # Iterate over lang column in DataFrame
150 | for entry in col:
151 | 
152 |     # If the language is in langs_count, add 1
153 |     if entry in langs_count.keys():
154 |         langs_count[entry] +=1
155 |     # Else add the language to langs_count, set the value to 1
156 |     else:
157 |         langs_count[entry] = 1
158 | 
159 | # Print the populated dictionary
160 | print(langs_count)
161 | 
162 | Q8:-
163 | Define the function count_entries(), which has two parameters. The first parameter is df for the DataFrame and the second is col_name for the column name.
164 | Complete the bodies of the if-else statements in the for loop: if the key is in the dictionary langs_count, add 1 to its current value, else add the key to langs_count and set its value to 1. Use the loop variable entry in your code.
165 | Return the langs_count dictionary from inside the count_entries() function.
166 | Call the count_entries() function by passing to it tweets_df and the name of the column, 'lang'. Assign the result of the call to the variable result.
167 | 
168 | Solution:-
169 | # Define count_entries()
170 | def count_entries(df, col_name):
171 |     """Return a dictionary with counts of 
172 |     occurrences as value for each key."""
173 | 
174 |     # Initialize an empty dictionary: langs_count
175 |     langs_count = {}
176 |     
177 |     # Extract column from DataFrame: col
178 |     col = df[col_name]
179 |     
180 |     # Iterate over lang column in DataFrame
181 |     for entry in col:
182 | 
183 |         # If the language is in langs_count, add 1
184 |         if entry in langs_count.keys():
185 |             langs_count[entry] +=1
186 |         # Else add the language to langs_count, set the value to 1
187 |         else:
188 |             langs_count[entry] = 1
189 | 
190 |     # Return the langs_count dictionary
191 |     return langs_count
192 | 
193 | # Call count_entries(): result
194 | result = count_entries(tweets_df,"lang")
195 | 
196 | # Print the result
197 | print(result)
198 | 
199 | 
200 | 


--------------------------------------------------------------------------------
/Python/Python Data Science Toolbox -Part 2/List comprehensions and generators:
--------------------------------------------------------------------------------
  1 | Q1:-
  2 | Using the range of numbers from 0 to 9 as your iterable and i as your iterator variable, write a list comprehension that produces a list of numbers consisting of the squared values of i.
  3 | 
  4 | Solution:-
  5 | # Create list comprehension: squares
  6 | squares = [i*i for i in range(0,10)]
  7 | 
  8 | Q2:-
  9 | In the inner list comprehension - that is, the output expression of the nested list comprehension - create a list of values from 0 to 4 using range(). Use col as the iterator variable.
 10 | In the iterable part of your nested list comprehension, use range() to count 5 rows - that is, create a list of values from 0 to 4. 
 11 | Use row as the iterator variable; note that you won't be needing this to create values in the list of lists.
 12 | 
 13 | Solution:-
 14 | # Create a 5 x 5 matrix using a list of lists: matrix
 15 | matrix = [[col for col in range(0,5)] for row in range(0,5)]
 16 | 
 17 | # Print the matrix
 18 | for row in matrix:
 19 |     print(row)
 20 | 
 21 | Q3:-
 22 | Use member as the iterator variable in the list comprehension. For the conditional, use len() to evaluate the iterator variable. 
 23 | Note that you only want strings with 7 characters or more.
 24 | 
 25 | Solution:-
 26 | # Create a list of strings: fellowship
 27 | fellowship = ['frodo', 'samwise', 'merry', 'aragorn', 'legolas', 'boromir', 'gimli']
 28 | 
 29 | # Create list comprehension: new_fellowship
 30 | new_fellowship = [member for member in fellowship if len(member) >= 7]
 31 | 
 32 | # Print the new list
 33 | print(new_fellowship)
 34 | 
 35 | Q4:-
 36 | In the output expression, keep the string as-is if the number of characters is >= 7, else replace it with an empty string - that is, '' or "".
 37 | 
 38 | Solution:-
 39 | # Create a list of strings: fellowship
 40 | fellowship = ['frodo', 'samwise', 'merry', 'aragorn', 'legolas', 'boromir', 'gimli']
 41 | 
 42 | # Create list comprehension: new_fellowship
 43 | new_fellowship = [member if len(member) >= 7 else "" for member in fellowship]
 44 | 
 45 | # Print the new list
 46 | print(new_fellowship)
 47 | 
 48 | Q5:-
 49 | Create a dict comprehension where the key is a string in fellowship and the value is the length of the string. 
 50 | Remember to use the syntax key:value in the output expression part of the comprehension to create the members of the dictionary. 
 51 | Use member as the iterator variable.
 52 | 
 53 | Solution:-
 54 | # Create a list of strings: fellowship
 55 | fellowship = ['frodo', 'samwise', 'merry', 'aragorn', 'legolas', 'boromir', 'gimli']
 56 | 
 57 | # Create dict comprehension: new_fellowship
 58 | new_fellowship = {member:len(member) for member in fellowship}
 59 | 
 60 | # Print the new list
 61 | print(new_fellowship)
 62 | 
 63 | Q6:-
 64 | Create a generator object that will produce values from 0 to 30. Assign the result to result and use num as the iterator variable in the generator expression.
 65 | Print the first 5 values by using next() appropriately in print().
 66 | Print the rest of the values by using a for loop to iterate over the generator object.
 67 | 
 68 | Solution:-
 69 | # Create generator object: result
 70 | result = (num for num in range(0,31))
 71 | 
 72 | # Print the first 5 values
 73 | print(next(result))
 74 | print(next(result))
 75 | print(next(result))
 76 | print(next(result))
 77 | print(next(result))
 78 | 
 79 | # Print the rest of the values
 80 | for value in result:
 81 |     print(value)
 82 | 
 83 | Q7:-
 84 | Write a generator expression that will generate the lengths of each string in lannister. Use person as the iterator variable. Assign the result to lengths.
 85 | Supply the correct iterable in the for loop for printing the values in the generator object.
 86 | 
 87 | Solution:-
 88 | # Create a list of strings: lannister
 89 | lannister = ['cersei', 'jaime', 'tywin', 'tyrion', 'joffrey']
 90 | 
 91 | # Create a generator object: lengths
 92 | lengths = (len(person) for person in lannister )
 93 | 
 94 | # Iterate over and print the values in lengths
 95 | for value in lengths:
 96 |     print(value)
 97 |     
 98 | Q8:-
 99 | Complete the function header for the function get_lengths() that has a single parameter, input_list.
100 | In the for loop in the function definition, yield the length of the strings in input_list.
101 | Complete the iterable part of the for loop for printing the values generated by the get_lengths() generator function. 
102 | Supply the call to get_lengths(), passing in the list lannister.
103 | 
104 | Solution:-
105 | # Define generator function get_lengths
106 | def get_lengths(input_list):
107 |     """Generator function that yields the
108 |     length of the strings in input_list."""
109 | 
110 |     # Yield the length of a string
111 |     for person in input_list:
112 |         yield len(person)
113 | 
114 | # Print the values generated by get_lengths()
115 | for value in get_lengths(lannister):
116 |     print(value)
117 |     
118 | Q9:-
119 | Extract the column 'created_at' from df and assign the result to tweet_time. Fun fact: the extracted column in tweet_time here is a Series data structure!
120 | Create a list comprehension that extracts the time from each row in tweet_time. Each row is a string that represents a timestamp, and you will access the 12th to 19th characters in the string to extract the time.
121 | Use entry as the iterator variable and assign the result to tweet_clock_time. Remember that Python uses 0-based indexing!
122 | 
123 | Solution:-
124 | # Extract the created_at column from df: tweet_time
125 | tweet_time = df['created_at']
126 | 
127 | # Extract the clock time: tweet_clock_time
128 | tweet_clock_time = [entry[11:19] for entry in tweet_time]
129 | 
130 | # Print the extracted times
131 | print(tweet_clock_time)
132 | 
133 | Q10:-
134 | Extract the column 'created_at' from df and assign the result to tweet_time.
135 | Create a list comprehension that extracts the time from each row in tweet_time. 
136 | Each row is a string that represents a timestamp, and you will access the 12th to 19th characters in the string to extract the time. 
137 | Use entry as the iterator variable and assign the result to tweet_clock_time.
138 | 
139 | Solution:-
140 | # Extract the created_at column from df: tweet_time
141 | tweet_time = df['created_at']
142 | 
143 | # Extract the clock time: tweet_clock_time
144 | tweet_clock_time = [entry[11:19] for entry in tweet_time if entry[17:19] == '19']
145 | 
146 | # Print the extracted times
147 | print(tweet_clock_time)
148 | 
149 | 
150 | Additionally, add a conditional expression that checks whether entry[17:19] is equal to '19'.
151 |     
152 | 


--------------------------------------------------------------------------------
/Python/Python Data Science Toolbox -Part/Case Study:
--------------------------------------------------------------------------------
 1 | Q1:-
 2 | Create a zip object by calling zip() and passing to it feature_names and row_vals. Assign the result to zipped_lists.
 3 | Create a dictionary from the zipped_lists zip object by calling dict() with zipped_lists. Assign the resulting dictionary to rs_dict.
 4 | 
 5 | Solution:-
 6 | # Zip lists: zipped_lists
 7 | zipped_lists = zip(feature_names,row_vals)
 8 | 
 9 | # Create a dictionary: rs_dict
10 | rs_dict = dict(zipped_lists)
11 | 
12 | # Print the dictionary
13 | print(rs_dict)
14 | 
15 | Q2:-
16 | Define the function lists2dict() with two parameters: first is list1 and second is list2.
17 | Return the resulting dictionary rs_dict in lists2dict().
18 | Call the lists2dict() function with the arguments feature_names and row_vals. Assign the result of the function call to rs_fxn.
19 | 
20 | Solution:-
21 | # Define lists2dict()
22 | def lists2dict(list1, list2):
23 |     """Return a dictionary where list1 provides
24 |     the keys and list2 provides the values."""
25 | 
26 |     # Zip lists: zipped_lists
27 |     zipped_lists = zip(list1, list2)
28 | 
29 |     # Create a dictionary: rs_dict
30 |     rs_dict = dict(zipped_lists)
31 | 
32 |     # Return the dictionary
33 |     return rs_dict
34 | 
35 | # Call lists2dict: rs_fxn
36 | rs_fxn = lists2dict(feature_names,row_vals)
37 | 
38 | # Print rs_fxn
39 | print(rs_fxn)
40 | 
41 | Q3:-
42 | Inspect the contents of row_lists by printing the first two lists in row_lists.
43 | Create a list comprehension that generates a dictionary using lists2dict() for each sublist in row_lists. The keys are from the feature_names list and the values are the row entries in row_lists. Use sublist as your iterator variable and assign the resulting list of dictionaries to list_of_dicts.
44 | Look at the first two dictionaries in list_of_dicts by printing them out.
45 | 
46 | Solution:-
47 | 


--------------------------------------------------------------------------------
/Python/Statistical Thinking in Python (Part 2)/Hypothesis test examples:
--------------------------------------------------------------------------------
  1 | Q1:-
  2 | Construct Boolean arrays, dems and reps that contain the votes of the respective parties; e.g., dems has 153 True entries and 91 False entries.
  3 | Write a function, frac_yea_dems(dems, reps) that returns the fraction of Democrats that voted yea. The first input is an array of Booleans, Two inputs are required to use your draw_perm_reps() function, but the second is not used.
  4 | Use your draw_perm_reps() function to draw 10,000 permutation replicates of the fraction of Democrat yea votes.
  5 | Compute and print the p-value.
  6 | 
  7 | Solution:-
  8 | # Construct arrays of data: dems, reps
  9 | dems = np.array([True] * 153 + [False] * 91)
 10 | reps = np.array([True]* 136 + [False]*35)
 11 | 
 12 | def frac_yea_dems(dems, reps):
 13 |     """Compute fraction of Democrat yea votes."""
 14 |     frac = np.sum(dems) / len(dems)
 15 |     return frac
 16 | 
 17 | # Acquire permutation samples: perm_replicates
 18 | perm_replicates = draw_perm_reps(dems, reps, frac_yea_dems, size=10000)
 19 | 
 20 | # Compute and print p-value: p
 21 | p = np.sum(perm_replicates <= 153/244) / len(perm_replicates)
 22 | print('p-value =', p)
 23 | 
 24 | Q2:-
 25 | Compute the observed difference in mean inter-nohitter time using diff_of_means().
 26 | Generate 10,000 permutation replicates of the difference of means using draw_perm_reps().
 27 | Compute and print the p-value.
 28 | 
 29 | Solution:-
 30 | # Compute the observed difference in mean inter-no-hitter times: nht_diff_obs
 31 | nht_diff_obs = diff_of_means(nht_dead,nht_live)
 32 | 
 33 | # Acquire 10,000 permutation replicates of difference in mean no-hitter time: perm_replicates
 34 | perm_replicates = draw_perm_reps(nht_dead,nht_live,diff_of_means,size=10000)
 35 | 
 36 | 
 37 | # Compute and print the p-value: p
 38 | p = np.sum(perm_replicates <= nht_diff_obs)/len(perm_replicates)
 39 | print('p-val =', p)
 40 | 
 41 | Q3:-
 42 | Compute the observed Pearson correlation between illiteracy and fertility.
 43 | Initialize an array to store your permutation replicates.
 44 | Write a for loop to draw 10,000 replicates:
 45 | Permute the illiteracy measurements using np.random.permutation().
 46 | Compute the Pearson correlation between the permuted illiteracy array, illiteracy_permuted, and fertility.
 47 | Compute and print the p-value from the replicates.
 48 | 
 49 | Solution:-
 50 | # Compute observed correlation: r_obs
 51 | r_obs = pearson_r(illiteracy,fertility)
 52 | 
 53 | # Initialize permutation replicates: perm_replicates
 54 | perm_replicates = np.empty(10000)
 55 | 
 56 | # Draw replicates
 57 | for i in range(10000):
 58 |     # Permute illiteracy measurments: illiteracy_permuted
 59 |     illiteracy_permuted = np.random.permutation(illiteracy)
 60 | 
 61 |     # Compute Pearson correlation
 62 |     perm_replicates[i] = pearson_r(illiteracy_permuted,fertility)
 63 | 
 64 | # Compute p-value: p
 65 | p = np.sum(perm_replicates >= 1)/len(perm_replicates)
 66 | print('p-val =', p)
 67 | 
 68 | Q4:-
 69 | Use your ecdf() function to generate x,y values from the control and treated arrays for plotting the ECDFs.
 70 | Plot the ECDFs on the same plot.
 71 | The margins have been set for you, along with the legend and axis labels. Hit 'Submit Answer' to see the result!
 72 | 
 73 | Solution:-
 74 | # Compute x,y values for ECDFs
 75 | x_control, y_control = ecdf(control)
 76 | x_treated, y_treated = ecdf(treated)
 77 | 
 78 | # Plot the ECDFs
 79 | plt.plot(x_control, y_control, marker='.', linestyle='none')
 80 | plt.plot(x_treated, y_treated, marker='.', linestyle='none')
 81 | 
 82 | # Set the margins
 83 | plt.margins(0.02)
 84 | 
 85 | # Add a legend
 86 | plt.legend(('control', 'treated'), loc='lower right')
 87 | 
 88 | # Label axes and show plot
 89 | plt.xlabel('millions of alive sperm per mL')
 90 | plt.ylabel('ECDF')
 91 | plt.show()
 92 | 
 93 | Q5:-
 94 | Compute the mean alive sperm count of control minus that of treated.
 95 | Compute the mean of all alive sperm counts. To do this, first concatenate control and treated and take the mean of the concatenated array.
 96 | Generate shifted data sets for both control and treated such that the shifted data sets have the same mean. This has already been done for you.
 97 | Generate 10,000 bootstrap replicates of the mean each for the two shifted arrays. Use your draw_bs_reps() function.
 98 | Compute the bootstrap replicates of the difference of means.
 99 | The code to compute and print the p-value has been written for you. Hit 'Submit Answer' to see the result!
100 | 
101 | Solution:-
102 | # Compute the difference in mean sperm count: diff_means
103 | diff_means = np.mean(control) - np.mean(treated)
104 | 
105 | # Compute mean of pooled data: mean_count
106 | mean_count = np.mean(np.concatenate((control,treated)))
107 | 
108 | # Generate shifted data sets
109 | control_shifted = control - np.mean(control) + mean_count
110 | treated_shifted = treated - np.mean(treated) + mean_count
111 | 
112 | # Generate bootstrap replicates
113 | bs_reps_control = draw_bs_reps(control_shifted,
114 |                        np.mean, size=10000)
115 | bs_reps_treated = draw_bs_reps(treated_shifted,
116 |                        np.mean, size=10000)
117 | 
118 | # Get replicates of difference of means: bs_replicates
119 | bs_replicates = bs_reps_control- bs_reps_treated
120 | 
121 | # Compute and print p-value: p
122 | p = np.sum(bs_replicates >= np.mean(control) - np.mean(treated)) \
123 |             / len(bs_replicates)
124 | print('p-value =', p)
125 | 


--------------------------------------------------------------------------------
/Python/Statistical Thinking in Python (Part 2)/Parameter estimation by optimization:
--------------------------------------------------------------------------------
  1 | Q1:-
  2 | Seed the random number generator with 42.
  3 | Compute the mean time (in units of number of games) between no-hitters.
  4 | Draw 100,000 samples from an Exponential distribution with the parameter you computed from the mean of the inter-no-hitter times.
  5 | Plot the theoretical PDF using plt.hist(). Remember to use keyword arguments bins=50, normed=True, and histtype='step'. Be sure to label your axes.
  6 | Show your plot.
  7 | 
  8 | Solution:-
  9 | # Seed random number generator
 10 | np.random.seed(42)
 11 | 
 12 | # Compute mean no-hitter time: tau
 13 | tau = np.mean(nohitter_times)
 14 | 
 15 | # Draw out of an exponential distribution with parameter tau: inter_nohitter_time
 16 | inter_nohitter_time = np.random.exponential(tau, 100000)
 17 | 
 18 | # Plot the PDF and label axes
 19 | _ = plt.hist(inter_nohitter_time,
 20 |              bins=50, normed=True, histtype='step')
 21 | _ = plt.xlabel('Games between no-hitters')
 22 | _ = plt.ylabel('PDF')
 23 | 
 24 | # Show the plot
 25 | plt.show()
 26 | 
 27 | Q2:-
 28 | # Create an ECDF from real data: x, y
 29 | x, y = ecdf(nohitter_times)
 30 | 
 31 | # Create a CDF from theoretical samples: x_theor, y_theor
 32 | x_theor, y_theor = ecdf(inter_nohitter_time)
 33 | 
 34 | # Overlay the plots
 35 | plt.plot(x_theor, y_theor)
 36 | plt.plot(x, y, marker='.', linestyle='none')
 37 | 
 38 | # Margins and axis labels
 39 | plt.margins(0.02)
 40 | plt.xlabel('Games between no-hitters')
 41 | plt.ylabel('CDF')
 42 | 
 43 | # Show the plot
 44 | plt.show()
 45 | 
 46 | Q3:-
 47 | Take 10000 samples out of an Exponential distribution with parameter τ1/2 = tau/2.
 48 | Take 10000 samples out of an Exponential distribution with parameter τ2 = 2*tau.
 49 | Generate CDFs from these two sets of samples using your ecdf() function.
 50 | Add these two CDFs as lines to your plot. This has been done for you, so hit 'Submit Answer' to view the plot!
 51 | 
 52 | Solution:-
 53 | # Plot the theoretical CDFs
 54 | plt.plot(x_theor, y_theor)
 55 | plt.plot(x, y, marker='.', linestyle='none')
 56 | plt.margins(0.02)
 57 | plt.xlabel('Games between no-hitters')
 58 | plt.ylabel('CDF')
 59 | 
 60 | # Take samples with half tau: samples_half
 61 | samples_half = np.random.exponential(tau/2,10000)
 62 | 
 63 | # Take samples with double tau: samples_double
 64 | samples_double = np.random.exponential(2*tau,10000)
 65 | 
 66 | # Generate CDFs from these samples
 67 | x_half, y_half = ecdf(samples_half)
 68 | x_double, y_double = ecdf(samples_double)
 69 | 
 70 | # Plot these CDFs as lines
 71 | _ = plt.plot(x_half, y_half)
 72 | _ = plt.plot(x_double, y_double)
 73 | 
 74 | # Show the plot
 75 | plt.show()
 76 | 
 77 | Q4:-
 78 | Plot fertility (y-axis) versus illiteracy (x-axis) as a scatter plot.
 79 | Set a 2% margin.
 80 | Compute and print the Pearson correlation coefficient between illiteracy and fertility.
 81 | 
 82 | Solution:-
 83 | # Plot the illiteracy rate versus fertility
 84 | _ = plt.plot(illiteracy, fertility, marker='.', linestyle='none')
 85 | 
 86 | # Set the margins and label axes
 87 | plt.margins(0.02)
 88 | _ = plt.xlabel('percent illiterate')
 89 | _ = plt.ylabel('fertility')
 90 | 
 91 | # Show the plot
 92 | plt.show()
 93 | 
 94 | # Show the Pearson correlation coefficient
 95 | print(pearson_r(illiteracy, fertility))
 96 | 
 97 | Q5:-
 98 | Compute the slope and intercept of the regression line using np.polyfit(). Remember, fertility is on the y-axis and illiteracy on the x-axis.
 99 | Print out the slope and intercept from the linear regression.
100 | To plot the best fit line, create an array x that consists of 0 and 100 using np.array(). Then, compute the theoretical values of y based on your regression parameters. I.e., y = a * x + b.
101 | Plot the data and the regression line on the same plot. Be sure to label your axes.
102 | Hit 'Submit Answer' to display your plot.
103 | 
104 | Solution:-
105 | # Plot the illiteracy rate versus fertility
106 | _ = plt.plot(illiteracy, fertility, marker='.', linestyle='none')
107 | plt.margins(0.02)
108 | _ = plt.xlabel('percent illiterate')
109 | _ = plt.ylabel('fertility')
110 | 
111 | # Perform a linear regression using np.polyfit(): a, b
112 | a, b = np.polyfit(illiteracy,fertility,1)
113 | 
114 | # Print the results to the screen
115 | print('slope =', a, 'children per woman / percent illiterate')
116 | print('intercept =', b, 'children per woman')
117 | 
118 | # Make theoretical line to plot
119 | x = np.array([0,100])
120 | y = a * x + b
121 | 
122 | # Add regression line to your plot
123 | _ = plt.plot(x, y)
124 | 
125 | # Draw the plot
126 | plt.show()
127 | 
128 | Q6:-
129 | Specify the values of the slope to compute the RSS. Use np.linspace() to get 200 points in the range between 0 and 0.1. For example, to get 100 points in the range between 0 and 0.5, you could use np.linspace() like so: np.linspace(0, 0.5, 100).
130 | Initialize an array, rss, to contain the RSS using np.empty_like() and the array you created above. The empty_like() function returns a new array with the same shape and type as a given array (in this case, a_vals).
131 | Write a for loop to compute the sum of RSS of the slope. Hint: the RSS is given by np.sum((y_data - a * x_data - b)**2). The variable b you computed in the last exercise is already in your namespace. Here, fertility is the y_data and illiteracy the x_data.
132 | Plot the RSS (rss) versus slope (a_vals).
133 | 
134 | Solution:-
135 | # Specify slopes to consider: a_vals
136 | a_vals = np.linspace(0,0.1,200)
137 | 
138 | # Initialize sum of square of residuals: rss
139 | rss = np.empty_like(a_vals)
140 | 
141 | # Compute sum of square of residuals for each value of a_vals
142 | for i, a in enumerate(a_vals):
143 |     rss[i] = np.sum((fertility - a*illiteracy - b)**2)
144 | 
145 | # Plot the RSS
146 | plt.plot(a_vals, rss, '-')
147 | plt.xlabel('slope (children per woman / percent illiterate)')
148 | plt.ylabel('sum of square of residuals')
149 | 
150 | plt.show()
151 | 
152 | Q7:-
153 | Compute the parameters for the slope and intercept using np.polyfit(). The Anscombe data are stored in the arrays x and y.
154 | Print the slope a and intercept b.
155 | Generate theoretical x and y data from the linear regression. Your x array, which you can create with np.array(), should consist of 3 and 15. To generate the y data, multiply the slope by x_theor and add the intercept.
156 | Plot the Anscombe data as a scatter plot and then plot the theoretical line. Remember to include the marker='.' and linestyle='none' keyword arguments in addition to x and y when to plot the Anscombe data as a scatter plot. You do not need these arguments when plotting the theoretical line.
157 | 
158 | Solution:-
159 | # Perform linear regression: a, b
160 | a, b = np.polyfit(x,y,1)
161 | 
162 | # Print the slope and intercept
163 | print(a, b)
164 | 
165 | # Generate theoretical x and y data: x_theor, y_theor
166 | x_theor = np.array([3, 15])
167 | y_theor = a * x_theor + b
168 | 
169 | # Plot the Anscombe data and theoretical line
170 | _ = plt.plot(x,y,marker='.',linestyle='none')
171 | _ = plt.plot(x_theor,y_theor)
172 | 
173 | # Label the axes
174 | plt.xlabel('x')
175 | plt.ylabel('y')
176 | 
177 | # Show the plot
178 | plt.show()
179 | 
180 | Q7:-
181 | Write a for loop to do the following for each Anscombe data set.
182 | Compute the slope and intercept.
183 | Print the slope and intercept.
184 | 
185 | Solution:-
186 | # Iterate through x,y pairs
187 | for x, y in zip(anscombe_x , anscombe_y ):
188 |     # Compute the slope and intercept: a, b
189 |     a, b = np.polyfit(x,y,1)
190 | 
191 |     # Print the result
192 |     print('slope:', a, 'intercept:', b)
193 | 
194 | 
195 | 


--------------------------------------------------------------------------------
/Python/Statistical Thinking in Python -Part 1/Graphical exploratory data analysis:
--------------------------------------------------------------------------------
  1 | Q1:-
  2 | Import matplotlib.pyplot and seaborn as their usual aliases (plt and sns).
  3 | Use seaborn to set the plotting defaults.
  4 | Plot a histogram of the Iris versicolor petal lengths using plt.hist() and the provided NumPy array versicolor_petal_length.
  5 | Show the histogram using plt.show().
  6 | 
  7 | Solution:-
  8 | # Import plotting modules
  9 | import matplotlib.pyplot as plt
 10 | import seaborn as sns
 11 | 
 12 | 
 13 | # Set default Seaborn style
 14 | sns.set()
 15 | 
 16 | # Plot histogram of versicolor petal lengths
 17 | plt.hist(versicolor_petal_length)
 18 | 
 19 | # Show histogram
 20 | plt.show()
 21 | 
 22 | Q2:-
 23 | Label the axes. Don't forget that you should always include units in your axis labels. Your y-axis label is just 'count'. Your x-axis label is 'petal length (cm)'. The units are essential!
 24 | Display the plot constructed in the above steps using plt.show().
 25 | 
 26 | Solution:-
 27 | # Plot histogram of versicolor petal lengths
 28 | _ = plt.hist(versicolor_petal_length)
 29 | 
 30 | # Label axes
 31 | plt.xlabel('petal length (cm)')
 32 | plt.ylabel('count')
 33 | 
 34 | # Show histogram
 35 | plt.show()
 36 | 
 37 | Q3:-
 38 | Import numpy as np. This gives access to the square root function, np.sqrt().
 39 | Determine how many data points you have using len().
 40 | Compute the number of bins using the square root rule.
 41 | Convert the number of bins to an integer using the built in int() function.
 42 | Generate the histogram and make sure to use the bins keyword argument.
 43 | Hit 'Submit Answer' to plot the figure and see the fruit of your labors!
 44 | 
 45 | Solution:-
 46 | # Import numpy
 47 | import numpy as np
 48 | 
 49 | # Compute number of data points: n_data
 50 | n_data = len(versicolor_petal_length)
 51 | 
 52 | # Number of bins is the square root of number of data points: n_bins
 53 | n_bins = np.sqrt(n_data)
 54 | 
 55 | # Convert number of bins to integer: n_bins
 56 | n_bins = int(n_bins)
 57 | 
 58 | # Plot the histogram
 59 | plt.hist(versicolor_petal_length, bins= n_bins)
 60 | 
 61 | # Label axes
 62 | _ = plt.xlabel('petal length (cm)')
 63 | _ = plt.ylabel('count')
 64 | 
 65 | # Show histogram
 66 | plt.show()
 67 | 
 68 | Q4:-
 69 | In the IPython Shell, inspect the DataFrame df using df.head(). This will let you identify which column names you need to pass as the x and y keyword arguments in your call to sns.swarmplot().
 70 | Use sns.swarmplot() to make a bee swarm plot from the DataFrame containing the Fisher iris data set, df. The x-axis should contain each of the three species, and the y-axis should contain the petal lengths.
 71 | Label the axes.
 72 | Show your plot.
 73 | 
 74 | Solution:-
 75 | # Create bee swarm plot with Seaborn's default settings
 76 | df.head()
 77 | 
 78 | # Label the axes
 79 | sns.swarmplot(x = 'species', y = 'petal length (cm)' , data = df)
 80 | _ = plt.xlabel('species')
 81 | _ = plt.ylabel('petal length (cm)')
 82 | # Show the plot
 83 | plt.show()
 84 | 
 85 | Q5:-
 86 | Define a function with the signature ecdf(data). Within the function definition,
 87 | Compute the number of data points, n, using the len() function.
 88 | The x-values are the sorted data. Use the np.sort() function to perform the sorting.
 89 | The y data of the ECDF go from 1/n to 1 in equally spaced increments. You can construct this using np.arange(). Remember, however, that the end value in np.arange() is not inclusive. Therefore, np.arange() will need to go from 1 to n+1. Be sure to divide this by n.
 90 | The function returns the values x and y.
 91 | 
 92 | Solution:-
 93 | def ecdf(data):
 94 |     """Compute ECDF for a one-dimensional array of measurements."""
 95 |     # Number of data points: n
 96 |     n = len(data)
 97 | 
 98 |     # x-data for the ECDF: x
 99 |     x = np.sort(data)
100 | 
101 |     # y-data for the ECDF: y
102 |     y = np.arange(1, n+1) / n
103 | 
104 |     return x, y
105 | 
106 | Q6:-
107 | Use ecdf() to compute the ECDF of versicolor_petal_length. Unpack the output into x_vers and y_vers.
108 | Plot the ECDF as dots. Remember to include marker = '.' and linestyle = 'none' in addition to x_vers and y_vers as arguments inside plt.plot().
109 | Label the axes. You can label the y-axis 'ECDF'.
110 | Show your plot
111 | 
112 | Solution:-
113 | # Compute ECDF for versicolor data: x_vers, y_vers
114 | x_vers, y_vers = ecdf(versicolor_petal_length)
115 | 
116 | # Generate plot
117 | _ = plt.plot(x_vers,  y_vers,marker='.',linestyle='none')
118 | 
119 | # Label the axes
120 | _ = plt.xlabel('length')
121 | _ = plt.ylabel('ECDF')
122 | 
123 | 
124 | # Display the plot
125 | plt.show()
126 | 
127 | Q7:-
128 | Compute ECDFs for each of the three species using your ecdf() function. The variables setosa_petal_length, versicolor_petal_length, and virginica_petal_length are all in your namespace. Unpack the ECDFs into x_set, y_set, x_vers, y_vers and x_virg, y_virg, respectively.
129 | Plot all three ECDFs on the same plot as dots. To do this, you will need three plt.plot() commands. Assign the result of each to _.
130 | A legend and axis labels have been added for you, so hit 'Submit Answer' to see all the ECDFs!
131 | 
132 | Solution:-
133 | # Compute ECDFs
134 | # Compute ECDFs
135 | x_set, y_set = ecdf(setosa_petal_length)
136 | x_vers, y_vers = ecdf(versicolor_petal_length)
137 | x_virg, y_virg = ecdf(virginica_petal_length)
138 | 
139 | # Plot all ECDFs on the same plot
140 | _ = plt.plot(x_set, y_set, marker = '.', linestyle = 'none')
141 | _ = plt.plot(x_vers, y_vers, marker = '.', linestyle = 'none')
142 | _ = plt.plot(x_virg, y_virg, marker = '.', linestyle = 'none')
143 | 
144 | 
145 | # Annotate the plot
146 | plt.legend(('setosa', 'versicolor', 'virginica'), loc='lower right')
147 | _ = plt.xlabel('petal length (cm)')
148 | _ = plt.ylabel('ECDF')
149 | 
150 | # Display the plot
151 | plt.show()
152 | 


--------------------------------------------------------------------------------
/Python/Statistical Thinking in Python -Part 1/Quantitative exploratory data analysis:
--------------------------------------------------------------------------------
  1 | Q1:-
  2 | Compute the mean petal length of Iris versicolor from Anderson's classic data set. The variable versicolor_petal_length is provided in your namespace. Assign the mean to mean_length_vers.
  3 | 
  4 | Solution:-
  5 | # Compute the mean: mean_length_vers
  6 | 
  7 | mean_length_vers = versicolor_petal_length.mean()
  8 | # Print the result with some nice formatting
  9 | print('I. versicolor:', mean_length_vers, 'cm')
 10 | 
 11 | Q2:-
 12 | Create percentiles, a NumPy array of percentiles you want to compute. These are the 2.5th, 25th, 50th, 75th, and 97.5th. You can do so by creating a list containing these ints/floats and convert the list to a NumPy array using np.array(). For example, np.array([30, 50]) would create an array consisting of the 30th and 50th percentiles.
 13 | Use np.percentile() to compute the percentiles of the petal lengths from the Iris versicolor samples. The variable versicolor_petal_length is in your namespace.
 14 | Print the percentiles.
 15 | 
 16 | Solution:-
 17 | # Specify array of percentiles: percentiles
 18 | percentiles = np.array([2.5,25,50,75,97.5])
 19 | 
 20 | # Compute percentiles: ptiles_vers
 21 | ptiles_vers = np.percentile(versicolor_petal_length,percentiles)
 22 | 
 23 | # Print the result
 24 | print(ptiles_vers)
 25 | 
 26 | Q3:-
 27 | Plot the percentiles as red diamonds on the ECDF. Pass the x and y co-ordinates - ptiles_vers and percentiles/100 - as positional arguments and specify the marker='D', color='red' and linestyle='none' keyword arguments. The argument for the y-axis - percentiles/100 has been specified for you.
 28 | 
 29 | Solution:-
 30 | # Plot the ECDF
 31 | _ = plt.plot(x_vers, y_vers, '.')
 32 | _ = plt.xlabel('petal length (cm)')
 33 | _ = plt.ylabel('ECDF')
 34 | 
 35 | # Overlay percentiles as red diamonds.
 36 | _ = plt.plot(ptiles_vers, percentiles/100, marker='D', color='red',
 37 |          linestyle='none')
 38 | 
 39 | # Show the plot
 40 | plt.show()
 41 | 
 42 | Q4:-
 43 | The set-up is exactly the same as for the bee swarm plot; you just call sns.boxplot() with the same keyword arguments as you would sns.swarmplot(). The x-axis is 'species' and y-axis is 'petal length (cm)'.
 44 | Don't forget to label your axes!
 45 | Display the figure using the normal call.
 46 | 
 47 | Solution:-
 48 | # Create box plot with Seaborn's default settings
 49 | _ = sns.boxplot(x='species', y='petal length (cm)', data=df)
 50 | 
 51 | # Label the axes
 52 | _ = plt.xlabel('species')
 53 | 
 54 | _ = plt.ylabel('petal length (cm)')
 55 | 
 56 | 
 57 | # Show the plot
 58 | plt.show()
 59 | 
 60 | Q5:-
 61 | Create an array called differences that is the difference between the petal lengths (versicolor_petal_length) and the mean petal length. The variable versicolor_petal_length is already in your namespace as a NumPy array so you can take advantage of NumPy's vectorized operations.
 62 | Square each element in this array. For example, x**2 squares each element in the array x. Store the result as diff_sq.
 63 | Compute the mean of the elements in diff_sq using np.mean(). Store the result as variance_explicit.
 64 | Compute the variance of versicolor_petal_length using np.var(). Store the result as variance_np.
 65 | Print both variance_explicit and variance_np in one print call to make sure they are consistent.
 66 | 
 67 | Solution:-
 68 | # Array of differences to mean: differences
 69 | differences = np.array(versicolor_petal_length - np.mean(versicolor_petal_length))
 70 | 
 71 | # Square the differences: diff_sq
 72 | diff_sq = differences **2
 73 | 
 74 | # Compute the mean square difference: variance_explicit
 75 | variance_explicit = np.mean(diff_sq)
 76 | 
 77 | # Compute the variance using NumPy: variance_np
 78 | variance_np = np.var(versicolor_petal_length)
 79 | 
 80 | # Print the results
 81 | print(variance_explicit,variance_np)
 82 | 
 83 | Q6:-
 84 | Compute the variance of the data in the versicolor_petal_length array using np.var() and store it in a variable called variance.
 85 | 
 86 | Print the square root of this value.
 87 | 
 88 | Print the standard deviation of the data in the versicolor_petal_length array using np.std().
 89 | 
 90 | Solution:-
 91 | # Compute the variance: variance
 92 | variance = np.var(versicolor_petal_length)
 93 | 
 94 | # Print the square root of the variance
 95 | print(np.sqrt(variance))
 96 | 
 97 | # Print the standard deviation
 98 | print(np.std(versicolor_petal_length))
 99 | 
100 | Q7:-
101 | Use plt.plot() with the appropriate keyword arguments to make a scatter plot of versicolor petal length (x-axis) versus petal width (y-axis). The variables versicolor_petal_length and versicolor_petal_width are already in your namespace. Do not forget to use the marker='.' and linestyle='none' keyword arguments.
102 | Label the axes.
103 | Display the plot.
104 | 
105 | Solution:-
106 | # Make a scatter plot
107 | _ = plt.plot(versicolor_petal_length,versicolor_petal_width,marker='.',linestyle='none')
108 | 
109 | 
110 | # Label the axes
111 | _ = plt.xlabel('versicolor petal length')
112 | 
113 | _ = plt.ylabel('versicolor petal width')
114 | 
115 | 
116 | 
117 | # Show the result
118 | plt.show()
119 | 
120 | Q8:-
121 | Use np.cov() to compute the covariance matrix for the petal length (versicolor_petal_length) and width (versicolor_petal_width) of I. versicolor.
122 | Print the covariance matrix.
123 | Extract the covariance from entry [0,1] of the covariance matrix. Note that by symmetry, entry [1,0] is the same as entry [0,1].
124 | Print the covariance.
125 | 
126 | Solution:-
127 | # Compute the covariance matrix: covariance_matrix
128 | covariance_matrix = np.cov(versicolor_petal_length, versicolor_petal_width)
129 | 
130 | # Print covariance matrix
131 | print(covariance_matrix)
132 | 
133 | # Extract covariance of length and width of petals: petal_cov
134 | petal_cov = covariance_matrix[0,1]
135 | 
136 | # Print the length/width covariance
137 | print(petal_cov)
138 | 
139 | Q9:-
140 | Define a function with signature pearson_r(x, y).
141 | Use np.corrcoef() to compute the correlation matrix of x and y (pass them to np.corrcoef() in that order).
142 | The function returns entry [0,1] of the correlation matrix.
143 | Compute the Pearson correlation between the data in the arrays versicolor_petal_length and versicolor_petal_width. Assign the result to r.
144 | Print the result.
145 | 
146 | Solution:-
147 | def pearson_r(x, y):
148 |     """Compute Pearson correlation coefficient between two arrays."""
149 |     # Compute correlation matrix: corr_mat
150 |     corr_mat = np.corrcoef(x,y)
151 | 
152 | 
153 |     # Return entry [0,1]
154 |     return corr_mat[0,1]
155 | 
156 | # Compute Pearson correlation coefficient for I. versicolor: r
157 | r = pearson_r(versicolor_petal_length,versicolor_petal_width)
158 | 
159 | # Print the result
160 | print(r)
161 | 
162 | 


--------------------------------------------------------------------------------
/Python/Statistical Thinking in Python -Part 1/Thinking probabilistically-- Continuous variables:
--------------------------------------------------------------------------------
  1 | Q1:-
  2 | Draw 100,000 samples from a Normal distribution that has a mean of 20 and a standard deviation of 1. Do the same for Normal distributions with standard deviations of 3 and 10, each still with a mean of 20. Assign the results to samples_std1, samples_std3 and samples_std10, respectively.
  3 | Plot a histograms of each of the samples; for each, use 100 bins, also using the keyword arguments normed=True and histtype='step'. The latter keyword argument makes the plot look much like the smooth theoretical PDF. You will need to make 3 plt.hist() calls.
  4 | Hit 'Submit Answer' to make a legend, showing which standard deviations you used, and show your plot! There is no need to label the axes because we have not defined what is being described by the Normal distribution; we are just looking at shapes of PDFs.
  5 | 
  6 | Solution:-
  7 | # Draw 100000 samples from Normal distribution with stds of interest: samples_std1, samples_std3, samples_std10
  8 | samples_std1 = np.random.normal(20,1,100000)
  9 | samples_std3 = np.random.normal(20,3,100000)
 10 | samples_std10 = np.random.normal(20,10,100000)
 11 | 
 12 | 
 13 | # Make histograms
 14 | plt.hist(samples_std1,bins=100, normed=True,histtype='step')
 15 | plt.hist(samples_std3,bins=100, normed=True,histtype='step')
 16 | plt.hist(samples_std10,bins=100, normed=True,histtype='step')
 17 | 
 18 | # Make a legend, set limits and show plot
 19 | _ = plt.legend(('std = 1', 'std = 3', 'std = 10'))
 20 | plt.ylim(-0.01, 0.42)
 21 | plt.show()
 22 | 
 23 | Q2:-
 24 | Use your ecdf() function to generate x and y values for CDFs: x_std1, y_std1, x_std3, y_std3 and x_std10, y_std10, respectively.
 25 | Plot all three CDFs as dots (do not forget the marker and linestyle keyword arguments!).
 26 | Hit submit to make a legend, showing which standard deviations you used, and to show your plot. There is no need to label the axes because we have not defined what is being described by the Normal distribution; we are just looking at shapes of CDFs.
 27 | 
 28 | Solution:-
 29 | # Generate CDFs
 30 | x_std1, y_std1 = ecdf(samples_std1)
 31 | x_std3, y_std3 = ecdf(samples_std3)
 32 | x_std10, y_std10 = ecdf(samples_std10)
 33 | 
 34 | # Plot CDFs
 35 | _ = plt.plot(x_std1, y_std1 , marker='.', linestyle='none')
 36 | _ = plt.plot(x_std3, y_std3 , marker='.', linestyle='none')
 37 | _ = plt.plot(x_std10, y_std10 , marker='.', linestyle='none')
 38 | 
 39 | 
 40 | # Make a legend and show the plot
 41 | _ = plt.legend(('std = 1', 'std = 3', 'std = 10'), loc='lower right')
 42 | plt.show()
 43 | 
 44 | Q3:-
 45 | Compute mean and standard deviation of Belmont winners' times with the two outliers removed. The NumPy array belmont_no_outliers has these data.
 46 | Take 10,000 samples out of a normal distribution with this mean and standard deviation using np.random.normal().
 47 | Compute the CDF of the theoretical samples and the ECDF of the Belmont winners' data, assigning the results to x_theor, y_theor and x, y, respectively.
 48 | Hit submit to plot the CDF of your samples with the ECDF, label your axes and show the plot.
 49 | 
 50 | Solution:-
 51 | # Compute mean and standard deviation: mu, sigma
 52 | mu, sigma = np.mean(belmont_no_outliers), np.std(belmont_no_outliers)
 53 | 
 54 | 
 55 | # Sample out of a normal distribution with this mu and sigma: samples
 56 | samples = np.random.normal(mu,sigma,10000)
 57 | 
 58 | # Get the CDF of the samples and of the data
 59 | x_theor, y_theor = ecdf(samples)
 60 | x,y = ecdf(belmont_no_outliers)
 61 | 
 62 | 
 63 | # Plot the CDFs and show the plot
 64 | _ = plt.plot(x_theor, y_theor)
 65 | _ = plt.plot(x, y, marker='.', linestyle='none')
 66 | _ = plt.xlabel('Belmont winning time (sec.)')
 67 | _ = plt.ylabel('CDF')
 68 | plt.show()
 69 | 
 70 | Q4:-
 71 | Take 1,000,000 samples from the normal distribution using the np.random.normal() function. The mean mu and standard deviation sigma are already loaded into the namespace of your IPython instance.
 72 | Compute the fraction of samples that have a time less than or equal to Secretariat's time of 144 seconds.
 73 | 
 74 | Solution:-
 75 | # Take a million samples out of the Normal distribution: samples
 76 | samples = np.random.normal(mu,sigma,1000000)
 77 | 
 78 | # Compute the fraction that are faster than 144 seconds: prob
 79 | prob = sum(samples <= 144)/1000000
 80 | 
 81 | # Print the result
 82 | print('Probability of besting Secretariat:', prob)
 83 | 
 84 | Q5:-
 85 | Define a function with call signature successive_poisson(tau1, tau2, size=1) that samples the waiting time for a no-hitter and a hit of the cycle.
 86 | Draw waiting times tau1 (size number of samples) for the no-hitter out of an exponential distribution and assign to t1.
 87 | Draw waiting times tau2 (size number of samples) for hitting the cycle out of an exponential distribution and assign to t2.
 88 | The function returns the sum of the waiting times for the two events.
 89 | 
 90 | Solution:-
 91 | def successive_poisson(tau1, tau2, size=1):
 92 |     """Compute time for arrival of 2 successive Poisson processes."""
 93 |     # Draw samples out of first exponential distribution: t1
 94 |     t1 = np.random.exponential(tau1, size)
 95 | 
 96 |     # Draw samples out of second exponential distribution: t2
 97 |     t2 = np.random.exponential(tau2, size)
 98 | 
 99 |     return t1 + t2
100 |  
101 |  Q6:-
102 |  Use your successive_poisson() function to draw 100,000 out of the distribution of waiting times for observing a no-hitter and a hitting of the cycle.
103 | Plot the PDF of the waiting times using the step histogram technique of a previous exercise. Don't forget the necessary keyword arguments. You should use bins=100, normed=True, and histtype='step'.
104 | Label the axes.
105 | Show your plot.
106 | 
107 | Solution:-
108 | # Draw samples of waiting times: waiting_times
109 | waiting_times = waiting_times = np.array(successive_poisson(764, 715, 100000))
110 | 
111 | # Make the histogram
112 | plt.hist(waiting_times, bins=100,normed=True,histtype='step')
113 | 
114 | 
115 | # Label axes
116 | plt.xlabel('x')
117 | plt.ylabel('y')
118 | 
119 | 
120 | # Show the plot
121 | plt.show()
122 |   
123 | 


--------------------------------------------------------------------------------
/Python/Supervised Learning with scikit-learn/Classification:
--------------------------------------------------------------------------------
  1 | Q1:-
  2 | Import KNeighborsClassifier from sklearn.neighbors.
  3 | Create arrays X and y for the features and the target variable. Here this has been done for you. Note the use of .drop() to drop the target variable 'party' from the feature array X as well as the use of the .values attribute to ensure X and y are NumPy arrays. Without using .values, X and y are a DataFrame and Series respectively; the scikit-learn API will accept them in this form also as long as they are of the right shape.
  4 | Instantiate a KNeighborsClassifier called knn with 6 neighbors by specifying the n_neighbors parameter.
  5 | Fit the classifier to the data using the .fit() method.
  6 | 
  7 | Solution:-
  8 | # Import KNeighborsClassifier from sklearn.neighbors
  9 | from sklearn.neighbors import KNeighborsClassifier
 10 | 
 11 | # Create arrays for the features and the response variable
 12 | y = df['party'].values
 13 | X = df.drop('party', axis=1).values
 14 | 
 15 | # Create a k-NN classifier with 6 neighbors
 16 | knn = KNeighborsClassifier(n_neighbors=6)
 17 | 
 18 | # Fit the classifier to the data
 19 | knn.fit(X,y)
 20 | 
 21 | Q2:-
 22 | Create arrays for the features and the target variable from df. As a reminder, the target variable is 'party'.
 23 | Instantiate a KNeighborsClassifier with 6 neighbors.
 24 | Fit the classifier to the data.
 25 | Predict the labels of the training data, X.
 26 | Predict the label of the new data point X_new.
 27 | 
 28 | Solution:-
 29 | # Import KNeighborsClassifier from sklearn.neighbors
 30 | from sklearn.neighbors import KNeighborsClassifier 
 31 | 
 32 | # Create arrays for the features and the response variable
 33 | y = df['party'].values
 34 | X = df.drop('party',axis=1).values
 35 | 
 36 | # Create a k-NN classifier with 6 neighbors: knn
 37 | knn = KNeighborsClassifier(n_neighbors=6)
 38 | 
 39 | # Fit the classifier to the data
 40 | knn.fit(X,y)
 41 | 
 42 | # Predict the labels for the training data X
 43 | y_pred = knn.predict(X)
 44 | 
 45 | # Predict and print the label for the new data point X_new
 46 | new_prediction = knn.predict(X_new)
 47 | print("Prediction: {}".format(new_prediction))
 48 | 
 49 | Q3:-
 50 | Import datasets from sklearn and matplotlib.pyplot as plt.
 51 | Load the digits dataset using the .load_digits() method on datasets.
 52 | Print the keys and DESCR of digits.
 53 | Print the shape of images and data keys using the . notation.
 54 | Display the 1011th image using plt.imshow(). This has been done for you, so hit 'Submit Answer' to see which handwritten digit this happens to be!
 55 | 
 56 | Solution:-
 57 | # Import necessary modules
 58 | from sklearn import datasets
 59 | import matplotlib.pyplot as plt
 60 | 
 61 | # Load the digits dataset: digits
 62 | digits = datasets.load_digits()
 63 | 
 64 | # Print the keys and DESCR of the dataset
 65 | print(digits.DESCR)
 66 | print(digits.keys())
 67 | 
 68 | # Print the shape of the images and data keys
 69 | print(digits.images.shape)
 70 | print(digits.data.shape)
 71 | 
 72 | # Display digit 1010
 73 | plt.imshow(digits.images[1010], cmap=plt.cm.gray_r, interpolation='nearest')
 74 | plt.show()
 75 | 
 76 | Q4:-
 77 | Import KNeighborsClassifier from sklearn.neighbors and train_test_split from sklearn.model_selection.
 78 | Create an array for the features using digits.data and an array for the target using digits.target.
 79 | Create stratified training and test sets using 0.2 for the size of the test set. Use a random state of 42. Stratify the split according to the labels so that they are distributed in the training and test sets as they are in the original dataset.
 80 | Create a k-NN classifier with 7 neighbors and fit it to the training data.
 81 | Compute and print the accuracy of the classifier's predictions using the .score() method.
 82 | 
 83 | Solution:-
 84 | # Import necessary modules
 85 | from sklearn.neighbors import KNeighborsClassifier
 86 | from sklearn.model_selection import train_test_split
 87 | digits = datasets.load_digits()
 88 | 
 89 | # Create feature and target arrays
 90 | X = digits.data
 91 | y = digits.target
 92 | 
 93 | # Split into training and test set
 94 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=42, stratify=y)
 95 | 
 96 | # Create a k-NN classifier with 7 neighbors: knn
 97 | knn = KNeighborsClassifier(n_neighbors=7)
 98 | 
 99 | # Fit the classifier to the training data
100 | knn.fit(X_train,y_train)
101 | 
102 | # Print the accuracy
103 | print(knn.score(X_test, y_test))
104 | 
105 | Q5:-
106 | Inside the for loop:
107 | Setup a k-NN classifier with the number of neighbors equal to k.
108 | Fit the classifier with k neighbors to the training data.
109 | Compute accuracy scores the training set and test set separately using the .score() method and assign the results to the train_accuracy and test_accuracy arrays respectively.
110 | 
111 | Solution:-
112 | # Setup arrays to store train and test accuracies
113 | neighbors = np.arange(1, 9)
114 | train_accuracy = np.empty(len(neighbors))
115 | test_accuracy = np.empty(len(neighbors))
116 | 
117 | # Loop over different values of k
118 | for i, k in enumerate(neighbors):
119 |     # Setup a k-NN Classifier with k neighbors: knn
120 |     knn = KNeighborsClassifier(n_neighbors=k)
121 | 
122 |     # Fit the classifier to the training data
123 |     knn.fit(X_train,y_train)
124 |     
125 |     #Compute accuracy on the training set
126 |     train_accuracy[i] = knn.score(X_train, y_train)
127 | 
128 |     #Compute accuracy on the testing set
129 |     test_accuracy[i] = knn.score(X_test, y_test)
130 | 
131 | # Generate plot
132 | plt.title('k-NN: Varying Number of Neighbors')
133 | plt.plot(neighbors, test_accuracy, label = 'Testing Accuracy')
134 | plt.plot(neighbors, train_accuracy, label = 'Training Accuracy')
135 | plt.legend()
136 | plt.xlabel('Number of Neighbors')
137 | plt.ylabel('Accuracy')
138 | plt.show()
139 | 


--------------------------------------------------------------------------------
/Python/Unsupervised Learning in Python/Discovering interpretable features:
--------------------------------------------------------------------------------
1 | Q1:-
2 | 


--------------------------------------------------------------------------------
/Python/Unsupervised Learning in Python/Visualization with hierarchical clustering and t-SNE:
--------------------------------------------------------------------------------
  1 | Q1:-
  2 | Import:
  3 | linkage and dendrogram from scipy.cluster.hierarchy.
  4 | matplotlib.pyplot as plt.
  5 | Perform hierarchical clustering on samples using the linkage() function with the method='complete' keyword argument. Assign the result to mergings.
  6 | Plot a dendrogram using the dendrogram() function on mergings. Specify the keyword arguments labels=varieties, leaf_rotation=90, and leaf_font_size=6.
  7 | 
  8 | Solution:-
  9 | # Perform the necessary imports
 10 | from scipy.cluster.hierarchy import linkage, dendrogram
 11 | import matplotlib.pyplot as plt
 12 | 
 13 | # Calculate the linkage: mergings
 14 | mergings = linkage(samples, method='complete')
 15 | 
 16 | # Plot the dendrogram, using varieties as labels
 17 | dendrogram(mergings,
 18 |            labels=varieties,
 19 |            leaf_rotation=90,
 20 |            leaf_font_size=6,
 21 | )
 22 | plt.show()
 23 | 
 24 | Q2:-
 25 | Import normalize from sklearn.preprocessing.
 26 | Rescale the price movements for each stock by using the normalize() function on movements.
 27 | Apply the linkage() function to normalized_movements, using 'complete' linkage, to calculate the hierarchical clustering. Assign the result to mergings.
 28 | Plot a dendrogram of the hierarchical clustering, using the list companies of company names as the labels. In addition, specify the leaf_rotation=90, and leaf_font_size=6 keyword arguments as you did in the previous exercise.
 29 | 
 30 | Solution:-
 31 | # Import normalize
 32 | from sklearn.preprocessing import normalize
 33 | 
 34 | # Normalize the movements: normalized_movements
 35 | normalized_movements = normalize(movements)
 36 | 
 37 | # Calculate the linkage: mergings
 38 | mergings = linkage(normalized_movements,method='complete')
 39 | 
 40 | # Plot the dendrogram
 41 | dendrogram(mergings,labels=companies,leaf_rotation=90,leaf_font_size=6)
 42 | plt.show()
 43 | 
 44 | Q3:-
 45 | Import:
 46 | linkage and dendrogram from scipy.cluster.hierarchy.
 47 | matplotlib.pyplot as plt.
 48 | Perform hierarchical clustering on samples using the linkage() function with the method='single' keyword argument. Assign the result to mergings.
 49 | Plot a dendrogram of the hierarchical clustering, using the list country_names as the labels. In addition, specify the leaf_rotation=90, and leaf_font_size=6 keyword arguments as you have done earlier.
 50 | 
 51 | Solution:-
 52 | # Perform the necessary imports
 53 | import matplotlib.pyplot as plt
 54 | from scipy.cluster.hierarchy import linkage, dendrogram
 55 | 
 56 | # Calculate the linkage: mergings
 57 | mergings = linkage(samples,method='single')
 58 | 
 59 | # Plot the dendrogram
 60 | dendrogram(mergings,labels=country_names,leaf_rotation=90,leaf_font_size=6)
 61 | plt.show()
 62 | 
 63 | Q4:-
 64 | Import:
 65 | pandas as pd.
 66 | fcluster from scipy.cluster.hierarchy.
 67 | Perform a flat hierarchical clustering by using the fcluster() function on mergings. Specify a maximum height of 6 and the keyword argument criterion='distance'.
 68 | Create a DataFrame df with two columns named 'labels' and 'varieties', using labels and varieties, respectively, for the column values. This has been done for you.
 69 | Create a cross-tabulation ct between df['labels'] and df['varieties'] to count the number of times each grain variety coincides with each cluster label.
 70 | 
 71 | Solution:-
 72 | # Perform the necessary imports
 73 | import pandas as pd
 74 | from scipy.cluster.hierarchy import fcluster
 75 | 
 76 | # Use fcluster to extract labels: labels
 77 | labels = fcluster(mergings,6,criterion='distance')
 78 | 
 79 | # Create a DataFrame with labels and varieties as columns: df
 80 | df = pd.DataFrame({'labels': labels, 'varieties': varieties})
 81 | 
 82 | # Create crosstab: ct
 83 | ct = pd.crosstab(df['labels'],df['varieties'])
 84 | 
 85 | # Display ct
 86 | print(ct)
 87 | 
 88 | Q5:-
 89 | Import TSNE from sklearn.manifold.
 90 | Create a TSNE instance called model with learning_rate=200.
 91 | Apply the .fit_transform() method of model to samples. Assign the result to tsne_features.
 92 | Select the column 0 of tsne_features. Assign the result to xs.
 93 | Select the column 1 of tsne_features. Assign the result to ys.
 94 | Make a scatter plot of the t-SNE features xs and ys. To color the points by the grain variety, specify the additional keyword argument c=variety_numbers.
 95 | 
 96 | Solution:-
 97 | # Import TSNE
 98 | from sklearn.manifold import TSNE
 99 | 
100 | # Create a TSNE instance: model
101 | model = TSNE(learning_rate=200)
102 | 
103 | # Apply fit_transform to samples: tsne_features
104 | tsne_features = model.fit_transform(samples)
105 | 
106 | # Select the 0th feature: xs
107 | xs = tsne_features[:,0]
108 | 
109 | # Select the 1st feature: ys
110 | ys = tsne_features[:,1]
111 | 
112 | # Scatter plot, coloring by variety_numbers
113 | plt.scatter(xs,ys,c=variety_numbers)
114 | plt.show()
115 | 
116 | Q6:-
117 | Import TSNE from sklearn.manifold.
118 | Create a TSNE instance called model with learning_rate=50.
119 | Apply the .fit_transform() method of model to normalized_movements. Assign the result to tsne_features.
120 | Select column 0 and column 1 of tsne_features.
121 | Make a scatter plot of the t-SNE features xs and ys. Specify the additional keyword argument alpha=0.5.
122 | Code to label each point with its company name has been written for you using plt.annotate(), so just hit 'Submit Answer' to see the visualization!
123 | 
124 | Solution:-
125 | # Import TSNE
126 | from sklearn.manifold import TSNE
127 | 
128 | # Create a TSNE instance: model
129 | model = TSNE(learning_rate=50)
130 | 
131 | # Apply fit_transform to normalized_movements: tsne_features
132 | tsne_features = model.fit_transform(normalized_movements)
133 | 
134 | # Select the 0th feature: xs
135 | xs = tsne_features[:,0]
136 | 
137 | # Select the 1th feature: ys
138 | ys = tsne_features[:,1]
139 | 
140 | # Scatter plot
141 | plt.scatter(xs,ys,alpha=0.5)
142 | 
143 | # Annotate the points
144 | for x, y, company in zip(xs, ys, companies):
145 |     plt.annotate(company, (x, y), fontsize=5, alpha=0.75)
146 | plt.show()
147 | 


--------------------------------------------------------------------------------
/Python/pandas Foundations/Data ingestion & inspection:
--------------------------------------------------------------------------------
  1 | Q1:-
  2 | Import numpy using the standard alias np.
  3 | Assign the numerical values in the DataFrame df to an array np_vals using the attribute values.
  4 | Pass np_vals into the NumPy method log10() and store the results in np_vals_log10.
  5 | Pass the entire df DataFrame into the NumPy method log10() and store the results in df_log10.
  6 | Inspect the output of the print() code to see the type() of the variables that you created.
  7 | 
  8 | Solution:-
  9 | # Import numpy
 10 | import numpy as np
 11 | 
 12 | # Create array of DataFrame values: np_vals
 13 | np_vals = df.values
 14 | 
 15 | # Create new array of base 10 logarithm values: np_vals_log10
 16 | np_vals_log10 = np.log10(np_vals)
 17 | 
 18 | # Create array of new DataFrame by passing df to np.log10(): df_log10
 19 | df_log10 = np.log10(df)
 20 | 
 21 | # Print original and new data containers
 22 | [print(x, 'has type', type(eval(x))) for x in ['np_vals', 'np_vals_log10', 'df', 'df_log10']]
 23 | 
 24 | Q2:-
 25 | Zip the 2 lists list_keys and list_values together into one list of (key, value) tuples. Be sure to convert the zip object into a list, and store the result in zipped.
 26 | Inspect the contents of zipped using print(). This has been done for you.
 27 | Construct a dictionary using zipped. Store the result as data.
 28 | Construct a DataFrame using the dictionary. Store the result as df.
 29 | 
 30 | Solution:-
 31 | # Zip the 2 lists together into one list of (key,value) tuples: zipped
 32 | zipped = list(zip(list_keys,list_values))
 33 | 
 34 | # Inspect the list using print()
 35 | print(zipped)
 36 | 
 37 | # Build a dictionary with the zipped list: data
 38 | data = dict(zipped)
 39 | 
 40 | # Build and inspect a DataFrame from the dictionary: df
 41 | df = pd.DataFrame(data)
 42 | print(df)
 43 | 
 44 | Q3:-
 45 | Create a list of new column labels with 'year', 'artist', 'song', 'chart weeks', and assign it to list_labels.
 46 | Assign your list of labels to df.columns.
 47 | 
 48 | Solution:-
 49 | # Build a list of labels: list_labels
 50 | list_labels = ['year','artist','song','chart weeks']
 51 | 
 52 | # Assign the list of labels to the columns attribute: df.columns
 53 | df.columns = list_labels
 54 | 
 55 | Q4:-
 56 | Make a string object with the value 'PA' and assign it to state.
 57 | Construct a dictionary with 2 key:value pairs: 'state':state and 'city':cities.
 58 | Construct a pandas DataFrame from the dictionary you created and assign it to df.
 59 | 
 60 | Solution:-
 61 | # Make a string with the value 'PA': state
 62 | state = "PA"
 63 | 
 64 | # Construct a dictionary: data
 65 | data = {'state':state, 'city':cities}
 66 | 
 67 | # Construct a DataFrame from dictionary data: df
 68 | df = pd.DataFrame(data)
 69 | 
 70 | # Print the DataFrame
 71 | print(df)
 72 | 
 73 | Q5:-
 74 | Use pd.read_csv() with the string 'world_population.csv' to read the CSV file into a DataFrame and assign it to df1.
 75 | Create a list of new column labels - 'year', 'population' - and assign it to the variable new_labels.
 76 | Reread the same file, again using pd.read_csv(), but this time, add the keyword arguments header=0 and names=new_labels. Assign the resulting DataFrame to df2.
 77 | Print both the df1 and df2 DataFrames to see the change in column names. This has already been done for you.
 78 | 
 79 | Solution:-
 80 | # Read in the file: df1
 81 | df1 = pd.read_csv('world_population.csv')
 82 | 
 83 | # Create a list of the new column labels: new_labels
 84 | new_labels = ['year','population']
 85 | 
 86 | # Read in the file, specifying the header and names parameters: df2
 87 | df2 = pd.read_csv('world_population.csv', header=0, names=new_labels)
 88 | 
 89 | # Print both the DataFrames
 90 | print(df1)
 91 | print(df2)
 92 | 
 93 | Q6:-
 94 | Use pd.read_csv() without using any keyword arguments to read file_messy into a pandas DataFrame df1.
 95 | Use .head() to print the first 5 rows of df1 and see how messy it is. Do this in the IPython Shell first so you can see how modifying read_csv() can clean up this mess.
 96 | Using the keyword arguments delimiter=' ', header=3 and comment='#', use pd.read_csv() again to read file_messy into a new DataFrame df2.
 97 | Print the output of df2.head() to verify the file was read correctly.
 98 | Use the DataFrame method .to_csv() to save the DataFrame df2 to the variable file_clean. Be sure to specify index=False.
 99 | Use the DataFrame method .to_excel() to save the DataFrame df2 to the file 'file_clean.xlsx'. Again, remember to specify index=False.
100 | 
101 | Solution:-
102 | # Read the raw file as-is: df1
103 | df1 = pd.read_csv(file_messy)
104 | 
105 | # Print the output of df1.head()
106 | print(df1.head())
107 | 
108 | # Read in the file with the correct parameters: df2
109 | df2 = pd.read_csv(file_messy, delimiter=' ', header=3, comment='#')
110 | 
111 | # Print the output of df2.head()
112 | print(df2.head())
113 | 
114 | # Save the cleaned up DataFrame to a CSV file without the index
115 | df2.to_csv(file_clean, index=False)
116 | 
117 | # Save the cleaned up DataFrame to an excel file without the index
118 | df2.to_excel('file_clean.xlsx', index=False)
119 | 
120 | Q7:-
121 | Create the plot with the DataFrame method df.plot(). Specify a color of 'red'.
122 | Note: c and color are interchangeable as parameters here, but we ask you to be explicit and specify color.
123 | Use plt.title() to give the plot a title of 'Temperature in Austin'.
124 | Use plt.xlabel() to give the plot an x-axis label of 'Hours since midnight August 1, 2010'.
125 | Use plt.ylabel() to give the plot a y-axis label of 'Temperature (degrees F)'.
126 | Finally, display the plot using plt.show().
127 | 
128 | Solution:-
129 | # Create a plot with color='red'
130 | df.plot(color='red')
131 | 
132 | # Add a title
133 | plt.title('Temperature in Austin')
134 | 
135 | # Specify the x-axis label
136 | plt.xlabel('Hours since midnight August 1, 2010')
137 | 
138 | # Specify the y-axis label
139 | plt.ylabel('Temperature (degrees F)')
140 | 
141 | # Display the plot
142 | plt.show()
143 | 
144 | Q8:-
145 | Plot all columns together on one figure by calling df.plot(), and noting the vertical scaling problem.
146 | Plot all columns as subplots. To do so, you need to specify subplots=True inside .plot().
147 | Plot a single column of dew point data. To do this, define a column list containing a single column name 'Dew Point (deg F)', and call df[column_list1].plot().
148 | Plot two columns of data, 'Temperature (deg F)' and 'Dew Point (deg F)'. To do this, define a list containing those column names and pass it into df[], as df[column_list2].plot().
149 | 
150 | Solution:-
151 | # Plot all columns (default)
152 | df.plot()
153 | plt.show()
154 | 
155 | # Plot all columns as subplots
156 | df.plot(subplots=True)
157 | plt.show()
158 | 
159 | # Plot just the Dew Point data
160 | column_list1 = ['Dew Point (deg F)']
161 | df[column_list1].plot()
162 | plt.show()
163 | 
164 | # Plot the Dew Point and Temperature data, but not the Pressure data
165 | column_list2 = ['Temperature (deg F)','Dew Point (deg F)']
166 | df[column_list2].plot()
167 | plt.show()
168 | 
169 | 


--------------------------------------------------------------------------------
/Python/pandas Foundations/Exploratory data analysis:
--------------------------------------------------------------------------------
  1 | Q1:-
  2 | Create a list of y-axis column names called y_columns consisting of 'AAPL' and 'IBM'.
  3 | Generate a line plot with x='Month' and y=y_columns as inputs.
  4 | Give the plot a title of 'Monthly stock prices'.
  5 | Specify the y-axis label.
  6 | Display the plot.
  7 | 
  8 | Solution:-
  9 | # Create a list of y-axis column names: y_columns
 10 | y_columns = ['AAPL','IBM']
 11 | 
 12 | # Generate a line plot
 13 | df.plot(x='Month', y=y_columns)
 14 | 
 15 | # Add the title
 16 | plt.title('Monthly stock prices')
 17 | 
 18 | # Add the y-axis label
 19 | plt.ylabel('Price ($US)')
 20 | 
 21 | # Display the plot
 22 | plt.show()
 23 | 
 24 | Q2:-
 25 | Generate a scatter plot with 'hp' on the x-axis and 'mpg' on the y-axis. Specify s=sizes.
 26 | Add a title to the plot.
 27 | Specify the x-axis and y-axis labels.
 28 | 
 29 | Solution:-
 30 | # Generate a scatter plot
 31 | df.plot(kind='scatter', x='hp', y='mpg', s=sizes)
 32 | 
 33 | # Add the title
 34 | plt.title('Fuel efficiency vs Horse-power')
 35 | 
 36 | # Add the x-axis label
 37 | plt.xlabel('Horse-power')
 38 | 
 39 | # Add the y-axis label
 40 | plt.ylabel('Fuel efficiency (mpg)')
 41 | 
 42 | # Display the plot
 43 | plt.show()
 44 | 
 45 | Q3:-
 46 | Make a list called cols of the column names to be plotted: 'weight' and 'mpg'. You can then access it using df[cols].
 47 | Generate a box plot of the two columns in a single figure. To do this, specify subplots=True
 48 | 
 49 | Solution:-
 50 | # Make a list of the column names to be plotted: cols
 51 | cols = ['weight','mpg']
 52 | 
 53 | # Generate the box plots
 54 | df[cols].plot(kind='box',subplots=True)
 55 | 
 56 | # Display the plot
 57 | plt.show()
 58 | 
 59 | Q4:-
 60 | Plot a PDF for the values in fraction with 30 bins between 0 and 30%. 
 61 | The range has been taken care of for you. ax=axes[0] means that this plot will appear in the first row.
 62 | Plot a CDF for the values in fraction with 30 bins between 0 and 30%. 
 63 | Again, the range has been specified for you. To make the CDF appear on the second row, you need to specify ax=axes[1].
 64 | 
 65 | Solution:-
 66 | # This formats the plots such that they appear on separate rows
 67 | fig, axes = plt.subplots(nrows=2, ncols=1)
 68 | 
 69 | # Plot the PDF
 70 | df.fraction.plot(ax=axes[0], kind='hist', bins=30, normed=True, range=(0,.3))
 71 | plt.show()
 72 | 
 73 | # Plot the CDF
 74 | df.fraction.plot(kind='hist', bins=30, cumulative=True, normed=True, ax=axes[1], range=(0,.3))
 75 | plt.show()
 76 | 
 77 | Q5:-
 78 | Print the minimum value of the 'Engineering' column.
 79 | Print the maximum value of the 'Engineering' column.
 80 | Construct the mean percentage per year with .mean(axis='columns'). Assign the result to mean.
 81 | Plot the average percentage per year. 
 82 | Since 'Year' is the index of df, it will appear on the x-axis of the plot. No keyword arguments are needed in your call to .plot().
 83 | 
 84 | Solution:-
 85 | # Print the minimum value of the Engineering column
 86 | print(df['Engineering'].min())
 87 | 
 88 | # Print the maximum value of the Engineering column
 89 | print(df['Engineering'].max())
 90 | 
 91 | # Construct the mean percentage per year: mean
 92 | mean = df.mean(axis='columns')
 93 | 
 94 | # Plot the average percentage per year
 95 | mean.plot()
 96 | 
 97 | # Display the plot
 98 | plt.show()
 99 | 
100 | Q6:-
101 | Print summary statistics of the 'fare' column of df with .describe() and print(). Note: df.fare and df['fare'] are equivalent.
102 | Generate a box plot of the 'fare' column.
103 | 
104 | Solution:-
105 | # Print summary statistics of the fare column with .describe()
106 | print(df['fare'].describe())
107 | 
108 | # Generate a box plot of the fare column
109 | df.fare.plot(kind='box')
110 | 
111 | # Show the plot
112 | plt.show()
113 | 
114 | Q7:-
115 | Print the number of countries reported in 2015. To do this, use the .count() method on the '2015' column of df.
116 | Print the 5th and 95th percentiles of df. To do this, use the .quantile() method with the list [0.05, 0.95].
117 | Generate a box plot using the list of columns provided in years. 
118 | This has already been done for you, so click on 'Submit Answer' to view the result!
119 | 
120 | Solution-
121 | # Print the number of countries reported in 2015
122 | print(df['2015'].count())
123 | 
124 | # Print the 5th and 95th percentiles
125 | print(df.quantile([0.05, 0.95]))
126 | 
127 | # Generate a box plot
128 | years = ['1800','1850','1900','1950','2000']
129 | df[years].plot(kind='box')
130 | plt.show()
131 | 
132 | Q8:-
133 | Compute and print the means of the January and March data using the .mean() method.
134 | Compute and print the standard deviations of the January and March data using the .std() method.
135 | 
136 | Solution:-
137 | # Print the mean of the January and March data
138 | print(january.mean(), march.mean())
139 | 
140 | # Print the standard deviation of the January and March data
141 | print(january.std(), march.std())
142 | 
143 | Q9:-
144 | Filtering and counting
145 | How many automobiles were manufactured in Asia in the automobile dataset? 
146 | The DataFrame has been provided for you as df. Use filtering and the .count() member method to determine the number of rows where the 'origin' column has the value 'Asia'.
147 | As an example, you can extract the rows that contain 'US' as the country of origin using df[df['origin'] == 'US'].
148 | 
149 | Solution:-
150 | df[df['origin'] == 'Asia'].count()
151 | 
152 | Q10:-
153 | Compute the global mean and global standard deviations of df using the .mean() and .std() methods. 
154 | Assign the results to global_mean and global_std.
155 | Filter the 'US' population from the 'origin' column and assign the result to us.
156 | Compute the US mean and US standard deviations of us using the .mean() and .std() methods. Assign the results to us_mean and us_std.
157 | Print the differences between us_mean and global_mean and us_std and global_std. This has already been done for you.
158 | 
159 | Solution:-
160 | # Compute the global mean and global standard deviation: global_mean, global_std
161 | global_mean = df.mean()
162 | global_std = df.std()
163 | 
164 | # Filter the US population from the origin column: us
165 | us = df[df['origin']=='US']
166 | 
167 | # Compute the US mean and US standard deviation: us_mean, us_std
168 | us_mean = us.mean()
169 | us_std = us.std()
170 | 
171 | # Print the differences
172 | print(us_mean - global_mean)
173 | print(us_std - global_std)
174 | 
175 | Q11:-
176 | Inside plt.subplots(), specify the nrows and ncols parameters so that there are 3 rows and 1 column.
177 | Filter the rows where the 'pclass' column has the values 1 and generate a box plot of the 'fare' column.
178 | Filter the rows where the 'pclass' column has the values 2 and generate a box plot of the 'fare' column.
179 | Filter the rows where the 'pclass' column has the values 3 and generate a box plot of the 'fare' column.
180 | 
181 | Solution:-
182 | # Display the box plots on 3 separate rows and 1 column
183 | fig, axes = plt.subplots(nrows=3, ncols=1)
184 | 
185 | # Generate a box plot of the fare prices for the First passenger class
186 | titanic.loc[titanic['pclass'] == 1].plot(ax=axes[0], y='fare', kind='box')
187 | 
188 | # Generate a box plot of the fare prices for the Second passenger class
189 | titanic.loc[titanic['pclass'] == 2].plot(ax=axes[1], y='fare', kind='box')
190 | 
191 | # Generate a box plot of the fare prices for the Third passenger class
192 | titanic.loc[titanic['pclass'] == 3].plot(ax=axes[2], y='fare', kind='box')
193 | 
194 | # Display the plot
195 | plt.show()
196 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # DataCamp
2 | This repository contains assignments on courses related to data science from Data camp
3 | 


--------------------------------------------------------------------------------
/SparkR/Introduction to Spark in R using sparklyr/Going Native: Use The Native Interface to Manipulate Spark DataFrames:
--------------------------------------------------------------------------------
 1 | Q1:-
 2 | Which of these statements is true?
 3 | 
 4 | sparklyr's dplyr methods convert code into Scala code before running it on Spark.
 5 | Converting R code into SQL code limits the number of supported computations.
 6 | Most Spark MLlib modeling functions require DoubleType inputs and return DoubleType outputs.
 7 | Most Spark MLlib modeling functions require IntegerType inputs and return BooleanType outputs
 8 | 
 9 | Solution:-
10 | 2 and 3.
11 | 
12 | Q2:-
13 | A Spark connection has been created for you as spark_conn. A tibble attached to the track metadata stored in Spark has been pre-defined as track_metadata_tbl.
14 | 
15 | Create a variable named hotttnesss from track_metadata_tbl.
16 | Select the artist_hotttnesss field.
17 | Use ft_binarizer() to create a new field, is_hottt_or_nottt, which is true when artist_hotttnesss is greater than 0.5.
18 | Collect the result.
19 | Convert the is_hottt_or_nottt field to be logical.
20 | Draw a ggplot() bar plot of is_hottt_or_nottt.
21 | The first argument to ggplot() is the data argument, hotttnesss.
22 | The second argument to ggplot() is the aesthetic, is_hottt_or_nottt wrapped in aes().
23 | Add geom_bar() to draw the bars.
24 | 
25 | Solution:-
26 | 


--------------------------------------------------------------------------------
/SparkR/Introduction to Spark in R using sparklyr/Light My Fire: Starting To Use Spark With dplyr Syntax:
--------------------------------------------------------------------------------
  1 | Q1:-
  2 | Load the sparklyr package with library().Connect to Spark by calling spark_connect(), with argument master = "local". Assign the result to spark_conn.
  3 | Get the Spark version using spark_version(), with argument sc = spark_conn.Disconnect from Spark using spark_disconnect(), with argument sc = spark_conn.
  4 | 
  5 | Solution:-
  6 | # Load sparklyr
  7 | library(sparklyr)
  8 | 
  9 | # Connect to your Spark cluster
 10 | spark_conn <- spark_connect(master="local")
 11 | 
 12 | # Print the version of Spark
 13 | print(spark_version(sc=spark_conn))
 14 | 
 15 | # Disconnect from Spark
 16 | spark_disconnect(sc=spark_conn)
 17 | 
 18 | Q2:-
 19 | track_metadata, containing the song name, artist name, and other metadata for 1,000 tracks, has been pre-defined in your workspace.
 20 | Use str() to explore the track_metadata dataset.Connect to your local Spark cluster, storing the connection in spark_conn.
 21 | Copy track_metadata to the Spark cluster using copy_to() .See which data frames are available in Spark, using src_tbls().
 22 | Disconnect from Spark.
 23 | 
 24 | Solution:-
 25 | # Load dplyr
 26 | library(dplyr)
 27 | 
 28 | # Explore track_metadata structure
 29 | str(track_metadata)
 30 | 
 31 | # Connect to your Spark cluster
 32 | spark_conn <- spark_connect("local")
 33 | 
 34 | # Copy track_metadata to Spark
 35 | track_metadata_tbl <- copy_to(spark_conn,track_metadata,overwrite=TRUE)
 36 | 
 37 | # List the data frames available in Spark
 38 | src_tbls(spark_conn)
 39 | 
 40 | # Disconnect from Spark
 41 | spark_disconnect(spark_conn)
 42 | 
 43 | Q3:-
 44 | A Spark connection has been created for you as spark_conn. The track metadata for 1,000 tracks is stored in the Spark cluster in the table "track_metadata".
 45 | Link to the "track_metadata" table using tbl(). Assign the result to track_metadata_tbl.See how big the dataset is, using dim() on track_metadata_tbl.
 46 | See how small the tibble is, using object_size() on track_metadata_tbl.
 47 | 
 48 | Solution:-
 49 | # Link to the track_metadata table in Spark
 50 | track_metadata_tbl <- tbl(spark_conn, "track_metadata")
 51 | 
 52 | # See how big the dataset is
 53 | dim(track_metadata_tbl)
 54 | 
 55 | # See how small the tibble is
 56 | object_size(track_metadata_tbl)
 57 | 
 58 | Q4:-
 59 | A Spark connection has been created for you as spark_conn. A tibble attached to the track metadata stored in Spark has been pre-defined 
 60 | as track_metadata_tbl.Print the first 5 rows and all the columns of the track metadata.Examine the structure of the tibble using str().
 61 | Examine the structure of the track metadata using glimpse().
 62 | 
 63 | Solution:-
 64 | # Print 5 rows, all columnsprint
 65 | (track_metadata_tbl,n=5,width=Inf)
 66 | 
 67 | # Examine structure of tibble
 68 | str(track_metadata_tbl)
 69 | 
 70 | # Examine structure of data
 71 | glimpse(track_metadata_tbl)
 72 | 
 73 | Q5:-
 74 | A Spark connection has been created for you as spark_conn. A tibble attached to the track metadata stored in Spark has been pre-defined as 
 75 | track_metadata_tbl.Select the artist_name, release, title, and year using select().Try to do the same thing using square bracket 
 76 | indexing. Spoiler! This code throws an error, so it is wrapped in a call to tryCatch().
 77 | 
 78 | Solution:-
 79 | # track_metadata_tbl has been pre-defined
 80 | track_metadata_tbl
 81 | 
 82 | # Manipulate the track metadata
 83 | track_metadata_tbl %>%
 84 | # Select columns
 85 |   select('artist_name', 'release', 'title', 'year')
 86 | 
 87 | # Try to select columns using [ ]
 88 | tryCatch({
 89 |     # Selection code here
 90 |     track_metadata_tbl[, c("artist_name", "release", "title", "year")]
 91 |   },
 92 |   error = print
 93 | )
 94 | 
 95 | Q6:-
 96 | A Spark connection has been created for you as spark_conn. A tibble attached to the track metadata stored in Spark has been pre-defined as 
 97 | track_metadata_tbl.As in the previous exercise, select the artist_name, release, title, and year using select().Pipe the result of this to filter() to get the tracks 
 98 | from the 1960s.
 99 | 
100 | Solution:-
101 | # track_metadata_tbl has been pre-defined
102 | glimpse(track_metadata_tbl)
103 | 
104 | # Manipulate the track metadata
105 | track_metadata_tbl %>%
106 |   # Select columns
107 |   select(artist_name, release, title, year) %>% 
108 |   #filter rows
109 |   filter(year >= 1960, year < 1970)
110 | 
111 | Q7:-
112 | A Spark connection has been created for you as spark_conn. A tibble attached to the track metadata stored in Spark has been pre-defined as 
113 | track_metadata_tbl.Select the artist_name, release, title, and year fields.Pipe the result of this to filter on tracks from the 1960s.
114 | Pipe the result of this to arrange() to order by artist_name, then descending year, then title.
115 | 
116 | Solution:-
117 | # track_metadata_tbl has been pre-defined
118 | track_metadata_tbl
119 | # Manipulate the track metadata
120 | track_metadata_tbl %>%
121 |   # Select columns
122 |   select(artist_name, release, title, year) %>%
123 |   # Filter rows
124 |   filter(year >= 1960, year < 1970) %>%
125 |   # Arrange rows
126 |   arrange(artist_name,desc(year),title)
127 |  
128 |  Q8:-
129 |  A Spark connection has been created for you as spark_conn. A tibble attached to the track metadata stored in Spark has been pre-defined 
130 | as track_metadata_tbl.Select the title, and duration fields. Note that the durations are in seconds.Pipe the result of this to mutate() to create a new field, duration_minutes, 
131 | that contains the track duration in minutes.
132 | 
133 | Solution:-
134 | # track_metadata_tbl has been pre-defined
135 | track_metadata_tbl
136 | 
137 | # Manipulate the track metadata
138 | track_metadata_tbl %>%
139 |   # Select columns
140 |   select(title,duration) %>%
141 |   # Mutate columns
142 |   mutate(
143 |   duration_minutes = duration/60
144 |   )
145 |   
146 |  Q9:-
147 |  A Spark connection has been created for you as spark_conn. A tibble attached to the track metadata stored in Spark has been pre-defined as 
148 |  track_metadata_tbl.Select the title, and duration fields.Pipe the result of this to create a new field, duration_minutes, that contains the track duration in minutes.
149 | Pipe the result of this to summarize() to calculate the mean duration in minutes, in a field named mean_duration_minutes.
150 | 
151 | Solution:-
152 | # track_metadata_tbl has been pre-defined
153 | track_metadata_tbl
154 | 
155 | # Manipulate the track metadata
156 | track_metadata_tbl %>%
157 |   # Select columns
158 |   select(title,duration) %>%
159 |   # Mutate columns
160 |   mutate(
161 |   duration_minutes = duration/60
162 |   ) %>%
163 |   # Summarize columns
164 |   summarize(
165 |   mean_duration_minutes = mean(duration_minutes)
166 |   )
167 | 
168 | 
169 | 


--------------------------------------------------------------------------------
/Spoken Language Processing in Python/Introduction to Spoken Language Processing with Python:
--------------------------------------------------------------------------------
 1 | Q1:-
 2 | Import the Python wave library.
 3 | Read in the good_morning.wav audio file and save it to good_morning.
 4 | Create signal_gm by reading all the frames from good_morning using readframes().
 5 | See what the first 10 frames of audio look like by slicing signal_gm.
 6 | 
 7 | Solution:-
 8 | import wave
 9 | 
10 | # Create audio file wave object
11 | good_morning = wave.open("good_morning.wav", 'r')
12 | 
13 | # Read all frames from wave object 
14 | signal_gm = good_morning.readframes(-1)
15 | 
16 | # View first 10
17 | print(signal_gm[:10])
18 | 


--------------------------------------------------------------------------------