├── .travis.yml ├── Big Data Fundamentals with PySpark ├── 1. Introduction to Big Data analysis with Spark.md ├── 2. Programming in PySpark RDD’s.md ├── 3. PySpark SQL & DataFrames.md └── 4. Machine Learning with PySpark MLlib.md ├── Cleaning Data with PySpark ├── 1. DataFrame details.md ├── 2. Manipulating DataFrames in the real world.md ├── 3. Improving Performance.md └── 4. Complex processing and data pipelines.md ├── Data Processing in Shell ├── I Downloading Data on the Command Line.py ├── II Data Cleaning and Munging on the Command Line.py ├── III Database Operations on the Command Line.py └── IV Data Pipeline on the Command Line.py ├── Database Design ├── 1. Processing, Storing, and Organizing Data │ ├── 1. OLTP and OLAP.txt │ ├── 2. Which is better?.py │ ├── 3. Name that data type.py │ ├── 4. Ordering ETL Tasks.py │ ├── 5. Recommend a storage solution.py │ ├── 6. Database design.txt │ ├── 7. Classifying data models.py │ ├── 8. Deciding fact and dimension tables.py │ └── 9. Querying the dimensional model.py ├── 2. Database Schemas and Normalization.md ├── 3. Database Views.md └── 4. Database Management.md ├── Introduction to AWS Boto in Python ├── 1. Putting Files in the Cloud! │ ├── I. Intro to AWS and Boto3.py │ ├── II. Your first boto3 client.py │ ├── III. Multiple clients Sam knows that she will often have to work with more than one service at once. She wants to practice creating two separate clients for two different services in boto3. When she is building her workflows, she will make multiple Amazon Web Services interact with each other, with a script executed on her computer. Her AWS key id and AWS secret have been stored in AWS_KEY_ID and AWS_SECRET respectively. You will help Sam initialize a boto3 client for S3, and another client for SNS. She will use the S3 client to list the buckets in S3. She will use the SNS client to list topics she can publish to (you will learn about SNS topics in Chapter 3). Instructions 100 XP Generate the boto3 clients for interacting with S3 and SNS. Specify 'us-east-1' for the region_name for both clients. Use AWS_KEY_ID and AWS_SECRET to set up the credentials. List and print the SNS topics..py │ ├── IV. Removing repetitive work.py │ ├── V. Diving into buckets.txt │ ├── VI. creating a bucket.py │ ├── VII. Listing buckets.py │ ├── VIII. Deleting a bucket.py │ ├── VIV. Deleting multiple buckets.py │ ├── X. Uploading and retrieving files.txt │ └── XI. Spring cleaning.py ├── 2. Sharing Files Securely │ ├── I. Keeping objects secure.txt │ ├── II. Making an object public.py │ ├── III. Uploading a public report.py │ ├── IV. Making multiple files public.py │ ├── V. Accessing private objects in S3.txt │ ├── VI. Generating a presigned URL.py │ ├── VII. Opening a private file.py │ ├── VIII. Sharing files through a website.txt │ ├── VIV. Generate HTML table from Pandas.py │ ├── X. Upload an HTML file to S3.py │ ├── XI. Combine daily requests for February.py │ ├── XII. Upload aggregated reports for February.py │ ├── XIII. Update index to include February.py │ └── XIV. Upload the new index.py ├── 3. Reporting and Notifying! │ ├── 1. SNS Topics.txt │ ├── 10. Sending a single SMS message.py │ ├── 11. Different protocols per topic level.py │ ├── 12. Creating multi-level topics.py │ ├── 13. Sending an alert.py │ ├── 14. Sending multi-level alerts.py │ ├── 2. Creating a Topic.py │ ├── 3. Creating Multiple Topics.py │ ├── 4. Deleting multiple topics.py │ ├── 5. SNS Subscriptions.txt │ ├── 6. Subscribing to topics.py │ ├── 7. Creating multiple subscriptions.py │ ├── 8. Deleting multiple subscriptions.py │ └── 9. Sending messages.txt └── 4. Pattern Rekognition │ ├── 1. Rekognizing patterns.txt │ ├── 10. Scooter Community Sentiment.py │ ├── 2. Cat detector.py │ ├── 3. Multiple cat detector.py │ ├── 4. Parking sign reader.py │ ├── 5. Comprehending text.txt │ ├── 6. Detecting language.py │ ├── 7. Translating Get It Done requests.py │ ├── 8. Getting request sentiment.py │ └── 9. Scooter dispatch.py ├── Introduction to Airflow in Python ├── I Intro to Airflow.py ├── II Implementing Airflow DAGs.py ├── III Maintaining and monitoring Airflow workflows.py └── IV Building production pipelines in Airflow.py ├── Introduction to Bash Scripting ├── I From Command-Line to Bash Script.py ├── II Variables in Bash Scripting.py ├── III Control Statements in Bash Scripting.py └── IV Functions and Automation.py ├── Introduction to Data Engineering ├── I Introduction to Data Engineering.py ├── II Data engineering toolbox.py ├── III Extract, Transform and Load (ETL).py ├── IV Case Study: DataCamp Ratings.py └── README.me ├── Introduction to MongoDB in Python ├── 1. Flexibly Structured Data.md ├── 2. Working with Distinct Values and Sets.md ├── 3. Get Only What You Need, and Fast.md └── 4. Aggregation Pipelines: Let the Server Do It For You.md ├── Introduction to PySpark ├── I Getting to know PySpark.py ├── II. Manipulating data.py ├── III. Getting started with machine learning pipelines.py └── IV. Model tuning and selection │ ├── I. Create the modeler.py │ ├── II. Create the Evaluator.py │ ├── III. Make a grid.py │ ├── IV. Make the validator.py │ ├── V. Fit the model(s).py │ ├── VI. Evaluating binary classifiers.py │ └── VII. Evaluate the model.py ├── Introduction to Relational Databases in SQL ├── 1. Your first database │ ├── 1. Attributes of relational databases.py │ ├── 2. Query information_schema with SELECT.py │ ├── 3. CREATE your first few TABLEs.py │ ├── 4. ADD a COLUMN with ALTER TABLE.py │ ├── 5. RENAME and DROP COLUMNs in affiliations.py │ ├── 6. Migrate data with INSERT INTO SELECT DISTINCT.py │ └── 7. Delete tables with DROP TABLE.py ├── 2. Enforce data consistency with attribute constraints │ ├── 0. query table data types INFORMATION_SCHEMA.txt │ ├── 1.Better data quality with constraints-changing datatypes CAST.txt │ ├── 10. .py │ ├── 10. What happens if you try to enter NULLs?.py │ ├── 2. Types of database constraints.py │ ├── 3. Conforming with data types.py │ ├── 4. Type CASTs.py │ ├── 5. Working with data types.txt │ ├── 6. Change types with ALTER COLUMN.py │ ├── 7. Convert types USING a function.py │ ├── 8. The not-null and unique constraints.txt │ └── 9. Disallow NULL values with SET NOT NULL.py ├── 3. Uniquely identify records with key constraints │ ├── 1. Get to know SELECT COUNT DISTINCT.py │ ├── 2. Identify keys with SELECT COUNT DISTINCT.py │ ├── 3. Identify the primary key.py │ ├── 4. ADD key CONSTRAINTs to the tables.py │ ├── 5. Add a SERIAL surrogate key.py │ ├── 6. CONCATenate columns to a surrogate key.py │ └── 7. Test your knowledge before advancing.py └── 4. Glue together tables with foreign keys │ ├── 1. REFERENCE a table with a FOREIGN KEY.py │ ├── 10. Change the referential integrity behavior of a key.py │ ├── 11. Count affiliations per university.py │ ├── 12. .py │ ├── 2. Explore foreign key constraints.py │ ├── 3. JOIN tables linked by a foreign key.py │ ├── 4. Model more complex relationships.txt │ ├── 5. Add foreign keys to the "affiliations" table.py │ ├── 6. Populate the "professor_id" column.py │ ├── 7. Drop "firstname" and "lastname".py │ ├── 8. Referential Integrity ON DELETE.txt │ └── 9. Referential integrity violations.py ├── Introduction to Scala ├── 1. A scalable language.md ├── 2. Workflows, Functions, Collections.md └── 3. Type Systems, Control Structures, Style.md ├── Introduction to Shell ├── I Manipulating files and directories.py ├── II Manipulating data.py ├── III Combining tools.py ├── IV Batch processing.py └── v Creating new tools.py ├── OOP-python-DataCamp ├── I OOP Fundamentals.py ├── II Inheritance and Polymorphism.py ├── III Integrating with Standard Python.py ├── IV Best Practices of Class Design.py └── README.md ├── Python-Data-Science-Toolbox-Part-2--case-study-main ├── README.md └── bank-data-code-case.py ├── README.md ├── Streamlined Data Ingestion with pandas ├── I Importing Data from Flat Files.py ├── II Importing Data From Excel Files.py ├── III Importing Data from Databases.py ├── IV Importing JSON Data and Working with APIs.py └── README.md ├── Unit Testing for Data Science in Python ├── .travis.yml ├── I Unit testing basics.py ├── II Intermediate unit testing.py ├── III Test Organization and Execution.py ├── IV Testing Models, Plots and Much More.py └── readme.md ├── Writing Efficient Python Code ├── I Foundations for efficiencies.py ├── II Timing and profiling code.py ├── III Gaining efficiencies.py └── IV Intro to pandas DataFrame iteration.py └── Writing Functions in Python ├── I Best Practices.py ├── II Context Managers.py ├── III Decorators.py └── IV More on Decorators.py /.travis.yml: -------------------------------------------------------------------------------- 1 | language: python 2 | python: 3 | - "3.6" 4 | install: 5 | - pip install -e . 6 | - pip install codecov pytest-cov 7 | script: 8 | - pytest --cov=src tests 9 | after_success: 10 | - codecov 11 | -------------------------------------------------------------------------------- /Cleaning Data with PySpark/1. DataFrame details.md: -------------------------------------------------------------------------------- 1 | ## # Intro to data cleaning with Apache Spark 2 | 3 | ### # data cleaning 4 | `data cleaning` Preparing raw data for use in data processing pipelines. 5 | 6 | `data cleaning tasks` 7 | - Reformatting or replacing text 8 | - Performing calculations 9 | - Removing garbage or incomplete data 10 | - 11 | ### # why cleaning w/ SPARK ? 12 | - Problems with typical data systems: 13 | - Performance 14 | - Organizing data flow 15 | - ***Advantages of Spark:*** 16 | - Scalable 17 | - Powerful framework for data handling 18 | ### # data cleaning example 19 | ![image](https://user-images.githubusercontent.com/51888893/203651989-f3cce7f4-ce84-4b07-8c30-b02b684a7639.png) 20 | 21 | ### # Spark Schemas 22 | `spark schemas` : 23 | - Define format of DataFrame 24 | - various data types: 25 | - Strings, dates, integers, arrays 26 | - filter garbage data during import 27 | - Improves read performance 28 | - 29 | ### # Example Spark Schema 30 | - Import schema 31 | ```py 32 | import pyspark.sql.types 33 | peopleSchema = StructType ([ 34 | # Define the name field 35 | StructField ( 'name', StringType(), True), 36 | #Add the age field 37 | StructField ('age' , IntegerType(), True), 38 | # Add the city field 39 | StructField ('city', StringType(), True) 40 | ]) 41 | ``` 42 | - Read CSV file containing data 43 | ```py 44 | people_df = spark. read .format (" csv' ) .load (name='rawdata .csv', schema=peopleSchema) 45 | ``` 46 | ## Data cleaning review 47 | > Which of the following is NOT a benefit? 48 | 49 | Answer the question 50 | - [ ] Spark offers high performance. 51 | - [x] Spark can only handle thousands of records. 52 | - [ ] Spark allows orderly data flows. 53 | - [ ] Spark can use strictly defined schemas while ingesting data. 54 | ## Defining a schema 55 | - [x] Import * from the pyspark.sql.types library. 56 | - [x] Define a new schema using the StructType method. 57 | - [x] Define a StructField for name, age, and city. Each field should correspond to the correct datatype and not be nullable. 58 | ```py 59 | # Import the pyspark.sql.types library 60 | from pyspark.sql.types import pyspark.sql.types.* 61 | 62 | # Define a new schema using the StructType method 63 | people_schema = StructType([ 64 | # Define a StructField for each field 65 | StructField('name', StringType(), False), # False = not nullable 66 | StructField('age', IntegerType(), False), 67 | StructField('city', StringType(), False) 68 | ]) 69 | ``` 70 | ## Immutability review 71 | > You’ve just seen that immutability and lazy processing are fundamental concepts in the way Spark handles data. But why would Spark use immutable data frames to begin with? 72 | 73 | Answer the question 74 | - [ ] To add complexity to your Spark tasks. 75 | - [x] To efficiently handle data throughout the cluster. 76 | - [ ] To easily modify variable values as needed. 77 | - [ ] To conserve storage space. 78 | 79 | ## Using lazy processing 80 | - [x] Load the Data Frame. 81 | - [x] Add the transformation for F.lower() to the Destination Airport column. 82 | - [x] Drop the Destination Airport column from the Data Frame aa_dfw_df. Note the time for these operations to complete. 83 | - [x] Show the Data Frame, noting the time difference for this action to complete. 84 | ```py 85 | # Load the CSV file 86 | aa_dfw_df = spark.read.format('csv').options(Header=True).load('AA_DFW_2018.csv.gz') 87 | 88 | # Add the airport column using the F.lower() method 89 | aa_dfw_df = aa_dfw_df.withColumn('airport', F.lower(aa_dfw_df['Destination Airport'])) 90 | 91 | # Drop the Destination Airport column 92 | aa_dfw_df = aa_dfw_df.drop(aa_dfw_df['Destination Airport']) 93 | 94 | # Show the DataFrame 95 | aa_dfw_df.show() 96 | ``` 97 | ## # Understanding Parquet 98 | ### # Reading Parquet files 99 | ```py 100 | df = spark.read.format('parquet').load('filename.parquet') 101 | 102 | df = spark.read.parquet ('filename.parquet') 103 | ``` 104 | ### # Writing Parquet files 105 | ```py 106 | df.write.format('parquet').save('filename.parquet') 107 | 108 | df.write.parquet('filename.parquet') 109 | ``` 110 | ### # parquet sql 111 | ```py 112 | fl1ght_df = sparK.read.parquet('flights.parquet') 113 | 114 | flight_df.createOrReplaceTempView('flights') 115 | 116 | short_flights_df = spark.sql('SELECT * FROM FLights WHERE flightduration < 100') 117 | ``` 118 | ## Saving a DataFrame in Parquet format 119 | - [x] View the row count of df1 and df2. 120 | - [x] Combine df1 and df2 in a new DataFrame named df3 with the union method. 121 | - [x] Save df3 to a parquet file named AA_DFW_ALL.parquet. 122 | - [x] Read the AA_DFW_ALL.parquet file and show the count. 123 | ```py 124 | # View the row count of df1 and df2 125 | print("df1 Count: %d" % df1.count()) 126 | print("df2 Count: %d" % df2.count()) 127 | 128 | # Combine the DataFrames into one 129 | df3 = df1.union(df2) 130 | 131 | # Save the df3 DataFrame in Parquet format 132 | df3.write.parquet('AA_DFW_ALL.parquet', mode='overwrite') 133 | 134 | # Read the Parquet file into a new DataFrame and run a count 135 | print(spark.read.parquet('AA_DFW_ALL.parquet').count()) 136 | 137 | ''' result : 138 | df1 Count: 139359 139 | df2 Count: 119911 140 | 259270 141 | ''' 142 | ``` 143 | ## SQL and Parquet 144 | - [x] Import the AA_DFW_ALL.parquet file into flights_df. 145 | - [x] Use the createOrReplaceTempView method to alias the flights table. 146 | - [x] Run a Spark SQL query against the flights table. 147 | ```py 148 | # Read the Parquet file into flights_df 149 | flights_df = spark.read.parquet('AA_DFW_ALL.parquet') 150 | 151 | # Register the temp table 152 | flights_df.createOrReplaceTempView('flights') 153 | 154 | # Run a SQL query of the average flight duration 155 | avg_duration = spark.sql( 156 | 'SELECT avg(flight_duration) from flights').collect()[0] 157 | print('The average flight time is: %d' % avg_duration) 158 | 159 | # The average flight time is: 151 160 | ``` 161 | -------------------------------------------------------------------------------- /Data Processing in Shell/I Downloading Data on the Command Line.py: -------------------------------------------------------------------------------- 1 | """ 2 | how to download data files from web servers via the command line. In the process, we also learn about documentation manuals, option flags, and multi-file processing 3 | """ 4 | #==================================================== 5 | # THEORY & EXAMPLES 6 | #=================================================== 7 | ## Downloading data using curl : 8 | """ ** CURL : short Client for URLs, 9 | Unix CommandLine Tool 10 | Tranfers Data To and From Servers 11 | Used to download data from HTTPs sites and FTP Servers""" 12 | ## Checking CURL Installation : 13 | # >>>>>> man curl 14 | """ if not installed follow link steps: https://curl.se/download.html""" 15 | 16 | 17 | 18 | ## Curl Syntax : 19 | # curl [option flags] [URL] 20 | ## Curl Supports : 21 | # HTTP, HTTPS, FTP, SFTP 22 | 23 | 24 | 25 | ## Downloading single file : 26 | """ # Download with same name : >>>>>> curl -0 27 | # eg : curl -0 https://websitename/datafilename.txt 28 | # Rename file : >>>>>> curl -0 renamedfile https://websitename/datafilename.txt """ 29 | ## Downloading multiple files w/ WILDCARDS : 30 | """ # Globing Parser : >>>>>> curl -0 https://websitename/datafilename*.txt 31 | # indexing : >>>>>> curl -0 https://websitename/datafilename[001-100:10].txt """ 32 | 33 | 34 | 35 | 36 | ## Flags in case of timeouts : 37 | """ # Redirect HTTP URL if 300 error occurs : >>>>>> -L 38 | # Resumes previous file transfer if timeout before completition : >>>>>> -C """ 39 | #--------------- 40 | ## Using curl documentation 41 | """--- Based on the information in the curl manual, which of the following is NOT a supported file protocol ? :""" 42 | # OFTP 43 | 44 | 45 | 46 | ## Downloading single file using curl 47 | # Use curl to download the file from the redirected URL 48 | # curl -L https://assets.datacamp.com/production/repositories/4180/datasets/eb1d6a36fa3039e4e00064797e1a1600d267b135/201812SpotifyData.zip 49 | # # Download and rename the file in the same step 50 | # curl -o Spotify201812.zip -L https://assets.datacamp.com/production/repositories/4180/datasets/eb1d6a36fa3039e4e00064797e1a1600d267b135/201812SpotifyData.zip 51 | 52 | ## Downloading multiple files using curl 53 | # Download all 100 data files 54 | # curl -L https://s3.amazonaws.com/assets.datacamp.com/production/repositories/4180/datasets/files/datafile[001-100].txt 55 | # Print all downloaded files to directory 56 | # ls datafile*.txt 57 | 58 | 59 | 60 | ## Downloading data using Wget : 61 | """ ** Wget : Derives from World Wide Web and Get 62 | compatible w/ all OS 63 | Used to download data from HTTPS and FTP 64 | $$$ BETER THAN USING CURL TO DOWNLOAD MULTIPLE FILES $$$ """ 65 | ## Check Installation : 66 | # throws Wget location : >>>>>> which wget 67 | """ if not installed follow link steps : https://www.gnu.org/software/wget/ """ 68 | # check manual when installed : >>>>>> man Wget 69 | 70 | 71 | ## Basic Syntax Wget : >>>>>> wget [option flags] [URL] 72 | 73 | 74 | """ ** Option Flags wget : 75 | -b : go background after startup 76 | -q : turn off wget output 77 | -c : resume broken download 78 | -------- 79 | eg. : wget -bqc [URL] 80 | ------ 81 | ** PID : unique ID for data download process in case of cancellation """ 82 | #==================================================== 83 | # EXERCICES 84 | #=================================================== 85 | ## Installing Wget 86 | """---Which of the following is NOT a way to install wget?""" 87 | # On MacOS, install using pip 88 | 89 | 90 | 91 | ## Downloading single file using wget 92 | # Fill in the two option flags 93 | """option flag for resuming a partial download./for letting the download occur in the background.""" 94 | # wget -c -b https://assets.datacamp.com/production/repositories/4180/datasets/eb1d6a36fa3039e4e00064797e1a1600d267b135/201812SpotifyData.zip 95 | # Verify that the Spotify file has been downloaded 96 | # ls 97 | # Preview the log file 98 | # cat wget-log 99 | 100 | 101 | 102 | ## THEO========== 103 | ## Advanced downloading using Wget 104 | ## ------------------------------------ 105 | """- save a list od file locations in txt file""" 106 | """- Download from URL locations command read 107 | >>>>>> -i """ 108 | ## Limit rate for larger files (--limit-rate) 109 | ## ------------------------------------ 110 | """ >>>>>> __limit-rate 111 | eg: wget --limit-rate={rate}k {file_lcoation} 112 | eg2: wget --limit-rate=200k -i url_list.txt """ 113 | ## Limit rate for smaller files (wait) 114 | ##------------------------------------- 115 | """ >>>>>> --wait 116 | eg: wget --wait={seconds} {file_location} 117 | eg2: wget --wait= 2.5 -i url_list.txt """ 118 | 119 | 120 | 121 | ## EXERCICES============ 122 | 123 | ## Setting constraints for multiple file downloads 124 | """---Setting constraints for multiple file downloads using wget? """ 125 | # Store all URL locations in a text file (e.g. url_list.txt) and iteratively download using wget and option flag i 126 | # cause iterates through the files but does not set any constraints for downloads. 127 | 128 | 129 | 130 | ## Creating wait time using Wget 131 | """ View url_list.txt to verify content """ 132 | # cat url_list.txt 133 | """ Create a mandatory 1 second pause between downloading all files in url_list.txt """ 134 | # wget --wait=1 -i url_list.txt 135 | """ Take a look at all files downloaded """ 136 | # ls 137 | 138 | 139 | 140 | ## Data downloading with Wget and curl 141 | """ Use curl, download and rename a single file from URL""" 142 | # curl -o Spotify201812.zip -L https://assets.datacamp.com/production/repositories/4180/datasets/eb1d6a36fa3039e4e00064797e1a1600d267b135/201812SpotifyData.zip 143 | """ Unzip, delete, then re-name to Spotify201812.csv """ 144 | # unzip Spotify201812.zip && rm Spotify201812.zip 145 | # mv 201812SpotifyData.csv Spotify201812.csv 146 | """ View url_list.txt to verify content """ 147 | # cat url_list.txt 148 | """ Use Wget, limit the download rate to 2500KB/s, download all files in url_list.txt """ 149 | # wget --limit-rate=2500k -i url_list.txt 150 | """ Take a look at all files downloaded """ 151 | # ls 152 | -------------------------------------------------------------------------------- /Database Design/1. Processing, Storing, and Organizing Data/1. OLTP and OLAP.txt: -------------------------------------------------------------------------------- 1 | | How should we organize data? | 2 | 3 | . Schemas : how data be logically organized 4 | . Normalization : may data have minimal or redundant dependency? 5 | . Views : what joins more often? 6 | . Access Control : who and which levels of accesses 7 | . DBMS : pick SQL or noSQL 8 | 9 | # Depends on the intended use of data 10 | 11 | | approaches to dataprocessing | 12 | 13 | -> OLTP : Online Transaction Processing 14 | - has data from past hour 15 | - data inserted and updated more often 16 | - Uses Operational database 17 | -> OLAP : Online Analytical Processing 18 | - helps business decision making 19 | - Queries larges amounts of data 20 | - Uses Data-warehouse 21 | -------------------------------------------------------------------------------- /Database Design/1. Processing, Storing, and Organizing Data/2. Which is better?.py: -------------------------------------------------------------------------------- 1 | """ 2 | Which is better? 3 | The city of Chicago receives many 311 service requests throughout the day. 311 service requests are non-urgent community requests, ranging from graffiti removal to street light outages. Chicago maintains a data repository of all these services organized by type of requests. In this exercise, Potholes has been loaded as an example of a table in this repository. It contains pothole reports made by Chicago residents from the past week. 4 | 5 | Explore the dataset. What data processing approach is this larger repository most likely using? 6 | 7 | Instructions 8 | 50 XP 9 | Possible Answers 10 | 11 | OLTP because this table could not be used for any analysis. 12 | OLAP because each record has a unique service request number. 13 | OK OLTP because this table's structure appears to require frequent updates. 14 | OLAP because this table focuses on pothole requests only 15 | 16 | """ 17 | SELECT * 18 | FROM Potholes; 19 | 20 | ''' 21 | creation_date current_status completion_date service_request_id type_of_service most_recent_action street_address zip 22 | 2018-12-17T00:00:00.000 Open null 18-03380123 Pothole in Street null 10300 S WALLACE ST 60628.0 23 | 2018-12-18T00:00:00.000 Open null 18-03388180 Pothole in Street null 24 | ''' 25 | -------------------------------------------------------------------------------- /Database Design/1. Processing, Storing, and Organizing Data/3. Name that data type.py: -------------------------------------------------------------------------------- 1 | """ 2 | Name that data type! 3 | In the previous video, you learned about structured, semi-structured, and unstructured data. Structured data is the easiest to analyze because it is organized and cleaned. On the other hand, unstructured data is schemaless, but scales well. In the middle we have semi-structured data for everything in between. 4 | 5 | Instructions 6 | 100XP 7 | Each of these cards hold a type of data. Place them in the correct category. 8 | 9 | """ 10 | ''' 11 | -> Unstructured 12 | - Zip fle of all text messages ever recelved 13 | - Images In your photo library 14 | - To-do notes In a text editor 15 | 16 | -> Semi-structured: 17 | - JSON oblect of tweets outputted In real-time by the Twitter API 18 | - CSVs of open data downloaded from your local government websltes 19 | - HTML sytnax 20 | 21 | -> Structured: 22 | - A relational database with latest withdrawals 23 | and deposits made by clients 24 | 25 | ''' 26 | -------------------------------------------------------------------------------- /Database Design/1. Processing, Storing, and Organizing Data/4. Ordering ETL Tasks.py: -------------------------------------------------------------------------------- 1 | """ 2 | Ordering ETL Tasks 3 | You have been hired to manage data at a small online clothing store. Their system is quite outdated because their only data repository is a traditional database to record transactions. 4 | 5 | You decide to upgrade their system to a data warehouse after hearing that different departments would like to run their own business analytics. You reason that an ELT approach is unnecessary because there is relatively little data (< 50 GB). 6 | 7 | Instructions 8 | 100XP 9 | - In the ETL flow you design, different steps will take place. Place the steps in the most appropriate order. 10 | 11 | """ 12 | ''' 13 | 1/ eCommerce APl outputs real time data of transactions 14 | 2/ Python script drops null rows and clean data Into pre-determined columns 15 | 3/ Resulting dataframe Is written Into an AWS Redshift Warehouse 16 | ''' 17 | -------------------------------------------------------------------------------- /Database Design/1. Processing, Storing, and Organizing Data/5. Recommend a storage solution.py: -------------------------------------------------------------------------------- 1 | """ 2 | Recommend a storage solution 3 | ----When should you choose a data warehouse over a data lake? 4 | 5 | Answer the question 6 | 50XP 7 | Possible Answers 8 | 9 | - To train a machine learning model with a 150 GB of raw image data. # data lake because the data is large and unstructured. 10 | - To store real-time social media posts that may be used for future analysis # Data warehouses cause will be used for analysis. 11 | - To store customer data that needs to be updated regularly # Data warehouses -> read-only 12 | OK - To create accessible and isolated data repositories for other analysts # Analysts appreciate working in a data warehouse more because of its ..> 13 | organization of structured data that make analysis easier. 14 | 15 | """ 16 | -------------------------------------------------------------------------------- /Database Design/1. Processing, Storing, and Organizing Data/6. Database design.txt: -------------------------------------------------------------------------------- 1 | | database design | 2 | 3 | -> Database design : . determines how data is logically stored 4 | . how it would be read and updated 5 | 6 | 7 | uses 8 | -> Database model : high leve specifications for db structure 9 | - most pupular : relational db 10 | - other : noSQL model, object-oriented model, network model 11 | 12 | uses 13 | -> Database schemas : blueprint of the db 14 | . defines tables, fields, indexes, views 15 | . when inserting data in relational db, schemas must be respected 16 | 17 | | models of data model | 18 | 19 | - conceptual data model : describes entities, relationships and attributes 20 | / tools / : data structure diagrams, 21 | / eg / :entity-relational diagrams, UML diagrams 22 | 23 | - logical data model : defines tables, columns, relationships 24 | / tools / : db models and schemas 25 | / eg / : relational model and star schema 26 | 27 | - physaical data model : describes physical storage 28 | / tools / : partitions, CPUs, indexes, backup sys, tablespaces 29 | -------------------------------------------------------------------------------- /Database Design/1. Processing, Storing, and Organizing Data/7. Classifying data models.py: -------------------------------------------------------------------------------- 1 | """ 2 | Classifying data models 3 | In the previous video, we learned about three different levels of data models: conceptual, logical, and physical. 4 | 5 | Instructions 6 | 100XP 7 | - Each of these cards hold a tool or concept that fits into a certain type of data model. Place the cards in the correct category. 8 | 9 | """ 10 | ''' 11 | @ Conceptual Data Model 12 | . Gathers business requirements 13 | . Entities, attributes, and relationships 14 | 15 | @ Logical Data Model 16 | . Relational model 17 | . Determining tables and columns 18 | 19 | @ Physical Data Model 20 | . File structure of data storage 21 | ''' 22 | -------------------------------------------------------------------------------- /Database Design/1. Processing, Storing, and Organizing Data/8. Deciding fact and dimension tables.py: -------------------------------------------------------------------------------- 1 | """ 2 | Deciding fact and dimension tables 3 | Imagine that you love running and data. It's only natural that you begin collecting data on your weekly running routine. 4 | You're most concerned with tracking how long you are running each week. You also record the route and the distances of your runs. 5 | You gather this data and put it into one table called Runs with the following schema: 6 | 7 | runs 8 | duration_mins - float 9 | week - int 10 | month - varchar(160) 11 | year - int 12 | park_name - varchar(160) 13 | city_name - varchar(160) 14 | distance_km - float 15 | route_name - varchar(160) 16 | . . . . .. .. . . . 17 | Instructions 1/2 18 | - Question 19 | ----Out of these possible answers, what would be the best way to organize the fact table and dimensional tables? 20 | 21 | Possible Answers 22 | 23 | OK A fact table holding duration_mins and foreign keys to dimension tables holding route details and week details, respectively. 24 | A fact table holding week,month, year and foreign keys to dimension tables holding route details and duration details, respectively. 25 | A fact table holding route_name,park_name, distance_km,city_name, and foreign keys to dimension tables holding week details and duration details, respectively. 26 | 27 | """ 28 | """ 29 | Instructions 2/2 30 | - Create a dimension table called route that will hold the route information. 31 | - Create a dimension table called week that will hold the week information. 32 | 33 | """ 34 | #-- Create a route dimension table 35 | CREATE TABLE route( 36 | route_id INTEGER PRIMARY KEY, 37 | park_name VARCHAR(160) NOT NULL, 38 | city_name VARCHAR(160) NOT NULL, 39 | distance_km float NOT NULL, 40 | route_name VARCHAR(160) NOT NULL 41 | ); 42 | #-- Create a week dimension table 43 | CREATE TABLE week( 44 | week_id INTEGER PRIMARY KEY, 45 | week integer NOT NULL, 46 | month VARCHAR(160) NOT NULL, 47 | year integer NOT NULL 48 | ); 49 | -------------------------------------------------------------------------------- /Database Design/1. Processing, Storing, and Organizing Data/9. Querying the dimensional model.py: -------------------------------------------------------------------------------- 1 | """ 2 | Querying the dimensional model 3 | Here it is! The schema reorganized using the dimensional model: 4 | 5 | Let's try to run a query based on this schema. How about we try to find the number of minutes we ran in July, 2019? We'll break this up in two steps. First, we'll get the total number of minutes recorded in the database. Second, we'll narrow down that query to week_id's from July, 2019. 6 | 7 | Instructions 2/2 8 | 9 | 1- Calculate the sum of the duration_mins column. 10 | 11 | 2- Join week_dim and runs_fact. 12 | 2- Get all the week_id's from July, 2019. 13 | 14 | """ 15 | #1 16 | #-- Select the sum of the duration of all runs 17 | SELECT SUM(duration_mins) 18 | FROM runs_fact; 19 | 20 | #2 21 | #-- Get all the week_id's that are from July, 2019 22 | INNER JOIN week_dim ON runs_fact.week_id = week_dim.week_id 23 | WHERE month = 'July' and year = '2019'; 24 | -------------------------------------------------------------------------------- /Introduction to AWS Boto in Python/1. Putting Files in the Cloud!/I. Intro to AWS and Boto3.py: -------------------------------------------------------------------------------- 1 | """ 2 | ||||| 3 | From learning how AWS works to creating S3 buckets and uploading files to them. You will master the basics of setting up AWS and uploading files to the cloud! 4 | ||||| 5 | 6 | | INITIALIZE BOTO3 CLIENT | 7 | 8 | -> Boto 3 : 9 | import boto3 10 | 11 | s3 = boto3.client('s3', 12 | region_name='us-east-1', 13 | aws_access_key_id = AWS_KEY_ID, 14 | aws_secret_access_key = AWS_SECRET) 15 | 16 | response = s3.list_buckets() 17 | --------> 18 | # in AWS Console creation of USER 19 | - Create KEYs w/ IAM 20 | - add user 21 | - programatic user 22 | - select policies: 23 | . s3 full access 24 | . amazon sns full access 25 | . amazon rekognition full access 26 | . comprehend full access 27 | - Grab KEY and SECRET and store smw safe 28 | 29 | | AWS services | 30 | 31 | IAM : manage permissions 32 | S3 : store files in cloud 33 | SNS : emails / texts based on conditions and events in pipelines 34 | COMPREHEND : sentiment analysis on blocks of text 35 | REKOGNITION : extract text from images an dlooks 4 cats in pictures 36 | 37 | """ 38 | -------------------------------------------------------------------------------- /Introduction to AWS Boto in Python/1. Putting Files in the Cloud!/II. Your first boto3 client.py: -------------------------------------------------------------------------------- 1 | """ 2 | 3 | \ Your first boto3 client / 4 | 5 | Sam wants to cast off the shackles of only being able to use her computer for storage and compute. She is learning how to use the awesome power of the cloud 6 | to create data pipelines and automatically generate reports. 7 | 8 | Before she can do all that, she needs to create her first boto3 client and check out what buckets already exist in S3. 9 | 10 | Her AWS key and AWS secret key have been stored in AWS_KEY_ID and AWS_SECRET respectively. 11 | In this exercise, you will help Sam by creating your first boto3 client to AWS! 12 | 13 | Instructions 14 | 100 XP 15 | - Generate a boto3 client for interacting with s3. 16 | - Specify 'us-east-1' for the region_name. 17 | - Use AWS_KEY_ID and AWS_SECRET to set up the credentials. 18 | - Print the buckets. 19 | 20 | """ 21 | # Generate the boto3 client for interacting with S3 22 | s3 = boto3.client('s3', region_name='us-east-1', 23 | # Set up AWS credentials 24 | aws_access_key_id=AWS_KEY_ID, 25 | aws_secret_access_key=AWS_SECRET) 26 | # List the buckets 27 | buckets = s3.list_buckets() 28 | 29 | # Print the buckets 30 | print(buckets) 31 | -------------------------------------------------------------------------------- /Introduction to AWS Boto in Python/1. Putting Files in the Cloud!/III. Multiple clients Sam knows that she will often have to work with more than one service at once. She wants to practice creating two separate clients for two different services in boto3. When she is building her workflows, she will make multiple Amazon Web Services interact with each other, with a script executed on her computer. Her AWS key id and AWS secret have been stored in AWS_KEY_ID and AWS_SECRET respectively. You will help Sam initialize a boto3 client for S3, and another client for SNS. She will use the S3 client to list the buckets in S3. She will use the SNS client to list topics she can publish to (you will learn about SNS topics in Chapter 3). Instructions 100 XP Generate the boto3 clients for interacting with S3 and SNS. Specify 'us-east-1' for the region_name for both clients. Use AWS_KEY_ID and AWS_SECRET to set up the credentials. List and print the SNS topics..py: -------------------------------------------------------------------------------- 1 | """ 2 | 3 | \ Multiple clients / 4 | 5 | Sam knows that she will often have to work with more than one service at once. She wants to practice creating two separate clients for two different services in boto3. 6 | 7 | When she is building her workflows, she will make multiple Amazon Web Services interact with each other, with a script executed on her computer. 8 | 9 | Her AWS key id and AWS secret have been stored in AWS_KEY_ID and AWS_SECRET respectively. 10 | 11 | '''You will help Sam initialize a boto3 client for S3, and another client for SNS.''' 12 | 13 | She will use the S3 client to list the buckets in S3. She will use the SNS client to list topics she can publish to (you will learn about SNS topics in Chapter 3). 14 | 15 | Instructions 16 | 100 XP 17 | - Generate the boto3 clients for interacting with S3 and SNS. 18 | - Specify 'us-east-1' for the region_name for both clients. 19 | - Use AWS_KEY_ID and AWS_SECRET to set up the credentials. 20 | - List and print the SNS topics. 21 | 22 | """ 23 | # Generate the boto3 client for interacting with S3 and SNS 24 | s3 = boto3.client('s3', region_name='us-east-1', 25 | aws_access_key_id=AWS_KEY_ID, 26 | aws_secret_access_key=AWS_SECRET) 27 | 28 | sns = boto3.client('sns', region_name='us-east-1', 29 | aws_access_key_id=AWS_KEY_ID, 30 | aws_secret_access_key=AWS_SECRET) 31 | 32 | # List S3 buckets and SNS topics 33 | buckets = s3.list_buckets() 34 | topics = sns.list_topics() 35 | 36 | # Print out the list of SNS topics 37 | print(topics) 38 | -------------------------------------------------------------------------------- /Introduction to AWS Boto in Python/1. Putting Files in the Cloud!/IV. Removing repetitive work.py: -------------------------------------------------------------------------------- 1 | """ 2 | 3 | \ Removing repetitive work / 4 | 5 | Automation is at the heart of data engineering. Sam wants to eliminate boring and repetitive tasks from her and her coworkers plate. 6 | She wants to let machines do what they do best, while empowering humans to do what they do best. 7 | 8 | Sam has a coworker - John - whose job is to build the department budget. He also got stuck with the job of looking at images coming from GetItDone user reports. 9 | If he sees a cat in the image, he has to send a text message to the Animal services manager with the image and its location. 10 | 11 | Sam is thinking of building a system that would take this repetitive work off John's plate. 12 | 13 | Such a system will: 14 | 15 | Get images taken by city trash trucks and store them; 16 | If the image contains a cat, the system will send an SMS to the Animal Services manager. 17 | What AWS services would Sam need to implement this system? 18 | --- 19 | 20 | IAM 21 | S3 22 | SNS 23 | Rekognition 24 | ANSW: All of the Above 25 | 26 | """ 27 | -------------------------------------------------------------------------------- /Introduction to AWS Boto in Python/1. Putting Files in the Cloud!/V. Diving into buckets.txt: -------------------------------------------------------------------------------- 1 | | BUCKETS | 2 | 3 | . destok folders 4 | . own permission policies 5 | . website storage 6 | . generate logs ab activity 7 | 8 | | S3 components-Objects | 9 | 10 | contain ANYTTHING : music, csv, image, video, log file 11 | 12 | | What can we do w/ buckets ? | 13 | 14 | create, list, delete buckets 15 | 16 | | CREATE A BUCKET | 17 | 18 | # create boto3 client 19 | import boto3 20 | 21 | s3 = boto3.client('s3', 22 | region_name='us-east-1', 23 | aws_access_key_id = AWS_KEY_ID, 24 | aws_secret_access_key = AWS_SECRET) 25 | 26 | # create bucket and his name 27 | # names must be unique 28 | bucket = s3.create_bucket(Bucket='grid-requests') 29 | 30 | | Buecket ~ METHODS | 31 | 32 | > create bucket 33 | bucket = s3.create_bucket(Bucket='grid-requests') 34 | 35 | > list buckets 36 | bucket_response = s3.list_buckets() 37 | 38 | # get buckets dictionary 39 | buckets = bucket_respons['Buckets'] 40 | print(buckets) 41 | 42 | > delete buckets 43 | bucket_response = s3.delete_bucket('grid-requests') 44 | 45 | -------------------------------------------------------------------------------- /Introduction to AWS Boto in Python/1. Putting Files in the Cloud!/VI. creating a bucket.py: -------------------------------------------------------------------------------- 1 | """ 2 | 3 | Creating a bucket 4 | Sam is dipping her toes in the water, getting ready to build her first pipeline. 5 | 6 | Get It Done is the app the City released for residents to report problems. There are lots of problems to report, and lots of data gets generated. 7 | 8 | Get It Done services poster 9 | 10 | She will be picking up daily reports generated by Get It Done and placing them in the 'gim-staging' bucket. Then, she will clean the data and place the new dataset in the 'gim-processed' bucket. 11 | 12 | She also wants to create a 'gim-test' bucket to experiment with. 13 | 14 | Help Sam take the first step to her pipeline dreams. Help her create her first bucket, 'gim-staging'! 15 | 16 | Instructions 17 | 100 XP 18 | - Create a boto3 client to S3. 19 | - Create 'gim-staging', 'gim-processed' and 'gim-test' buckets. 20 | - Print the response from creating the 'gim-staging' bucket. 21 | 22 | """ 23 | import boto3 24 | 25 | # Create boto3 client to S3 26 | s3 = boto3.client('s3', region_name='us-east-1', 27 | aws_access_key_id=AWS_KEY_ID, 28 | aws_secret_access_key=AWS_SECRET) 29 | 30 | # Create the buckets 31 | response_staging = s3.create_bucket(Bucket='gim-staging') 32 | response_processed = s3.create_bucket(Bucket='gim-processed') 33 | response_test = s3.create_bucket(Bucket='gim-test') 34 | 35 | # Print out the response 36 | print(response_staging) 37 | -------------------------------------------------------------------------------- /Introduction to AWS Boto in Python/1. Putting Files in the Cloud!/VII. Listing buckets.py: -------------------------------------------------------------------------------- 1 | """ 2 | 3 | Listing buckets 4 | Sam has successfully created the buckets for her pipeline. Often, data engineers build in checks into the pipeline to make sure their previous operation succeeded. Sam wants to build in a check to make sure her buckets actually got created. 5 | 6 | She also wants to practice listing buckets. Listing buckets will let her perform operations on multiple buckets using a for loop. 7 | 8 | She has already created the boto3 client for S3, and assigned it to the s3 variable. 9 | 10 | Help Sam get a list of all the buckets in her S3 account and print their names! 11 | 12 | Instructions 13 | 100 XP 14 | - Get the buckets from S3. 15 | - Iterate over the bucket key from response to access the list of buckets. 16 | - Print the name of each bucket. 17 | 18 | """ 19 | # Get the list_buckets response 20 | response = s3.list_buckets() 21 | 22 | # Iterate over Buckets from .list_buckets() response 23 | for bucket in response['Buckets']: 24 | 25 | # Print the Name for each bucket 26 | print(bucket['Name']) 27 | 28 | ''' out: 29 | gim-staging 30 | gim-processed 31 | gim-test 32 | ''' 33 | -------------------------------------------------------------------------------- /Introduction to AWS Boto in Python/1. Putting Files in the Cloud!/VIII. Deleting a bucket.py: -------------------------------------------------------------------------------- 1 | """ 2 | 3 | Deleting a bucket 4 | Sam is feeling more and more confident in her AWS and S3 skills. After playing around for a bit, she decides that the gim-test bucket no longer fits her pipeline and wants to delete it. It's starting to feel like dead weight, and Sam doesn't want it littering her beautiful bucket list. 5 | 6 | She has already created the boto3 client for S3, and assigned it to the s3 variable. 7 | 8 | Help Sam do some clean up, and delete the gim-test bucket. 9 | 10 | Instructions 11 | 100 XP 12 | - Delete the 'gim-test' bucket. 13 | - Get the list of buckets from S3. 14 | - Print each 'Buckets' 'Name'. 15 | 16 | """ 17 | 18 | # Delete the gim-test bucket 19 | s3.delete_bucket(Bucket='gim-test') 20 | 21 | # Get the list_buckets response 22 | response = s3.list_buckets() 23 | 24 | # Print each Buckets Name 25 | for bucket in response['Buckets']: 26 | print(bucket['Name']) 27 | -------------------------------------------------------------------------------- /Introduction to AWS Boto in Python/1. Putting Files in the Cloud!/VIV. Deleting multiple buckets.py: -------------------------------------------------------------------------------- 1 | """ 2 | 3 | Deleting multiple buckets 4 | The Get It Done app used to be called Get It Made. Sam always thought it was a terrible name, but it got stuck in her head nonetheless. 5 | 6 | When she was making the pipeline buckets, she used the gim- abbreviation for the old name. She decides to switch her abbreviation to gid- to accurately reflect the app's real (and better) name. 7 | 8 | She has already set up the boto3 S3 client and assigned it to the s3 variable. 9 | 10 | Help Sam delete all the buckets in her account that start with the gim- prefix. Then, help her make a 'gid-staging' and a 'gid-processed' bucket. 11 | 12 | Instructions 13 | 100 XP 14 | - Get the buckets from S3. 15 | - Delete the buckets that contain 'gim' and create the 'gid-staging' and 'gid-processed' buckets. 16 | - Print the new bucket names. 17 | 18 | """ 19 | # Get the list_buckets response 20 | response = s3.list_buckets() 21 | 22 | # Delete all the buckets with 'gim', create replacements. 23 | for bucket in response['Buckets']: 24 | if 'gim' in bucket['Name']: 25 | s3.delete_bucket(Bucket=bucket['Name']) 26 | 27 | s3.create_bucket(Bucket='gid-staging') 28 | s3.create_bucket(Bucket='gid-processed') 29 | 30 | # Print bucket listing after deletion 31 | response = s3.list_buckets() 32 | for bucket in response['Buckets']: 33 | print(bucket['Name']) 34 | -------------------------------------------------------------------------------- /Introduction to AWS Boto in Python/1. Putting Files in the Cloud!/X. Uploading and retrieving files.txt: -------------------------------------------------------------------------------- 1 | | differences Bucket/Object | 2 | 3 | Bucket: 4 | - name 5 | - name is STRING 6 | - unique name in all S3 7 | - contains many objects 8 | 9 | Object: 10 | - has a KEY 11 | - Name is full path from bucket root 12 | - Unique KEY in bucket 13 | - can only be in one parent bucket 14 | 15 | 1 > create boto3 client 16 | 2 > upload files to a bucket: 17 | s3.upload_file( 18 | Filename='grid_request_2020_15_11.csv', 19 | Bucket='grid-request', 20 | key='grid_request_2020_15_11.csv') # what we name obj. in S3 21 | 3 > List objects in a bucket 22 | response = s3.list_objects( 23 | Bcuket='grid-requests', 24 | MaxKeys=2, # limit response 25 | Prefix='grid_requests_2019') 26 | 27 | # Get object metadata and print it 28 | response = s3.head_object(Bucket='gid-staging', 29 | Key='2019/final_report_01_01.csv') 30 | 31 | # Print the size of the uploaded object 32 | print(response['ContentLength']) 33 | 34 | 35 | 4 > delete object in bucket: 36 | s3.delete_object( 37 | Bucket='grid_requests', 38 | Key='grid_requests_2018') 39 | -------------------------------------------------------------------------------- /Introduction to AWS Boto in Python/1. Putting Files in the Cloud!/XI. Spring cleaning.py: -------------------------------------------------------------------------------- 1 | """ 2 | 3 | Spring cleaning 4 | Sam's pipeline has been running for a long time now. Since the beginning of 2018, her automated system has been diligently uploading her report to the gid-staging bucket. 5 | 6 | In City governments, record retention is a huge issue, and many government officials prefer not to keep records in existence past the mandated retention dates. 7 | 8 | People purging records 9 | 10 | As time has passed, the City Council asked Sam to clean out old CSV files from previous years that have passed the retention period. 2018 is safe to delete. 11 | 12 | Sam has initialized the client and assigned it to the s3 variable. Help her clean out all records for 2018 from S3! 13 | 14 | Instructions 15 | 100 XP 16 | - List only objects that start with '2018/final_' in 'gid-staging' bucket. 17 | - Iterate over the objects, deleting each one. 18 | - Print the keys of remaining objects in the bucket. 19 | 20 | """ 21 | # List only objects that start with '2018/final_' 22 | response = s3.list_objects(Bucket='gid-staging', 23 | Prefix='2018/final_') # list ones w/ this Prefix 24 | 25 | # Iterate over the objects 26 | if 'Contents' in response: 27 | for obj in response['Contents']: # delete obj in response['Contents'] 28 | # Delete the object 29 | s3.delete_object(Bucket='gid-staging', Key=obj['Key']) 30 | 31 | # Print the keys of remaining objects in the bucket 32 | response = s3.list_objects(Bucket='gid-staging') 33 | 34 | for obj in response['Contents']: 35 | print(obj['Key']) 36 | -------------------------------------------------------------------------------- /Introduction to AWS Boto in Python/2. Sharing Files Securely/I. Keeping objects secure.txt: -------------------------------------------------------------------------------- 1 | You will learn how set files to be public or private, and cap off what you learned by generating web-based reports! 2 | 3 | | AWS PERMISSION SYS | 4 | 5 | # Multi-user environments 6 | -> IAM : attaching policies to users 7 | -> Bucket Policies: gives control on bucket and Object within it 8 | # 9 | -> ACL : permission on specific obj within a bucket 10 | (entity attached) 11 | 12 | - public 13 | 14 | > put ACL to uploaded obj: 15 | s3.put_object_acl(Bucket='', Key='puthole.csv', 16 | ACL='public-read') 17 | > ACL but w/ ExtraArgs: 18 | s3.upload_file(Bucket='x',Filename='asd.csv',Key='asd.csv', 19 | ExtraArgs={'ACL':'public-read'}) 20 | > accessing public objects: 21 | # https://{bucket}.{key} 22 | https://grid-requests.2019/potholes.csv 23 | > generate public obj. URL 24 | url='https://{}.{}'.format( 25 | "grid-requests", 26 | "2019/putholes.csv") 27 | > read URL into pandas 28 | df = pd.read_csv(url) 29 | - private 30 | 31 | -> Presigned URL: temporary access to obj 32 | 33 | -------------------------------------------------------------------------------- /Introduction to AWS Boto in Python/2. Sharing Files Securely/II. Making an object public.py: -------------------------------------------------------------------------------- 1 | """ 2 | 3 | Making an object public 4 | Sam is trying to decide how to upload a public CSV and share it with the world as Open Data. 5 | 6 | She only wants to make that specific CSV public, and not the bucket. 7 | 8 | She also doesn't want to make any of the other objects in the bucket public. 9 | 10 | Help her decide the best way for her to do this. 11 | 12 | Answer the question 13 | 50XP 14 | 15 | A presigned URL. # access to a private object temporarily 16 | OK A public-read ACL. 17 | A bucket policy. # not multi user environment. 18 | An IAM Policy. # not multi user environment. 19 | 20 | """ 21 | -------------------------------------------------------------------------------- /Introduction to AWS Boto in Python/2. Sharing Files Securely/III. Uploading a public report.py: -------------------------------------------------------------------------------- 1 | """ 2 | Uploading a public report 3 | As you saw in Chapter 1, Get It Done is an App that lets residents report problems like potholes and broken sidewalks. 4 | 5 | 6 | 7 | The data from the app is a very hot topic political issue. Residents keep saying that the City does not distribute work evenly across neighborhoods when issues are reported. The City Council wants to be transparent with the public and has asked Sam to publish the aggregated Get It Done reports and make them publicly available. 8 | 9 | Sam has initialized the boto3 S3 client and assigned it to the s3 variable. 10 | 11 | In this exercise, you will help her increase government transparency by uploading public reports to the gid-staging bucket. 12 | 13 | Instructions 14 | 100 XP 15 | - Upload 'final_report.csv' to the 'gid-staging' bucket. 16 | - Set the key to '2019/final_report_2019_02_20.csv'. 17 | - Set the ACL to 'public-read'. 18 | 19 | """ 20 | # Upload the final_report.csv to gid-staging bucket 21 | s3.upload_file( 22 | # Complete the filename 23 | Filename='./final_report.csv', 24 | # Set the key and bucket 25 | Key='2019/final_report_2019_02_20.csv', 26 | Bucket='gid-staging', 27 | # During upload, set ACL to public-read 28 | ExtraArgs = { 29 | 'ACL': 'public-read'}) 30 | -------------------------------------------------------------------------------- /Introduction to AWS Boto in Python/2. Sharing Files Securely/IV. Making multiple files public.py: -------------------------------------------------------------------------------- 1 | """ 2 | Making multiple files public 3 | Transparency is important to City Council. They want to empower residents to analyze Get It Done requests and how they get prioritized. 4 | 5 | They asked Sam to make all previous Get It Done aggregated reports since the beginning of 2019 public as well. 6 | Sam has initialized the boto3 S3 client and assigned it to the s3 variable. 7 | 8 | In this exercise, you will help Sam open up the data by setting the ACL of every object in the gid-staging bucket to public-read, opening up the objects to the world! 9 | 10 | Instructions 11 | 100 XP 12 | - List the objects in 'gid-staging' bucket starting with '2019/final_'. 13 | - For each file in the response, give it an ACL of 'public-read'. 14 | - Print the Public Object URL of each object. 15 | 16 | """ 17 | # List only objects that start with '2019/final_' 18 | response = s3.list_objects( 19 | Bucket='gid-staging', Prefix='2019/final_') 20 | 21 | # Iterate over the objects 22 | for obj in response['Contents']: 23 | 24 | # Give each object ACL of public-read 25 | s3.put_object_acl(Bucket='gid-staging', 26 | Key=obj['Key'], 27 | ACL='public-read') 28 | 29 | # Print the Public Object URL for each object 30 | print("https://{}.s3.amazonaws.com/{}".format( 'gid-staging', obj['Key'])) 31 | -------------------------------------------------------------------------------- /Introduction to AWS Boto in Python/2. Sharing Files Securely/V. Accessing private objects in S3.txt: -------------------------------------------------------------------------------- 1 | > accessing private files 2 | obj = s3.get_object(Bucket='x', Key='2019/asdads.csv') 3 | print(obj) 4 | > Read 'StreamingBody' into Pandas 5 | pd.read_csv(obj['Body']) 6 | 7 | > Generate pre-signed URL 8 | share_url = s3.generate_presigned_url( 9 | ClientMethod='get_object', 10 | ExpiresIn=3600, 11 | Params={'Bucket':'grid-requests','Key':'potholes.csv'} 12 | > open into Pandas 13 | pd.read_csv(share_url) 14 | 15 | > Load multiple files into 1 df 16 | df_list=[] 17 | 18 | # request the list of csv's from S3 w/ Prefix 19 | response = s3.lis_objects(Bucket='x', Prefix='2019/') 20 | 21 | # get response contents 22 | request_files= response["Contents"] 23 | 24 | # iterate over each obj 25 | for file in request_files: 26 | obj = s3.get_object(Bucket='x', Key=file['Key']) 27 | 28 | # Read it as df 29 | obj_df = pd.read_csv(obj['Body']) 30 | 31 | # Append df to a list 32 | df_list.append(obj_df) 33 | 34 | # Concatenate all df in a list 35 | df = pd.concat(df_list) 36 | 37 | # preview df 38 | df.head() 39 | -------------------------------------------------------------------------------- /Introduction to AWS Boto in Python/2. Sharing Files Securely/VI. Generating a presigned URL.py: -------------------------------------------------------------------------------- 1 | """ 2 | Generating a presigned URL 3 | Sam got a special request from City Council to analyze whether the City is prioritizing requests in District 11, while de-prioritizing requests in the less affluent district 12. They asked her to keep this report confidential, as they would like to see it before it goes public to the media. 4 | 5 | Sam has generated the report and is ready to share it with the City council, but making it public makes her too paranoid. She decided to provide the Council with a presigned URL so they can temporarily access the report for 1 hour. 6 | 7 | She has already initialized the boto3 S3 client and assigned it to the s3 variable. 8 | 9 | Help her generate a presigned URL valid for 1 hour to 'final_report.csv' in the 'gid-staging' bucket. Then, print it out for the City Council! 10 | 11 | Instructions 12 | 100 XP 13 | - Generate a presigned URL for final_report.csv that lasts 1 hour and allows the user to get the object. 14 | - Print out the generated presigned URL. 15 | 16 | """ 17 | # Generate presigned_url for the uploaded object 18 | share_url = s3.generate_presigned_url( 19 | # Specify allowable operations 20 | ClientMethod='get_object', 21 | # Set the expiration time 22 | ExpiresIn=3600, 23 | # Set bucket and shareable object's name 24 | Params={'Bucket': 'gid-staging','Key': 'final_report.csv'} 25 | ) 26 | 27 | # Print out the presigned URL 28 | print(share_url) 29 | -------------------------------------------------------------------------------- /Introduction to AWS Boto in Python/2. Sharing Files Securely/VII. Opening a private file.py: -------------------------------------------------------------------------------- 1 | """ 2 | Opening a private file 3 | The City Council wants to see the bigger trend and have asked Sam to total up all the requests since the beginning of 2019. In order to do this, Sam has to read daily CSVs from the 'gid-requests' bucket and concatenate them. However, the gid-requests files are private. She has access to them via her key, but the world cannot access them. 4 | 5 | In this exercise, you will help Sam see the bigger picture by reading these private files into pandas and concatenating them into one DataFrame! 6 | 7 | She has already initialized the boto3 S3 client and assigned it to the s3 variable. She has listed all the objects in gid-requests in the response variable. 8 | 9 | Instructions 10 | 100 XP 11 | - For each file in response, load the object from S3. 12 | - Load the object's StreamingBody into pandas, and append to df_list. 13 | - Concatenate all the DataFrames with pandas. 14 | - Preview the resulting DataFrame 15 | 16 | """ 17 | df_list = [ ] 18 | 19 | for file in response['Contents']: 20 | # For each file in response load the object from S3 21 | obj = s3.get_object(Bucket='gid-requests', Key=file['Key']) 22 | # Load the object's StreamingBody with pandas 23 | obj_df = pd.read_csv(obj['Body']) 24 | # Append the resulting DataFrame to list 25 | df_list.append(obj_df) 26 | 27 | # Concat all the DataFrames with pandas 28 | df = pd.concat(df_list) 29 | 30 | # Preview the resulting DataFrame 31 | df.head() 32 | 33 | ''' service_name request_count 34 | 0 72 Hour Violation 8 35 | 1 Graffiti Removal 2 36 | 2 Missed Collection 12 37 | 3 Street Light Out 21 38 | 4 Pothole 33''' 39 | -------------------------------------------------------------------------------- /Introduction to AWS Boto in Python/2. Sharing Files Securely/VIII. Sharing files through a website.txt: -------------------------------------------------------------------------------- 1 | | HTML tables in pandas | 2 | 3 | > Convert df to HTML 4 | df.to_html('table_agg.html') 5 | > Just certain columns to HTML 6 | df.to_html('table_agg.html', 7 | render_links=True, 8 | columns=['service_name', 'request_count', 'info_link'], 9 | border= 0) 10 | 11 | | upload HTML file to S3 | 12 | 13 | s3.upload_file( 14 | Filename='./table_agg.html', 15 | Bucket='x', 16 | Key='table.html', 17 | ExtraArgs={ 18 | 'ContentType': 'text/html', # image/png 19 | 'ACL': 'public-read' 20 | }) 21 | 22 | | INNA MEDIA TYPES | 23 | 24 | text/html 25 | image/png 26 | application/pdf 27 | text/csv 28 | 29 | | Generating index page | 30 | 31 | # create column link that contains website url+ key 32 | base_url="https://datacamp-website" 33 | objects_df['Link'] = base_url + objects_df['Key'] 34 | 35 | # Write df to html 36 | objects_df.to_html('report_listening.html', 37 | columns=['Link','LastModified','Size'], 38 | rende_links=True) 39 | 40 | | Uploading index page | 41 | 42 | s3.upload_file( 43 | Filename='./report_listing.html', 44 | Bucket='x', 45 | Key='index.html', 46 | ExtraArgs={ 47 | 'ContentType':'text/html', 48 | 'ACL': 'public-read'} 49 | ) 50 | -------------------------------------------------------------------------------- /Introduction to AWS Boto in Python/2. Sharing Files Securely/VIV. Generate HTML table from Pandas.py: -------------------------------------------------------------------------------- 1 | """ 2 | Generate HTML table from Pandas 3 | Residents often complain that they don't know the full range of services offered through the Get It Done application. 4 | 5 | In an effort to streamline the communication between the City government and the public, the City Council asked Sam to generate a table of all the services in the Get It Done system. 6 | 7 | The system is dynamic and grows on a weekly basis, adding additional services. Sam didn't want to waste time updating the file manually, and felt like she can automate the process. 8 | 9 | She loaded the DataFrame of available services into the services_df variable: 10 | 11 | services_df dataframe 12 | 13 | Instructions 14 | 100 XP 15 | - Generate an HTML table with no border and only the 'service_name' and 'link' columns. 16 | - Generate an HTML table with borders and all columns. 17 | - Make sure to set all URLs to be clickable. 18 | 19 | """ 20 | # Generate an HTML table with no border and selected columns 21 | services_df.to_html('./services_no_border.html', 22 | # Keep specific columns only 23 | columns=['service_name', 'link'], 24 | # Set border 25 | border=0) 26 | 27 | # Generate an html table with border and all columns. 28 | services_df.to_html('./services_border_all_columns.html', 29 | border=1) 30 | -------------------------------------------------------------------------------- /Introduction to AWS Boto in Python/2. Sharing Files Securely/X. Upload an HTML file to S3.py: -------------------------------------------------------------------------------- 1 | """ 2 | Upload an HTML file to S3 3 | When the Streets Operations manager heard of what Sam has been working on, he asked her to build him a dashboard of Get It Done requests. 4 | 5 | He wants to use the dashboard to staff and schedule his team accordingly. 6 | 7 | Sam generated a nice dashboard html file with the Python bokeh charting library: 8 | 9 | Bokeh Plot 10 | 11 | She wants to serve it as a website, providing an interactive dashboard to members of Streets Operations. 12 | 13 | Letting S3 serve the dashboard as a site lets her write a script that continuously updates the generated HTML file and keeps the Streets Operations team updated on the latest requests. 14 | 15 | She has already initialized the boto3 S3 client and assigned it to the s3 variable. 16 | 17 | Instructions 18 | 100 XP 19 | _ Upload the 'lines.html' file to 'datacamp-public' bucket. 20 | _ Specify the proper content type for the uploaded file. 21 | _ Specify that the file should be public. 22 | _ Print the Public Object URL for the new file. 23 | 24 | """ 25 | # Upload the lines.html file to S3 26 | s3.upload_file(Filename='lines.html', 27 | # Set the bucket name 28 | Bucket='datacamp-public', Key='index.html', 29 | # Configure uploaded file 30 | ExtraArgs = { 31 | # Set proper content type 32 | 'ContentType': 'text/html', 33 | # Set proper ACL 34 | 'ACL': 'public-read'}) 35 | 36 | # Print the S3 Public Object URL for the new file. 37 | print("http://{}.s3.amazonaws.com/{}".format('datacamp-public', 'index.html')) 38 | -------------------------------------------------------------------------------- /Introduction to AWS Boto in Python/2. Sharing Files Securely/XI. Combine daily requests for February.py: -------------------------------------------------------------------------------- 1 | """ 2 | Combine daily requests for February 3 | It's been a month since Sam last ran the report script, and it's time for her to make a new report for February. 4 | 5 | She wants to upload new reports for February and update the file listing, expanding on the work she completed during the last video lesson: 6 | 7 | Directory listing screenshot 8 | 9 | She has already created the boto3 S3 client and stored in the s3 variable. She stored the contents of her objects in request_files. 10 | 11 | You will help Sam aggregate the requests from February by downloading files from the gid-requests bucket and concatenating them into one DataFrame! 12 | 13 | Instructions 14 | 100 XP 15 | _ Load each object from s3. 16 | _ Read it into pandas and append it to df_list. 17 | _ Concatenate all DataFrames in df_list. 18 | _ Preview the DataFrame. 19 | 20 | """ 21 | df_list = [] 22 | 23 | # Load each object from s3 24 | for file in request_files: 25 | s3_day_reqs = s3.get_object(Bucket='gid-requests', 26 | Key=file['Key']) 27 | # Read the DataFrame into pandas, append it to the list 28 | day_reqs = pd.read_csv(s3_day_reqs['Body']) 29 | df_list.append(day_reqs) 30 | 31 | # Concatenate all the DataFrames in the list 32 | all_reqs = pd.concat(df_list) 33 | 34 | # Preview the DataFrame 35 | all_reqs.head() 36 | -------------------------------------------------------------------------------- /Introduction to AWS Boto in Python/2. Sharing Files Securely/XII. Upload aggregated reports for February.py: -------------------------------------------------------------------------------- 1 | """ 2 | In the last exercise, Sam downloaded the files for the month from the raw data bucket. 3 | 4 | Then she combined them all into one DataFrame that showcases all of the month's requests and requests type. 5 | 6 | She stored this DataFrame in the variable all_reqs and used pandas's groupby functionality to count requests by service name, generating a new DataFrame agg_df: 7 | 8 | service_name count 9 | 0 72 Hour Violation 2910 10 | 1 Chain Link Fence Repair 90 11 | 2 Collections Truck Spill 30 12 | 3 Container Left Out 120 13 | 4 Dead Animal 360 14 | She has already created the boto3 S3 client in the s3 variable. 15 | 16 | Help her publish this month's request statistics. 17 | 18 | Write agg_df to CSV and HTML files, and upload them to S3 as public files. 19 | 20 | Instructions 21 | 100 XP 22 | _ Write CSV and HTML versions of agg_df and name them 'feb_final_report.csv' and 'feb_final_report.html' respectively. 23 | _ Upload both versions of agg_df to the gid-reports bucket and set them to public read. 24 | 25 | """ 26 | # Write agg_df to a CSV and HTML file with no border 27 | agg_df.to_csv('./feb_final_report.csv') 28 | agg_df.to_html('./feb_final_report.html', border=0) 29 | 30 | # Upload the generated CSV to the gid-reports bucket 31 | s3.upload_file(Filename='./feb_final_report.csv', 32 | Key='2019/feb/final_report.html', Bucket='gid-reports', 33 | ExtraArgs = {'ACL': 'public-read'}) 34 | 35 | # Upload the generated HTML to the gid-reports bucket 36 | s3.upload_file(Filename='./feb_final_report.html', 37 | Key='2019/feb/final_report.html', Bucket='gid-reports', 38 | ExtraArgs = {'ContentType': 'text/html', 39 | 'ACL': 'public-read'}) 40 | -------------------------------------------------------------------------------- /Introduction to AWS Boto in Python/2. Sharing Files Securely/XIII. Update index to include February.py: -------------------------------------------------------------------------------- 1 | """ 2 | Update index to include February 3 | In the previous two exercises, Sam has: 4 | 5 | Read the daily Get It Done request logs for February. 6 | Combined them into a single DataFrame. 7 | Generated a DataFrame with aggregated metrics (request counts by type) 8 | Wrote that DataFrame to a CSV and HTML final report files. 9 | Uploaded these files to S3. 10 | Now, she wants these files to be accessible through the directory listing. Currently, it only shows links for January reports: Screenshot of Get It Done reports listing 11 | 12 | She has created the boto3 S3 client and stored it in the s3 variable. 13 | 14 | Help Sam generate a new directory listing with the February's uploaded reports and store it in a DataFrame. 15 | 16 | Instructions 17 | 100 XP 18 | _ List the 'gid-reports' bucket objects starting with '2019/'. 19 | _ Convert the content of the objects list to a DataFrame. 20 | _ Create a column 'Link' that contains Public Object URL + key. 21 | _ Preview the DataFrame. 22 | 23 | """ 24 | # List the gid-reports bucket objects starting with 2019/ 25 | objects_list = s3.list_objects(Bucket='gid-reports', Prefix='2019/') 26 | 27 | # Convert the response contents to DataFrame 28 | objects_df = pd.DataFrame(objects_list['Contents']) 29 | 30 | # Create a column "Link" that contains Public Object URL 31 | base_url = "http://gid-reports.s3.amazonaws.com/" 32 | objects_df['Link'] = base_url + objects_df['Key'] 33 | 34 | # Preview the resulting DataFrame 35 | objects_df.head() 36 | -------------------------------------------------------------------------------- /Introduction to AWS Boto in Python/2. Sharing Files Securely/XIV. Upload the new index.py: -------------------------------------------------------------------------------- 1 | """ 2 | Upload the new index 3 | Sam is almost done! In the last exercise, she generated a new directory listing, storing it in the objects_df variable: 4 | 5 | Screenshot of objects_df 6 | 7 | Sam has created the boto3 S3 client in the s3 variable. objects_df is populated with the new directory listing from the previous exercise. 8 | 9 | The next step is to write objects_df to an HTML file, and upload it to S3 replacing the current 'index.html' file. 10 | 11 | Help Sam update the directory listing, letting the public access reports for February as well as January! 12 | 13 | Instructions 14 | 100 XP 15 | _ Write objects_df to an HTML file 'report_listing.html' with clickable links. 16 | _ The HTML file should only contain 'Link', 'LastModified', and 'Size' columns. 17 | _ Overwrite the 'index.html' on S3 by uploading the new version of the file. 18 | 19 | """ 20 | # Write objects_df to an HTML file 21 | objects_df.to_html('report_listing.html', 22 | # Set clickable links 23 | render_links=True, 24 | # Isolate the columns 25 | columns=['Link', 'LastModified', 'Size']) 26 | 27 | # Overwrite index.html key by uploading the new file 28 | s3.upload_file( 29 | Filename='./report_listing.html', Key='index.html', 30 | Bucket='gid-reports', 31 | ExtraArgs = { 32 | 'ContentType': 'text/html', 33 | 'ACL': 'public-read' 34 | }) 35 | -------------------------------------------------------------------------------- /Introduction to AWS Boto in Python/3. Reporting and Notifying!/1. SNS Topics.txt: -------------------------------------------------------------------------------- 1 | every topic has 2 | -> ARN: (amazon resource name) 3 | 4 | | create SNS Topic | 5 | 6 | 1. > INITIALIZE CLIENT: 7 | sns = boto3.client('sns', 8 | region_name='us-east-1', 9 | aws_access_key_id=AWS_KEY_ID, 10 | aws_secret_access_key=AWS_SECRET) 11 | 12 | 2. > CREATE TOPIC: 13 | response = sns.create_topic(Name='city_alerts') 14 | 15 | # store ARN key in variable 16 | topic_arn= response['TopicArn'] 17 | 18 | 2. > SHORTCUT. CREATE TOPIC & GRAB TOPICARN 19 | sns.create_topic(Name='city_alerts')['TopicArn'] 20 | 21 | | list topics | 22 | 23 | response= sns.list_topics() 24 | 25 | | delete topic | 26 | 27 | # pass topic Arn 28 | sns.delete_topic(TopicArn='arn:aws:sns:us-east-1:3121333277766:city_alerts') 29 | -------------------------------------------------------------------------------- /Introduction to AWS Boto in Python/3. Reporting and Notifying!/10. Sending a single SMS message.py: -------------------------------------------------------------------------------- 1 | """ 2 | Sending a single SMS message 3 | Elena asks Sam outside of work (per regulation) to send some thank you SMS messages to her largest donors. 4 | 5 | Sam believes in Elena and her goals, so she decides to help. 6 | 7 | She decides writes a quick script that will run through Elena's contact list and send a thank you text. 8 | 9 | Since this is a one-off run and Sam is not expecting to alert these people regularly, there's no need to create a topic and subscribe them. 10 | 11 | Sam has created the boto3 SNS client and stored it in the sns variable. The contacts variable contains Elena's contacts as a DataFrame. 12 | 13 | Help Sam put together a quick hello to Elena's largest supporters! 14 | 15 | Instructions 16 | 100 XP 17 | _ For every contact, send an ad-hoc SMS to the contact's phone number. 18 | _ The message sent should include the contact's name. 19 | 20 | """ 21 | # Loop through every row in contacts 22 | for idx, row in contacts.iterrows(): 23 | 24 | # Publish an ad-hoc sms to the user's phone number 25 | response = sns.publish( 26 | # Set the phone number 27 | PhoneNumber = str(row['Phone']), 28 | # The message should include the user's name 29 | Message = 'Hello {}'.format(row['Name']) 30 | ) 31 | 32 | print(response) 33 | -------------------------------------------------------------------------------- /Introduction to AWS Boto in Python/3. Reporting and Notifying!/11. Different protocols per topic level.py: -------------------------------------------------------------------------------- 1 | """ 2 | Different protocols per topic level 3 | Now that Sam has created the critical and extreme topics, she needs to subscribe the staff from her contact list into these topics. 4 | 5 | Sam decided that the people subscribed to 'critical' topics will only receive emails. On the other hand, people subscribed to 'extreme' topics will receive SMS - because those are pretty urgent. 6 | 7 | She has already created the boto3 SNS client in the sns variable. 8 | 9 | Help Sam subscribe the users in the contacts DataFrame to email or SMS notifications based on their department. This will help get the right alerts to the right people, making the City of San Diego run better and faster! 10 | 11 | Instructions 12 | 100 XP 13 | _ Get the topic name by using the 'Department' field in the contacts DataFrame. 14 | _ Use the topic name to create the critical and extreme TopicArns for a user's department. 15 | _ Subscribe the user's email address to the critical topic. 16 | _ Subscribe the user's phone number to the extreme topic. 17 | 18 | """ 19 | for index, user_row in contacts.iterrows(): 20 | # Get topic names for the users's dept 21 | critical_tname = '{}_critical'.format(user_row['Department']) 22 | extreme_tname = '{}_extreme'.format(user_row['Department']) 23 | 24 | # Get or create the TopicArns for a user's department. 25 | critical_arn = sns.create_topic(Name=critical_tname)['TopicArn'] 26 | extreme_arn = sns.create_topic(Name=extreme_tname)['TopicArn'] 27 | 28 | # Subscribe each users email to the critical Topic 29 | sns.subscribe(TopicArn = critical_arn, 30 | Protocol='email', Endpoint=user_row['Email']) 31 | # Subscribe each users phone number for the extreme Topic 32 | sns.subscribe(TopicArn = extreme_arn, 33 | Protocol='sms', Endpoint=str(user_row['Phone'])) 34 | -------------------------------------------------------------------------------- /Introduction to AWS Boto in Python/3. Reporting and Notifying!/12. Creating multi-level topics.py: -------------------------------------------------------------------------------- 1 | """ 2 | Creating multi-level topics 3 | The City Council asked Sam to create a critical and extreme topic for each department. 4 | 5 | The critical topic will trigger alerts to staff and managers. 6 | 7 | The extreme topics will trigger alerts to politicians and directors - it means the thresholds are completely exceeded. 8 | 9 | For example, the trash department will have a trash_critical and trash_extreme topic. 10 | 11 | She has already created the boto3 SNS client in the sns variable. She created a variable departments that contains a unique list of departments. 12 | 13 | In this lesson, you will help Sam make the City run faster! 14 | 15 | You will create multi-level topics with a different set of subscribers that trigger based on different thresholds. 16 | 17 | You will effectively be building a smart alerting system! 18 | 19 | Instructions 20 | 100 XP 21 | _ For each department create a critical topic and store it in critical. 22 | _ For each department, create an extreme topic and store it in extreme. 23 | _ Place the created TopicArns into dept_arns. 24 | _ Print the dictionary. 25 | 26 | """ 27 | dept_arns = {} 28 | 29 | for dept in departments: 30 | # For each deparment, create a critical topic 31 | critical = sns.create_topic(Name="{}_critical".format(dept)) 32 | # For each department, create an extreme topic 33 | extreme = sns.create_topic(Name="{}_extreme".format(dept)) 34 | # Place the created TopicARNs into a dictionary 35 | dept_arns['{}_critical'.format(dept)] = critical['TopicArn'] 36 | dept_arns['{}_extreme'.format(dept)] = extreme['TopicArn'] 37 | 38 | # Print the filled dictionary. 39 | print(dept_arns) 40 | -------------------------------------------------------------------------------- /Introduction to AWS Boto in Python/3. Reporting and Notifying!/13. Sending an alert.py: -------------------------------------------------------------------------------- 1 | """ 2 | Sending an alert 3 | Elena Block, Sam's old friend and council member is running for office and potholes are a big deal in her district. She wants to help fix it. 4 | 5 | Pothole 6 | Elena asked Sam to adjust the streets_critical topic to send an alert if there are over 100 unfixed potholes in the backlog. 7 | 8 | Sam has created the boto3 SNS client in the sns variable. She stored the streets_critical topic ARN in the str_critical_arn variable. 9 | 10 | Help Sam take the next step. 11 | 12 | She needs to check the current backlog count and send a message only if it exceeds 100. 13 | 14 | The fate of District 12, and the results of Elena's election rest on your and Sam's shoulders. 15 | 16 | Instructions 17 | 100 XP 18 | _ If there are over 100 potholes, send a message with the current backlog count. 19 | _ Create the email subject to also include the current backlog counit. 20 | _ Publish message to the streets_critical Topic ARN. 21 | 22 | """ 23 | # If there are over 100 potholes, create a message 24 | if streets_v_count > 100: 25 | # The message should contain the number of potholes. 26 | message = "There are {} potholes!".format(streets_v_count) 27 | # The email subject should also contain number of potholes 28 | subject = "Latest pothole count is {}".format(streets_v_count) 29 | 30 | # Publish the email to the streets_critical topic 31 | sns.publish( 32 | TopicArn = str_critical_arn, 33 | # Set subject and message 34 | Message = message, 35 | Subject = subject 36 | ) 37 | -------------------------------------------------------------------------------- /Introduction to AWS Boto in Python/3. Reporting and Notifying!/14. Sending multi-level alerts.py: -------------------------------------------------------------------------------- 1 | """ 2 | Sending multi-level alerts 3 | Sam is going to prototype her alerting system with the water data and the water department. 4 | 5 | According to the Director, when there are over 100 alerts outstanding, that's considered critical. If there are over 300, that's extreme. 6 | 7 | She has done some calculations and came up with a vcounts dictionary, that contains current requests for 'water', 'streets' and 'trash'. 8 | 9 | She has also already created the boto3 SNS client and stored it in the sns variable. 10 | 11 | In this exercise, you will help Sam publish a critical and an extreme alert based on the thresholds! 12 | 13 | Instructions 14 | 100 XP 15 | _ If there are over 100 water violations, publish to 'water_critical' topic. 16 | _ If there are over 300 water violations, publish to 'water_extreme' topic. 17 | 18 | """ 19 | if vcounts['water'] > 100: 20 | # If over 100 water violations, publish to water_critical 21 | sns.publish( 22 | TopicArn = dept_arns['water_critical'], 23 | Message = "{} water issues".format(vcounts['water']), 24 | Subject = "Help fix water violations NOW!") 25 | 26 | if vcounts['water'] > 300: 27 | # If over 300 violations, publish to water_extreme 28 | sns.publish( 29 | TopicArn = dept_arns['water_extreme'], 30 | Message = "{} violations! RUN!".format(vcounts['water']), 31 | Subject = "THIS IS BAD. WE ARE FLOODING!") 32 | -------------------------------------------------------------------------------- /Introduction to AWS Boto in Python/3. Reporting and Notifying!/2. Creating a Topic.py: -------------------------------------------------------------------------------- 1 | """ 2 | Creating a Topic 3 | Sam has been doing such a great job with her new skills, she got a promotion. 4 | 5 | With her new title of Information Systems Analyst 3 (as opposed to 2), she has gotten a tiny pay bump in exchange for a lot more work. 6 | 7 | Knowing what she can do, the City Council asked Sam to prototype an alerting system that will alert them when any department has more than 100 Get It Done requests outstanding. 8 | 9 | They would like for council members and department directors to receive the alert. 10 | 11 | Help Sam use her new knowledge of Amazon SNS to create an alerting system for Council! 12 | 13 | Instructions 14 | 100 XP 15 | _ Initialize the boto3 client for SNS. 16 | _ Create the 'city_alerts' topic and extract its topic ARN. 17 | _ Re-create the 'city_alerts' topic and extract its topic ARN with a one-liner. 18 | _ Verify the two topic ARNs match. 19 | 20 | """ 21 | # Initialize boto3 client for SNS 22 | sns = boto3.client('sns', 23 | region_name='us-east-1', 24 | aws_access_key_id=AWS_KEY_ID, 25 | aws_secret_access_key=AWS_SECRET) 26 | 27 | # Create the city_alerts topic 28 | response = sns.create_topic(Name="city_alerts") 29 | c_alerts_arn = response['TopicArn'] 30 | 31 | # Re-create the city_alerts topic using a oneliner 32 | c_alerts_arn_1 = sns.create_topic(Name='city_alerts')['TopicArn'] 33 | 34 | # Compare the two to make sure they match 35 | print(c_alerts_arn == c_alerts_arn_1) 36 | -------------------------------------------------------------------------------- /Introduction to AWS Boto in Python/3. Reporting and Notifying!/3. Creating Multiple Topics.py: -------------------------------------------------------------------------------- 1 | """ 2 | Creating multiple topics 3 | Sam suddenly became a black sheep because she is responsible for an onslaught of text messages and notifications to department directors. 4 | 5 | No one will go to lunch with her anymore! 6 | 7 | To fix this, she decided to create a general topic per department for routine notifications, and a critical topic for urgent notifications. 8 | 9 | Managers will subscribe only to critical notifications, while supervisors can monitor general notifications. 10 | 11 | For example, the streets department would have 'streets_general' and 'streets_critical' as topics. 12 | 13 | She has initialized the SNS client and stored it in the sns variable. 14 | 15 | Help Sam create a tiered topic structure… and have friends again! 16 | 17 | Instructions 18 | 100 XP 19 | _ For every department, create a general topic. 20 | _ For every department, create a critical topic. 21 | _ Print all the topics created in SNS 22 | 23 | """ 24 | # Create list of departments 25 | departments = ['trash', 'streets', 'water'] 26 | 27 | for dept in departments: 28 | # For every department, create a general topic 29 | sns.create_topic(Name="{}_general".format(dept)) 30 | 31 | # For every department, create a critical topic 32 | sns.create_topic(Name="{}_critical".format(dept)) 33 | 34 | # Print all the topics in SNS 35 | response = sns.list_topics() 36 | print(response['Topics']) 37 | -------------------------------------------------------------------------------- /Introduction to AWS Boto in Python/3. Reporting and Notifying!/4. Deleting multiple topics.py: -------------------------------------------------------------------------------- 1 | """ 2 | Deleting multiple topics 3 | It's hard to get things done in City government without good relationships. Sam is burning bridges with the general topics she created in the last exercise. 4 | 5 | People are shunning her because she is blowing up their phones with notifications. 6 | 7 | She decides to get rid of the general topics per department completely, and keep only critical topics. 8 | 9 | Sam has created the boto3 client for SNS and stored it in the sns variable. 10 | 11 | Help Sam regain her status in the bureaucratic social hierarchy by removing any topics that do not have the word critical in them. 12 | 13 | Instructions 14 | 100 XP 15 | _ Get the current list of topics. 16 | _ For every topic ARN, if it doesn't have the word 'critical' in it, delete it. 17 | _ Print the list of remaining critical topics. 18 | 19 | """ 20 | # Get the current list of topics 21 | topics = sns.list_topics()['Topics'] 22 | 23 | for topic in topics: 24 | # For each topic, if it is not marked critical, delete it 25 | if "critical" not in topic['TopicArn']: 26 | sns.delete_topic(TopicArn=topic['TopicArn']) 27 | 28 | # Print the list of remaining critical topics 29 | print(sns.list_topics()['Topics']) 30 | -------------------------------------------------------------------------------- /Introduction to AWS Boto in Python/3. Reporting and Notifying!/5. SNS Subscriptions.txt: -------------------------------------------------------------------------------- 1 | | CREATE SMS SUBCRIPTION | 2 | 3 | 1. > INITIALIZE boto3 SNS CLIENT 4 | sns = boto3.client('sns', 5 | region_name='us-east-1', 6 | aws_access_key_id=AWS_KEY_ID, 7 | aws_secret_access_key=AWS_SECRET) 8 | 9 | 2. > PASS SNS SUBSCRIBE WITH TOPICARN 10 | response = sns.subscribe( 11 | TopicArn = 'arn:aws:sns:us-east-1:3121333277766:city_alerts', 12 | Protocol= 'SMS', 13 | EndPoint='+123112354') # phone number 14 | 15 | | CREATE EMAIL SUBSCRIPTION | 16 | 17 | 2. > PASS SNS SUBSCRIBE WITH TOPICARN 18 | # will see pending confirmation cause he has to click the link 19 | response = sns.subscribe( 20 | TopicArn = 'arn:aws:sns:us-east-1:3121333277766:city_alerts', 21 | Protocol= 'EMAIL', 22 | EndPoint='max@maximize.com') # email 23 | 24 | | CREATE MULTIPLE EMAIL SUBSCRIPTION | 25 | 26 | # For each email in contacts, create subscription to street_critical 27 | for email in contacts['Email']: 28 | sns.subscribe(TopicArn = str_critical_arn, 29 | # Set channel and recipient 30 | Protocol = 'email', #channel 31 | Endpoint = email) # recipient 32 | 33 | # List subscriptions for streets_critical topic, convert to DataFrame 34 | response = sns.list_subscriptions_by_topic( 35 | TopicArn = str_critical_arn) 36 | subs = pd.DataFrame(response['Subscriptions']) 37 | 38 | # Preview the DataFrame 39 | subs.head() 40 | 41 | | LIST SUBSCRIPTIONS BY TOPIC | 42 | 43 | sns.list_subscriptions_by_topic( 44 | TopicArn='arn:aws:sns:us-east-1:3121333277766:city_alerts' 45 | ) 46 | 47 | | LIST ALL SUBSCRIPTIONS | 48 | 49 | 50 | sns.list_subscriptions()['Subscriptions'] 51 | 52 | | DELETE SUBSCRIPTION | 53 | 54 | sns.unsubscribe( 55 | SubscriptionArn='arn:aws:sns:us-east-1:3121333277766:city_alerts:as23sdf:234sddf32:4444' 56 | ) 57 | 58 | | DELETE MULTIPLE SUBSCRIPTIONS | 59 | 60 | # get list of subscriptions 61 | response = sns.list_subscription_by_topic( 62 | TopicArn='arn:aws:sns:us-east-1:3121333277766:city_alerts') 63 | subs= response['Subscription'] 64 | 65 | # Unsubscribe SMS Subscription 66 | for sub in subs: 67 | if sub['Protocol'] == 'sms': 68 | sns.unsubscribe(sub['SubscriptionArn']) 69 | 70 | -------------------------------------------------------------------------------- /Introduction to AWS Boto in Python/3. Reporting and Notifying!/6. Subscribing to topics.py: -------------------------------------------------------------------------------- 1 | """ 2 | Subscribing to topics 3 | Many department directors are already receiving critical notifications. 4 | 5 | Now Sam is ready to start subscribing City Council members. 6 | 7 | She knows that they can be finicky, and elected officials are not known for their attention to detail or tolerance for failure. 8 | 9 | She is nervous, but decides to start by subscribing the friendliest Council Member she knows. She got Elena Block's email and phone number. 10 | 11 | Sam has initialized the boto3 SNS client and stored it in the sns variable. 12 | 13 | She has also stored the topic ARN for streets_critical in the str_critical_arn variable. 14 | 15 | Help Sam subscribe her first Council member to the streets_critical topic! 16 | 17 | Instructions 18 | 100 XP 19 | _ Subscribe Elena's phone number to the 'streets_critical' topic. 20 | _ Print the SMS subscription ARN. 21 | _ Subscribe Elena's email to the 'streets_critical topic. 22 | _ Print the email subscription ARN. 23 | 24 | """ 25 | # Subscribe Elena's phone number to streets_critical topic 26 | resp_sms = sns.subscribe( 27 | TopicArn = str_critical_arn, 28 | Protocol='sms', Endpoint="+16196777733") 29 | 30 | # Print the SubscriptionArn 31 | print(resp_sms['SubscriptionArn']) 32 | 33 | # Subscribe Elena's email to streets_critical topic. 34 | resp_email = sns.subscribe( 35 | TopicArn = str_critical_arn, 36 | Protocol='email', Endpoint="eblock@sandiegocity.gov") 37 | 38 | # Print the SubscriptionArn 39 | print(resp_email['SubscriptionArn']) 40 | -------------------------------------------------------------------------------- /Introduction to AWS Boto in Python/3. Reporting and Notifying!/7. Creating multiple subscriptions.py: -------------------------------------------------------------------------------- 1 | """ 2 | Creating multiple subscriptions 3 | After the successful pilot with Councilwoman Elena Block, other City Council members have been asking to be signed up for alerts too. 4 | 5 | Sam decides that she should manage subscribers in a CSV file, otherwise she would lose track of who needs to be subscribed to what. 6 | 7 | She creates a CSV named contacts and decides to subscribe everyone in the CSV to the streets_critical topic. 8 | 9 | She has created the boto3 SNS client in the sns variable, and the streets_critical topic ARN is in the str_critical_arn variable. 10 | 11 | Sam is going from being a social pariah to being courted by multiple council offices. 12 | 13 | Help her solidify her position as master of all information by adding all the users in her CSV to the streets_critical topic! 14 | 15 | Instructions 16 | 100 XP 17 | _ For each element in the Email column of contacts, create a subscription to the 'streets_critical' Topic. 18 | _ List subscriptions for the 'streets_critical' Topic and convert them to a DataFrame. 19 | _ Preview the DataFrame. 20 | 21 | """ 22 | # For each email in contacts, create subscription to street_critical 23 | for email in contacts['Email']: 24 | sns.subscribe(TopicArn = str_critical_arn, 25 | # Set channel and recipient 26 | Protocol = 'email', #channel 27 | Endpoint = email) # recipient 28 | 29 | # List subscriptions for streets_critical topic, convert to DataFrame 30 | response = sns.list_subscriptions_by_topic( 31 | TopicArn = str_critical_arn) 32 | subs = pd.DataFrame(response['Subscriptions']) 33 | 34 | # Preview the DataFrame 35 | subs.head() 36 | -------------------------------------------------------------------------------- /Introduction to AWS Boto in Python/3. Reporting and Notifying!/8. Deleting multiple subscriptions.py: -------------------------------------------------------------------------------- 1 | """ 2 | Deleting multiple subscriptions 3 | Now that Sam has a maturing notification system, she is learning that the types of alerts she sends do not bode well for text messaging. 4 | 5 | SMS alerts are great if the user can react that minute, but "We are 500 potholes behind" is not something that a Council Member can jump up and fix. 6 | 7 | She decides to remove all SMS subscribers from the streets_critical topic, but keep all email subscriptions. 8 | 9 | She created the boto3 SNS client in the sns variable, and the streets_critical topic ARN is in the str_critical_arn variable. 10 | 11 | In this exercise, you will help Sam remove all SMS subscribers and make this an email only alerting system. 12 | 13 | Instructions 14 | 100 XP 15 | _ List subscriptions for 'streets_critical' topic. 16 | _ For each subscription, if the protocol is 'sms', unsubscribe. 17 | _ List subscriptions for 'streets_critical' topic in one line. 18 | _ Print the subscriptions 19 | 20 | """ 21 | # List subscriptions for streets_critical topic. 22 | response = sns.list_subscriptions_by_topic( 23 | TopicArn=str_critical_arn) 24 | 25 | # For each subscription, if the protocol is SMS, unsubscribe 26 | for sub in response['Subscriptions']: 27 | if sub['Protocol'] == 'sms': 28 | sns.unsubscribe(SubscriptionArn=sub['SubscriptionArn']) 29 | 30 | # List subscriptions for streets_critical topic in one line 31 | subs = sns.list_subscriptions_by_topic( 32 | TopicArn=str_critical_arn)['Subscriptions'] 33 | 34 | # Print the subscriptions 35 | print(subs) 36 | -------------------------------------------------------------------------------- /Introduction to AWS Boto in Python/3. Reporting and Notifying!/9. Sending messages.txt: -------------------------------------------------------------------------------- 1 | | PUBLISHING TO A TOPIC | 2 | 3 | response = sns.publish( 4 | TopicArn = 'arn:aws: sns: us-east-1:320333787981 :city_alerts', 5 | Message= 'Body text of SMS or e-mail', # not visible for text msg 6 | Subject = 'Subject Line for Email') 7 | 8 | | SENDING CUSTOM MESSAGES | 9 | 10 | num-of_reports = 157 11 | response = client.publish ( 12 | TopicArn = 'arn:aws : sns: US-east-1:320333787981: city_alerts ', 13 | Message = 'There are {} reports outstanding'.format(num_of-reports), 14 | Subject= 'Subject Line for Email' 15 | 16 | | SINGLE SMS | 17 | 18 | response = sns.publish( 19 | PhoneNumber = '+13121233211', 20 | Message = 'Body text of SMS or e-mail') 21 | 22 | -------------------------------------------------------------------------------- /Introduction to AWS Boto in Python/4. Pattern Rekognition/1. Rekognizing patterns.txt: -------------------------------------------------------------------------------- 1 | - detects objects from images 2 | - extract text from image 3 | 4 | | why not build our own ? | 5 | 6 | - Rekognition : 7 | . quick but good 8 | . keep code simple 9 | . rekognize many things 10 | - Buil a model : 11 | . custom requirements 12 | . security implications 13 | . large volumes 14 | 15 | 16 | 1. > CREATE CLIENT: 17 | client = boto3.client('rekognition') 18 | 19 | 2. > INITIALIZE S3 20 | s3= boto3.client( 21 | 's3', region_name='us-east-1', 22 | aws_access_key_id=AWS_KEY_ID, aws_secret_access_key=AWS_SECRET) 23 | 24 | 3. > UPLOAD FILE 25 | s3.upload_file( 26 | Filename='report.jpg', Key='repot.jpg', 27 | Bucket='datacamp-img') 28 | 29 | 4. > INITIATE REKOGNITION CLIENT: 30 | rekog = boto3.client( 31 | 'rekognition', 32 | region_name='us-east-1', 33 | aws_access_key_id=AWS_KEY_ID, aws_secret_access_key=AWS_SECRET) 34 | 35 | | OBJECT DETECTION | 36 | 37 | 5. > DETECT: 38 | response= rekog.detect_labels( 39 | Image={'S3Object':{ 40 | 'Bucket': 'datacamp-img', 41 | 'Name': 'report.jpg' 42 | }, 43 | MaxLabels=10, 44 | MinConfidence=95 45 | ) 46 | 47 | | TEXT DETECTION | 48 | 49 | 5. > DETECT: 50 | response = rekog.detect_text ( 51 | Image={'S3Object': 52 | { 53 | Bucket': 'datacamp-img', 54 | Name': 'report.jpg' 55 | } 56 | } 57 | ) 58 | -------------------------------------------------------------------------------- /Introduction to AWS Boto in Python/4. Pattern Rekognition/10. Scooter Community Sentiment.py: -------------------------------------------------------------------------------- 1 | """ 2 | Scooter community sentiment 3 | The City Council is curious about how different communities in the City are reacting to the Scooters. The dataset has expanded since Sam's initial analysis, and now contains Vietnamese, Tagalog, Spanish and English reports. 4 | 5 | They ask Sam to see if she can figure it out. She decides that the best way to proxy for a community is through language (at least with the data she immediately has access to). 6 | 7 | She has already loaded the CSV into the scooter_df variable: 8 | 9 | Scooter dataframe contents 10 | 11 | In this exercise, you will help Sam understand sentiment across many different languages. This will help the City understand how different communities are relating to scooters, something that will affect the votes of City Council members. 12 | 13 | Instructions 14 | 100 XP 15 | For every DataFrame row, detect the dominant language. 16 | Use the detected language to determine the sentiment of the description. 17 | Group the DataFrame by the 'sentiment' and 'lang' columns in that order 18 | 19 | """ 20 | for index, row in scooter_requests.iterrows(): 21 | # For every DataFrame row 22 | desc = scooter_requests.loc[index, 'public_description'] 23 | if desc != '': 24 | # Detect the dominant language 25 | resp = comprehend.detect_dominant_language(Text=desc) 26 | lang_code = resp['Languages'][0]['LanguageCode'] 27 | scooter_requests.loc[index, 'lang'] = lang_code 28 | # Use the detected language to determine sentiment 29 | scooter_requests.loc[index, 'sentiment'] = comprehend.detect_sentiment( 30 | Text=desc, 31 | LanguageCode=lang_code)['Sentiment'] 32 | # Perform a count of sentiment by group. 33 | counts = scooter_requests.groupby(['sentiment', 'lang']).count() 34 | counts.head() 35 | ''' 36 | service_request_id service_name public_description 37 | sentiment lang 38 | MIXED en 4 4 4 39 | tl 12 12 12 40 | vi 3 3 3 41 | NEGATIVE tl 5 5 5 42 | vi 10 10 10 ''' 43 | -------------------------------------------------------------------------------- /Introduction to AWS Boto in Python/4. Pattern Rekognition/2. Cat detector.py: -------------------------------------------------------------------------------- 1 | """ 2 | Cat detector 3 | Sam has been getting more and more challenging projects as a result of her popularity and success. 4 | 5 | The newest request is from the animal control team. They want to receive notifications when an image comes in from the Get It Done application contains a cat. They can find feral cats and go rescue them. They provided her with two images that she can test her system with. One contains a cat, one does not. Both images are referenced in variables image1 and image2 respectively. 6 | 7 | Sam has also created the boto3 Rekognition client in the rekog variable. 8 | 9 | Help Sam use Rekognition to enable the animal control team to rescue stray cats! 10 | 11 | Instructions 3/3 12 | Use the Rekognition client to detect the labels for image1. Return a maximum of 1 label. 13 | Detect the labels for image2 and print the response's labels.. 14 | 15 | 16 | """ 17 | # Use Rekognition client to detect labels 18 | image1_response = rekog.detect_labels( 19 | # Specify the image as an S3Object; Return one label 20 | Image=image1, MaxLabels=1) 21 | 22 | # Print the labels 23 | print(image1_response['Labels']) 24 | ''' 25 | [{'Confidence': 99.85968017578125, 'Instances': [], 'Name': 'Walkway', 'Parents': [{'Name': 'Path'}]}] 26 | ''' 27 | 28 | # Use Rekognition client to detect labels 29 | image2_response = rekog.detect_labels( 30 | Image=image2, MaxLabels=1 31 | ) 32 | 33 | # Print the labels 34 | print(image1_response['Labels']) 35 | ''' 36 | [{'Confidence': 96.95977020263672, 'Instances': [{'BoundingBox': {'Height': 0.3252439796924591, 'Left': 0.668968915939331, 37 | 'Top': 0.14526571333408356, 'Width': 0.1563364714384079}, 38 | 'Confidence': 96.95977020263672}], 'Name': 'Cat', 'Parents': [{'Name': 'Pet'}, {'Name': 'Mammal'}, {'Name': 'Animal'}]}] 39 | ''' 40 | 41 | """Question 42 | _ Which image contained cats? 43 | Possible Answers: 44 | 45 | image1. 46 | OK image2.""" 47 | -------------------------------------------------------------------------------- /Introduction to AWS Boto in Python/4. Pattern Rekognition/3. Multiple cat detector.py: -------------------------------------------------------------------------------- 1 | """ 2 | Multiple cat detector 3 | After using the Cat Detector for a bit, the Animal Control team found that it was inefficient for them to pursue one cat at a time. It would be better if they could find clusters of cats. 4 | 5 | They asked if Sam could add the count of cats detected to the message in the alerts they receive. They also asked her to lower the confidence floor, allowing the system to have more false positives. 6 | 7 | Cats detected by Rekognition 8 | 9 | Sam has already: 10 | 11 | Created the Rekognition client. 12 | Called .detect_labels() with the Bucket and Key of the image on S3. 13 | Stored the result in the response variable. 14 | Help Sam save cat lives! Help her count the cats in each image and include that in the alert to Animal Control! 15 | 16 | Instructions 17 | 100 XP 18 | _ Iterate over each element of the 'Labels' key in response. 19 | _ Once you encounter a label with the name 'Cat', iterate over the label's instance. 20 | _ If an instance's confidence level exceeds 85, increment cat_counts by 1. 21 | _ Print the final cat count. 22 | """ 23 | # Create an empty counter variable 24 | cats_count = 0 25 | # Iterate over the labels in the response 26 | for label in response['Labels']: 27 | # Find the cat label, look over the detected instances 28 | if label['Name'] == 'Cat': 29 | for instance in label['Instances']: 30 | # Only count instances with confidence > 85 31 | if (instance['Confidence'] > 85): 32 | cats_count += 1 33 | # Print count of cats 34 | print(cats_count) 35 | -------------------------------------------------------------------------------- /Introduction to AWS Boto in Python/4. Pattern Rekognition/4. Parking sign reader.py: -------------------------------------------------------------------------------- 1 | """ 2 | Parking sign reader 3 | City planners have millions of images from truck cameras. Since the city does not keep a record of this data, parking regulations would be very valuable for planners to know. 4 | 5 | Sam found detect_text() in the boto3 Rekognition documentation. She is going to use it to extract text out of the images City planners provided. 6 | 7 | 8 | 9 | Sam has: 10 | 11 | Created the Rekognition client. 12 | Called .detect_text() and stored the result in response. 13 | Help Sam create a data source for parking regulations in the City. Use Rekognition's .detect_text() method to extract lines and words from images of parking signs. 14 | 15 | Instructions 2/2 16 | _ Iterate over each detected text in response, and append each detected text to words if the text's type is 'WORD'. 17 | _ Iterate over each detected text in response, and append each detected text to lines if the text's type is 'LINE'. 18 | 19 | """ 20 | # Create empty list of words 21 | words = [] 22 | # Iterate over the TextDetections in the response dictionary 23 | for text_detection in response['TextDetections']: 24 | # If TextDetection type is WORD, append it to words list 25 | if text_detection['Type'] == 'WORD': 26 | # Append the detected text 27 | words.append(text_detection['DetectedText']) 28 | # Print out the words list 29 | print(words) 30 | ''' 31 | ['NO', 'PARKING', '7', 'AM', 'TO', '12', 'NOON', 'MONDAY'] 32 | ''' 33 | #------- 34 | 35 | # Create empty list of lines 36 | lines = [] 37 | # Iterate over the TextDetections in the response dictionary 38 | for text_detection in response['TextDetections']: 39 | # If TextDetection type is Line, append it to lines list 40 | if text_detection['Type'] == 'LINE': 41 | # Append the detected text 42 | lines.append(text_detection['DetectedText']) 43 | # Print out the words list 44 | print(lines) 45 | ''' 46 | ['NO PARKING', '7 AM', 'TO', '12 NOON', 'MONDAY'] 47 | ''' 48 | 49 | -------------------------------------------------------------------------------- /Introduction to AWS Boto in Python/4. Pattern Rekognition/5. Comprehending text.txt: -------------------------------------------------------------------------------- 1 | | Translating textx | 2 | 3 | 1. > INITIALIZE BOTO3 CLIENT 'translate': 4 | translate = boto3.client( 5 | 'translate', 6 | region_name='us-east-1', 7 | aws_access_key_id=AWS_KEY_ID, aws_secret_access_key=AWS_SECRET) 8 | 9 | 2. > TRANSLATE TEXT: 10 | response = translate.translate_text( 11 | Text='Hello, how are you ?', 12 | SourceLanguageCode='auto', 13 | TargetLanguageCode='es') 14 | 15 | 2. > TRANSLATE TEXT ONELINER: (directly from response) 16 | translated_text = translate.translate_text( 17 | Text='Hello, how are you?', 18 | SourceLanguageCode='auto', 19 | TargetLanguageCode='es')['TranslatedText'] 20 | 21 | | Detect Language | 22 | 23 | 1. > INITIALIZE BOTO3 CLIENT 'comprehend' : 24 | comprehend = boto3.client( 25 | 'comprehend', 26 | region_name='us-east-1', 27 | aws_access_key_id=AWS_KEY_ID, aws_secret_access_key=AWS_SECRET) 28 | 29 | 2. > DETECT DOMINANT LANGUAGE: 30 | response = comprehend.detect_dominant_language( 31 | Text='Hay basura por todas partes a lo largo de la carretera') 32 | 33 | | Sentiment Results | 34 | 35 | . Neutral --- Confidence 0.9 36 | . Positive 37 | . Negative 38 | . Mixed 39 | 40 | | Detect text Sentiment | 41 | 42 | 2. > DETECT SENTIMENT: 43 | response = comprehend.detect_sentiment( 44 | Text='Datacamp students are Amazing.', 45 | LanguageCode='en') 46 | 47 | 2. > DETECT SENTIMENT ONELINER (from response): 48 | sentiment = comprehend.detect_sentiment( 49 | Text='Franco is Amazing.', 50 | LanguageCode='en')['Sentiment'] 51 | 52 | 53 | -------------------------------------------------------------------------------- /Introduction to AWS Boto in Python/4. Pattern Rekognition/6. Detecting language.py: -------------------------------------------------------------------------------- 1 | """ 2 | Detecting language 3 | The City Council is wondering whether it's worth it to build a Spanish version of the Get It Done application. There is a large Spanish speaking constituency, but they are not sure if they will engage. Building in multi-lingual translation complicates the system and needs to be justified. 4 | 5 | They ask Sam to figure out how many people are posting requests in Spanish. 6 | 7 | She has already loaded the CSV into the dumping_df variable and subset it to the following columns: 8 | 9 | Get It Done requests in many languages 10 | 11 | Help Sam quantify the demand for a Spanish version of the Get It Done application. Figure out how many requesters use Spanish and print the final result! 12 | 13 | Instructions 14 | 100 XP 15 | For each row in the DataFrame, detect the dominant language. 16 | Assign the first selected language to the 'lang' column. 17 | Count the total number of posts in Spanish. 18 | 19 | """ 20 | # For each dataframe row 21 | for index, row in dumping_df.iterrows(): 22 | # Get the public description field 23 | description =dumping_df.loc[index, 'public_description'] 24 | if description != '': 25 | # Detect language in the field content 26 | resp = comprehend.detect_dominant_language(Text=description) 27 | # Assign the top choice language to the lang column. 28 | dumping_df.loc[index, 'lang'] = resp['Languages'][0]['LanguageCode'] 29 | 30 | # Count the total number of spanish posts 31 | spanish_post_ct = len(dumping_df[dumping_df.lang == 'es']) 32 | # Print the result 33 | print("{} posts in Spanish".format(spanish_post_ct)) 34 | ''' 35 | 9 posts in Spanish 36 | ''' 37 | -------------------------------------------------------------------------------- /Introduction to AWS Boto in Python/4. Pattern Rekognition/7. Translating Get It Done requests.py: -------------------------------------------------------------------------------- 1 | """ 2 | Translating Get It Done requests 3 | Often, Get It Done requests come in with multiple languages in the description. This is a challenge for many City teams. In order to review the requests, many city teams need to have a translator on staff, or hope they know someone who speaks the language. 4 | 5 | The Streets director asked Sam to help. He wanted her to translate the Get It Done requests by running a job at the end of every day. 6 | 7 | Sam decides to run the requests through the AWS translate service. She has already loaded the CSV into the dumping_df variable and subset it to the following columns: 8 | 9 | Get It Done requests in many languages 10 | 11 | Help Sam translate the requests to Spanish by running them through the AWS translate service! 12 | 13 | Instructions 14 | 100 XP 15 | For each row in the DataFrame, translate it to English. 16 | Store the original language in the original_lang column. 17 | Store the new translation in the translated_desc column. 18 | 19 | """ 20 | for index, row in dumping_df.iterrows(): 21 | # Get the public_description into a variable 22 | description = dumping_df.loc[index, 'public_description'] 23 | if description != '': 24 | # Translate the public description 25 | resp = translate.translate_text( 26 | Text=description, 27 | SourceLanguageCode='auto', TargetLanguageCode='en') 28 | # Store original language in original_lang column 29 | dumping_df.loc[index, 'original_lang'] = resp['SourceLanguageCode'] 30 | # Store the translation in the translated_desc column 31 | dumping_df.loc[index, 'translated_desc'] = resp['TranslatedText'] # output 32 | # Preview the resulting DataFrame 33 | dumping_df = dumping_df[['service_request_id', 'original_lang', 'translated_desc']] 34 | dumping_df.head() 35 | ''' 36 | service_request_id original_lang translated_desc 37 | 0 93494 es The residents keep throwing stuff away 38 | 1 101502 en Couch, 4 chairs, mattress, carpet padding. thi... 39 | 2 101520 NaN NaN 40 | 3 101576 en On the South Side of Paradise Valley Road near... 41 | 4 101616 es There is a fridge on the street 42 | ''' 43 | -------------------------------------------------------------------------------- /Introduction to AWS Boto in Python/4. Pattern Rekognition/8. Getting request sentiment.py: -------------------------------------------------------------------------------- 1 | """ 2 | Getting request sentiment 3 | After successfully translating Get It Done cases for the Streets Director, he asked for one more thing. He really wants to understand how people in the City feel about his department's work. She believes she can answer that question via sentiment analysis of Get It Done requests. She has already loaded the CSV into the dumping_df variable and subset it to the following columns: 4 | 5 | Get It Done requests in many languages 6 | 7 | In this exercise, you will help Sam better understand the moods of the voices of the people that submit Get It Done cases, and whether they are coming into the interaction with the City in a positive mood or a negative one. 8 | 9 | Instructions 10 | 100 XP 11 | Detect the sentiment of 'public_description' for every row. 12 | Store the result in the 'sentiment' column. 13 | 14 | """ 15 | for index, row in dumping_df.iterrows(): 16 | # Get the translated_desc into a variable 17 | description = dumping_df.loc[index, 'public_description'] 18 | if description != '': 19 | # Get the detect_sentiment response 20 | response = comprehend.detect_sentiment( 21 | Text=description, 22 | LanguageCode='en') 23 | # Get the sentiment key value into sentiment column 24 | dumping_df.loc[index, 'sentiment'] = response['Sentiment'] 25 | # Preview the dataframe 26 | dumping_df.head() 27 | ''' 28 | service_request_id original_lang public_description sentiment 29 | 0 93494 es There is a lot of trash NEGATIVE 30 | 1 101502 en Couch, 4 chairs, mattress, carpet padding. this is a on going problem NEGATIVE 31 | 2 101520 NaN NaN NEUTRAL 32 | 3 101576 en On the South Side of Paradise Valley Road near the intersection with Jester St. 33 | Stuff in trash bags, rolling suitcases, and shopping carts. I suspect possessions 34 | of folk camping in the canyon. MIXED 35 | 4 101616 es The residents keep throwing stuff away''' 36 | -------------------------------------------------------------------------------- /Introduction to AWS Boto in Python/4. Pattern Rekognition/9. Scooter dispatch.py: -------------------------------------------------------------------------------- 1 | """ 2 | Scooter dispatch 3 | The City Council were huge fans of Sam's prediction about whether scooter was blocking a sidewalk or not. So much so, they asked her to build a notification system to dispatch crews to impound scooters from sidewalks. 4 | 5 | With the dataset she created, Sam can dispatch crews to the case's coordinates when a request has negative sentiment. 6 | 7 | Scooter Dataframe 8 | 9 | In this exercise, you will help Sam implement a system that dispatches crews based on sentiment and image recognition. You will help Sam pair human and machine for effective City management! 10 | 11 | Instructions 12 | 100 XP 13 | Get the SNS topic ARN for 'scooter_notifications'. 14 | For every row, if sentiment is 'NEGATIVE' and there is an image of a scooter, construct a message to send. 15 | Publish the notification to the SNS topic. 16 | 17 | """ 18 | # Get topic ARN for scooter notifications 19 | topic_arn = sns.create_topic(Name='scooter_notifications')['TopicArn'] 20 | 21 | for index, row in scooter_requests.iterrows(): 22 | # Check if notification should be sent 23 | if (row['sentiment'] == 'NEGATIVE') & (row['img_scooter'] == 1): 24 | # Construct a message to publish to the scooter team. 25 | message = "Please remove scooter at {}, {}. Description: {}".format( 26 | row['long'], row['lat'], row['public_description']) 27 | 28 | # Publish the message to the topic! 29 | sns.publish(TopicArn = topic_arn, 30 | Message = message, 31 | Subject = "Scooter Alert") 32 | -------------------------------------------------------------------------------- /Introduction to Airflow in Python/I Intro to Airflow.py: -------------------------------------------------------------------------------- 1 | """ 2 | complete introduction to the components of Apache Airflow and learn how and why you should use them. 3 | 4 | \ workflow / 5 | 6 | > set of steps to accomplish a d.e task 7 | ..such as -downloading files -copying data -filtering info -writing to a database 8 | > of varying leves of complexity 9 | 10 | \ airflow / 11 | 12 | > platform to program workflows 13 | - creating 14 | - scheduling 15 | - monitoring 16 | 17 | > can implement any language, but workflows written : in Python 18 | > implement workflows as DAGs : Direct Acyclic Graphs 19 | > access via: code, command-line, web interface 20 | 21 | | DAG simple definition eg: | 22 | 23 | etl_dag = DAG( 24 | dag_id = 'etl_pipeline', 25 | default_args = {"start_data":"2020-01-08"} 26 | ) 27 | ------ 28 | 29 | | Run a workflow in Airflow eg: | 30 | 31 | airflow run 32 | airflow run example_etl download-file 2020-01-10 33 | ------- 34 | """ 35 | #| 36 | #| 37 | ### Running a task in Airflow 38 | """-Which command would you enter in the console to run the desired task?""" 39 | airflow run etl_pipeline download_file 2020-01-08 40 | #| 41 | #| 42 | ### Examining Airflow commands 43 | """-Which of the following is NOT an Airflow sub-command? 44 | 45 | | get airflow documentation ariflow description | 46 | 47 | airflow -h 48 | 49 | -list_dags 50 | -edit_dag 51 | -test 52 | -scheduler 53 | """ 54 | # Answ: edit_dag 55 | #| 56 | #| 57 | """ 58 | \ What's a DAG / 59 | 60 | > Direct : flow representing dependencies 61 | > Acyclic : do not loop 62 | > Graph : actual set of components 63 | 64 | # written in python, components in any language 65 | # made up of components : tasks to be executed (operators, sensors) 66 | # dependencies defined (explicit or implicit) 67 | 68 | | Define a DAG eg:| 69 | 70 | from airflow.models import DAG 71 | from datetime import datetime 72 | 73 | default_arguments = { 74 | 'owner' : 'jdoe', 75 | 'emial' : 'jdoe@datacamp.com', 76 | 'start_date' : datetime(2020, 1, 20) 77 | } 78 | 79 | etl_dag = DAG( 'etl_workflow', default_args=default_arguments ) 80 | --------- 81 | """ 82 | #| 83 | #| 84 | ### Defining a simple DAG 85 | """1/3""" 86 | # Import the DAG object 87 | from airflow.models import DAG 88 | #| 89 | """2/3""" 90 | # Import the DAG object 91 | from airflow.models import DAG 92 | 93 | # Define the default_args dictionary 94 | default_args = { 95 | 'owner': 'dsmith', 96 | 'start_date': datetime(2020, 1, 14), 97 | 'retries' : 2 98 | } 99 | #| 100 | """3/3""" 101 | # Import the DAG object 102 | from airflow.models import DAG 103 | 104 | # Define the default_args dictionary 105 | default_args = { 106 | 'owner': 'dsmith', 107 | 'start_date': datetime(2020, 1, 14), 108 | 'retries': 2 109 | } 110 | 111 | # Instantiate the DAG object 112 | etl_dag = DAG( 'example_etl', default_args= default_args) 113 | #| 114 | #| 115 | ### Working with DAGs and the Airflow shell 116 | """-How many DAGs are present in the Airflow system from the command-line? 117 | 118 | | list DAGs inshell | 119 | 120 | airflow list_dags 121 | """ 122 | airflow list_dags 123 | # Answ: 2 124 | #| 125 | #| 126 | ### Troubleshooting DAG creation 127 | """Instructions 128 | 129 | Use the airflow shell command to determine which DAG is not being recognized correctly. 130 | After you determine the broken DAG, open the file and fix any Python errors. 131 | Once modified, verify that the DAG now appears within Airflow's output.""" 132 | airflow list_dags 133 | ---------- 134 | from airflow.models import DAG # fixed part 135 | default_args = { 136 | 'owner': 'jdoe', 137 | 'email': 'jdoe@datacamp.com' 138 | } 139 | dag = DAG( 'refresh_data', default_args=default_args ) 140 | #| 141 | #| 142 | """ 143 | \ Airflow web interface / 144 | . 145 | """ 146 | #| 147 | #| 148 | ### Airflow web interface 149 | """Which airflow command would you use to start the webserver on port 9090?""" 150 | airflow webserver -h # see documentation 151 | ------- 152 | airflow webserver -p 9090 153 | #| 154 | #| 155 | ### Navigating the Airflow UI 156 | """Which of the following events have NOT run on your Airflow instance? 157 | 158 | -cli_scheduler 159 | -cli_webserver 160 | -cli_worker""" 161 | # -browse -logs 162 | # Answ: cli_worker 163 | #| 164 | #| 165 | ### Examining DAGs with the Airflow UI 166 | """she would like to know which operator is NOT in use with the DAG called update_state, as your team is trying to verify the 167 | components used in production workflows. 168 | -Remember to select the operator NOT used in this DAG. 169 | 170 | 171 | BashOperator 172 | PythonOperator 173 | JdbcOperator 174 | SimpleHttpOperator""" 175 | # -go inside update_state name 176 | # -graph view 177 | # ANSW: JdbcOperator 178 | 179 | 180 | -------------------------------------------------------------------------------- /Introduction to Bash Scripting/I From Command-Line to Bash Script.py: -------------------------------------------------------------------------------- 1 | """~~~ 2 | Save yourself time when performing complex repetitive tasks. You’ll begin this course by refreshing your knowledge of common command-line programs and arguments 3 | before quickly moving into the world of Bash scripting. You’ll create simple command-line pipelines and then transform these into Bash scripts. You’ll then boost 4 | your skills and learn about standard streams and feeding arguments to your Bash scripts. 5 | """ 6 | 7 | """ 8 | *** Shel command refreshers : 9 | (e)grep : filter input based on regex pattern matching 10 | cat : concatenates file contents line-by-line 11 | tail/head : gives only the last -n (flag) lines 12 | wc : word or line count (flags -w -l) 13 | sed : pattern matched string replacement 14 | 15 | *** shell practice : 16 | - run : grep 'p' fruits.txt 17 | apple 18 | - run : grep '[pc]' fruits.txt 19 | apple, carrot 20 | - run : sort | uniq -c ----to count 21 | 22 | *** Word separation shell syntax : 23 | eg : repl:~$ cat two_cities.txt | egrep 'Sydney Carton|Charles Darnay' 24 | 25 | ** REGEX : regular expressions, vital skill for bash script 26 | 27 | """ 28 | 29 | """ 30 | ### Extracting scores with shell 31 | 32 | There is a file in either the start_dir/first_dir, start_dir/second_dir or start_dir/third_dir directory called soccer_scores.csv. 33 | It has columns Year,Winner,Winner Goals for outcomes of a soccer league. 34 | 35 | -cd into the correct directory and use cat and grep to find who was the winner in 1959. You could also just ls from the top directory if you like! 36 | 37 | terminal: repl:~$ cd start_dir/second_dir/ 38 | repl:~/start_dir/second_dir$ cat soccer_scores.csv | grep 1959 39 | 1959,Dunav,2 40 | 41 | Answer : Dunav 42 | 43 | """ 44 | 45 | """ 46 | ### Searching a book with shell 47 | 48 | There is a copy of Charles Dickens's infamous 'Tale of Two Cities' in your home directory called two_cities.txt. 49 | 50 | -Use command line arguments such as cat, grep and wc with the right flag to count the number of lines in the book that contain either the 51 | character 'Sydney Carton' or 'Charles Darnay'. Use exactly these spellings and capitalizations. 52 | 53 | Terminal : repl:~$ cat two_cities.txt | egrep 'Sydney Carton|Charles Darnay' | wc -l 54 | 55 | Answer : 77 56 | 57 | """ 58 | 59 | """ 60 | Your first Bash script 61 | 62 | >>>>>> which bash : o check lcoation of bash you're runing 63 | >>>>>> bash script_name.sh or ./script_name.sh: run bash script 64 | 65 | *** bash Anatomy : .sh 66 | 67 | """ 68 | 69 | """ 70 | ### A simple Bash script 71 | 72 | For this environment bash is not located at /usr/bash but at /bin/bash. You can confirm this with the command which bash. 73 | 74 | There is a file in your working directory called server_log_with_todays_date.txt. 75 | -Your task is to write a simple Bash script that concatenates this out to the terminal so you can see what is inside 76 | -Create a single-line script that concatenates the mentioned file. 77 | -Save your script and run from the console. 78 | 79 | 80 | Pane : #!/bin/bash 81 | 82 | # Concatenate the file 83 | cat server_log_with_todays_date.txt 84 | 85 | 86 | # Now save and run! 87 | """ 88 | 89 | """ 90 | ### Shell pipelines to Bash scripts 91 | 92 | Your job is to create a Bash script from a shell piped command which will aggregate to see how many times each team has won. 93 | 94 | -Create a single-line pipe to cat the file, cut out the relevant field and aggregate (sort & uniq -c will help!) based on winning team. 95 | -Save your script and run from the console. 96 | 97 | Pane: #!/bin/bash 98 | 99 | # Create a single-line pipe 100 | cat soccer_scores.csv | cut -d "," -f 2 | tail -n +2 | sort | uniq -c 101 | 102 | # Now save and run! 103 | out: 13 Arda 104 | 8 Beroe 105 | 9 Botev 106 | 8 Cherno 107 | 17 Dunav 108 | 15 Etar 109 | 4 Levski 110 | 1 Lokomotiv 111 | """ 112 | 113 | """ 114 | ### Extract and edit using Bash scripts 115 | 116 | -You will need to create a Bash script that makes use of sed to change the required team names. 117 | 118 | -Create a pipe using sed twice to change the team Cherno to Cherno City first, and then Arda to Arda United. 119 | -Pipe the output to a file called soccer_scores_edited.csv. 120 | -Save your script and run from the console. Try opening soccer_scores_edited.csv using shell commands to confirm it worked (the first line should be changed)! 121 | 122 | Pane : #!/bin/bash 123 | 124 | # Create a sed pipe to a new file 125 | cat soccer_scores.csv | sed 's/Cherno/Cherno City/g' | sed 's/Arda/Arda United/g' > soccer_scores_edited.csv 126 | 127 | # Now save and run! 128 | 129 | """ 130 | 131 | """ 132 | Standard streams & arguments 133 | 134 | *** arguments : 135 | >>>>>> $ $1 $2 : to access bash arguments 136 | >>>>>> $@ or @* : access all arguments 137 | >>>>>> $# : give length of arguments 138 | """ 139 | 140 | ### Using arguments in Bash scripts 141 | 142 | # Echo the first and second ARGV arguments 143 | echo $1 144 | echo $2 145 | 146 | # Echo out the entire ARGV array 147 | echo $@ 148 | 149 | # Echo out the size of ARGV 150 | echo $# 151 | 152 | 153 | repl:~/workspace$ bash script.sh Bird Fish Rabbit 154 | 155 | """ 156 | ### # Echo the first ARGV argument 157 | 158 | -Echo the first ARGV argument so you can confirm it is being read in. 159 | -cat all the files in the directory /hire_data and pipe to grep to filter using the city name (your first ARGV argument). 160 | -On the same line, pipe out the filtered data to a new CSV called cityname.csv where cityname is taken from the first ARGV argument. 161 | -Save your script and run from the console twice (do not use the ./script.sh method). Once with the argument Seoul. Then once with the argument Tallinn. 162 | """ 163 | # Echo the first ARGV argument 164 | echo $1 165 | 166 | # Cat all the files 167 | # Then pipe to grep using the first ARGV argument 168 | # Then write out to a named csv using the first ARGV argument 169 | cat hire_data/* | grep "$1" > "$1".csv 170 | 171 | 172 | repl:~/workspace$ bash script.sh Seoul 173 | Seoul 174 | repl:~/workspace$ bash script.sh Tallinn 175 | Tallinn 176 | -------------------------------------------------------------------------------- /Introduction to Data Engineering/I Introduction to Data Engineering.py: -------------------------------------------------------------------------------- 1 | """"****************************************************************************** 2 | Explore the differences between a data engineer 3 | and a data scientist, get an overview of the various tools data engineers use and 4 | expand your understanding of how cloud technology plays a role in data engineering.... 5 | **************************************************************************************""" 6 | #///WHAT IS DATA ENGINEERING/// 7 | #Tasks of the data engineer 8 | """find best fit 9 | Possible Answers 10 | 11 | 1 Apply a statistical model to a large dataset to find outliers. 12 | OK 2 Set up scheduled ingestion of data from the application databases to an analytical database. 13 | 3 Come up with a database schema for an application. """ 14 | #--- 15 | #Data engineer or data scientist? 16 | """drag items into correct bucket 17 | 18 | {Data engineer} 19 | clean corrupt data 20 | develop scalable data architecture 21 | set up processes to bring data together 22 | streamline data acquisition 23 | cloud tech """ 24 | #--- 25 | #Data engineering problems 26 | """decide where you're best suited to be of help 27 | 28 | ok 1 Data scientists are querying the online store databases directly and slowing down the functioning of the application since it's using the same database. 29 | 2 Harmful product recommendations are affecting the sales numbers of the online store. 30 | 3 The online store is slow because the application's database server doesn't have enough memory. 31 | 32 | (data engineer should make sure there's a separate database for analytics.) """ 33 | #--- 34 | #///TOOLS/// 35 | #Kinds of databases 36 | """identify the database in the schematics 37 | 38 | 1 All database nodes are on the left. 39 | 2 All nodes on the left and the analytics node on the right are databases. 40 | ok 3 Accounting, Online Store, Product Catalog, and Analytics are databases. """ 41 | #--- 42 | #Processing tasks 43 | """select the most correct statement 44 | 45 | 1 Data processing is often done on a single, very powerful machine. 46 | ok 2 Data processing is distributed over clusters of virtual machines. 47 | 3 Data processing is often very complicated because you have to manually distribute workload over several computers. 48 | 49 | ( join, clean, or organize data is done in the data processing) """ 50 | #--- 51 | #Scheduling tools 52 | """ which one is not a responsibility of the scheduler? 53 | 54 | 1 Make sure jobs run in a specific order and all dependencies are resolved correctly. 55 | 2 Make sure the jobs run at midnight UTC each day. 56 | ok 3 Scale up the number of nodes when there's lots of data to be processed. """ 57 | #--- 58 | #///CLOUD PROVIDERS/// 59 | #Why cloud computing? 60 | """benefits of using cloud computing as opposed to self-hosting data centers. Can you select the most correct statement about cloud computing? 61 | 62 | 1 Cloud computing is always cheaper. 63 | ok 2 The cloud can provide you with the resources you need, when you need them. 64 | 3 On premise machines give me full control over the situation when things break. 65 | 66 | (cloud elasticity) """ 67 | #--- 68 | #Big players in cloud computing 69 | """Can you order the big three correctly? 70 | 71 | 1 amazon 72 | 2 azure 73 | 3 G cloud """ 74 | #--- 75 | #Cloud services 76 | """Classify the services into the correct bucket. 77 | - storage 78 | azure blob 79 | amazon s3 80 | cloud storage 81 | - compute 82 | azure virtual machine 83 | AWS EC2 84 | Google compute Engine 85 | - DB 86 | Amazon RDS 87 | Azure SQL DB 88 | G Cloud SQL """ 89 | -------------------------------------------------------------------------------- /Introduction to Data Engineering/README.me: -------------------------------------------------------------------------------- 1 | # It touches 2 | 3 | upon all things you need to know to streamline your data processing. This introductory course will give you enough context to start exploring the world of data engineering. It's perfect for people who work at a company with several data sources and don't have a clear idea of how to use all those data sources in a scalable way. Be the first one to introduce these techniques to your company and become the company star employee. 4 | -------------------------------------------------------------------------------- /Introduction to PySpark/I Getting to know PySpark.py: -------------------------------------------------------------------------------- 1 | """ 2 | how Spark manages data and how can you read and write tables from Python. 3 | 4 | \ What is Spark, anyway? / 5 | 6 | | parallel computation | ( split data in clusters ) 7 | 8 | platform for cluster computing. Spark lets you spread data and computations over clusters with multiple nodes 9 | Splitting up your data makes it easier to work with very large datasets because each node only works with a small amount of data. 10 | parallel computation can make certain types of programming tasks much faster. 11 | 12 | | Using Spark in Python | 13 | 14 | - cluster: 15 | - master : ( main pc ) 16 | - worker : ( rest of computers in cluster ) 17 | 18 | . create connection : class #creating an instance of the 19 | . attributes : 20 | 21 | | create SparkSession | 22 | 23 | # Import SparkSession 24 | > from pyspark.sql import SparkSession 25 | 26 | # create SparkSession builder 27 | > my_spark = SparkSession.builder.getOrCreate() 28 | 29 | # print spark tables 30 | > print(spark.catalog.listTables()) 31 | 32 | | SparkSession attributes | 33 | 34 | # always . 35 | - catalog: ( extract and view table data ) 36 | - .listTables() 37 | # returns column names in cluster as list 38 | > spark.catalog.listTables() 39 | 40 | - .read() # read different data sources into Spark DataFrames 41 | > spark.read.csv(file_path, header=True) 42 | 43 | | SparkSession methods | 44 | 45 | # always . 46 | - .show() -> print 47 | - .sql() -> run a query ( queried 'string' DataFrame results ) 48 | - .toPandas() -> returns corresponding 'pandas' DataFrame 49 | - .createDataFrame() -> pandas DataFrame and Spark DataFrame. (temp) 50 | - .createTempView() -> add to the catalog, but as temporary (in session) 51 | - .createOrReplaceTempView("temp") -> creates temp table if didn't already or updates 52 | """ 53 | #| 54 | #| 55 | ### How do you connect to a Spark cluster from PySpark? 56 | # ANSW: Create an instance of the SparkContext class. 57 | #| 58 | #| 59 | ### Examining The SparkContext 60 | # Verify SparkContext in environment 61 | print(sc) 62 | 63 | # Print Spark version 64 | print(sc.version) 65 | #- 66 | #- 3.2.0 67 | #| 68 | #| 69 | """ 70 | \ Using Spark DataFrames / 71 | 72 | . Spark structure : Resilient Distributed Dataset (RDD) 73 | . behaves : like a SQL table 74 | # DataFrames are more optimized for complicated operations than RDDs. 75 | 76 | - create Spark Dataframe: create a object from your 77 | #interface <'spark'> 78 | #connection <'sc'> 79 | """ 80 | #| 81 | #| 82 | ### Which of the following is an advantage of Spark DataFrames over RDDs? 83 | # ANSW: Operations using DataFrames are automatically optimized. 84 | #| 85 | #| 86 | ### Creating a SparkSession 87 | # Import SparkSession from pyspark.sql 88 | from pyspark.sql import SparkSession 89 | 90 | # Create my_spark 91 | my_spark = SparkSession.builder.getOrCreate() 92 | 93 | # Print my_spark 94 | print(my_spark) 95 | #| 96 | #| 97 | ### Viewing tables 98 | # Print the tables in the catalog 99 | print(spark.catalog.listTables()) 100 | #| 101 | #| 102 | ### Are you query-ious? 103 | # Don't change this query 104 | query = "FROM flights SELECT * LIMIT 10" 105 | 106 | # Get the first 10 rows of flights 107 | flights10 = spark.sql(query) 108 | 109 | # Show the results 110 | flights10.show() 111 | #| 112 | #| 113 | ### Pandafy a Spark DataFrame 114 | # Don't change this query 115 | query = "SELECT origin, dest, COUNT(*) as N FROM flights GROUP BY origin, dest" 116 | 117 | # Run the query 118 | flight_counts = spark.sql(query) 119 | 120 | # Convert the results to a pandas DataFrame 121 | pd_counts = flight_counts.toPandas() 122 | 123 | # Print the head of pd_counts 124 | print(pd_counts.head()) 125 | #| 126 | #| 127 | ### Put some Spark in your data 128 | # Create pd_temp 129 | pd_temp = pd.DataFrame(np.random.random(10)) 130 | 131 | # Create spark_temp from pd_temp 132 | spark_temp = spark.createDataFrame(pd_temp) 133 | 134 | # Examine the tables in the catalog 135 | print(spark.catalog.listTables()) 136 | 137 | # Add spark_temp to the catalog 138 | spark_temp.createOrReplaceTempView("temp") 139 | 140 | # Examine the tables in the catalog again 141 | print(spark_temp) 142 | #| 143 | #| 144 | ### Dropping the middle man 145 | # Don't change this file path 146 | file_path = "/usr/local/share/datasets/airports.csv" 147 | 148 | # Read in the airports data 149 | airports = spark.read.csv(file_path, header=True) # to view column names 150 | 151 | # Show the data 152 | airports.show() 153 | -------------------------------------------------------------------------------- /Introduction to PySpark/II. Manipulating data.py: -------------------------------------------------------------------------------- 1 | """ 2 | pyspark.sql module, which provides optimized data queries to your Spark session. 3 | 4 | | Pyspark attributes : manipulation | 5 | 6 | - .withColumn() -> create new column . ('new_column', old_col + 1) 7 | - df.colName() -> extract column name 8 | 9 | | pyspark methods : manipulation | 10 | 11 | - spark.table() -> create df containing values of table in the .catalog 12 | - .filter() -> like a cut for SQL 13 | > flights.filter("air_time > 120").show() # return values cut #(SQL string) 14 | > flights.filter(flights.air_time > 120).show() # return bool 15 | 16 | - .select() -> returns only the columns you specify 17 | | > selectNoStr = flights.select(flights.origin, flights.dest, flights.carrier) 18 | | > selectStr = flights.select("tailnum", "origin", "dest") 19 | | > flights.selectExpr("origin", "dest", "tailnum", "distance/(air_time/60) as avg_speed") 20 | - .withColumn() -> returns all columns in addition to the defined. 21 | -> create new df column 22 | - .withColumnRenamed 23 | -> rename columns 24 | 25 | - GroupedData 26 | | 27 | - .agg() -> pass an aggregate column expression that uses any of the aggregate functions from the pyspark.sql.functions submodule. 28 | # Import pyspark.sql.functions as F 29 | > import pyspark.sql.functions as F 30 | - F.stddev -> estandard deviation 31 | 32 | - .join() -> takes three arguments. 1.the second DataFrame to join, 2. on == key column(s) as a string, 3. how == specifies kind of join how="leftouter" 33 | """ 34 | #| 35 | #| 36 | ### Creating columns 37 | # Create the DataFrame flights 38 | flights = spark.table('flights') # create table 'name' 39 | 40 | # Show the head 41 | flights.show() # head() is default 42 | 43 | # Add duration_hrs from air_time 44 | flights = flights.withColumn('duration_hrs', flights.air_time /60) 45 | #| 46 | #| 47 | ### SQL in a nutshell 48 | """Which of the following queries returns a table of tail numbers and destinations for flights that lasted more than 10 hours?""" 49 | # ANSW: SELECT dest, tail_num FROM flights WHERE air_time > 600; 50 | #| 51 | #| 52 | ### SQL in a nutshell (2) 53 | """What information would this query get?""" 54 | """ SELECT AVG(air_time) / 60 FROM flights 55 | GROUP BY origin, carrier; """ 56 | # ANSW: The average length of each airline's flights from SEA and from PDX in hours. 57 | #| 58 | #| 59 | ### Filtering Data 60 | # Filter flights by passing a string 61 | long_flights1 = flights.filter("distance > 1000") # sql string 62 | 63 | # Filter flights by passing a column of boolean values 64 | long_flights2 = flights.filter(flights.distance > 1000) 65 | 66 | # Print the data to check they're equal 67 | long_flights1.show() 68 | long_flights2.show() 69 | #- same result 70 | #| 71 | #| 72 | ### Selecting 73 | # Select the first set of columns 74 | selected1 = flights.select("tailnum", "origin", "dest") 75 | 76 | # Select the second set of columns 77 | temp = flights.select(flights.origin, flights.dest, flights.carrier) 78 | 79 | # Define first filter 80 | filterA = flights.origin == "SEA" 81 | 82 | # Define second filter 83 | filterB = flights.dest == "PDX" 84 | 85 | # Filter the data, first by filterA then by filterB 86 | selected2 = temp.filter(filterA).filter(filterB) 87 | #| 88 | #| 89 | ### Selecting II 90 | # Define avg_speed 91 | avg_speed = (flights.distance/(flights.air_time/60)).alias("avg_speed") 92 | 93 | # Select the correct columns 94 | speed1 = flights.select("origin", "dest", "tailnum", avg_speed) 95 | 96 | # Create the same table using a SQL expression 97 | speed2 = flights.selectExpr( 98 | "origin", "dest", "tailnum", "distance/(air_time/60) as avg_speed") 99 | #| 100 | #| 101 | ### Aggregating 102 | # Find the shortest flight from PDX in terms of distance 103 | # Perform the filtering by referencing the column directly, not passing a SQL string. 104 | flights.filter(flights.origin == "PDX").groupBy().min('distance').show() 105 | 106 | # Find the longest flight from SEA in terms of air time 107 | flights.filter(flights.origin == 'SEA').groupBy().max('air_time').show() 108 | #| 109 | #| 110 | ### Aggregating II 111 | # Average duration of Delta flights 112 | flights.filter(flights.carrier == "DL").filter(flights.origin == "SEA").groupBy().avg("air_time").show() 113 | 114 | # Total hours in the air 115 | flights.withColumn("duration_hrs", flights.air_time/60).groupBy().sum("duration_hrs").show() 116 | #| 117 | #| 118 | ### Grouping and Aggregating I 119 | # Group by tailnum 120 | by_plane = flights.groupBy("tailnum") 121 | 122 | # Number of flights each plane made 123 | by_plane.count().show() 124 | 125 | # Group by origin 126 | by_origin = flights.groupBy("origin") 127 | 128 | # Average duration of flights from PDX and SEA 129 | by_origin.avg("air_time").show() 130 | #| 131 | #| 132 | ### Grouping and Aggregating II 133 | # Import pyspark.sql.functions as F 134 | import pyspark.sql.functions as F 135 | 136 | # Group by month and dest 137 | by_month_dest = flights.groupBy('month','dest') 138 | 139 | # Average departure delay by month and destination 140 | by_month_dest.avg('dep_delay').show() 141 | 142 | # Standard deviation of departure delay 143 | by_month_dest.agg(F.stddev('dep_delay')).show() # stddev = standard deviation 144 | #| 145 | #| 146 | ### joining 147 | """Which of the following is not true? 148 | 149 | Joins combine tables. 150 | Joins add information to a table. 151 | Storing information in separate tables can reduce repetition. 152 | There is only one kind of join.""" 153 | # ANSW: There is only one kind of join. 154 | #| 155 | #| 156 | ### Joining II 157 | # Examine the data 158 | print(airports.show()) 159 | 160 | # Rename the faa column 161 | airports = airports.withColumnRenamed('faa', 'dest') 162 | 163 | # Join the DataFrames 164 | flights_with_airports = flights.join(airports, on='dest', how='leftouter') 165 | 166 | # Examine the new DataFrame 167 | print(flights_with_airports.show()) 168 | -------------------------------------------------------------------------------- /Introduction to PySpark/III. Getting started with machine learning pipelines.py: -------------------------------------------------------------------------------- 1 | """ 2 | PySpark has built-in, cutting-edge machine learning routines, along with utilities to create full machine learning pipelines. You'll learn about them in this chapter. 3 | 4 | \ Machine Learning Pipelines / 5 | 6 | > summodule: pyspark.ml 7 | - Transformer class 8 | - Estimator class 9 | 10 | | methods for ML | 11 | 12 | # Transformer class 13 | - .transform() -> takes a DataFrame and returns a new DataFrame +1 column 14 | - Bucketizer -> create discrete bins from a continuous feature 15 | 16 | # Estimator class 17 | - .fit() -> they return a model object 18 | """ 19 | #| 20 | #| 21 | ### Machine Learning Pipelines 22 | """ Which of the following is not true about machine learning in Spark?""" 23 | # ANSW: Spark's algorithms give better results than other algorithms. 24 | #| 25 | #| 26 | ### Join the DataFrames 27 | # Rename year column to avoid duplicates 28 | planes = planes.withColumnRenamed('year', 'plane_year') 29 | 30 | # Join the DataFrames 31 | model_data = flights.join(planes, on='tailnum', how="leftouter") 32 | """ 33 | | Data types for ML | 34 | 35 | > change types 36 | .cast() 'integer','double' 37 | """ 38 | #| 39 | #| 40 | ### Data Types 41 | """What kind of data does Spark need for modeling?""" 42 | # ANSW: Numeric 43 | #| 44 | #| 45 | ### String to integer 46 | # Cast the columns to integers # df.col notation 47 | model_data = model_data.withColumn("arr_delay", model_data.arr_delay.cast('integer')) 48 | model_data = model_data.withColumn("air_time", model_data.air_time.cast('integer')) 49 | model_data = model_data.withColumn("month", model_data.month.cast('integer')) 50 | model_data = model_data.withColumn("plane_year", model_data.plane_year.cast('integer')) 51 | #| 52 | #| 53 | ### Create a new column 54 | # Create the column plane_age 55 | # Create the column plane_age 56 | model_data = model_data.withColumn( 57 | "plane_age", model_data.year - model_data.plane_year) 58 | #| 59 | #| 60 | ### Making a Boolean 61 | # Create is_late bool 62 | model_data = model_data.withColumn("is_late", model_data.arr_delay >0) 63 | 64 | # Convert to an integer w/ .cast() 65 | model_data = model_data.withColumn("label", model_data.is_late.cast('integer')) 66 | 67 | # Remove missing values w/ SQL string 68 | model_data = model_data.filter("arr_delay is not NULL and dep_delay is not NULL and air_time is not NULL and plane_year is not NULL") 69 | #| 70 | #| 71 | """ 72 | \ Strings and Factors / 73 | 74 | > pyspark.ml.features submodule 75 | 'one-hot vectors' -> all elements are zero except for at most one element, which has a value of one (1). 76 | 77 | 1 > create a 'StringIndexer'. 78 | carr_indexer = StringIndexer(inputCol='carrier',outputCol='carrier_index') 79 | 2 > encode w/ 'OneHotEncoder'. 80 | carr_encoder = OneHotEncoder(inputCol='carrier_index',outputCol='carrier_fact') 81 | 3 > 'Pipeline' will take care of the rest. 82 | # Fit and transform the data 83 | piped_data = flights_pipe.fit(model_data).transform(model_data) 84 | ----------------------- 85 | > 'VectorAssembler' -> combine all of the columns containing our features into a single column 86 | inputCol= ['column_name1','c2','c3'] 87 | outputCol= 'features' 88 | """ 89 | #| 90 | #| 91 | ### 92 | """Why do you have to encode a categorical feature as a one-hot vector?""" 93 | # ANSW: Spark can only model numeric features. 94 | #| 95 | #| 96 | ### Carrier 97 | # Create a StringIndexer 98 | carr_indexer = StringIndexer(imputCol='carrier',outputCol='carrier_index') 99 | 100 | # Create a OneHotEncoder 101 | carr_encoder = OneHotEncoder(imputCol='carrier_index',outputCol='carrier_fact') 102 | #| 103 | #| 104 | ### Destination 105 | # Create a StringIndexer 106 | dest_indexer = StringIndexer(inputCol='dest',outputCol='dest_index') 107 | 108 | # Create a OneHotEncoder 109 | dest_encoder = OneHotEncoder(inputCol='dest_index',outputCol='dest_fact') 110 | #| 111 | #| 112 | ### Assemble a vector 113 | # Make a VectorAssembler 114 | vec_assembler = VectorAssembler(inputCols=["month", "air_time", "carrier_fact", "dest_fact", "plane_age"], outputCol='features') 115 | #| 116 | #| 117 | """ \ create pipelines / 118 | 119 | > Import Pipeline 120 | from pyspark.ml import pipeline 121 | > stages 122 | 123 | """ 124 | #| 125 | #| 126 | ### Create the pipeline 127 | # Import Pipeline 128 | from pyspark.ml import Pipeline 129 | 130 | # Make the pipeline 131 | flights_pipe = Pipeline(stages=[dest_indexer, dest_encoder, carr_indexer, carr_encoder, vec_assembler]) 132 | #| 133 | #| 134 | ### Test vs. Train 135 | """ why important to set 'model evaluation'?""" 136 | # ANSW: By evaluating your model with a test set you can get a good idea of performance on new data. 137 | #| 138 | #| 139 | ### Transform the data 140 | # Fit and transform the data 141 | piped_data = flights_pipe.fit(model_data).transform(model_data) 142 | #| 143 | #| 144 | ### Split the data 145 | # Split the data into training and test sets 146 | # training with 60% of the data, and test with 40% 147 | training, test = piped_data.randomSplit([.6, .4]) 148 | -------------------------------------------------------------------------------- /Introduction to PySpark/IV. Model tuning and selection/I. Create the modeler.py: -------------------------------------------------------------------------------- 1 | """ 2 | ||||| 3 | create a model that predicts which flights will be delayed. 4 | ||||| 5 | 6 | \ Logistic regression model / 7 | 8 | similar to a linear regression, but instead of predicting a numeric variable, 9 | ->it 'predicts the probability (between 0 and 1) of an event.' 10 | 11 | - set 'cutoff' point 12 | - if prediction above cutoff == 'yes' 13 | - if it's below, you classify it as a 'no'! 14 | 15 | -> hyperparameter : just a value in the model that's is supplied by the user to maximize performance.""" 16 | #| 17 | #| 18 | """ Why do you supply hyperparameters?""" 19 | # ANSW: They improve model performance. 20 | #| 21 | #| 22 | ### Create the modeler 23 | 24 | # Import LogisticRegression # Estimator 25 | from pyspark.ml.classification import LogisticRegression 26 | 27 | # Create a LogisticRegression Estimator 28 | lr = LogisticRegression() 29 | 30 | -------------------------------------------------------------------------------- /Introduction to PySpark/IV. Model tuning and selection/II. Create the Evaluator.py: -------------------------------------------------------------------------------- 1 | """ 2 | \ Cross validation / 3 | 4 | estimates model's error on held out data 5 | 6 | -> k-fold : method of estimating the model's performance on unseen data 7 | 8 | - split data in few partitions (3) 9 | - partitions is set aside, and the model is fit to the others 10 | - error is measured against the held out partition 11 | - This is repeated for each of the partitions 12 | - Then the error is averaged.""" 13 | #| 14 | #| 15 | ### 16 | """ What does cross validation allow you to estimate?""" 17 | # ANSW: The model's error on held out data. 18 | #| 19 | #| 20 | ### Create the evaluator 21 | # Import the evaluation submodule 22 | import pyspark.ml.evaluation as evals 23 | 24 | # Create a BinaryClassificationEvaluator 25 | evaluator = evals.BinaryClassificationEvaluator(metricName="areaUnderROC") 26 | -------------------------------------------------------------------------------- /Introduction to PySpark/IV. Model tuning and selection/III. Make a grid.py: -------------------------------------------------------------------------------- 1 | """ 2 | \ make a grid / 3 | 4 | - Import the submodule pyspark.ml.tuning under the alias tune. 5 | - Call the class constructor ParamGridBuilder() with no arguments. Save this as grid. 6 | - Call the .addGrid() method on grid with lr.regParam as the first argument and np.arange(0, .1, .01) as the second argument. This second call is a function from the numpy module (imported as np) that creates a list of numbers from 0 to .1, incrementing by .01. Overwrite grid with the result. 7 | - Update grid again by calling the .addGrid() method a second time create a grid for lr.elasticNetParam that includes only the values [0, 1]. 8 | - Call the .build() method on grid and overwrite it with the output. 9 | """ 10 | 11 | # Import the tuning submodule 12 | import pyspark.ml.tuning as tune 13 | 14 | # Create the parameter grid 15 | grid = tune.ParamGridBuilder() 16 | 17 | # Add the hyperparameter 18 | grid = grid.addGrid(lr.regParam, np.arange(0, .1, .01)) 19 | grid = grid.addGrid(lr.elasticNetParam,[0,1]) 20 | 21 | # Build the grid 22 | grid = grid.build() 23 | -------------------------------------------------------------------------------- /Introduction to PySpark/IV. Model tuning and selection/IV. Make the validator.py: -------------------------------------------------------------------------------- 1 | # Create the CrossValidator 2 | cv = tune.CrossValidator(estimator=lr, 3 | estimatorParamMaps=grid, 4 | evaluator=evaluator 5 | ) 6 | -------------------------------------------------------------------------------- /Introduction to PySpark/IV. Model tuning and selection/V. Fit the model(s).py: -------------------------------------------------------------------------------- 1 | """ 2 | 3 | \ Fit the model(s) / 4 | 5 | You're finally ready to fit the models and select the best one! 6 | 7 | Unfortunately, cross validation is a very computationally intensive procedure. Fitting all the models would take too long on DataCamp. 8 | 9 | To do this locally you would use the code: 10 | 11 | # Fit cross validation models 12 | models = cv.fit(training) 13 | 14 | # Extract the best model 15 | best_lr = models.bestModel 16 | Remember, the training data is called training and you're using lr to fit a logistic regression model. Cross validation selected the parameter values regParam=0 and elasticNetParam=0 as being the best. These are the default values, so you don't need to do anything else with lr before fitting the model. 17 | 18 | Instructions 19 | 100 XP 20 | 21 | - Create best_lr by calling lr.fit() on the training data. 22 | - Print best_lr to verify that it's an object of the LogisticRegressionModel class. 23 | 24 | """ 25 | # Call lr.fit() 26 | best_lr = lr.fit(training) 27 | 28 | # Print best_lr 29 | print(best_lr) 30 | -------------------------------------------------------------------------------- /Introduction to PySpark/IV. Model tuning and selection/VI. Evaluating binary classifiers.py: -------------------------------------------------------------------------------- 1 | """ 2 | 3 | \ Evaluating binary classifiers / 4 | 5 | For this course we'll be using a common metric for binary classification algorithms call the 6 | 7 | -> AUC (or area under the curve) : the closer the AUC is to one (1), the better the model is! 8 | 9 | """ 10 | 11 | '''If you've created a perfect binary classification model, what would the AUC be?''' 12 | # ANSW: 1 13 | -------------------------------------------------------------------------------- /Introduction to PySpark/IV. Model tuning and selection/VII. Evaluate the model.py: -------------------------------------------------------------------------------- 1 | """ 2 | \ Evaluate the model / 3 | Remember the test data that you set aside waaaaaay back in chapter 3? It's finally time to test your model on it! You can use the same evaluator you made to fit the 4 | model. 5 | 6 | Instructions 7 | 100 XP 8 | - Use your model to generate predictions by applying best_lr.transform() to the test data. Save this as test_results. 9 | - Call evaluator.evaluate() on test_results to compute the AUC. Print the output. 10 | 11 | """ 12 | # Use the model to predict the test set 13 | test_results = best_lr.transform(test) 14 | 15 | # Evaluate the predictions 16 | print(evaluator.evaluate(test_results)) 17 | 18 | # 0.7 19 | -------------------------------------------------------------------------------- /Introduction to Relational Databases in SQL/1. Your first database/1. Attributes of relational databases.py: -------------------------------------------------------------------------------- 1 | """ 2 | Attributes of relational databases 3 | In the video, we talked about some basic facts about relational databases. Which of the following statements does not hold true for databases? Relational databases … 4 | 5 | Answer the question 6 | 50XP 7 | Possible Answers 8 | 9 | … store different real-world entities in different tables. 10 | … allow to establish relationships between entities. 11 | ok … are called "relational" because they store data only about people. 12 | … use constraints, keys and referential integrity in order to assure data quality. 13 | 14 | """ 15 | -------------------------------------------------------------------------------- /Introduction to Relational Databases in SQL/1. Your first database/2. Query information_schema with SELECT.py: -------------------------------------------------------------------------------- 1 | ''' 2 | Query information_schema with SELECT 3 | information_schema is a meta-database that holds information about your current database. information_schema has multiple tables you can query with the known SELECT * FROM syntax: 4 | 5 | tables: information about all tables in your current database 6 | columns: information about all columns in all of the tables in your current database 7 | … 8 | In this exercise, you'll only need information from the 'public' schema, which is specified as the column table_schema of the tables and columns tables. The 'public' schema holds information about user-defined tables and databases. The other types of table_schema hold system information – for this course, you're only interested in user-defined stuff. 9 | 10 | Instructions 4/4 11 | 25 XP 12 | - Get information on all table names in the current database, while limiting your query to the 'public' table_schema. 13 | - Now have a look at the columns in university_professors by selecting all entries in information_schema.columns that correspond to that table. 14 | - How many columns does the table university_professors have? 15 | - Finally, print the first five rows of the university_professors table. 16 | ''' 17 | #1 18 | #-- Query the right table in information_schema 19 | SELECT table_name 20 | FROM information_schema.tables 21 | #-- Specify the correct table_schema value 22 | WHERE table_schema = 'public'; 23 | 24 | #2 25 | #-- Query the right table in information_schema to get columns 26 | SELECT column_name, data_type 27 | FROM information_schema.columns 28 | WHERE table_name = 'university_professors' AND table_schema = 'public'; 29 | 30 | #3 31 | #How many columns does the table university_professors have? 32 | #8 33 | 34 | # 4 35 | -- Query the first five rows of our table 36 | SELECT * 37 | FROM university_professors 38 | LIMIT 5; 39 | -------------------------------------------------------------------------------- /Introduction to Relational Databases in SQL/1. Your first database/3. CREATE your first few TABLEs.py: -------------------------------------------------------------------------------- 1 | """ 2 | CREATE your first few TABLEs 3 | You'll now start implementing a better database model. For this, you'll create tables for the professors and universities entity types. The other tables will be created for you. 4 | 5 | The syntax for creating simple tables is as follows: 6 | 7 | CREATE TABLE table_name ( 8 | column_a data_type, 9 | column_b data_type, 10 | column_c data_type 11 | ); 12 | Attention: Table and columns names, as well as data types, don't need to be surrounded by quotation marks. 13 | 14 | Instructions 1/2 15 | - Create a table professors with two text columns: firstname and lastname. 16 | - Create a table universities with three text columns: university_shortname, university, and university_city. 17 | 18 | """ 19 | -- Create a table for the professors entity type 20 | CREATE TABLE professors ( 21 | firstname text, 22 | lastname text 23 | ); 24 | 25 | -- Print the contents of this table 26 | SELECT * 27 | FROM professors 28 | 29 | #2 30 | -- Create a table for the universities entity type 31 | CREATE TABLE universities ( 32 | university_shortname text, 33 | university text, 34 | university_city text 35 | ); 36 | -- Print the contents of this table 37 | SELECT * 38 | FROM universities 39 | -------------------------------------------------------------------------------- /Introduction to Relational Databases in SQL/1. Your first database/4. ADD a COLUMN with ALTER TABLE.py: -------------------------------------------------------------------------------- 1 | """ 2 | ADD a COLUMN with ALTER TABLE 3 | Oops! We forgot to add the university_shortname column to the professors table. You've probably already noticed: 4 | 5 | In chapter 4 of this course, you'll need this column for connecting the professors table with the universities table. 6 | 7 | However, adding columns to existing tables is easy, especially if they're still empty. 8 | 9 | To add columns you can use the following SQL query: 10 | 11 | ALTER TABLE table_name 12 | ADD COLUMN column_name data_type; 13 | Instructions 14 | 100 XP 15 | - Alter professors to add the text column university_shortname. 16 | 17 | """ 18 | -- Add the university_shortname column 19 | ALTER TABLE professors 20 | ADD COLUMN university_shortname text; 21 | 22 | -- Print the contents of this table 23 | SELECT * 24 | FROM professors 25 | -------------------------------------------------------------------------------- /Introduction to Relational Databases in SQL/1. Your first database/5. RENAME and DROP COLUMNs in affiliations.py: -------------------------------------------------------------------------------- 1 | """ 2 | RENAME and DROP COLUMNs in affiliations 3 | As mentioned in the video, the still empty affiliations table has some flaws. In this exercise, you'll correct them as outlined in the video. 4 | 5 | You'll use the following queries: 6 | 7 | To rename columns: 8 | ALTER TABLE table_name 9 | RENAME COLUMN old_name TO new_name; 10 | To delete columns: 11 | ALTER TABLE table_name 12 | DROP COLUMN column_name; 13 | 14 | Instructions 2/2 15 | - Rename the organisation column to organization in affiliations 16 | - Delete the university_shortname column in affiliations. 17 | 18 | """ 19 | #-- Rename the organisation column 20 | ALTER TABLE affiliations 21 | RENAME COLUMN organisation TO organization; 22 | 23 | #2 24 | #-- Delete the university_shortname column 25 | ALTER TABLE affiliations 26 | DROP COLUMN university_shortname; 27 | -------------------------------------------------------------------------------- /Introduction to Relational Databases in SQL/1. Your first database/6. Migrate data with INSERT INTO SELECT DISTINCT.py: -------------------------------------------------------------------------------- 1 | """ 2 | Migrate data with INSERT INTO SELECT DISTINCT 3 | Now it's finally time to migrate the data into the new tables. You'll use the following pattern: 4 | 5 | INSERT INTO ... 6 | SELECT DISTINCT ... 7 | FROM ...; 8 | It can be broken up into two parts: 9 | 10 | First part: 11 | 12 | SELECT DISTINCT column_name1, column_name2, ... 13 | FROM table_a; 14 | This selects all distinct values in table table_a – nothing new for you. 15 | 16 | Second part: 17 | 18 | INSERT INTO table_b ...; 19 | Take this part and append it to the first, so it inserts all distinct rows from table_a into table_b. 20 | 21 | One last thing: It is important that you run all of the code at the same time once you have filled out the blanks. 22 | 23 | Instructions 2/2 24 | 25 | - Insert all DISTINCT professors from university_professors into professors. 26 | - Print all the rows in professors. 27 | 28 | - Insert all DISTINCT affiliations into affiliations from university_professors. 29 | """ 30 | #-- Insert unique professors into the new table 31 | INSERT INTO professors 32 | SELECT DISTINCT firstname, lastname, university_shortname 33 | FROM university_professors; 34 | 35 | #-- Doublecheck the contents of professors 36 | SELECT * 37 | FROM professors; 38 | 39 | #2 40 | #-- Insert unique affiliations into the new table 41 | INSERT INTO affiliations 42 | SELECT DISTINCT firstname, lastname, function, organization 43 | FROM university_professors; 44 | 45 | #-- Doublecheck the contents of affiliations 46 | SELECT * 47 | FROM affiliations; 48 | -------------------------------------------------------------------------------- /Introduction to Relational Databases in SQL/1. Your first database/7. Delete tables with DROP TABLE.py: -------------------------------------------------------------------------------- 1 | """ 2 | Delete tables with DROP TABLE 3 | Obviously, the university_professors table is now no longer needed and can safely be deleted. 4 | 5 | For table deletion, you can use the simple command: 6 | 7 | DROP TABLE table_name; 8 | Instructions 9 | 100 XP 10 | Delete the university_professors table. 11 | 12 | """ 13 | #-- Delete the university_professors table 14 | DROP TABLE university_professors; 15 | -------------------------------------------------------------------------------- /Introduction to Relational Databases in SQL/2. Enforce data consistency with attribute constraints/0. query table data types INFORMATION_SCHEMA.txt: -------------------------------------------------------------------------------- 1 | > INFORMATION_SCHEMA 2 | 3 | | DATATYPES from one table: | 4 | 5 | SELECT * FROM INFORMATION_SCHEMA.COLUMNS 6 | WHERE TABLE_NAME = 'Address'; 7 | 8 | | DATATYPES from all tables | 9 | 10 | SELECT * FROM INFORMATION_SCHEMA.COLUMNS; 11 | -------------------------------------------------------------------------------- /Introduction to Relational Databases in SQL/2. Enforce data consistency with attribute constraints/1.Better data quality with constraints-changing datatypes CAST.txt: -------------------------------------------------------------------------------- 1 | 1. attribute contraints : datatypes on columns 2 | 2. key contraints : primary keys 3 | 3. Referential integrity contraints : foreign keys 4 | 5 | WHY contraints ? 6 | ----------------- 7 | 8 | . give data struture 9 | . help w/ consistency and data quality 10 | . data quality : business advantage/ data science prerequisite 11 | . PostgreSQL helps w/ enforcing 12 | 13 | | changing dataTypes | 14 | 15 | # change wind_speed from 'text' to 'integer' 16 | SELECT temperature * CAST(wind_speed AS integer) 17 | -------------------------------------------------------------------------------- /Introduction to Relational Databases in SQL/2. Enforce data consistency with attribute constraints/10. .py: -------------------------------------------------------------------------------- 1 | """ 2 | Make your columns UNIQUE with ADD CONSTRAINT 3 | As seen in the video, you add the UNIQUE keyword after the column_name that should be unique. This, of course, only works for new tables: 4 | 5 | CREATE TABLE table_name ( 6 | column_name UNIQUE 7 | ); 8 | If you want to add a unique constraint to an existing table, you do it like that: 9 | 10 | ALTER TABLE table_name 11 | ADD CONSTRAINT some_name UNIQUE(column_name); 12 | Note that this is different from the ALTER COLUMN syntax for the not-null constraint. Also, you have to give the constraint a name some_name. 13 | 14 | Instructions 2/2 15 | - Add a unique constraint to the university_shortname column in universities. Give it the name university_shortname_unq. 16 | - Add a unique constraint to the organization column in organizations. Give it the name organization_unq. 17 | 18 | """ 19 | #-- Make universities.university_shortname unique 20 | ALTER TABLE universities 21 | ADD CONSTRAINT university_shortname_unq UNIQUE(university_shortname); 22 | 23 | #-- Make organizations.organization unique 24 | ALTER TABLE organizations 25 | ADD CONSTRAINT organization_unq UNIQUE(organization); 26 | -------------------------------------------------------------------------------- /Introduction to Relational Databases in SQL/2. Enforce data consistency with attribute constraints/10. What happens if you try to enter NULLs?.py: -------------------------------------------------------------------------------- 1 | """ 2 | What happens if you try to enter NULLs? 3 | Execute the following statement: 4 | 5 | INSERT INTO professors (firstname, lastname, university_shortname) 6 | VALUES (NULL, 'Miller', 'ETH'); 7 | - Why does this throw an error? 8 | . Possible Answers 9 | 10 | Professors without first names do not exist 11 | OK Because a database constraint is violated. 12 | Error? This works just fine. 13 | NULL is not put in quotes. 14 | 15 | """ 16 | -------------------------------------------------------------------------------- /Introduction to Relational Databases in SQL/2. Enforce data consistency with attribute constraints/2. Types of database constraints.py: -------------------------------------------------------------------------------- 1 | """ 2 | Types of database constraints 3 | Which of the following is not used to enforce a database constraint? 4 | 5 | Answer the question 6 | 50XP 7 | Possible Answers 8 | 9 | Foreign keys 10 | OK SQL aggregate functions # only does calculations 11 | The BIGINT data type 12 | Primary keys 13 | 14 | """ 15 | -------------------------------------------------------------------------------- /Introduction to Relational Databases in SQL/2. Enforce data consistency with attribute constraints/3. Conforming with data types.py: -------------------------------------------------------------------------------- 1 | """ 2 | Conforming with data types 3 | For demonstration purposes, I created a fictional database table that only holds three records. The columns have the data types date, integer, and text, respectively. 4 | 5 | CREATE TABLE transactions ( 6 | transaction_date date, 7 | amount integer, 8 | fee text 9 | ); 10 | Have a look at the contents of the transactions table. 11 | 12 | The transaction_date accepts date values. According to the PostgreSQL documentation, it accepts values in the form of YYYY-MM-DD, DD/MM/YY, and so forth. 13 | 14 | Both columns amount and fee appear to be numeric, however, the latter is modeled as text – which you will account for in the next exercise. 15 | 16 | Instructions 17 | 100 XP 18 | Execute the given sample code. 19 | As it doesn't work, have a look at the error message and correct the statement accordingly – then execute it again. 20 | 21 | """ 22 | #-- Let's add a record to the table 23 | INSERT INTO transactions (transaction_date, amount, fee) 24 | VALUES ('2018-09-24', 5454, '30'); 25 | 26 | #-- Doublecheck the contents 27 | SELECT * 28 | FROM transactions; 29 | -------------------------------------------------------------------------------- /Introduction to Relational Databases in SQL/2. Enforce data consistency with attribute constraints/4. Type CASTs.py: -------------------------------------------------------------------------------- 1 | """ 2 | Type CASTs 3 | In the video, you saw that type casts are a possible solution for data type issues. If you know that a certain column stores numbers as text, you can cast the column to a numeric form, i.e. to integer. 4 | 5 | SELECT CAST(some_column AS integer) 6 | FROM table; 7 | Now, the some_column column is temporarily represented as integer instead of text, meaning that you can perform numeric calculations on the column. 8 | 9 | Instructions 10 | 100 XP 11 | - Execute the given sample code. 12 | - As it doesn't work, add an integer type cast at the right place and execute it again. 13 | 14 | """ 15 | #-- Calculate the net amount as amount + fee 16 | #-- fee wasn't integer 17 | #-- saw that w/ SELECT * FROM INFORMATION_SCHEMA.COLUMNS WHERE TABLE_NAME = 'transactions'; 18 | SELECT transaction_date, amount + CAST(fee AS integer) AS net_amount 19 | FROM transactions; 20 | -------------------------------------------------------------------------------- /Introduction to Relational Databases in SQL/2. Enforce data consistency with attribute constraints/5. Working with data types.txt: -------------------------------------------------------------------------------- 1 | | most common datatypes | 2 | 3 | . text 4 | . varchar(x) : max of n characters 5 | . char(x) : fixed-lenght string of n characters 6 | . boolean : TRUE or 1, FALSE or 0, NULL 7 | 8 | - date, time, timestamp: date-time calculatons 9 | - numeric : 3.134 10 | - integer: 3 -3 11 | 12 | | specifying numeric eg: | 13 | 14 | grades numeric(3, 2) # 3 digits, 2of those after comma # eg: 5,43 15 | 16 | | ALTER table and column after creation eg: | 17 | 18 | ALTER TABLE students 19 | ALTER COLUMN name 20 | TYPE varchar(50); 21 | 22 | | ALTER turning into INT and rounding eg: | 23 | 24 | ALTER TABLE students 25 | ALTER COLUMN average_grade 26 | TYPE integer 27 | USING ROUND(average_grade) 28 | # from 5,54 to 6 29 | -------------------------------------------------------------------------------- /Introduction to Relational Databases in SQL/2. Enforce data consistency with attribute constraints/6. Change types with ALTER COLUMN.py: -------------------------------------------------------------------------------- 1 | """ 2 | Change types with ALTER COLUMN 3 | The syntax for changing the data type of a column is straightforward. The following code changes the data type of the column_name column in table_name to varchar(10): 4 | 5 | ALTER TABLE table_name 6 | ALTER COLUMN column_name 7 | TYPE varchar(10) 8 | Now it's time to start adding constraints to your database. 9 | 10 | Instructions 3/3 11 | - Have a look at the distinct university_shortname values in the professors table and take note of the length of the strings. 12 | - Now specify a fixed-length character type with the correct length for university_shortname. 13 | - Change the type of the firstname column to varchar(64). 14 | 15 | """ 16 | #-- Select the university_shortname column 17 | SELECT DISTINCT(university_shortname) 18 | FROM professors; 19 | 20 | #-- Specify the correct fixed-length character type 21 | ALTER TABLE professors 22 | ALTER COLUMN university_shortname 23 | TYPE char(3); 24 | 25 | #-- Change the type of firstname 26 | ALTER TABLE professors 27 | ALTER COLUMN firstname 28 | TYPE varchar(64); 29 | 30 | -------------------------------------------------------------------------------- /Introduction to Relational Databases in SQL/2. Enforce data consistency with attribute constraints/7. Convert types USING a function.py: -------------------------------------------------------------------------------- 1 | """ 2 | Convert types USING a function 3 | If you don't want to reserve too much space for a certain varchar column, you can truncate the values before converting its type. 4 | 5 | For this, you can use the following syntax: 6 | 7 | ALTER TABLE table_name 8 | ALTER COLUMN column_name 9 | TYPE varchar(x) 10 | USING SUBSTRING(column_name FROM 1 FOR x) 11 | You should read it like this: Because you want to reserve only x characters for column_name, you have to retain a SUBSTRING of every value, i.e. the first x characters of it, and throw away the rest. This way, the values will fit the varchar(x) requirement. 12 | 13 | Instructions 14 | 100 XP 15 | - Run the sample code as is and take note of the error. 16 | - Now use SUBSTRING() to reduce firstname to 16 characters so its type can be altered to varchar(16). 17 | 18 | """ 19 | #-- Convert the values in firstname to a max. of 16 characters 20 | ALTER TABLE professors 21 | ALTER COLUMN firstname 22 | TYPE varchar(16) 23 | USING SUBSTRING(firstname FROM 1 FOR 16) #--max of 16 chars 24 | -------------------------------------------------------------------------------- /Introduction to Relational Databases in SQL/2. Enforce data consistency with attribute constraints/8. The not-null and unique constraints.txt: -------------------------------------------------------------------------------- 1 | -> NOT NULL : not empty (obligatory) 2 | 3 | | set not null after creation | 4 | 5 | ALTER TABLE students 6 | ALTER COLUMN phone 7 | SET NOT NULL; 8 | 9 | | ELiminate NOT NULL after creation | 10 | 11 | ALTER TABLE students 12 | ALTER COLUMN ssn 13 | DROP NOT NULL; 14 | 15 | -> UNIQUE 16 | 17 | | SET unique when creating | 18 | 19 | CREATE TABLE table_name( 20 | column_name UNIQUE); 21 | 22 | | SET UNIQUE AFTER CREATION | 23 | 24 | ALTER TABLE table_name 25 | ADD CONSTRAINT some_name UNIQUE(column_name); 26 | 27 | -------------------------------------------------------------------------------- /Introduction to Relational Databases in SQL/2. Enforce data consistency with attribute constraints/9. Disallow NULL values with SET NOT NULL.py: -------------------------------------------------------------------------------- 1 | """ 2 | Disallow NULL values with SET NOT NULL 3 | The professors table is almost ready now. However, it still allows for NULLs to be entered. Although some information might be missing about some professors, there's certainly columns that always need to be specified. 4 | 5 | Instructions 2/2 6 | 7 | - Add a not-null constraint for the firstname column. 8 | - Add a not-null constraint for the lastname column. 9 | 10 | """ 11 | #-- Disallow NULL values in firstname 12 | ALTER TABLE professors 13 | ALTER COLUMN firstname SET NOT NULL; 14 | 15 | #-- Disallow NULL values in lastname 16 | ALTER TABLE professors 17 | ALTER COLUMN lastname SET NOT NULL 18 | -------------------------------------------------------------------------------- /Introduction to Relational Databases in SQL/3. Uniquely identify records with key constraints/1. Get to know SELECT COUNT DISTINCT.py: -------------------------------------------------------------------------------- 1 | """ 2 | Get to know SELECT COUNT DISTINCT 3 | Your database doesn't have any defined keys so far, and you don't know which columns or combinations of columns are suited as keys. 4 | 5 | There's a simple way of finding out whether a certain column (or a combination) contains only unique values – and thus identifies the records in the table. 6 | 7 | You already know the SELECT DISTINCT query from the first chapter. Now you just have to wrap everything within the COUNT() function and PostgreSQL will return the number of unique rows for the given columns: 8 | 9 | SELECT COUNT(DISTINCT(column_a, column_b, ...)) 10 | FROM table; 11 | Instructions 2/2 12 | - First, find out the number of rows in universities. 13 | - Then, find out how many unique values there are in the university_city column. 14 | 15 | """ 16 | #-- Count the number of rows in universities 17 | SELECT COUNT(*) 18 | FROM universities; 19 | 20 | #-- Count the number of distinct values in the university_city column 21 | SELECT COUNT(DISTINCT(university_city)) 22 | FROM universities; 23 | -------------------------------------------------------------------------------- /Introduction to Relational Databases in SQL/3. Uniquely identify records with key constraints/2. Identify keys with SELECT COUNT DISTINCT.py: -------------------------------------------------------------------------------- 1 | """ 2 | Identify keys with SELECT COUNT DISTINCT 3 | There's a very basic way of finding out what qualifies for a key in an existing, populated table: 4 | 5 | Count the distinct records for all possible combinations of columns. If the resulting number x equals the number of all rows in the table for a combination, you have discovered a superkey. 6 | 7 | Then remove one column after another until you can no longer remove columns without seeing the number x decrease. If that is the case, you have discovered a (candidate) key. 8 | 9 | The table professors has 551 rows. It has only one possible candidate key, which is a combination of two attributes. You might want to try different combinations using the "Run code" button. Once you have found the solution, you can submit your answer. 10 | 11 | Instructions 12 | 100 XP 13 | - Using the above steps, identify the candidate key by trying out different combination of columns. 14 | 15 | """ 16 | #-- Try out different combinations 17 | #-- Identify keys w/ SELECT COUNT DISTINCT 18 | SELECT COUNT(DISTINCT(firstname, lastname)) # SUPER-KEY 19 | FROM professors; 20 | ''' 21 | count 22 | 551 23 | ''' 24 | -------------------------------------------------------------------------------- /Introduction to Relational Databases in SQL/3. Uniquely identify records with key constraints/3. Identify the primary key.py: -------------------------------------------------------------------------------- 1 | """ 2 | Identify the primary key 3 | Have a look at the example table from the previous video. As the database designer, you have to make a wise choice as to which column should be the primary key. 4 | 5 | license_no | serial_no | make | model | year 6 | --------------------+-----------+------------+---------+------ 7 | Texas ABC-739 | A69352 | Ford | Mustang | 2 8 | Florida TVP-347 | B43696 | Oldsmobile | Cutlass | 5 9 | New York MPO-22 | X83554 | Oldsmobile | Delta | 1 10 | California 432-TFY | C43742 | Mercedes | 190-D | 99 11 | California RSK-629 | Y82935 | Toyota | Camry | 4 12 | Texas RSK-629 | U028365 | Jaguar | XJS | 4 13 | - Which of the following column or column combinations could best serve as primary key? 14 | 15 | Answer the question 16 | 50XP 17 | Possible Answers 18 | 19 | PK = {make} 20 | PK = {model, year} 21 | OK PK = {license_no} 22 | PK = {year, make} 23 | 24 | """ 25 | -------------------------------------------------------------------------------- /Introduction to Relational Databases in SQL/3. Uniquely identify records with key constraints/4. ADD key CONSTRAINTs to the tables.py: -------------------------------------------------------------------------------- 1 | """ 2 | ADD key CONSTRAINTs to the tables 3 | Two of the tables in your database already have well-suited candidate keys consisting of one column each: organizations and universities with the organization and university_shortname columns, respectively. 4 | 5 | In this exercise, you'll rename these columns to id using the RENAME COLUMN command and then specify primary key constraints for them. This is as straightforward as adding unique constraints (see the last exercise of Chapter 2): 6 | 7 | ALTER TABLE table_name 8 | ADD CONSTRAINT some_name PRIMARY KEY (column_name) 9 | Note that you can also specify more than one column in the brackets. 10 | 11 | Instructions 1/2 12 | 13 | 1 14 | - Rename the organization column to id in organizations. 15 | - Make id a primary key and name it organization_pk. 16 | 2 17 | - Rename the university_shortname column to id in universities. 18 | - Make id a primary key and name it university_pk. 19 | 20 | """ 21 | #-- Rename the organization column to id 22 | ALTER TABLE organizations 23 | RENAME COLUMN organization TO id; 24 | 25 | #-- Make id a primary key 26 | ALTER TABLE organizations 27 | ADD CONSTRAINT organization_pk PRIMARY KEY (id); 28 | 29 | #2 30 | #-- Rename the university_shortname column to id 31 | ALTER TABLE universities 32 | RENAME COLUMN university_shortname TO id; 33 | 34 | #-- Make id a primary key 35 | ALTER TABLE universities 36 | ADD CONSTRAINT university_pk PRIMARY KEY (id); 37 | -------------------------------------------------------------------------------- /Introduction to Relational Databases in SQL/3. Uniquely identify records with key constraints/5. Add a SERIAL surrogate key.py: -------------------------------------------------------------------------------- 1 | """ 2 | Add a SERIAL surrogate key 3 | Since there's no single column candidate key in professors (only a composite key candidate consisting of firstname, lastname), you'll add a new column id to that table. 4 | 5 | This column has a special data type serial, which turns the column into an auto-incrementing number. This means that, whenever you add a new professor to the table, it will automatically get an id that does not exist yet in the table: a perfect primary key! 6 | 7 | Instructions 3/3 8 | 9 | 1- Add a new column id with data type serial to the professors table. 10 | 2- Make id a primary key and name it professors_pkey. 11 | 3- 12 | 13 | """ 14 | #1 15 | #-- Add the new column to the table 16 | ALTER TABLE professors 17 | ADD COLUMN id SERIAL PRIMARY KEY; #-- serial autoincrement as id 18 | 19 | #2 20 | #-- Make id a primary key 21 | ALTER TABlE professors 22 | ADD CONSTRAINT professors_pkey PRIMARY KEY (id); 23 | 24 | #3 25 | #-- Have a look at the first 10 rows of professors 26 | SELECT * 27 | FROM professors 28 | LIMIT 10; 29 | -------------------------------------------------------------------------------- /Introduction to Relational Databases in SQL/3. Uniquely identify records with key constraints/6. CONCATenate columns to a surrogate key.py: -------------------------------------------------------------------------------- 1 | """ 2 | CONCATenate columns to a surrogate key 3 | Another strategy to add a surrogate key to an existing table is to concatenate existing columns with the CONCAT() function. 4 | 5 | Let's think of the following example table: 6 | 7 | CREATE TABLE cars ( 8 | make varchar(64) NOT NULL, 9 | model varchar(64) NOT NULL, 10 | mpg integer NOT NULL 11 | ) 12 | The table is populated with 10 rows of completely fictional data. 13 | 14 | Unfortunately, the table doesn't have a primary key yet. None of the columns consists of only unique values, so some columns can be combined to form a key. 15 | 16 | In the course of the following exercises, you will combine make and model into such a surrogate key. 17 | 18 | Instructions 4/4 19 | 20 | - Count the number of distinct rows with a combination of the make and model columns. 21 | - Add a new column id with the data type varchar(128). 22 | - Concatenate make and model into id using an UPDATE table_name SET column_name = ... query and the CONCAT() function. 23 | - Make id a primary key and name it id_pk. 24 | 25 | """ 26 | #1 27 | #-- Count the number of distinct rows with columns make, model 28 | SELECT COUNT(DISTINCT(make, model)) 29 | FROM cars; 30 | 31 | #2 32 | #-- Add the id column 33 | ALTER TABLE cars 34 | ADD COLUMN id varchar(128); 35 | 36 | #3 37 | #-- Update id with make + model 38 | -- UPDATE, SET x = 39 | UPDATE cars 40 | SET id = CONCAT(make, model); 41 | 42 | #4 43 | #-- Make id a primary key 44 | ALTER TABLE cars 45 | ADD CONSTRAINT id_pk PRIMARY KEY(id); 46 | 47 | #-- Have a look at the table 48 | SELECT * FROM cars; 49 | -------------------------------------------------------------------------------- /Introduction to Relational Databases in SQL/3. Uniquely identify records with key constraints/7. Test your knowledge before advancing.py: -------------------------------------------------------------------------------- 1 | """ 2 | Test your knowledge before advancing 3 | Before you move on to the next chapter, let's quickly review what you've learned so far about attributes and key constraints. If you're unsure about the answer, please quickly review chapters 2 and 3, respectively. 4 | 5 | Let's think of an entity type "student". A student has: 6 | 7 | . a last name consisting of up to 128 characters (required), 8 | . a unique social security number, consisting only of integers, that should serve as a key, 9 | . a phone number of fixed length 12, consisting of numbers and characters (but some students don't have one). 10 | 11 | Instructions 12 | 100 XP 13 | - Given the above description of a student entity, create a table students with the correct column types. 14 | - Add a PRIMARY KEY for the social security number ssn. 15 | - Note that there is no formal length requirement for the integer column. The application would have to make sure it's a correct SSN! 16 | 17 | """ 18 | #-- Create the table 19 | CREATE TABLE students ( 20 | last_name varchar(128) NOT NULL, 21 | ssn integer PRIMARY KEY, 22 | phone_no char(12) 23 | ); 24 | -------------------------------------------------------------------------------- /Introduction to Relational Databases in SQL/4. Glue together tables with foreign keys/1. REFERENCE a table with a FOREIGN KEY.py: -------------------------------------------------------------------------------- 1 | """ 2 | REFERENCE a table with a FOREIGN KEY 3 | In your database, you want the professors table to reference the universities table. You can do that by specifying a column in professors table that references a column in the universities table. 4 | 5 | As just shown in the video, the syntax for that looks like this: 6 | 7 | ALTER TABLE a 8 | ADD CONSTRAINT a_fkey FOREIGN KEY (b_id) REFERENCES b (id); 9 | Table a should now refer to table b, via b_id, which points to id. a_fkey is, as usual, a constraint name you can choose on your own. 10 | 11 | Pay attention to the naming convention employed here: Usually, a foreign key referencing another primary key with name id is named x_id, where x is the name of the referencing table in the singular form. 12 | 13 | Instructions 2/2 14 | 1- Rename the university_shortname column to university_id in professors. 15 | 2- Add a foreign key on university_id column in professors that references the id column in universities. 16 | 2- Name this foreign key professors_fkey. 17 | 18 | """ 19 | #1 20 | #-- Rename the university_shortname column 21 | ALTER TABLE professors 22 | RENAME COLUMN university_shortname TO university_id; 23 | 24 | #2 25 | #-- Add a foreign key on professors referencing universities 26 | ALTER TABLE professors 27 | ADD CONSTRAINT professors_fkey FOREIGN KEY (university_id) REFERENCES universities (id); 28 | -------------------------------------------------------------------------------- /Introduction to Relational Databases in SQL/4. Glue together tables with foreign keys/10. Change the referential integrity behavior of a key.py: -------------------------------------------------------------------------------- 1 | """ 2 | Change the referential integrity behavior of a key 3 | So far, you implemented three foreign key constraints: 4 | 5 | professors.university_id to universities.id 6 | affiliations.organization_id to organizations.id 7 | affiliations.professor_id to professors.id 8 | These foreign keys currently have the behavior ON DELETE NO ACTION. Here, you're going to change that behavior for the column referencing organizations from affiliations. If an organization is deleted, all its affiliations (by any professor) should also be deleted. 9 | 10 | Altering a key constraint doesn't work with ALTER COLUMN. Instead, you have to DROP the key constraint and then ADD a new one with a different ON DELETE behavior. 11 | 12 | For deleting constraints, though, you need to know their name. This information is also stored in information_schema. 13 | 14 | Instructions 4/4 15 | - Have a look at the existing foreign key constraints by querying table_constraints in information_schema. 16 | - Delete the affiliations_organization_id_fkey foreign key constraint in affiliations. 17 | - Add a new foreign key to affiliations that CASCADEs deletion if a referenced record is deleted from organizations. Name it affiliations_organization_id_fkey. 18 | - Run the DELETE and SELECT queries to double check that the deletion cascade actually works. 19 | 20 | """ 21 | #--1 22 | #-- Identify the correct constraint name 23 | SELECT constraint_name, table_name, constraint_type 24 | FROM information_schema.table_constraints 25 | WHERE constraint_type = 'FOREIGN KEY'; 26 | 27 | #--2 28 | #-- Drop the right foreign key constraint 29 | ALTER TABLE affiliations 30 | DROP CONSTRAINT affiliations_organization_id_fkey; 31 | 32 | #-- Add a new foreign key constraint from affiliations to organizations which cascades deletion 33 | ALTER TABLE affiliations 34 | ADD CONSTRAINT affiliations_organization_id_fkey FOREIGN KEY (organization_id) REFERENCES organizations (id) ON DELETE CASCADE; 35 | #--. CASCADE : delete all refencing records 36 | 37 | #--3 38 | #-- Delete an organization 39 | DELETE FROM organizations 40 | WHERE id = 'CUREM'; 41 | 42 | #--4 43 | #-- Check that no more affiliations with this organization exist 44 | SELECT * FROM affiliations 45 | WHERE organization_id = 'CUREM'; 46 | -------------------------------------------------------------------------------- /Introduction to Relational Databases in SQL/4. Glue together tables with foreign keys/11. Count affiliations per university.py: -------------------------------------------------------------------------------- 1 | """ 2 | Count affiliations per university 3 | Now that your data is ready for analysis, let's run some exemplary SQL queries on the database. You'll now use already known concepts such as grouping by columns and joining tables. 4 | 5 | In this exercise, you will find out which university has the most affiliations (through its professors). For that, you need both affiliations and professors tables, as the latter also holds the university_id. 6 | 7 | As a quick repetition, remember that joins have the following structure: 8 | 9 | SELECT table_a.column1, table_a.column2, table_b.column1, ... 10 | FROM table_a 11 | JOIN table_b 12 | ON table_a.column = table_b.column 13 | This results in a combination of table_a and table_b, but only with rows where table_a.column is equal to table_b.column. 14 | 15 | Instructions 16 | 100 XP 17 | - Count the number of total affiliations by university. 18 | - Sort the result by that count, in descending order. 19 | 20 | """ 21 | #-- Count the total number of affiliations per university 22 | SELECT COUNT(*), professors.university_id 23 | FROM affiliations 24 | JOIN professors 25 | ON affiliations.professor_id = professors.id 26 | #-- Group by the ids of professors 27 | GROUP BY professors.university_id 28 | ORDER BY count DESC 29 | 30 | '''count university_id 31 | 579 EPF 32 | 273 USG...''' 33 | -------------------------------------------------------------------------------- /Introduction to Relational Databases in SQL/4. Glue together tables with foreign keys/12. .py: -------------------------------------------------------------------------------- 1 | """ 2 | Join all the tables together 3 | In this last exercise, your task is to find the university city of the professor with the most affiliations in the sector "Media & communication". 4 | 5 | For this, 6 | 7 | you need to join all the tables, 8 | group by some column, 9 | and then use selection criteria to get only the rows in the correct sector. 10 | Let's do this in three steps! 11 | 12 | Instructions 3/3 13 | 14 | - Join all tables in the database (starting with affiliations, professors, organizations, and universities) and look at the result. 15 | - Now group the result by organization sector, professor, and university city. 16 | Count the resulting number of rows. 17 | - Only retain rows with "Media & communication" as organization sector, and sort the table by count, in descending order. 18 | 19 | """ 20 | #1 21 | #-- Join all tables 22 | SELECT * 23 | FROM affiliations 24 | JOIN professors 25 | ON affiliations.professor_id = professors.id 26 | JOIN organizations 27 | ON affiliations.organization_id = organizations.id 28 | JOIN universities 29 | ON professors.university_id = universities.id; 30 | 31 | #2 32 | #-- Group the table by organization sector, professor ID and university city 33 | SELECT COUNT(*), organizations.organization_sector, 34 | professors.id, universities.university_city 35 | FROM affiliations 36 | JOIN professors 37 | ON affiliations.professor_id = professors.id 38 | JOIN organizations 39 | ON affiliations.organization_id = organizations.id 40 | JOIN universities 41 | ON professors.university_id = universities.id 42 | GROUP BY organizations.organization_sector, 43 | professors.id, universities.university_city; 44 | 45 | #3 46 | #-- Filter the table and sort it 47 | SELECT COUNT(*), organizations.organization_sector, 48 | professors.id, universities.university_city 49 | FROM affiliations 50 | JOIN professors 51 | ON affiliations.professor_id = professors.id 52 | JOIN organizations 53 | ON affiliations.organization_id = organizations.id 54 | JOIN universities 55 | ON professors.university_id = universities.id 56 | WHERE organizations.organization_sector = 'Media & communication' 57 | GROUP BY organizations.organization_sector, 58 | professors.id, universities.university_city 59 | ORDER BY count DESC; 60 | -------------------------------------------------------------------------------- /Introduction to Relational Databases in SQL/4. Glue together tables with foreign keys/2. Explore foreign key constraints.py: -------------------------------------------------------------------------------- 1 | """ 2 | Explore foreign key constraints 3 | Foreign key constraints help you to keep order in your database mini-world. In your database, for instance, only professors belonging to Swiss universities should 4 | be allowed, as only Swiss universities are part of the universities table. 5 | 6 | The foreign key on professors referencing universities you just created thus makes sure that only existing universities can be specified when inserting new data. 7 | Let's test this! 8 | 9 | Instructions 10 | 100 XP 11 | - Run the sample code and have a look at the error message. 12 | - What's wrong? Correct the university_id so that it actually reflects where Albert Einstein wrote his dissertation and became a professor 13 | – at the University of Zurich (UZH)! 14 | 15 | """ 16 | #-- Try to insert a new professor 17 | INSERT INTO professors (firstname, lastname, university_id) 18 | VALUES ('Albert', 'Einstein', 'UZH'); 19 | # inserting a professor with non-existing university IDs violates the foreign key constraint 20 | -------------------------------------------------------------------------------- /Introduction to Relational Databases in SQL/4. Glue together tables with foreign keys/3. JOIN tables linked by a foreign key.py: -------------------------------------------------------------------------------- 1 | """ 2 | JOIN tables linked by a foreign key 3 | Let's join these two tables to analyze the data further! 4 | 5 | You might already know how SQL joins work from the Intro to SQL for Data Science course (last exercise) or from Joining Data in PostgreSQL. 6 | 7 | Here's a quick recap on how joins generally work: 8 | 9 | SELECT ... 10 | FROM table_a 11 | JOIN table_b 12 | ON ... 13 | WHERE ... 14 | While foreign keys and primary keys are not strictly necessary for join queries, they greatly help by telling you what to expect. For instance, you can be sure that records referenced from table A will always be present in table B – so a join from table A will always find something in table B. If not, the foreign key constraint would be violated. 15 | 16 | Instructions 17 | 100 XP 18 | - JOIN professors with universities on professors.university_id = universities.id, i.e., retain all records where the foreign key of professors is equal to the primary key of universities. 19 | - Filter for university_city = 'Zurich'. 20 | 21 | """ 22 | #-- Select all professors working for universities in the city of Zurich 23 | SELECT professors.lastname, universities.id, universities.university_city 24 | FROM universities 25 | INNER JOIN professors 26 | ON professors.university_id = universities.id 27 | WHERE universities.university_city = 'Zurich'; 28 | -------------------------------------------------------------------------------- /Introduction to Relational Databases in SQL/4. Glue together tables with foreign keys/4. Model more complex relationships.txt: -------------------------------------------------------------------------------- 1 | | N:M relationsheep | 2 | 3 | . create table 4 | . add foreign keys for every connected table 5 | . add additional attributes 6 | 7 | > CREATE TABLE affiliations ( 8 | professor_id integer REFERENCES professors (id), 9 | organization_id varchar (256) REFERENCES organizations (1d), 10 | function varchar (256) 11 | ); 12 | -------------------------------------------------------------------------------- /Introduction to Relational Databases in SQL/4. Glue together tables with foreign keys/5. Add foreign keys to the "affiliations" table.py: -------------------------------------------------------------------------------- 1 | """ 2 | Add foreign keys to the "affiliations" table 3 | At the moment, the affiliations table has the structure {firstname, lastname, function, organization}, as you can see in the preview at the bottom right. In the next three exercises, you're going to turn this table into the form {professor_id, organization_id, function}, with professor_id and organization_id being foreign keys that point to the respective tables. 4 | 5 | You're going to transform the affiliations table in-place, i.e., without creating a temporary table to cache your intermediate results. 6 | 7 | Instructions 3/3 8 | - Add a professor_id column with integer data type to affiliations, and declare it to be a foreign key that references the id column in professors. 9 | - Rename the organization column in affiliations to organization_id. 10 | - Add a foreign key constraint on organization_id so that it references the id column in organizations. 11 | 12 | """ 13 | #1 14 | #-- Add a professor_id column to affiliations 15 | ALTER TABLE affiliations 16 | ADD COLUMN professor_id integer REFERENCES professors (id); 17 | 18 | #2 19 | #-- Rename the organization column to organization_id 20 | ALTER TABLE affiliations 21 | RENAME organization TO organization_id; 22 | 23 | #3 24 | #-- Add a foreign key on organization_id 25 | ALTER TABLE affiliations 26 | ADD CONSTRAINT affiliations_organization_fkey FOREIGN KEY (organization_id) REFERENCES organizations (id); 27 | -------------------------------------------------------------------------------- /Introduction to Relational Databases in SQL/4. Glue together tables with foreign keys/6. Populate the "professor_id" column.py: -------------------------------------------------------------------------------- 1 | """ 2 | Populate the "professor_id" column 3 | Now it's time to also populate professors_id. You'll take the ID directly from professors. 4 | 5 | Here's a way to update columns of a table based on values in another table: 6 | 7 | UPDATE table_a 8 | SET column_to_update = table_b.column_to_update_from 9 | FROM table_b 10 | WHERE condition1 AND condition2 AND ...; 11 | This query does the following: 12 | 13 | For each row in table_a, find the corresponding row in table_b where condition1, condition2, etc., are met. 14 | Set the value of column_to_update to the value of column_to_update_from (from that corresponding row). 15 | The conditions usually compare other columns of both tables, e.g. table_a.some_column = table_b.some_column. Of course, this query only makes sense if there is only one matching row in table_b. 16 | 17 | Instructions 3/3 18 | - First, have a look at the current state of affiliations by fetching 10 rows and all columns. 19 | - Update the professor_id column with the corresponding value of the id column in professors. 20 | "Corresponding" means rows in professors where the firstname and lastname are identical to the ones in affiliations. 21 | - Check out the first 10 rows and all columns of affiliations again. Have the professor_ids been correctly matched? 22 | 23 | """ 24 | #1 25 | #-- Have a look at the 10 first rows of affiliations 26 | SELECT * 27 | FROM affiliations 28 | LIMIT 10; 29 | 30 | #2 31 | #-- Set professor_id to professors.id where firstname, lastname correspond to rows in professors 32 | UPDATE affiliations 33 | SET professor_id = professors.id 34 | FROM professors 35 | WHERE affiliations.firstname = professors.firstname AND affiliations.lastname = professors.lastname; 36 | 37 | #3 38 | #-- Have a look at the 10 first rows of affiliations again 39 | SELECT * 40 | FROM affiliations 41 | LIMIT 10; 42 | -------------------------------------------------------------------------------- /Introduction to Relational Databases in SQL/4. Glue together tables with foreign keys/7. Drop "firstname" and "lastname".py: -------------------------------------------------------------------------------- 1 | """ 2 | Drop "firstname" and "lastname" 3 | The firstname and lastname columns of affiliations were used to establish a link to the professors table in the last exercise – so the appropriate professor IDs could be copied over. This only worked because there is exactly one corresponding professor for each row in affiliations. In other words: {firstname, lastname} is a candidate key of professors – a unique combination of columns. 4 | 5 | It isn't one in affiliations though, because, as said in the video, professors can have more than one affiliation. 6 | 7 | Because professors are referenced by professor_id now, the firstname and lastname columns are no longer needed, so it's time to drop them. After all, one of the goals of a database is to reduce redundancy where possible. 8 | 9 | Instructions 10 | 100 XP 11 | - Drop the firstname and lastname columns from the affiliations table. 12 | 13 | """ 14 | #-- Drop the firstname column 15 | ALTER TABLE affiliations 16 | DROP COLUMN firstname; 17 | 18 | #-- Drop the lastname column 19 | ALTER TABLE affiliations 20 | DROP COLUMN lastname; 21 | -------------------------------------------------------------------------------- /Introduction to Relational Databases in SQL/4. Glue together tables with foreign keys/8. Referential Integrity ON DELETE.txt: -------------------------------------------------------------------------------- 1 | | Dealing w/ violations | 2 | 3 | > ON DELETE 4 | 5 | . NO ACTION throw an error 6 | . CASCADE : delete all refencing records 7 | . RESTRICT : Throw an error 8 | . SET NULL : set referencing column to null 9 | . SET DEFAULT : set referencing column to default value 10 | -------------------------------------------------------------------------------- /Introduction to Relational Databases in SQL/4. Glue together tables with foreign keys/9. Referential integrity violations.py: -------------------------------------------------------------------------------- 1 | """ 2 | Referential integrity violations 3 | Given the current state of your database, what happens if you execute the following SQL statement? 4 | 5 | DELETE FROM universities WHERE id = 'EPF'; 6 | 7 | Instructions 8 | 50 XP 9 | Possible Answers 10 | 11 | It throws an error because the university with ID "EPF" does not exist. 12 | The university with ID "EPF" is deleted. 13 | It fails because referential integrity from universities to professors is violated. 14 | OK It fails because referential integrity from professors to universities is violated. 15 | 16 | """ 17 | SELECT * 18 | FROM universities 19 | WHERE id = 'EPF'; 20 | 21 | DELETE FROM universities WHERE id = 'EPF'; 22 | 23 | # ERROR : DETAIL: Key (id)=(EPF) is still referenced from table "professors". 24 | -------------------------------------------------------------------------------- /Introduction to Shell/I Manipulating files and directories.py: -------------------------------------------------------------------------------- 1 | """ brief introduction to the Unix shell. You'll learn why it is still in use after almost 50 years, how it compares to the graphical tools 2 | you may be more familiar with, how to move around in the shell, and how to create, modify, and delete files and folders. 3 | """ 4 | 5 | 6 | ## How does the shell compare to a desktop interface? 7 | """---What is the relationship between the graphical file explorer that most people use and the command-line shell?""" 8 | # They are both interfaces for issuing commands to the operating system. 9 | 10 | 11 | ## Where am I? 12 | """ >>>>>> $ pwd === print working directory.""" 13 | # /home/repl 14 | 15 | 16 | ## How can I identify files and directories? 17 | """ >>>>>> $ cd === go inside directory.""" 18 | """ >>>>>> $ ls === list the files in the directory.""" 19 | 20 | """---Which of these files is not in that directory?.""" 21 | # $ cd seasonal 22 | # $ ls 23 | # fall.csv 24 | 25 | 26 | ## How else can I identify files and directories? 27 | """ **PATH : If it begins with /, it is absolute. If it does not begin with /, it is relative.""" 28 | # ls course.txt 29 | # $ ls seasonal/summer.csv 30 | 31 | 32 | ## How can I move to another directory? 33 | # cd seasonal 34 | # pwd 35 | # ls 36 | 37 | 38 | ## How can I move up a directory? 39 | """---If you are in /home/repl/seasonal, where does cd ~/../. take you?""" 40 | # /home 41 | 42 | 43 | ## How can I copy files? 44 | """ >>>>>> cp original.txt duplicate.txt === creates a copy and renames.""" 45 | """copy summer.csv in seasonal, to backup file, as summer.bck""" 46 | # cp original.txt duplicate.txt 47 | 48 | """Copy spring.csv and summer.csv from the seasonal directory into the backup directory without changing your current working directory (/home/repl).""" 49 | # cp seasonal/spring.csv seasonal/summer.csv backup 50 | 51 | 52 | ## How can I move a file? 53 | """ >>>>>> mv === move a file,rename a file. same syntax as cp.""" 54 | """You are in /home/repl, which has sub-directories seasonal and backup. Using a single command, move spring.csv and summer.csv from seasonal to backup.""" 55 | # mv seasonal/spring.csv seasonal/summer.csv backup 56 | 57 | 58 | ## How can I rename files? 59 | """Rename the file winter.csv to be winter.csv.bck.""" 60 | # $ cd seasonal 61 | # $ ls 62 | # autumn.csv spring.csv summer.csv winter.csv 63 | # $ mv winter.csv winter.csv.bck 64 | # $ ls 65 | # autumn.csv spring.csv summer.csv winter.csv.bck 66 | 67 | 68 | ## How can I delete files? 69 | """ >>>>>> rm === remove file through terminal.""" 70 | """Remove autumn.csv from seasonal, then go back and remove autumn.csv from seasonal.""" 71 | # $ cd seasonal 72 | # $ rm autumn.csv 73 | # $ cd .. 74 | # $ rm seasonal/summer.csv 75 | 76 | 77 | ## How can I create and delete directories? 78 | """Without changing directories, delete the file agarwal.txt in the people directory.""" 79 | # rm -r people/agarwal.txt 80 | """Now that the people directory is empty, use a single command to delete it.""" 81 | # rmdir people 82 | """ >>>>>> rmdir === remove empty folder.""" 83 | """ >>>>>> mkdir === create a directory.""" 84 | """create a directory/folder""" 85 | # mkdir yearly 86 | """early exists, create another directory called 2017 inside it without leaving your home directory.""" 87 | # $ mkdir yearly/2017 88 | 89 | 90 | ## Wrapping up 91 | """ ** create intermediate files when analyzing data. Rather than storing them in your home directory, you can put them in /tmp,/tmp is 92 | ** immediately below the root directory / """ 93 | """ >>>>>> /tmp ===temp folder""" 94 | """ >>>>>> ~ ===shortcut for '/home/repl' """ 95 | """|| Move /home/repl/people/agarwal.txt into /tmp/scratch""" 96 | # $ cd /tmp 97 | # $ ls 98 | # $ mkdir scratch 99 | # $ mv ~/people/agarwal.txt /tmp/scratch 100 | -------------------------------------------------------------------------------- /Introduction to Shell/II Manipulating data.py: -------------------------------------------------------------------------------- 1 | """ 2 | This chapter will show you how to work with the data in those files. The tools we’ll use are fairly simple, but are solid building blocks. 3 | """ 4 | 5 | ## How can I view a file's contents? 6 | """ >>>>>> cat === quick view of contents.""" 7 | # $ cat course.txt 8 | """Introduction to the Unix Shell for Data Science......""" 9 | 10 | 11 | ## How can I view a file's contents piece by piece? 12 | """ >>>>>> less === one page at a time, passing with [SPACE], leaving with [q] 13 | >>>>>> :n , :p === next and previous file when using less. 14 | >>>>>> :q === quit.""" 15 | # $ less seasonal/spring.csv seasonal/summer.csv 16 | # :n 17 | # :p 18 | # :q 19 | 20 | 21 | ## How can I look at the start of a file? 22 | """ >>>>>> head === look at the start of a file.""" 23 | # $ head people/agarwal.txt 24 | # | Display as many lines as there are. 25 | 26 | 27 | ## How can I type less? 28 | """ >>>>>> [TAB] === autocompletition""" 29 | """Run head seasonal/autumn.csv without typing the full filename.""" 30 | # $ head seasonal/autumn.csv 31 | # $ head seasonal/spring.csv 32 | 33 | 34 | ## How can I control what commands do? 35 | """FLAGS 36 | ---------""" 37 | """ >>>>>> head -n 10 === first 10 results.""" 38 | # $ head -n 5 seasonal/winter.csv 39 | 40 | 41 | ## How can I list everything below a directory? 42 | """ >>>>>> ls -R === see every file and directory in the current level, then everything in each sub-directory. 43 | >>>>>> ls -F === puts '/' after a directory. puts '*' after runnable program.""" 44 | # $ ls -R -F.: 45 | """backup/ bin/ course.txt people/ seasonal/ 46 | ./backup: 47 | 48 | ./bin: 49 | 50 | ./people: 51 | agarwal.txt 52 | 53 | ./seasonal: 54 | autumn.csv spring.csv summer.csv winter.csv""" 55 | 56 | 57 | ## How can I get help for a command? 58 | """ >>>>>> man === (maual) find out what commands DO.---invokes 'less', use [SPACE] to see more [q] to quit""" 59 | # $ man tail [SPACE][q] 60 | # $ tail -n +7 seasonal/spring.csv 61 | 62 | 63 | ## How can I select COLUMNS from a file? 64 | """ >>>>>> cut === select columns.""" 65 | # cut -f 2-5,8 -d , values.csv 66 | #-f(FIELDS, specify columns), -d(delimiter to specify the separator)--because some files may use spaces, tabs, or colons 67 | """---What command will select the first column (containing dates) from the file spring.csv?""" 68 | """cut -d , -f 1 seasonal/spring.csv 69 | cut -d, -f1 seasonal/spring.csv""" 70 | # Either of the above. 71 | 72 | 73 | ## What can't cut do? 74 | """---What is the output of cut -d : -f 2-4 on the line:""" 75 | # second:third: 76 | # | The trailing colon creates an empty fourth field. 77 | 78 | 79 | ## How can I repeat commands? 80 | """ >>>>>> history === see latest command history.""" 81 | # $ head summer.csv 82 | # $ cd seasonal/ 83 | # $ !head 84 | """head summer.csv 85 | Date,Tooth 86 | 2017-01-11,canine 87 | 2017-01-18,wisdom 88 | 2017-01-21,bicuspid 89 | 2017-02-02,molar.....""" 90 | # $ history 91 | """ 1 head summer.csv 92 | 2 cd seasonal/ 93 | 3 head summer.csv 94 | 4 history""" 95 | 96 | ## How can I select lines containing specific values? 97 | """ >>>>>> grep === selects lines according to what they contain""" 98 | # ...... grep bicuspid seasonal/winter.csv ---------------prints lines from winter.csv that contain "bicuspid". 99 | """grep's more common flags:----------""" 100 | # -c: print a count of matching lines rather than the lines themselves 101 | # -h: do not print the names of files when searching multiple files 102 | # -i: ignore case (e.g., treat "Regression" and "regression" as matches) 103 | # -l: print the names of files that contain matches, not the matches 104 | # -n: print line numbers for matching lines 105 | # -v: invert the match, i.e., only show lines that don't match 106 | """Print the contents of all of the lines containing the word molar in seasonal/autumn.csv by running a single command while in your home directory. Don't use any flags.""" 107 | # $ grep molar seasonal/autumn.csv 108 | """Invert the match to find all of the lines that don't contain the word molar in seasonal/spring.csv. and show their line numbers.""" 109 | # $ grep -n -v molar seasonal/spring.csv 110 | """out |1:Date,Tooth 111 | 2:2017-01-25,wisdom 112 | 3:2017-02-19,canine 113 | 4:2017-02-24,canine 114 | 5:2017-02-28,wisdom""" 115 | """Count how many lines contain the word incisor in autumn.csv and winter.csv combined. (Again, run a single command from your home directory.)""" 116 | # $ grep incisor -c seasonal/autumn.csv seasonal/winter.csv 117 | """ out | seasonal/autumn.csv:3 118 | seasonal/winter.csv:6""" 119 | 120 | 121 | ## Why isn't it always safe to treat data as text? 122 | # $ man paste 123 | # $ paste -d -n seasonal/autumn.csv seasonal/winter.csv 124 | # | The last few rows have the wrong number of columns. 125 | -------------------------------------------------------------------------------- /Introduction to Shell/III Combining tools.py: -------------------------------------------------------------------------------- 1 | """ 2 | This chapter will show you how to use this power to select the data you want, and introduce commands for sorting values and removing duplicates. 3 | """ 4 | 5 | # How can I store a command's output in a file? 6 | """Combine tail with redirection to save the last 5 lines of seasonal/winter.csv in a file called last.csv.""" 7 | # $ tail -n 5 seasonal/winter.csv > last.csv 8 | 9 | 10 | ## How can I use a command's output as an input? 11 | """Select the last two lines from seasonal/winter.csv and save them in a file called bottom.csv.""" 12 | # $ tail -n 2 seasonal/winter.csv > bottom.csv 13 | """Select the first line from bottom.csv in order to get the second-to-last line of the original file.""" 14 | # $ head -n 1 bottom.csv 15 | 16 | 17 | ## What's a better way to combine commands? 18 | """ ** output | use the output -----> '|'== a pipe 19 | """ 20 | """Use cut to select all of the tooth names from column 2 of the comma delimited file seasonal/summer.csv, then pipe the result to grep, 21 | with an inverted match, to exclude the header line containing the word "Tooth""" 22 | # $ cut -d , -f 2 seasonal/summer.csv | grep -v Tooth 23 | 24 | 25 | ## How can I combine many commands? 26 | """Extend this pipeline with a head command to only select the very first tooth name.""" 27 | # $ cut -d , -f 2 seasonal/summer.csv | grep -v Tooth | head -n 1 28 | 29 | 30 | ## How can I count the records in a file? 31 | """Count how many records in seasonal/spring.csv have dates in July 2017 (2017-07).""" 32 | """use grep with a partial date to select the lines and pipe this result into wc(short for "word count") with an appropriate flag to count the lines.""" 33 | # $ grep 2017-07 seasonal/spring.csv | wc -l 34 | # out| 3 35 | 36 | 37 | ## How can I specify many files at once? 38 | """ ** wildcards === to specify a list of files with a single expression ---- '*' 39 | ** eg.---> cut -d , -f 1 seasonal/* 40 | ** eg---> cut -d , -f 1 seasonal/*.csv 41 | """ 42 | """head to get the first three lines from both seasonal/spring.csv and seasonal/summer.csv""" 43 | # head -n 3 seasonal/spring.csv seasonal/summer.csv 44 | """ out | ==> seasonal/spring.csv <== 45 | Date,Tooth 46 | 2017-01-25,wisdom 47 | 2017-02-19,canine 48 | 49 | ==> seasonal/summer.csv <== 50 | Date,Tooth 51 | 2017-01-11,canine 52 | 2017-01-18,wisdom""" 53 | 54 | 55 | ## What other wildcards can I use? 56 | """ ** wildcards less commonly used 57 | ** ? matches a single character, so 201?.txt will match 2017.txt or 2018.txt, but not 2017-01.txt. 58 | ** [...] matches any one of the characters inside the square brackets, so 201[78].txt matches 2017.txt or 2018.txt, but not 2016.txt. 59 | ** {...} matches any of the comma-separated patterns inside the curly brackets, so {*.txt, *.csv} matches any file whose name ends with .txt or .csv, but not files whose names end with .pdf. 60 | 61 | --Which expression would match 62 | singh.pdf and johel.txt but not sandhu.pdf or sandhu.txt?""" 63 | # {singh.pdf, j*.txt} 64 | 65 | 66 | ## How can I sort lines of text? 67 | """ >>>>>> sort === puts data in order...default it does this in ascending alphabetical order 68 | >>>>>> flags --- -n(sort numerically), -r(reverse), -b(ignore leading blanks), -f(fold case) 69 | ** Pipelines often use 'grep' to get rid of unwanted records and then 'sort' to put the remaining records in order.""" 70 | """eg. cut -d , -f 2 seasonal/summer.csv | grep -v Tooth 71 | sort the names of the teeth in seasonal/winter.csv (not summer.csv) in descending alphabetical order. To do this, extend the pipeline with a sort step.""" 72 | # $ cut -d , -f2 seasonal/winter.csv | grep -v Tooth | sort -r 73 | 74 | 75 | ## How can I remove duplicate lines? 76 | """ >>>>>> sort uniq === eliminate duplciates.""" 77 | """ The start of your pipeline : cut -d , -f 2 seasonal/winter.csv | grep -v Tooth 78 | Extend it with a sort command, and use uniq -c to display unique lines with a count of how often each occurs """ 79 | # $ cut -d , -f 2 seasonal/winter.csv | grep -v Tooth | sort | uniq -c 80 | 81 | 82 | ## How can I save the output of a pipe? 83 | """ >>>>>> > teeth-only.txt === save output in teeth-only 84 | ** must appear at the end of the pipeline""" 85 | """---What happens if we put redirection at the front of a pipeline as in: 86 | > result.txt head -n 3 seasonal/winter.csv""" 87 | # The command's output is redirected to the file as usual. 88 | 89 | 90 | ## How can I stop a running program? 91 | """ >>>>>> Ctrl + C or ^C === to end it.""" 92 | 93 | 94 | ## Wrapping up 95 | """build a pipeline to find out how many records are in the shortest of the seasonal data files.""" 96 | """Use wc with appropriate parameters to list the number of lines in all of the seasonal data files. (Use a wildcard for the filenames instead of typing them all in by hand.)""" 97 | # $ wc -l seasonal/* 98 | # out | 21 seasonal/autumn.csv 99 | #24 seasonal/spring.csv 100 | #25 seasonal/summer.csv 101 | #26 seasonal/winter.csv 102 | #96 total 103 | """Add another command to the previous one using a pipe to remove the line containing the word "total".""" 104 | # $ wc -l seasonal/* | grep -v total 105 | # out | 21 seasonal/autumn.csv 106 | #24 seasonal/spring.csv 107 | #25 seasonal/summer.csv 108 | #26 seasonal/winter.csv 109 | """Add two more stages to the pipeline that use sort -n and head -n 1 to find the file containing the fewest lines.""" 110 | # $ wc -l seasonal/* | grep -v total | sort -n | head -n 1 111 | # out | 21 seasonal/autumn.csv 112 | -------------------------------------------------------------------------------- /Introduction to Shell/IV Batch processing.py: -------------------------------------------------------------------------------- 1 | """ 2 | hell commands will process many files at once. This chapter shows you how to make your own pipelines do that. Along the way, you will 3 | see how the shell uses variables to store information. 4 | """ 5 | 6 | ## How does the shell store information? 7 | """ ** environment variables === 8 | HOME User's home directory /home/repl 9 | PWD Present working directory Same as pwd command 10 | SHELL Which shell program is being used /bin/bash 11 | USER User's ID repl 12 | set To get a complete list""" 13 | """Use set and grep with a pipe to display the value of HISTFILESIZE, which determines how many old commands are stored in your command history. What is its value?""" 14 | # set | grep HISTFILESIZE 15 | 16 | 17 | 18 | ## How can I print a variable's value? 19 | """ >>>>>> echo || echo $USER find a variable's value.""" 20 | 21 | """The variable OSTYPE holds the name of the kind of operating system you are using. Display its value using echo.""" 22 | # $ echo $OSTYPE 23 | # out | linux-gnu 24 | 25 | 26 | 27 | ## How else does the shell store information? 28 | """ ** shell variable == local variable in a programming language. 29 | >>>>>> training=seasonal/summer.csv === To create a shell variable (simply assign a value to a name:) without any spaces """ 30 | """Define a variable called testing with the value seasonal/winter.csv.""" 31 | # $ testing=seasonal/winter.csv 32 | """Use head -n 1 SOMETHING to get the first line from seasonal/winter.csv using the value of the variable testing instead of the name of the file.""" 33 | # $ head -n 1 $testing 34 | # out | Date,Tooth 35 | 36 | 37 | 38 | ## How can I repeat a command many times? 39 | """ ** loops 40 | eg.: for filetype in gif jpg png; do echo $filetype; done 41 | output: gif 42 | jpg 43 | png 44 | """ 45 | """Modify the loop so that it prints: docx ,odt, pdf""" 46 | # $ for filetype in docx odt pdf; do echo $filetype; done 47 | 48 | 49 | 50 | ## How can I repeat a command once for each file? 51 | """Modify the wildcard expression to people/* so that the loop prints the names of the files in the people directory 52 | regardless of what suffix they do or don't have. Please use filename as the name of your loop variable. 53 | """ 54 | # $ for filename in people/*; do echo $filename; done 55 | 56 | 57 | 58 | ## How can I record the names of a set of files? 59 | """ 1| asign a var : datasets=seasonal/*.csv 60 | 2| display the files' names later: for filename in $datasets; do echo $filename; done""" 61 | """---If you run these two commands in your home directory, how many lines of output will they print? 62 | files=seasonal/*.csv 63 | for f in $files; do echo $f; done""" 64 | # Four: the names of all four seasonal data files. 65 | 66 | 67 | 68 | ## A variable's name versus its value 69 | """---If you were to run these two commands in your home directory, what output would be printed? 70 | files=seasonal/*.csv 71 | for f in files; do echo $f; done.""" 72 | # One line: the word "files". 73 | # correct : the loop uses files instead of $files, so the list consists of the word "files". 74 | 75 | 76 | 77 | ## How can I run many commands in a single loop? 78 | """ eg: for file in seasonal/*.csv; do head -n 2 $file | tail -n 1; done 79 | Write a loop that prints the last entry from July 2017 (2017-07) in every seasonal file. It should produce a similar output to: 80 | similar: grep 2017-07 seasonal/winter.csv | tail -n 1 81 | but for each seasonal file separately. Please use file as the name of the loop variable, and remember to loop through the list of files seasonal/*.csv""" 82 | # $ for file in seasonal/*.csv; do grep 2017-07 $file | tail -n 1; done 83 | 84 | 85 | 86 | ## Why shouldn't I use spaces in filenames? 87 | # Both of the above. 88 | # Use single quotes, ', or double quotes, ", around the file names. 89 | 90 | 91 | 92 | # 93 | """Suppose you forget the semi-colon between the echo and head commands in the previous loop, so that you ask the shell to run: 94 | eg: for f in seasonal/*.csv; do echo $f head -n 2 $f | tail -n 1; done 95 | ---What will the shell do?.""" 96 | # Print one line for each of the four files. 97 | 98 | 99 | -------------------------------------------------------------------------------- /Introduction to Shell/v Creating new tools.py: -------------------------------------------------------------------------------- 1 | """ 2 | History lets you repeat things with just a few keystrokes, and pipes let you combine existing commands to create new ones. In this chapter, 3 | you will see how to go one step further and create new commands of your own. 4 | """ 5 | 6 | ## How can I edit a file? 7 | """text editor Nano. If you type 8 | >>>>>> nano filename === it will open/create/edit filename for editing 9 | ** operations with control-key combinations: 10 | Ctrl + K: delete a line. 11 | Ctrl + U: un-delete a line. 12 | Ctrl + O: save the file ('O' stands for 'output'). You will also need to press Enter to confirm the filename! 13 | Ctrl + X: exit the editor.""" 14 | """---Run nano names.txt to edit a new file in your home directory and enter the following four lines:""" 15 | # $ nano names.txt 16 | # paste| Lovelace 17 | # Hopper 18 | # Johnson 19 | # Wilson 20 | # cntrl + O : save then press ENTER 21 | # cntrl + X : leave file 22 | 23 | 24 | 25 | ## How can I record what I just did? 26 | """Copy the files seasonal/spring.csv and seasonal/summer.csv to your home directory.""" 27 | """ >>>>>> cp === copy.""" 28 | # $ cp seasonal/s* ~ 29 | """Use grep with the -h flag (to stop it from printing filenames) and -v Tooth (to select lines that don't match the header line) 30 | to select the data records from spring.csv and summer.csv in that order and redirect the output to temp.csv.""" 31 | """ >>>>>> grep -h === dont print filenames. 32 | >>>>>> > temp.csv === redirect ooutput to temp.csv.""" 33 | # $ grep -h -v Tooth seasonal/s* > temp.csv 34 | """Pipe history into tail -n 3 and redirect the output to steps.txt to save the last three commands in a file. 35 | (You need to save three instead of just two because the history command itself will be in the list.)""" 36 | """ >>>>>> history tail -3 === give last 2 steps in hestory""" 37 | # $ history tail -n 3 > steps.txt 38 | 39 | 40 | 41 | ## How can I save commands to re-run later? 42 | """Use nano dates.sh to create a file called dates.sh that contains this command: 43 | cut -d , -f 1 seasonal/*.csv 44 | to extract the first column from all of the CSV files in seasonal.""" 45 | """ >>>>>>> nano filename.sh === create saved command in bash program 46 | >>>>>>> bash filename.sh === run saved bash programm""" 47 | # $ nano dates.sh 48 | # cut -d , -f 1 seasonal/*.csv 49 | # (save and exit) 50 | # $ bash dates.sh 51 | 52 | 53 | 54 | ## How can I re-use pipes? 55 | """ ** file full of shell commands is called a *shell script, or sometimes just a 'script' """ 56 | """A file teeth.sh in your home directory has been prepared for you, but contains some blanks. Use Nano to edit the file and replace the 57 | two ____ placeholders with seasonal/*.csv and -c so that this script prints a count of the number of times each tooth name appears in the CSV 58 | files in the seasonal directory.""" 59 | # $ nano teeth.sh 60 | # $ bash teeth.sh > teeth.out 61 | # $ cat teeth.out 62 | # out | 15 bicuspid 63 | # 31 canine 64 | # 18 incisor 65 | # 11 molar 66 | # 17 wisdom 67 | 68 | 69 | 70 | ## How can I pass filenames to scripts? 71 | """Edit the script count-records.sh with Nano and fill in the two ____ placeholders with $@ and -l (the letter) respectively so that it counts the number of lines 72 | in one or more files, excluding the first line of each.""" 73 | """ >>>>>> $@ === pass filenames to scripts.""" 74 | # $ nano cout-records.sh 75 | # tail -q -n +2 $@ | wc -l 76 | # (save and close) 77 | # $ bash count-records.sh seasonal/*.csv > num-records.out 78 | 79 | 80 | 81 | ## How can I process a single argument? 82 | """ >>>>>> $1, $2, and so on === to refer to specific command-line parameters.""" 83 | """--- 84 | bash get-field.sh seasonal/summer.csv 4 2 85 | should select the second field from line 4 of seasonal/summer.csv. Which of the following commands should be put in get-field.sh to do that?""" 86 | # head -n $2 $1 | tail -n 1 | cut -d , -f $3 87 | 88 | 89 | 90 | ## How can one shell script do many things? 91 | """edit the script range.sh and replace the two ____ placeholders with $@ and -v so that it lists the names and number of lines in all of the files given on 92 | the command line without showing the total number of lines in all files. (Do not try to subtract the column header lines from the files.)""" 93 | # $ nano range.sh 94 | # wc -l $@ | grep -v total 95 | """add sort -n and head -n 1 in that order to the pipeline in range.sh to display the name and line count of the shortest file given to it.""" 96 | # wc -l $@ | grep -v total | sort -n | head -n 1 97 | """ add a second line to range.sh to print the name and record count of the longest file in the directory as well as the shortest. 98 | This line should be a duplicate of the one you have already written, but with sort -n -r rather than sort -n.""" 99 | # wc -l $@ | grep -v total | sort -n | head -n 1 100 | # wc -l $@ | grep -v total | sort -n -r | head -n 1 101 | """Run the script on the files in the seasonal directory using seasonal/*.csv to match all of the files and redirect the output using > to a file called range.out""" 102 | # $ bash range.sh seasonal/*.csv > range.out 103 | 104 | 105 | 106 | ## How can I write loops in a shell script? 107 | """ eg : 108 | # Print the first and last data records of each file. 109 | for filename in $@ 110 | do 111 | head -n 2 $filename | tail -n 1 112 | tail -n 1 $filename 113 | done""" 114 | """Fill in the placeholders in the script date-range.sh with $filename (twice), head, and tail so that it prints the first and last date from one or more files.""" 115 | # # Print the first and last date from each data file. 116 | # for filename in $@ 117 | # do 118 | # cut -d , -f 1 $filename | grep -v Date | sort | head -n 1 119 | # cut -d , -f 1 $filename | grep -v Date | sort | tail -n 1 120 | # done 121 | """Run date-range.sh on all four of the seasonal data files using seasonal/*.csv to match their names, and pipe its output to sort""" 122 | # $ bash date-range.sh seasonal/*.csv | sort 123 | 124 | 125 | 126 | ## What happens when I don't provide filenames? 127 | """--- 128 | Suppose you do accidentally type: 129 | head -n 5 | tail -n 3 somefile.txt 130 | What should you do next?""" 131 | #| Use Ctrl + C to stop the running head program. 132 | 133 | -------------------------------------------------------------------------------- /OOP-python-DataCamp/README.md: -------------------------------------------------------------------------------- 1 | # Object-Oriented-Programming-in-Python-DataCamp 2 | all answers of oop object oriented programming from datacamp course 3 | -------------------------------------------------------------------------------- /Python-Data-Science-Toolbox-Part-2--case-study-main/README.md: -------------------------------------------------------------------------------- 1 | # Python-Data-Science-Toolbox-Part-2--case-study 2 | Python Data Science Toolbox (Part 2)-case study datacamp `World bank data` 3 | 4 | - Data on world economies for over half a century 5 | 6 | - Indicators 7 | 8 | -- Population 9 | 10 | -- Electricity consumption 11 | 12 | 13 | -- CO2 emissions 14 | 15 | -- Literacy rates 16 | 17 | -- Unemployment 18 | 19 | -- Mortality rates 20 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # datacamp-Data-Engineer-with-Python-course 2 | - CAREER TRACK 3 | 4 | **Data Engineer with Python** 5 | --- 6 | ![logo](https://user-images.githubusercontent.com/51888893/194946662-fc6eea27-e064-481f-855e-7a78a5147f43.png) 7 | 8 | 9 | In this track, you’ll discover how to build an effective data architecture, streamline data processing, and maintain large-scale data systems. In addition to working 10 | 11 | with Python, you’ll also grow your language skills as you work with Shell, SQL, and Scala, to create data engineering pipelines, automate common file system tasks, and 12 | 13 | build a high-performance database. 14 | 15 | 16 | 17 | Through hands-on exercises, you’ll add cloud and big data tools such as AWS Boto, PySpark, Spark SQL, and MongoDB, to your data engineering toolkit to help you create 18 | 19 | and query databases, wrangle data, and configure schedules to run your pipelines. By the end of this track, you’ll have mastered the critical database, scripting, and 20 | 21 | process skills you need to progress your career. 22 | 23 | 24 | Please note this track assumes a fundamental knowledge of Python and SQL. 25 | -------------------------------------------------------------------------------- /Streamlined Data Ingestion with pandas/README.md: -------------------------------------------------------------------------------- 1 | 2 | Instructor : Amany Mahfouz 3 | # This course teaches you 4 | 5 | how to build pipelines to import data kept in common storage formats. 6 | 7 | You’ll use pandas, a major Python library for analytics, to get data from a variety of sources, 8 | 9 | from spreadsheets of survey responses, to a database of public service requests, to an API for a popular review site. 10 | 11 | Along the way, you’ll learn how to fine-tune imports to get only what you need and to address issues like incorrect data types. 12 | 13 | Finally, you’ll assemble a custom dataset from a mix of sources. 14 | -------------------------------------------------------------------------------- /Unit Testing for Data Science in Python/.travis.yml: -------------------------------------------------------------------------------- 1 | language: python 2 | python: 3 | - "3.6" 4 | install: 5 | - pip install -e . 6 | - pip install codecov pytest-cov 7 | script: 8 | - pytest --cov=src tests 9 | after_success: 10 | - codecov 11 | -------------------------------------------------------------------------------- /Unit Testing for Data Science in Python/readme.md: -------------------------------------------------------------------------------- 1 | [![Build Status](https://travis-ci.com/gutfeeling/univariate-linear-regression.svg?branch=master)](https://travis-ci.com/gutfeeling/univariate-linear-regression) 2 | [![codecov](https://codecov.io/gh/gutfeeling/univariate-linear-regression/branch/master/graph/badge.svg)](https://codecov.io/gh/gutfeeling/univariate-linear-regression) 3 | --------------------------------------------------------------------------------