├── .travis.yml
├── Big Data Fundamentals with PySpark
    ├── 1. Introduction to Big Data analysis with Spark.md
    ├── 2. Programming in PySpark RDD’s.md
    ├── 3. PySpark SQL & DataFrames.md
    └── 4. Machine Learning with PySpark MLlib.md
├── Cleaning Data with PySpark
    ├── 1. DataFrame details.md
    ├── 2. Manipulating DataFrames in the real world.md
    ├── 3. Improving Performance.md
    └── 4. Complex processing and data pipelines.md
├── Data Processing in Shell
    ├── I Downloading Data on the Command Line.py
    ├── II Data Cleaning and Munging on the Command Line.py
    ├── III Database Operations on the Command Line.py
    └── IV Data Pipeline on the Command Line.py
├── Database Design
    ├── 1. Processing, Storing, and Organizing Data
    │   ├── 1. OLTP and OLAP.txt
    │   ├── 2. Which is better?.py
    │   ├── 3. Name that data type.py
    │   ├── 4. Ordering ETL Tasks.py
    │   ├── 5. Recommend a storage solution.py
    │   ├── 6. Database design.txt
    │   ├── 7. Classifying data models.py
    │   ├── 8. Deciding fact and dimension tables.py
    │   └── 9.  Querying the dimensional model.py
    ├── 2. Database Schemas and Normalization.md
    ├── 3. Database Views.md
    └── 4. Database Management.md
├── Introduction to AWS Boto in Python
    ├── 1. Putting Files in the Cloud!
    │   ├── I. Intro to AWS and Boto3.py
    │   ├── II. Your first boto3 client.py
    │   ├── III. Multiple clients Sam knows that she will often have to work with more than one service at once. She wants to practice creating two separate clients for two different services in boto3.  When she is building her workflows, she will make multiple Amazon Web Services interact with each other, with a script executed on her computer.  Her AWS key id and AWS secret have been stored in AWS_KEY_ID and AWS_SECRET respectively.  You will help Sam initialize a boto3 client for S3, and another client for SNS.  She will use the S3 client to list the buckets in S3. She will use the SNS client to list topics she can publish to (you will learn about SNS topics in Chapter 3).  Instructions 100 XP Generate the boto3 clients for interacting with S3 and SNS. Specify 'us-east-1' for the region_name for both clients. Use AWS_KEY_ID and AWS_SECRET to set up the credentials. List and print the SNS topics..py
    │   ├── IV. Removing repetitive work.py
    │   ├── V. Diving into buckets.txt
    │   ├── VI. creating a bucket.py
    │   ├── VII. Listing buckets.py
    │   ├── VIII. Deleting a bucket.py
    │   ├── VIV. Deleting multiple buckets.py
    │   ├── X. Uploading and retrieving files.txt
    │   └── XI. Spring cleaning.py
    ├── 2. Sharing Files Securely
    │   ├── I. Keeping objects secure.txt
    │   ├── II.  Making an object public.py
    │   ├── III. Uploading a public report.py
    │   ├── IV. Making multiple files public.py
    │   ├── V. Accessing private objects in S3.txt
    │   ├── VI. Generating a presigned URL.py
    │   ├── VII. Opening a private file.py
    │   ├── VIII. Sharing files through a website.txt
    │   ├── VIV. Generate HTML table from Pandas.py
    │   ├── X. Upload an HTML file to S3.py
    │   ├── XI. Combine daily requests for February.py
    │   ├── XII. Upload aggregated reports for February.py
    │   ├── XIII. Update index to include February.py
    │   └── XIV. Upload the new index.py
    ├── 3. Reporting and Notifying!
    │   ├── 1. SNS Topics.txt
    │   ├── 10. Sending a single SMS message.py
    │   ├── 11. Different protocols per topic level.py
    │   ├── 12. Creating multi-level topics.py
    │   ├── 13. Sending an alert.py
    │   ├── 14. Sending multi-level alerts.py
    │   ├── 2. Creating a Topic.py
    │   ├── 3. Creating Multiple Topics.py
    │   ├── 4. Deleting multiple topics.py
    │   ├── 5. SNS Subscriptions.txt
    │   ├── 6. Subscribing to topics.py
    │   ├── 7. Creating multiple subscriptions.py
    │   ├── 8. Deleting multiple subscriptions.py
    │   └── 9. Sending messages.txt
    └── 4. Pattern Rekognition
    │   ├── 1. Rekognizing patterns.txt
    │   ├── 10. Scooter Community Sentiment.py
    │   ├── 2. Cat detector.py
    │   ├── 3. Multiple cat detector.py
    │   ├── 4. Parking sign reader.py
    │   ├── 5. Comprehending text.txt
    │   ├── 6. Detecting language.py
    │   ├── 7. Translating Get It Done requests.py
    │   ├── 8. Getting request sentiment.py
    │   └── 9. Scooter dispatch.py
├── Introduction to Airflow in Python
    ├── I Intro to Airflow.py
    ├── II Implementing Airflow DAGs.py
    ├── III Maintaining and monitoring Airflow workflows.py
    └── IV Building production pipelines in Airflow.py
├── Introduction to Bash Scripting
    ├── I From Command-Line to Bash Script.py
    ├── II Variables in Bash Scripting.py
    ├── III Control Statements in Bash Scripting.py
    └── IV Functions and Automation.py
├── Introduction to Data Engineering
    ├── I Introduction to Data Engineering.py
    ├── II Data engineering toolbox.py
    ├── III Extract, Transform and Load (ETL).py
    ├── IV Case Study: DataCamp Ratings.py
    └── README.me
├── Introduction to MongoDB in Python
    ├── 1. Flexibly Structured Data.md
    ├── 2. Working with Distinct Values and Sets.md
    ├── 3. Get Only What You Need, and Fast.md
    └── 4. Aggregation Pipelines: Let the Server Do It For You.md
├── Introduction to PySpark
    ├── I Getting to know PySpark.py
    ├── II. Manipulating data.py
    ├── III. Getting started with machine learning pipelines.py
    └── IV. Model tuning and selection
    │   ├── I. Create the modeler.py
    │   ├── II. Create the Evaluator.py
    │   ├── III. Make a grid.py
    │   ├── IV. Make the validator.py
    │   ├── V. Fit the model(s).py
    │   ├── VI. Evaluating binary classifiers.py
    │   └── VII. Evaluate the model.py
├── Introduction to Relational Databases in SQL
    ├── 1. Your first database
    │   ├── 1. Attributes of relational databases.py
    │   ├── 2. Query information_schema with SELECT.py
    │   ├── 3. CREATE your first few TABLEs.py
    │   ├── 4. ADD a COLUMN with ALTER TABLE.py
    │   ├── 5. RENAME and DROP COLUMNs in affiliations.py
    │   ├── 6. Migrate data with INSERT INTO SELECT DISTINCT.py
    │   └── 7. Delete tables with DROP TABLE.py
    ├── 2. Enforce data consistency with attribute constraints
    │   ├── 0. query table data types INFORMATION_SCHEMA.txt
    │   ├── 1.Better data quality with constraints-changing datatypes CAST.txt
    │   ├── 10. .py
    │   ├── 10. What happens if you try to enter NULLs?.py
    │   ├── 2. Types of database constraints.py
    │   ├── 3. Conforming with data types.py
    │   ├── 4. Type CASTs.py
    │   ├── 5. Working with data types.txt
    │   ├── 6. Change types with ALTER COLUMN.py
    │   ├── 7. Convert types USING a function.py
    │   ├── 8. The not-null and unique constraints.txt
    │   └── 9. Disallow NULL values with SET NOT NULL.py
    ├── 3. Uniquely identify records with key constraints
    │   ├── 1. Get to know SELECT COUNT DISTINCT.py
    │   ├── 2. Identify keys with SELECT COUNT DISTINCT.py
    │   ├── 3. Identify the primary key.py
    │   ├── 4. ADD key CONSTRAINTs to the tables.py
    │   ├── 5.  Add a SERIAL surrogate key.py
    │   ├── 6. CONCATenate columns to a surrogate key.py
    │   └── 7. Test your knowledge before advancing.py
    └── 4. Glue together tables with foreign keys
    │   ├── 1. REFERENCE a table with a FOREIGN KEY.py
    │   ├── 10. Change the referential integrity behavior of a key.py
    │   ├── 11. Count affiliations per university.py
    │   ├── 12. .py
    │   ├── 2. Explore foreign key constraints.py
    │   ├── 3. JOIN tables linked by a foreign key.py
    │   ├── 4. Model more complex relationships.txt
    │   ├── 5. Add foreign keys to the "affiliations" table.py
    │   ├── 6. Populate the "professor_id" column.py
    │   ├── 7. Drop "firstname" and "lastname".py
    │   ├── 8.  Referential Integrity ON DELETE.txt
    │   └── 9. Referential integrity violations.py
├── Introduction to Scala
    ├── 1. A scalable language.md
    ├── 2. Workflows, Functions, Collections.md
    └── 3. Type Systems, Control Structures, Style.md
├── Introduction to Shell
    ├── I Manipulating files and directories.py
    ├── II Manipulating data.py
    ├── III Combining tools.py
    ├── IV Batch processing.py
    └── v Creating new tools.py
├── OOP-python-DataCamp
    ├── I OOP Fundamentals.py
    ├── II Inheritance and Polymorphism.py
    ├── III Integrating with Standard Python.py
    ├── IV Best Practices of Class Design.py
    └── README.md
├── Python-Data-Science-Toolbox-Part-2--case-study-main
    ├── README.md
    └── bank-data-code-case.py
├── README.md
├── Streamlined Data Ingestion with pandas
    ├── I Importing Data from Flat Files.py
    ├── II Importing Data From Excel Files.py
    ├── III Importing Data from Databases.py
    ├── IV Importing JSON Data and Working with APIs.py
    └── README.md
├── Unit Testing for Data Science in Python
    ├── .travis.yml
    ├── I Unit testing basics.py
    ├── II Intermediate unit testing.py
    ├── III Test Organization and Execution.py
    ├── IV Testing Models, Plots and Much More.py
    └── readme.md
├── Writing Efficient Python Code
    ├── I Foundations for efficiencies.py
    ├── II Timing and profiling code.py
    ├── III Gaining efficiencies.py
    └── IV Intro to pandas DataFrame iteration.py
└── Writing Functions in Python
    ├── I Best Practices.py
    ├── II Context Managers.py
    ├── III Decorators.py
    └── IV More on Decorators.py


/.travis.yml:
--------------------------------------------------------------------------------
 1 | language: python
 2 | python:
 3 |   - "3.6"
 4 | install:
 5 |   - pip install -e .
 6 |   - pip install codecov pytest-cov
 7 | script:
 8 |   - pytest --cov=src tests
 9 | after_success:
10 |   - codecov
11 | 


--------------------------------------------------------------------------------
/Cleaning Data with PySpark/1. DataFrame details.md:
--------------------------------------------------------------------------------
  1 | ## # Intro to data cleaning with Apache Spark
  2 | 
  3 | ### # data cleaning
  4 | `data cleaning` Preparing raw data for use in data processing pipelines.
  5 | 
  6 | `data cleaning tasks` 
  7 | - Reformatting or replacing text
  8 | - Performing calculations
  9 | - Removing garbage or incomplete data
 10 | - 
 11 | ###  # why cleaning w/ SPARK ?
 12 | - Problems with typical data systems:
 13 |   - Performance
 14 |   - Organizing data flow
 15 | - ***Advantages of Spark:***
 16 |   - Scalable
 17 |   - Powerful framework for data handling
 18 | ### # data cleaning example 
 19 | ![image](https://user-images.githubusercontent.com/51888893/203651989-f3cce7f4-ce84-4b07-8c30-b02b684a7639.png)
 20 | 
 21 | ### # Spark Schemas
 22 | `spark schemas` :
 23 | - Define format of DataFrame
 24 | - various data types:
 25 |   - Strings, dates, integers, arrays
 26 | - filter garbage data during import
 27 | - Improves read performance
 28 | - 
 29 | ### # Example Spark Schema
 30 | - Import schema
 31 | ```py
 32 | import pyspark.sql.types
 33 | peopleSchema = StructType ([
 34 | # Define the name field
 35 | StructField ( 'name', StringType(), True),
 36 | #Add the age field
 37 | StructField ('age' , IntegerType(), True),
 38 | # Add the city field
 39 | StructField ('city', StringType(), True)
 40 | ])
 41 | ```
 42 | - Read CSV file containing data
 43 | ```py
 44 | people_df = spark. read .format (" csv' ) .load (name='rawdata .csv', schema=peopleSchema)
 45 | ```
 46 | ## Data cleaning review
 47 | > Which of the following is NOT a benefit?
 48 | 
 49 | Answer the question
 50 | - [ ] Spark offers high performance.
 51 | - [x] Spark can only handle thousands of records.
 52 | - [ ] Spark allows orderly data flows.
 53 | - [ ] Spark can use strictly defined schemas while ingesting data.
 54 | ## Defining a schema
 55 | - [x] Import * from the pyspark.sql.types library.
 56 | - [x] Define a new schema using the StructType method.
 57 | - [x] Define a StructField for name, age, and city. Each field should correspond to the correct datatype and not be nullable.
 58 | ```py
 59 | # Import the pyspark.sql.types library
 60 | from pyspark.sql.types import pyspark.sql.types.*
 61 | 
 62 | # Define a new schema using the StructType method
 63 | people_schema = StructType([
 64 |   # Define a StructField for each field
 65 |   StructField('name', StringType(), False), # False = not nullable
 66 |   StructField('age', IntegerType(), False),
 67 |   StructField('city', StringType(), False)
 68 | ])
 69 | ```
 70 | ## Immutability review
 71 | > You’ve just seen that immutability and lazy processing are fundamental concepts in the way Spark handles data. But why would Spark use immutable data frames to begin with?
 72 | 
 73 | Answer the question
 74 | - [ ] To add complexity to your Spark tasks.
 75 | - [x] To efficiently handle data throughout the cluster.
 76 | - [ ] To easily modify variable values as needed.
 77 | - [ ] To conserve storage space.
 78 | 
 79 | ## Using lazy processing
 80 | - [x] Load the Data Frame.
 81 | - [x] Add the transformation for F.lower() to the Destination Airport column.
 82 | - [x] Drop the Destination Airport column from the Data Frame aa_dfw_df. Note the time for these operations to complete.
 83 | - [x] Show the Data Frame, noting the time difference for this action to complete.
 84 | ```py
 85 | # Load the CSV file
 86 | aa_dfw_df = spark.read.format('csv').options(Header=True).load('AA_DFW_2018.csv.gz')
 87 | 
 88 | # Add the airport column using the F.lower() method
 89 | aa_dfw_df = aa_dfw_df.withColumn('airport', F.lower(aa_dfw_df['Destination Airport']))
 90 | 
 91 | # Drop the Destination Airport column
 92 | aa_dfw_df = aa_dfw_df.drop(aa_dfw_df['Destination Airport'])
 93 | 
 94 | # Show the DataFrame
 95 | aa_dfw_df.show()
 96 | ```
 97 | ## # Understanding Parquet
 98 | ### # Reading Parquet files
 99 | ```py
100 | df = spark.read.format('parquet').load('filename.parquet')
101 | 
102 | df = spark.read.parquet ('filename.parquet')
103 | ```
104 | ### # Writing Parquet files
105 | ```py
106 | df.write.format('parquet').save('filename.parquet')
107 | 
108 | df.write.parquet('filename.parquet')
109 | ```
110 | ### # parquet sql
111 | ```py
112 | fl1ght_df = sparK.read.parquet('flights.parquet')
113 | 
114 | flight_df.createOrReplaceTempView('flights')
115 | 
116 | short_flights_df = spark.sql('SELECT * FROM FLights WHERE flightduration < 100')
117 | ```
118 | ## Saving a DataFrame in Parquet format
119 | - [x] View the row count of df1 and df2.
120 | - [x] Combine df1 and df2 in a new DataFrame named df3 with the union method.
121 | - [x] Save df3 to a parquet file named AA_DFW_ALL.parquet.
122 | - [x] Read the AA_DFW_ALL.parquet file and show the count.
123 | ```py
124 | # View the row count of df1 and df2
125 | print("df1 Count: %d" % df1.count())
126 | print("df2 Count: %d" % df2.count())
127 | 
128 | # Combine the DataFrames into one
129 | df3 = df1.union(df2)
130 | 
131 | # Save the df3 DataFrame in Parquet format
132 | df3.write.parquet('AA_DFW_ALL.parquet', mode='overwrite')
133 | 
134 | # Read the Parquet file into a new DataFrame and run a count
135 | print(spark.read.parquet('AA_DFW_ALL.parquet').count())
136 | 
137 | ''' result : 
138 | df1 Count: 139359
139 | df2 Count: 119911
140 | 259270
141 | '''
142 | ```
143 | ## SQL and Parquet
144 | - [x] Import the AA_DFW_ALL.parquet file into flights_df.
145 | - [x] Use the createOrReplaceTempView method to alias the flights table.
146 | - [x] Run a Spark SQL query against the flights table.
147 | ```py
148 | # Read the Parquet file into flights_df
149 | flights_df = spark.read.parquet('AA_DFW_ALL.parquet')
150 | 
151 | # Register the temp table
152 | flights_df.createOrReplaceTempView('flights')
153 | 
154 | # Run a SQL query of the average flight duration
155 | avg_duration = spark.sql(
156 |     'SELECT avg(flight_duration) from flights').collect()[0]
157 | print('The average flight time is: %d' % avg_duration)
158 | 
159 | # The average flight time is: 151
160 | ```
161 | 


--------------------------------------------------------------------------------
/Data Processing in Shell/I Downloading Data on the Command Line.py:
--------------------------------------------------------------------------------
  1 | """ 
  2 | how to download data files from web servers via the command line. In the process, we also learn about documentation manuals, option flags, and multi-file processing
  3 | """
  4 | #====================================================
  5 | #           THEORY & EXAMPLES
  6 | #===================================================
  7 | ## Downloading data using curl : 
  8 | """ ** CURL : short Client for URLs, 
  9 |               Unix CommandLine Tool 
 10 |               Tranfers Data To and From Servers
 11 |               Used to download data from HTTPs sites and FTP Servers"""             
 12 | ## Checking CURL Installation :
 13 |     # >>>>>> man curl
 14 | """ if not installed follow link steps:  https://curl.se/download.html"""
 15 | 
 16 | 
 17 | 
 18 | ## Curl Syntax : 
 19 |     # curl [option flags] [URL]
 20 | ## Curl Supports : 
 21 |     # HTTP, HTTPS, FTP, SFTP
 22 | 
 23 |   
 24 |   
 25 | ## Downloading single file  :
 26 |   """ # Download with same name : >>>>>> curl -0
 27 |       # eg : curl -0 https://websitename/datafilename.txt
 28 |       # Rename file : >>>>>> curl -0 renamedfile https://websitename/datafilename.txt     """
 29 | ## Downloading multiple files w/ WILDCARDS :
 30 |   """ # Globing Parser :  >>>>>> curl -0  https://websitename/datafilename*.txt
 31 |       # indexing       : >>>>>> curl -0 https://websitename/datafilename[001-100:10].txt    """
 32 | 
 33 |   
 34 |   
 35 |   
 36 | ## Flags in case of timeouts : 
 37 |   """ # Redirect HTTP URL if 300 error occurs   : >>>>>> -L
 38 |       # Resumes previous file transfer if timeout before completition   : >>>>>> -C   """
 39 | #---------------
 40 | ## Using curl documentation
 41 | """--- Based on the information in the curl manual, which of the following is NOT a supported file protocol ?  :"""
 42 | # OFTP
 43 | 
 44 | 
 45 | 
 46 | ## Downloading single file using curl
 47 | # Use curl to download the file from the redirected URL
 48 |   # curl -L https://assets.datacamp.com/production/repositories/4180/datasets/eb1d6a36fa3039e4e00064797e1a1600d267b135/201812SpotifyData.zip
 49 | # # Download and rename the file in the same step
 50 |   # curl -o Spotify201812.zip -L https://assets.datacamp.com/production/repositories/4180/datasets/eb1d6a36fa3039e4e00064797e1a1600d267b135/201812SpotifyData.zip
 51 | 
 52 | ## Downloading multiple files using curl
 53 | # Download all 100 data files
 54 |   # curl -L https://s3.amazonaws.com/assets.datacamp.com/production/repositories/4180/datasets/files/datafile[001-100].txt
 55 | # Print all downloaded files to directory
 56 |   # ls datafile*.txt
 57 | 
 58 |   
 59 |   
 60 | ## Downloading data using Wget  :
 61 |   """ ** Wget : Derives from World Wide Web and Get
 62 |                 compatible w/ all OS
 63 |                 Used to download data from HTTPS and FTP
 64 |             $$$ BETER THAN USING CURL TO DOWNLOAD MULTIPLE FILES $$$      """
 65 | ## Check Installation   : 
 66 |   # throws Wget location   : >>>>>> which wget 
 67 |   """ if not installed follow link steps :  https://www.gnu.org/software/wget/    """
 68 |   # check manual when installed :  >>>>>> man Wget
 69 |   
 70 |   
 71 | ## Basic Syntax Wget :  >>>>>> wget [option flags] [URL]
 72 | 
 73 | 
 74 | """ ** Option Flags wget :
 75 |         -b : go background after startup
 76 |         -q : turn off wget output
 77 |         -c : resume broken download     
 78 |        -------- 
 79 |             eg. :  wget -bqc [URL]
 80 |        ------
 81 |     ** PID : unique ID for data download process in case of cancellation    """
 82 | #====================================================
 83 | #           EXERCICES
 84 | #===================================================
 85 | ## Installing Wget
 86 | """---Which of the following is NOT a way to install wget?"""
 87 | # On MacOS, install using pip
 88 | 
 89 | 
 90 | 
 91 | ## Downloading single file using wget
 92 | # Fill in the two option flags 
 93 | """option flag for resuming a partial download./for letting the download occur in the background."""
 94 |   # wget -c -b https://assets.datacamp.com/production/repositories/4180/datasets/eb1d6a36fa3039e4e00064797e1a1600d267b135/201812SpotifyData.zip
 95 | # Verify that the Spotify file has been downloaded
 96 |   # ls 
 97 | # Preview the log file 
 98 |   # cat wget-log
 99 | 
100 |   
101 |   
102 | ## THEO==========
103 | ## Advanced downloading using Wget
104 | ## ------------------------------------
105 | """- save a list od file locations in txt file"""
106 | """- Download from URL locations command read
107 |       >>>>>> -i       """
108 | ## Limit rate for larger files  (--limit-rate)
109 | ## ------------------------------------
110 | """   >>>>>> __limit-rate
111 |       eg:  wget --limit-rate={rate}k {file_lcoation}
112 |       eg2: wget --limit-rate=200k -i url_list.txt     """
113 | ## Limit rate for smaller files (wait)
114 | ##-------------------------------------
115 | """   >>>>>> --wait
116 |       eg:    wget --wait={seconds} {file_location}
117 |       eg2:   wget --wait= 2.5 -i url_list.txt       """
118 | 
119 | 
120 | 
121 | ## EXERCICES============
122 | 
123 | ## Setting constraints for multiple file downloads
124 | """---Setting constraints for multiple file downloads using wget? """
125 | # Store all URL locations in a text file (e.g. url_list.txt) and iteratively download using wget and option flag i
126 | # cause iterates through the files but does not set any constraints for downloads.
127 | 
128 | 
129 | 
130 | ## Creating wait time using Wget
131 | """ View url_list.txt to verify content """
132 |   # cat url_list.txt
133 | """ Create a mandatory 1 second pause between downloading all files in url_list.txt """
134 |   # wget --wait=1 -i url_list.txt
135 | """ Take a look at all files downloaded """
136 |   # ls
137 |   
138 |   
139 |   
140 | ## Data downloading with Wget and curl
141 | """ Use curl, download and rename a single file from URL"""
142 |   # curl -o Spotify201812.zip -L https://assets.datacamp.com/production/repositories/4180/datasets/eb1d6a36fa3039e4e00064797e1a1600d267b135/201812SpotifyData.zip
143 | """ Unzip, delete, then re-name to Spotify201812.csv  """
144 |   # unzip Spotify201812.zip && rm Spotify201812.zip
145 |   # mv 201812SpotifyData.csv Spotify201812.csv
146 | """ View url_list.txt to verify content """
147 |   # cat url_list.txt
148 | """  Use Wget, limit the download rate to 2500KB/s, download all files in url_list.txt  """
149 |   # wget --limit-rate=2500k -i url_list.txt
150 | """ Take a look at all files downloaded """
151 |   # ls
152 | 


--------------------------------------------------------------------------------
/Database Design/1. Processing, Storing, and Organizing Data/1. OLTP and OLAP.txt:
--------------------------------------------------------------------------------
 1 | |  How should we organize data?  |
 2 | 
 3 |       . Schemas : how data be logically organized
 4 |       . Normalization : may data have minimal or redundant dependency?
 5 |       . Views : what joins more often?
 6 |       . Access Control : who and which levels of accesses
 7 |       . DBMS : pick SQL or noSQL
 8 |       
 9 |       # Depends on the intended use of data
10 |       
11 | | approaches to dataprocessing  |
12 | 
13 |       -> OLTP : Online Transaction Processing
14 |                   - has data from past hour
15 |                   - data inserted and updated more often
16 |                   - Uses Operational database
17 |       -> OLAP : Online Analytical Processing
18 |                   - helps business decision making
19 |                   - Queries larges amounts of data
20 |                   - Uses Data-warehouse
21 | 


--------------------------------------------------------------------------------
/Database Design/1. Processing, Storing, and Organizing Data/2. Which is better?.py:
--------------------------------------------------------------------------------
 1 | """
 2 | Which is better?
 3 | The city of Chicago receives many 311 service requests throughout the day. 311 service requests are non-urgent community requests, ranging from graffiti removal to street light outages. Chicago maintains a data repository of all these services organized by type of requests. In this exercise, Potholes has been loaded as an example of a table in this repository. It contains pothole reports made by Chicago residents from the past week.
 4 | 
 5 | Explore the dataset. What data processing approach is this larger repository most likely using?
 6 | 
 7 | Instructions
 8 | 50 XP
 9 | Possible Answers
10 | 
11 |         OLTP because this table could not be used for any analysis.
12 |         OLAP because each record has a unique service request number.
13 |  OK     OLTP because this table's structure appears to require frequent updates.
14 |         OLAP because this table focuses on pothole requests only      
15 | 
16 | """
17 | SELECT * 
18 | FROM Potholes;
19 | 
20 | '''
21 | creation_date	current_status	completion_date	service_request_id	type_of_service	most_recent_action	street_address	zip
22 | 2018-12-17T00:00:00.000	Open	null	18-03380123	Pothole in Street	null	10300 S WALLACE ST	60628.0
23 | 2018-12-18T00:00:00.000	Open	null	18-03388180	Pothole in Street	null
24 | '''
25 | 


--------------------------------------------------------------------------------
/Database Design/1. Processing, Storing, and Organizing Data/3. Name that data type.py:
--------------------------------------------------------------------------------
 1 | """
 2 | Name that data type!
 3 | In the previous video, you learned about structured, semi-structured, and unstructured data. Structured data is the easiest to analyze because it is organized and cleaned. On the other hand, unstructured data is schemaless, but scales well. In the middle we have semi-structured data for everything in between.
 4 | 
 5 | Instructions
 6 | 100XP
 7 | Each of these cards hold a type of data. Place them in the correct category.
 8 | 
 9 | """
10 | '''
11 | -> Unstructured
12 |    - Zip fle of all text messages ever recelved
13 |    - Images In your photo library
14 |    - To-do notes In a text editor
15 |    
16 | -> Semi-structured:
17 |    - JSON oblect of tweets outputted In real-time by the Twitter API
18 |    - CSVs of open data downloaded from your local government websltes
19 |    - HTML sytnax
20 |    
21 | -> Structured:
22 |    - A relational database with latest withdrawals
23 |       and deposits made by clients
24 |       
25 | '''
26 | 


--------------------------------------------------------------------------------
/Database Design/1. Processing, Storing, and Organizing Data/4. Ordering ETL Tasks.py:
--------------------------------------------------------------------------------
 1 | """
 2 | Ordering ETL Tasks
 3 | You have been hired to manage data at a small online clothing store. Their system is quite outdated because their only data repository is a traditional database to record transactions.
 4 | 
 5 | You decide to upgrade their system to a data warehouse after hearing that different departments would like to run their own business analytics. You reason that an ELT approach is unnecessary because there is relatively little data (< 50 GB).
 6 | 
 7 | Instructions
 8 | 100XP
 9 | - In the ETL flow you design, different steps will take place. Place the steps in the most appropriate order.
10 | 
11 | """
12 | '''
13 | 1/ eCommerce APl outputs real time data of transactions
14 | 2/ Python script drops null rows and clean data Into pre-determined columns
15 | 3/ Resulting dataframe Is written Into an AWS Redshift Warehouse
16 | '''
17 | 


--------------------------------------------------------------------------------
/Database Design/1. Processing, Storing, and Organizing Data/5. Recommend a storage solution.py:
--------------------------------------------------------------------------------
 1 | """
 2 | Recommend a storage solution
 3 | ----When should you choose a data warehouse over a data lake?
 4 | 
 5 | Answer the question
 6 | 50XP
 7 | Possible Answers
 8 | 
 9 |         - To train a machine learning model with a 150 GB of raw image data.            # data lake because the data is large and unstructured.
10 |         - To store real-time social media posts that may be used for future analysis    # Data warehouses cause will be used for analysis.
11 |         - To store customer data that needs to be updated regularly                     # Data warehouses -> read-only
12 |      OK - To create accessible and isolated data repositories for other analysts        #  Analysts appreciate working in a data warehouse more because of its ..>
13 |                                                                                             organization of structured data that make analysis easier.  
14 | 
15 | """
16 | 


--------------------------------------------------------------------------------
/Database Design/1. Processing, Storing, and Organizing Data/6. Database design.txt:
--------------------------------------------------------------------------------
 1 | |  database design  |
 2 | 
 3 |     -> Database design : . determines how data is logically stored
 4 |                          . how it would be read and updated
 5 |                          
 6 | 
 7 |           uses
 8 |           -> Database model :  high leve specifications  for db structure   
 9 |                                   - most pupular : relational db
10 |                                   - other : noSQL model, object-oriented model, network model
11 | 
12 |           uses
13 |           -> Database schemas :  blueprint of the db
14 |                                     . defines tables, fields, indexes, views
15 |                                     . when inserting data in relational db, schemas must be respected
16 |                         
17 | |  models of data model  |
18 | 
19 |         - conceptual data model : describes entities, relationships and attributes
20 |                        / tools / : data structure diagrams, 
21 |                           / eg / :entity-relational diagrams, UML diagrams
22 |             
23 |         - logical data model : defines tables, columns, relationships
24 |                        / tools / : db models and schemas
25 |                           / eg / : relational model and star schema
26 |                           
27 |         - physaical data model : describes physical storage
28 |                        / tools / : partitions, CPUs, indexes, backup sys, tablespaces
29 | 


--------------------------------------------------------------------------------
/Database Design/1. Processing, Storing, and Organizing Data/7. Classifying data models.py:
--------------------------------------------------------------------------------
 1 | """
 2 | Classifying data models
 3 | In the previous video, we learned about three different levels of data models: conceptual, logical, and physical.
 4 | 
 5 | Instructions
 6 | 100XP
 7 | - Each of these cards hold a tool or concept that fits into a certain type of data model. Place the cards in the correct category.
 8 | 
 9 | """
10 | '''
11 |  @ Conceptual Data Model
12 |   . Gathers business requirements
13 |   . Entities, attributes, and relationships
14 |   
15 | @ Logical Data Model
16 |   . Relational model
17 |   . Determining tables and columns
18 |   
19 | @ Physical Data Model
20 |   . File structure of data storage
21 |   '''
22 | 


--------------------------------------------------------------------------------
/Database Design/1. Processing, Storing, and Organizing Data/8. Deciding fact and dimension tables.py:
--------------------------------------------------------------------------------
 1 | """
 2 | Deciding fact and dimension tables
 3 | Imagine that you love running and data. It's only natural that you begin collecting data on your weekly running routine.
 4 | You're most concerned with tracking how long you are running each week. You also record the route and the distances of your runs.
 5 | You gather this data and put it into one table called Runs with the following schema:
 6 | 
 7 |               runs
 8 |               duration_mins - float
 9 |               week - int
10 |               month - varchar(160)
11 |               year - int
12 |               park_name - varchar(160)
13 |               city_name - varchar(160)
14 |               distance_km - float
15 |               route_name - varchar(160)
16 |  . . . . .. .. . . .            
17 | Instructions 1/2
18 | - Question
19 | ----Out of these possible answers, what would be the best way to organize the fact table and dimensional tables?
20 | 
21 | Possible Answers
22 | 
23 | OK      A fact table holding duration_mins and foreign keys to dimension tables holding route details and week details, respectively.
24 |         A fact table holding week,month, year and foreign keys to dimension tables holding route details and duration details, respectively.
25 |         A fact table holding route_name,park_name, distance_km,city_name, and foreign keys to dimension tables holding week details and duration details, respectively.
26 | 
27 | """
28 | """
29 | Instructions 2/2
30 | - Create a dimension table called route that will hold the route information.
31 | - Create a dimension table called week that will hold the week information.
32 | 
33 | """
34 | #-- Create a route dimension table
35 | CREATE TABLE route(
36 | 	route_id INTEGER PRIMARY KEY,
37 |     park_name VARCHAR(160) NOT NULL,
38 |     city_name VARCHAR(160) NOT NULL,
39 |     distance_km float NOT NULL,
40 |     route_name VARCHAR(160) NOT NULL
41 | );
42 | #-- Create a week dimension table
43 | CREATE TABLE week(
44 | 	week_id INTEGER PRIMARY KEY,
45 |     week integer NOT NULL,
46 |     month VARCHAR(160) NOT NULL,
47 |     year integer NOT NULL
48 | );
49 | 


--------------------------------------------------------------------------------
/Database Design/1. Processing, Storing, and Organizing Data/9.  Querying the dimensional model.py:
--------------------------------------------------------------------------------
 1 | """
 2 | Querying the dimensional model
 3 | Here it is! The schema reorganized using the dimensional model: 
 4 | 
 5 | Let's try to run a query based on this schema. How about we try to find the number of minutes we ran in July, 2019? We'll break this up in two steps. First, we'll get the total number of minutes recorded in the database. Second, we'll narrow down that query to week_id's from July, 2019.
 6 | 
 7 | Instructions 2/2
 8 | 
 9 | 1- Calculate the sum of the duration_mins column.
10 | 
11 | 2- Join week_dim and runs_fact.
12 | 2- Get all the week_id's from July, 2019.
13 | 
14 | """
15 | #1 
16 | #-- Select the sum of the duration of all runs
17 | SELECT SUM(duration_mins)
18 | FROM runs_fact;
19 |   
20 | #2
21 | #-- Get all the week_id's that are from July, 2019
22 | INNER JOIN week_dim ON runs_fact.week_id = week_dim.week_id
23 | WHERE month = 'July' and year = '2019';
24 | 


--------------------------------------------------------------------------------
/Introduction to AWS Boto in Python/1. Putting Files in the Cloud!/I. Intro to AWS and Boto3.py:
--------------------------------------------------------------------------------
 1 | """
 2 | |||||
 3 | From learning how AWS works to creating S3 buckets and uploading files to them. You will master the basics of setting up AWS and uploading files to the cloud!
 4 | |||||
 5 | 
 6 | |  INITIALIZE BOTO3 CLIENT  |
 7 | 
 8 |   -> Boto 3 : 
 9 |               import boto3
10 |               
11 |               s3 = boto3.client('s3',
12 |                                 region_name='us-east-1',
13 |                                 aws_access_key_id = AWS_KEY_ID,
14 |                                 aws_secret_access_key = AWS_SECRET)
15 |                                 
16 |               response = s3.list_buckets()
17 |   -------->
18 |   # in AWS Console creation of USER
19 |       - Create KEYs w/ IAM
20 |       - add user
21 |       - programatic user 
22 |       - select policies:
23 |               .  s3 full access
24 |               .  amazon sns full access
25 |               .  amazon rekognition full access
26 |               .  comprehend full access
27 |       - Grab KEY and SECRET and store smw safe
28 | 
29 | |  AWS services  |
30 | 
31 |   IAM :  manage permissions
32 |   S3 :  store files in cloud
33 |   SNS :  emails / texts based on conditions and events in pipelines
34 |   COMPREHEND :  sentiment analysis on blocks of text
35 |   REKOGNITION   : extract text from images an dlooks 4 cats in pictures
36 |   
37 | """
38 | 


--------------------------------------------------------------------------------
/Introduction to AWS Boto in Python/1. Putting Files in the Cloud!/II. Your first boto3 client.py:
--------------------------------------------------------------------------------
 1 | """
 2 | 
 3 | \  Your first boto3 client  /
 4 | 
 5 | Sam wants to cast off the shackles of only being able to use her computer for storage and compute. She is learning how to use the awesome power of the cloud 
 6 | to create data pipelines and automatically generate reports.
 7 | 
 8 | Before she can do all that, she needs to create her first boto3 client and check out what buckets already exist in S3.
 9 | 
10 | Her AWS key and AWS secret key have been stored in AWS_KEY_ID and AWS_SECRET respectively.
11 | In this exercise, you will help Sam by creating your first boto3 client to AWS!
12 | 
13 | Instructions
14 | 100 XP
15 | - Generate a boto3 client for interacting with s3.
16 | - Specify 'us-east-1' for the region_name.
17 | - Use AWS_KEY_ID and AWS_SECRET to set up the credentials.
18 | - Print the buckets.
19 | 
20 | """
21 | # Generate the boto3 client for interacting with S3
22 | s3 = boto3.client('s3', region_name='us-east-1', 
23 |                         # Set up AWS credentials 
24 |                         aws_access_key_id=AWS_KEY_ID, 
25 |                          aws_secret_access_key=AWS_SECRET)
26 | # List the buckets
27 | buckets = s3.list_buckets()
28 | 
29 | # Print the buckets
30 | print(buckets)
31 | 


--------------------------------------------------------------------------------
/Introduction to AWS Boto in Python/1. Putting Files in the Cloud!/III. Multiple clients Sam knows that she will often have to work with more than one service at once. She wants to practice creating two separate clients for two different services in boto3.  When she is building her workflows, she will make multiple Amazon Web Services interact with each other, with a script executed on her computer.  Her AWS key id and AWS secret have been stored in AWS_KEY_ID and AWS_SECRET respectively.  You will help Sam initialize a boto3 client for S3, and another client for SNS.  She will use the S3 client to list the buckets in S3. She will use the SNS client to list topics she can publish to (you will learn about SNS topics in Chapter 3).  Instructions 100 XP Generate the boto3 clients for interacting with S3 and SNS. Specify 'us-east-1' for the region_name for both clients. Use AWS_KEY_ID and AWS_SECRET to set up the credentials. List and print the SNS topics..py:
--------------------------------------------------------------------------------
 1 | """
 2 | 
 3 | \  Multiple clients  /
 4 | 
 5 | Sam knows that she will often have to work with more than one service at once. She wants to practice creating two separate clients for two different services in boto3.
 6 | 
 7 | When she is building her workflows, she will make multiple Amazon Web Services interact with each other, with a script executed on her computer.
 8 | 
 9 | Her AWS key id and AWS secret have been stored in AWS_KEY_ID and AWS_SECRET respectively.
10 | 
11 | '''You will help Sam initialize a boto3 client for S3, and another client for SNS.'''
12 | 
13 | She will use the S3 client to list the buckets in S3. She will use the SNS client to list topics she can publish to (you will learn about SNS topics in Chapter 3).
14 | 
15 | Instructions
16 | 100 XP
17 | - Generate the boto3 clients for interacting with S3 and SNS.
18 | - Specify 'us-east-1' for the region_name for both clients.
19 | - Use AWS_KEY_ID and AWS_SECRET to set up the credentials.
20 | - List and print the SNS topics.
21 | 
22 | """
23 | # Generate the boto3 client for interacting with S3 and SNS
24 | s3 = boto3.client('s3', region_name='us-east-1', 
25 |                          aws_access_key_id=AWS_KEY_ID, 
26 |                          aws_secret_access_key=AWS_SECRET)
27 | 
28 | sns = boto3.client('sns', region_name='us-east-1', 
29 |                          aws_access_key_id=AWS_KEY_ID, 
30 |                          aws_secret_access_key=AWS_SECRET)
31 | 
32 | # List S3 buckets and SNS topics
33 | buckets = s3.list_buckets()
34 | topics = sns.list_topics()
35 | 
36 | # Print out the list of SNS topics
37 | print(topics)
38 | 


--------------------------------------------------------------------------------
/Introduction to AWS Boto in Python/1. Putting Files in the Cloud!/IV. Removing repetitive work.py:
--------------------------------------------------------------------------------
 1 | """
 2 | 
 3 | \  Removing repetitive work  /
 4 | 
 5 | Automation is at the heart of data engineering. Sam wants to eliminate boring and repetitive tasks from her and her coworkers plate. 
 6 | She wants to let machines do what they do best, while empowering humans to do what they do best.
 7 | 
 8 | Sam has a coworker - John - whose job is to build the department budget. He also got stuck with the job of looking at images coming from GetItDone user reports.
 9 | If he sees a cat in the image, he has to send a text message to the Animal services manager with the image and its location.
10 | 
11 | Sam is thinking of building a system that would take this repetitive work off John's plate.
12 | 
13 | Such a system will:
14 | 
15 | Get images taken by city trash trucks and store them;
16 | If the image contains a cat, the system will send an SMS to the Animal Services manager.
17 | What AWS services would Sam need to implement this system?
18 | ---
19 | 
20 |       IAM
21 |       S3
22 |       SNS
23 |       Rekognition
24 | ANSW: All of the Above
25 | 
26 | """
27 | 


--------------------------------------------------------------------------------
/Introduction to AWS Boto in Python/1. Putting Files in the Cloud!/V. Diving into buckets.txt:
--------------------------------------------------------------------------------
 1 | | BUCKETS |
 2 | 
 3 |     . destok folders
 4 |     . own permission policies
 5 |     . website storage
 6 |     . generate logs ab activity
 7 |     
 8 |     |  S3 components-Objects  |
 9 | 
10 |        contain ANYTTHING : music, csv, image, video, log file
11 | 
12 | |  What can we do w/ buckets ?  |
13 | 
14 |     create, list, delete buckets
15 |     
16 | |  CREATE A BUCKET  |
17 | 
18 |     # create boto3 client
19 |     import boto3
20 |               
21 |               s3 = boto3.client('s3',
22 |                                 region_name='us-east-1',
23 |                                 aws_access_key_id = AWS_KEY_ID,
24 |                                 aws_secret_access_key = AWS_SECRET)
25 |                                 
26 |     # create bucket and his name
27 |     # names must be unique
28 |     bucket = s3.create_bucket(Bucket='grid-requests')
29 |     
30 | |  Buecket ~ METHODS  |
31 | 
32 |     > create bucket
33 |       bucket = s3.create_bucket(Bucket='grid-requests')
34 | 
35 |     > list buckets
36 |       bucket_response = s3.list_buckets()
37 |       
38 |       # get buckets dictionary
39 |       buckets = bucket_respons['Buckets']
40 |       print(buckets)
41 |       
42 |     > delete buckets
43 |       bucket_response = s3.delete_bucket('grid-requests')
44 |       
45 | 


--------------------------------------------------------------------------------
/Introduction to AWS Boto in Python/1. Putting Files in the Cloud!/VI. creating a bucket.py:
--------------------------------------------------------------------------------
 1 | """
 2 | 
 3 | Creating a bucket
 4 | Sam is dipping her toes in the water, getting ready to build her first pipeline.
 5 | 
 6 | Get It Done is the app the City released for residents to report problems. There are lots of problems to report, and lots of data gets generated.
 7 | 
 8 | Get It Done services poster
 9 | 
10 | She will be picking up daily reports generated by Get It Done and placing them in the 'gim-staging' bucket. Then, she will clean the data and place the new dataset in the 'gim-processed' bucket.
11 | 
12 | She also wants to create a 'gim-test' bucket to experiment with.
13 | 
14 | Help Sam take the first step to her pipeline dreams. Help her create her first bucket, 'gim-staging'!
15 | 
16 | Instructions
17 | 100 XP
18 | - Create a boto3 client to S3.
19 | - Create 'gim-staging', 'gim-processed' and 'gim-test' buckets.
20 | - Print the response from creating the 'gim-staging' bucket.
21 | 
22 | """
23 | import boto3
24 | 
25 | # Create boto3 client to S3
26 | s3 = boto3.client('s3', region_name='us-east-1', 
27 |                          aws_access_key_id=AWS_KEY_ID, 
28 |                          aws_secret_access_key=AWS_SECRET)
29 | 
30 | # Create the buckets
31 | response_staging = s3.create_bucket(Bucket='gim-staging')
32 | response_processed = s3.create_bucket(Bucket='gim-processed')
33 | response_test = s3.create_bucket(Bucket='gim-test')
34 | 
35 | # Print out the response
36 | print(response_staging)
37 | 


--------------------------------------------------------------------------------
/Introduction to AWS Boto in Python/1. Putting Files in the Cloud!/VII. Listing buckets.py:
--------------------------------------------------------------------------------
 1 | """
 2 | 
 3 | Listing buckets
 4 | Sam has successfully created the buckets for her pipeline. Often, data engineers build in checks into the pipeline to make sure their previous operation succeeded. Sam wants to build in a check to make sure her buckets actually got created.
 5 | 
 6 | She also wants to practice listing buckets. Listing buckets will let her perform operations on multiple buckets using a for loop.
 7 | 
 8 | She has already created the boto3 client for S3, and assigned it to the s3 variable.
 9 | 
10 | Help Sam get a list of all the buckets in her S3 account and print their names!
11 | 
12 | Instructions
13 | 100 XP
14 | - Get the buckets from S3.
15 | - Iterate over the bucket key from response to access the list of buckets.
16 | - Print the name of each bucket.
17 | 
18 | """
19 | # Get the list_buckets response
20 | response = s3.list_buckets()
21 | 
22 | # Iterate over Buckets from .list_buckets() response
23 | for bucket in response['Buckets']:
24 |   
25 |   	# Print the Name for each bucket
26 |     print(bucket['Name'])
27 |     
28 | ''' out: 
29 | gim-staging
30 | gim-processed
31 | gim-test
32 | '''
33 | 


--------------------------------------------------------------------------------
/Introduction to AWS Boto in Python/1. Putting Files in the Cloud!/VIII. Deleting a bucket.py:
--------------------------------------------------------------------------------
 1 | """
 2 | 
 3 | Deleting a bucket
 4 | Sam is feeling more and more confident in her AWS and S3 skills. After playing around for a bit, she decides that the gim-test bucket no longer fits her pipeline and wants to delete it. It's starting to feel like dead weight, and Sam doesn't want it littering her beautiful bucket list.
 5 | 
 6 | She has already created the boto3 client for S3, and assigned it to the s3 variable.
 7 | 
 8 | Help Sam do some clean up, and delete the gim-test bucket.
 9 | 
10 | Instructions
11 | 100 XP
12 | - Delete the 'gim-test' bucket.
13 | - Get the list of buckets from S3.
14 | - Print each 'Buckets' 'Name'.
15 | 
16 | """
17 | 
18 | # Delete the gim-test bucket
19 | s3.delete_bucket(Bucket='gim-test')
20 | 
21 | # Get the list_buckets response
22 | response = s3.list_buckets()
23 | 
24 | # Print each Buckets Name
25 | for bucket in response['Buckets']:
26 |     print(bucket['Name'])
27 | 


--------------------------------------------------------------------------------
/Introduction to AWS Boto in Python/1. Putting Files in the Cloud!/VIV. Deleting multiple buckets.py:
--------------------------------------------------------------------------------
 1 | """
 2 | 
 3 | Deleting multiple buckets
 4 | The Get It Done app used to be called Get It Made. Sam always thought it was a terrible name, but it got stuck in her head nonetheless.
 5 | 
 6 | When she was making the pipeline buckets, she used the gim- abbreviation for the old name. She decides to switch her abbreviation to gid- to accurately reflect the app's real (and better) name.
 7 | 
 8 | She has already set up the boto3 S3 client and assigned it to the s3 variable.
 9 | 
10 | Help Sam delete all the buckets in her account that start with the gim- prefix. Then, help her make a 'gid-staging' and a 'gid-processed' bucket.
11 | 
12 | Instructions
13 | 100 XP
14 | - Get the buckets from S3.
15 | - Delete the buckets that contain 'gim' and create the 'gid-staging' and 'gid-processed' buckets.
16 | - Print the new bucket names.
17 | 
18 | """
19 | # Get the list_buckets response
20 | response = s3.list_buckets()
21 | 
22 | # Delete all the buckets with 'gim', create replacements.
23 | for bucket in response['Buckets']:
24 |   if 'gim' in bucket['Name']:
25 |       s3.delete_bucket(Bucket=bucket['Name'])
26 |     
27 | s3.create_bucket(Bucket='gid-staging')
28 | s3.create_bucket(Bucket='gid-processed')
29 |   
30 | # Print bucket listing after deletion
31 | response = s3.list_buckets()
32 | for bucket in response['Buckets']:
33 |     print(bucket['Name'])
34 | 


--------------------------------------------------------------------------------
/Introduction to AWS Boto in Python/1. Putting Files in the Cloud!/X. Uploading and retrieving files.txt:
--------------------------------------------------------------------------------
 1 | |  differences Bucket/Object  |
 2 | 
 3 |     Bucket: 
 4 |             - name
 5 |             - name is STRING
 6 |             - unique name in all S3
 7 |             - contains many objects
 8 |             
 9 |     Object:
10 |             - has a KEY
11 |             - Name is full path from bucket root
12 |             - Unique KEY in bucket
13 |             - can only be in one parent bucket
14 |             
15 |     1 > create boto3 client
16 |     2 > upload files to a bucket:
17 |           s3.upload_file(
18 |                   Filename='grid_request_2020_15_11.csv',
19 |                   Bucket='grid-request',
20 |                   key='grid_request_2020_15_11.csv')    # what we name obj. in S3
21 |     3 > List objects in a bucket
22 |           response = s3.list_objects(
23 |                           Bcuket='grid-requests',
24 |                           MaxKeys=2,                    # limit response
25 |                           Prefix='grid_requests_2019')
26 |                           
27 |           # Get object metadata and print it
28 |             response = s3.head_object(Bucket='gid-staging', 
29 |                                    Key='2019/final_report_01_01.csv')
30 | 
31 |             # Print the size of the uploaded object
32 |             print(response['ContentLength'])
33 | 
34 |           
35 |     4 > delete object in bucket:
36 |           s3.delete_object(
37 |                       Bucket='grid_requests',
38 |                       Key='grid_requests_2018')
39 | 


--------------------------------------------------------------------------------
/Introduction to AWS Boto in Python/1. Putting Files in the Cloud!/XI. Spring cleaning.py:
--------------------------------------------------------------------------------
 1 | """
 2 | 
 3 | Spring cleaning
 4 | Sam's pipeline has been running for a long time now. Since the beginning of 2018, her automated system has been diligently uploading her report to the gid-staging bucket.
 5 | 
 6 | In City governments, record retention is a huge issue, and many government officials prefer not to keep records in existence past the mandated retention dates.
 7 | 
 8 | People purging records
 9 | 
10 | As time has passed, the City Council asked Sam to clean out old CSV files from previous years that have passed the retention period. 2018 is safe to delete.
11 | 
12 | Sam has initialized the client and assigned it to the s3 variable. Help her clean out all records for 2018 from S3!
13 | 
14 | Instructions
15 | 100 XP
16 | - List only objects that start with '2018/final_' in 'gid-staging' bucket.
17 | - Iterate over the objects, deleting each one.
18 | - Print the keys of remaining objects in the bucket.
19 | 
20 | """
21 | # List only objects that start with '2018/final_'
22 | response = s3.list_objects(Bucket='gid-staging', 
23 |                            Prefix='2018/final_')  # list ones w/ this Prefix
24 | 
25 | # Iterate over the objects
26 | if 'Contents' in response:     
27 |   for obj in response['Contents']: # delete obj in response['Contents']
28 |       # Delete the object
29 |       s3.delete_object(Bucket='gid-staging', Key=obj['Key'])
30 | 
31 | # Print the keys of remaining objects in the bucket
32 | response = s3.list_objects(Bucket='gid-staging')
33 | 
34 | for obj in response['Contents']:
35 |   	print(obj['Key'])
36 | 


--------------------------------------------------------------------------------
/Introduction to AWS Boto in Python/2. Sharing Files Securely/I. Keeping objects secure.txt:
--------------------------------------------------------------------------------
 1 | You will learn how set files to be public or private, and cap off what you learned by generating web-based reports!
 2 | 
 3 | |  AWS PERMISSION SYS  |
 4 | 
 5 |     # Multi-user environments
 6 |     -> IAM : attaching policies to users
 7 |     -> Bucket Policies: gives control on bucket and Object within it
 8 |     # 
 9 |     -> ACL : permission on specific obj within a bucket 
10 |             (entity attached)
11 |             
12 |            - public
13 | 
14 |                   > put ACL to uploaded obj:
15 |                     s3.put_object_acl(Bucket='', Key='puthole.csv', 
16 |                                       ACL='public-read')
17 |                   > ACL but w/ ExtraArgs:
18 |                     s3.upload_file(Bucket='x',Filename='asd.csv',Key='asd.csv',
19 |                                    ExtraArgs={'ACL':'public-read'})
20 |                   > accessing public objects:
21 |                     # https://{bucket}.{key}
22 |                     https://grid-requests.2019/potholes.csv
23 |                   > generate public obj. URL
24 |                     url='https://{}.{}'.format(
25 |                         "grid-requests",
26 |                         "2019/putholes.csv")
27 |                   > read URL into pandas
28 |                     df = pd.read_csv(url)
29 |            - private 
30 |            
31 |     -> Presigned URL: temporary access to obj
32 |     
33 | 


--------------------------------------------------------------------------------
/Introduction to AWS Boto in Python/2. Sharing Files Securely/II.  Making an object public.py:
--------------------------------------------------------------------------------
 1 | """
 2 | 
 3 | Making an object public
 4 | Sam is trying to decide how to upload a public CSV and share it with the world as Open Data.
 5 | 
 6 | She only wants to make that specific CSV public, and not the bucket.
 7 | 
 8 | She also doesn't want to make any of the other objects in the bucket public.
 9 | 
10 | Help her decide the best way for her to do this.
11 | 
12 | Answer the question
13 | 50XP
14 | 
15 |         A presigned URL.  # access to a private object temporarily
16 |  OK     A public-read ACL.
17 |         A bucket policy.  # not  multi user environment.
18 |         An IAM Policy.    # not  multi user environment.
19 | 
20 | """
21 | 


--------------------------------------------------------------------------------
/Introduction to AWS Boto in Python/2. Sharing Files Securely/III. Uploading a public report.py:
--------------------------------------------------------------------------------
 1 | """
 2 | Uploading a public report
 3 | As you saw in Chapter 1, Get It Done is an App that lets residents report problems like potholes and broken sidewalks.
 4 | 
 5 | 
 6 | 
 7 | The data from the app is a very hot topic political issue. Residents keep saying that the City does not distribute work evenly across neighborhoods when issues are reported. The City Council wants to be transparent with the public and has asked Sam to publish the aggregated Get It Done reports and make them publicly available.
 8 | 
 9 | Sam has initialized the boto3 S3 client and assigned it to the s3 variable.
10 | 
11 | In this exercise, you will help her increase government transparency by uploading public reports to the gid-staging bucket.
12 | 
13 | Instructions
14 | 100 XP
15 | - Upload 'final_report.csv' to the 'gid-staging' bucket.
16 | - Set the key to '2019/final_report_2019_02_20.csv'.
17 | - Set the ACL to 'public-read'.
18 | 
19 | """
20 | # Upload the final_report.csv to gid-staging bucket
21 | s3.upload_file(
22 |   # Complete the filename
23 |   Filename='./final_report.csv', 
24 |   # Set the key and bucket
25 |   Key='2019/final_report_2019_02_20.csv', 
26 |   Bucket='gid-staging',
27 |   # During upload, set ACL to public-read
28 |   ExtraArgs = {
29 |     'ACL': 'public-read'})
30 | 


--------------------------------------------------------------------------------
/Introduction to AWS Boto in Python/2. Sharing Files Securely/IV. Making multiple files public.py:
--------------------------------------------------------------------------------
 1 | """
 2 | Making multiple files public
 3 | Transparency is important to City Council. They want to empower residents to analyze Get It Done requests and how they get prioritized.
 4 | 
 5 | They asked Sam to make all previous Get It Done aggregated reports since the beginning of 2019 public as well.
 6 | Sam has initialized the boto3 S3 client and assigned it to the s3 variable.
 7 | 
 8 | In this exercise, you will help Sam open up the data by setting the ACL of every object in the gid-staging bucket to public-read, opening up the objects to the world!
 9 | 
10 | Instructions
11 | 100 XP
12 | - List the objects in 'gid-staging' bucket starting with '2019/final_'.
13 | - For each file in the response, give it an ACL of 'public-read'.
14 | - Print the Public Object URL of each object.
15 | 
16 | """
17 | # List only objects that start with '2019/final_'
18 | response = s3.list_objects(
19 |     Bucket='gid-staging', Prefix='2019/final_')
20 | 
21 | # Iterate over the objects
22 | for obj in response['Contents']:
23 | 
24 |     # Give each object ACL of public-read
25 |     s3.put_object_acl(Bucket='gid-staging', 
26 |                       Key=obj['Key'], 
27 |                       ACL='public-read')
28 |     
29 |     # Print the Public Object URL for each object
30 |     print("https://{}.s3.amazonaws.com/{}".format( 'gid-staging', obj['Key']))
31 | 


--------------------------------------------------------------------------------
/Introduction to AWS Boto in Python/2. Sharing Files Securely/V. Accessing private objects in S3.txt:
--------------------------------------------------------------------------------
 1 |     > accessing private files  
 2 |       obj = s3.get_object(Bucket='x', Key='2019/asdads.csv')
 3 |       print(obj)
 4 |     > Read 'StreamingBody' into Pandas
 5 |       pd.read_csv(obj['Body'])
 6 |       
 7 |     > Generate pre-signed URL
 8 |       share_url = s3.generate_presigned_url(
 9 |                    ClientMethod='get_object',
10 |                    ExpiresIn=3600,
11 |                    Params={'Bucket':'grid-requests','Key':'potholes.csv'}
12 |         > open into Pandas
13 |           pd.read_csv(share_url)
14 |     
15 |     > Load multiple files into 1 df
16 |       df_list=[]
17 |       
18 |       # request the list of csv's from S3 w/ Prefix
19 |       response = s3.lis_objects(Bucket='x', Prefix='2019/')
20 |       
21 |       # get response contents
22 |       request_files= response["Contents"]
23 |       
24 |       # iterate over each obj
25 |       for file in request_files:
26 |           obj = s3.get_object(Bucket='x', Key=file['Key'])
27 |           
28 |           # Read it as df
29 |           obj_df = pd.read_csv(obj['Body'])
30 |           
31 |           # Append df to a list
32 |           df_list.append(obj_df)
33 |           
34 |       # Concatenate all df in a list
35 |       df = pd.concat(df_list)
36 |       
37 |       # preview df
38 |       df.head()
39 | 


--------------------------------------------------------------------------------
/Introduction to AWS Boto in Python/2. Sharing Files Securely/VI. Generating a presigned URL.py:
--------------------------------------------------------------------------------
 1 | """
 2 | Generating a presigned URL
 3 | Sam got a special request from City Council to analyze whether the City is prioritizing requests in District 11, while de-prioritizing requests in the less affluent district 12. They asked her to keep this report confidential, as they would like to see it before it goes public to the media.
 4 | 
 5 | Sam has generated the report and is ready to share it with the City council, but making it public makes her too paranoid. She decided to provide the Council with a presigned URL so they can temporarily access the report for 1 hour.
 6 | 
 7 | She has already initialized the boto3 S3 client and assigned it to the s3 variable.
 8 | 
 9 | Help her generate a presigned URL valid for 1 hour to 'final_report.csv' in the 'gid-staging' bucket. Then, print it out for the City Council!
10 | 
11 | Instructions
12 | 100 XP
13 | - Generate a presigned URL for final_report.csv that lasts 1 hour and allows the user to get the object.
14 | - Print out the generated presigned URL.
15 | 
16 | """
17 | # Generate presigned_url for the uploaded object
18 | share_url = s3.generate_presigned_url(
19 |   # Specify allowable operations
20 |   ClientMethod='get_object',
21 |   # Set the expiration time
22 |   ExpiresIn=3600,
23 |   # Set bucket and shareable object's name
24 |   Params={'Bucket': 'gid-staging','Key': 'final_report.csv'}
25 | )
26 | 
27 | # Print out the presigned URL
28 | print(share_url)
29 | 


--------------------------------------------------------------------------------
/Introduction to AWS Boto in Python/2. Sharing Files Securely/VII. Opening a private file.py:
--------------------------------------------------------------------------------
 1 | """
 2 | Opening a private file
 3 | The City Council wants to see the bigger trend and have asked Sam to total up all the requests since the beginning of 2019. In order to do this, Sam has to read daily CSVs from the 'gid-requests' bucket and concatenate them. However, the gid-requests files are private. She has access to them via her key, but the world cannot access them.
 4 | 
 5 | In this exercise, you will help Sam see the bigger picture by reading these private files into pandas and concatenating them into one DataFrame!
 6 | 
 7 | She has already initialized the boto3 S3 client and assigned it to the s3 variable. She has listed all the objects in gid-requests in the response variable.
 8 | 
 9 | Instructions
10 | 100 XP
11 | - For each file in response, load the object from S3.
12 | - Load the object's StreamingBody into pandas, and append to df_list.
13 | - Concatenate all the DataFrames with pandas.
14 | - Preview the resulting DataFrame
15 | 
16 | """
17 | df_list =  [ ] 
18 | 
19 | for file in response['Contents']:
20 |     # For each file in response load the object from S3
21 |     obj = s3.get_object(Bucket='gid-requests', Key=file['Key'])
22 |     # Load the object's StreamingBody with pandas
23 |     obj_df = pd.read_csv(obj['Body'])
24 |     # Append the resulting DataFrame to list
25 |     df_list.append(obj_df)
26 | 
27 | # Concat all the DataFrames with pandas
28 | df = pd.concat(df_list)
29 | 
30 | # Preview the resulting DataFrame
31 | df.head()
32 | 
33 | '''        service_name  request_count
34 |         0  72 Hour Violation              8
35 |         1   Graffiti Removal              2
36 |         2  Missed Collection             12
37 |         3   Street Light Out             21
38 |         4            Pothole             33'''
39 | 


--------------------------------------------------------------------------------
/Introduction to AWS Boto in Python/2. Sharing Files Securely/VIII. Sharing files through a website.txt:
--------------------------------------------------------------------------------
 1 | | HTML tables in pandas |
 2 | 
 3 |     > Convert df to HTML
 4 |       df.to_html('table_agg.html')
 5 |     > Just certain columns to HTML
 6 |       df.to_html('table_agg.html',
 7 |                   render_links=True,
 8 |                   columns=['service_name', 'request_count', 'info_link'],
 9 |                   border= 0)
10 |                   
11 | |  upload HTML file to S3  |
12 | 
13 |     s3.upload_file(
14 |               Filename='./table_agg.html',
15 |               Bucket='x',
16 |               Key='table.html',
17 |               ExtraArgs={
18 |                   'ContentType': 'text/html',  # image/png
19 |                   'ACL': 'public-read'
20 |               })
21 |               
22 |               |  INNA MEDIA TYPES  |
23 |               
24 |                   text/html
25 |                   image/png
26 |                   application/pdf
27 |                   text/csv
28 |                   
29 | |  Generating index page  |
30 |         
31 |         # create column link that contains website url+ key
32 |         base_url="https://datacamp-website"
33 |         objects_df['Link'] = base_url + objects_df['Key']
34 |         
35 |         # Write df to html
36 |         objects_df.to_html('report_listening.html',
37 |                             columns=['Link','LastModified','Size'],
38 |                             rende_links=True)
39 | 
40 | |  Uploading index page  |
41 | 
42 |         s3.upload_file(
43 |                 Filename='./report_listing.html',
44 |                 Bucket='x',
45 |                 Key='index.html',
46 |                 ExtraArgs={
47 |                     'ContentType':'text/html',
48 |                     'ACL': 'public-read'}
49 |          )
50 | 


--------------------------------------------------------------------------------
/Introduction to AWS Boto in Python/2. Sharing Files Securely/VIV. Generate HTML table from Pandas.py:
--------------------------------------------------------------------------------
 1 | """
 2 | Generate HTML table from Pandas
 3 | Residents often complain that they don't know the full range of services offered through the Get It Done application.
 4 | 
 5 | In an effort to streamline the communication between the City government and the public, the City Council asked Sam to generate a table of all the services in the Get It Done system.
 6 | 
 7 | The system is dynamic and grows on a weekly basis, adding additional services. Sam didn't want to waste time updating the file manually, and felt like she can automate the process.
 8 | 
 9 | She loaded the DataFrame of available services into the services_df variable:
10 | 
11 | services_df dataframe
12 | 
13 | Instructions
14 | 100 XP
15 | - Generate an HTML table with no border and only the 'service_name' and 'link' columns.
16 | - Generate an HTML table with borders and all columns.
17 | - Make sure to set all URLs to be clickable.
18 | 
19 | """
20 | # Generate an HTML table with no border and selected columns
21 | services_df.to_html('./services_no_border.html',
22 |            # Keep specific columns only
23 |            columns=['service_name', 'link'],
24 |            # Set border
25 |            border=0)
26 | 
27 | # Generate an html table with border and all columns.
28 | services_df.to_html('./services_border_all_columns.html', 
29 |            border=1)
30 | 


--------------------------------------------------------------------------------
/Introduction to AWS Boto in Python/2. Sharing Files Securely/X. Upload an HTML file to S3.py:
--------------------------------------------------------------------------------
 1 | """
 2 | Upload an HTML file to S3
 3 | When the Streets Operations manager heard of what Sam has been working on, he asked her to build him a dashboard of Get It Done requests.
 4 | 
 5 | He wants to use the dashboard to staff and schedule his team accordingly.
 6 | 
 7 | Sam generated a nice dashboard html file with the Python bokeh charting library:
 8 | 
 9 | Bokeh Plot
10 | 
11 | She wants to serve it as a website, providing an interactive dashboard to members of Streets Operations.
12 | 
13 | Letting S3 serve the dashboard as a site lets her write a script that continuously updates the generated HTML file and keeps the Streets Operations team updated on the latest requests.
14 | 
15 | She has already initialized the boto3 S3 client and assigned it to the s3 variable.
16 | 
17 | Instructions
18 | 100 XP
19 | _ Upload the 'lines.html' file to 'datacamp-public' bucket.
20 | _ Specify the proper content type for the uploaded file.
21 | _ Specify that the file should be public.
22 | _ Print the Public Object URL for the new file.
23 | 
24 | """
25 | # Upload the lines.html file to S3
26 | s3.upload_file(Filename='lines.html', 
27 |                # Set the bucket name
28 |                Bucket='datacamp-public', Key='index.html',
29 |                # Configure uploaded file
30 |                ExtraArgs = {
31 |                  # Set proper content type
32 |                  'ContentType': 'text/html',
33 |                  # Set proper ACL
34 |                  'ACL': 'public-read'})
35 | 
36 | # Print the S3 Public Object URL for the new file.
37 | print("http://{}.s3.amazonaws.com/{}".format('datacamp-public', 'index.html'))
38 | 


--------------------------------------------------------------------------------
/Introduction to AWS Boto in Python/2. Sharing Files Securely/XI. Combine daily requests for February.py:
--------------------------------------------------------------------------------
 1 | """
 2 | Combine daily requests for February
 3 | It's been a month since Sam last ran the report script, and it's time for her to make a new report for February.
 4 | 
 5 | She wants to upload new reports for February and update the file listing, expanding on the work she completed during the last video lesson:
 6 | 
 7 | Directory listing screenshot
 8 | 
 9 | She has already created the boto3 S3 client and stored in the s3 variable. She stored the contents of her objects in request_files.
10 | 
11 | You will help Sam aggregate the requests from February by downloading files from the gid-requests bucket and concatenating them into one DataFrame!
12 | 
13 | Instructions
14 | 100 XP
15 | _ Load each object from s3.
16 | _ Read it into pandas and append it to df_list.
17 | _ Concatenate all DataFrames in df_list.
18 | _ Preview the DataFrame.
19 | 
20 | """
21 | df_list = [] 
22 | 
23 | # Load each object from s3
24 | for file in request_files:
25 |     s3_day_reqs = s3.get_object(Bucket='gid-requests', 
26 |                                 Key=file['Key'])
27 |     # Read the DataFrame into pandas, append it to the list
28 |     day_reqs = pd.read_csv(s3_day_reqs['Body'])
29 |     df_list.append(day_reqs)
30 | 
31 | # Concatenate all the DataFrames in the list
32 | all_reqs = pd.concat(df_list)
33 | 
34 | # Preview the DataFrame
35 | all_reqs.head()
36 | 


--------------------------------------------------------------------------------
/Introduction to AWS Boto in Python/2. Sharing Files Securely/XII. Upload aggregated reports for February.py:
--------------------------------------------------------------------------------
 1 | """
 2 | In the last exercise, Sam downloaded the files for the month from the raw data bucket.
 3 | 
 4 | Then she combined them all into one DataFrame that showcases all of the month's requests and requests type.
 5 | 
 6 | She stored this DataFrame in the variable all_reqs and used pandas's groupby functionality to count requests by service name, generating a new DataFrame agg_df:
 7 | 
 8 | service_name	count
 9 | 0	72 Hour Violation	2910
10 | 1	Chain Link Fence Repair	90
11 | 2	Collections Truck Spill	30
12 | 3	Container Left Out	120
13 | 4	Dead Animal	360
14 | She has already created the boto3 S3 client in the s3 variable.
15 | 
16 | Help her publish this month's request statistics.
17 | 
18 | Write agg_df to CSV and HTML files, and upload them to S3 as public files.
19 | 
20 | Instructions
21 | 100 XP
22 | _ Write CSV and HTML versions of agg_df and name them 'feb_final_report.csv' and 'feb_final_report.html' respectively.
23 | _ Upload both versions of agg_df to the gid-reports bucket and set them to public read.
24 | 
25 | """
26 | # Write agg_df to a CSV and HTML file with no border
27 | agg_df.to_csv('./feb_final_report.csv')
28 | agg_df.to_html('./feb_final_report.html', border=0)
29 | 
30 | # Upload the generated CSV to the gid-reports bucket
31 | s3.upload_file(Filename='./feb_final_report.csv', 
32 | 	Key='2019/feb/final_report.html', Bucket='gid-reports',
33 |     ExtraArgs = {'ACL': 'public-read'})
34 | 
35 | # Upload the generated HTML to the gid-reports bucket
36 | s3.upload_file(Filename='./feb_final_report.html', 
37 | 	            Key='2019/feb/final_report.html', Bucket='gid-reports',
38 |                 ExtraArgs = {'ContentType': 'text/html', 
39 |                  'ACL': 'public-read'})
40 | 


--------------------------------------------------------------------------------
/Introduction to AWS Boto in Python/2. Sharing Files Securely/XIII. Update index to include February.py:
--------------------------------------------------------------------------------
 1 | """
 2 | Update index to include February
 3 | In the previous two exercises, Sam has:
 4 | 
 5 | Read the daily Get It Done request logs for February.
 6 | Combined them into a single DataFrame.
 7 | Generated a DataFrame with aggregated metrics (request counts by type)
 8 | Wrote that DataFrame to a CSV and HTML final report files.
 9 | Uploaded these files to S3.
10 | Now, she wants these files to be accessible through the directory listing. Currently, it only shows links for January reports: Screenshot of Get It Done reports listing
11 | 
12 | She has created the boto3 S3 client and stored it in the s3 variable.
13 | 
14 | Help Sam generate a new directory listing with the February's uploaded reports and store it in a DataFrame.
15 | 
16 | Instructions
17 | 100 XP
18 | _ List the 'gid-reports' bucket objects starting with '2019/'.
19 | _ Convert the content of the objects list to a DataFrame.
20 | _ Create a column 'Link' that contains Public Object URL + key.
21 | _ Preview the DataFrame.
22 | 
23 | """
24 | # List the gid-reports bucket objects starting with 2019/
25 | objects_list = s3.list_objects(Bucket='gid-reports', Prefix='2019/')
26 | 
27 | # Convert the response contents to DataFrame
28 | objects_df = pd.DataFrame(objects_list['Contents'])
29 | 
30 | # Create a column "Link" that contains Public Object URL
31 | base_url = "http://gid-reports.s3.amazonaws.com/"
32 | objects_df['Link'] = base_url + objects_df['Key']
33 | 
34 | # Preview the resulting DataFrame
35 | objects_df.head()
36 | 


--------------------------------------------------------------------------------
/Introduction to AWS Boto in Python/2. Sharing Files Securely/XIV. Upload the new index.py:
--------------------------------------------------------------------------------
 1 | """
 2 | Upload the new index
 3 | Sam is almost done! In the last exercise, she generated a new directory listing, storing it in the objects_df variable:
 4 | 
 5 | Screenshot of objects_df
 6 | 
 7 | Sam has created the boto3 S3 client in the s3 variable. objects_df is populated with the new directory listing from the previous exercise.
 8 | 
 9 | The next step is to write objects_df to an HTML file, and upload it to S3 replacing the current 'index.html' file.
10 | 
11 | Help Sam update the directory listing, letting the public access reports for February as well as January!
12 | 
13 | Instructions
14 | 100 XP
15 | _ Write objects_df to an HTML file 'report_listing.html' with clickable links.
16 | _ The HTML file should only contain 'Link', 'LastModified', and 'Size' columns.
17 | _ Overwrite the 'index.html' on S3 by uploading the new version of the file.
18 | 
19 | """
20 | # Write objects_df to an HTML file
21 | objects_df.to_html('report_listing.html',
22 |     # Set clickable links
23 |     render_links=True,
24 | 	# Isolate the columns
25 |     columns=['Link', 'LastModified', 'Size'])
26 | 
27 | # Overwrite index.html key by uploading the new file
28 | s3.upload_file(
29 |   Filename='./report_listing.html', Key='index.html', 
30 |   Bucket='gid-reports',
31 |   ExtraArgs = {
32 |     'ContentType': 'text/html', 
33 |     'ACL': 'public-read'
34 |   })
35 | 


--------------------------------------------------------------------------------
/Introduction to AWS Boto in Python/3. Reporting and Notifying!/1. SNS Topics.txt:
--------------------------------------------------------------------------------
 1 | every topic has 
 2 | -> ARN: (amazon resource name)
 3 | 
 4 | |  create SNS Topic  |
 5 | 
 6 |         1. > INITIALIZE CLIENT:
 7 |               sns = boto3.client('sns',
 8 |                                   region_name='us-east-1',
 9 |                                   aws_access_key_id=AWS_KEY_ID,
10 |                                   aws_secret_access_key=AWS_SECRET)
11 | 
12 |         2. > CREATE TOPIC:
13 |               response = sns.create_topic(Name='city_alerts')
14 | 
15 |               # store ARN key in variable
16 |               topic_arn= response['TopicArn']
17 | 
18 |         2. > SHORTCUT. CREATE TOPIC & GRAB TOPICARN 
19 |              sns.create_topic(Name='city_alerts')['TopicArn']
20 |          
21 | |  list topics  |
22 | 
23 |     response= sns.list_topics()
24 |     
25 | |  delete topic  |
26 | 
27 |     # pass topic Arn
28 |     sns.delete_topic(TopicArn='arn:aws:sns:us-east-1:3121333277766:city_alerts')
29 | 


--------------------------------------------------------------------------------
/Introduction to AWS Boto in Python/3. Reporting and Notifying!/10. Sending a single SMS message.py:
--------------------------------------------------------------------------------
 1 | """
 2 | Sending a single SMS message
 3 | Elena asks Sam outside of work (per regulation) to send some thank you SMS messages to her largest donors.
 4 | 
 5 | Sam believes in Elena and her goals, so she decides to help.
 6 | 
 7 | She decides writes a quick script that will run through Elena's contact list and send a thank you text.
 8 | 
 9 | Since this is a one-off run and Sam is not expecting to alert these people regularly, there's no need to create a topic and subscribe them.
10 | 
11 | Sam has created the boto3 SNS client and stored it in the sns variable. The contacts variable contains Elena's contacts as a DataFrame.
12 | 
13 | Help Sam put together a quick hello to Elena's largest supporters!
14 | 
15 | Instructions
16 | 100 XP
17 | _ For every contact, send an ad-hoc SMS to the contact's phone number.
18 | _ The message sent should include the contact's name.
19 | 
20 | """
21 | # Loop through every row in contacts
22 | for idx, row in contacts.iterrows():
23 |     
24 |     # Publish an ad-hoc sms to the user's phone number
25 |     response = sns.publish(
26 |         # Set the phone number
27 |         PhoneNumber = str(row['Phone']),
28 |         # The message should include the user's name
29 |         Message = 'Hello {}'.format(row['Name'])
30 |     )
31 |    
32 |     print(response)
33 | 


--------------------------------------------------------------------------------
/Introduction to AWS Boto in Python/3. Reporting and Notifying!/11. Different protocols per topic level.py:
--------------------------------------------------------------------------------
 1 | """
 2 | Different protocols per topic level
 3 | Now that Sam has created the critical and extreme topics, she needs to subscribe the staff from her contact list into these topics.
 4 | 
 5 | Sam decided that the people subscribed to 'critical' topics will only receive emails. On the other hand, people subscribed to 'extreme' topics will receive SMS - because those are pretty urgent.
 6 | 
 7 | She has already created the boto3 SNS client in the sns variable.
 8 | 
 9 | Help Sam subscribe the users in the contacts DataFrame to email or SMS notifications based on their department. This will help get the right alerts to the right people, making the City of San Diego run better and faster!
10 | 
11 | Instructions
12 | 100 XP
13 | _ Get the topic name by using the 'Department' field in the contacts DataFrame.
14 | _ Use the topic name to create the critical and extreme TopicArns for a user's department.
15 | _ Subscribe the user's email address to the critical topic.
16 | _ Subscribe the user's phone number to the extreme topic.
17 | 
18 | """
19 | for index, user_row in contacts.iterrows():
20 |   # Get topic names for the users's dept
21 |   critical_tname = '{}_critical'.format(user_row['Department'])
22 |   extreme_tname = '{}_extreme'.format(user_row['Department'])
23 |   
24 |   # Get or create the TopicArns for a user's department.
25 |   critical_arn = sns.create_topic(Name=critical_tname)['TopicArn']
26 |   extreme_arn = sns.create_topic(Name=extreme_tname)['TopicArn']
27 |   
28 |   # Subscribe each users email to the critical Topic
29 |   sns.subscribe(TopicArn = critical_arn, 
30 |                 Protocol='email', Endpoint=user_row['Email'])
31 |   # Subscribe each users phone number for the extreme Topic
32 |   sns.subscribe(TopicArn = extreme_arn, 
33 |                 Protocol='sms', Endpoint=str(user_row['Phone']))
34 | 


--------------------------------------------------------------------------------
/Introduction to AWS Boto in Python/3. Reporting and Notifying!/12. Creating multi-level topics.py:
--------------------------------------------------------------------------------
 1 | """
 2 | Creating multi-level topics
 3 | The City Council asked Sam to create a critical and extreme topic for each department.
 4 | 
 5 | The critical topic will trigger alerts to staff and managers.
 6 | 
 7 | The extreme topics will trigger alerts to politicians and directors - it means the thresholds are completely exceeded.
 8 | 
 9 | For example, the trash department will have a trash_critical and trash_extreme topic.
10 | 
11 | She has already created the boto3 SNS client in the sns variable. She created a variable departments that contains a unique list of departments.
12 | 
13 | In this lesson, you will help Sam make the City run faster!
14 | 
15 | You will create multi-level topics with a different set of subscribers that trigger based on different thresholds.
16 | 
17 | You will effectively be building a smart alerting system!
18 | 
19 | Instructions
20 | 100 XP
21 | _ For each department create a critical topic and store it in critical.
22 | _ For each department, create an extreme topic and store it in extreme.
23 | _ Place the created TopicArns into dept_arns.
24 | _ Print the dictionary.
25 | 
26 | """
27 | dept_arns = {} 
28 | 
29 | for dept in departments:
30 |   # For each deparment, create a critical topic
31 |   critical = sns.create_topic(Name="{}_critical".format(dept))
32 |   # For each department, create an extreme topic
33 |   extreme = sns.create_topic(Name="{}_extreme".format(dept))
34 |   # Place the created TopicARNs into a dictionary 
35 |   dept_arns['{}_critical'.format(dept)] = critical['TopicArn']
36 |   dept_arns['{}_extreme'.format(dept)] = extreme['TopicArn']
37 | 
38 | # Print the filled dictionary.
39 | print(dept_arns)
40 | 


--------------------------------------------------------------------------------
/Introduction to AWS Boto in Python/3. Reporting and Notifying!/13. Sending an alert.py:
--------------------------------------------------------------------------------
 1 | """
 2 | Sending an alert
 3 | Elena Block, Sam's old friend and council member is running for office and potholes are a big deal in her district. She wants to help fix it.
 4 | 
 5 | Pothole
 6 | Elena asked Sam to adjust the streets_critical topic to send an alert if there are over 100 unfixed potholes in the backlog.
 7 | 
 8 | Sam has created the boto3 SNS client in the sns variable. She stored the streets_critical topic ARN in the str_critical_arn variable.
 9 | 
10 | Help Sam take the next step.
11 | 
12 | She needs to check the current backlog count and send a message only if it exceeds 100.
13 | 
14 | The fate of District 12, and the results of Elena's election rest on your and Sam's shoulders.
15 | 
16 | Instructions
17 | 100 XP
18 | _ If there are over 100 potholes, send a message with the current backlog count.
19 | _ Create the email subject to also include the current backlog counit.
20 | _ Publish message to the streets_critical Topic ARN.
21 | 
22 | """
23 | # If there are over 100 potholes, create a message
24 | if streets_v_count > 100:
25 |   # The message should contain the number of potholes.
26 |   message = "There are {} potholes!".format(streets_v_count)
27 |   # The email subject should also contain number of potholes
28 |   subject = "Latest pothole count is {}".format(streets_v_count)
29 | 
30 |   # Publish the email to the streets_critical topic
31 |   sns.publish(
32 |     TopicArn = str_critical_arn,
33 |     # Set subject and message
34 |     Message = message,
35 |     Subject = subject
36 |   )
37 | 


--------------------------------------------------------------------------------
/Introduction to AWS Boto in Python/3. Reporting and Notifying!/14. Sending multi-level alerts.py:
--------------------------------------------------------------------------------
 1 | """
 2 | Sending multi-level alerts
 3 | Sam is going to prototype her alerting system with the water data and the water department.
 4 | 
 5 | According to the Director, when there are over 100 alerts outstanding, that's considered critical. If there are over 300, that's extreme.
 6 | 
 7 | She has done some calculations and came up with a vcounts dictionary, that contains current requests for 'water', 'streets' and 'trash'.
 8 | 
 9 | She has also already created the boto3 SNS client and stored it in the sns variable.
10 | 
11 | In this exercise, you will help Sam publish a critical and an extreme alert based on the thresholds!
12 | 
13 | Instructions
14 | 100 XP
15 | _ If there are over 100 water violations, publish to 'water_critical' topic.
16 | _ If there are over 300 water violations, publish to 'water_extreme' topic.
17 | 
18 | """
19 | if vcounts['water'] > 100:
20 |   # If over 100 water violations, publish to water_critical
21 |   sns.publish(
22 |     TopicArn = dept_arns['water_critical'],
23 |     Message = "{} water issues".format(vcounts['water']),
24 |     Subject = "Help fix water violations NOW!")
25 | 
26 | if vcounts['water'] > 300:
27 |   # If over 300 violations, publish to water_extreme
28 |   sns.publish(
29 |     TopicArn = dept_arns['water_extreme'],
30 |     Message = "{} violations! RUN!".format(vcounts['water']),
31 |     Subject = "THIS IS BAD.  WE ARE FLOODING!")
32 | 


--------------------------------------------------------------------------------
/Introduction to AWS Boto in Python/3. Reporting and Notifying!/2. Creating a Topic.py:
--------------------------------------------------------------------------------
 1 | """
 2 | Creating a Topic
 3 | Sam has been doing such a great job with her new skills, she got a promotion.
 4 | 
 5 | With her new title of Information Systems Analyst 3 (as opposed to 2), she has gotten a tiny pay bump in exchange for a lot more work.
 6 | 
 7 | Knowing what she can do, the City Council asked Sam to prototype an alerting system that will alert them when any department has more than 100 Get It Done requests outstanding.
 8 | 
 9 | They would like for council members and department directors to receive the alert.
10 | 
11 | Help Sam use her new knowledge of Amazon SNS to create an alerting system for Council!
12 | 
13 | Instructions
14 | 100 XP
15 | _ Initialize the boto3 client for SNS.
16 | _ Create the 'city_alerts' topic and extract its topic ARN.
17 | _ Re-create the 'city_alerts' topic and extract its topic ARN with a one-liner.
18 | _ Verify the two topic ARNs match.
19 | 
20 | """
21 | # Initialize boto3 client for SNS
22 | sns = boto3.client('sns', 
23 |                    region_name='us-east-1', 
24 |                    aws_access_key_id=AWS_KEY_ID, 
25 |                    aws_secret_access_key=AWS_SECRET)
26 | 
27 | # Create the city_alerts topic
28 | response = sns.create_topic(Name="city_alerts")
29 | c_alerts_arn = response['TopicArn']
30 | 
31 | # Re-create the city_alerts topic using a oneliner
32 | c_alerts_arn_1 = sns.create_topic(Name='city_alerts')['TopicArn']
33 | 
34 | # Compare the two to make sure they match
35 | print(c_alerts_arn == c_alerts_arn_1)
36 | 


--------------------------------------------------------------------------------
/Introduction to AWS Boto in Python/3. Reporting and Notifying!/3. Creating Multiple Topics.py:
--------------------------------------------------------------------------------
 1 | """
 2 | Creating multiple topics
 3 | Sam suddenly became a black sheep because she is responsible for an onslaught of text messages and notifications to department directors.
 4 | 
 5 | No one will go to lunch with her anymore!
 6 | 
 7 | To fix this, she decided to create a general topic per department for routine notifications, and a critical topic for urgent notifications.
 8 | 
 9 | Managers will subscribe only to critical notifications, while supervisors can monitor general notifications.
10 | 
11 | For example, the streets department would have 'streets_general' and 'streets_critical' as topics.
12 | 
13 | She has initialized the SNS client and stored it in the sns variable.
14 | 
15 | Help Sam create a tiered topic structure… and have friends again!
16 | 
17 | Instructions
18 | 100 XP
19 | _ For every department, create a general topic.
20 | _ For every department, create a critical topic.
21 | _ Print all the topics created in SNS
22 | 
23 | """
24 | # Create list of departments
25 | departments = ['trash', 'streets', 'water']
26 | 
27 | for dept in departments:
28 |   	# For every department, create a general topic
29 |     sns.create_topic(Name="{}_general".format(dept))
30 |     
31 |     # For every department, create a critical topic
32 |     sns.create_topic(Name="{}_critical".format(dept))
33 | 
34 | # Print all the topics in SNS
35 | response = sns.list_topics()
36 | print(response['Topics'])
37 | 


--------------------------------------------------------------------------------
/Introduction to AWS Boto in Python/3. Reporting and Notifying!/4. Deleting multiple topics.py:
--------------------------------------------------------------------------------
 1 | """
 2 | Deleting multiple topics
 3 | It's hard to get things done in City government without good relationships. Sam is burning bridges with the general topics she created in the last exercise.
 4 | 
 5 | People are shunning her because she is blowing up their phones with notifications.
 6 | 
 7 | She decides to get rid of the general topics per department completely, and keep only critical topics.
 8 | 
 9 | Sam has created the boto3 client for SNS and stored it in the sns variable.
10 | 
11 | Help Sam regain her status in the bureaucratic social hierarchy by removing any topics that do not have the word critical in them.
12 | 
13 | Instructions
14 | 100 XP
15 | _ Get the current list of topics.
16 | _ For every topic ARN, if it doesn't have the word 'critical' in it, delete it.
17 | _ Print the list of remaining critical topics.
18 | 
19 | """
20 | # Get the current list of topics
21 | topics = sns.list_topics()['Topics']
22 | 
23 | for topic in topics:
24 |   # For each topic, if it is not marked critical, delete it
25 |   if "critical" not in topic['TopicArn']:
26 |     sns.delete_topic(TopicArn=topic['TopicArn'])
27 |     
28 | # Print the list of remaining critical topics
29 | print(sns.list_topics()['Topics'])
30 | 


--------------------------------------------------------------------------------
/Introduction to AWS Boto in Python/3. Reporting and Notifying!/5. SNS Subscriptions.txt:
--------------------------------------------------------------------------------
 1 | |  CREATE SMS SUBCRIPTION  |
 2 | 
 3 |           1. > INITIALIZE boto3 SNS CLIENT
 4 |                  sns = boto3.client('sns',
 5 |                                         region_name='us-east-1',
 6 |                                         aws_access_key_id=AWS_KEY_ID,
 7 |                                         aws_secret_access_key=AWS_SECRET)
 8 | 
 9 |           2. > PASS SNS SUBSCRIBE WITH TOPICARN
10 |                 response = sns.subscribe(
11 |                             TopicArn = 'arn:aws:sns:us-east-1:3121333277766:city_alerts',
12 |                             Protocol= 'SMS',
13 |                             EndPoint='+123112354')     # phone number
14 |                       
15 | |  CREATE EMAIL SUBSCRIPTION  |
16 | 
17 |           2. > PASS SNS SUBSCRIBE WITH TOPICARN
18 |                 # will see pending confirmation cause he has to click the link
19 |                 response = sns.subscribe(
20 |                             TopicArn = 'arn:aws:sns:us-east-1:3121333277766:city_alerts',
21 |                             Protocol= 'EMAIL',
22 |                             EndPoint='max@maximize.com')     # email
23 |                             
24 | |  CREATE MULTIPLE EMAIL SUBSCRIPTION  |
25 | 
26 |           # For each email in contacts, create subscription to street_critical
27 |           for email in contacts['Email']:
28 |             sns.subscribe(TopicArn = str_critical_arn,
29 |                           # Set channel and recipient 
30 |                           Protocol = 'email', #channel
31 |                           Endpoint = email)   # recipient
32 | 
33 |           # List subscriptions for streets_critical topic, convert to DataFrame
34 |           response = sns.list_subscriptions_by_topic(
35 |             TopicArn = str_critical_arn)
36 |           subs = pd.DataFrame(response['Subscriptions'])
37 | 
38 |           # Preview the DataFrame
39 |           subs.head()
40 | 
41 | |  LIST SUBSCRIPTIONS BY TOPIC  |
42 | 
43 |           sns.list_subscriptions_by_topic(
44 |             TopicArn='arn:aws:sns:us-east-1:3121333277766:city_alerts'
45 |             )
46 |       
47 | |  LIST ALL SUBSCRIPTIONS  |
48 | 
49 | 
50 |           sns.list_subscriptions()['Subscriptions']
51 |           
52 | |  DELETE SUBSCRIPTION  |
53 | 
54 |           sns.unsubscribe(
55 |             SubscriptionArn='arn:aws:sns:us-east-1:3121333277766:city_alerts:as23sdf:234sddf32:4444'
56 |             )
57 |     
58 | |  DELETE MULTIPLE SUBSCRIPTIONS  |
59 | 
60 |           # get list of subscriptions
61 |           response = sns.list_subscription_by_topic(
62 |              TopicArn='arn:aws:sns:us-east-1:3121333277766:city_alerts')
63 |           subs= response['Subscription']
64 |           
65 |           # Unsubscribe SMS Subscription
66 |           for sub in subs:
67 |                     if sub['Protocol'] == 'sms':
68 |                        sns.unsubscribe(sub['SubscriptionArn'])
69 |           
70 | 


--------------------------------------------------------------------------------
/Introduction to AWS Boto in Python/3. Reporting and Notifying!/6. Subscribing to topics.py:
--------------------------------------------------------------------------------
 1 | """
 2 | Subscribing to topics
 3 | Many department directors are already receiving critical notifications.
 4 | 
 5 | Now Sam is ready to start subscribing City Council members.
 6 | 
 7 | She knows that they can be finicky, and elected officials are not known for their attention to detail or tolerance for failure.
 8 | 
 9 | She is nervous, but decides to start by subscribing the friendliest Council Member she knows. She got Elena Block's email and phone number.
10 | 
11 | Sam has initialized the boto3 SNS client and stored it in the sns variable.
12 | 
13 | She has also stored the topic ARN for streets_critical in the str_critical_arn variable.
14 | 
15 | Help Sam subscribe her first Council member to the streets_critical topic!
16 | 
17 | Instructions
18 | 100 XP
19 | _ Subscribe Elena's phone number to the 'streets_critical' topic.
20 | _ Print the SMS subscription ARN.
21 | _ Subscribe Elena's email to the 'streets_critical topic.
22 | _ Print the email subscription ARN.
23 | 
24 | """
25 | # Subscribe Elena's phone number to streets_critical topic
26 | resp_sms = sns.subscribe(
27 |   TopicArn = str_critical_arn, 
28 |   Protocol='sms', Endpoint="+16196777733")
29 | 
30 | # Print the SubscriptionArn
31 | print(resp_sms['SubscriptionArn'])
32 | 
33 | # Subscribe Elena's email to streets_critical topic.
34 | resp_email = sns.subscribe(
35 |   TopicArn = str_critical_arn, 
36 |   Protocol='email', Endpoint="eblock@sandiegocity.gov")
37 | 
38 | # Print the SubscriptionArn
39 | print(resp_email['SubscriptionArn'])
40 | 


--------------------------------------------------------------------------------
/Introduction to AWS Boto in Python/3. Reporting and Notifying!/7. Creating multiple subscriptions.py:
--------------------------------------------------------------------------------
 1 | """
 2 | Creating multiple subscriptions
 3 | After the successful pilot with Councilwoman Elena Block, other City Council members have been asking to be signed up for alerts too.
 4 | 
 5 | Sam decides that she should manage subscribers in a CSV file, otherwise she would lose track of who needs to be subscribed to what.
 6 | 
 7 | She creates a CSV named contacts and decides to subscribe everyone in the CSV to the streets_critical topic.
 8 | 
 9 | She has created the boto3 SNS client in the sns variable, and the streets_critical topic ARN is in the str_critical_arn variable.
10 | 
11 | Sam is going from being a social pariah to being courted by multiple council offices.
12 | 
13 | Help her solidify her position as master of all information by adding all the users in her CSV to the streets_critical topic!
14 | 
15 | Instructions
16 | 100 XP
17 | _ For each element in the Email column of contacts, create a subscription to the 'streets_critical' Topic.
18 | _ List subscriptions for the 'streets_critical' Topic and convert them to a DataFrame.
19 | _ Preview the DataFrame.
20 | 
21 | """
22 | # For each email in contacts, create subscription to street_critical
23 | for email in contacts['Email']:
24 |   sns.subscribe(TopicArn = str_critical_arn,
25 |                 # Set channel and recipient 
26 |                 Protocol = 'email', #channel
27 |                 Endpoint = email)   # recipient
28 | 
29 | # List subscriptions for streets_critical topic, convert to DataFrame
30 | response = sns.list_subscriptions_by_topic(
31 |   TopicArn = str_critical_arn)
32 | subs = pd.DataFrame(response['Subscriptions'])
33 | 
34 | # Preview the DataFrame
35 | subs.head()
36 | 


--------------------------------------------------------------------------------
/Introduction to AWS Boto in Python/3. Reporting and Notifying!/8. Deleting multiple subscriptions.py:
--------------------------------------------------------------------------------
 1 | """
 2 | Deleting multiple subscriptions
 3 | Now that Sam has a maturing notification system, she is learning that the types of alerts she sends do not bode well for text messaging.
 4 | 
 5 | SMS alerts are great if the user can react that minute, but "We are 500 potholes behind" is not something that a Council Member can jump up and fix.
 6 | 
 7 | She decides to remove all SMS subscribers from the streets_critical topic, but keep all email subscriptions.
 8 | 
 9 | She created the boto3 SNS client in the sns variable, and the streets_critical topic ARN is in the str_critical_arn variable.
10 | 
11 | In this exercise, you will help Sam remove all SMS subscribers and make this an email only alerting system.
12 | 
13 | Instructions
14 | 100 XP
15 | _ List subscriptions for 'streets_critical' topic.
16 | _ For each subscription, if the protocol is 'sms', unsubscribe.
17 | _ List subscriptions for 'streets_critical' topic in one line.
18 | _ Print the subscriptions
19 | 
20 | """
21 | # List subscriptions for streets_critical topic.
22 | response = sns.list_subscriptions_by_topic(
23 |     TopicArn=str_critical_arn)
24 | 
25 | # For each subscription, if the protocol is SMS, unsubscribe
26 | for sub in response['Subscriptions']:
27 |     if sub['Protocol'] == 'sms':
28 |         sns.unsubscribe(SubscriptionArn=sub['SubscriptionArn'])
29 | 
30 | # List subscriptions for streets_critical topic in one line
31 | subs = sns.list_subscriptions_by_topic(
32 |     TopicArn=str_critical_arn)['Subscriptions']
33 | 
34 | # Print the subscriptions
35 | print(subs)
36 | 


--------------------------------------------------------------------------------
/Introduction to AWS Boto in Python/3. Reporting and Notifying!/9. Sending messages.txt:
--------------------------------------------------------------------------------
 1 | |  PUBLISHING TO A TOPIC  |
 2 | 
 3 |     response = sns.publish(
 4 |       TopicArn = 'arn:aws: sns: us-east-1:320333787981 :city_alerts',
 5 |       Message= 'Body text of SMS or e-mail',  # not visible for text msg
 6 |       Subject = 'Subject Line for Email')
 7 |       
 8 | |  SENDING CUSTOM MESSAGES  |
 9 | 
10 |     num-of_reports = 157
11 |     response = client.publish (
12 |         TopicArn = 'arn:aws : sns: US-east-1:320333787981: city_alerts ',
13 |         Message = 'There are {} reports outstanding'.format(num_of-reports),
14 |         Subject= 'Subject Line for Email'
15 |         
16 | |  SINGLE SMS  |
17 | 
18 |     response = sns.publish(
19 |         PhoneNumber = '+13121233211',
20 |         Message = 'Body text of SMS or e-mail')
21 |         
22 | 


--------------------------------------------------------------------------------
/Introduction to AWS Boto in Python/4. Pattern Rekognition/1. Rekognizing patterns.txt:
--------------------------------------------------------------------------------
 1 | - detects objects from images
 2 | - extract text from image
 3 | 
 4 | |  why not build our own  ?  |
 5 | 
 6 |             - Rekognition : 
 7 |                             . quick but good
 8 |                             . keep code simple
 9 |                             . rekognize many things
10 |             - Buil a model :
11 |                             . custom requirements
12 |                             . security implications
13 |                             . large volumes
14 | 
15 | 
16 |         1. > CREATE CLIENT: 
17 |               client = boto3.client('rekognition')
18 | 
19 |         2. > INITIALIZE S3
20 |               s3= boto3.client(
21 |                     's3', region_name='us-east-1',
22 |                     aws_access_key_id=AWS_KEY_ID, aws_secret_access_key=AWS_SECRET)
23 | 
24 |         3. > UPLOAD FILE
25 |               s3.upload_file(
26 |                   Filename='report.jpg', Key='repot.jpg',
27 |                   Bucket='datacamp-img')
28 | 
29 |         4. > INITIATE REKOGNITION CLIENT:
30 |               rekog = boto3.client(
31 |                   'rekognition',
32 |                   region_name='us-east-1',
33 |                   aws_access_key_id=AWS_KEY_ID, aws_secret_access_key=AWS_SECRET)
34 |               
35 | |  OBJECT DETECTION  |
36 | 
37 |         5. > DETECT:
38 |               response= rekog.detect_labels(
39 |                           Image={'S3Object':{
40 |                                       'Bucket': 'datacamp-img',
41 |                                       'Name': 'report.jpg'
42 |                                       },
43 |                                   MaxLabels=10,
44 |                                   MinConfidence=95
45 |                           )
46 |                       
47 | |  TEXT DETECTION  |
48 | 
49 |         5. > DETECT:
50 |               response = rekog.detect_text (
51 |                       Image={'S3Object':
52 |                         {
53 |                           Bucket': 'datacamp-img',
54 |                           Name': 'report.jpg'
55 |                         }
56 |                       }
57 |               )
58 | 


--------------------------------------------------------------------------------
/Introduction to AWS Boto in Python/4. Pattern Rekognition/10. Scooter Community Sentiment.py:
--------------------------------------------------------------------------------
 1 | """
 2 | Scooter community sentiment
 3 | The City Council is curious about how different communities in the City are reacting to the Scooters. The dataset has expanded since Sam's initial analysis, and now contains Vietnamese, Tagalog, Spanish and English reports.
 4 | 
 5 | They ask Sam to see if she can figure it out. She decides that the best way to proxy for a community is through language (at least with the data she immediately has access to).
 6 | 
 7 | She has already loaded the CSV into the scooter_df variable:
 8 | 
 9 | Scooter dataframe contents
10 | 
11 | In this exercise, you will help Sam understand sentiment across many different languages. This will help the City understand how different communities are relating to scooters, something that will affect the votes of City Council members.
12 | 
13 | Instructions
14 | 100 XP
15 | For every DataFrame row, detect the dominant language.
16 | Use the detected language to determine the sentiment of the description.
17 | Group the DataFrame by the 'sentiment' and 'lang' columns in that order
18 | 
19 | """
20 | for index, row in scooter_requests.iterrows():
21 |   	# For every DataFrame row
22 |     desc = scooter_requests.loc[index, 'public_description']
23 |     if desc != '':
24 |       	# Detect the dominant language
25 |         resp = comprehend.detect_dominant_language(Text=desc)
26 |         lang_code = resp['Languages'][0]['LanguageCode']
27 |         scooter_requests.loc[index, 'lang'] = lang_code
28 |         # Use the detected language to determine sentiment
29 |         scooter_requests.loc[index, 'sentiment'] = comprehend.detect_sentiment(
30 |           Text=desc, 
31 |           LanguageCode=lang_code)['Sentiment']
32 | # Perform a count of sentiment by group.
33 | counts = scooter_requests.groupby(['sentiment', 'lang']).count()
34 | counts.head()
35 | '''
36 |                 service_request_id  service_name  public_description
37 | sentiment lang                                                      
38 | MIXED     en    4                   4             4                 
39 |           tl    12                  12            12                
40 |           vi    3                   3             3                 
41 | NEGATIVE  tl    5                   5             5                 
42 |           vi    10                  10            10 '''
43 | 


--------------------------------------------------------------------------------
/Introduction to AWS Boto in Python/4. Pattern Rekognition/2. Cat detector.py:
--------------------------------------------------------------------------------
 1 | """
 2 | Cat detector
 3 | Sam has been getting more and more challenging projects as a result of her popularity and success.
 4 | 
 5 | The newest request is from the animal control team. They want to receive notifications when an image comes in from the Get It Done application contains a cat. They can find feral cats and go rescue them. They provided her with two images that she can test her system with. One contains a cat, one does not. Both images are referenced in variables image1 and image2 respectively.
 6 | 
 7 | Sam has also created the boto3 Rekognition client in the rekog variable.
 8 | 
 9 | Help Sam use Rekognition to enable the animal control team to rescue stray cats!
10 | 
11 | Instructions 3/3
12 | Use the Rekognition client to detect the labels for image1. Return a maximum of 1 label.
13 | Detect the labels for image2 and print the response's labels..
14 | 
15 | 
16 | """
17 | # Use Rekognition client to detect labels
18 | image1_response = rekog.detect_labels(
19 |     # Specify the image as an S3Object; Return one label
20 |     Image=image1, MaxLabels=1)
21 | 
22 | # Print the labels
23 | print(image1_response['Labels'])
24 | '''
25 | [{'Confidence': 99.85968017578125, 'Instances': [], 'Name': 'Walkway', 'Parents': [{'Name': 'Path'}]}]
26 | '''
27 | 
28 | # Use Rekognition client to detect labels
29 | image2_response = rekog.detect_labels(
30 |     Image=image2, MaxLabels=1
31 | )
32 | 
33 | # Print the labels
34 | print(image1_response['Labels'])
35 | '''
36 | [{'Confidence': 96.95977020263672, 'Instances': [{'BoundingBox': {'Height': 0.3252439796924591, 'Left': 0.668968915939331, 
37 | 'Top': 0.14526571333408356, 'Width': 0.1563364714384079}, 
38 | 'Confidence': 96.95977020263672}], 'Name': 'Cat', 'Parents': [{'Name': 'Pet'}, {'Name': 'Mammal'}, {'Name': 'Animal'}]}]
39 | '''
40 | 
41 | """Question
42 | _ Which image contained cats?
43 | Possible Answers:
44 | 
45 |              image1.
46 |         OK   image2."""
47 | 


--------------------------------------------------------------------------------
/Introduction to AWS Boto in Python/4. Pattern Rekognition/3. Multiple cat detector.py:
--------------------------------------------------------------------------------
 1 | """
 2 | Multiple cat detector
 3 | After using the Cat Detector for a bit, the Animal Control team found that it was inefficient for them to pursue one cat at a time. It would be better if they could find clusters of cats.
 4 | 
 5 | They asked if Sam could add the count of cats detected to the message in the alerts they receive. They also asked her to lower the confidence floor, allowing the system to have more false positives.
 6 | 
 7 | Cats detected by Rekognition
 8 | 
 9 | Sam has already:
10 | 
11 | Created the Rekognition client.
12 | Called .detect_labels() with the Bucket and Key of the image on S3.
13 | Stored the result in the response variable.
14 | Help Sam save cat lives! Help her count the cats in each image and include that in the alert to Animal Control!
15 | 
16 | Instructions
17 | 100 XP
18 | _ Iterate over each element of the 'Labels' key in response.
19 | _ Once you encounter a label with the name 'Cat', iterate over the label's instance.
20 | _ If an instance's confidence level exceeds 85, increment cat_counts by 1.
21 | _ Print the final cat count.
22 | """
23 | # Create an empty counter variable
24 | cats_count = 0
25 | # Iterate over the labels in the response
26 | for label in response['Labels']:
27 |     # Find the cat label, look over the detected instances
28 |     if label['Name'] == 'Cat':
29 |         for instance in label['Instances']:
30 |             # Only count instances with confidence > 85
31 |             if (instance['Confidence'] > 85):
32 |                 cats_count += 1
33 | # Print count of cats
34 | print(cats_count)
35 | 


--------------------------------------------------------------------------------
/Introduction to AWS Boto in Python/4. Pattern Rekognition/4. Parking sign reader.py:
--------------------------------------------------------------------------------
 1 | """
 2 | Parking sign reader
 3 | City planners have millions of images from truck cameras. Since the city does not keep a record of this data, parking regulations would be very valuable for planners to know.
 4 | 
 5 | Sam found detect_text() in the boto3 Rekognition documentation. She is going to use it to extract text out of the images City planners provided.
 6 | 
 7 | 
 8 | 
 9 | Sam has:
10 | 
11 | Created the Rekognition client.
12 | Called .detect_text() and stored the result in response.
13 | Help Sam create a data source for parking regulations in the City. Use Rekognition's .detect_text() method to extract lines and words from images of parking signs.
14 | 
15 | Instructions 2/2
16 | _ Iterate over each detected text in response, and append each detected text to words if the text's type is 'WORD'.
17 | _ Iterate over each detected text in response, and append each detected text to lines if the text's type is 'LINE'.
18 | 
19 | """
20 | # Create empty list of words
21 | words = []
22 | # Iterate over the TextDetections in the response dictionary
23 | for text_detection in response['TextDetections']:
24 |   	# If TextDetection type is WORD, append it to words list
25 |     if text_detection['Type'] == 'WORD':
26 |         # Append the detected text
27 |         words.append(text_detection['DetectedText'])
28 | # Print out the words list
29 | print(words)
30 | '''
31 | ['NO', 'PARKING', '7', 'AM', 'TO', '12', 'NOON', 'MONDAY']
32 | '''
33 | #-------
34 | 
35 | # Create empty list of lines
36 | lines = []
37 | # Iterate over the TextDetections in the response dictionary
38 | for text_detection in response['TextDetections']:
39 |   	# If TextDetection type is Line, append it to lines list
40 |     if text_detection['Type'] == 'LINE':
41 |         # Append the detected text
42 |         lines.append(text_detection['DetectedText'])
43 | # Print out the words list
44 | print(lines)
45 | '''
46 | ['NO PARKING', '7 AM', 'TO', '12 NOON', 'MONDAY']
47 | '''
48 | 
49 | 


--------------------------------------------------------------------------------
/Introduction to AWS Boto in Python/4. Pattern Rekognition/5. Comprehending text.txt:
--------------------------------------------------------------------------------
 1 | |  Translating textx  |
 2 | 
 3 |     1. > INITIALIZE BOTO3 CLIENT 'translate':
 4 |           translate = boto3.client(
 5 |                         'translate',
 6 |                         region_name='us-east-1',
 7 |                         aws_access_key_id=AWS_KEY_ID, aws_secret_access_key=AWS_SECRET)
 8 |                         
 9 |     2. > TRANSLATE TEXT:
10 |           response = translate.translate_text(
11 |                         Text='Hello, how are you ?',
12 |                         SourceLanguageCode='auto',
13 |                         TargetLanguageCode='es')
14 |                         
15 |     2. > TRANSLATE TEXT ONELINER: (directly from response)
16 |           translated_text = translate.translate_text(
17 |                 Text='Hello, how are you?',
18 |                 SourceLanguageCode='auto',
19 |                 TargetLanguageCode='es')['TranslatedText']
20 |                 
21 | |  Detect Language  |
22 | 
23 |     1. > INITIALIZE BOTO3 CLIENT 'comprehend' : 
24 |           comprehend = boto3.client(
25 |               'comprehend',
26 |               region_name='us-east-1',
27 |               aws_access_key_id=AWS_KEY_ID, aws_secret_access_key=AWS_SECRET)
28 |               
29 |     2. > DETECT DOMINANT LANGUAGE:
30 |           response = comprehend.detect_dominant_language(
31 |               Text='Hay basura por todas partes a lo largo de la carretera')
32 |               
33 | |  Sentiment Results  |
34 |     
35 |     . Neutral     --- Confidence 0.9
36 |     . Positive
37 |     . Negative
38 |     . Mixed
39 |     
40 | |  Detect text Sentiment  |
41 | 
42 |    2. > DETECT SENTIMENT:
43 |       response = comprehend.detect_sentiment(
44 |           Text='Datacamp students are Amazing.',
45 |           LanguageCode='en')
46 |           
47 |    2. > DETECT SENTIMENT ONELINER (from response):
48 |         sentiment = comprehend.detect_sentiment(
49 |             Text='Franco is Amazing.',
50 |             LanguageCode='en')['Sentiment']
51 | 
52 |     
53 | 


--------------------------------------------------------------------------------
/Introduction to AWS Boto in Python/4. Pattern Rekognition/6. Detecting language.py:
--------------------------------------------------------------------------------
 1 | """
 2 | Detecting language
 3 | The City Council is wondering whether it's worth it to build a Spanish version of the Get It Done application. There is a large Spanish speaking constituency, but they are not sure if they will engage. Building in multi-lingual translation complicates the system and needs to be justified.
 4 | 
 5 | They ask Sam to figure out how many people are posting requests in Spanish.
 6 | 
 7 | She has already loaded the CSV into the dumping_df variable and subset it to the following columns:
 8 | 
 9 | Get It Done requests in many languages
10 | 
11 | Help Sam quantify the demand for a Spanish version of the Get It Done application. Figure out how many requesters use Spanish and print the final result!
12 | 
13 | Instructions
14 | 100 XP
15 | For each row in the DataFrame, detect the dominant language.
16 | Assign the first selected language to the 'lang' column.
17 | Count the total number of posts in Spanish.
18 | 
19 | """
20 | # For each dataframe row
21 | for index, row in dumping_df.iterrows():
22 |     # Get the public description field
23 |     description =dumping_df.loc[index, 'public_description']
24 |     if description != '':
25 |         # Detect language in the field content
26 |         resp = comprehend.detect_dominant_language(Text=description)
27 |         # Assign the top choice language to the lang column.
28 |         dumping_df.loc[index, 'lang'] = resp['Languages'][0]['LanguageCode']
29 |         
30 | # Count the total number of spanish posts
31 | spanish_post_ct = len(dumping_df[dumping_df.lang == 'es'])
32 | # Print the result
33 | print("{} posts in Spanish".format(spanish_post_ct))
34 | '''
35 | 9 posts in Spanish
36 | '''
37 | 


--------------------------------------------------------------------------------
/Introduction to AWS Boto in Python/4. Pattern Rekognition/7. Translating Get It Done requests.py:
--------------------------------------------------------------------------------
 1 | """
 2 | Translating Get It Done requests
 3 | Often, Get It Done requests come in with multiple languages in the description. This is a challenge for many City teams. In order to review the requests, many city teams need to have a translator on staff, or hope they know someone who speaks the language.
 4 | 
 5 | The Streets director asked Sam to help. He wanted her to translate the Get It Done requests by running a job at the end of every day.
 6 | 
 7 | Sam decides to run the requests through the AWS translate service. She has already loaded the CSV into the dumping_df variable and subset it to the following columns:
 8 | 
 9 | Get It Done requests in many languages
10 | 
11 | Help Sam translate the requests to Spanish by running them through the AWS translate service!
12 | 
13 | Instructions
14 | 100 XP
15 | For each row in the DataFrame, translate it to English.
16 | Store the original language in the original_lang column.
17 | Store the new translation in the translated_desc column.
18 | 
19 | """
20 | for index, row in dumping_df.iterrows():
21 |   	# Get the public_description into a variable
22 |     description = dumping_df.loc[index, 'public_description']
23 |     if description != '':
24 |       	# Translate the public description
25 |         resp = translate.translate_text(
26 |             Text=description, 
27 |             SourceLanguageCode='auto', TargetLanguageCode='en')
28 |         # Store original language in original_lang column
29 |         dumping_df.loc[index, 'original_lang'] = resp['SourceLanguageCode']
30 |         # Store the translation in the translated_desc column
31 |         dumping_df.loc[index, 'translated_desc'] = resp['TranslatedText'] # output
32 | # Preview the resulting DataFrame
33 | dumping_df = dumping_df[['service_request_id', 'original_lang', 'translated_desc']]
34 | dumping_df.head()
35 | '''
36 | service_request_id original_lang                                    translated_desc
37 | 0               93494            es             The residents keep throwing stuff away
38 | 1              101502            en  Couch, 4 chairs, mattress, carpet padding. thi...
39 | 2              101520           NaN                                                NaN
40 | 3              101576            en  On the South Side of Paradise Valley Road near...
41 | 4              101616            es                    There is a fridge on the street
42 | '''
43 | 


--------------------------------------------------------------------------------
/Introduction to AWS Boto in Python/4. Pattern Rekognition/8. Getting request sentiment.py:
--------------------------------------------------------------------------------
 1 | """
 2 | Getting request sentiment
 3 | After successfully translating Get It Done cases for the Streets Director, he asked for one more thing. He really wants to understand how people in the City feel about his department's work. She believes she can answer that question via sentiment analysis of Get It Done requests. She has already loaded the CSV into the dumping_df variable and subset it to the following columns:
 4 | 
 5 | Get It Done requests in many languages
 6 | 
 7 | In this exercise, you will help Sam better understand the moods of the voices of the people that submit Get It Done cases, and whether they are coming into the interaction with the City in a positive mood or a negative one.
 8 | 
 9 | Instructions
10 | 100 XP
11 | Detect the sentiment of 'public_description' for every row.
12 | Store the result in the 'sentiment' column.
13 | 
14 | """
15 | for index, row in dumping_df.iterrows():
16 |   	# Get the translated_desc into a variable
17 |     description = dumping_df.loc[index, 'public_description']
18 |     if description != '':
19 |       	# Get the detect_sentiment response
20 |         response = comprehend.detect_sentiment(
21 |           Text=description, 
22 |           LanguageCode='en')
23 |         # Get the sentiment key value into sentiment column
24 |         dumping_df.loc[index, 'sentiment'] = response['Sentiment']
25 | # Preview the dataframe
26 | dumping_df.head()
27 | '''
28 | service_request_id original_lang                                                                                              public_description sentiment
29 | 0  93494               es            There is a lot of trash                                                                     NEGATIVE
30 | 1  101502              en            Couch, 4 chairs, mattress, carpet padding. this is a on going problem                       NEGATIVE
31 | 2  101520              NaN           NaN                                                                                         NEUTRAL 
32 | 3  101576              en            On the South Side of Paradise Valley Road near the intersection with Jester St. 
33 |                                       Stuff in trash bags, rolling suitcases, and shopping carts. I suspect possessions
34 |                                       of folk camping in the canyon.                                                             MIXED   
35 | 4  101616              es            The residents keep throwing stuff away'''
36 | 


--------------------------------------------------------------------------------
/Introduction to AWS Boto in Python/4. Pattern Rekognition/9. Scooter dispatch.py:
--------------------------------------------------------------------------------
 1 | """
 2 | Scooter dispatch
 3 | The City Council were huge fans of Sam's prediction about whether scooter was blocking a sidewalk or not. So much so, they asked her to build a notification system to dispatch crews to impound scooters from sidewalks.
 4 | 
 5 | With the dataset she created, Sam can dispatch crews to the case's coordinates when a request has negative sentiment.
 6 | 
 7 | Scooter Dataframe
 8 | 
 9 | In this exercise, you will help Sam implement a system that dispatches crews based on sentiment and image recognition. You will help Sam pair human and machine for effective City management!
10 | 
11 | Instructions
12 | 100 XP
13 | Get the SNS topic ARN for 'scooter_notifications'.
14 | For every row, if sentiment is 'NEGATIVE' and there is an image of a scooter, construct a message to send.
15 | Publish the notification to the SNS topic.
16 | 
17 | """
18 | # Get topic ARN for scooter notifications
19 | topic_arn = sns.create_topic(Name='scooter_notifications')['TopicArn']
20 | 
21 | for index, row in scooter_requests.iterrows():
22 |     # Check if notification should be sent
23 |     if (row['sentiment'] == 'NEGATIVE') & (row['img_scooter'] == 1):
24 |         # Construct a message to publish to the scooter team.
25 |         message = "Please remove scooter at {}, {}. Description: {}".format(
26 |             row['long'], row['lat'], row['public_description'])
27 | 
28 |         # Publish the message to the topic!
29 |         sns.publish(TopicArn = topic_arn,
30 |                     Message = message, 
31 |                     Subject = "Scooter Alert")
32 | 


--------------------------------------------------------------------------------
/Introduction to Airflow in Python/I Intro to Airflow.py:
--------------------------------------------------------------------------------
  1 | """
  2 | complete introduction to the components of Apache Airflow and learn how and why you should use them.
  3 | 
  4 | \ workflow /
  5 | 
  6 |   > set of steps to accomplish a d.e task
  7 |     ..such as -downloading files -copying data -filtering info -writing to a database
  8 |   > of varying leves of complexity
  9 | 
 10 | \ airflow /
 11 | 
 12 |   > platform to program workflows 
 13 |     - creating
 14 |     - scheduling
 15 |     - monitoring
 16 |     
 17 |     > can implement any language, but workflows written : in Python 
 18 |     > implement workflows as DAGs : Direct Acyclic Graphs
 19 |     > access via: code, command-line, web interface 
 20 |     
 21 |     | DAG simple definition eg: |
 22 |     
 23 |       etl_dag = DAG(
 24 |             dag_id = 'etl_pipeline',
 25 |             default_args = {"start_data":"2020-01-08"}
 26 |             )
 27 |       ------
 28 |             
 29 |     | Run a workflow in Airflow eg: |
 30 |     
 31 |     airflow run <dag_ig> <task_id> <start_date>
 32 |     airflow run example_etl download-file 2020-01-10
 33 |     -------
 34 | """
 35 | #|
 36 | #|
 37 | ### Running a task in Airflow
 38 | """-Which command would you enter in the console to run the desired task?"""
 39 | airflow run etl_pipeline download_file 2020-01-08
 40 | #|
 41 | #|
 42 | ### Examining Airflow commands
 43 | """-Which of the following is NOT an Airflow sub-command?
 44 | 
 45 |     | get airflow documentation ariflow description |
 46 |     
 47 |     airflow -h
 48 |     
 49 | -list_dags
 50 | -edit_dag
 51 | -test
 52 | -scheduler
 53 | """
 54 | # Answ: edit_dag
 55 | #|
 56 | #|
 57 | """
 58 | \ What's a DAG /
 59 | 
 60 |   > Direct : flow representing dependencies
 61 |   > Acyclic : do not loop
 62 |   > Graph : actual set of components
 63 |   
 64 |     # written in python, components in any language
 65 |     # made up of components : tasks to be executed (operators, sensors)
 66 |     # dependencies defined (explicit or implicit)
 67 |     
 68 |     | Define a DAG eg:|
 69 |     
 70 |     from airflow.models import DAG
 71 |     from datetime import datetime
 72 |     
 73 |     default_arguments = {
 74 |       'owner' : 'jdoe',
 75 |       'emial' : 'jdoe@datacamp.com',
 76 |       'start_date' : datetime(2020, 1, 20)
 77 |     }
 78 |     
 79 |     etl_dag = DAG( 'etl_workflow', default_args=default_arguments )
 80 |     ---------
 81 | """
 82 | #|
 83 | #|
 84 | ### Defining a simple DAG
 85 | """1/3"""
 86 | # Import the DAG object
 87 | from airflow.models import DAG
 88 | #|
 89 | """2/3"""
 90 | # Import the DAG object
 91 | from airflow.models import DAG
 92 | 
 93 | # Define the default_args dictionary
 94 | default_args = {
 95 |   'owner': 'dsmith',
 96 |   'start_date': datetime(2020, 1, 14),
 97 |   'retries' : 2
 98 | }
 99 | #|
100 | """3/3"""
101 | # Import the DAG object
102 | from airflow.models import DAG
103 | 
104 | # Define the default_args dictionary
105 | default_args = {
106 |   'owner': 'dsmith',
107 |   'start_date': datetime(2020, 1, 14),
108 |   'retries': 2
109 | }
110 | 
111 | # Instantiate the DAG object
112 | etl_dag = DAG( 'example_etl', default_args= default_args)
113 | #|
114 | #|
115 | ### Working with DAGs and the Airflow shell
116 | """-How many DAGs are present in the Airflow system from the command-line?
117 |     
118 |     | list DAGs inshell |
119 |     
120 |     airflow list_dags
121 | """
122 | airflow list_dags
123 | # Answ: 2
124 | #|
125 | #|
126 | ### Troubleshooting DAG creation
127 | """Instructions
128 | 
129 | Use the airflow shell command to determine which DAG is not being recognized correctly.
130 | After you determine the broken DAG, open the file and fix any Python errors.
131 | Once modified, verify that the DAG now appears within Airflow's output."""
132 | airflow list_dags
133 | ----------
134 | from airflow.models import DAG # fixed part
135 | default_args = {
136 |   'owner': 'jdoe',
137 |   'email': 'jdoe@datacamp.com'
138 | }
139 | dag = DAG( 'refresh_data', default_args=default_args )
140 | #|
141 | #|
142 | """
143 | \ Airflow web interface /
144 | .
145 | """
146 | #|
147 | #|
148 | ### Airflow web interface
149 | """Which airflow command would you use to start the webserver on port 9090?"""
150 | airflow webserver -h  # see documentation
151 | -------
152 | airflow webserver -p 9090
153 | #|
154 | #|
155 | ### Navigating the Airflow UI
156 | """Which of the following events have NOT run on your Airflow instance?
157 | 
158 | -cli_scheduler
159 | -cli_webserver
160 | -cli_worker"""
161 | # -browse   -logs 
162 | # Answ: cli_worker
163 | #|
164 | #|
165 | ### Examining DAGs with the Airflow UI
166 | """she would like to know which operator is NOT in use with the DAG called update_state, as your team is trying to verify the
167 | components used in production workflows.
168 | -Remember to select the operator NOT used in this DAG.
169 | 
170 | 
171 | BashOperator
172 | PythonOperator
173 | JdbcOperator
174 | SimpleHttpOperator"""
175 | # -go inside update_state name
176 | # -graph view
177 | # ANSW: JdbcOperator
178 | 
179 |   
180 | 


--------------------------------------------------------------------------------
/Introduction to Bash Scripting/I From Command-Line to Bash Script.py:
--------------------------------------------------------------------------------
  1 | """~~~
  2 | Save yourself time when performing complex repetitive tasks. You’ll begin this course by refreshing your knowledge of common command-line programs and arguments
  3 | before quickly moving into the world of Bash scripting. You’ll create simple command-line pipelines and then transform these into Bash scripts. You’ll then boost 
  4 | your skills and learn about standard streams and feeding arguments to your Bash scripts.
  5 | """
  6 | 
  7 | """
  8 | *** Shel command refreshers :
  9 |         (e)grep : filter input based on regex pattern matching
 10 |         cat : concatenates file contents line-by-line
 11 |         tail/head : gives only the last -n (flag) lines
 12 |         wc : word or line count (flags -w -l)
 13 |         sed : pattern matched string replacement 
 14 |         
 15 |         *** shell practice :
 16 |             - run : grep 'p' fruits.txt
 17 |               apple
 18 |             - run : grep '[pc]' fruits.txt
 19 |               apple, carrot
 20 |             - run : sort | uniq -c  ----to count
 21 |             
 22 |             *** Word separation shell syntax : 
 23 |                  eg : repl:~$ cat two_cities.txt | egrep 'Sydney Carton|Charles Darnay'
 24 |       
 25 | ** REGEX : regular expressions, vital skill for bash script
 26 | 
 27 | """
 28 | 
 29 | """
 30 | ### Extracting scores with shell
 31 | 
 32 | There is a file in either the start_dir/first_dir, start_dir/second_dir or start_dir/third_dir directory called soccer_scores.csv. 
 33 |   It has columns Year,Winner,Winner Goals for outcomes of a soccer league.
 34 |   
 35 | -cd into the correct directory and use cat and grep to find who was the winner in 1959. You could also just ls from the top directory if you like!
 36 | 
 37 | terminal:   repl:~$ cd start_dir/second_dir/
 38 |             repl:~/start_dir/second_dir$ cat soccer_scores.csv | grep 1959
 39 |             1959,Dunav,2
 40 |             
 41 | Answer : Dunav
 42 | 
 43 | """
 44 | 
 45 | """
 46 | ### Searching a book with shell
 47 | 
 48 | There is a copy of Charles Dickens's infamous 'Tale of Two Cities' in your home directory called two_cities.txt.
 49 | 
 50 | -Use command line arguments such as cat, grep and wc with the right flag to count the number of lines in the book that contain either the
 51 |  character 'Sydney Carton' or 'Charles Darnay'. Use exactly these spellings and capitalizations.
 52 |  
 53 | Terminal : repl:~$ cat two_cities.txt | egrep 'Sydney Carton|Charles Darnay' | wc -l
 54 | 
 55 | Answer : 77
 56 | 
 57 | """
 58 | 
 59 | """
 60 | Your first Bash script
 61 | 
 62 | >>>>>> which bash  : o check lcoation of bash you're runing
 63 | >>>>>> bash script_name.sh  or  ./script_name.sh: run bash script
 64 | 
 65 | *** bash Anatomy : .sh
 66 | 
 67 | """
 68 | 
 69 | """
 70 | ### A simple Bash script
 71 | 
 72 | For this environment bash is not located at /usr/bash but at /bin/bash. You can confirm this with the command which bash.
 73 | 
 74 | There is a file in your working directory called server_log_with_todays_date.txt. 
 75 | -Your task is to write a simple Bash script that concatenates this out to the terminal so you can see what is inside
 76 | -Create a single-line script that concatenates the mentioned file.
 77 | -Save your script and run from the console.
 78 |         
 79 |         
 80 | Pane :  #!/bin/bash
 81 | 
 82 |         # Concatenate the file
 83 |         cat server_log_with_todays_date.txt
 84 | 
 85 | 
 86 |         # Now save and run!
 87 | """
 88 | 
 89 | """
 90 | ### Shell pipelines to Bash scripts
 91 | 
 92 | Your job is to create a Bash script from a shell piped command which will aggregate to see how many times each team has won.
 93 | 
 94 | -Create a single-line pipe to cat the file, cut out the relevant field and aggregate (sort & uniq -c will help!) based on winning team.
 95 | -Save your script and run from the console.
 96 | 
 97 | Pane:   #!/bin/bash
 98 | 
 99 |         # Create a single-line pipe
100 |         cat soccer_scores.csv | cut -d "," -f 2 | tail -n +2 | sort | uniq -c
101 | 
102 |         # Now save and run!
103 | out: 13 Arda
104 |       8 Beroe
105 |       9 Botev
106 |       8 Cherno
107 |      17 Dunav
108 |      15 Etar
109 |       4 Levski
110 |       1 Lokomotiv
111 | """
112 | 
113 | """
114 | ### Extract and edit using Bash scripts
115 | 
116 | -You will need to create a Bash script that makes use of sed to change the required team names.
117 | 
118 | -Create a pipe using sed twice to change the team Cherno to Cherno City first, and then Arda to Arda United.
119 | -Pipe the output to a file called soccer_scores_edited.csv.
120 | -Save your script and run from the console. Try opening soccer_scores_edited.csv using shell commands to confirm it worked (the first line should be changed)!
121 | 
122 | Pane :  #!/bin/bash
123 | 
124 |         # Create a sed pipe to a new file
125 |         cat soccer_scores.csv | sed 's/Cherno/Cherno City/g' | sed 's/Arda/Arda United/g' > soccer_scores_edited.csv
126 | 
127 |         # Now save and run!
128 | 
129 | """
130 | 
131 | """
132 | Standard streams & arguments
133 | 
134 | *** arguments : 
135 |                >>>>>> $   $1   $2 : to access bash arguments
136 |                  >>>>>>  $@  or @*  : access all arguments
137 |                  >>>>>>  $# : give length of arguments 
138 | """
139 | 
140 | ### Using arguments in Bash scripts
141 | 
142 | # Echo the first and second ARGV arguments
143 | echo $1 
144 | echo $2
145 | 
146 | # Echo out the entire ARGV array
147 | echo $@
148 | 
149 | # Echo out the size of ARGV
150 | echo $#
151 | 
152 | 
153 | repl:~/workspace$ bash script.sh Bird Fish Rabbit
154 |         
155 | """
156 | ### # Echo the first ARGV argument
157 | 
158 | -Echo the first ARGV argument so you can confirm it is being read in.
159 | -cat all the files in the directory /hire_data and pipe to grep to filter using the city name (your first ARGV argument).
160 | -On the same line, pipe out the filtered data to a new CSV called cityname.csv where cityname is taken from the first ARGV argument.
161 | -Save your script and run from the console twice (do not use the ./script.sh method). Once with the argument Seoul. Then once with the argument Tallinn.
162 | """
163 | # Echo the first ARGV argument
164 | echo $1 
165 | 
166 | # Cat all the files
167 | # Then pipe to grep using the first ARGV argument
168 | # Then write out to a named csv using the first ARGV argument
169 | cat hire_data/* | grep "$1" > "$1".csv
170 | 
171 | 
172 | repl:~/workspace$ bash script.sh Seoul
173 | Seoul
174 | repl:~/workspace$ bash script.sh Tallinn
175 | Tallinn 
176 | 


--------------------------------------------------------------------------------
/Introduction to Data Engineering/I Introduction to Data Engineering.py:
--------------------------------------------------------------------------------
 1 | """"******************************************************************************
 2 |  Explore the differences between a data engineer
 3 | and a data scientist, get an overview of the various tools data engineers use and 
 4 | expand your understanding of how cloud technology plays a role in data engineering....
 5 | **************************************************************************************"""
 6 | #///WHAT IS DATA ENGINEERING///
 7 | #Tasks of the data engineer
 8 | """find best fit
 9 | Possible Answers
10 | 
11 |     1 Apply a statistical model to a large dataset to find outliers.
12 | OK  2 Set up scheduled ingestion of data from the application databases to an analytical database.
13 |     3 Come up with a database schema for an application.                                     """
14 | #---
15 | #Data engineer or data scientist?
16 | """drag items into correct bucket
17 | 
18 | {Data engineer}
19 |     clean corrupt data
20 |     develop scalable data architecture
21 |     set up processes to bring data together
22 |     streamline data acquisition 
23 |     cloud tech                                              """
24 | #---
25 | #Data engineering problems
26 | """decide where you're best suited to be of help
27 | 
28 | ok  1 Data scientists are querying the online store databases directly and slowing down the functioning of the application since it's using the same database.
29 |     2 Harmful product recommendations are affecting the sales numbers of the online store.
30 |     3 The online store is slow because the application's database server doesn't have enough memory.
31 |     
32 | (data engineer should make sure there's a separate database for analytics.)                            """
33 | #---
34 | #///TOOLS///
35 | #Kinds of databases
36 | """identify the database in the schematics
37 | 
38 |     1 All database nodes are on the left.
39 |     2 All nodes on the left and the analytics node on the right are databases.
40 | ok  3 Accounting, Online Store, Product Catalog, and Analytics are databases.                         """
41 | #---
42 | #Processing tasks
43 | """select the most correct statement
44 | 
45 |     1 Data processing is often done on a single, very powerful machine.
46 | ok  2 Data processing is distributed over clusters of virtual machines.
47 |     3 Data processing is often very complicated because you have to manually distribute workload over several computers.
48 | 
49 | ( join, clean, or organize data is done in the data processing)                       """
50 | #---
51 | #Scheduling tools
52 | """ which one is not a responsibility of the scheduler?
53 | 
54 |     1 Make sure jobs run in a specific order and all dependencies are resolved correctly.
55 |     2 Make sure the jobs run at midnight UTC each day.
56 | ok  3 Scale up the number of nodes when there's lots of data to be processed.                     """
57 | #---
58 | #///CLOUD PROVIDERS///
59 | #Why cloud computing?
60 | """benefits of using cloud computing as opposed to self-hosting data centers. Can you select the most correct statement about cloud computing?
61 | 
62 |     1 Cloud computing is always cheaper.
63 | ok  2 The cloud can provide you with the resources you need, when you need them.
64 |     3 On premise machines give me full control over the situation when things break.
65 |    
66 |  (cloud elasticity)                                                                              """
67 | #---
68 | #Big players in cloud computing
69 | """Can you order the big three correctly?
70 | 
71 |     1 amazon
72 |     2 azure
73 |     3 G cloud                                                                                  """
74 | #---
75 | #Cloud services
76 | """Classify the services into the correct bucket.
77 | - storage
78 |      azure blob
79 |      amazon s3
80 |      cloud storage
81 | - compute 
82 |      azure virtual machine
83 |      AWS EC2
84 |      Google compute Engine
85 | - DB
86 |      Amazon RDS
87 |      Azure SQL DB
88 |      G Cloud SQL                                                                                      """
89 | 


--------------------------------------------------------------------------------
/Introduction to Data Engineering/README.me:
--------------------------------------------------------------------------------
1 | # It touches 
2 | 
3 | upon all things you need to know to streamline your data processing. This introductory course will give you enough context to start exploring the world of data engineering. It's perfect for people who work at a company with several data sources and don't have a clear idea of how to use all those data sources in a scalable way. Be the first one to introduce these techniques to your company and become the company star employee.
4 | 


--------------------------------------------------------------------------------
/Introduction to PySpark/I Getting to know PySpark.py:
--------------------------------------------------------------------------------
  1 | """
  2 | how Spark manages data and how can you read and write tables from Python.
  3 | 
  4 | \ What is Spark, anyway? /
  5 | 
  6 |   | parallel computation | ( split data in clusters )
  7 |   
  8 |    platform for cluster computing. Spark lets you spread data and computations over clusters with multiple nodes
  9 |    Splitting up your data makes it easier to work with very large datasets because each node only works with a small amount of data.
 10 |    parallel computation can make certain types of programming tasks much faster.
 11 |    
 12 |  | Using Spark in Python |
 13 |  
 14 |     -  cluster: <remote machine connects all nodes>
 15 |     -  master : ( main pc ) <manages splitting data and computations>
 16 |     -  worker : ( rest of computers in cluster ) <receives calculations, return results>
 17 |     
 18 |       .  create connection : <SparkContext> class <sc>  #creating an instance of the
 19 |       .  attributes : <SparkConf() constructor>
 20 |       
 21 |  | create SparkSession |
 22 |  
 23 |       # Import SparkSession 
 24 |         > from pyspark.sql import SparkSession
 25 | 
 26 |       # create SparkSession builder
 27 |         > my_spark = SparkSession.builder.getOrCreate()
 28 | 
 29 |       # print spark tables     
 30 |         > print(spark.catalog.listTables())
 31 | 
 32 | | SparkSession attributes |
 33 | 
 34 |       # always <SparkSessionName>.
 35 |      - catalog: ( extract and view table data )
 36 |               -  .listTables() 
 37 |                             # returns column names in cluster as list
 38 |                             > spark.catalog.listTables()
 39 |                             
 40 |      -  .read()    # read different data sources into Spark DataFrames
 41 |                    > spark.read.csv(file_path, header=True)
 42 |                       
 43 | | SparkSession methods |
 44 | 
 45 |      # always <SparkSessionName>.
 46 |     -  .show() ->   print
 47 |     -  .sql() ->  run a query ( <takes> queried 'string' <returns> DataFrame results )
 48 |     -  .toPandas() ->   returns corresponding 'pandas' DataFrame
 49 |     -  .createDataFrame() ->     <takes> pandas DataFrame and <returns>  Spark DataFrame. (temp)
 50 |     -  .createTempView()  ->    add to the catalog, but as temporary (in session)
 51 |     -  .createOrReplaceTempView("temp")  ->   creates temp table if didn't already or updates
 52 | """
 53 | #|
 54 | #|
 55 | ### How do you connect to a Spark cluster from PySpark?
 56 | # ANSW: Create an instance of the SparkContext class.
 57 | #|
 58 | #|
 59 | ### Examining The SparkContext
 60 | # Verify SparkContext in environment
 61 | print(sc)
 62 | 
 63 | # Print Spark version
 64 | print(sc.version)
 65 | #- <SparkContext master=local[*] appName=pyspark-shell>
 66 | #-    3.2.0
 67 | #|
 68 | #|
 69 | """
 70 | \ Using Spark DataFrames /
 71 | 
 72 |     . Spark structure : Resilient Distributed Dataset (RDD)
 73 |     . behaves :  like a SQL table 
 74 |     # DataFrames are more optimized for complicated operations than RDDs.
 75 |     
 76 |   - create Spark Dataframe: create a  object from your 
 77 |       <SparkSession> #interface <'spark'>
 78 |       <SparkContext> #connection <'sc'>
 79 | """
 80 | #|
 81 | #|
 82 | ### Which of the following is an advantage of Spark DataFrames over RDDs?
 83 | # ANSW: Operations using DataFrames are automatically optimized.
 84 | #|
 85 | #|
 86 | ### Creating a SparkSession
 87 | # Import SparkSession from pyspark.sql
 88 | from pyspark.sql import SparkSession
 89 | 
 90 | # Create my_spark
 91 | my_spark = SparkSession.builder.getOrCreate()
 92 | 
 93 | # Print my_spark
 94 | print(my_spark)
 95 | #|
 96 | #|
 97 | ### Viewing tables
 98 | # Print the tables in the catalog
 99 | print(spark.catalog.listTables())
100 | #|
101 | #|
102 | ### Are you query-ious?
103 | # Don't change this query
104 | query = "FROM flights SELECT * LIMIT 10"
105 | 
106 | # Get the first 10 rows of flights
107 | flights10 = spark.sql(query)
108 | 
109 | # Show the results
110 | flights10.show()
111 | #|
112 | #|
113 | ### Pandafy a Spark DataFrame
114 | # Don't change this query
115 | query = "SELECT origin, dest, COUNT(*) as N FROM flights GROUP BY origin, dest"
116 | 
117 | # Run the query
118 | flight_counts = spark.sql(query)
119 | 
120 | # Convert the results to a pandas DataFrame
121 | pd_counts = flight_counts.toPandas()
122 | 
123 | # Print the head of pd_counts
124 | print(pd_counts.head())
125 | #|
126 | #|
127 | ### Put some Spark in your data
128 | # Create pd_temp
129 | pd_temp = pd.DataFrame(np.random.random(10))
130 | 
131 | # Create spark_temp from pd_temp
132 | spark_temp = spark.createDataFrame(pd_temp)
133 | 
134 | # Examine the tables in the catalog
135 | print(spark.catalog.listTables())
136 | 
137 | # Add spark_temp to the catalog
138 | spark_temp.createOrReplaceTempView("temp")
139 | 
140 | # Examine the tables in the catalog again
141 | print(spark_temp)
142 | #|
143 | #|
144 | ### Dropping the middle man
145 | # Don't change this file path
146 | file_path = "/usr/local/share/datasets/airports.csv"
147 | 
148 | # Read in the airports data
149 | airports = spark.read.csv(file_path, header=True) # to view column names
150 | 
151 | # Show the data
152 | airports.show()
153 | 


--------------------------------------------------------------------------------
/Introduction to PySpark/II. Manipulating data.py:
--------------------------------------------------------------------------------
  1 | """
  2 | pyspark.sql module, which provides optimized data queries to your Spark session.
  3 | 
  4 | | Pyspark attributes : manipulation |
  5 | 
  6 |       -  .withColumn()  ->    create new column .  <takes> ('new_column', old_col + 1)
  7 |       -  df.colName()   ->    extract column name
  8 |       
  9 | | pyspark methods : manipulation |
 10 | 
 11 |       -  spark.table()  ->  create df containing values of table in the .catalog
 12 |       -  .filter()      ->  like a cut for SQL   
 13 |                      > flights.filter("air_time > 120").show()              # return values cut  #(SQL string)
 14 |                      > flights.filter(flights.air_time > 120).show()        # return bool
 15 |                      
 16 |       -  .select()    ->  returns only the columns you specify
 17 |         |            > selectNoStr = flights.select(flights.origin, flights.dest, flights.carrier)
 18 |         |            > selectStr = flights.select("tailnum", "origin", "dest")
 19 |         |            > flights.selectExpr("origin", "dest", "tailnum", "distance/(air_time/60) as avg_speed")
 20 |       -  .withColumn()  ->   returns all columns in addition to the defined.
 21 |                         ->   create new df column
 22 |             -  .withColumnRenamed
 23 |                         -> rename columns
 24 |       
 25 |       -  GroupedData 
 26 |       |
 27 |       -  .agg()        -> pass an aggregate column expression that uses any of the aggregate functions from the pyspark.sql.functions submodule.
 28 |                         # Import pyspark.sql.functions as F
 29 |                         > import pyspark.sql.functions as F
 30 |             - F.stddev  -> estandard deviation
 31 |             
 32 |       -  .join()       -> takes three arguments. 1.the second DataFrame to join, 2. on == key column(s) as a string, 3. how == specifies kind of join how="leftouter"
 33 | """
 34 | #|
 35 | #|
 36 | ### Creating columns
 37 | # Create the DataFrame flights
 38 | flights = spark.table('flights') # create table 'name'
 39 | 
 40 | # Show the head
 41 | flights.show() # head() is default
 42 | 
 43 | # Add duration_hrs from air_time
 44 | flights = flights.withColumn('duration_hrs', flights.air_time /60)
 45 | #|
 46 | #|
 47 | ### SQL in a nutshell
 48 | """Which of the following queries returns a table of tail numbers and destinations for flights that lasted more than 10 hours?"""
 49 | # ANSW: SELECT dest, tail_num FROM flights WHERE air_time > 600;
 50 | #|
 51 | #|
 52 | ### SQL in a nutshell (2)
 53 | """What information would this query get?"""
 54 | """         SELECT AVG(air_time) / 60 FROM flights
 55 |             GROUP BY origin, carrier;           """
 56 | # ANSW: The average length of each airline's flights from SEA and from PDX in hours.
 57 | #|
 58 | #|
 59 | ### Filtering Data
 60 | # Filter flights by passing a string
 61 | long_flights1 = flights.filter("distance > 1000")   # sql string
 62 | 
 63 | # Filter flights by passing a column of boolean values
 64 | long_flights2 = flights.filter(flights.distance > 1000)
 65 | 
 66 | # Print the data to check they're equal
 67 | long_flights1.show()
 68 | long_flights2.show()
 69 | #- same result
 70 | #|
 71 | #|
 72 | ### Selecting
 73 | # Select the first set of columns
 74 | selected1 = flights.select("tailnum", "origin", "dest")
 75 | 
 76 | # Select the second set of columns
 77 | temp = flights.select(flights.origin, flights.dest, flights.carrier)
 78 | 
 79 | # Define first filter
 80 | filterA = flights.origin == "SEA"
 81 | 
 82 | # Define second filter
 83 | filterB = flights.dest == "PDX"
 84 | 
 85 | # Filter the data, first by filterA then by filterB
 86 | selected2 = temp.filter(filterA).filter(filterB)
 87 | #|
 88 | #|
 89 | ### Selecting II
 90 | # Define avg_speed
 91 | avg_speed = (flights.distance/(flights.air_time/60)).alias("avg_speed")
 92 | 
 93 | # Select the correct columns
 94 | speed1 = flights.select("origin", "dest", "tailnum", avg_speed)
 95 | 
 96 | # Create the same table using a SQL expression
 97 | speed2 = flights.selectExpr(
 98 |     "origin", "dest", "tailnum", "distance/(air_time/60) as avg_speed")
 99 | #|
100 | #|
101 | ### Aggregating
102 | # Find the shortest flight from PDX in terms of distance
103 | #  Perform the filtering by referencing the column directly, not passing a SQL string.
104 | flights.filter(flights.origin == "PDX").groupBy().min('distance').show()
105 | 
106 | # Find the longest flight from SEA in terms of air time
107 | flights.filter(flights.origin == 'SEA').groupBy().max('air_time').show()
108 | #|
109 | #|
110 | ### Aggregating II
111 | # Average duration of Delta flights
112 | flights.filter(flights.carrier == "DL").filter(flights.origin == "SEA").groupBy().avg("air_time").show()
113 | 
114 | # Total hours in the air
115 | flights.withColumn("duration_hrs", flights.air_time/60).groupBy().sum("duration_hrs").show()
116 | #|
117 | #|
118 | ### Grouping and Aggregating I
119 | # Group by tailnum
120 | by_plane = flights.groupBy("tailnum")
121 | 
122 | # Number of flights each plane made
123 | by_plane.count().show()
124 | 
125 | # Group by origin
126 | by_origin = flights.groupBy("origin")
127 | 
128 | # Average duration of flights from PDX and SEA
129 | by_origin.avg("air_time").show()
130 | #|
131 | #|
132 | ### Grouping and Aggregating II
133 | # Import pyspark.sql.functions as F
134 | import pyspark.sql.functions as F
135 | 
136 | # Group by month and dest
137 | by_month_dest = flights.groupBy('month','dest')
138 | 
139 | # Average departure delay by month and destination
140 | by_month_dest.avg('dep_delay').show()
141 | 
142 | # Standard deviation of departure delay
143 | by_month_dest.agg(F.stddev('dep_delay')).show() # stddev = standard deviation
144 | #|
145 | #|
146 | ### joining
147 | """Which of the following is not true?
148 | 
149 | Joins combine tables.
150 | Joins add information to a table.
151 | Storing information in separate tables can reduce repetition.
152 | There is only one kind of join."""
153 | # ANSW: There is only one kind of join.
154 | #|
155 | #|
156 | ### Joining II
157 | # Examine the data
158 | print(airports.show())
159 | 
160 | # Rename the faa column
161 | airports = airports.withColumnRenamed('faa', 'dest')
162 | 
163 | # Join the DataFrames
164 | flights_with_airports = flights.join(airports, on='dest', how='leftouter')
165 | 
166 | # Examine the new DataFrame
167 | print(flights_with_airports.show())
168 | 


--------------------------------------------------------------------------------
/Introduction to PySpark/III. Getting started with machine learning pipelines.py:
--------------------------------------------------------------------------------
  1 | """
  2 | PySpark has built-in, cutting-edge machine learning routines, along with utilities to create full machine learning pipelines. You'll learn about them in this chapter.
  3 | 
  4 | \  Machine Learning Pipelines  /
  5 | 
  6 |    > summodule:  pyspark.ml 
  7 |     - Transformer class
  8 |     - Estimator class
  9 |     
 10 | | methods for ML |    
 11 | 
 12 |    # Transformer class
 13 |     -  .transform()  ->  takes a DataFrame and returns a new DataFrame +1 column
 14 |        -  Bucketizer    -> create discrete bins from a continuous feature
 15 |        
 16 |    # Estimator class
 17 |     - .fit()         -> they return a model object
 18 | """
 19 | #|
 20 | #|
 21 | ### Machine Learning Pipelines
 22 | """ Which of the following is not true about machine learning in Spark?"""
 23 | # ANSW: Spark's algorithms give better results than other algorithms.
 24 | #|
 25 | #|
 26 | ### Join the DataFrames
 27 | # Rename year column to avoid duplicates
 28 | planes = planes.withColumnRenamed('year', 'plane_year')
 29 | 
 30 | # Join the DataFrames
 31 | model_data = flights.join(planes, on='tailnum', how="leftouter")
 32 | """
 33 | |  Data types for ML  |
 34 |   
 35 |   > change types 
 36 |     .cast()   'integer','double'
 37 | """
 38 | #|
 39 | #|
 40 | ### Data Types
 41 | """What kind of data does Spark need for modeling?"""
 42 | # ANSW: Numeric
 43 | #|
 44 | #|
 45 | ### String to integer
 46 | # Cast the columns to integers # df.col notation
 47 | model_data = model_data.withColumn("arr_delay", model_data.arr_delay.cast('integer'))
 48 | model_data = model_data.withColumn("air_time", model_data.air_time.cast('integer'))
 49 | model_data = model_data.withColumn("month", model_data.month.cast('integer'))
 50 | model_data = model_data.withColumn("plane_year", model_data.plane_year.cast('integer'))
 51 | #|
 52 | #|
 53 | ### Create a new column
 54 | # Create the column plane_age
 55 | # Create the column plane_age
 56 | model_data = model_data.withColumn(
 57 |     "plane_age", model_data.year - model_data.plane_year)
 58 | #|
 59 | #|
 60 | ### Making a Boolean
 61 | # Create is_late bool
 62 | model_data = model_data.withColumn("is_late", model_data.arr_delay >0)
 63 | 
 64 | # Convert to an integer w/ .cast()
 65 | model_data = model_data.withColumn("label", model_data.is_late.cast('integer'))
 66 | 
 67 | # Remove missing values w/ SQL string
 68 | model_data = model_data.filter("arr_delay is not NULL and dep_delay is not NULL and air_time is not NULL and plane_year is not NULL")
 69 | #|
 70 | #|
 71 | """
 72 | \  Strings and Factors  /
 73 | 
 74 |    > pyspark.ml.features submodule
 75 |     'one-hot vectors'      -> all elements are zero except for at most one element, which has a value of one (1).
 76 |     
 77 |     1 > create a 'StringIndexer'.
 78 |             carr_indexer = StringIndexer(inputCol='carrier',outputCol='carrier_index')
 79 |     2 > encode w/ 'OneHotEncoder'.
 80 |             carr_encoder = OneHotEncoder(inputCol='carrier_index',outputCol='carrier_fact')
 81 |     3 > 'Pipeline' will take care of the rest.
 82 |             # Fit and transform the data
 83 |             piped_data = flights_pipe.fit(model_data).transform(model_data)
 84 |     -----------------------
 85 |       > 'VectorAssembler'  -> combine all of the columns containing our features into a single column
 86 |                           inputCol= ['column_name1','c2','c3']
 87 |                           outputCol= 'features'
 88 | """
 89 | #|
 90 | #|
 91 | ###
 92 | """Why do you have to encode a categorical feature as a one-hot vector?"""
 93 | # ANSW: Spark can only model numeric features.
 94 | #|
 95 | #|
 96 | ### Carrier
 97 | # Create a StringIndexer
 98 | carr_indexer = StringIndexer(imputCol='carrier',outputCol='carrier_index')
 99 | 
100 | # Create a OneHotEncoder
101 | carr_encoder = OneHotEncoder(imputCol='carrier_index',outputCol='carrier_fact')
102 | #|
103 | #|
104 | ### Destination
105 | # Create a StringIndexer
106 | dest_indexer = StringIndexer(inputCol='dest',outputCol='dest_index')
107 | 
108 | # Create a OneHotEncoder
109 | dest_encoder = OneHotEncoder(inputCol='dest_index',outputCol='dest_fact')
110 | #|
111 | #|
112 | ### Assemble a vector
113 | # Make a VectorAssembler
114 | vec_assembler = VectorAssembler(inputCols=["month", "air_time", "carrier_fact", "dest_fact", "plane_age"], outputCol='features')
115 | #|
116 | #|
117 | """ \ create pipelines  /
118 |       
119 |       > Import Pipeline
120 |          from pyspark.ml import pipeline
121 |       > stages
122 |          
123 | """
124 | #|
125 | #|
126 | ### Create the pipeline
127 | # Import Pipeline
128 | from pyspark.ml import Pipeline
129 | 
130 | # Make the pipeline
131 | flights_pipe = Pipeline(stages=[dest_indexer, dest_encoder, carr_indexer, carr_encoder, vec_assembler])
132 | #|
133 | #|
134 | ### Test vs. Train
135 | """ why important to set 'model evaluation'?"""
136 | # ANSW: By evaluating your model with a test set you can get a good idea of performance on new data.
137 | #|
138 | #|
139 | ### Transform the data
140 | # Fit and transform the data
141 | piped_data = flights_pipe.fit(model_data).transform(model_data)
142 | #|
143 | #|
144 | ### Split the data
145 | # Split the data into training and test sets
146 | # training with 60% of the data, and test with 40%
147 | training, test = piped_data.randomSplit([.6, .4])
148 | 


--------------------------------------------------------------------------------
/Introduction to PySpark/IV. Model tuning and selection/I. Create the modeler.py:
--------------------------------------------------------------------------------
 1 | """
 2 | |||||
 3 | create a model that predicts which flights will be delayed.
 4 | |||||
 5 | 
 6 | \  Logistic regression model  /
 7 | 
 8 |     similar to a linear regression, but instead of predicting a numeric variable, 
 9 |    ->it 'predicts the probability (between 0 and 1) of an event.'
10 |     
11 |       - set 'cutoff' point
12 |       - if prediction above cutoff == 'yes'
13 |       - if it's below, you classify it as a 'no'!
14 |       
15 |       -> hyperparameter : just a value in the model that's is supplied by the user to maximize performance."""
16 | #|
17 | #|
18 | """ Why do you supply hyperparameters?"""
19 | # ANSW: They improve model performance.
20 | #|
21 | #|
22 | ### Create the modeler
23 | 
24 | # Import LogisticRegression # Estimator
25 | from pyspark.ml.classification import LogisticRegression
26 | 
27 | # Create a LogisticRegression Estimator
28 | lr = LogisticRegression()
29 | 
30 | 


--------------------------------------------------------------------------------
/Introduction to PySpark/IV. Model tuning and selection/II. Create the Evaluator.py:
--------------------------------------------------------------------------------
 1 | """
 2 | \  Cross validation  /
 3 | 
 4 |   estimates model's error on held out data
 5 | 
 6 |     -> k-fold  :  method of estimating the model's performance on unseen data
 7 |     
 8 |                   - split data in few partitions (3)
 9 |                   - partitions is set aside, and the model is fit to the others
10 |                   - error is measured against the held out partition
11 |                   - This is repeated for each of the partitions
12 |                   - Then the error is averaged."""
13 | #|
14 | #|
15 | ### 
16 | """ What does cross validation allow you to estimate?"""
17 | # ANSW: The model's error on held out data.
18 | #|
19 | #|
20 | ### Create the evaluator
21 | # Import the evaluation submodule
22 | import pyspark.ml.evaluation as evals
23 | 
24 | # Create a BinaryClassificationEvaluator
25 | evaluator = evals.BinaryClassificationEvaluator(metricName="areaUnderROC")
26 | 


--------------------------------------------------------------------------------
/Introduction to PySpark/IV. Model tuning and selection/III. Make a grid.py:
--------------------------------------------------------------------------------
 1 | """
 2 | \  make a grid  /
 3 | 
 4 | - Import the submodule pyspark.ml.tuning under the alias tune.
 5 | - Call the class constructor ParamGridBuilder() with no arguments. Save this as grid.
 6 | - Call the .addGrid() method on grid with lr.regParam as the first argument and np.arange(0, .1, .01) as the second argument. This second call is a function from the numpy module (imported as np) that creates a list of numbers from 0 to .1, incrementing by .01. Overwrite grid with the result.
 7 | - Update grid again by calling the .addGrid() method a second time create a grid for lr.elasticNetParam that includes only the values [0, 1].
 8 | - Call the .build() method on grid and overwrite it with the output.
 9 | """
10 | 
11 | # Import the tuning submodule
12 | import pyspark.ml.tuning as tune
13 | 
14 | # Create the parameter grid
15 | grid = tune.ParamGridBuilder()
16 | 
17 | # Add the hyperparameter
18 | grid = grid.addGrid(lr.regParam, np.arange(0, .1, .01))
19 | grid = grid.addGrid(lr.elasticNetParam,[0,1])
20 | 
21 | # Build the grid
22 | grid = grid.build()
23 | 


--------------------------------------------------------------------------------
/Introduction to PySpark/IV. Model tuning and selection/IV. Make the validator.py:
--------------------------------------------------------------------------------
1 | # Create the CrossValidator
2 | cv = tune.CrossValidator(estimator=lr,
3 |                estimatorParamMaps=grid,
4 |                evaluator=evaluator
5 |                )
6 | 


--------------------------------------------------------------------------------
/Introduction to PySpark/IV. Model tuning and selection/V. Fit the model(s).py:
--------------------------------------------------------------------------------
 1 | """
 2 | 
 3 | \  Fit the model(s)  /
 4 | 
 5 | You're finally ready to fit the models and select the best one!
 6 | 
 7 | Unfortunately, cross validation is a very computationally intensive procedure. Fitting all the models would take too long on DataCamp.
 8 | 
 9 | To do this locally you would use the code:
10 | 
11 |         # Fit cross validation models
12 |         models = cv.fit(training)
13 | 
14 |         # Extract the best model
15 |         best_lr = models.bestModel
16 |         Remember, the training data is called training and you're using lr to fit a logistic regression model. Cross validation selected the parameter values regParam=0 and elasticNetParam=0 as being the best. These are the default values, so you don't need to do anything else with lr before fitting the model.
17 | 
18 | Instructions
19 | 100 XP
20 | 
21 | - Create best_lr by calling lr.fit() on the training data.
22 | - Print best_lr to verify that it's an object of the LogisticRegressionModel class.
23 | 
24 | """
25 | # Call lr.fit()
26 | best_lr = lr.fit(training)
27 | 
28 | # Print best_lr
29 | print(best_lr)
30 | 


--------------------------------------------------------------------------------
/Introduction to PySpark/IV. Model tuning and selection/VI. Evaluating binary classifiers.py:
--------------------------------------------------------------------------------
 1 | """
 2 | 
 3 | \  Evaluating binary classifiers  /
 4 | 
 5 | For this course we'll be using a common metric for binary classification algorithms call the 
 6 | 
 7 |   -> AUC (or area under the curve) :  the closer the AUC is to one (1), the better the model is!
 8 | 
 9 | """
10 | 
11 | '''If you've created a perfect binary classification model, what would the AUC be?'''
12 | # ANSW: 1
13 | 


--------------------------------------------------------------------------------
/Introduction to PySpark/IV. Model tuning and selection/VII. Evaluate the model.py:
--------------------------------------------------------------------------------
 1 | """
 2 | \  Evaluate the model  /
 3 | Remember the test data that you set aside waaaaaay back in chapter 3? It's finally time to test your model on it! You can use the same evaluator you made to fit the
 4 | model.
 5 | 
 6 | Instructions
 7 | 100 XP
 8 | - Use your model to generate predictions by applying best_lr.transform() to the test data. Save this as test_results.
 9 | - Call evaluator.evaluate() on test_results to compute the AUC. Print the output.
10 | 
11 | """
12 | # Use the model to predict the test set
13 | test_results = best_lr.transform(test)
14 | 
15 | # Evaluate the predictions
16 | print(evaluator.evaluate(test_results))
17 | 
18 | # 0.7
19 | 


--------------------------------------------------------------------------------
/Introduction to Relational Databases in SQL/1. Your first database/1. Attributes of relational databases.py:
--------------------------------------------------------------------------------
 1 | """
 2 | Attributes of relational databases
 3 | In the video, we talked about some basic facts about relational databases. Which of the following statements does not hold true for databases? Relational databases …
 4 | 
 5 | Answer the question
 6 | 50XP
 7 | Possible Answers
 8 | 
 9 |         … store different real-world entities in different tables.
10 |         … allow to establish relationships between entities.
11 |  ok     … are called "relational" because they store data only about people.
12 |         … use constraints, keys and referential integrity in order to assure data quality.
13 |         
14 | """
15 | 


--------------------------------------------------------------------------------
/Introduction to Relational Databases in SQL/1. Your first database/2. Query information_schema with SELECT.py:
--------------------------------------------------------------------------------
 1 | '''
 2 | Query information_schema with SELECT
 3 | information_schema is a meta-database that holds information about your current database. information_schema has multiple tables you can query with the known SELECT * FROM syntax:
 4 | 
 5 | tables: information about all tables in your current database
 6 | columns: information about all columns in all of the tables in your current database
 7 | …
 8 | In this exercise, you'll only need information from the 'public' schema, which is specified as the column table_schema of the tables and columns tables. The 'public' schema holds information about user-defined tables and databases. The other types of table_schema hold system information – for this course, you're only interested in user-defined stuff.
 9 | 
10 | Instructions 4/4
11 | 25 XP
12 | - Get information on all table names in the current database, while limiting your query to the 'public' table_schema.
13 | - Now have a look at the columns in university_professors by selecting all entries in information_schema.columns that correspond to that table.
14 | - How many columns does the table university_professors have?
15 | - Finally, print the first five rows of the university_professors table.
16 | '''
17 | #1
18 | #-- Query the right table in information_schema
19 | SELECT table_name 
20 | FROM information_schema.tables
21 | #-- Specify the correct table_schema value
22 | WHERE table_schema = 'public';
23 | 
24 | #2
25 | #-- Query the right table in information_schema to get columns
26 | SELECT column_name, data_type 
27 | FROM information_schema.columns 
28 | WHERE table_name = 'university_professors' AND table_schema = 'public';
29 | 
30 | #3
31 | #How many columns does the table university_professors have?
32 | #8
33 | 
34 | # 4
35 | -- Query the first five rows of our table
36 | SELECT * 
37 | FROM university_professors 
38 | LIMIT 5;
39 | 


--------------------------------------------------------------------------------
/Introduction to Relational Databases in SQL/1. Your first database/3. CREATE your first few TABLEs.py:
--------------------------------------------------------------------------------
 1 | """
 2 | CREATE your first few TABLEs
 3 | You'll now start implementing a better database model. For this, you'll create tables for the professors and universities entity types. The other tables will be created for you.
 4 | 
 5 | The syntax for creating simple tables is as follows:
 6 | 
 7 | CREATE TABLE table_name (
 8 |  column_a data_type,
 9 |  column_b data_type,
10 |  column_c data_type
11 | );
12 | Attention: Table and columns names, as well as data types, don't need to be surrounded by quotation marks.
13 | 
14 | Instructions 1/2
15 | - Create a table professors with two text columns: firstname and lastname.
16 | - Create a table universities with three text columns: university_shortname, university, and university_city.
17 | 
18 | """
19 | -- Create a table for the professors entity type
20 | CREATE TABLE professors (
21 |  firstname text,
22 |  lastname text
23 | );
24 | 
25 | -- Print the contents of this table
26 | SELECT * 
27 | FROM professors
28 | 
29 | #2
30 | -- Create a table for the universities entity type
31 | CREATE TABLE universities (
32 |     university_shortname text,
33 |     university text,
34 |     university_city text
35 | );
36 | -- Print the contents of this table
37 | SELECT * 
38 | FROM universities
39 | 


--------------------------------------------------------------------------------
/Introduction to Relational Databases in SQL/1. Your first database/4. ADD a COLUMN with ALTER TABLE.py:
--------------------------------------------------------------------------------
 1 | """
 2 | ADD a COLUMN with ALTER TABLE
 3 | Oops! We forgot to add the university_shortname column to the professors table. You've probably already noticed:
 4 | 
 5 | In chapter 4 of this course, you'll need this column for connecting the professors table with the universities table.
 6 | 
 7 | However, adding columns to existing tables is easy, especially if they're still empty.
 8 | 
 9 | To add columns you can use the following SQL query:
10 | 
11 | ALTER TABLE table_name
12 | ADD COLUMN column_name data_type;
13 | Instructions
14 | 100 XP
15 | - Alter professors to add the text column university_shortname.
16 | 
17 | """
18 | -- Add the university_shortname column
19 | ALTER TABLE professors
20 | ADD COLUMN university_shortname text;
21 | 
22 | -- Print the contents of this table
23 | SELECT * 
24 | FROM professors
25 | 


--------------------------------------------------------------------------------
/Introduction to Relational Databases in SQL/1. Your first database/5. RENAME and DROP COLUMNs in affiliations.py:
--------------------------------------------------------------------------------
 1 | """
 2 | RENAME and DROP COLUMNs in affiliations
 3 | As mentioned in the video, the still empty affiliations table has some flaws. In this exercise, you'll correct them as outlined in the video.
 4 | 
 5 | You'll use the following queries:
 6 | 
 7 | To rename columns:
 8 | ALTER TABLE table_name
 9 | RENAME COLUMN old_name TO new_name;
10 | To delete columns:
11 | ALTER TABLE table_name
12 | DROP COLUMN column_name;
13 | 
14 | Instructions 2/2
15 | - Rename the organisation column to organization in affiliations
16 | - Delete the university_shortname column in affiliations.
17 | 
18 | """
19 | #-- Rename the organisation column
20 | ALTER TABLE affiliations
21 | RENAME COLUMN organisation TO organization;
22 | 
23 | #2
24 | #-- Delete the university_shortname column
25 | ALTER TABLE affiliations
26 | DROP COLUMN university_shortname;
27 | 


--------------------------------------------------------------------------------
/Introduction to Relational Databases in SQL/1. Your first database/6. Migrate data with INSERT INTO SELECT DISTINCT.py:
--------------------------------------------------------------------------------
 1 | """
 2 | Migrate data with INSERT INTO SELECT DISTINCT
 3 | Now it's finally time to migrate the data into the new tables. You'll use the following pattern:
 4 | 
 5 | INSERT INTO ... 
 6 | SELECT DISTINCT ... 
 7 | FROM ...;
 8 | It can be broken up into two parts:
 9 | 
10 | First part:
11 | 
12 | SELECT DISTINCT column_name1, column_name2, ... 
13 | FROM table_a;
14 | This selects all distinct values in table table_a – nothing new for you.
15 | 
16 | Second part:
17 | 
18 | INSERT INTO table_b ...;
19 | Take this part and append it to the first, so it inserts all distinct rows from table_a into table_b.
20 | 
21 | One last thing: It is important that you run all of the code at the same time once you have filled out the blanks.
22 | 
23 | Instructions 2/2
24 | 
25 | - Insert all DISTINCT professors from university_professors into professors.
26 | - Print all the rows in professors.
27 | 
28 | - Insert all DISTINCT affiliations into affiliations from university_professors.
29 | """
30 | #-- Insert unique professors into the new table
31 | INSERT INTO professors 
32 | SELECT DISTINCT firstname, lastname, university_shortname 
33 | FROM university_professors;
34 | 
35 | #-- Doublecheck the contents of professors
36 | SELECT * 
37 | FROM professors;
38 | 
39 | #2
40 | #-- Insert unique affiliations into the new table
41 | INSERT INTO affiliations 
42 | SELECT DISTINCT firstname, lastname, function, organization 
43 | FROM university_professors;
44 | 
45 | #-- Doublecheck the contents of affiliations
46 | SELECT * 
47 | FROM affiliations;
48 | 


--------------------------------------------------------------------------------
/Introduction to Relational Databases in SQL/1. Your first database/7. Delete tables with DROP TABLE.py:
--------------------------------------------------------------------------------
 1 | """
 2 | Delete tables with DROP TABLE
 3 | Obviously, the university_professors table is now no longer needed and can safely be deleted.
 4 | 
 5 | For table deletion, you can use the simple command:
 6 | 
 7 | DROP TABLE table_name;
 8 | Instructions
 9 | 100 XP
10 | Delete the university_professors table.
11 | 
12 | """
13 | #-- Delete the university_professors table
14 | DROP TABLE university_professors;
15 | 


--------------------------------------------------------------------------------
/Introduction to Relational Databases in SQL/2. Enforce data consistency with attribute constraints/0. query table data types INFORMATION_SCHEMA.txt:
--------------------------------------------------------------------------------
 1 | > INFORMATION_SCHEMA
 2 | 
 3 | |  DATATYPES from one table:  |
 4 | 
 5 |     SELECT * FROM INFORMATION_SCHEMA.COLUMNS
 6 |     WHERE TABLE_NAME = 'Address';
 7 |     
 8 | |  DATATYPES from all tables  |  
 9 | 
10 |     SELECT * FROM INFORMATION_SCHEMA.COLUMNS;
11 | 


--------------------------------------------------------------------------------
/Introduction to Relational Databases in SQL/2. Enforce data consistency with attribute constraints/1.Better data quality with constraints-changing datatypes CAST.txt:
--------------------------------------------------------------------------------
 1 | 1. attribute contraints : datatypes on columns
 2 | 2. key contraints : primary keys
 3 | 3. Referential integrity contraints : foreign keys
 4 | 
 5 | WHY contraints ?
 6 | -----------------
 7 |     
 8 |     . give data struture
 9 |     . help w/ consistency and data quality
10 |     . data quality : business advantage/ data science prerequisite
11 |     . PostgreSQL helps w/ enforcing
12 |     
13 | |  changing dataTypes  |
14 |       
15 |     # change wind_speed from 'text' to 'integer'
16 |     SELECT temperature * CAST(wind_speed AS integer)
17 | 


--------------------------------------------------------------------------------
/Introduction to Relational Databases in SQL/2. Enforce data consistency with attribute constraints/10. .py:
--------------------------------------------------------------------------------
 1 | """
 2 | Make your columns UNIQUE with ADD CONSTRAINT
 3 | As seen in the video, you add the UNIQUE keyword after the column_name that should be unique. This, of course, only works for new tables:
 4 | 
 5 | CREATE TABLE table_name (
 6 |  column_name UNIQUE
 7 | );
 8 | If you want to add a unique constraint to an existing table, you do it like that:
 9 | 
10 | ALTER TABLE table_name
11 | ADD CONSTRAINT some_name UNIQUE(column_name);
12 | Note that this is different from the ALTER COLUMN syntax for the not-null constraint. Also, you have to give the constraint a name some_name.
13 | 
14 | Instructions 2/2
15 | - Add a unique constraint to the university_shortname column in universities. Give it the name university_shortname_unq.
16 | - Add a unique constraint to the organization column in organizations. Give it the name organization_unq.
17 | 
18 | """
19 | #-- Make universities.university_shortname unique
20 | ALTER TABLE universities
21 | ADD CONSTRAINT university_shortname_unq UNIQUE(university_shortname);
22 | 
23 | #-- Make organizations.organization unique
24 | ALTER TABLE organizations
25 | ADD CONSTRAINT organization_unq UNIQUE(organization);
26 | 


--------------------------------------------------------------------------------
/Introduction to Relational Databases in SQL/2. Enforce data consistency with attribute constraints/10. What happens if you try to enter NULLs?.py:
--------------------------------------------------------------------------------
 1 | """
 2 | What happens if you try to enter NULLs?
 3 | Execute the following statement:
 4 | 
 5 | INSERT INTO professors (firstname, lastname, university_shortname)
 6 | VALUES (NULL, 'Miller', 'ETH');
 7 | - Why does this throw an error?
 8 | . Possible Answers
 9 | 
10 |         Professors without first names do not exist
11 |  OK     Because a database constraint is violated.
12 |         Error? This works just fine.
13 |         NULL is not put in quotes.
14 | 
15 | """
16 | 


--------------------------------------------------------------------------------
/Introduction to Relational Databases in SQL/2. Enforce data consistency with attribute constraints/2. Types of database constraints.py:
--------------------------------------------------------------------------------
 1 | """
 2 | Types of database constraints
 3 | Which of the following is not used to enforce a database constraint?
 4 | 
 5 | Answer the question
 6 | 50XP
 7 | Possible Answers
 8 | 
 9 |           Foreign keys
10 |     OK    SQL aggregate functions  # only does calculations
11 |           The BIGINT data type
12 |           Primary keys
13 | 
14 | """
15 | 


--------------------------------------------------------------------------------
/Introduction to Relational Databases in SQL/2. Enforce data consistency with attribute constraints/3. Conforming with data types.py:
--------------------------------------------------------------------------------
 1 | """
 2 | Conforming with data types
 3 | For demonstration purposes, I created a fictional database table that only holds three records. The columns have the data types date, integer, and text, respectively.
 4 | 
 5 | CREATE TABLE transactions (
 6 |  transaction_date date, 
 7 |  amount integer,
 8 |  fee text
 9 | );
10 | Have a look at the contents of the transactions table.
11 | 
12 | The transaction_date accepts date values. According to the PostgreSQL documentation, it accepts values in the form of YYYY-MM-DD, DD/MM/YY, and so forth.
13 | 
14 | Both columns amount and fee appear to be numeric, however, the latter is modeled as text – which you will account for in the next exercise.
15 | 
16 | Instructions
17 | 100 XP
18 | Execute the given sample code.
19 | As it doesn't work, have a look at the error message and correct the statement accordingly – then execute it again.
20 | 
21 | """
22 | #-- Let's add a record to the table
23 | INSERT INTO transactions (transaction_date, amount, fee) 
24 | VALUES ('2018-09-24', 5454, '30');
25 | 
26 | #-- Doublecheck the contents
27 | SELECT *
28 | FROM transactions;
29 | 


--------------------------------------------------------------------------------
/Introduction to Relational Databases in SQL/2. Enforce data consistency with attribute constraints/4. Type CASTs.py:
--------------------------------------------------------------------------------
 1 | """
 2 | Type CASTs
 3 | In the video, you saw that type casts are a possible solution for data type issues. If you know that a certain column stores numbers as text, you can cast the column to a numeric form, i.e. to integer.
 4 | 
 5 | SELECT CAST(some_column AS integer)
 6 | FROM table;
 7 | Now, the some_column column is temporarily represented as integer instead of text, meaning that you can perform numeric calculations on the column.
 8 | 
 9 | Instructions
10 | 100 XP
11 | - Execute the given sample code.
12 | - As it doesn't work, add an integer type cast at the right place and execute it again.
13 | 
14 | """
15 | #-- Calculate the net amount as amount + fee
16 | #-- fee wasn't integer
17 | #-- saw that w/     SELECT * FROM INFORMATION_SCHEMA.COLUMNS WHERE TABLE_NAME = 'transactions';
18 | SELECT transaction_date, amount + CAST(fee AS integer) AS net_amount 
19 | FROM transactions;
20 | 


--------------------------------------------------------------------------------
/Introduction to Relational Databases in SQL/2. Enforce data consistency with attribute constraints/5. Working with data types.txt:
--------------------------------------------------------------------------------
 1 | |  most common datatypes  |
 2 | 
 3 |     . text
 4 |     . varchar(x) : max of n characters
 5 |     . char(x) : fixed-lenght string of n characters
 6 |     . boolean : TRUE or 1, FALSE or 0, NULL
 7 |     
 8 |     - date, time, timestamp: date-time calculatons
 9 |     - numeric : 3.134 
10 |     - integer: 3 -3
11 |     
12 | |  specifying numeric eg: |
13 | 
14 |     grades numeric(3, 2)  # 3 digits, 2of those after comma # eg: 5,43
15 |     
16 | |  ALTER table and column after creation eg:  |
17 | 
18 |     ALTER TABLE students
19 |     ALTER COLUMN name
20 |     TYPE varchar(50);
21 |     
22 | |  ALTER turning into INT and rounding  eg:  |
23 | 
24 |     ALTER TABLE students
25 |     ALTER COLUMN average_grade
26 |     TYPE integer
27 |     USING ROUND(average_grade)
28 |     # from 5,54 to 6
29 | 


--------------------------------------------------------------------------------
/Introduction to Relational Databases in SQL/2. Enforce data consistency with attribute constraints/6. Change types with ALTER COLUMN.py:
--------------------------------------------------------------------------------
 1 | """
 2 | Change types with ALTER COLUMN
 3 | The syntax for changing the data type of a column is straightforward. The following code changes the data type of the column_name column in table_name to varchar(10):
 4 | 
 5 | ALTER TABLE table_name
 6 | ALTER COLUMN column_name
 7 | TYPE varchar(10)
 8 | Now it's time to start adding constraints to your database.
 9 | 
10 | Instructions 3/3
11 | - Have a look at the distinct university_shortname values in the professors table and take note of the length of the strings.
12 | - Now specify a fixed-length character type with the correct length for university_shortname.
13 | - Change the type of the firstname column to varchar(64).
14 | 
15 | """
16 | #-- Select the university_shortname column
17 | SELECT DISTINCT(university_shortname) 
18 | FROM professors;
19 | 
20 | #-- Specify the correct fixed-length character type
21 | ALTER TABLE professors
22 | ALTER COLUMN university_shortname
23 | TYPE char(3);
24 | 
25 | #-- Change the type of firstname
26 | ALTER TABLE professors
27 | ALTER COLUMN firstname
28 | TYPE varchar(64);
29 | 
30 | 


--------------------------------------------------------------------------------
/Introduction to Relational Databases in SQL/2. Enforce data consistency with attribute constraints/7. Convert types USING a function.py:
--------------------------------------------------------------------------------
 1 | """
 2 | Convert types USING a function
 3 | If you don't want to reserve too much space for a certain varchar column, you can truncate the values before converting its type.
 4 | 
 5 | For this, you can use the following syntax:
 6 | 
 7 | ALTER TABLE table_name
 8 | ALTER COLUMN column_name
 9 | TYPE varchar(x)
10 | USING SUBSTRING(column_name FROM 1 FOR x)
11 | You should read it like this: Because you want to reserve only x characters for column_name, you have to retain a SUBSTRING of every value, i.e. the first x characters of it, and throw away the rest. This way, the values will fit the varchar(x) requirement.
12 | 
13 | Instructions
14 | 100 XP
15 | - Run the sample code as is and take note of the error.
16 | - Now use SUBSTRING() to reduce firstname to 16 characters so its type can be altered to varchar(16).
17 | 
18 | """
19 | #-- Convert the values in firstname to a max. of 16 characters
20 | ALTER TABLE professors 
21 | ALTER COLUMN firstname 
22 | TYPE varchar(16)
23 | USING SUBSTRING(firstname FROM 1 FOR 16) #--max of 16 chars
24 | 


--------------------------------------------------------------------------------
/Introduction to Relational Databases in SQL/2. Enforce data consistency with attribute constraints/8. The not-null and unique constraints.txt:
--------------------------------------------------------------------------------
 1 | -> NOT NULL : not empty (obligatory)
 2 | 
 3 | |  set not null after creation  |
 4 | 
 5 |     ALTER TABLE students
 6 |     ALTER COLUMN phone
 7 |     SET NOT NULL;
 8 |     
 9 | |  ELiminate NOT NULL after creation  |
10 | 
11 |     ALTER TABLE students
12 |     ALTER COLUMN ssn
13 |     DROP NOT NULL;
14 |     
15 | -> UNIQUE
16 | 
17 | |  SET unique when creating  |
18 |     
19 |     CREATE TABLE table_name(
20 |       column_name UNIQUE);
21 |       
22 | |  SET UNIQUE AFTER CREATION  |
23 | 
24 |     ALTER TABLE table_name
25 |     ADD CONSTRAINT some_name UNIQUE(column_name);
26 |      
27 | 


--------------------------------------------------------------------------------
/Introduction to Relational Databases in SQL/2. Enforce data consistency with attribute constraints/9. Disallow NULL values with SET NOT NULL.py:
--------------------------------------------------------------------------------
 1 | """
 2 | Disallow NULL values with SET NOT NULL
 3 | The professors table is almost ready now. However, it still allows for NULLs to be entered. Although some information might be missing about some professors, there's certainly columns that always need to be specified.
 4 | 
 5 | Instructions 2/2
 6 | 
 7 | - Add a not-null constraint for the firstname column.
 8 | - Add a not-null constraint for the lastname column.
 9 | 
10 | """
11 | #-- Disallow NULL values in firstname
12 | ALTER TABLE professors 
13 | ALTER COLUMN firstname SET NOT NULL;
14 | 
15 | #-- Disallow NULL values in lastname
16 | ALTER TABLE professors
17 | ALTER COLUMN lastname SET NOT NULL
18 | 


--------------------------------------------------------------------------------
/Introduction to Relational Databases in SQL/3. Uniquely identify records with key constraints/1. Get to know SELECT COUNT DISTINCT.py:
--------------------------------------------------------------------------------
 1 | """
 2 | Get to know SELECT COUNT DISTINCT
 3 | Your database doesn't have any defined keys so far, and you don't know which columns or combinations of columns are suited as keys.
 4 | 
 5 | There's a simple way of finding out whether a certain column (or a combination) contains only unique values – and thus identifies the records in the table.
 6 | 
 7 | You already know the SELECT DISTINCT query from the first chapter. Now you just have to wrap everything within the COUNT() function and PostgreSQL will return the number of unique rows for the given columns:
 8 | 
 9 | SELECT COUNT(DISTINCT(column_a, column_b, ...))
10 | FROM table;
11 | Instructions 2/2
12 | - First, find out the number of rows in universities.
13 | - Then, find out how many unique values there are in the university_city column.
14 | 
15 | """
16 | #-- Count the number of rows in universities
17 | SELECT COUNT(*) 
18 | FROM universities;
19 | 
20 | #-- Count the number of distinct values in the university_city column
21 | SELECT COUNT(DISTINCT(university_city)) 
22 | FROM universities;
23 | 


--------------------------------------------------------------------------------
/Introduction to Relational Databases in SQL/3. Uniquely identify records with key constraints/2. Identify keys with SELECT COUNT DISTINCT.py:
--------------------------------------------------------------------------------
 1 | """
 2 | Identify keys with SELECT COUNT DISTINCT
 3 | There's a very basic way of finding out what qualifies for a key in an existing, populated table:
 4 | 
 5 | Count the distinct records for all possible combinations of columns. If the resulting number x equals the number of all rows in the table for a combination, you have discovered a superkey.
 6 | 
 7 | Then remove one column after another until you can no longer remove columns without seeing the number x decrease. If that is the case, you have discovered a (candidate) key.
 8 | 
 9 | The table professors has 551 rows. It has only one possible candidate key, which is a combination of two attributes. You might want to try different combinations using the "Run code" button. Once you have found the solution, you can submit your answer.
10 | 
11 | Instructions
12 | 100 XP
13 | - Using the above steps, identify the candidate key by trying out different combination of columns.
14 | 
15 | """
16 | #-- Try out different combinations 
17 | #-- Identify keys w/ SELECT COUNT DISTINCT
18 | SELECT COUNT(DISTINCT(firstname, lastname))  # SUPER-KEY
19 | FROM professors;
20 | '''
21 | count
22 | 551
23 | '''
24 | 


--------------------------------------------------------------------------------
/Introduction to Relational Databases in SQL/3. Uniquely identify records with key constraints/3. Identify the primary key.py:
--------------------------------------------------------------------------------
 1 | """
 2 | Identify the primary key
 3 | Have a look at the example table from the previous video. As the database designer, you have to make a wise choice as to which column should be the primary key.
 4 | 
 5 |      license_no     | serial_no |    make    |  model  | year
 6 | --------------------+-----------+------------+---------+------
 7 |  Texas ABC-739      | A69352    | Ford       | Mustang |    2
 8 |  Florida TVP-347    | B43696    | Oldsmobile | Cutlass |    5
 9 |  New York MPO-22    | X83554    | Oldsmobile | Delta   |    1
10 |  California 432-TFY | C43742    | Mercedes   | 190-D   |   99
11 |  California RSK-629 | Y82935    | Toyota     | Camry   |    4
12 |  Texas RSK-629      | U028365   | Jaguar     | XJS     |    4
13 | - Which of the following column or column combinations could best serve as primary key?
14 | 
15 | Answer the question
16 | 50XP
17 | Possible Answers
18 | 
19 |         PK = {make}
20 |         PK = {model, year}
21 |     OK  PK = {license_no}
22 |         PK = {year, make}
23 | 
24 | """
25 | 


--------------------------------------------------------------------------------
/Introduction to Relational Databases in SQL/3. Uniquely identify records with key constraints/4. ADD key CONSTRAINTs to the tables.py:
--------------------------------------------------------------------------------
 1 | """
 2 | ADD key CONSTRAINTs to the tables
 3 | Two of the tables in your database already have well-suited candidate keys consisting of one column each: organizations and universities with the organization and university_shortname columns, respectively.
 4 | 
 5 | In this exercise, you'll rename these columns to id using the RENAME COLUMN command and then specify primary key constraints for them. This is as straightforward as adding unique constraints (see the last exercise of Chapter 2):
 6 | 
 7 | ALTER TABLE table_name
 8 | ADD CONSTRAINT some_name PRIMARY KEY (column_name)
 9 | Note that you can also specify more than one column in the brackets.
10 | 
11 | Instructions 1/2
12 | 
13 | 1
14 | - Rename the organization column to id in organizations.
15 | - Make id a primary key and name it organization_pk.
16 | 2
17 | - Rename the university_shortname column to id in universities.
18 | - Make id a primary key and name it university_pk.
19 | 
20 | """
21 | #-- Rename the organization column to id
22 | ALTER TABLE organizations
23 | RENAME COLUMN organization TO id;
24 | 
25 | #-- Make id a primary key
26 | ALTER TABLE organizations
27 | ADD CONSTRAINT organization_pk PRIMARY KEY (id);
28 | 
29 | #2
30 | #-- Rename the university_shortname column to id
31 | ALTER TABLE universities
32 | RENAME COLUMN university_shortname TO id;
33 | 
34 | #-- Make id a primary key
35 | ALTER TABLE universities
36 | ADD CONSTRAINT university_pk PRIMARY KEY (id);
37 | 


--------------------------------------------------------------------------------
/Introduction to Relational Databases in SQL/3. Uniquely identify records with key constraints/5.  Add a SERIAL surrogate key.py:
--------------------------------------------------------------------------------
 1 | """
 2 | Add a SERIAL surrogate key
 3 | Since there's no single column candidate key in professors (only a composite key candidate consisting of firstname, lastname), you'll add a new column id to that table.
 4 | 
 5 | This column has a special data type serial, which turns the column into an auto-incrementing number. This means that, whenever you add a new professor to the table, it will automatically get an id that does not exist yet in the table: a perfect primary key!
 6 | 
 7 | Instructions 3/3
 8 | 
 9 | 1- Add a new column id with data type serial to the professors table.
10 | 2- Make id a primary key and name it professors_pkey.
11 | 3- 
12 | 
13 | """
14 | #1
15 | #-- Add the new column to the table
16 | ALTER TABLE professors 
17 | ADD COLUMN id SERIAL PRIMARY KEY; #-- serial autoincrement as id
18 | 
19 | #2
20 | #-- Make id a primary key
21 | ALTER TABlE professors 
22 | ADD CONSTRAINT professors_pkey PRIMARY KEY (id);
23 | 
24 | #3
25 | #-- Have a look at the first 10 rows of professors
26 | SELECT * 
27 | FROM professors
28 | LIMIT 10;
29 | 


--------------------------------------------------------------------------------
/Introduction to Relational Databases in SQL/3. Uniquely identify records with key constraints/6. CONCATenate columns to a surrogate key.py:
--------------------------------------------------------------------------------
 1 | """
 2 | CONCATenate columns to a surrogate key
 3 | Another strategy to add a surrogate key to an existing table is to concatenate existing columns with the CONCAT() function.
 4 | 
 5 | Let's think of the following example table:
 6 | 
 7 | CREATE TABLE cars (
 8 |  make varchar(64) NOT NULL,
 9 |  model varchar(64) NOT NULL,
10 |  mpg integer NOT NULL
11 | )
12 | The table is populated with 10 rows of completely fictional data.
13 | 
14 | Unfortunately, the table doesn't have a primary key yet. None of the columns consists of only unique values, so some columns can be combined to form a key.
15 | 
16 | In the course of the following exercises, you will combine make and model into such a surrogate key.
17 | 
18 | Instructions 4/4
19 | 
20 | - Count the number of distinct rows with a combination of the make and model columns.
21 | - Add a new column id with the data type varchar(128).
22 | - Concatenate make and model into id using an UPDATE table_name SET column_name = ... query and the CONCAT() function.
23 | - Make id a primary key and name it id_pk.
24 | 
25 | """
26 | #1
27 | #-- Count the number of distinct rows with columns make, model
28 | SELECT COUNT(DISTINCT(make, model))
29 | FROM cars;
30 | 
31 | #2
32 | #-- Add the id column
33 | ALTER TABLE cars
34 | ADD COLUMN id varchar(128);
35 | 
36 | #3
37 | #-- Update id with make + model
38 | -- UPDATE, SET x =
39 | UPDATE cars
40 | SET id = CONCAT(make, model);
41 | 
42 | #4
43 | #-- Make id a primary key
44 | ALTER TABLE cars
45 | ADD CONSTRAINT id_pk PRIMARY KEY(id);
46 | 
47 | #-- Have a look at the table
48 | SELECT * FROM cars;
49 | 


--------------------------------------------------------------------------------
/Introduction to Relational Databases in SQL/3. Uniquely identify records with key constraints/7. Test your knowledge before advancing.py:
--------------------------------------------------------------------------------
 1 | """
 2 | Test your knowledge before advancing
 3 | Before you move on to the next chapter, let's quickly review what you've learned so far about attributes and key constraints. If you're unsure about the answer, please quickly review chapters 2 and 3, respectively.
 4 | 
 5 | Let's think of an entity type "student". A student has:
 6 | 
 7 | . a last name consisting of up to 128 characters (required),
 8 | . a unique social security number, consisting only of integers, that should serve as a key,
 9 | . a phone number of fixed length 12, consisting of numbers and characters (but some students don't have one).
10 | 
11 | Instructions
12 | 100 XP
13 | - Given the above description of a student entity, create a table students with the correct column types.
14 | - Add a PRIMARY KEY for the social security number ssn.
15 | - Note that there is no formal length requirement for the integer column. The application would have to make sure it's a correct SSN!
16 | 
17 | """
18 | #-- Create the table
19 | CREATE TABLE students (
20 |   last_name varchar(128) NOT NULL,
21 |   ssn integer PRIMARY KEY,
22 |   phone_no char(12)
23 | );
24 | 


--------------------------------------------------------------------------------
/Introduction to Relational Databases in SQL/4. Glue together tables with foreign keys/1. REFERENCE a table with a FOREIGN KEY.py:
--------------------------------------------------------------------------------
 1 | """
 2 | REFERENCE a table with a FOREIGN KEY
 3 | In your database, you want the professors table to reference the universities table. You can do that by specifying a column in professors table that references a column in the universities table.
 4 | 
 5 | As just shown in the video, the syntax for that looks like this:
 6 | 
 7 | ALTER TABLE a 
 8 | ADD CONSTRAINT a_fkey FOREIGN KEY (b_id) REFERENCES b (id);
 9 | Table a should now refer to table b, via b_id, which points to id. a_fkey is, as usual, a constraint name you can choose on your own.
10 | 
11 | Pay attention to the naming convention employed here: Usually, a foreign key referencing another primary key with name id is named x_id, where x is the name of the referencing table in the singular form.
12 | 
13 | Instructions 2/2
14 | 1- Rename the university_shortname column to university_id in professors.
15 | 2- Add a foreign key on university_id column in professors that references the id column in universities.
16 | 2- Name this foreign key professors_fkey.
17 | 
18 | """
19 | #1
20 | #-- Rename the university_shortname column
21 | ALTER TABLE professors
22 | RENAME COLUMN university_shortname TO university_id;
23 | 
24 | #2
25 | #-- Add a foreign key on professors referencing universities
26 | ALTER TABLE professors
27 | ADD CONSTRAINT professors_fkey FOREIGN KEY (university_id) REFERENCES universities (id);
28 | 


--------------------------------------------------------------------------------
/Introduction to Relational Databases in SQL/4. Glue together tables with foreign keys/10. Change the referential integrity behavior of a key.py:
--------------------------------------------------------------------------------
 1 | """
 2 | Change the referential integrity behavior of a key
 3 | So far, you implemented three foreign key constraints:
 4 | 
 5 | professors.university_id to universities.id
 6 | affiliations.organization_id to organizations.id
 7 | affiliations.professor_id to professors.id
 8 | These foreign keys currently have the behavior ON DELETE NO ACTION. Here, you're going to change that behavior for the column referencing organizations from affiliations. If an organization is deleted, all its affiliations (by any professor) should also be deleted.
 9 | 
10 | Altering a key constraint doesn't work with ALTER COLUMN. Instead, you have to DROP the key constraint and then ADD a new one with a different ON DELETE behavior.
11 | 
12 | For deleting constraints, though, you need to know their name. This information is also stored in information_schema.
13 | 
14 | Instructions 4/4
15 | - Have a look at the existing foreign key constraints by querying table_constraints in information_schema.
16 | - Delete the affiliations_organization_id_fkey foreign key constraint in affiliations.
17 | - Add a new foreign key to affiliations that CASCADEs deletion if a referenced record is deleted from organizations. Name it affiliations_organization_id_fkey.
18 | - Run the DELETE and SELECT queries to double check that the deletion cascade actually works.
19 | 
20 | """
21 | #--1
22 | #-- Identify the correct constraint name
23 | SELECT constraint_name, table_name, constraint_type
24 | FROM information_schema.table_constraints
25 | WHERE constraint_type = 'FOREIGN KEY';
26 | 
27 | #--2
28 | #-- Drop the right foreign key constraint
29 | ALTER TABLE affiliations
30 | DROP CONSTRAINT affiliations_organization_id_fkey;
31 | 
32 | #-- Add a new foreign key constraint from affiliations to organizations which cascades deletion
33 | ALTER TABLE affiliations
34 | ADD CONSTRAINT affiliations_organization_id_fkey FOREIGN KEY (organization_id) REFERENCES organizations (id) ON DELETE CASCADE;
35 | #--. CASCADE : delete all refencing records
36 | 
37 | #--3
38 | #-- Delete an organization 
39 | DELETE FROM organizations 
40 | WHERE id = 'CUREM';
41 | 
42 | #--4
43 | #-- Check that no more affiliations with this organization exist
44 | SELECT * FROM affiliations
45 | WHERE organization_id = 'CUREM';
46 | 


--------------------------------------------------------------------------------
/Introduction to Relational Databases in SQL/4. Glue together tables with foreign keys/11. Count affiliations per university.py:
--------------------------------------------------------------------------------
 1 | """
 2 | Count affiliations per university
 3 | Now that your data is ready for analysis, let's run some exemplary SQL queries on the database. You'll now use already known concepts such as grouping by columns and joining tables.
 4 | 
 5 | In this exercise, you will find out which university has the most affiliations (through its professors). For that, you need both affiliations and professors tables, as the latter also holds the university_id.
 6 | 
 7 | As a quick repetition, remember that joins have the following structure:
 8 | 
 9 | SELECT table_a.column1, table_a.column2, table_b.column1, ... 
10 | FROM table_a
11 | JOIN table_b 
12 | ON table_a.column = table_b.column
13 | This results in a combination of table_a and table_b, but only with rows where table_a.column is equal to table_b.column.
14 | 
15 |   Instructions
16 |   100 XP
17 | - Count the number of total affiliations by university.
18 | - Sort the result by that count, in descending order.
19 | 
20 | """
21 | #-- Count the total number of affiliations per university
22 | SELECT COUNT(*), professors.university_id
23 | FROM affiliations
24 | JOIN professors
25 | ON affiliations.professor_id = professors.id
26 | #-- Group by the ids of professors
27 | GROUP BY professors.university_id
28 | ORDER BY count DESC
29 | 
30 | '''count	university_id
31 |      579	  EPF
32 |      273	  USG...'''
33 | 


--------------------------------------------------------------------------------
/Introduction to Relational Databases in SQL/4. Glue together tables with foreign keys/12. .py:
--------------------------------------------------------------------------------
 1 | """
 2 | Join all the tables together
 3 | In this last exercise, your task is to find the university city of the professor with the most affiliations in the sector "Media & communication".
 4 | 
 5 | For this,
 6 | 
 7 | you need to join all the tables,
 8 | group by some column,
 9 | and then use selection criteria to get only the rows in the correct sector.
10 | Let's do this in three steps!
11 | 
12 | Instructions 3/3
13 | 
14 | - Join all tables in the database (starting with affiliations, professors, organizations, and universities) and look at the result.
15 | - Now group the result by organization sector, professor, and university city.
16 |   Count the resulting number of rows.
17 | - Only retain rows with "Media & communication" as organization sector, and sort the table by count, in descending order.
18 | 
19 | """
20 | #1
21 | #-- Join all tables
22 | SELECT *
23 | FROM affiliations
24 | JOIN professors
25 | ON affiliations.professor_id = professors.id
26 | JOIN organizations
27 | ON affiliations.organization_id = organizations.id
28 | JOIN universities
29 | ON professors.university_id = universities.id;
30 | 
31 | #2
32 | #-- Group the table by organization sector, professor ID and university city
33 | SELECT COUNT(*), organizations.organization_sector, 
34 | professors.id, universities.university_city
35 | FROM affiliations
36 | JOIN professors
37 | ON affiliations.professor_id = professors.id
38 | JOIN organizations
39 | ON affiliations.organization_id = organizations.id
40 | JOIN universities
41 | ON professors.university_id = universities.id
42 | GROUP BY organizations.organization_sector, 
43 | professors.id, universities.university_city;
44 | 
45 | #3
46 | #-- Filter the table and sort it
47 | SELECT COUNT(*), organizations.organization_sector, 
48 | professors.id, universities.university_city
49 | FROM affiliations
50 | JOIN professors
51 | ON affiliations.professor_id = professors.id
52 | JOIN organizations
53 | ON affiliations.organization_id = organizations.id
54 | JOIN universities
55 | ON professors.university_id = universities.id
56 | WHERE organizations.organization_sector = 'Media & communication'
57 | GROUP BY organizations.organization_sector, 
58 | professors.id, universities.university_city
59 | ORDER BY count DESC;
60 | 


--------------------------------------------------------------------------------
/Introduction to Relational Databases in SQL/4. Glue together tables with foreign keys/2. Explore foreign key constraints.py:
--------------------------------------------------------------------------------
 1 | """
 2 | Explore foreign key constraints
 3 | Foreign key constraints help you to keep order in your database mini-world. In your database, for instance, only professors belonging to Swiss universities should 
 4 | be allowed, as only Swiss universities are part of the universities table.
 5 | 
 6 | The foreign key on professors referencing universities you just created thus makes sure that only existing universities can be specified when inserting new data.
 7 | Let's test this!
 8 | 
 9 | Instructions
10 | 100 XP
11 | - Run the sample code and have a look at the error message.
12 | - What's wrong? Correct the university_id so that it actually reflects where Albert Einstein wrote his dissertation and became a professor
13 |       – at the University of Zurich (UZH)!
14 | 
15 | """
16 | #-- Try to insert a new professor
17 | INSERT INTO professors (firstname, lastname, university_id)
18 | VALUES ('Albert', 'Einstein', 'UZH');
19 | # inserting a professor with non-existing university IDs violates the foreign key constraint
20 | 


--------------------------------------------------------------------------------
/Introduction to Relational Databases in SQL/4. Glue together tables with foreign keys/3. JOIN tables linked by a foreign key.py:
--------------------------------------------------------------------------------
 1 | """
 2 | JOIN tables linked by a foreign key
 3 | Let's join these two tables to analyze the data further!
 4 | 
 5 | You might already know how SQL joins work from the Intro to SQL for Data Science course (last exercise) or from Joining Data in PostgreSQL.
 6 | 
 7 | Here's a quick recap on how joins generally work:
 8 | 
 9 | SELECT ...
10 | FROM table_a
11 | JOIN table_b
12 | ON ...
13 | WHERE ...
14 | While foreign keys and primary keys are not strictly necessary for join queries, they greatly help by telling you what to expect. For instance, you can be sure that records referenced from table A will always be present in table B – so a join from table A will always find something in table B. If not, the foreign key constraint would be violated.
15 | 
16 | Instructions
17 | 100 XP
18 | - JOIN professors with universities on professors.university_id = universities.id, i.e., retain all records where the foreign key of professors is equal to the primary key of universities.
19 | - Filter for university_city = 'Zurich'.
20 | 
21 | """
22 | #-- Select all professors working for universities in the city of Zurich
23 | SELECT professors.lastname, universities.id, universities.university_city
24 | FROM universities 
25 | INNER JOIN professors 
26 | ON professors.university_id = universities.id
27 | WHERE universities.university_city = 'Zurich';
28 | 


--------------------------------------------------------------------------------
/Introduction to Relational Databases in SQL/4. Glue together tables with foreign keys/4. Model more complex relationships.txt:
--------------------------------------------------------------------------------
 1 | |   N:M relationsheep  |
 2 | 
 3 |     . create table
 4 |     . add foreign keys for every connected table
 5 |     . add additional attributes
 6 |     
 7 |       >   CREATE TABLE affiliations (
 8 |             professor_id integer REFERENCES professors (id),
 9 |             organization_id varchar (256) REFERENCES organizations (1d),
10 |             function varchar (256)
11 |           );
12 | 


--------------------------------------------------------------------------------
/Introduction to Relational Databases in SQL/4. Glue together tables with foreign keys/5. Add foreign keys to the "affiliations" table.py:
--------------------------------------------------------------------------------
 1 | """
 2 | Add foreign keys to the "affiliations" table
 3 | At the moment, the affiliations table has the structure {firstname, lastname, function, organization}, as you can see in the preview at the bottom right. In the next three exercises, you're going to turn this table into the form {professor_id, organization_id, function}, with professor_id and organization_id being foreign keys that point to the respective tables.
 4 | 
 5 | You're going to transform the affiliations table in-place, i.e., without creating a temporary table to cache your intermediate results.
 6 | 
 7 | Instructions 3/3
 8 | - Add a professor_id column with integer data type to affiliations, and declare it to be a foreign key that references the id column in professors.
 9 | - Rename the organization column in affiliations to organization_id.
10 | - Add a foreign key constraint on organization_id so that it references the id column in organizations.
11 | 
12 | """
13 | #1
14 | #-- Add a professor_id column to affiliations
15 | ALTER TABLE affiliations
16 | ADD COLUMN professor_id integer REFERENCES professors (id); 
17 | 
18 | #2
19 | #-- Rename the organization column to organization_id
20 | ALTER TABLE affiliations
21 | RENAME organization TO organization_id;
22 | 
23 | #3
24 | #-- Add a foreign key on organization_id
25 | ALTER TABLE affiliations
26 | ADD CONSTRAINT affiliations_organization_fkey FOREIGN KEY (organization_id) REFERENCES organizations (id);
27 | 


--------------------------------------------------------------------------------
/Introduction to Relational Databases in SQL/4. Glue together tables with foreign keys/6. Populate the "professor_id" column.py:
--------------------------------------------------------------------------------
 1 | """
 2 | Populate the "professor_id" column
 3 | Now it's time to also populate professors_id. You'll take the ID directly from professors.
 4 | 
 5 | Here's a way to update columns of a table based on values in another table:
 6 | 
 7 | UPDATE table_a
 8 | SET column_to_update = table_b.column_to_update_from
 9 | FROM table_b
10 | WHERE condition1 AND condition2 AND ...;
11 | This query does the following:
12 | 
13 | For each row in table_a, find the corresponding row in table_b where condition1, condition2, etc., are met.
14 | Set the value of column_to_update to the value of column_to_update_from (from that corresponding row).
15 | The conditions usually compare other columns of both tables, e.g. table_a.some_column = table_b.some_column. Of course, this query only makes sense if there is only one matching row in table_b.
16 | 
17 | Instructions 3/3
18 | - First, have a look at the current state of affiliations by fetching 10 rows and all columns.
19 | - Update the professor_id column with the corresponding value of the id column in professors.
20 |   "Corresponding" means rows in professors where the firstname and lastname are identical to the ones in affiliations.
21 | - Check out the first 10 rows and all columns of affiliations again. Have the professor_ids been correctly matched?
22 | 
23 | """
24 | #1
25 | #-- Have a look at the 10 first rows of affiliations
26 | SELECT *
27 | FROM affiliations
28 | LIMIT 10;
29 | 
30 | #2
31 | #-- Set professor_id to professors.id where firstname, lastname correspond to rows in professors
32 | UPDATE affiliations
33 | SET professor_id = professors.id
34 | FROM professors
35 | WHERE affiliations.firstname = professors.firstname AND affiliations.lastname = professors.lastname;
36 | 
37 | #3
38 | #-- Have a look at the 10 first rows of affiliations again
39 | SELECT *
40 | FROM affiliations
41 | LIMIT 10;
42 | 


--------------------------------------------------------------------------------
/Introduction to Relational Databases in SQL/4. Glue together tables with foreign keys/7. Drop "firstname" and "lastname".py:
--------------------------------------------------------------------------------
 1 | """
 2 | Drop "firstname" and "lastname"
 3 | The firstname and lastname columns of affiliations were used to establish a link to the professors table in the last exercise – so the appropriate professor IDs could be copied over. This only worked because there is exactly one corresponding professor for each row in affiliations. In other words: {firstname, lastname} is a candidate key of professors – a unique combination of columns.
 4 | 
 5 | It isn't one in affiliations though, because, as said in the video, professors can have more than one affiliation.
 6 | 
 7 | Because professors are referenced by professor_id now, the firstname and lastname columns are no longer needed, so it's time to drop them. After all, one of the goals of a database is to reduce redundancy where possible.
 8 | 
 9 | Instructions
10 | 100 XP
11 | - Drop the firstname and lastname columns from the affiliations table.
12 | 
13 | """
14 | #-- Drop the firstname column
15 | ALTER TABLE affiliations
16 | DROP COLUMN firstname;
17 | 
18 | #-- Drop the lastname column
19 | ALTER TABLE affiliations
20 | DROP COLUMN lastname;
21 | 


--------------------------------------------------------------------------------
/Introduction to Relational Databases in SQL/4. Glue together tables with foreign keys/8.  Referential Integrity ON DELETE.txt:
--------------------------------------------------------------------------------
 1 | |  Dealing w/ violations  |
 2 | 
 3 |     >  ON DELETE 
 4 |   
 5 |       . NO ACTION throw an error
 6 |       . CASCADE : delete all refencing records
 7 |       . RESTRICT : Throw an error
 8 |       . SET NULL : set referencing column to null
 9 |       . SET DEFAULT : set referencing column to default value
10 | 


--------------------------------------------------------------------------------
/Introduction to Relational Databases in SQL/4. Glue together tables with foreign keys/9. Referential integrity violations.py:
--------------------------------------------------------------------------------
 1 | """
 2 | Referential integrity violations
 3 | Given the current state of your database, what happens if you execute the following SQL statement?
 4 | 
 5 |       DELETE FROM universities WHERE id = 'EPF';
 6 | 
 7 | Instructions
 8 | 50 XP
 9 | Possible Answers
10 | 
11 |         It throws an error because the university with ID "EPF" does not exist.
12 |         The university with ID "EPF" is deleted.
13 |         It fails because referential integrity from universities to professors is violated.
14 |    OK   It fails because referential integrity from professors to universities is violated.
15 | 
16 | """
17 | SELECT * 
18 | FROM universities 
19 | WHERE id = 'EPF';
20 | 
21 | DELETE FROM universities WHERE id = 'EPF';
22 | 
23 | # ERROR : DETAIL:  Key (id)=(EPF) is still referenced from table "professors".
24 | 


--------------------------------------------------------------------------------
/Introduction to Shell/I Manipulating files and directories.py:
--------------------------------------------------------------------------------
  1 | """ brief introduction to the Unix shell. You'll learn why it is still in use after almost 50 years, how it compares to the graphical tools
  2 | you may be more familiar with, how to move around in the shell, and how to create, modify, and delete files and folders.
  3 | """
  4 | 
  5 | 
  6 | ## How does the shell compare to a desktop interface?
  7 | """---What is the relationship between the graphical file explorer that most people use and the command-line shell?"""
  8 | # They are both interfaces for issuing commands to the operating system.
  9 | 
 10 | 
 11 | ## Where am I?
 12 |   """ >>>>>> $ pwd === print working directory."""
 13 | # /home/repl
 14 | 
 15 | 
 16 | ## How can I identify files and directories?
 17 |   """ >>>>>> $ cd === go inside directory."""
 18 |   """ >>>>>> $ ls ===  list the files in the directory."""
 19 | 
 20 | """---Which of these files is not in that directory?."""
 21 | # $ cd seasonal
 22 | # $ ls         
 23 | # fall.csv
 24 | 
 25 | 
 26 | ## How else can I identify files and directories?
 27 | """ **PATH : If it begins with /, it is absolute. If it does not begin with /, it is relative."""
 28 | # ls course.txt
 29 | # $ ls seasonal/summer.csv
 30 | 
 31 | 
 32 | ## How can I move to another directory?
 33 | # cd seasonal
 34 | # pwd
 35 | # ls
 36 | 
 37 | 
 38 | ## How can I move up a directory?
 39 | """---If you are in /home/repl/seasonal, where does cd ~/../. take you?"""
 40 | # /home
 41 | 
 42 | 
 43 | ## How can I copy files?
 44 | """  >>>>>> cp original.txt duplicate.txt === creates a copy and renames."""
 45 | """copy summer.csv in seasonal, to backup file, as summer.bck"""
 46 | # cp original.txt duplicate.txt
 47 | 
 48 | """Copy spring.csv and summer.csv from the seasonal directory into the backup directory without changing your current working directory (/home/repl)."""
 49 | # cp seasonal/spring.csv seasonal/summer.csv backup
 50 | 
 51 | 
 52 | ## How can I move a file?
 53 | """   >>>>>> mv === move a file,rename a file. same syntax as cp."""
 54 | """You are in /home/repl, which has sub-directories seasonal and backup. Using a single command, move spring.csv and summer.csv from seasonal to backup."""
 55 | # mv seasonal/spring.csv seasonal/summer.csv backup
 56 | 
 57 | 
 58 | ## How can I rename files?
 59 | """Rename the file winter.csv to be winter.csv.bck."""
 60 | # $ cd seasonal
 61 | # $ ls
 62 | # autumn.csv  spring.csv  summer.csv  winter.csv
 63 | # $ mv winter.csv winter.csv.bck
 64 | # $ ls
 65 | # autumn.csv  spring.csv  summer.csv  winter.csv.bck
 66 | 
 67 | 
 68 | ## How can I delete files?
 69 | """   >>>>>> rm === remove file through terminal."""
 70 | """Remove autumn.csv from seasonal, then go back and remove autumn.csv from seasonal."""
 71 | # $ cd seasonal
 72 | # $ rm autumn.csv
 73 | # $ cd ..
 74 | # $ rm seasonal/summer.csv
 75 | 
 76 | 
 77 | ## How can I create and delete directories?
 78 | """Without changing directories, delete the file agarwal.txt in the people directory."""
 79 | # rm -r people/agarwal.txt
 80 | """Now that the people directory is empty, use a single command to delete it."""
 81 | # rmdir people
 82 | """   >>>>>> rmdir === remove empty folder."""
 83 | """   >>>>>> mkdir === create a directory."""
 84 | """create a directory/folder"""
 85 | # mkdir yearly
 86 | """early exists, create another directory called 2017 inside it without leaving your home directory."""
 87 | # $ mkdir yearly/2017
 88 | 
 89 | 
 90 | ## Wrapping up
 91 | """ ** create intermediate files when analyzing data. Rather than storing them in your home directory, you can put them in /tmp,/tmp is 
 92 |     ** immediately below the root directory /  """
 93 | """   >>>>>> /tmp ===temp folder"""
 94 | """   >>>>>> ~ ===shortcut for '/home/repl' """
 95 | """|| Move /home/repl/people/agarwal.txt into /tmp/scratch"""
 96 | # $ cd /tmp
 97 | # $ ls
 98 | # $ mkdir scratch
 99 | # $ mv ~/people/agarwal.txt /tmp/scratch
100 | 


--------------------------------------------------------------------------------
/Introduction to Shell/II Manipulating data.py:
--------------------------------------------------------------------------------
  1 | """
  2 | This chapter will show you how to work with the data in those files. The tools we’ll use are fairly simple, but are solid building blocks.
  3 | """
  4 | 
  5 | ## How can I view a file's contents?
  6 | """ >>>>>> cat === quick view of contents."""
  7 | # $ cat course.txt
  8 | """Introduction to the Unix Shell for Data Science......"""
  9 | 
 10 | 
 11 | ## How can I view a file's contents piece by piece?
 12 | """ >>>>>> less === one page at a time, passing with [SPACE], leaving with [q]
 13 |     >>>>>> :n  , :p  === next and previous file when using less.
 14 |     >>>>>> :q  === quit."""
 15 | # $ less seasonal/spring.csv seasonal/summer.csv
 16 | # :n
 17 | # :p
 18 | # :q
 19 | 
 20 | 
 21 | ## How can I look at the start of a file?
 22 | """ >>>>>> head === look at the start of a file."""
 23 | # $ head people/agarwal.txt
 24 | # | Display as many lines as there are.
 25 | 
 26 | 
 27 | ## How can I type less?
 28 | """ >>>>>> [TAB] === autocompletition"""
 29 | """Run head seasonal/autumn.csv without typing the full filename."""
 30 | # $ head seasonal/autumn.csv  
 31 | # $ head seasonal/spring.csv
 32 | 
 33 | 
 34 | ## How can I control what commands do?
 35 | """FLAGS
 36 | ---------"""
 37 | """ >>>>>> head -n 10 === first 10 results."""
 38 | # $ head -n 5 seasonal/winter.csv
 39 | 
 40 | 
 41 | ## How can I list everything below a directory?
 42 | """ >>>>>> ls -R  === see every file and directory in the current level, then everything in each sub-directory.
 43 |     >>>>>> ls -F === puts '/' after a directory. puts '*' after runnable program."""
 44 | # $ ls -R -F.:
 45 | """backup/  bin/  course.txt  people/  seasonal/
 46 |     ./backup:
 47 | 
 48 |     ./bin:
 49 | 
 50 |     ./people:
 51 |     agarwal.txt
 52 | 
 53 |     ./seasonal:
 54 |     autumn.csv  spring.csv  summer.csv  winter.csv"""
 55 | 
 56 | 
 57 | ## How can I get help for a command?
 58 | """ >>>>>> man === (maual) find out what commands DO.---invokes 'less', use [SPACE] to see more [q] to quit"""
 59 | # $ man tail   [SPACE][q]
 60 | # $ tail -n +7 seasonal/spring.csv
 61 | 
 62 | 
 63 | ## How can I select COLUMNS from a file?
 64 | """ >>>>>> cut === select columns."""
 65 |          # cut -f 2-5,8 -d , values.csv
 66 |                      #-f(FIELDS, specify columns), -d(delimiter to specify the separator)--because some files may use spaces, tabs, or colons
 67 | """---What command will select the first column (containing dates) from the file spring.csv?"""
 68 | """cut -d , -f 1 seasonal/spring.csv
 69 |     cut -d, -f1 seasonal/spring.csv"""
 70 | # Either of the above.
 71 | 
 72 | 
 73 | ## What can't cut do?
 74 | """---What is the output of cut -d : -f 2-4 on the line:"""
 75 | # second:third:
 76 | # |  The trailing colon creates an empty fourth field.
 77 | 
 78 | 
 79 | ## How can I repeat commands?
 80 | """ >>>>>> history === see latest command history."""
 81 | # $ head summer.csv
 82 | # $ cd seasonal/
 83 | # $ !head
 84 | """head summer.csv
 85 | Date,Tooth
 86 | 2017-01-11,canine
 87 | 2017-01-18,wisdom
 88 | 2017-01-21,bicuspid
 89 | 2017-02-02,molar....."""
 90 | # $ history
 91 | """   1  head summer.csv
 92 |     2  cd seasonal/
 93 |     3  head summer.csv
 94 |     4  history"""
 95 | 
 96 | ## How can I select lines containing specific values?
 97 | """ >>>>>> grep === selects lines according to what they contain"""
 98 | #   ...... grep bicuspid seasonal/winter.csv ---------------prints lines from winter.csv that contain "bicuspid".
 99 | """grep's more common flags:----------"""
100 | # -c: print a count of matching lines rather than the lines themselves
101 | # -h: do not print the names of files when searching multiple files
102 | # -i: ignore case (e.g., treat "Regression" and "regression" as matches)
103 | # -l: print the names of files that contain matches, not the matches
104 | # -n: print line numbers for matching lines
105 | # -v: invert the match, i.e., only show lines that don't match
106 | """Print the contents of all of the lines containing the word molar in seasonal/autumn.csv by running a single command while in your home directory. Don't use any flags."""
107 | # $ grep molar seasonal/autumn.csv
108 | """Invert the match to find all of the lines that don't contain the word molar in seasonal/spring.csv. and show their line numbers."""
109 | # $ grep -n -v molar seasonal/spring.csv
110 | """out |1:Date,Tooth
111 | 2:2017-01-25,wisdom
112 | 3:2017-02-19,canine
113 | 4:2017-02-24,canine
114 | 5:2017-02-28,wisdom"""
115 | """Count how many lines contain the word incisor in autumn.csv and winter.csv combined. (Again, run a single command from your home directory.)"""
116 | # $ grep incisor -c seasonal/autumn.csv seasonal/winter.csv
117 | """ out | seasonal/autumn.csv:3
118 |           seasonal/winter.csv:6"""
119 | 
120 | 
121 | ## Why isn't it always safe to treat data as text?
122 | # $ man paste 
123 | # $ paste -d -n seasonal/autumn.csv seasonal/winter.csv
124 | # | The last few rows have the wrong number of columns.
125 | 


--------------------------------------------------------------------------------
/Introduction to Shell/III Combining tools.py:
--------------------------------------------------------------------------------
  1 | """
  2 | This chapter will show you how to use this power to select the data you want, and introduce commands for sorting values and removing duplicates.
  3 | """
  4 | 
  5 | # How can I store a command's output in a file?
  6 | """Combine tail with redirection to save the last 5 lines of seasonal/winter.csv in a file called last.csv."""
  7 | # $ tail -n 5 seasonal/winter.csv  > last.csv
  8 | 
  9 | 
 10 | ## How can I use a command's output as an input?
 11 | """Select the last two lines from seasonal/winter.csv and save them in a file called bottom.csv."""
 12 | # $ tail -n 2 seasonal/winter.csv > bottom.csv
 13 | """Select the first line from bottom.csv in order to get the second-to-last line of the original file."""
 14 | # $ head -n 1 bottom.csv
 15 | 
 16 | 
 17 | ## What's a better way to combine commands?
 18 | """ ** output | use the output -----> '|'== a pipe
 19 | """
 20 | """Use cut to select all of the tooth names from column 2 of the comma delimited file seasonal/summer.csv, then pipe the result to grep, 
 21 | with an inverted match, to exclude the header line containing the word "Tooth"""
 22 | # $ cut -d , -f 2 seasonal/summer.csv | grep -v Tooth
 23 | 
 24 | 
 25 | ## How can I combine many commands?
 26 | """Extend this pipeline with a head command to only select the very first tooth name."""
 27 | # $ cut -d , -f 2 seasonal/summer.csv | grep -v Tooth | head -n 1
 28 | 
 29 | 
 30 | ## How can I count the records in a file?
 31 | """Count how many records in seasonal/spring.csv have dates in July 2017 (2017-07)."""
 32 | """use grep with a partial date to select the lines and pipe this result into wc(short for "word count") with an appropriate flag to count the lines."""
 33 | # $ grep 2017-07 seasonal/spring.csv | wc -l
 34 | # out| 3
 35 | 
 36 | 
 37 | ## How can I specify many files at once?
 38 | """ ** wildcards === to specify a list of files with a single expression ---- '*' 
 39 |     ** eg.---> cut -d , -f 1 seasonal/*
 40 |     ** eg---> cut -d , -f 1 seasonal/*.csv
 41 | """
 42 | """head to get the first three lines from both seasonal/spring.csv and seasonal/summer.csv"""
 43 | #  head -n 3 seasonal/spring.csv seasonal/summer.csv
 44 | """ out | ==> seasonal/spring.csv <==
 45 | Date,Tooth
 46 | 2017-01-25,wisdom
 47 | 2017-02-19,canine
 48 | 
 49 | ==> seasonal/summer.csv <==
 50 | Date,Tooth
 51 | 2017-01-11,canine
 52 | 2017-01-18,wisdom""" 
 53 | 
 54 | 
 55 | ## What other wildcards can I use?
 56 | """ ** wildcards less commonly used
 57 |       **   ? matches a single character, so 201?.txt will match 2017.txt or 2018.txt, but not 2017-01.txt.
 58 |       **   [...] matches any one of the characters inside the square brackets, so 201[78].txt matches 2017.txt or 2018.txt, but not 2016.txt.
 59 |       **   {...} matches any of the comma-separated patterns inside the curly brackets, so {*.txt, *.csv} matches any file whose name ends with .txt or .csv, but not files whose names end with .pdf.
 60 |       
 61 | --Which expression would match 
 62 | singh.pdf and johel.txt but not sandhu.pdf or sandhu.txt?"""
 63 |  # {singh.pdf, j*.txt}
 64 | 
 65 | 
 66 | ## How can I sort lines of text?
 67 | """    >>>>>> sort ===  puts data in order...default it does this in ascending alphabetical order
 68 |        >>>>>> flags --- -n(sort numerically), -r(reverse), -b(ignore leading blanks), -f(fold case)
 69 |        ** Pipelines often use 'grep' to get rid of unwanted records and then 'sort' to put the remaining records in order."""
 70 | """eg.    cut -d , -f 2 seasonal/summer.csv | grep -v Tooth
 71 | sort the names of the teeth in seasonal/winter.csv (not summer.csv) in descending alphabetical order. To do this, extend the pipeline with a sort step."""
 72 | # $ cut -d , -f2 seasonal/winter.csv | grep -v Tooth | sort -r
 73 |     
 74 |    
 75 | ## How can I remove duplicate lines?
 76 | """   >>>>>> sort uniq === eliminate duplciates."""
 77 | """ The start of your pipeline : cut -d , -f 2 seasonal/winter.csv | grep -v Tooth
 78 | Extend it with a sort command, and use uniq -c to display unique lines with a count of how often each occurs """
 79 | # $ cut -d , -f 2 seasonal/winter.csv | grep -v Tooth | sort | uniq -c
 80 | 
 81 | 
 82 | ## How can I save the output of a pipe?
 83 | """ >>>>>>  > teeth-only.txt === save output in teeth-only
 84 |       **  must appear at the end of the pipeline"""
 85 | """---What happens if we put redirection at the front of a pipeline as in:
 86 |             > result.txt head -n 3 seasonal/winter.csv"""
 87 | # The command's output is redirected to the file as usual.
 88 |    
 89 |    
 90 | ## How can I stop a running program?
 91 | """   >>>>>> Ctrl + C  or  ^C === to end it."""
 92 | 
 93 | 
 94 | ## Wrapping up
 95 | """build a pipeline to find out how many records are in the shortest of the seasonal data files."""
 96 | """Use wc with appropriate parameters to list the number of lines in all of the seasonal data files. (Use a wildcard for the filenames instead of typing them all in by hand.)"""
 97 | # $ wc -l seasonal/*
 98 | # out | 21 seasonal/autumn.csv
 99 |   #24 seasonal/spring.csv
100 |   #25 seasonal/summer.csv
101 |   #26 seasonal/winter.csv
102 |   #96 total
103 | """Add another command to the previous one using a pipe to remove the line containing the word "total"."""
104 | # $ wc -l seasonal/* | grep -v total
105 | # out | 21 seasonal/autumn.csv
106 |   #24 seasonal/spring.csv
107 |   #25 seasonal/summer.csv
108 |   #26 seasonal/winter.csv
109 | """Add two more stages to the pipeline that use sort -n and head -n 1 to find the file containing the fewest lines."""
110 | # $ wc -l seasonal/* | grep -v total | sort -n | head -n 1
111 | # out |  21 seasonal/autumn.csv
112 | 


--------------------------------------------------------------------------------
/Introduction to Shell/IV Batch processing.py:
--------------------------------------------------------------------------------
 1 | """
 2 | hell commands will process many files at once. This chapter shows you how to make your own pipelines do that. Along the way, you will
 3 | see how the shell uses variables to store information.
 4 | """
 5 | 
 6 | ## How does the shell store information?
 7 | """ ** environment variables === 
 8 |          HOME	User's home directory	/home/repl
 9 |          PWD	Present working directory	Same as pwd command
10 |          SHELL	Which shell program is being used	/bin/bash
11 |          USER	User's ID	repl
12 |          set   To get a complete list"""
13 | """Use set and grep with a pipe to display the value of HISTFILESIZE, which determines how many old commands are stored in your command history. What is its value?"""
14 | # set | grep HISTFILESIZE
15 | 
16 | 
17 | 
18 | ## How can I print a variable's value?
19 | """      >>>>>> echo || echo $USER find a variable's value."""
20 |                   
21 | """The variable OSTYPE holds the name of the kind of operating system you are using. Display its value using echo."""
22 | # $ echo $OSTYPE
23 | # out | linux-gnu
24 | 
25 | 
26 | 
27 | ## How else does the shell store information?
28 | """      ** shell variable == local variable in a programming language.
29 |          >>>>>>   training=seasonal/summer.csv === To create a shell variable (simply assign a value to a name:)          without any spaces """
30 | """Define a variable called testing with the value seasonal/winter.csv."""
31 | # $ testing=seasonal/winter.csv
32 | """Use head -n 1 SOMETHING to get the first line from seasonal/winter.csv using the value of the variable testing instead of the name of the file."""
33 | # $ head -n 1 $testing
34 | # out | Date,Tooth
35 | 
36 | 
37 | 
38 | ## How can I repeat a command many times?
39 | """               ** loops
40 | eg.:      for filetype in gif jpg png; do echo $filetype; done
41 | output:  gif
42 |          jpg
43 |          png               
44 | """
45 | """Modify the loop so that it prints: docx   ,odt,    pdf"""
46 | # $ for filetype in docx odt pdf; do echo $filetype; done
47 | 
48 | 
49 | 
50 | ## How can I repeat a command once for each file?
51 | """Modify the wildcard expression to people/* so that the loop prints the names of the files in the people directory
52 | regardless of what suffix they do or don't have. Please use filename as the name of your loop variable.
53 | """
54 | # $ for filename in people/*; do echo $filename; done
55 | 
56 | 
57 | 
58 | ## How can I record the names of a set of files?
59 | """     1| asign a var :  datasets=seasonal/*.csv
60 |         2| display the files' names later:  for filename in $datasets; do echo $filename; done"""
61 | """---If you run these two commands in your home directory, how many lines of output will they print?
62 |          files=seasonal/*.csv
63 |          for f in $files; do echo $f; done"""
64 | # Four: the names of all four seasonal data files.
65 | 
66 | 
67 | 
68 | ## A variable's name versus its value
69 | """---If you were to run these two commands in your home directory, what output would be printed?
70 |          files=seasonal/*.csv
71 |          for f in files; do echo $f; done."""
72 | # One line: the word "files".
73 | # correct : the loop uses files instead of $files, so the list consists of the word "files".
74 | 
75 | 
76 | 
77 | ## How can I run many commands in a single loop?
78 | """ eg:     for file in seasonal/*.csv; do head -n 2 $file | tail -n 1; done
79 | Write a loop that prints the last entry from July 2017 (2017-07) in every seasonal file. It should produce a similar output to:
80 | similar:        grep 2017-07 seasonal/winter.csv | tail -n 1
81 | but for each seasonal file separately. Please use file as the name of the loop variable, and remember to loop through the list of files seasonal/*.csv"""
82 | # $ for file in seasonal/*.csv; do grep 2017-07 $file | tail -n 1; done
83 | 
84 | 
85 | 
86 | ## Why shouldn't I use spaces in filenames?
87 | # Both of the above.
88 | # Use single quotes, ', or double quotes, ", around the file names.
89 | 
90 | 
91 | 
92 | #
93 | """Suppose you forget the semi-colon between the echo and head commands in the previous loop, so that you ask the shell to run:
94 | eg:      for f in seasonal/*.csv; do echo $f head -n 2 $f | tail -n 1; done
95 | ---What will the shell do?."""
96 | # Print one line for each of the four files.
97 | 
98 | 
99 | 


--------------------------------------------------------------------------------
/Introduction to Shell/v Creating new tools.py:
--------------------------------------------------------------------------------
  1 | """
  2 | History lets you repeat things with just a few keystrokes, and pipes let you combine existing commands to create new ones. In this chapter,
  3 | you will see how to go one step further and create new commands of your own.
  4 | """
  5 | 
  6 | ## How can I edit a file?
  7 | """text editor Nano. If you type 
  8 |       >>>>>> nano filename === it will open/create/edit filename for editing
  9 |           ** operations with control-key combinations:
 10 |               Ctrl + K: delete a line.
 11 |               Ctrl + U: un-delete a line.
 12 |               Ctrl + O: save the file ('O' stands for 'output'). You will also need to press Enter to confirm the filename!
 13 |               Ctrl + X: exit the editor."""
 14 | """---Run nano names.txt to edit a new file in your home directory and enter the following four lines:"""
 15 | # $ nano names.txt
 16 | # paste| Lovelace
 17 | #        Hopper
 18 | #        Johnson
 19 | #        Wilson
 20 | # cntrl + O : save  then press ENTER
 21 | # cntrl + X : leave file
 22 | 
 23 | 
 24 | 
 25 | ## How can I record what I just did?
 26 | """Copy the files seasonal/spring.csv and seasonal/summer.csv to your home directory."""
 27 | """ >>>>>> cp === copy."""
 28 | # $ cp seasonal/s* ~
 29 | """Use grep with the -h flag (to stop it from printing filenames) and -v Tooth (to select lines that don't match the header line) 
 30 | to select the data records from spring.csv and summer.csv in that order and redirect the output to temp.csv."""
 31 | """ >>>>>> grep -h === dont print filenames.
 32 |     >>>>>> > temp.csv === redirect ooutput to temp.csv."""
 33 | # $ grep -h -v Tooth seasonal/s* > temp.csv
 34 | """Pipe history into tail -n 3 and redirect the output to steps.txt to save the last three commands in a file.
 35 | (You need to save three instead of just two because the history command itself will be in the list.)"""
 36 | """ >>>>>> history tail -3 === give last 2 steps in hestory"""
 37 | # $ history tail -n 3 > steps.txt
 38 | 
 39 | 
 40 | 
 41 | ## How can I save commands to re-run later?
 42 | """Use nano dates.sh to create a file called dates.sh that contains this command:
 43 |             cut -d , -f 1 seasonal/*.csv
 44 | to extract the first column from all of the CSV files in seasonal."""
 45 | """ >>>>>>> nano filename.sh === create saved command in bash program
 46 |     >>>>>>> bash filename.sh === run saved bash programm"""
 47 | # $ nano dates.sh
 48 | # cut -d , -f 1 seasonal/*.csv
 49 | # (save and exit)
 50 | # $ bash dates.sh
 51 | 
 52 | 
 53 | 
 54 | ## How can I re-use pipes?
 55 | """ ** file full of shell commands is called a *shell script, or sometimes just a 'script' """
 56 | """A file teeth.sh in your home directory has been prepared for you, but contains some blanks. Use Nano to edit the file and replace the 
 57 | two ____ placeholders with seasonal/*.csv and -c so that this script prints a count of the number of times each tooth name appears in the CSV
 58 | files in the seasonal directory."""
 59 | # $ nano teeth.sh
 60 | # $ bash teeth.sh > teeth.out
 61 | # $ cat teeth.out
 62 | # out | 15 bicuspid
 63 | #       31 canine
 64 | #       18 incisor
 65 | #       11 molar
 66 | #       17 wisdom
 67 | 
 68 | 
 69 | 
 70 | ## How can I pass filenames to scripts?
 71 | """Edit the script count-records.sh with Nano and fill in the two ____ placeholders with $@ and -l (the letter) respectively so that it counts the number of lines
 72 | in one or more files, excluding the first line of each."""
 73 | """ >>>>>> $@ === pass filenames to scripts."""
 74 | # $ nano cout-records.sh
 75 | # tail -q -n +2 $@ | wc -l
 76 | # (save and close)
 77 | # $ bash count-records.sh  seasonal/*.csv > num-records.out
 78 | 
 79 | 
 80 | 
 81 | ## How can I process a single argument?
 82 | """ >>>>>> $1, $2, and so on === to refer to specific command-line parameters."""
 83 | """---
 84 | bash get-field.sh seasonal/summer.csv 4 2
 85 | should select the second field from line 4 of seasonal/summer.csv. Which of the following commands should be put in get-field.sh to do that?"""
 86 | # head -n $2 $1 | tail -n 1 | cut -d , -f $3
 87 | 
 88 | 
 89 | 
 90 | ## How can one shell script do many things?
 91 | """edit the script range.sh and replace the two ____ placeholders with $@ and -v so that it lists the names and number of lines in all of the files given on 
 92 | the command line without showing the total number of lines in all files. (Do not try to subtract the column header lines from the files.)"""
 93 | # $ nano range.sh
 94 | # wc -l $@ | grep -v total
 95 | """add sort -n and head -n 1 in that order to the pipeline in range.sh to display the name and line count of the shortest file given to it."""
 96 | # wc -l $@ | grep -v total | sort -n | head -n 1
 97 | """ add a second line to range.sh to print the name and record count of the longest file in the directory as well as the shortest. 
 98 | This line should be a duplicate of the one you have already written, but with sort -n -r rather than sort -n."""
 99 | # wc -l $@ | grep -v total | sort -n | head -n 1
100 | # wc -l $@ | grep -v total | sort -n -r | head -n 1
101 | """Run the script on the files in the seasonal directory using seasonal/*.csv to match all of the files and redirect the output using > to a file called range.out"""
102 | # $ bash range.sh seasonal/*.csv > range.out
103 | 
104 | 
105 | 
106 | ## How can I write loops in a shell script?
107 | """ eg : 
108 |             # Print the first and last data records of each file.
109 |             for filename in $@
110 |             do
111 |                 head -n 2 $filename | tail -n 1
112 |                 tail -n 1 $filename
113 |             done"""
114 | """Fill in the placeholders in the script date-range.sh with $filename (twice), head, and tail so that it prints the first and last date from one or more files."""
115 | # # Print the first and last date from each data file.
116 | # for filename in $@
117 | # do
118 | #     cut -d , -f 1 $filename | grep -v Date | sort | head -n 1
119 | #     cut -d , -f 1 $filename | grep -v Date | sort | tail -n 1
120 | # done
121 | """Run date-range.sh on all four of the seasonal data files using seasonal/*.csv to match their names, and pipe its output to sort"""
122 | # $ bash date-range.sh seasonal/*.csv | sort
123 | 
124 | 
125 | 
126 | ## What happens when I don't provide filenames?
127 | """---
128 | Suppose you do accidentally type:
129 |       head -n 5 | tail -n 3 somefile.txt
130 | What should you do next?"""
131 | #| Use Ctrl + C to stop the running head program.
132 | 
133 | 


--------------------------------------------------------------------------------
/OOP-python-DataCamp/README.md:
--------------------------------------------------------------------------------
1 | # Object-Oriented-Programming-in-Python-DataCamp
2 | all answers of  oop object oriented programming from datacamp course 
3 | 


--------------------------------------------------------------------------------
/Python-Data-Science-Toolbox-Part-2--case-study-main/README.md:
--------------------------------------------------------------------------------
 1 | # Python-Data-Science-Toolbox-Part-2--case-study
 2 | Python Data Science Toolbox (Part 2)-case study datacamp `World bank data`
 3 | 
 4 | - Data on world economies for over half a century
 5 | 
 6 | - Indicators
 7 | 
 8 | -- Population
 9 | 
10 | -- Electricity consumption
11 | 
12 | 
13 | -- CO2 emissions
14 | 
15 | -- Literacy rates
16 | 
17 | -- Unemployment
18 | 
19 | -- Mortality rates
20 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # datacamp-Data-Engineer-with-Python-course
 2 | - CAREER TRACK
 3 | 
 4 | **Data Engineer with Python**
 5 | ---
 6 | ![logo](https://user-images.githubusercontent.com/51888893/194946662-fc6eea27-e064-481f-855e-7a78a5147f43.png)
 7 | 
 8 | 
 9 | In this track, you’ll discover how to build an effective data architecture, streamline data processing, and maintain large-scale data systems. In addition to working
10 | 
11 | with Python, you’ll also grow your language skills as you work with Shell, SQL, and Scala, to create data engineering pipelines, automate common file system tasks, and
12 | 
13 | build a high-performance database.
14 | 
15 | 
16 | 
17 | Through hands-on exercises, you’ll add cloud and big data tools such as AWS Boto, PySpark, Spark SQL, and MongoDB, to your data engineering toolkit to help you create 
18 | 
19 | and query databases, wrangle data, and configure schedules to run your pipelines. By the end of this track, you’ll have mastered the critical database, scripting, and 
20 | 
21 | process skills you need to progress your career.
22 | 
23 | 
24 | Please note this track assumes a fundamental knowledge of Python and SQL.
25 | 


--------------------------------------------------------------------------------
/Streamlined Data Ingestion with pandas/README.md:
--------------------------------------------------------------------------------
 1 | 
 2 |  Instructor : Amany Mahfouz
 3 | # This course teaches you 
 4 | 
 5 | how to build pipelines to import data kept in common storage formats.
 6 | 
 7 | You’ll use pandas, a major Python library for analytics, to get data from a variety of sources,
 8 | 
 9 | from spreadsheets of survey responses, to a database of public service requests, to an API for a popular review site. 
10 | 
11 | Along the way, you’ll learn how to fine-tune imports to get only what you need and to address issues like incorrect data types. 
12 | 
13 | Finally, you’ll assemble a custom dataset from a mix of sources.
14 | 


--------------------------------------------------------------------------------
/Unit Testing for Data Science in Python/.travis.yml:
--------------------------------------------------------------------------------
 1 | language: python
 2 | python:
 3 |   - "3.6"
 4 | install:
 5 |   - pip install -e .
 6 |   - pip install codecov pytest-cov
 7 | script:
 8 |   - pytest --cov=src tests
 9 | after_success:
10 |   - codecov
11 | 


--------------------------------------------------------------------------------
/Unit Testing for Data Science in Python/readme.md:
--------------------------------------------------------------------------------
1 | [![Build Status](https://travis-ci.com/gutfeeling/univariate-linear-regression.svg?branch=master)](https://travis-ci.com/gutfeeling/univariate-linear-regression)
2 | [![codecov](https://codecov.io/gh/gutfeeling/univariate-linear-regression/branch/master/graph/badge.svg)](https://codecov.io/gh/gutfeeling/univariate-linear-regression)
3 | 


--------------------------------------------------------------------------------