├── High Level Infra.drawio.png ├── README.md ├── airflow-redshift-template.yaml ├── dags ├── openweather_api.py └── transform_redshift_load.py ├── requirements.txt └── scripts └── transform.py /High Level Infra.drawio.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AnandDedha/aws-airflow-dataengineering-pipeline/c3370b5ed682b24889484b68f9f5d0c515144f01/High Level Infra.drawio.png -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | 2 | ## Data Engineering project AWS 3 | 4 | In this project, we're developing an AWS-based data pipeline utilizing airflow, python, spark, and AWS services. The objective is to extract Weather Data from OpenWeather, subsequently store it in S3, and ultimately load it into Redshift using the Airflow platform. The project's high-level architecture is as follows: 5 | 6 | ![alt text](https://github.com/AnandDedha/aws-airflow-dataengineering-pipeline/blob/main/High%20Level%20Infra.drawio.png) 7 | 8 | We've established the entire architecture through infrastructure as code (IAC) methodologies. Following that, we've configured two separate Airflow DAGs. 9 | 10 | **Setting Up the First Airflow DAG:** 11 | This DAG encompasses tasks for data extraction via the weather API and subsequent loading into an S3 bucket. 12 | 13 | **Task 1: Extracting Data from OpenWeather API:** Utilizing a Python operator, we retrieve weather data from the OpenWeather API. This involves leveraging libraries such as "requests" to initiate API requests and obtain the required data. 14 | 15 | **Task 2: Storing Data in S3:** After successfully extracting the data, we structure it within a data frame and then store it within an S3 bucket. This operation is executed using the S3CreateObjectOperator. 16 | 17 | **Setting Up the Second Airflow DAG:** 18 | This DAG includes tasks for extracting data from S3, applying transformations, and finally loading the processed data into Redshift. 19 | 20 | **Task 1: Waiting for Completion of the First DAG:** Before proceeding, this task ensures that the first DAG has successfully completed its operations. 21 | 22 | **Task 2: ETL Processing to Redshift using Glue Operator:** This step involves employing the GlueContext and Spark context for Extract, Transform, Load (ETL) operations. The data, once transformed, is loaded into the Redshift database. 23 | 24 | **Conclusion:** 25 | By adhering to these outlined steps, a comprehensive ETL pipeline is established within AWS. This pipeline facilitates the extraction of weather data from OpenWeather, its storage within S3, and its subsequent loading into Redshift. The integration of Apache Airflow and Glue operators guarantees the automation of routine updates to the weather data in the Redshift database. 26 | -------------------------------------------------------------------------------- /airflow-redshift-template.yaml: -------------------------------------------------------------------------------- 1 | AWSTemplateFormatVersion: "2010-09-09" 2 | 3 | Mappings: 4 | ClusterConfigurations: 5 | redshift: 6 | userName: redshiftmasteruser 7 | dbName: dev 8 | nodeType: dc2.large 9 | # Make sure you use the right value for "nodeCount" when using betweeen "multi-node" and "single-node" 10 | clusterType: multi-node 11 | nodeCount: 2 12 | 13 | Parameters: 14 | 15 | EnvironmentName: 16 | Description: An environment name that is prefixed to resource names 17 | Type: String 18 | Default: MWAAEnvironment 19 | 20 | VpcCIDR: 21 | Description: The IP range (CIDR notation) for this VPC 22 | Type: String 23 | Default: 10.192.0.0/16 24 | 25 | PublicSubnet1CIDR: 26 | Description: The IP range (CIDR notation) for the public subnet in the first Availability Zone 27 | Type: String 28 | Default: 10.192.10.0/24 29 | 30 | PublicSubnet2CIDR: 31 | Description: The IP range (CIDR notation) for the public subnet in the second Availability Zone 32 | Type: String 33 | Default: 10.192.11.0/24 34 | 35 | PrivateSubnet1CIDR: 36 | Description: The IP range (CIDR notation) for the private subnet in the first Availability Zone 37 | Type: String 38 | Default: 10.192.20.0/24 39 | PrivateSubnet2CIDR: 40 | Description: The IP range (CIDR notation) for the private subnet in the second Availability Zone 41 | Type: String 42 | Default: 10.192.21.0/24 43 | MaxWorkerNodes: 44 | Description: The maximum number of workers that can run in the environment 45 | Type: Number 46 | Default: 4 47 | DagProcessingLogs: 48 | Description: Log level for DagProcessing 49 | Type: String 50 | Default: INFO 51 | SchedulerLogsLevel: 52 | Description: Log level for SchedulerLogs 53 | Type: String 54 | Default: INFO 55 | TaskLogsLevel: 56 | Description: Log level for TaskLogs 57 | Type: String 58 | Default: INFO 59 | WorkerLogsLevel: 60 | Description: Log level for WorkerLogs 61 | Type: String 62 | Default: INFO 63 | WebserverLogsLevel: 64 | Description: Log level for WebserverLogs 65 | Type: String 66 | Default: INFO 67 | 68 | Resources: 69 | ##################################################################################################################### 70 | # CREATE VPC 71 | ##################################################################################################################### 72 | 73 | VPC: 74 | Type: AWS::EC2::VPC 75 | Properties: 76 | CidrBlock: !Ref VpcCIDR 77 | EnableDnsSupport: true 78 | EnableDnsHostnames: true 79 | Tags: 80 | - Key: Name 81 | Value: MWAAEnvironment 82 | 83 | InternetGateway: 84 | Type: AWS::EC2::InternetGateway 85 | Properties: 86 | Tags: 87 | - Key: Name 88 | Value: MWAAEnvironment 89 | 90 | InternetGatewayAttachment: 91 | Type: AWS::EC2::VPCGatewayAttachment 92 | Properties: 93 | InternetGatewayId: !Ref InternetGateway 94 | VpcId: !Ref VPC 95 | 96 | PublicSubnet1: 97 | Type: AWS::EC2::Subnet 98 | Properties: 99 | VpcId: !Ref VPC 100 | AvailabilityZone: !Select [ 0, !GetAZs '' ] 101 | CidrBlock: !Ref PublicSubnet1CIDR 102 | MapPublicIpOnLaunch: true 103 | Tags: 104 | - Key: Name 105 | Value: !Sub ${EnvironmentName} Public Subnet (AZ1) 106 | 107 | PublicSubnet2: 108 | Type: AWS::EC2::Subnet 109 | Properties: 110 | VpcId: !Ref VPC 111 | AvailabilityZone: !Select [ 1, !GetAZs '' ] 112 | CidrBlock: !Ref PublicSubnet2CIDR 113 | MapPublicIpOnLaunch: true 114 | Tags: 115 | - Key: Name 116 | Value: !Sub ${EnvironmentName} Public Subnet (AZ2) 117 | 118 | PrivateSubnet1: 119 | Type: AWS::EC2::Subnet 120 | Properties: 121 | VpcId: !Ref VPC 122 | AvailabilityZone: !Select [ 0, !GetAZs '' ] 123 | CidrBlock: !Ref PrivateSubnet1CIDR 124 | MapPublicIpOnLaunch: false 125 | Tags: 126 | - Key: Name 127 | Value: !Sub ${EnvironmentName} Private Subnet (AZ1) 128 | 129 | PrivateSubnet2: 130 | Type: AWS::EC2::Subnet 131 | Properties: 132 | VpcId: !Ref VPC 133 | AvailabilityZone: !Select [ 1, !GetAZs '' ] 134 | CidrBlock: !Ref PrivateSubnet2CIDR 135 | MapPublicIpOnLaunch: false 136 | Tags: 137 | - Key: Name 138 | Value: !Sub ${EnvironmentName} Private Subnet (AZ2) 139 | 140 | NatGateway1EIP: 141 | Type: AWS::EC2::EIP 142 | DependsOn: InternetGatewayAttachment 143 | Properties: 144 | Domain: vpc 145 | 146 | NatGateway2EIP: 147 | Type: AWS::EC2::EIP 148 | DependsOn: InternetGatewayAttachment 149 | Properties: 150 | Domain: vpc 151 | 152 | NatGateway1: 153 | Type: AWS::EC2::NatGateway 154 | Properties: 155 | AllocationId: !GetAtt NatGateway1EIP.AllocationId 156 | SubnetId: !Ref PublicSubnet1 157 | 158 | NatGateway2: 159 | Type: AWS::EC2::NatGateway 160 | Properties: 161 | AllocationId: !GetAtt NatGateway2EIP.AllocationId 162 | SubnetId: !Ref PublicSubnet2 163 | 164 | PublicRouteTable: 165 | Type: AWS::EC2::RouteTable 166 | Properties: 167 | VpcId: !Ref VPC 168 | Tags: 169 | - Key: Name 170 | Value: !Sub ${EnvironmentName} Public Routes 171 | 172 | DefaultPublicRoute: 173 | Type: AWS::EC2::Route 174 | DependsOn: InternetGatewayAttachment 175 | Properties: 176 | RouteTableId: !Ref PublicRouteTable 177 | DestinationCidrBlock: 0.0.0.0/0 178 | GatewayId: !Ref InternetGateway 179 | 180 | PublicSubnet1RouteTableAssociation: 181 | Type: AWS::EC2::SubnetRouteTableAssociation 182 | Properties: 183 | RouteTableId: !Ref PublicRouteTable 184 | SubnetId: !Ref PublicSubnet1 185 | 186 | PublicSubnet2RouteTableAssociation: 187 | Type: AWS::EC2::SubnetRouteTableAssociation 188 | Properties: 189 | RouteTableId: !Ref PublicRouteTable 190 | SubnetId: !Ref PublicSubnet2 191 | 192 | 193 | PrivateRouteTable1: 194 | Type: AWS::EC2::RouteTable 195 | Properties: 196 | VpcId: !Ref VPC 197 | Tags: 198 | - Key: Name 199 | Value: !Sub ${EnvironmentName} Private Routes (AZ1) 200 | 201 | DefaultPrivateRoute1: 202 | Type: AWS::EC2::Route 203 | Properties: 204 | RouteTableId: !Ref PrivateRouteTable1 205 | DestinationCidrBlock: 0.0.0.0/0 206 | NatGatewayId: !Ref NatGateway1 207 | 208 | PrivateSubnet1RouteTableAssociation: 209 | Type: AWS::EC2::SubnetRouteTableAssociation 210 | Properties: 211 | RouteTableId: !Ref PrivateRouteTable1 212 | SubnetId: !Ref PrivateSubnet1 213 | 214 | PrivateRouteTable2: 215 | Type: AWS::EC2::RouteTable 216 | Properties: 217 | VpcId: !Ref VPC 218 | Tags: 219 | - Key: Name 220 | Value: !Sub ${EnvironmentName} Private Routes (AZ2) 221 | 222 | DefaultPrivateRoute2: 223 | Type: AWS::EC2::Route 224 | Properties: 225 | RouteTableId: !Ref PrivateRouteTable2 226 | DestinationCidrBlock: 0.0.0.0/0 227 | NatGatewayId: !Ref NatGateway2 228 | 229 | PrivateSubnet2RouteTableAssociation: 230 | Type: AWS::EC2::SubnetRouteTableAssociation 231 | Properties: 232 | RouteTableId: !Ref PrivateRouteTable2 233 | SubnetId: !Ref PrivateSubnet2 234 | 235 | SecurityGroup: 236 | Type: AWS::EC2::SecurityGroup 237 | Properties: 238 | GroupName: "mwaa-security-group" 239 | GroupDescription: "Security group with a self-referencing inbound rule." 240 | VpcId: !Ref VPC 241 | 242 | SecurityGroupIngress: 243 | Type: AWS::EC2::SecurityGroupIngress 244 | Properties: 245 | GroupId: !Ref SecurityGroup 246 | IpProtocol: "-1" 247 | SourceSecurityGroupId: !Ref SecurityGroup 248 | 249 | S3Endpoint: 250 | Type: 'AWS::EC2::VPCEndpoint' 251 | Properties: 252 | PolicyDocument: 253 | Version: 2012-10-17 254 | Statement: 255 | - Effect: Allow 256 | Action: '*' 257 | Principal: '*' 258 | Resource: '*' 259 | RouteTableIds: 260 | - !Ref PrivateRouteTable1 261 | - !Ref PrivateRouteTable2 262 | - !Ref PublicRouteTable 263 | ServiceName: !Sub 'com.amazonaws.${AWS::Region}.s3' 264 | VpcId: !Ref VPC 265 | 266 | AirflowEnvironmentBucket: 267 | Type: AWS::S3::Bucket 268 | Properties: 269 | VersioningConfiguration: 270 | Status: Enabled 271 | PublicAccessBlockConfiguration: 272 | BlockPublicAcls: true 273 | BlockPublicPolicy: true 274 | IgnorePublicAcls: true 275 | RestrictPublicBuckets: true 276 | 277 | 278 | RedshiftTempDataBucket: 279 | Type: AWS::S3::Bucket 280 | Properties: 281 | PublicAccessBlockConfiguration: 282 | BlockPublicAcls: True 283 | BlockPublicPolicy: True 284 | IgnorePublicAcls: True 285 | RestrictPublicBuckets: True 286 | 287 | ##################################################################################################################### 288 | # CREATE MWAA 289 | ##################################################################################################################### 290 | 291 | MwaaEnvironment: 292 | Type: AWS::MWAA::Environment 293 | DependsOn: MwaaExecutionRole 294 | Properties: 295 | Name: !Sub "${AWS::StackName}-MwaaEnvironment" 296 | SourceBucketArn: !GetAtt AirflowEnvironmentBucket.Arn 297 | ExecutionRoleArn: !GetAtt MwaaExecutionRole.Arn 298 | DagS3Path: dags 299 | NetworkConfiguration: 300 | SecurityGroupIds: 301 | - !GetAtt SecurityGroup.GroupId 302 | SubnetIds: 303 | - !Ref PrivateSubnet1 304 | - !Ref PrivateSubnet2 305 | WebserverAccessMode: PUBLIC_ONLY 306 | MaxWorkers: !Ref MaxWorkerNodes 307 | LoggingConfiguration: 308 | DagProcessingLogs: 309 | LogLevel: !Ref DagProcessingLogs 310 | Enabled: true 311 | SchedulerLogs: 312 | LogLevel: !Ref SchedulerLogsLevel 313 | Enabled: true 314 | TaskLogs: 315 | LogLevel: !Ref TaskLogsLevel 316 | Enabled: true 317 | WorkerLogs: 318 | LogLevel: !Ref WorkerLogsLevel 319 | Enabled: true 320 | WebserverLogs: 321 | LogLevel: !Ref WebserverLogsLevel 322 | Enabled: true 323 | SecurityGroup: 324 | Type: AWS::EC2::SecurityGroup 325 | Properties: 326 | VpcId: !Ref VPC 327 | GroupDescription: !Sub "Security Group for Amazon MWAA Environment ${AWS::StackName}-MwaaEnvironment" 328 | GroupName: !Sub "airflow-security-group-${AWS::StackName}-MwaaEnvironment" 329 | 330 | SecurityGroupIngress: 331 | Type: AWS::EC2::SecurityGroupIngress 332 | Properties: 333 | GroupId: !Ref SecurityGroup 334 | IpProtocol: "-1" 335 | SourceSecurityGroupId: !Ref SecurityGroup 336 | 337 | SecurityGroupEgress: 338 | Type: AWS::EC2::SecurityGroupEgress 339 | Properties: 340 | GroupId: !Ref SecurityGroup 341 | IpProtocol: "-1" 342 | CidrIp: "0.0.0.0/0" 343 | 344 | RedshiftSubnetGroup: 345 | Type: 'AWS::Redshift::ClusterSubnetGroup' 346 | Properties: 347 | Description: 'Redshift cluster subnet group' 348 | SubnetIds: 349 | - !Ref PrivateSubnet1 350 | - !Ref PrivateSubnet2 351 | 352 | MwaaExecutionRole: 353 | Type: AWS::IAM::Role 354 | Properties: 355 | AssumeRolePolicyDocument: 356 | Version: 2012-10-17 357 | Statement: 358 | - Effect: Allow 359 | Principal: 360 | Service: 361 | - airflow-env.amazonaws.com 362 | - airflow.amazonaws.com 363 | Action: 364 | - "sts:AssumeRole" 365 | Path: "/service-role/" 366 | ManagedPolicyArns: 367 | - arn:aws:iam::aws:policy/AdministratorAccess 368 | 369 | ##################################################################################################################### 370 | # CREATE Redshift 371 | ##################################################################################################################### 372 | RedshiftClusterParameterGroup: 373 | Type: 'AWS::Redshift::ClusterParameterGroup' 374 | Properties: 375 | Description: !Join [ " ", [ !Ref 'AWS::StackName', " - Redshift Cluster Parameter group" ]] 376 | ParameterGroupFamily: redshift-1.0 377 | RedshiftCreds: 378 | Type: 'AWS::SecretsManager::Secret' 379 | Properties: 380 | Description: !Sub Redshift cluster master user credentials for ${AWS::StackName} 381 | GenerateSecretString: 382 | SecretStringTemplate: !Join [ '', [ '{"username": "', !FindInMap [ ClusterConfigurations, redshift, userName ], '"}' ]] 383 | GenerateStringKey: 'password' 384 | PasswordLength: 16 385 | ExcludePunctuation: true 386 | Tags: 387 | - 388 | Key: RedshiftGlueBlogCred 389 | Value: 'true' 390 | RedshiftIamRole: 391 | Type: 'AWS::IAM::Role' 392 | Properties: 393 | AssumeRolePolicyDocument: 394 | Version: 2012-10-17 395 | Statement: 396 | - Effect: Allow 397 | Principal: 398 | Service: 399 | - redshift.amazonaws.com 400 | - glue.amazonaws.com 401 | Action: 'sts:AssumeRole' 402 | ManagedPolicyArns: 403 | - arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole 404 | # These are inline policies which we define manually --- delete this lines while using in cloud formation 405 | Policies: 406 | - PolicyName: Redshift-IAM-Policy 407 | PolicyDocument: 408 | Version: 2012-10-17 409 | Statement: 410 | - Sid: AllowAccesstoRedshiftTempDataBucket 411 | Effect: Allow 412 | Action: 413 | - 's3:AbortMultipartUpload' 414 | - 's3:DeleteObject' 415 | - 's3:GetBucketVersioning' 416 | - 's3:GetObject' 417 | - 's3:GetObjectTagging' 418 | - 's3:GetObjectVersion' 419 | - 's3:ListBucket' 420 | - 's3:ListBucketMultipartUploads' 421 | - 's3:ListBucketVersions' 422 | - 's3:ListMultipartUploadParts' 423 | - 's3:PutBucketVersioning' 424 | - 's3:PutObject' 425 | - 's3:PutObjectTagging' 426 | Resource: 427 | - !Sub 'arn:aws:s3:::${RedshiftTempDataBucket}' 428 | - !Sub 'arn:aws:s3:::${RedshiftTempDataBucket}/*' 429 | - Sid: ListPermissions 430 | Effect: Allow 431 | Action: 432 | - 's3:ListBucket' 433 | - 's3:DeleteObject' 434 | - 's3:GetBucketVersioning' 435 | - 's3:GetObject' 436 | - 's3:GetObjectTagging' 437 | - 's3:GetObjectVersion' 438 | - 's3:ListBucket' 439 | - 's3:ListBucketMultipartUploads' 440 | - 's3:ListBucketVersions' 441 | - 's3:ListMultipartUploadParts' 442 | - 's3:PutBucketVersioning' 443 | - 's3:PutObject' 444 | Resource: 445 | - '*' 446 | - Sid: AllowIAMPass 447 | Effect: Allow 448 | Action: 449 | - 'iam:GetRole' 450 | - 'iam:PassRole' 451 | Resource: 452 | - '*' 453 | RedshiftCluster: 454 | Type: 'AWS::Redshift::Cluster' 455 | DependsOn: 456 | - RedshiftCreds 457 | Properties: 458 | ClusterIdentifier: !Sub ${AWS::StackName}-Redshift-Cluster 459 | DBName: !FindInMap [ ClusterConfigurations, redshift, dbName ] 460 | MasterUsername: !Join [ '', [ '{{resolve:secretsmanager:', !Ref RedshiftCreds, ':SecretString:username}}' ]] 461 | MasterUserPassword: !Join [ '', [ '{{resolve:secretsmanager:', !Ref RedshiftCreds, ':SecretString:password}}' ]] 462 | NodeType: !FindInMap [ ClusterConfigurations, redshift, nodeType ] 463 | ClusterType: !FindInMap [ ClusterConfigurations, redshift, clusterType ] 464 | NumberOfNodes: !FindInMap [ ClusterConfigurations, redshift, nodeCount ] 465 | PubliclyAccessible: false 466 | VpcSecurityGroupIds: 467 | - !Ref SecurityGroup 468 | IamRoles: 469 | - !GetAtt RedshiftIamRole.Arn 470 | ClusterSubnetGroupName: !Ref RedshiftSubnetGroup 471 | ClusterParameterGroupName: !Ref RedshiftClusterParameterGroup 472 | GlueRedshiftConnection: 473 | Type: AWS::Glue::Connection 474 | Properties: 475 | CatalogId: !Ref AWS::AccountId 476 | ConnectionInput: 477 | ConnectionType: JDBC 478 | Name: redshift-demo-connection 479 | PhysicalConnectionRequirements: 480 | SecurityGroupIdList: 481 | - !Ref SecurityGroup 482 | SubnetId: !Ref PrivateSubnet1 483 | AvailabilityZone: !Select 484 | - 0 485 | - !GetAZs 486 | Ref: 'AWS::Region' 487 | ConnectionProperties: { 488 | "JDBC_CONNECTION_URL": !Join [ '', [ 'jdbc:redshift://', !GetAtt RedshiftCluster.Endpoint.Address, ':', !GetAtt RedshiftCluster.Endpoint.Port, '/', !FindInMap [ ClusterConfigurations, redshift, dbName ]]], 489 | "USERNAME": !Join [ '', [ '{{resolve:secretsmanager:', !Ref RedshiftCreds, ':SecretString:username}}' ]], 490 | "PASSWORD": !Join [ '', [ '{{resolve:secretsmanager:', !Ref RedshiftCreds, ':SecretString:password}}' ]] 491 | } 492 | 493 | Outputs: 494 | VPC: 495 | Description: A reference to the created VPC 496 | Value: !Ref VPC 497 | 498 | PublicSubnets: 499 | Description: A list of the public subnets 500 | Value: !Join [ ",", [ !Ref PublicSubnet1, !Ref PublicSubnet2 ]] 501 | 502 | PrivateSubnets: 503 | Description: A list of the private subnets 504 | Value: !Join [ ",", [ !Ref PrivateSubnet1, !Ref PrivateSubnet2 ]] 505 | 506 | PublicSubnet1: 507 | Description: A reference to the public subnet in the 1st Availability Zone 508 | Value: !Ref PublicSubnet1 509 | 510 | PublicSubnet2: 511 | Description: A reference to the public subnet in the 2nd Availability Zone 512 | Value: !Ref PublicSubnet2 513 | 514 | PrivateSubnet1: 515 | Description: A reference to the private subnet in the 1st Availability Zone 516 | Value: !Ref PrivateSubnet1 517 | 518 | PrivateSubnet2: 519 | Description: A reference to the private subnet in the 2nd Availability Zone 520 | Value: !Ref PrivateSubnet2 521 | 522 | SecurityGroupIngress: 523 | Description: Security group with self-referencing inbound rule 524 | Value: !Ref SecurityGroupIngress 525 | 526 | MwaaApacheAirflowUI: 527 | Description: MWAA Environment 528 | Value: !Sub "https://${MwaaEnvironment.WebserverUrl}" 529 | 530 | RedshiftClusterJdbcUrl: 531 | Description: JDBC URL for Redshift Cluster 532 | Value: !Join [ '', [ 'jdbc:redshift:iam://', !GetAtt RedshiftCluster.Endpoint.Address, ':', !GetAtt RedshiftCluster.Endpoint.Port, '/', !FindInMap [ ClusterConfigurations, redshift, dbName ]]] 533 | 534 | RedshiftS3TempPath: 535 | Description: Temporary path used by Redshift to store data 536 | Value: !Join [ '', [ 's3://', !Ref RedshiftTempDataBucket, '/redshift-temp-dir/' ]] 537 | -------------------------------------------------------------------------------- /dags/openweather_api.py: -------------------------------------------------------------------------------- 1 | from airflow import DAG 2 | from airflow.operators.python_operator import PythonOperator 3 | from airflow.sensors.http_sensor import HttpSensor 4 | from airflow.models import Variable 5 | from datetime import datetime, timedelta 6 | import json 7 | from airflow.providers.amazon.aws.operators.s3 import S3CreateObjectOperator 8 | import pandas as pd 9 | import requests 10 | 11 | default_args = { 12 | 'owner': 'airflow', 13 | 'depends_on_past': False, 14 | 'start_date': datetime(2023, 8, 12), 15 | 'retries': 2, 16 | 'retry_delay': timedelta(minutes=5), 17 | } 18 | 19 | dag = DAG('openweather_api_dag', default_args=default_args, schedule_interval="@once",catchup=False) 20 | 21 | # Set your OpenWeather API endpoint and parameters 22 | #api_endpoint = "https://api.openweathermap.org/data/2.5/weather" 23 | api_endpoint = "https://api.openweathermap.org/data/2.5/forecast" 24 | api_params = { 25 | "q": "Toronto,Canada", 26 | "appid": Variable.get("key") 27 | } 28 | 29 | def extract_openweather_data(**kwargs): 30 | print("Extracting started ") 31 | ti = kwargs['ti'] 32 | response = requests.get(api_endpoint, params=api_params) 33 | data = response.json() 34 | print(data) 35 | df= pd.json_normalize(data['list']) 36 | #df = pd.DataFrame(data['weather']) 37 | print(df) 38 | ti.xcom_push(key = 'final_data' , value = df.to_csv(index=False)) 39 | 40 | 41 | extract_api_data = PythonOperator( 42 | task_id='extract_api_data', 43 | python_callable=extract_openweather_data, 44 | provide_context=True, 45 | dag=dag, 46 | ) 47 | 48 | upload_to_s3 = S3CreateObjectOperator( 49 | task_id="upload_to_S3", 50 | aws_conn_id= 'AWS_CONN', 51 | s3_bucket='airflowoutputtos3bucket', 52 | s3_key='raw/weather_api_data.csv', 53 | data="{{ ti.xcom_pull(key='final_data') }}", 54 | dag=dag, 55 | ) 56 | 57 | # Set task dependencies 58 | extract_api_data >> upload_to_s3 59 | -------------------------------------------------------------------------------- /dags/transform_redshift_load.py: -------------------------------------------------------------------------------- 1 | from airflow import DAG 2 | from airflow.providers.amazon.aws.operators.glue import GlueJobOperator 3 | from airflow.sensors.external_task_sensor import ExternalTaskMarker, ExternalTaskSensor 4 | from datetime import datetime, timedelta 5 | 6 | default_args = { 7 | 'owner': 'airflow', 8 | 'depends_on_past': False, 9 | 'start_date': datetime(2023, 8, 12), 10 | 'retries': 1, 11 | 'retry_delay': timedelta(minutes=5), 12 | } 13 | 14 | dag = DAG('transform_redshift_dag', default_args=default_args, schedule_interval="@once",catchup=False) 15 | 16 | 17 | # Define the Glue Job 18 | transform_task = GlueJobOperator( 19 | task_id='transform_task', 20 | job_name='glue_transform_task', 21 | script_location='s3://aws-glue-assets-262136919150-us-east-1/scripts/transform.py', 22 | s3_bucket='s3://aws-glue-assets-262136919150-us-east-1', # S3 bucket where logs and local etl script will be uploaded 23 | aws_conn_id='AWS_CONN', # You'll need to set up an AWS connection in Airflow 24 | region_name="us-east-1", 25 | iam_role_name='dataproject-RedshiftIamRole-1PIMCBQ8B05RY', 26 | create_job_kwargs ={"GlueVersion": "4.0", "NumberOfWorkers": 4, "WorkerType": "G.1X", "Connections":{"Connections":["redshift-demo-connection"]},}, 27 | dag=dag, 28 | ) 29 | 30 | wait_openweather_api = ExternalTaskSensor( 31 | task_id='wait_openweather_api', 32 | external_dag_id = 'openweather_api_dag', 33 | external_task_id = None, 34 | timeout=2000, 35 | dag=dag, 36 | mode ='reschedule', 37 | allowed_states=["success"] 38 | ) 39 | 40 | # Set task dependencies 41 | wait_openweather_api >> transform_task 42 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | pandas==1.3.5 2 | requests==2.31.0 3 | 4 | -------------------------------------------------------------------------------- /scripts/transform.py: -------------------------------------------------------------------------------- 1 | import sys 2 | from awsglue.transforms import * 3 | from awsglue.utils import getResolvedOptions 4 | from pyspark.context import SparkContext 5 | from awsglue.context import GlueContext 6 | from awsglue.job import Job 7 | from awsglue.dynamicframe import DynamicFrame 8 | 9 | 10 | args = getResolvedOptions(sys.argv, ["JOB_NAME"]) 11 | sc = SparkContext() 12 | glueContext = GlueContext(sc) 13 | spark = glueContext.spark_session 14 | job = Job(glueContext) 15 | job.init(args["JOB_NAME"], args) 16 | 17 | # Script generated for node Amazon S3 18 | weather_dyf = glueContext.create_dynamic_frame.from_options( 19 | format_options={"quoteChar": '"', "withHeader": True, "separator": ","}, 20 | connection_type="s3", 21 | format="csv", 22 | connection_options={ 23 | "paths": ["s3://airflowoutputtos3bucket/raw/weather_api_data.csv"], 24 | "recurse": True, 25 | }, 26 | transformation_ctx="weather_dyf", 27 | ) 28 | 29 | # Script generated for node Change Schema 30 | changeschema_weather_dyf = ApplyMapping.apply( 31 | frame=weather_dyf, 32 | mappings=[ 33 | ("dt", "string", "dt", "string"), 34 | ("weather", "string", "weather", "string"), 35 | ("visibility", "string", "visibility", "string"), 36 | ("`main.temp`", "string", "temp", "string"), 37 | ("`main.feels_like`", "string", "feels_like", "string"), 38 | ("`main.temp_min`", "string", "min_temp", "string"), 39 | ("`main.temp_max`", "string", "max_temp", "string"), 40 | ("`main.pressure`", "string", "pressure", "string"), 41 | ("`main.sea_level`", "string", "sea_level", "string"), 42 | ("`main.grnd_level`", "string", "ground_level", "string"), 43 | ("`main.humidity`", "string", "humidity", "string"), 44 | ("`wind.speed`", "string", "wind", "string"), 45 | ], 46 | transformation_ctx="changeschema_weather_dyf", 47 | ) 48 | 49 | #changeschema_weather_dyf.show() 50 | 51 | redshift_output = glueContext.write_dynamic_frame.from_options( 52 | frame=changeschema_weather_dyf, 53 | connection_type="redshift", 54 | connection_options={ 55 | "redshiftTmpDir": "s3://aws-glue-assets-262136919150-us-east-1/temporary/", 56 | "useConnectionProperties": "true", 57 | "aws_iam_role": "arn:aws:iam::262136919150:role/dataproject-RedshiftIamRole-1PIMCBQ8B05RY", 58 | "dbtable": "public.weather_data", 59 | "connectionName": "redshift-demo-connection", 60 | "preactions": "DROP TABLE IF EXISTS public.weather_data; CREATE TABLE IF NOT EXISTS public.weather_data (dt VARCHAR, weather VARCHAR, visibility VARCHAR, temp VARCHAR, feels_like VARCHAR, min_temp VARCHAR, max_temp VARCHAR, pressure VARCHAR, sea_level VARCHAR, ground_level VARCHAR, humidity VARCHAR, wind VARCHAR);", 61 | }, 62 | transformation_ctx="redshift_output", 63 | ) 64 | 65 | job.commit() 66 | --------------------------------------------------------------------------------