├── LICENSE ├── README.md ├── consumer.py ├── producer.py └── snowflake_queries.sql /LICENSE: -------------------------------------------------------------------------------- 1 | Creative Commons Legal Code 2 | 3 | CC0 1.0 Universal 4 | 5 | CREATIVE COMMONS CORPORATION IS NOT A LAW FIRM AND DOES NOT PROVIDE 6 | LEGAL SERVICES. DISTRIBUTION OF THIS DOCUMENT DOES NOT CREATE AN 7 | ATTORNEY-CLIENT RELATIONSHIP. CREATIVE COMMONS PROVIDES THIS 8 | INFORMATION ON AN "AS-IS" BASIS. CREATIVE COMMONS MAKES NO WARRANTIES 9 | REGARDING THE USE OF THIS DOCUMENT OR THE INFORMATION OR WORKS 10 | PROVIDED HEREUNDER, AND DISCLAIMS LIABILITY FOR DAMAGES RESULTING FROM 11 | THE USE OF THIS DOCUMENT OR THE INFORMATION OR WORKS PROVIDED 12 | HEREUNDER. 13 | 14 | Statement of Purpose 15 | 16 | The laws of most jurisdictions throughout the world automatically confer 17 | exclusive Copyright and Related Rights (defined below) upon the creator 18 | and subsequent owner(s) (each and all, an "owner") of an original work of 19 | authorship and/or a database (each, a "Work"). 20 | 21 | Certain owners wish to permanently relinquish those rights to a Work for 22 | the purpose of contributing to a commons of creative, cultural and 23 | scientific works ("Commons") that the public can reliably and without fear 24 | of later claims of infringement build upon, modify, incorporate in other 25 | works, reuse and redistribute as freely as possible in any form whatsoever 26 | and for any purposes, including without limitation commercial purposes. 27 | These owners may contribute to the Commons to promote the ideal of a free 28 | culture and the further production of creative, cultural and scientific 29 | works, or to gain reputation or greater distribution for their Work in 30 | part through the use and efforts of others. 31 | 32 | For these and/or other purposes and motivations, and without any 33 | expectation of additional consideration or compensation, the person 34 | associating CC0 with a Work (the "Affirmer"), to the extent that he or she 35 | is an owner of Copyright and Related Rights in the Work, voluntarily 36 | elects to apply CC0 to the Work and publicly distribute the Work under its 37 | terms, with knowledge of his or her Copyright and Related Rights in the 38 | Work and the meaning and intended legal effect of CC0 on those rights. 39 | 40 | 1. Copyright and Related Rights. A Work made available under CC0 may be 41 | protected by copyright and related or neighboring rights ("Copyright and 42 | Related Rights"). Copyright and Related Rights include, but are not 43 | limited to, the following: 44 | 45 | i. the right to reproduce, adapt, distribute, perform, display, 46 | communicate, and translate a Work; 47 | ii. moral rights retained by the original author(s) and/or performer(s); 48 | iii. publicity and privacy rights pertaining to a person's image or 49 | likeness depicted in a Work; 50 | iv. rights protecting against unfair competition in regards to a Work, 51 | subject to the limitations in paragraph 4(a), below; 52 | v. rights protecting the extraction, dissemination, use and reuse of data 53 | in a Work; 54 | vi. database rights (such as those arising under Directive 96/9/EC of the 55 | European Parliament and of the Council of 11 March 1996 on the legal 56 | protection of databases, and under any national implementation 57 | thereof, including any amended or successor version of such 58 | directive); and 59 | vii. other similar, equivalent or corresponding rights throughout the 60 | world based on applicable law or treaty, and any national 61 | implementations thereof. 62 | 63 | 2. Waiver. To the greatest extent permitted by, but not in contravention 64 | of, applicable law, Affirmer hereby overtly, fully, permanently, 65 | irrevocably and unconditionally waives, abandons, and surrenders all of 66 | Affirmer's Copyright and Related Rights and associated claims and causes 67 | of action, whether now known or unknown (including existing as well as 68 | future claims and causes of action), in the Work (i) in all territories 69 | worldwide, (ii) for the maximum duration provided by applicable law or 70 | treaty (including future time extensions), (iii) in any current or future 71 | medium and for any number of copies, and (iv) for any purpose whatsoever, 72 | including without limitation commercial, advertising or promotional 73 | purposes (the "Waiver"). Affirmer makes the Waiver for the benefit of each 74 | member of the public at large and to the detriment of Affirmer's heirs and 75 | successors, fully intending that such Waiver shall not be subject to 76 | revocation, rescission, cancellation, termination, or any other legal or 77 | equitable action to disrupt the quiet enjoyment of the Work by the public 78 | as contemplated by Affirmer's express Statement of Purpose. 79 | 80 | 3. Public License Fallback. Should any part of the Waiver for any reason 81 | be judged legally invalid or ineffective under applicable law, then the 82 | Waiver shall be preserved to the maximum extent permitted taking into 83 | account Affirmer's express Statement of Purpose. In addition, to the 84 | extent the Waiver is so judged Affirmer hereby grants to each affected 85 | person a royalty-free, non transferable, non sublicensable, non exclusive, 86 | irrevocable and unconditional license to exercise Affirmer's Copyright and 87 | Related Rights in the Work (i) in all territories worldwide, (ii) for the 88 | maximum duration provided by applicable law or treaty (including future 89 | time extensions), (iii) in any current or future medium and for any number 90 | of copies, and (iv) for any purpose whatsoever, including without 91 | limitation commercial, advertising or promotional purposes (the 92 | "License"). The License shall be deemed effective as of the date CC0 was 93 | applied by Affirmer to the Work. Should any part of the License for any 94 | reason be judged legally invalid or ineffective under applicable law, such 95 | partial invalidity or ineffectiveness shall not invalidate the remainder 96 | of the License, and in such case Affirmer hereby affirms that he or she 97 | will not (i) exercise any of his or her remaining Copyright and Related 98 | Rights in the Work or (ii) assert any associated claims and causes of 99 | action with respect to the Work, in either case contrary to Affirmer's 100 | express Statement of Purpose. 101 | 102 | 4. Limitations and Disclaimers. 103 | 104 | a. No trademark or patent rights held by Affirmer are waived, abandoned, 105 | surrendered, licensed or otherwise affected by this document. 106 | b. Affirmer offers the Work as-is and makes no representations or 107 | warranties of any kind concerning the Work, express, implied, 108 | statutory or otherwise, including without limitation warranties of 109 | title, merchantability, fitness for a particular purpose, non 110 | infringement, or the absence of latent or other defects, accuracy, or 111 | the present or absence of errors, whether or not discoverable, all to 112 | the greatest extent permissible under applicable law. 113 | c. Affirmer disclaims responsibility for clearing rights of other persons 114 | that may apply to the Work or any use thereof, including without 115 | limitation any person's Copyright and Related Rights in the Work. 116 | Further, Affirmer disclaims responsibility for obtaining any necessary 117 | consents, permissions or other rights required for any use of the 118 | Work. 119 | d. Affirmer understands and acknowledges that Creative Commons is not a 120 | party to this document and has no duty or obligation with respect to 121 | this CC0 or use of the Work. 122 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # real-time_crypto_data_pipeline_using_kafka 2 | I am using confluent Kafka cluster to produce and consume scraped data. 3 | 4 | 5 | In this project, I've created a real-time data pipeline that utilizes Kafka to scrape, process, and load data onto S3 in JSON format. With a producer-consumer architecture, I ensure that the data is in the right format for loading onto S3 by performing minor transformations while consuming it. 6 | 7 | But that's not all - I've also used AWS crawler to crawl the data and generate a schema catalog. Athena utilizes this catalog, allowing me to query the data directly from S3 without loading it first. This saves time and resources and enables me to get insights from the data much faster! 8 | 9 | Moreover, I've connected S3 with Snowflake using Snowpipe. As data is loaded onto S3, a SNS notification is sent to Snowpipe, which then automatically starts loading the data into Snowflake. This makes data loading a seamless and automated process, freeing up time for other important tasks. 10 | 11 | 12 | ![kafka_proj](https://user-images.githubusercontent.com/128234000/235583169-ae099338-60e4-4c04-a4fb-b4707a6e743a.png) 13 | -------------------------------------------------------------------------------- /consumer.py: -------------------------------------------------------------------------------- 1 | #import os 2 | import pandas as pd 3 | import boto3 4 | import json 5 | import datetime 6 | 7 | 8 | from confluent_kafka import Consumer 9 | from confluent_kafka.serialization import SerializationContext, MessageField 10 | from confluent_kafka.schema_registry.json_schema import JSONDeserializer 11 | from confluent_kafka.schema_registry import SchemaRegistryClient 12 | 13 | 14 | 15 | API_KEY = '' 16 | ENDPOINT_SCHEMA_URL = '' 17 | API_SECRET_KEY = '' 18 | BOOTSTRAP_SERVER = '' 19 | SECURITY_PROTOCOL = 'SASL_SSL' 20 | SSL_MACHENISM = 'PLAIN' 21 | SCHEMA_REGISTRY_API_KEY = '' 22 | SCHEMA_REGISTRY_API_SECRET = '' 23 | 24 | 25 | 26 | 27 | def sasl_conf(): 28 | 29 | sasl_conf = {'sasl.mechanism': SSL_MACHENISM, 30 | # Set to SASL_SSL to enable TLS support. 31 | # 'security.protocol': 'SASL_PLAINTEXT'} 32 | 'bootstrap.servers':BOOTSTRAP_SERVER, 33 | 'security.protocol': SECURITY_PROTOCOL, 34 | 'sasl.username': API_KEY, 35 | 'sasl.password': API_SECRET_KEY 36 | } 37 | return sasl_conf 38 | 39 | 40 | 41 | def schema_config(): 42 | return {'url':ENDPOINT_SCHEMA_URL, 43 | 'basic.auth.user.info':f"{SCHEMA_REGISTRY_API_KEY}:{SCHEMA_REGISTRY_API_SECRET}"} 44 | 45 | 46 | class CryptoRec: 47 | def __init__(self,record:dict): 48 | for k,v in record.items(): 49 | setattr(self,k,v) 50 | self.record=record 51 | 52 | @staticmethod 53 | def dict_to_CryptoRec(data:dict,ctx): 54 | return CryptoRec(record=data) 55 | 56 | def __str__(self): 57 | return f"{self.record}" 58 | 59 | 60 | 61 | s3 = boto3.client('s3') 62 | bucket_name = "your-bucket-name" 63 | def consumeData(topic): 64 | 65 | schema_registry_conf = schema_config() 66 | schema_registry_client = SchemaRegistryClient(schema_registry_conf) 67 | # subjects = schema_registry_client.get_subjects() 68 | # print(subjects) 69 | subject = topic+'-value' 70 | 71 | schema = schema_registry_client.get_latest_version(subject) 72 | schema_str=schema.schema.schema_str 73 | 74 | json_deserializer = JSONDeserializer(schema_str, 75 | from_dict=CryptoRec.dict_to_CryptoRec) 76 | 77 | consumer_conf = sasl_conf() 78 | consumer_conf.update({ 79 | 'group.id': 'group1', 80 | 'auto.offset.reset': "earliest"}) 81 | 82 | consumer = Consumer(consumer_conf) 83 | consumer.subscribe([topic]) 84 | 85 | 86 | while True: 87 | try: 88 | # SIGINT can't be handled when polling, limit timeout to 1 second. 89 | msg = consumer.poll(1.0) 90 | if msg is None: 91 | continue 92 | 93 | #to hold tranformed data 94 | transform_data = {} 95 | 96 | cryptoRecord = json_deserializer(msg.value(), SerializationContext(msg.topic(), MessageField.VALUE)) 97 | 98 | if cryptoRecord is not None: 99 | #print("User record {}: crypto_Data: {}\n".format(msg.key(), cryptoRecord)) 100 | transform_data['SYSTEM_INSERTED_TIMESTAMP'] = datetime.datetime.fromtimestamp(cryptoRecord.record['SYSTEM_INSERTED_TIMESTAMP'] / 1000.0).strftime('%Y-%m-%d %H:%M:%S') 101 | transform_data['RANK'] = int(cryptoRecord.record['RANK']) 102 | transform_data['NAME'] = cryptoRecord.record['NAME'] 103 | transform_data['SYMBOL'] = cryptoRecord.record['SYMBOL'] 104 | transform_data['PRICE'] = float((cryptoRecord.record['PRICE'].replace('$', '').replace(',', '').replace(' ', ''))) 105 | transform_data['PERCENT_CHANGE_24H'] = float(cryptoRecord.record['PERCENT_CHANGE_24H'].replace('%', '').replace(',', '').replace(' ', '')) 106 | transform_data['VOLUME_24H'] = float(cryptoRecord.record['VOLUME_24H'].replace('$', '').replace('B', 'E9').replace('M', 'E6').replace(',', '').replace(' ', '')) 107 | transform_data['MARKET_CAP'] = float(cryptoRecord.record['MARKET_CAP'].replace('$', '').replace('B', 'E9').replace('M', 'E6').replace(',', '').replace(' ', '')) 108 | transform_data['CURRENCY'] = 'USD' 109 | 110 | json_str = json.dumps(transform_data) 111 | #print(json_str) 112 | file_name = "real_time_data/top_100_crypto_data_" + str(cryptoRecord.record['SYSTEM_INSERTED_TIMESTAMP']) + '_' + str(cryptoRecord.record['RANK']) + '.json' 113 | 114 | response = s3.put_object( 115 | Bucket=bucket_name, 116 | Key=file_name, 117 | Body=json_str) 118 | print("file_uploaded: ",file_name) 119 | except KeyboardInterrupt: 120 | break 121 | 122 | consumer.close() 123 | 124 | 125 | 126 | 127 | def main(): 128 | 129 | topic = 'top_100_crypto' 130 | print("starting consumer: ",topic) 131 | consumeData(topic) 132 | 133 | 134 | main() 135 | 136 | 137 | -------------------------------------------------------------------------------- /producer.py: -------------------------------------------------------------------------------- 1 | from bs4 import BeautifulSoup 2 | import requests 3 | import time 4 | import datetime 5 | import pandas as pd 6 | import argparse 7 | import json 8 | from json import dumps 9 | from time import sleep 10 | 11 | from uuid import uuid4 12 | from confluent_kafka import Producer 13 | from confluent_kafka.serialization import StringSerializer, SerializationContext, MessageField 14 | from confluent_kafka.schema_registry import SchemaRegistryClient 15 | from confluent_kafka.schema_registry.json_schema import JSONSerializer 16 | 17 | 18 | 19 | API_KEY = '' 20 | ENDPOINT_SCHEMA_URL = '' 21 | API_SECRET_KEY = '' 22 | BOOTSTRAP_SERVER = '' 23 | SECURITY_PROTOCOL = 'SASL_SSL' 24 | SSL_MACHENISM = 'PLAIN' 25 | SCHEMA_REGISTRY_API_KEY = '' 26 | SCHEMA_REGISTRY_API_SECRET = '' 27 | 28 | 29 | parser = argparse.ArgumentParser(description='To gett arguments passed during run time') 30 | parser.add_argument('topic', help='topic help') 31 | parser.add_argument('time', help='time help') 32 | args = parser.parse_args() 33 | 34 | topic = args.topic 35 | time = int(args.time) 36 | 37 | 38 | 39 | #config function 40 | def sasl_conf(): 41 | 42 | sasl_conf = {'sasl.mechanism': SSL_MACHENISM, 43 | # Set to SASL_SSL to enable TLS support. 44 | # 'security.protocol': 'SASL_PLAINTEXT'} 45 | 'bootstrap.servers':BOOTSTRAP_SERVER, 46 | 'security.protocol': SECURITY_PROTOCOL, 47 | 'sasl.username': API_KEY, 48 | 'sasl.password': API_SECRET_KEY 49 | } 50 | return sasl_conf 51 | 52 | #config function 53 | def schema_config(): 54 | return {'url':ENDPOINT_SCHEMA_URL, 55 | 'basic.auth.user.info':f"{SCHEMA_REGISTRY_API_KEY}:{SCHEMA_REGISTRY_API_SECRET}"} 56 | 57 | #Error detector function. Gives report if data is sent succesfully or not 58 | def delivery_report(err, msg): 59 | """ 60 | Reports the success or failure of a message delivery. 61 | Args: 62 | err (KafkaError): The error that occurred on None on success. 63 | msg (Message): The message that was produced or failed. 64 | """ 65 | 66 | if err is not None: 67 | print("Delivery failed for User record {}: {}".format(msg.key(), err)) 68 | return 69 | print('User record {} successfully produced to {} [{}] at offset {}'.format( 70 | msg.key(), msg.topic(), msg.partition(), msg.offset())) 71 | 72 | 73 | 74 | #function for scraping data when called 75 | def scrape_data(url): 76 | 77 | allRecordsCombined = [] 78 | 79 | for page in range(1,3): 80 | # Make a request to the website 81 | response = requests.get(url+str(page)) 82 | current_timestamp = datetime.datetime.now() 83 | 84 | soup = BeautifulSoup(response.content, 'html.parser') 85 | 86 | # Find the table containing the top 100 cryptocurrencies 87 | treeTag = soup.find_all('tr') 88 | 89 | #print(treeTag) 90 | 91 | for tree in treeTag[1:]: 92 | rank = tree.find('td',{'class': 'css-w6jew4'}).get_text() 93 | name = tree.find('p',{'class': 'chakra-text css-rkws3'}).get_text() 94 | symbol = tree.find('span',{'class': 'css-1jj7b1a'}).get_text() 95 | market_cap = tree.find('td',{'class':'css-1nh9lk8'}).get_text() 96 | change_24h = "" 97 | price_arr = str(tree.find('div',{'class':'css-16q9pr7'}).get_text()) 98 | if('-' in price_arr): 99 | price_arr = price_arr.split('-') 100 | change_24h = '-'+price_arr[1] 101 | else: 102 | price_arr = price_arr.split('+') 103 | change_24h = '+'+price_arr[1] 104 | price = price_arr[0] 105 | volume_24 = tree.find('td',{'class':'css-1nh9lk8'}).get_text() 106 | 107 | 108 | #print("Rank: ", rank) 109 | #print("NAME: ", name) 110 | #print("symbol: ", symbol) 111 | #print("price: ", price) 112 | #print("market_cap: ", market_cap) 113 | #print("volume_24: ", volume_24) 114 | #print("change_24h: ", change_24h) 115 | 116 | allRecordsCombined.append([current_timestamp, rank, name, symbol, price, change_24h, volume_24, market_cap]) 117 | 118 | columns = ['SYSTEM_INSERTED_TIMESTAMP', 'RANK','NAME', 'SYMBOL', 'PRICE', 'PERCENT_CHANGE_24H','VOLUME_24H', 'MARKET_CAP'] 119 | df = pd.DataFrame(columns=columns, data=allRecordsCombined) 120 | current_timestamp = datetime.datetime.now().strftime("%Y-%m-%d_%H-%M-%S") 121 | # Convert data frame to JSON string 122 | json_export = df.to_json(orient='records') 123 | #df.to_csv('s3://coinmarketcap-bucket/raw_layer/{}.csv'.format(current_timestamp), index=False) 124 | #df.to_csv(f"{dag_path}/dags/output/{current_timestamp}.csv", index=False) 125 | #print(f"FILE created at: {dag_path}/output/{current_timestamp}.csv") 126 | return json.loads(json_export) 127 | 128 | 129 | #class craeted to hold scraped data in object format 130 | class CryptoRec: 131 | def __init__(self,record:dict): 132 | self.record=record 133 | 134 | @staticmethod 135 | def dict_to_CryptoRec(data:dict,ctx): 136 | return CryptoRec(record=data) 137 | 138 | def __str__(self): 139 | return f"{self.record}" 140 | 141 | 142 | 143 | #helper funtion to convert object to dict data type 144 | def cryptoRec_to_dict(crypto:CryptoRec, ctx): 145 | """ 146 | Returns a dict representation of a User instance for serialization. 147 | Args: 148 | user (User): User instance. 149 | ctx (SerializationContext): Metadata pertaining to the serialization 150 | operation. 151 | Returns: 152 | dict: Dict populated with user attributes to be serialized. 153 | """ 154 | 155 | # User._address must not be serialized; omit from dict 156 | return crypto.record 157 | 158 | 159 | #This producer function will produce and send data to Confluent kafka 160 | def produceData(topic): 161 | 162 | schema_registry_conf = schema_config() 163 | schema_registry_client = SchemaRegistryClient(schema_registry_conf) 164 | 165 | # subjects = schema_registry_client.get_subjects() 166 | # print(subjects) 167 | subject = topic+'-value' 168 | 169 | schema = schema_registry_client.get_latest_version(subject) 170 | schema_str=schema.schema.schema_str 171 | 172 | string_serializer = StringSerializer('utf_8') 173 | json_serializer = JSONSerializer(schema_str, schema_registry_client, cryptoRec_to_dict) 174 | 175 | producer = Producer(sasl_conf()) 176 | 177 | for i in range(0,time): 178 | url = 'https://crypto.com/price?page=' 179 | json_file = scrape_data(url) 180 | #print(json_file) 181 | 182 | print("Producing user records to topic {}. ^C to exit.".format(topic)) 183 | 184 | try: 185 | for temp_rec in json_file: 186 | rec=CryptoRec(temp_rec) 187 | producer.produce( 188 | topic=topic, 189 | key=string_serializer(str(uuid4())), 190 | value=json_serializer(rec, SerializationContext(topic, MessageField.VALUE)), 191 | on_delivery=delivery_report) 192 | producer.poll(0.2) 193 | except ValueError: 194 | print("Invalid input, discarding record...") 195 | pass 196 | 197 | #5 sec delay 198 | sleep(5) 199 | 200 | print("\nFlushing records...") 201 | producer.flush() 202 | 203 | 204 | 205 | def main(): 206 | 207 | #topic = 'top_100_crypto' 208 | print("starting producer: ",topic) 209 | produceData(topic) 210 | 211 | 212 | main() 213 | 214 | 215 | -------------------------------------------------------------------------------- /snowflake_queries.sql: -------------------------------------------------------------------------------- 1 | CREATE DATABASE KAFKA_LIVE_DATA; 2 | 3 | USE DATABASE KAFKA_LIVE_DATA; 4 | 5 | 6 | CREATE TABLE top_100_crypto_data ( 7 | SYSTEM_INSERTED_TIMESTAMP TIMESTAMP, 8 | RANK INTEGER, 9 | NAME VARCHAR, 10 | SYMBOL VARCHAR, 11 | PRICE NUMBER, 12 | PERCENT_CHANGE_24H FLOAT, 13 | VOLUME_24H NUMBER, 14 | MARKET_CAP NUMBER, 15 | CURRENCY VARCHAR 16 | ); 17 | 18 | 19 | CREATE TABLE top_100_crypto_data ( 20 | json_data VARIANT 21 | ); 22 | 23 | CREATE OR REPLACE STAGE ext_stage 24 | URL = 's3://coinmarketcap-bucket/real_time_data/' 25 | CREDENTIALS = ( 26 | AWS_KEY_ID='', 27 | AWS_SECRET_KEY='' 28 | ); 29 | 30 | 31 | 32 | 33 | 34 | CREATE OR REPLACE PIPE live_crypto_data 35 | AUTO_INGEST = TRUE 36 | AS 37 | COPY INTO KAFKA_LIVE_DATA.PUBLIC.TOP_100_CRYPTO_DATA 38 | FROM @ext_stage 39 | FILE_FORMAT = (TYPE=JSON); 40 | 41 | 42 | --for manually refreshing the snowpipe 43 | ALTER PIPE live_crypto_data REFRESH; 44 | 45 | SHOW PIPES; 46 | 47 | 48 | SELECT count(*) AS COUNT FROM kafka_live_data.public.top_100_crypto_data; 49 | 50 | 51 | CREATE TABLE top_100_crypto_data_sink ( 52 | SYSTEM_INSERTED_TIMESTAMP TIMESTAMP, 53 | RANK INTEGER, 54 | NAME VARCHAR, 55 | SYMBOL VARCHAR, 56 | PRICE NUMBER, 57 | PERCENT_CHANGE_24H FLOAT, 58 | VOLUME_24H NUMBER, 59 | MARKET_CAP NUMBER, 60 | CURRENCY VARCHAR 61 | ); 62 | 63 | --------------------------------------------------------------------------------