├── dio-aws-mostcommon_word.py └── README.md /dio-aws-mostcommon_word.py: -------------------------------------------------------------------------------- 1 | from mrjob.job import MRJob 2 | from mrjob.step import MRStep 3 | import re 4 | 5 | REGEX_ONLY_WORDS = "[\w']+" 6 | 7 | class MRDataMining(MRJob): 8 | 9 | def steps(self): 10 | return [ 11 | MRStep(mapper = self.mapper_get_words, reducer = self.reducer_count_words), 12 | MRStep(reducer = self.reducer_find_max_word) 13 | ] 14 | 15 | def mapper_get_words(self, _, line): 16 | words = re.findall(REGEX_ONLY_WORDS, line) 17 | for word in words: 18 | yield word.lower(), 1 19 | 20 | def reducer_count_words(self, word, count): 21 | # send all (num_occurrences, word) pairs to the same reducer. 22 | # num_occurrences is so we can easily use Python's max() function. 23 | 24 | yield None, (sum(count), word) 25 | 26 | def reducer_find_max_word(self, _, count_words_pair): 27 | # each item of word_count_pairs is (count, word), 28 | # so yielding one results in key=counts, value=word 29 | yield max(count_words_pair) 30 | 31 | if __name__ == '__main__': 32 | MRDataMining.run() 33 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # PROJECT BUILDING A BIG DATA ECOSYSTEM IN THE AWS CLOUD 2 | 3 | - Project developed during the Cognizant Cloud Data Engineer Bootcamp on the Digital Innovation One platform with the objective of extracting and counting words from a book in plain text format, displaying the most frequent word, through a python algorithm. 4 | 5 | - The algorithm was developed using the MRJOB and MRSTEPS library to create mapreduce jobs. This code is responsible for creating the cluster in the Elastic MapReduce (EMR) service, using the concept of infrastructure as code, without the need to manually create the cluster through the AWS console. The EMR is responsible for processing the given book data (input data), storing the counted words (output data) in S3. 6 | 7 | ## Project steps 8 | 9 | - Create a datalake (bucket) structure in the S3 datastore service 10 | 11 | - Inside the bucket, create the folder structures below: data output temp 12 | Create SSH key from EC2 console and download .pem file 13 | 14 | - SSH keys are to allow remote access from my machine to the AWS system 15 | Get AWS Secret Id and Key to Configure MrJob 16 | 17 | - Create a python virtual environment on a linux virtual machine 18 | 19 | - Create a python virtual environment 20 | In the python virtual environment: 21 | 22 | - Create dio-aws-mostcommon-word.py word parsing algorithm in VS CODE 23 | Install boto3: pip install boto3 24 | Install mrjob: pip install mrjob 25 | Configure the .mrjob.conf file located in the /etc folder 26 | Access S3 and Upload File to Bucket 27 | 28 | - In the python virtual environment, run a command in the VM's linux terminal for python to run the algorithm and create the cluster in EMR, process the data and store the output data in S3 in the appropriate folders 29 | 30 | Access EMR and verify the cluster created 31 | 32 | Access S3 and check the generated data in the folders 33 | --------------------------------------------------------------------------------