├── dio-aws-mostcommon_word.py
└── README.md


/dio-aws-mostcommon_word.py:
--------------------------------------------------------------------------------
 1 | from mrjob.job import MRJob
 2 | from mrjob.step import MRStep
 3 | import re
 4 | 
 5 | REGEX_ONLY_WORDS = "[\w']+"
 6 | 
 7 | class MRDataMining(MRJob):
 8 | 
 9 |     def steps(self):
10 |         return [
11 |             MRStep(mapper = self.mapper_get_words, reducer = self.reducer_count_words),
12 |             MRStep(reducer = self.reducer_find_max_word)
13 |         ]
14 | 
15 |     def mapper_get_words(self, _, line):
16 |         words = re.findall(REGEX_ONLY_WORDS, line)
17 |         for word in words:
18 |             yield word.lower(), 1
19 | 
20 |     def reducer_count_words(self, word, count):
21 |         # send all (num_occurrences, word) pairs to the same reducer.
22 |         # num_occurrences is so we can easily use Python's max() function.
23 |         
24 |         yield None, (sum(count), word)
25 | 
26 |     def reducer_find_max_word(self, _, count_words_pair):
27 |         # each item of word_count_pairs is (count, word),
28 |         # so yielding one results in key=counts, value=word
29 |             yield max(count_words_pair)
30 | 
31 | if __name__ == '__main__':
32 |     MRDataMining.run()
33 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # PROJECT BUILDING A BIG DATA ECOSYSTEM IN THE AWS CLOUD
 2 | 
 3 | - Project developed during the Cognizant Cloud Data Engineer Bootcamp on the Digital Innovation One platform with the objective of extracting and counting words from a book in plain text format, displaying the most frequent word, through a python algorithm.
 4 | 
 5 | - The algorithm was developed using the MRJOB and MRSTEPS library to create mapreduce jobs. This code is responsible for creating the cluster in the Elastic MapReduce (EMR) service, using the concept of infrastructure as code, without the need to manually create the cluster through the AWS console. The EMR is responsible for processing the given book data (input data), storing the counted words (output data) in S3.
 6 | 
 7 | ## Project steps
 8 | 
 9 | - Create a datalake (bucket) structure in the S3 datastore service
10 | 
11 | - Inside the bucket, create the folder structures below: data output temp
12 | Create SSH key from EC2 console and download .pem file
13 | 
14 | - SSH keys are to allow remote access from my machine to the AWS system
15 | Get AWS Secret Id and Key to Configure MrJob
16 | 
17 | - Create a python virtual environment on a linux virtual machine
18 | 
19 | - Create a python virtual environment
20 | In the python virtual environment:
21 | 
22 | - Create dio-aws-mostcommon-word.py word parsing algorithm in VS CODE
23 | Install boto3: pip install boto3
24 | Install mrjob: pip install mrjob
25 | Configure the .mrjob.conf file located in the /etc folder
26 | Access S3 and Upload File to Bucket
27 | 
28 | - In the python virtual environment, run a command in the VM's linux terminal for python to run the algorithm and create the cluster in EMR, process the data and store the output data in S3 in the appropriate folders
29 | 
30 | Access EMR and verify the cluster created
31 | 
32 | Access S3 and check the generated data in the folders
33 | 


--------------------------------------------------------------------------------