├── ASL Recognition with Deep Learning.ipynb ├── README.md ├── a_network_analysis_of_game_of_thrones.ipynb ├── a_new_era_of_data_analysis_in_baseball.ipynb ├── a_visual_history_of_nobel_prize_winners.ipynb ├── a_visual_history_of_nobel_prize_winners_python.ipynb ├── bad_passwords_and_the_NIST_guidelines.ipynb ├── classify_song_genres_from_audio_data.ipynb ├── clustering_heart_disease_patient_data.ipynb ├── degrees_that_pay_you_back.ipynb ├── dr_semmelweis_and_the_discovery_of_handwashing.ipynb ├── exploring_67_years_of_lego.ipynb ├── exploring_the_bitcoin_cryptocurrency_market.ipynb ├── exploring_the_evolution_of_linux.ipynb ├── exploring_the_kaggle_data_science_survey.ipynb ├── find_movie_similarity_from_plot_summaries.ipynb ├── functions_for_food_price_forecasts.ipynb ├── generating_keywords_for_google_adwords.ipynb ├── give_life_predict_blood_donations.ipynb ├── level_difficulty_in_candy_crush_saga.ipynb ├── mobile_games_ab-testing_with_cookie_cats.ipynb ├── naive_Bees_predict_species_from_images.ipynb ├── naive_bees_deep_learning_with_images.ipynb ├── naive_bees_image_loading_and_processing.ipynb ├── name_game_genderprediction_using_sound.ipynb ├── phyllotaxis_draw_flowers_using_mathematics.ipynb ├── predict_taxi_fares_with_random_forests.ipynb ├── predicting_credit_card_approvals.ipynb ├── recreating_john_snow_s_ghost_map.ipynb ├── reducing_traffic_mortality_in_the_USA.ipynb ├── rise_and_fall_of_programming_languages.ipynb ├── risk_and_returns_the_sharpe_ratio.ipynb ├── scout_your_athletics_fantasy_team.ipynb ├── the_android_app_market_on_google_play.ipynb ├── the_gitHub_history_of_the_scala_language.ipynb ├── the_hottest_topics_in_machine_learning.ipynb ├── tv_halftime_shows_and_the_big_game.ipynb ├── up_and_down_with_the_kardashians.ipynb ├── visualizing_inequalities_in_life_expectancy.ipynb ├── what_makes_a_pokémon_legendary.ipynb ├── whats_in_a_name.ipynb ├── which_debts_are_worth_the_Bank_s_effort.ipynb ├── who_is_drunk_and_when_in_ames_iowa.ipynb ├── who_is_drunk_and_when_in_ames_iowa_python.ipynb ├── word_frequency_in_moby_dick.ipynb └── wrangling_and_visualizing_musical_data.ipynb /README.md: -------------------------------------------------------------------------------- 1 | # DataCamp Projects 2 | 3 | Each file is the solution to a project in [DataCamp](https://www.datacamp.com). 4 | 5 | ## R Projects 6 | - [Rise and Fall of Programming Languages](https://github.com/ChristianNogueira/datacamp_projects/blob/master/rise_and_fall_of_programming_languages.ipynb) 7 | - [Predict Taxi Fares with Random Forests](https://github.com/ChristianNogueira/datacamp_projects/blob/master/predict_taxi_fares_with_random_forests.ipynb) 8 | - [Scout Your Athletics Fantasy Team](https://github.com/ChristianNogueira/datacamp_projects/blob/master/scout_your_athletics_fantasy_team.ipynb) 9 | - [Visualizing Inequalities in Life Expectancy](https://github.com/ChristianNogueira/datacamp_projects/blob/master/visualizing_inequalities_in_life_expectancy.ipynb) 10 | - [A Visual History of Nobel Prize Winners](https://github.com/ChristianNogueira/datacamp_projects/blob/master/a_visual_history_of_nobel_prize_winners.ipynb) 11 | - [Who Is Drunk and When in Ames, Iowa?](https://github.com/ChristianNogueira/datacamp_projects/blob/master/who_is_drunk_and_when_in_ames_iowa.ipynb) 12 | - [Bad passwords and the NIST guidelines] 13 | - [Wrangling and Visualizing Musical Data](https://github.com/ChristianNogueira/datacamp_projects/blob/master/wrangling_and_visualizing_musical_data.ipynb) 14 | - [Level difficulty in Candy Crush Saga](https://github.com/ChristianNogueira/datacamp_projects/blob/master/level_difficulty_in_candy_crush_saga.ipynb) 15 | - [Exploring the Kaggle Data Science Survey](https://github.com/ChristianNogueira/datacamp_projects/blob/master/exploring_the_kaggle_data_science_survey.ipynb) 16 | - [Phyllotaxis: Draw flowers using mathematics](https://github.com/ChristianNogueira/datacamp_projects/blob/master/phyllotaxis_draw_flowers_using_mathematics.ipynb) 17 | - [Dr. Semmelweis and the discovery of handwashing] 18 | - [Introduction to DataCamp Projects] 19 | 20 | ## Python Projects 21 | - [Predicting Credit Card Approvals](https://github.com/ChristianNogueira/datacamp_projects/blob/master/predicting_credit_card_approvals.ipynb) 22 | - [Give Life, Predict Blood Donations](https://github.com/ChristianNogueira/datacamp_projects/blob/master/give_life_predict_blood_donations.ipynb) 23 | - [Classify songs Genres from Audio Data](https://github.com/ChristianNogueira/datacamp_projects/blob/master/classify_song_genres_from_audio_data.ipynb) 24 | - [A Visual History of Nobel Prize Winners](https://github.com/ChristianNogueira/datacamp_projects/blob/master/a_visual_history_of_nobel_prize_winners_python.ipynb) 25 | - [Who Is Drunk and When in Ames, Iowa? (Python version)](https://github.com/ChristianNogueira/datacamp_projects/blob/master/who_is_drunk_and_when_in_ames_iowa_python.ipynb) 26 | - [Naïve Bees: Predict Species from Images](https://github.com/ChristianNogueira/datacamp_projects/blob/master/naive_Bees_predict_species_from_images.ipynb) 27 | - [Generating Keywords for Google AdWords](https://github.com/ChristianNogueira/datacamp_projects/blob/master/generating_keywords_for_google_adwords.ipynb) 28 | - [Naïve Bees: Image Loading and Processing](https://github.com/ChristianNogueira/datacamp_projects/blob/master/naive_bees_image_loading_and_processing.ipynb) 29 | - [The GitHub History of the Scala Language](https://github.com/ChristianNogueira/datacamp_projects/blob/master/the_gitHub_history_of_the_scala_language.ipynb) 30 | - [The Hottest Topics in Machine Learning](https://github.com/ChristianNogueira/datacamp_projects/blob/master/the_hottest_topics_in_machine_learning.ipynb) 31 | - [Recreating John Snow's Ghost Map](https://github.com/ChristianNogueira/datacamp_projects/blob/master/recreating_john_snow_s_ghost_map.ipynb) 32 | - [A New Era of Data Analysis in Baseball](https://github.com/ChristianNogueira/datacamp_projects/blob/master/a_new_era_of_data_analysis_in_baseball.ipynb) 33 | - [Mobile Games A/B Testing with Cookie Cats](https://github.com/ChristianNogueira/datacamp_projects/blob/master/mobile_games_ab-testing_with_cookie_cats.ipynb) 34 | - [A Network analysis of Game of Thrones](https://github.com/ChristianNogueira/datacamp_projects/blob/master/a_network_analysis_of_game_of_thrones.ipynb) 35 | - [Name Game: Gender Prediction using Sound](https://github.com/ChristianNogueira/datacamp_projects/blob/master/name_game_genderprediction_using_sound.ipynb) 36 | - [Risk and Returns: The Sharpe Ratio](https://github.com/ChristianNogueira/datacamp_projects/blob/master/risk_and_returns_the_sharpe_ratio.ipynb) 37 | - [Exploring the Bitcoin cryptocurrency market](https://github.com/ChristianNogueira/datacamp_projects/blob/master/exploring_the_bitcoin_cryptocurrency_market.ipynb) 38 | - [Word frequency in Moby Dick](https://github.com/ChristianNogueira/datacamp_projects/blob/master/word_frequency_in_moby_dick.ipynb) 39 | - [Bad passwords and the NIST guidelines](https://github.com/ChristianNogueira/datacamp_projects/blob/master/bad_passwords_and_the_NIST_guidelines.ipynbd) 40 | - [Dr. Semmelweis and the discovery of handwashing](https://github.com/ChristianNogueira/datacamp_projects/blob/master/dr_semmelweis_and_the_discovery_of_handwashing.ipynb) 41 | - [Exploring the evolution of Linux](https://github.com/ChristianNogueira/datacamp_projects/blob/master/exploring_the_evolution_of_linux.ipynb) 42 | - [Exploring 67 years of LEGO](https://github.com/ChristianNogueira/datacamp_projects/blob/master/exploring_67_years_of_lego.ipynb) 43 | - [Introduction to DataCamp Projects] 44 | 45 | ## Projects not available in DataCamp any more (or missed it) 46 | - [What in a Name](https://github.com/ChristianNogueira/datacamp_projects/blob/master/whats_in_a_name.ipynb) 47 | -------------------------------------------------------------------------------- /bad_passwords_and_the_NIST_guidelines.ipynb: -------------------------------------------------------------------------------- 1 | {"cells":[{"cell_type":"markdown","source":"## 1. The NIST Special Publication 800-63B\n
If you – 50 years ago – needed to come up with a secret password you were probably part of a secret espionage organization or (more likely) you were pretending to be a spy when playing as a kid. Today, many of us are forced to come up with new passwords all the time when signing into sites and apps. As a password inventeur it is your responsibility to come up with good, hard-to-crack passwords. But it is also in the interest of sites and apps to make sure that you use good passwords. The problem is that it's really hard to define what makes a good password. However, the National Institute of Standards and Technology (NIST) knows what the second best thing is: To make sure you're at least not using a bad password.
\nIn this notebook, we will go through the rules in NIST Special Publication 800-63B which details what checks a verifier (what the NIST calls a second party responsible for storing and verifying passwords) should perform to make sure users don't pick bad passwords. We will go through the passwords of users from a fictional company and use python to flag the users with bad passwords. But us being able to do this already means the fictional company is breaking one of the rules of 800-63B:
\n\n\nVerifiers SHALL store memorized secrets in a form that is resistant to offline attacks. Memorized secrets SHALL be salted and hashed using a suitable one-way key derivation function.
\n
That is, never save users' passwords in plaintext, always encrypt the passwords! Keeping this in mind for the next time we're building a password management system, let's load in the data.
\nWarning: The list of passwords and the fictional user database both contain real passwords leaked from real websites. These passwords have not been filtered in any way and include words that are explicit, derogatory and offensive.
","metadata":{"editable":false,"tags":["context"],"run_control":{"frozen":true},"deletable":false,"dc":{"key":"3"}}},{"cell_type":"code","source":"# Importing the pandas module\nimport pandas as pd\n\n# Loading in datasets/users.csv \nusers = pd.read_csv('datasets/users.csv')\n\n# Printing out how many users we've got\nprint(len(users.index))\n\n# Taking a look at the 12 first users\nusers.head(n=12)","metadata":{"dc":{"key":"3"},"trusted":true,"tags":["sample_code"]},"execution_count":98,"outputs":[{"text":"982\n","name":"stdout","output_type":"stream"},{"data":{"text/plain":" id user_name password\n0 1 vance.jennings joobheco\n1 2 consuelo.eaton 0869347314\n2 3 mitchel.perkins fabypotter\n3 4 odessa.vaughan aharney88\n4 5 araceli.wilder acecdn3000\n5 6 shawn.harrington 5278049\n6 7 evelyn.gay master\n7 8 noreen.hale murphy\n8 9 gladys.ward lwsves2\n9 10 brant.zimmerman 1190KAREN5572497\n10 11 leanna.abbott aivlys24\n11 12 milford.hubbard hubbard","text/html":"\n | id | \nuser_name | \npassword | \n
---|---|---|---|
0 | \n1 | \nvance.jennings | \njoobheco | \n
1 | \n2 | \nconsuelo.eaton | \n0869347314 | \n
2 | \n3 | \nmitchel.perkins | \nfabypotter | \n
3 | \n4 | \nodessa.vaughan | \naharney88 | \n
4 | \n5 | \naraceli.wilder | \nacecdn3000 | \n
5 | \n6 | \nshawn.harrington | \n5278049 | \n
6 | \n7 | \nevelyn.gay | \nmaster | \n
7 | \n8 | \nnoreen.hale | \nmurphy | \n
8 | \n9 | \ngladys.ward | \nlwsves2 | \n
9 | \n10 | \nbrant.zimmerman | \n1190KAREN5572497 | \n
10 | \n11 | \nleanna.abbott | \naivlys24 | \n
11 | \n12 | \nmilford.hubbard | \nhubbard | \n
If we take a look at the first 12 users above we already see some bad passwords. But let's not get ahead of ourselves and start flagging passwords manually. What is the first thing we should check according to the NIST Special Publication 800-63B?
\n\n\nVerifiers SHALL require subscriber-chosen memorized secrets to be at least 8 characters in length.
\n
Ok, so the passwords of our users shouldn't be too short. Let's start by checking that!
","metadata":{"editable":false,"tags":["context"],"run_control":{"frozen":true},"deletable":false,"dc":{"key":"10"}}},{"cell_type":"code","source":"# Calculating the lengths of users' passwords\nusers['length'] = users['password'].str.len()\n\n# Flagging the users with too short passwords\nusers['too_short'] = users['length'] < 8\n\n# Counting and printing the number of users with too short passwords\nprint(users['too_short'].sum())\n\n# Taking a look at the 12 first rows\nusers.head(n=12)","metadata":{"dc":{"key":"10"},"trusted":true,"tags":["sample_code"]},"execution_count":100,"outputs":[{"text":"376\n","name":"stdout","output_type":"stream"},{"data":{"text/plain":" id user_name password length too_short\n0 1 vance.jennings joobheco 8 False\n1 2 consuelo.eaton 0869347314 10 False\n2 3 mitchel.perkins fabypotter 10 False\n3 4 odessa.vaughan aharney88 9 False\n4 5 araceli.wilder acecdn3000 10 False\n5 6 shawn.harrington 5278049 7 True\n6 7 evelyn.gay master 6 True\n7 8 noreen.hale murphy 6 True\n8 9 gladys.ward lwsves2 7 True\n9 10 brant.zimmerman 1190KAREN5572497 16 False\n10 11 leanna.abbott aivlys24 8 False\n11 12 milford.hubbard hubbard 7 True","text/html":"\n | id | \nuser_name | \npassword | \nlength | \ntoo_short | \n
---|---|---|---|---|---|
0 | \n1 | \nvance.jennings | \njoobheco | \n8 | \nFalse | \n
1 | \n2 | \nconsuelo.eaton | \n0869347314 | \n10 | \nFalse | \n
2 | \n3 | \nmitchel.perkins | \nfabypotter | \n10 | \nFalse | \n
3 | \n4 | \nodessa.vaughan | \naharney88 | \n9 | \nFalse | \n
4 | \n5 | \naraceli.wilder | \nacecdn3000 | \n10 | \nFalse | \n
5 | \n6 | \nshawn.harrington | \n5278049 | \n7 | \nTrue | \n
6 | \n7 | \nevelyn.gay | \nmaster | \n6 | \nTrue | \n
7 | \n8 | \nnoreen.hale | \nmurphy | \n6 | \nTrue | \n
8 | \n9 | \ngladys.ward | \nlwsves2 | \n7 | \nTrue | \n
9 | \n10 | \nbrant.zimmerman | \n1190KAREN5572497 | \n16 | \nFalse | \n
10 | \n11 | \nleanna.abbott | \naivlys24 | \n8 | \nFalse | \n
11 | \n12 | \nmilford.hubbard | \nhubbard | \n7 | \nTrue | \n
Already this simple rule flagged a couple of offenders among the first 12 users. Next up in Special Publication 800-63B is the rule that
\n\n\nverifiers SHALL compare the prospective secrets against a list that contains values known to be commonly-used, expected, or compromised.
\n\n
\n- Passwords obtained from previous breach corpuses.
\n- Dictionary words.
\n- Repetitive or sequential characters (e.g. ‘aaaaaa’, ‘1234abcd’).
\n- Context-specific words, such as the name of the service, the username, and derivatives thereof.
\n
We're going to check these in order and start with Passwords obtained from previous breach corpuses, that is, websites where hackers have leaked all the users' passwords. As many websites don't follow the NIST guidelines and encrypt passwords there now exist large lists of the most popular passwords. Let's start by loading in the 10,000 most common passwords which I've taken from here.
","metadata":{"editable":false,"tags":["context"],"run_control":{"frozen":true},"deletable":false,"dc":{"key":"17"}}},{"cell_type":"code","source":"# Reading in the top 10000 passwords\ncommon_passwords = pd.read_csv('datasets/10_million_password_list_top_10000.txt', header=None, squeeze=True)\n\n# Taking a look at the top 20\ncommon_passwords.head(n=20)","metadata":{"dc":{"key":"17"},"trusted":true,"tags":["sample_code"]},"execution_count":102,"outputs":[{"data":{"text/plain":"0 123456\n1 password\n2 12345678\n3 qwerty\n4 123456789\n5 12345\n6 1234\n7 111111\n8 1234567\n9 dragon\n10 123123\n11 baseball\n12 abc123\n13 football\n14 monkey\n15 letmein\n16 696969\n17 shadow\n18 master\n19 666666\nName: 0, dtype: object"},"output_type":"execute_result","metadata":{},"execution_count":102}]},{"cell_type":"markdown","source":"## 4. Passwords should not be common passwords\nThe list of passwords was ordered, with the most common passwords first, and so we shouldn't be surprised to see passwords like 123456
and qwerty
above. As hackers also have access to this list of common passwords, it's important that none of our users use these passwords!
Let's flag all the passwords in our user database that are among the top 10,000 used passwords.
","metadata":{"editable":false,"tags":["context"],"run_control":{"frozen":true},"deletable":false,"dc":{"key":"24"}}},{"cell_type":"code","source":"# Flagging the users with passwords that are common passwords\nusers['common_password'] = users['password'].isin(common_passwords)\n\n# Counting and printing the number of users using common passwords\nprint(users['common_password'].sum())\n\n# Taking a look at the 12 first rows\nusers.head(n=12)","metadata":{"dc":{"key":"24"},"trusted":true,"tags":["sample_code"]},"execution_count":104,"outputs":[{"text":"129\n","name":"stdout","output_type":"stream"},{"data":{"text/plain":" id user_name password length too_short common_password\n0 1 vance.jennings joobheco 8 False False\n1 2 consuelo.eaton 0869347314 10 False False\n2 3 mitchel.perkins fabypotter 10 False False\n3 4 odessa.vaughan aharney88 9 False False\n4 5 araceli.wilder acecdn3000 10 False False\n5 6 shawn.harrington 5278049 7 True False\n6 7 evelyn.gay master 6 True True\n7 8 noreen.hale murphy 6 True True\n8 9 gladys.ward lwsves2 7 True False\n9 10 brant.zimmerman 1190KAREN5572497 16 False False\n10 11 leanna.abbott aivlys24 8 False False\n11 12 milford.hubbard hubbard 7 True False","text/html":"\n | id | \nuser_name | \npassword | \nlength | \ntoo_short | \ncommon_password | \n
---|---|---|---|---|---|---|
0 | \n1 | \nvance.jennings | \njoobheco | \n8 | \nFalse | \nFalse | \n
1 | \n2 | \nconsuelo.eaton | \n0869347314 | \n10 | \nFalse | \nFalse | \n
2 | \n3 | \nmitchel.perkins | \nfabypotter | \n10 | \nFalse | \nFalse | \n
3 | \n4 | \nodessa.vaughan | \naharney88 | \n9 | \nFalse | \nFalse | \n
4 | \n5 | \naraceli.wilder | \nacecdn3000 | \n10 | \nFalse | \nFalse | \n
5 | \n6 | \nshawn.harrington | \n5278049 | \n7 | \nTrue | \nFalse | \n
6 | \n7 | \nevelyn.gay | \nmaster | \n6 | \nTrue | \nTrue | \n
7 | \n8 | \nnoreen.hale | \nmurphy | \n6 | \nTrue | \nTrue | \n
8 | \n9 | \ngladys.ward | \nlwsves2 | \n7 | \nTrue | \nFalse | \n
9 | \n10 | \nbrant.zimmerman | \n1190KAREN5572497 | \n16 | \nFalse | \nFalse | \n
10 | \n11 | \nleanna.abbott | \naivlys24 | \n8 | \nFalse | \nFalse | \n
11 | \n12 | \nmilford.hubbard | \nhubbard | \n7 | \nTrue | \nFalse | \n
Ay ay ay! It turns out many of our users use common passwords, and of the first 12 users there are already two. However, as most common passwords also tend to be short, they were already flagged as being too short. What is the next thing we should check?
\n\n\nVerifiers SHALL compare the prospective secrets against a list that contains [...] dictionary words.
\n
This follows the same logic as before: It is easy for hackers to check users' passwords against common English words and therefore common English words make bad passwords. Let's check our users' passwords against the top 10,000 English words from Google's Trillion Word Corpus.
","metadata":{"editable":false,"tags":["context"],"run_control":{"frozen":true},"deletable":false,"dc":{"key":"31"}}},{"cell_type":"code","source":"# Reading in a list of the 10000 most common words\nwords = pd.read_csv('datasets/google-10000-english.txt', header=None, squeeze=True)\n\n# Flagging the users with passwords that are common words\nusers['common_word'] = users['password'].str.lower().isin(words)\n\n# Counting and printing the number of users using common words as passwords\nprint(users['common_word'].sum())\n\n# Taking a look at the 12 first rows\nusers.head(n=12)","metadata":{"dc":{"key":"31"},"trusted":true,"tags":["sample_code"]},"execution_count":106,"outputs":[{"text":"137\n","name":"stdout","output_type":"stream"},{"data":{"text/plain":" id user_name password length too_short common_password \\\n0 1 vance.jennings joobheco 8 False False \n1 2 consuelo.eaton 0869347314 10 False False \n2 3 mitchel.perkins fabypotter 10 False False \n3 4 odessa.vaughan aharney88 9 False False \n4 5 araceli.wilder acecdn3000 10 False False \n5 6 shawn.harrington 5278049 7 True False \n6 7 evelyn.gay master 6 True True \n7 8 noreen.hale murphy 6 True True \n8 9 gladys.ward lwsves2 7 True False \n9 10 brant.zimmerman 1190KAREN5572497 16 False False \n10 11 leanna.abbott aivlys24 8 False False \n11 12 milford.hubbard hubbard 7 True False \n\n common_word \n0 False \n1 False \n2 False \n3 False \n4 False \n5 False \n6 True \n7 True \n8 False \n9 False \n10 False \n11 False ","text/html":"\n | id | \nuser_name | \npassword | \nlength | \ntoo_short | \ncommon_password | \ncommon_word | \n
---|---|---|---|---|---|---|---|
0 | \n1 | \nvance.jennings | \njoobheco | \n8 | \nFalse | \nFalse | \nFalse | \n
1 | \n2 | \nconsuelo.eaton | \n0869347314 | \n10 | \nFalse | \nFalse | \nFalse | \n
2 | \n3 | \nmitchel.perkins | \nfabypotter | \n10 | \nFalse | \nFalse | \nFalse | \n
3 | \n4 | \nodessa.vaughan | \naharney88 | \n9 | \nFalse | \nFalse | \nFalse | \n
4 | \n5 | \naraceli.wilder | \nacecdn3000 | \n10 | \nFalse | \nFalse | \nFalse | \n
5 | \n6 | \nshawn.harrington | \n5278049 | \n7 | \nTrue | \nFalse | \nFalse | \n
6 | \n7 | \nevelyn.gay | \nmaster | \n6 | \nTrue | \nTrue | \nTrue | \n
7 | \n8 | \nnoreen.hale | \nmurphy | \n6 | \nTrue | \nTrue | \nTrue | \n
8 | \n9 | \ngladys.ward | \nlwsves2 | \n7 | \nTrue | \nFalse | \nFalse | \n
9 | \n10 | \nbrant.zimmerman | \n1190KAREN5572497 | \n16 | \nFalse | \nFalse | \nFalse | \n
10 | \n11 | \nleanna.abbott | \naivlys24 | \n8 | \nFalse | \nFalse | \nFalse | \n
11 | \n12 | \nmilford.hubbard | \nhubbard | \n7 | \nTrue | \nFalse | \nFalse | \n
It turns out many of our passwords were common English words too! Next up on the NIST list:
\n\n\nVerifiers SHALL compare the prospective secrets against a list that contains [...] context-specific words, such as the name of the service, the username, and derivatives thereof.
\n
Ok, so there are many things we could check here. One thing to notice is that our users' usernames consist of their first names and last names separated by a dot. For now, let's just flag passwords that are the same as either a user's first or last name.
","metadata":{"editable":false,"tags":["context"],"run_control":{"frozen":true},"deletable":false,"dc":{"key":"38"}}},{"cell_type":"code","source":"# Extracting first and last names into their own columns\nusers['first_name'] = users['user_name'].str.extract('(^\\w+)')\nusers['last_name'] = users['user_name'].str.extract('(\\w+$)')\n\n# Flagging the users with passwords that matches their names\nusers['uses_name'] = (users['password'] == users['first_name']) | (users['password'] == users['last_name'])\n\n# Counting and printing the number of users using names as passwords\nprint(users['uses_name'].sum())\n\n# Taking a look at the 12 first rows\nusers.head(n=12)","metadata":{"dc":{"key":"38"},"trusted":true,"tags":["sample_code"]},"execution_count":108,"outputs":[{"text":"50\n","name":"stdout","output_type":"stream"},{"data":{"text/plain":" id user_name password length too_short common_password \\\n0 1 vance.jennings joobheco 8 False False \n1 2 consuelo.eaton 0869347314 10 False False \n2 3 mitchel.perkins fabypotter 10 False False \n3 4 odessa.vaughan aharney88 9 False False \n4 5 araceli.wilder acecdn3000 10 False False \n5 6 shawn.harrington 5278049 7 True False \n6 7 evelyn.gay master 6 True True \n7 8 noreen.hale murphy 6 True True \n8 9 gladys.ward lwsves2 7 True False \n9 10 brant.zimmerman 1190KAREN5572497 16 False False \n10 11 leanna.abbott aivlys24 8 False False \n11 12 milford.hubbard hubbard 7 True False \n\n common_word first_name last_name uses_name \n0 False vance jennings False \n1 False consuelo eaton False \n2 False mitchel perkins False \n3 False odessa vaughan False \n4 False araceli wilder False \n5 False shawn harrington False \n6 True evelyn gay False \n7 True noreen hale False \n8 False gladys ward False \n9 False brant zimmerman False \n10 False leanna abbott False \n11 False milford hubbard True ","text/html":"\n | id | \nuser_name | \npassword | \nlength | \ntoo_short | \ncommon_password | \ncommon_word | \nfirst_name | \nlast_name | \nuses_name | \n
---|---|---|---|---|---|---|---|---|---|---|
0 | \n1 | \nvance.jennings | \njoobheco | \n8 | \nFalse | \nFalse | \nFalse | \nvance | \njennings | \nFalse | \n
1 | \n2 | \nconsuelo.eaton | \n0869347314 | \n10 | \nFalse | \nFalse | \nFalse | \nconsuelo | \neaton | \nFalse | \n
2 | \n3 | \nmitchel.perkins | \nfabypotter | \n10 | \nFalse | \nFalse | \nFalse | \nmitchel | \nperkins | \nFalse | \n
3 | \n4 | \nodessa.vaughan | \naharney88 | \n9 | \nFalse | \nFalse | \nFalse | \nodessa | \nvaughan | \nFalse | \n
4 | \n5 | \naraceli.wilder | \nacecdn3000 | \n10 | \nFalse | \nFalse | \nFalse | \naraceli | \nwilder | \nFalse | \n
5 | \n6 | \nshawn.harrington | \n5278049 | \n7 | \nTrue | \nFalse | \nFalse | \nshawn | \nharrington | \nFalse | \n
6 | \n7 | \nevelyn.gay | \nmaster | \n6 | \nTrue | \nTrue | \nTrue | \nevelyn | \ngay | \nFalse | \n
7 | \n8 | \nnoreen.hale | \nmurphy | \n6 | \nTrue | \nTrue | \nTrue | \nnoreen | \nhale | \nFalse | \n
8 | \n9 | \ngladys.ward | \nlwsves2 | \n7 | \nTrue | \nFalse | \nFalse | \ngladys | \nward | \nFalse | \n
9 | \n10 | \nbrant.zimmerman | \n1190KAREN5572497 | \n16 | \nFalse | \nFalse | \nFalse | \nbrant | \nzimmerman | \nFalse | \n
10 | \n11 | \nleanna.abbott | \naivlys24 | \n8 | \nFalse | \nFalse | \nFalse | \nleanna | \nabbott | \nFalse | \n
11 | \n12 | \nmilford.hubbard | \nhubbard | \n7 | \nTrue | \nFalse | \nFalse | \nmilford | \nhubbard | \nTrue | \n
Milford Hubbard (user number 12 above), what where you thinking!? Ok, so the last thing we are going to check is a bit tricky:
\n\n\nverifiers SHALL compare the prospective secrets [so that they don't contain] repetitive or sequential characters (e.g. ‘aaaaaa’, ‘1234abcd’).
\n
This is tricky to check because what is repetitive is hard to define. Is 11111
repetitive? Yes! Is 12345
repetitive? Well, kind of. Is 13579
repetitive? Maybe not..? To check for repetitiveness can be arbitrarily complex, but here we're only going to do something simple. We're going to flag all passwords that contain 4 or more repeated characters.
\n | id | \nuser_name | \npassword | \nlength | \ntoo_short | \ncommon_password | \ncommon_word | \nfirst_name | \nlast_name | \nuses_name | \ntoo_many_repeats | \n
---|---|---|---|---|---|---|---|---|---|---|---|
146 | \n147 | \npatti.dixon | \n555555 | \n6 | \nTrue | \nTrue | \nFalse | \npatti | \ndixon | \nFalse | \nTrue | \n
572 | \n573 | \ncornelia.bradley | \n555555 | \n6 | \nTrue | \nTrue | \nFalse | \ncornelia | \nbradley | \nFalse | \nTrue | \n
644 | \n645 | \nessie.lopez | \n11111 | \n5 | \nTrue | \nTrue | \nFalse | \nessie | \nlopez | \nFalse | \nTrue | \n
798 | \n799 | \ncharley.key | \n888888 | \n6 | \nTrue | \nTrue | \nFalse | \ncharley | \nkey | \nFalse | \nTrue | \n
941 | \n942 | \nmitch.ferguson | \naaaaaa | \n6 | \nTrue | \nTrue | \nFalse | \nmitch | \nferguson | \nFalse | \nTrue | \n
Now we have implemented all the basic tests for bad passwords suggested by NIST Special Publication 800-63B! What's left is just to flag all bad passwords and maybe to send these users an e-mail that strongly suggests they change their password.
","metadata":{"editable":false,"tags":["context"],"run_control":{"frozen":true},"deletable":false,"dc":{"key":"52"}}},{"cell_type":"code","source":"# Flagging all passwords that are bad\nusers['bad_password'] = users['too_short'] |\\\n users['common_password'] |\\\n users['common_word'] |\\\n users['uses_name'] |\\\n users['too_many_repeats']\n\n# Counting and printing the number of bad passwords\nprint(users['bad_password'].sum())\n\n# Looking at the first 25 bad passwords\nusers.loc[users['bad_password'] == True]['password'].head(n=25)","metadata":{"dc":{"key":"52"},"trusted":true,"tags":["sample_code"]},"execution_count":112,"outputs":[{"text":"423\n","name":"stdout","output_type":"stream"},{"data":{"text/plain":"5 5278049\n6 master\n7 murphy\n8 lwsves2\n11 hubbard\n13 310356\n15 oZ4k0QE\n16 chelsea\n17 zvc1939\n18 nickgd\n21 cocacola\n22 woodard\n25 AJ9Da\n26 ewokzs\n28 YyGjz8E\n30 reid\n34 jOYZBs8\n38 wwewwf1\n43 225377\n45 NdZ7E6\n47 CQB3Z\n48 diffo\n51 123456789\n52 y8uM7D6\n56 mikeloo\nName: password, dtype: object"},"output_type":"execute_result","metadata":{},"execution_count":112}]},{"cell_type":"markdown","source":"## 9. Otherwise, the password should be up to the user\nIn this notebook, we've implemented the password checks recommended by the NIST Special Publication 800-63B. It's certainly possible to better implement these checks, for example, by using a longer list of common passwords. Also note that the NIST checks in no way guarantee that a chosen password is good, just that it's not obviously bad.
\nApart from the checks we've implemented above the NIST is also clear with what password rules should not be imposed:
\n\n\nVerifiers SHOULD NOT impose other composition rules (e.g., requiring mixtures of different character types or prohibiting consecutively repeated characters) for memorized secrets. Verifiers SHOULD NOT require memorized secrets to be changed arbitrarily (e.g., periodically).
\n
So the next time a website or app tells you to \"include both a number, symbol and an upper and lower case character in your password\" you should send them a copy of NIST Special Publication 800-63B.
","metadata":{"editable":false,"tags":["context"],"run_control":{"frozen":true},"deletable":false,"dc":{"key":"59"}}},{"cell_type":"code","source":"# Enter a password that passes the NIST requirements\n# PLEASE DO NOT USE AN EXISTING PASSWORD HERE\nnew_password = \"i_like_pie\"","metadata":{"dc":{"key":"59"},"collapsed":true,"trusted":true,"tags":["sample_code"]},"execution_count":114,"outputs":[]}],"metadata":{"kernelspec":{"display_name":"Python 3","name":"python3","language":"python"},"language_info":{"name":"python","mimetype":"text/x-python","pygments_lexer":"ipython3","nbconvert_exporter":"python","version":"3.5.2","file_extension":".py","codemirror_mode":{"name":"ipython","version":3}}},"nbformat":4,"nbformat_minor":2} -------------------------------------------------------------------------------- /exploring_67_years_of_lego.ipynb: -------------------------------------------------------------------------------- 1 | {"nbformat_minor":2,"metadata":{"kernelspec":{"display_name":"Python 3","name":"python3","language":"python"},"language_info":{"nbconvert_exporter":"python","mimetype":"text/x-python","version":"3.5.2","pygments_lexer":"ipython3","name":"python","codemirror_mode":{"version":3,"name":"ipython"},"file_extension":".py"}},"cells":[{"source":"## Introduction\nEveryone loves Lego (unless you ever stepped on one). Did you know by the way that \"Lego\" was derived from the Danish phrase leg godt, which means \"play well\"? Unless you speak Danish, probably not.
\nIn this project, we will analyze a fascinating dataset on every single lego block that has ever been built!
\nThis comprehensive database of lego blocks is provided by Rebrickable. The data is available as csv files and the schema is shown below.
\nLet us start by reading in the colors data to get a sense of the diversity of lego sets!
","metadata":{"tags":["context"],"run_control":{"frozen":true},"deletable":false,"editable":false,"dc":{"key":"044b2cef41"}},"cell_type":"markdown"},{"source":"# Import modules\nimport pandas as pd\n\n# Read colors data\ncolors = pd.read_csv('datasets/colors.csv')\n\n# Print the first few rows\ncolors.head()","metadata":{"trusted":true,"tags":["sample_code"],"dc":{"key":"044b2cef41"}},"execution_count":65,"cell_type":"code","outputs":[{"data":{"text/html":"\n | id | \nname | \nrgb | \nis_trans | \n
---|---|---|---|---|
0 | \n-1 | \nUnknown | \n0033B2 | \nf | \n
1 | \n0 | \nBlack | \n05131D | \nf | \n
2 | \n1 | \nBlue | \n0055BF | \nf | \n
3 | \n2 | \nGreen | \n237841 | \nf | \n
4 | \n3 | \nDark Turquoise | \n008F9B | \nf | \n
Now that we have read the colors
data, we can start exploring it! Let us start by understanding the number of colors available.
The colors
data has a column named is_trans
that indicates whether a color is transparent or not. It would be interesting to explore the distribution of transparent vs. non-transparent colors.
Another interesting dataset available in this database is the sets
data. It contains a comprehensive list of sets over the years and the number of parts that each of these sets contained.
Let us use this data to explore how the average number of parts in lego sets has varied over the years.
","metadata":{"tags":["context"],"run_control":{"frozen":true},"deletable":false,"editable":false,"dc":{"key":"c9d0e58653"}},"cell_type":"markdown"},{"source":"%matplotlib inline\n# Read sets data as `sets`\nsets = pd.read_csv('datasets/sets.csv')\n# Create a summary of average number of parts by year: `parts_by_year`\nparts_by_year = sets.groupby('year')['num_parts'].mean().reset_index()\nprint(parts_by_year.head())\n# Plot trends in average number of parts by year\nimport matplotlib.pyplot as plt\n\nplt.scatter(x = parts_by_year['year'], y = parts_by_year['num_parts'])","metadata":{"trusted":true,"tags":["sample_code"],"dc":{"key":"c9d0e58653"}},"execution_count":71,"cell_type":"code","outputs":[{"output_type":"stream","text":" year num_parts\n0 1950 10.142857\n1 1953 16.500000\n2 1954 12.357143\n3 1955 36.857143\n4 1956 18.500000\n","name":"stdout"},{"data":{"text/plain":"Lego blocks ship under multiple themes. Let us try to get a sense of how the number of themes shipped has varied over the years.
","metadata":{"tags":["context"],"run_control":{"frozen":true},"deletable":false,"editable":false,"dc":{"key":"266a3f390c"}},"cell_type":"markdown"},{"source":"# themes_by_year: Number of themes shipped by year\nthemes_by_year = sets.groupby('year')['theme_id'].nunique().reset_index()\nprint(themes_by_year.head())","metadata":{"trusted":true,"tags":["sample_code"],"dc":{"key":"266a3f390c"}},"execution_count":73,"cell_type":"code","outputs":[{"output_type":"stream","text":" year theme_id\n0 1950 2\n1 1953 1\n2 1954 2\n3 1955 4\n4 1956 3\n","name":"stdout"}]},{"source":"## Wrapping It All Up!\nLego blocks offer an unlimited amoung of fun across ages. We explored some interesting trends around colors, parts and themes.
","metadata":{"tags":["context"],"run_control":{"frozen":true},"deletable":false,"editable":false,"dc":{"key":"a293e5076e"}},"cell_type":"markdown"},{"source":"# Nothing to do here","metadata":{"collapsed":true,"trusted":true,"tags":["sample_code"],"dc":{"key":"a293e5076e"}},"execution_count":75,"cell_type":"code","outputs":[]}],"nbformat":4} -------------------------------------------------------------------------------- /exploring_the_evolution_of_linux.ipynb: -------------------------------------------------------------------------------- 1 | {"metadata":{"kernelspec":{"language":"python","name":"python3","display_name":"Python 3"},"language_info":{"nbconvert_exporter":"python","name":"python","file_extension":".py","mimetype":"text/x-python","version":"3.5.2","codemirror_mode":{"name":"ipython","version":3},"pygments_lexer":"ipython3"}},"cells":[{"metadata":{"editable":false,"tags":["context"],"deletable":false,"run_control":{"frozen":true},"dc":{"key":"4"}},"cell_type":"markdown","source":"## 1. Introduction\nVersion control repositories like CVS, Subversion or Git can be a real gold mine for software developers. They contain every change to the source code including the date (the \"when\"), the responsible developer (the \"who\"), as well as little message that describes the intention (the \"what\") of a change.
\n\nIn this notebook, we will analyze the evolution of a very famous open-source project – the Linux kernel. The Linux kernel is the heart of some Linux distributions like Debian, Ubuntu or CentOS.
\nWe get some first insights into the work of the development efforts by
\nLinus Torvalds, the (spoiler alert!) main contributor to the Linux kernel (and also the creator of Git), created a mirror of the Linux repository on GitHub. It contains the complete history of kernel development for the last 13 years.
\nFor our analysis, we will use a Git log file with the following content:
"},{"execution_count":80,"outputs":[{"output_type":"stream","name":"stdout","text":"['1502382966#Linus Torvalds\\n', '1501368308#Max Gurtovoy\\n', '1501625560#James Smart\\n', '1501625559#James Smart\\n', '1500568442#Martin Wilck\\n']\n"}],"metadata":{"tags":["sample_code"],"trusted":true,"dc":{"key":"4"}},"cell_type":"code","source":"# Printing the content of git_log_excerpt.csv\nwith open(\"datasets/git_log_excerpt.csv\") as myfile:\n firstNlines=myfile.readlines()[0:5]\n \nprint(firstNlines)"},{"metadata":{"editable":false,"tags":["context"],"deletable":false,"run_control":{"frozen":true},"dc":{"key":"11"}},"cell_type":"markdown","source":"## 2. Reading in the dataset\nThe dataset was created by using the command git log --encoding=latin-1 --pretty=\"%at#%aN\"
. The latin-1
encoded text output was saved in a header-less csv file. In this file, each row is a commit entry with the following information:
timestamp
: the time of the commit as a UNIX timestamp in seconds since 1970-01-01 00:00:00 (Git log placeholder \"%at
\")author
: the name of the author that performed the commit (Git log placeholder \"%aN
\")The columns are separated by the number sign #
. The complete dataset is in the datasets/
directory. It is a gz
-compressed csv file named git_log.gz
.
\n | timestamp | \nauthor | \n
---|---|---|
0 | \n1502826583 | \nLinus Torvalds | \n
1 | \n1501749089 | \nAdrian Hunter | \n
2 | \n1501749088 | \nAdrian Hunter | \n
3 | \n1501882480 | \nKees Cook | \n
4 | \n1497271395 | \nRob Clark | \n
The dataset contains the information about every single code contribution (a \"commit\") to the Linux kernel over the last 13 years. We'll first take a look at the number of authors and their commits to the repository.
"},{"execution_count":84,"outputs":[{"output_type":"stream","name":"stdout","text":"17385 authors committed 699071 code changes.\n"}],"metadata":{"tags":["sample_code"],"trusted":true,"dc":{"key":"18"}},"cell_type":"code","source":"# calculating number of commits\nnumber_of_commits = len(git_log.index)\n\n# calculating number of authors\nnumber_of_authors = git_log['author'].nunique()\n\n# printing out the results\nprint(\"%s authors committed %s code changes.\" % (number_of_authors, number_of_commits))"},{"metadata":{"editable":false,"tags":["context"],"deletable":false,"run_control":{"frozen":true},"dc":{"key":"25"}},"cell_type":"markdown","source":"## 4. Finding the TOP 10 contributors\nThere are some very important people that changed the Linux kernel very often. To see if there are any bottlenecks, we take a look at the TOP 10 authors with the most commits.
"},{"execution_count":86,"outputs":[{"execution_count":86,"metadata":{},"output_type":"execute_result","data":{"text/html":"\n | count | \n
---|---|
author | \n\n |
Linus Torvalds | \n23361 | \n
David S. Miller | \n9106 | \n
Mark Brown | \n6802 | \n
Takashi Iwai | \n6209 | \n
Al Viro | \n6006 | \n
H Hartley Sweeten | \n5938 | \n
Ingo Molnar | \n5344 | \n
Mauro Carvalho Chehab | \n5204 | \n
Arnd Bergmann | \n4890 | \n
Greg Kroah-Hartman | \n4580 | \n
For our analysis, we want to visualize the contributions over time. For this, we use the information in the timestamp
column to create a time series-based column.
As we can see from the results above, some contributors had their operating system's time incorrectly set when they committed to the repository. We'll clean up the timestamp
column by dropping the rows with the incorrect timestamps.
To find out how the development activity has increased over time, we'll group the commits by year and count them up.
"},{"execution_count":92,"outputs":[{"execution_count":92,"metadata":{},"output_type":"execute_result","data":{"text/html":"\n | author | \n
---|---|
timestamp | \n\n |
2005-01-01 | \n16229 | \n
2006-01-01 | \n29255 | \n
2007-01-01 | \n33759 | \n
2008-01-01 | \n48847 | \n
2009-01-01 | \n52572 | \n
Finally, we'll make a plot out of these counts to better see how the development effort on Linux has increased over the the last few years.
"},{"execution_count":94,"outputs":[{"metadata":{},"output_type":"display_data","data":{"text/plain":"Thanks to the solid foundation and caretaking of Linux Torvalds, many other developers are now able to contribute to the Linux kernel as well. There is no decrease of development activity at sight!
"},{"execution_count":96,"outputs":[],"metadata":{"collapsed":true,"tags":["sample_code"],"trusted":true,"dc":{"key":"60"}},"cell_type":"code","source":"# calculating or setting the year with the most commits to Linux\nyear_with_most_commits = 2016 "}],"nbformat":4,"nbformat_minor":2} -------------------------------------------------------------------------------- /generating_keywords_for_google_adwords.ipynb: -------------------------------------------------------------------------------- 1 | {"nbformat":4,"metadata":{"language_info":{"name":"python","file_extension":".py","codemirror_mode":{"name":"ipython","version":3},"nbconvert_exporter":"python","pygments_lexer":"ipython3","version":"3.5.2","mimetype":"text/x-python"},"kernelspec":{"name":"python3","display_name":"Python 3","language":"python"}},"nbformat_minor":2,"cells":[{"cell_type":"markdown","metadata":{"run_control":{"frozen":true},"dc":{"key":"4"},"deletable":false,"tags":["context"],"editable":false},"source":"## 1. The brief\nImagine working for a digital marketing agency, and the agency is approached by a massive online retailer of furniture. They want to test our skills at creating large campaigns for all of their website. We are tasked with creating a prototype set of keywords for search campaigns for their sofas section. The client says that they want us to generate keywords for the following products:
\nThe brief: The client is generally a low-cost retailer, offering many promotions and discounts. We will need to focus on such keywords. We will also need to move away from luxury keywords and topics, as we are targeting price-sensitive customers. Because we are going to be tight on budget, it would be good to focus on a tightly targeted set of keywords and make sure they are all set to exact and phrase match.
\nBased on the brief above we will first need to generate a list of words, that together with the products given above would make for good keywords. Here are some examples:
\nThe resulting keywords: 'buy sofas', 'sofas buy', 'buy recliners', 'recliners buy',\n 'prices sofas', 'sofas prices', 'prices recliners', 'recliners prices'.
\nAs a final result, we want to have a DataFrame that looks like this:
\nCampaign | \nAd Group | \nKeyword | \nCriterion Type | \n
---|---|---|---|
Campaign1 | \nAdGroup_1 | \nkeyword 1a | \nExact | \n
Campaign1 | \nAdGroup_1 | \nkeyword 1a | \nPhrase | \n
Campaign1 | \nAdGroup_1 | \nkeyword 1b | \nExact | \n
Campaign1 | \nAdGroup_1 | \nkeyword 1b | \nPhrase | \n
Campaign1 | \nAdGroup_2 | \nkeyword 2a | \nExact | \n
Campaign1 | \nAdGroup_2 | \nkeyword 2a | \nPhrase | \n
The first step is to come up with a list of words that users might use to express their desire in buying low-cost sofas.
"},{"cell_type":"code","execution_count":176,"metadata":{"trusted":true,"dc":{"key":"4"},"tags":["sample_code"]},"outputs":[{"name":"stdout","text":"['buy', 'discount', 'promotion', 'price', 'promo', 'shop', 'cheap']\n","output_type":"stream"}],"source":"# List of words to pair with products\nwords = ['buy', 'discount', 'promotion', 'price', 'promo', 'shop', 'cheap']\n\n# Print list of words\nprint(words)"},{"cell_type":"markdown","metadata":{"run_control":{"frozen":true},"dc":{"key":"11"},"deletable":false,"tags":["context"],"editable":false},"source":"## 2. Combine the words with the product names\nImagining all the possible combinations of keywords can be stressful! But not for us, because we are keyword ninjas! We know how to translate campaign briefs into Python data structures and can imagine the resulting DataFrames that we need to create.
\nNow that we have brainstormed the words that work well with the brief that we received, it is now time to combine them with the product names to generate meaningful search keywords. We want to combine every word with every product once before, and once after, as seen in the example above.
\nAs a quick reminder, for the product 'recliners' and the words 'buy' and 'price' for example, we would want to generate the following combinations:
\nbuy recliners
\nrecliners buy
\nprice recliners
\nrecliners price
\n...
and so on for all the words and products that we have.
"},{"cell_type":"code","execution_count":178,"metadata":{"trusted":true,"dc":{"key":"11"},"tags":["sample_code"]},"outputs":[{"name":"stdout","text":"[['sofas', 'sofas buy'], ['sofas', 'buy sofas'], ['sofas', 'sofas discount'], ['sofas', 'discount sofas'], ['sofas', 'sofas promotion'], ['sofas', 'promotion sofas'], ['sofas', 'sofas price'], ['sofas', 'price sofas'], ['sofas', 'sofas promo'], ['sofas', 'promo sofas'], ['sofas', 'sofas shop'], ['sofas', 'shop sofas'], ['sofas', 'sofas cheap'], ['sofas', 'cheap sofas'], ['convertible sofas', 'convertible sofas buy'], ['convertible sofas', 'buy convertible sofas'], ['convertible sofas', 'convertible sofas discount'], ['convertible sofas', 'discount convertible sofas'], ['convertible sofas', 'convertible sofas promotion'], ['convertible sofas', 'promotion convertible sofas'], ['convertible sofas', 'convertible sofas price'], ['convertible sofas', 'price convertible sofas'], ['convertible sofas', 'convertible sofas promo'], ['convertible sofas', 'promo convertible sofas'], ['convertible sofas', 'convertible sofas shop'], ['convertible sofas', 'shop convertible sofas'], ['convertible sofas', 'convertible sofas cheap'], ['convertible sofas', 'cheap convertible sofas'], ['love seats', 'love seats buy'], ['love seats', 'buy love seats'], ['love seats', 'love seats discount'], ['love seats', 'discount love seats'], ['love seats', 'love seats promotion'], ['love seats', 'promotion love seats'], ['love seats', 'love seats price'], ['love seats', 'price love seats'], ['love seats', 'love seats promo'], ['love seats', 'promo love seats'], ['love seats', 'love seats shop'], ['love seats', 'shop love seats'], ['love seats', 'love seats cheap'], ['love seats', 'cheap love seats'], ['recliners', 'recliners buy'], ['recliners', 'buy recliners'], ['recliners', 'recliners discount'], ['recliners', 'discount recliners'], ['recliners', 'recliners promotion'], ['recliners', 'promotion recliners'], ['recliners', 'recliners price'], ['recliners', 'price recliners'], ['recliners', 'recliners promo'], ['recliners', 'promo recliners'], ['recliners', 'recliners shop'], ['recliners', 'shop recliners'], ['recliners', 'recliners cheap'], ['recliners', 'cheap recliners'], ['sofa beds', 'sofa beds buy'], ['sofa beds', 'buy sofa beds'], ['sofa beds', 'sofa beds discount'], ['sofa beds', 'discount sofa beds'], ['sofa beds', 'sofa beds promotion'], ['sofa beds', 'promotion sofa beds'], ['sofa beds', 'sofa beds price'], ['sofa beds', 'price sofa beds'], ['sofa beds', 'sofa beds promo'], ['sofa beds', 'promo sofa beds'], ['sofa beds', 'sofa beds shop'], ['sofa beds', 'shop sofa beds'], ['sofa beds', 'sofa beds cheap'], ['sofa beds', 'cheap sofa beds']]\n","output_type":"stream"}],"source":"products = ['sofas', 'convertible sofas', 'love seats', 'recliners', 'sofa beds']\n\n# Create an empty list\nkeywords_list = []\n\n# Loop through products\nfor product in products:\n # Loop through words\n for word in words:\n # Append combinations\n keywords_list.append([product, product + ' ' + word])\n keywords_list.append([product, word + ' ' + product])\n \n# Inspect keyword list\nprint(keywords_list)"},{"cell_type":"markdown","metadata":{"run_control":{"frozen":true},"dc":{"key":"18"},"deletable":false,"tags":["context"],"editable":false},"source":"## 3. Convert the list of lists into a DataFrame\nNow we want to convert this list of lists into a DataFrame so we can easily manipulate it and manage the final output.
"},{"cell_type":"code","execution_count":180,"metadata":{"trusted":true,"dc":{"key":"18"},"tags":["sample_code"]},"outputs":[{"name":"stdout","text":" 0 1\n0 sofas sofas buy\n1 sofas buy sofas\n2 sofas sofas discount\n3 sofas discount sofas\n4 sofas sofas promotion\n","output_type":"stream"}],"source":"# Load library\nimport pandas as pd\n\n# Create a DataFrame from list\nkeywords_df = pd.DataFrame.from_records(keywords_list)\n\n# Print the keywords DataFrame to explore it\nprint(keywords_df.head())"},{"cell_type":"markdown","metadata":{"run_control":{"frozen":true},"dc":{"key":"25"},"deletable":false,"tags":["context"],"editable":false},"source":"## 4. Rename the columns of the DataFrame\nBefore we can upload this table of keywords, we will need to give the columns meaningful names. If we inspect the DataFrame we just created above, we can see that the columns are currently named 0
and 1
. Ad Group
(example: \"sofas\") and Keyword
(example: \"sofas buy\") are much more appropriate names.
Now we need to add some additional information to our DataFrame. \nWe need a new column called Campaign
for the campaign name. We want campaign names to be descriptive of our group of keywords and products, so let's call this campaign 'SEM_Sofas'.
There are different keyword match types. One is exact match, which is for matching the exact term or are close variations of that exact term. Another match type is broad match, which means ads may show on searches that include misspellings, synonyms, related searches, and other relevant variations.
\nStraight from Google's AdWords documentation:
\n\n\nIn general, the broader the match type, the more traffic potential that keyword will have, since your ads may be triggered more often. Conversely, a narrower match type means that your ads may show less often—but when they do, they’re likely to be more related to someone’s search.
\n
Since the client is tight on budget, we want to make sure all the keywords are in exact match at the beginning.
"},{"cell_type":"code","execution_count":186,"metadata":{"trusted":true,"dc":{"key":"39"},"tags":["sample_code"],"collapsed":true},"outputs":[],"source":"# Add a criterion type column\nkeywords_df['Criterion Type'] = 'Exact'"},{"cell_type":"markdown","metadata":{"run_control":{"frozen":true},"dc":{"key":"46"},"deletable":false,"tags":["context"],"editable":false},"source":"## 7. Duplicate all the keywords into 'phrase' match\nThe great thing about exact match is that it is very specific, and we can control the process very well. The tradeoff, however, is that:
\nSo it's good to use another match called phrase match as a discovery mechanism to allow our ads to be triggered by keywords that include our exact match keywords, together with anything before (or after) them.
\nLater on, when we launch the campaign, we can explore with modified broad match, broad match, and negative match types, for better visibility and control of our campaigns.
"},{"cell_type":"code","execution_count":188,"metadata":{"trusted":true,"dc":{"key":"46"},"tags":["sample_code"],"collapsed":true},"outputs":[],"source":"# Make a copy of the keywords DataFrame\nkeywords_phrase = keywords_df.copy()\n\n# Change criterion type match to phrase\nkeywords_phrase['Criterion Type'] = 'Phrase'\n\n# Append the DataFrames\nkeywords_df_final = keywords_df.append(keywords_phrase)"},{"cell_type":"markdown","metadata":{"run_control":{"frozen":true},"dc":{"key":"53"},"deletable":false,"tags":["context"],"editable":false},"source":"## 8. Save and summarize!\nTo upload our campaign, we need to save it as a CSV file. Then we will be able to import it to AdWords editor or BingAds editor. There is also the option of pasting the data into the editor if we want, but having easy access to the saved data is great so let's save to a CSV file!
\nLooking at a summary of our campaign structure is good now that we've wrapped up our keyword work. We can do that by grouping by ad group and criterion type and counting by keyword. This summary shows us that we assigned specific keywords to specific ad groups, which are each part of a campaign. In essence, we are telling Google (or Bing, etc.) that we want any of the words in each ad group to trigger one of the ads in the same ad group. Separately, we will have to create another table for ads, which is a task for another day and would look something like this:
\nCampaign | \nAd Group | \nHeadline 1 | \nHeadline 2 | \nDescription | \nFinal URL | \n
---|---|---|---|---|---|
SEM_Sofas | \nSofas | \nLooking for Quality Sofas? | \nExplore Our Massive Collection | \n30-day Returns With Free Delivery Within the US. Start Shopping Now | \nDataCampSofas.com/sofas | \n
SEM_Sofas | \nSofas | \nLooking for Affordable Sofas? | \nCheck Out Our Weekly Offers | \n30-day Returns With Free Delivery Within the US. Start Shopping Now | \nDataCampSofas.com/sofas | \n
SEM_Sofas | \nRecliners | \nLooking for Quality Recliners? | \nExplore Our Massive Collection | \n30-day Returns With Free Delivery Within the US. Start Shopping Now | \nDataCampSofas.com/recliners | \n
SEM_Sofas | \nRecliners | \nNeed Affordable Recliners? | \nCheck Out Our Weekly Offers | \n30-day Returns With Free Delivery Within the US. Start Shopping Now | \nDataCampSofas.com/recliners | \n
Together, these tables get us the sample keywords -> ads -> landing pages mapping shown in the diagram below.
\nBlood transfusion saves lives - from replacing lost blood during major surgery or a serious injury to treating various illnesses and blood disorders. Ensuring that there's enough blood in supply whenever needed is a serious challenge for the health professionals. According to WebMD, \"about 5 million Americans need a blood transfusion every year\".
\n", 22 | "Our dataset is from a mobile blood donation vehicle in Taiwan. The Blood Transfusion Service Center drives to different universities and collects blood as part of a blood drive. We want to predict whether or not a donor will give blood the next time the vehicle comes to campus.
\n", 23 | "The data is stored in datasets/transfusion.data
and it is structured according to RFMTC marketing model (a variation of RFM). We'll explore what that means later in this notebook. First, let's inspect the data.
We now know that we are working with a typical CSV file (i.e., the delimiter is ,
, etc.). We proceed to loading the data into memory.
\n", 113 | " | Recency (months) | \n", 114 | "Frequency (times) | \n", 115 | "Monetary (c.c. blood) | \n", 116 | "Time (months) | \n", 117 | "whether he/she donated blood in March 2007 | \n", 118 | "
---|---|---|---|---|---|
0 | \n", 123 | "2 | \n", 124 | "50 | \n", 125 | "12500 | \n", 126 | "98 | \n", 127 | "1 | \n", 128 | "
1 | \n", 131 | "0 | \n", 132 | "13 | \n", 133 | "3250 | \n", 134 | "28 | \n", 135 | "1 | \n", 136 | "
2 | \n", 139 | "1 | \n", 140 | "16 | \n", 141 | "4000 | \n", 142 | "35 | \n", 143 | "1 | \n", 144 | "
3 | \n", 147 | "2 | \n", 148 | "20 | \n", 149 | "5000 | \n", 150 | "45 | \n", 151 | "1 | \n", 152 | "
4 | \n", 155 | "1 | \n", 156 | "24 | \n", 157 | "6000 | \n", 158 | "77 | \n", 159 | "0 | \n", 160 | "
Let's briefly return to our discussion of RFM model. RFM stands for Recency, Frequency and Monetary Value and it is commonly used in marketing for identifying your best customers. In our case, our customers are blood donors.
\n", 215 | "RFMTC is a variation of the RFM model. Below is a description of what each column means in our dataset:
\n", 216 | "It looks like every column in our DataFrame has the numeric type, which is exactly what we want when building a machine learning model. Let's verify our hypothesis.
" 224 | ] 225 | }, 226 | { 227 | "cell_type": "code", 228 | "execution_count": 157, 229 | "metadata": { 230 | "dc": { 231 | "key": "17" 232 | }, 233 | "tags": [ 234 | "sample_code" 235 | ] 236 | }, 237 | "outputs": [ 238 | { 239 | "name": "stdout", 240 | "output_type": "stream", 241 | "text": [ 242 | "We are aiming to predict the value in whether he/she donated blood in March 2007
column. Let's rename this it to target
so that it's more convenient to work with.
\n", 313 | " | Recency (months) | \n", 314 | "Frequency (times) | \n", 315 | "Monetary (c.c. blood) | \n", 316 | "Time (months) | \n", 317 | "target | \n", 318 | "
---|---|---|---|---|---|
0 | \n", 323 | "2 | \n", 324 | "50 | \n", 325 | "12500 | \n", 326 | "98 | \n", 327 | "1 | \n", 328 | "
1 | \n", 331 | "0 | \n", 332 | "13 | \n", 333 | "3250 | \n", 334 | "28 | \n", 335 | "1 | \n", 336 | "
We want to predict whether or not the same donor will give blood the next time the vehicle comes to campus. The model for this is a binary classifier, meaning that there are only 2 possible outcomes:
\n", 385 | "0
- the donor will not give blood1
- the donor will give bloodTarget incidence is defined as the number of cases of each individual target value in a dataset. That is, how many 0s in the target column compared to how many 1s? Target incidence gives us an idea of how balanced (or imbalanced) is our dataset.
" 390 | ] 391 | }, 392 | { 393 | "cell_type": "code", 394 | "execution_count": 161, 395 | "metadata": { 396 | "dc": { 397 | "key": "31" 398 | }, 399 | "tags": [ 400 | "sample_code" 401 | ] 402 | }, 403 | "outputs": [ 404 | { 405 | "data": { 406 | "text/plain": [ 407 | "0 0.762\n", 408 | "1 0.238\n", 409 | "Name: target, dtype: float64" 410 | ] 411 | }, 412 | "execution_count": 161, 413 | "metadata": {}, 414 | "output_type": "execute_result" 415 | } 416 | ], 417 | "source": [ 418 | "# Print target incidence proportions, rounding output to 3 decimal places\n", 419 | "transfusion.target.value_counts(normalize=True).round(3)" 420 | ] 421 | }, 422 | { 423 | "cell_type": "markdown", 424 | "metadata": { 425 | "dc": { 426 | "key": "38" 427 | }, 428 | "deletable": false, 429 | "editable": false, 430 | "run_control": { 431 | "frozen": true 432 | }, 433 | "tags": [ 434 | "context" 435 | ] 436 | }, 437 | "source": [ 438 | "## 6. Splitting transfusion into train and test datasets\n", 439 | "We'll now use train_test_split()
method to split transfusion
DataFrame.
Target incidence informed us that in our dataset 0
s appear 76% of the time. We want to keep the same structure in train and test datasets, i.e., both datasets must have 0 target incidence of 76%. This is very easy to do using the train_test_split()
method from the scikit learn
library - all we need to do is specify the stratify
parameter. In our case, we'll stratify on the target
column.
\n", 476 | " | Recency (months) | \n", 477 | "Frequency (times) | \n", 478 | "Monetary (c.c. blood) | \n", 479 | "Time (months) | \n", 480 | "
---|---|---|---|---|
334 | \n", 485 | "16 | \n", 486 | "2 | \n", 487 | "500 | \n", 488 | "16 | \n", 489 | "
99 | \n", 492 | "5 | \n", 493 | "7 | \n", 494 | "1750 | \n", 495 | "26 | \n", 496 | "
TPOT is a Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.
\n", 549 | "TPOT will automatically explore hundreds of possible pipelines to find the best one for our dataset. Note, the outcome of this search will be a scikit-learn pipeline, meaning it will include any pre-processing steps as well as the model.
\n", 551 | "We are using TPOT to help us zero in on one model that we can then explore and optimize further.
" 552 | ] 553 | }, 554 | { 555 | "cell_type": "code", 556 | "execution_count": 165, 557 | "metadata": { 558 | "dc": { 559 | "key": "45" 560 | }, 561 | "tags": [ 562 | "sample_code" 563 | ] 564 | }, 565 | "outputs": [ 566 | { 567 | "data": { 568 | "application/vnd.jupyter.widget-view+json": { 569 | "model_id": "85bd7069fff74a75a8c36926bb8b9586", 570 | "version_major": 2, 571 | "version_minor": 0 572 | }, 573 | "text/plain": [ 574 | "HBox(children=(IntProgress(value=0, description='Optimization Progress', max=120, style=ProgressStyle(descript…" 575 | ] 576 | }, 577 | "metadata": {}, 578 | "output_type": "display_data" 579 | }, 580 | { 581 | "name": "stdout", 582 | "output_type": "stream", 583 | "text": [ 584 | "Generation 1 - Current best internal CV score: 0.7433977184592779\n", 585 | "Generation 2 - Current best internal CV score: 0.7433977184592779\n", 586 | "Generation 3 - Current best internal CV score: 0.7433977184592779\n", 587 | "Generation 4 - Current best internal CV score: 0.7433977184592779\n", 588 | "Generation 5 - Current best internal CV score: 0.7433977184592779\n", 589 | "\n", 590 | "Best pipeline: LogisticRegression(input_matrix, C=0.5, dual=False, penalty=l2)\n", 591 | "\n", 592 | "AUC score: 0.7850\n", 593 | "\n", 594 | "Best pipeline steps:\n", 595 | "1. LogisticRegression(C=0.5, class_weight=None, dual=False, fit_intercept=True,\n", 596 | " intercept_scaling=1, max_iter=100, multi_class='warn',\n", 597 | " n_jobs=None, penalty='l2', random_state=None, solver='warn',\n", 598 | " tol=0.0001, verbose=0, warm_start=False)\n" 599 | ] 600 | } 601 | ], 602 | "source": [ 603 | "# Import TPOTClassifier and roc_auc_score\n", 604 | "from tpot import TPOTClassifier\n", 605 | "from sklearn.metrics import roc_auc_score\n", 606 | "\n", 607 | "# Instantiate TPOTClassifier\n", 608 | "tpot = TPOTClassifier(\n", 609 | " generations=5,\n", 610 | " population_size=20,\n", 611 | " verbosity=2,\n", 612 | " scoring='roc_auc',\n", 613 | " random_state=42,\n", 614 | " disable_update_check=True,\n", 615 | " config_dict='TPOT light'\n", 616 | ")\n", 617 | "tpot.fit(X_train, y_train)\n", 618 | "\n", 619 | "# AUC score for tpot model\n", 620 | "tpot_auc_score = roc_auc_score(y_test, tpot.predict_proba(X_test)[:, 1])\n", 621 | "print(f'\\nAUC score: {tpot_auc_score:.4f}')\n", 622 | "\n", 623 | "# Print best pipeline steps\n", 624 | "print('\\nBest pipeline steps:', end='\\n')\n", 625 | "for idx, (name, transform) in enumerate(tpot.fitted_pipeline_.steps, start=1):\n", 626 | " # Print idx and transform\n", 627 | " print(f'{idx}. {transform}')" 628 | ] 629 | }, 630 | { 631 | "cell_type": "markdown", 632 | "metadata": { 633 | "dc": { 634 | "key": "52" 635 | }, 636 | "deletable": false, 637 | "editable": false, 638 | "run_control": { 639 | "frozen": true 640 | }, 641 | "tags": [ 642 | "context" 643 | ] 644 | }, 645 | "source": [ 646 | "## 8. Checking the variance\n", 647 | "TPOT picked LogisticRegression
as the best model for our dataset with no pre-processing steps, giving us the AUC score of 0.7850. This is a great starting point. Let's see if we can make it better.
One of the assumptions for linear regression models is that the data and the features we are giving it are related in a linear fashion, or can be measured with a linear distance metric. If a feature in our dataset has a high variance that's an order of magnitude or more greater than the other features, this could impact the model's ability to learn from other features in the dataset.
\n", 649 | "Correcting for high variance is called normalization. It is one of the possible transformations you do before training a model. Let's check the variance to see if such transformation is needed.
" 650 | ] 651 | }, 652 | { 653 | "cell_type": "code", 654 | "execution_count": 167, 655 | "metadata": { 656 | "dc": { 657 | "key": "52" 658 | }, 659 | "tags": [ 660 | "sample_code" 661 | ] 662 | }, 663 | "outputs": [ 664 | { 665 | "data": { 666 | "text/plain": [ 667 | "Recency (months) 66.929\n", 668 | "Frequency (times) 33.830\n", 669 | "Monetary (c.c. blood) 2114363.700\n", 670 | "Time (months) 611.147\n", 671 | "dtype: float64" 672 | ] 673 | }, 674 | "execution_count": 167, 675 | "metadata": {}, 676 | "output_type": "execute_result" 677 | } 678 | ], 679 | "source": [ 680 | "# X_train's variance, rounding the output to 3 decimal places\n", 681 | "X_train.var().round(3)" 682 | ] 683 | }, 684 | { 685 | "cell_type": "markdown", 686 | "metadata": { 687 | "dc": { 688 | "key": "59" 689 | }, 690 | "deletable": false, 691 | "editable": false, 692 | "run_control": { 693 | "frozen": true 694 | }, 695 | "tags": [ 696 | "context" 697 | ] 698 | }, 699 | "source": [ 700 | "## 9. Log normalization\n", 701 | "Monetary (c.c. blood)
's variance is very high in comparison to any other column in the dataset. This means that, unless accounted for, this feature may get more weight by the model (i.e., be seen as more important) than any other feature.
One way to correct for high variance is to use log normalization.
" 703 | ] 704 | }, 705 | { 706 | "cell_type": "code", 707 | "execution_count": 169, 708 | "metadata": { 709 | "dc": { 710 | "key": "59" 711 | }, 712 | "tags": [ 713 | "sample_code" 714 | ] 715 | }, 716 | "outputs": [ 717 | { 718 | "data": { 719 | "text/plain": [ 720 | "Recency (months) 66.929\n", 721 | "Frequency (times) 33.830\n", 722 | "Time (months) 611.147\n", 723 | "monetary_log 0.837\n", 724 | "dtype: float64" 725 | ] 726 | }, 727 | "execution_count": 169, 728 | "metadata": {}, 729 | "output_type": "execute_result" 730 | } 731 | ], 732 | "source": [ 733 | "# Import numpy\n", 734 | "import numpy as np\n", 735 | "\n", 736 | "# Copy X_train and X_test into X_train_normed and X_test_normed\n", 737 | "X_train_normed, X_test_normed = X_train.copy(), X_test.copy()\n", 738 | "\n", 739 | "# Specify which column to normalize\n", 740 | "col_to_normalize = 'Monetary (c.c. blood)'\n", 741 | "\n", 742 | "# Log normalization\n", 743 | "for df_ in [X_train_normed, X_test_normed]:\n", 744 | " # Add log normalized column\n", 745 | " df_['monetary_log'] = np.log(df_[col_to_normalize])\n", 746 | " # Drop the original column\n", 747 | " df_.drop(columns=col_to_normalize, inplace=True)\n", 748 | "\n", 749 | "# Check the variance for X_train_normed\n", 750 | "X_train_normed.var().round(3)" 751 | ] 752 | }, 753 | { 754 | "cell_type": "markdown", 755 | "metadata": { 756 | "dc": { 757 | "key": "66" 758 | }, 759 | "deletable": false, 760 | "editable": false, 761 | "run_control": { 762 | "frozen": true 763 | }, 764 | "tags": [ 765 | "context" 766 | ] 767 | }, 768 | "source": [ 769 | "## 10. Training the linear regression model\n", 770 | "The variance looks much better now. Notice that now Time (months)
has the largest variance, but it's not the orders of magnitude higher than the rest of the variables, so we'll leave it as is.
We are now ready to train the linear regression model.
" 772 | ] 773 | }, 774 | { 775 | "cell_type": "code", 776 | "execution_count": 171, 777 | "metadata": { 778 | "dc": { 779 | "key": "66" 780 | }, 781 | "tags": [ 782 | "sample_code" 783 | ] 784 | }, 785 | "outputs": [ 786 | { 787 | "name": "stdout", 788 | "output_type": "stream", 789 | "text": [ 790 | "\n", 791 | "AUC score: 0.7891\n" 792 | ] 793 | } 794 | ], 795 | "source": [ 796 | "# Importing modules\n", 797 | "from sklearn import linear_model\n", 798 | "\n", 799 | "# Instantiate LogisticRegression\n", 800 | "logreg = linear_model.LogisticRegression(\n", 801 | " solver='liblinear',\n", 802 | " random_state=42\n", 803 | ")\n", 804 | "\n", 805 | "# Train the model\n", 806 | "logreg.fit(X_train_normed, y_train)\n", 807 | "\n", 808 | "# AUC score for tpot model\n", 809 | "logreg_auc_score = roc_auc_score(y_test, logreg.predict_proba(X_test_normed)[:, 1])\n", 810 | "print(f'\\nAUC score: {logreg_auc_score:.4f}')" 811 | ] 812 | }, 813 | { 814 | "cell_type": "markdown", 815 | "metadata": { 816 | "dc": { 817 | "key": "73" 818 | }, 819 | "deletable": false, 820 | "editable": false, 821 | "run_control": { 822 | "frozen": true 823 | }, 824 | "tags": [ 825 | "context" 826 | ] 827 | }, 828 | "source": [ 829 | "## 11. Conclusion\n", 830 | "The demand for blood fluctuates throughout the year. As one prominent example, blood donations slow down during busy holiday seasons. An accurate forecast for the future supply of blood allows for an appropriate action to be taken ahead of time and therefore saving more lives.
\n", 831 | "In this notebook, we explored automatic model selection using TPOT and AUC score we got was 0.7850. This is better than simply choosing 0
all the time (the target incidence suggests that such a model would have 76% success rate). We then log normalized our training data and improved the AUC score by 0.5%. In the field of machine learning, even small improvements in accuracy can be important, depending on the purpose.
Another benefit of using logistic regression model is that it is interpretable. We can analyze how much of the variance in the response variable (target
) can be explained by other variables in our dataset.
Grey and Gray. Colour and Color. Words like these have been the cause of many heated arguments between Brits and Americans. Accents (and jokes) aside, there are many words that are pronounced the same way but have different spellings. While it is easy for us to realize their equivalence, basic programming commands will fail to equate such two strings.
\nMore extreme than word spellings are names because people have more flexibility in choosing to spell a name in a certain way. To some extent, tradition sometimes governs the way a name is spelled, which limits the number of variations of any given English name. But if we consider global names and their associated English spellings, you can only imagine how many ways they can be spelled out.
\nOne way to tackle this challenge is to write a program that checks if two strings sound the same, instead of checking for equivalence in spellings. We'll do that here using fuzzy name matching.
"},{"outputs":[{"text":"TANAR\nTANAR\n","output_type":"stream","name":"stdout"}],"metadata":{"dc":{"key":"3"},"trusted":true,"tags":["sample_code"]},"execution_count":95,"cell_type":"code","source":"import fuzzy\n\n# Exploring the output of fuzzy.nysiis\nprint(fuzzy.nysiis('tomorrow'))\n\n# Testing equivalence of similar sounding words (misspelled word)\nprint(fuzzy.nysiis('tommorow'))"},{"metadata":{"run_control":{"frozen":true},"dc":{"key":"10"},"deletable":false,"editable":false,"tags":["context"]},"cell_type":"markdown","source":"## 2. Authoring the authors\nThe New York Times puts out a weekly list of best-selling books from different genres, and which has been published since the 1930’s. We’ll focus on Children’s Picture Books, and analyze the gender distribution of authors to see if there have been changes over time. We'll begin by reading in the data on the best selling authors from 2008 to 2017.
"},{"outputs":[{"data":{"text/html":"\n | Year | \nBook Title | \nAuthor | \nBesteller this year | \nfirst_name | \n
---|---|---|---|---|---|
0 | \n2017 | \nDRAGONS LOVE TACOS | \nAdam Rubin | \n49 | \nAdam | \n
1 | \n2017 | \nTHE WONDERFUL THINGS YOU WILL BE | \nEmily Winfield Martin | \n48 | \nEmily | \n
2 | \n2017 | \nTHE DAY THE CRAYONS QUIT | \nDrew Daywalt | \n44 | \nDrew | \n
3 | \n2017 | \nROSIE REVERE, ENGINEER | \nAndrea Beaty | \n38 | \nAndrea | \n
4 | \n2017 | \nADA TWIST, SCIENTIST | \nAndrea Beaty | \n28 | \nAndrea | \n
When we were young children, we were taught to read using phonics; sounding out the letters that compose words. So let's relive history and do that again, but using python this time. We will now create a new column or list that contains the phonetic equivalent of every first name that we just extracted.
\nTo make sure we're on the right track, let's compare the number of unique values in the first_name
column and the number of unique values in the nysiis coded column. As a rule of thumb, the number of unique nysiis first names should be less than or equal to the number of actual first names.
We'll use babynames_nysiis.csv
, a dataset that is derived from the Social Security Administration’s baby name data, to identify author genders. The dataset contains unique NYSIIS versions of baby names, and also includes the percentage of times the name appeared as a female name (perc_female
) and the percentage of times it appeared as a male name (perc_male
).
We'll use this data to create a list of gender
. Let's make the following simplifying assumption: For each name, if perc_female
is greater than perc_male
then assume the name is female, if perc_female
is less than perc_male
then assume it is a male name, and if the percentages are equal then it's a \"neutral\" name.
\n | babynysiis | \nperc_female | \nperc_male | \ngender | \n
---|---|---|---|---|
0 | \nNaN | \n62.50 | \n37.50 | \nF | \n
1 | \nRAX | \n63.64 | \n36.36 | \nF | \n
2 | \nESAR | \n44.44 | \n55.56 | \nM | \n
3 | \nDJANG | \n0.00 | \n100.00 | \nM | \n
4 | \nPARCAL | \n25.00 | \n75.00 | \nM | \n
5 | \nVALCARY | \n100.00 | \n0.00 | \nF | \n
6 | \nFRANCASC | \n63.64 | \n36.36 | \nF | \n
7 | \nCABAT | \n50.00 | \n50.00 | \nN | \n
8 | \nXANDAR | \n16.67 | \n83.33 | \nM | \n
9 | \nRACSAN | \n33.33 | \n66.67 | \nM | \n
Now that we have identified the likely genders of different names, let's find author genders by searching for each author's name in the babies_df
DataFrame, and extracting the associated gender.
From the results above see that there are more female authors on the New York Times best seller's list than male authors. Our dataset spans 2008 to 2017. Let's find out if there have been changes over time.
"},{"outputs":[{"text":"[15, 45, 48, 51, 46, 51, 34, 30, 32, 43]\n[8, 19, 27, 21, 21, 11, 21, 18, 25, 20]\n[1, 0, 1, 1, 2, 1, 1, 0, 1, 0]\n","output_type":"stream","name":"stdout"}],"metadata":{"dc":{"key":"38"},"trusted":true,"tags":["sample_code"]},"execution_count":105,"cell_type":"code","source":"# Creating a list of unique years, sorted in ascending order.\nyears = np.unique(author_df['Year'])\n\n# Initializing lists\nmales_by_yr = []\nfemales_by_yr = []\nunknown_by_yr = []\n\n# Looping through years to find the number of male, female and unknown authors per year\nfor year in years:\n females_by_yr.append(len(author_df[(author_df['author_gender']=='F') & (author_df['Year']==year)]))\n males_by_yr.append(len(author_df[(author_df['author_gender']=='M') & (author_df['Year']==year)]))\n unknown_by_yr.append(len(author_df[(author_df['author_gender']=='N') & (author_df['Year']==year)]))\n\n# Printing out yearly values to examine changes over time\nprint(females_by_yr)\nprint(males_by_yr)\nprint(unknown_by_yr)"},{"metadata":{"run_control":{"frozen":true},"dc":{"key":"45"},"deletable":false,"editable":false,"tags":["context"]},"cell_type":"markdown","source":"## 7. Foreign-born authors?\nOur gender data comes from social security applications of individuals born in the US. Hence, one possible explanation for why there are \"unknown\" genders associated with some author names is because these authors were foreign-born. While making this assumption, we should note that these are only a subset of foreign-born authors as others will have names that have a match in baby_df
(and in the social security dataset).
Using a bar chart, let's explore the trend of foreign-born authors with no name matches in the social security dataset.
"},{"outputs":[{"data":{"text/plain":"Text(0.5,0,'year')"},"metadata":{},"execution_count":107,"output_type":"execute_result"},{"data":{"image/png":"iVBORw0KGgoAAAANSUhEUgAAAYwAAAEWCAYAAAB1xKBvAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMS4xLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvAOZPmwAAG7xJREFUeJzt3XuUHWWd7vHvY7ipMBBMRMiFoKKCR7nYA14HvGFAJXocl2FEUWFldASv4xwYZwHieAQ93mXEHIyIDqCDohlFkSMwjCKaBAEFBEIMJhFMJNxBMPCcP+pt3TTd6beTXb130s9nrVq9663L/r3dnTxdb9Wukm0iIiJG85heFxAREZuGBEZERFRJYERERJUERkREVElgRERElQRGRERUSWDEuJP0dElXSrpb0rta2P89kp7c7f2WfVvSU9vY93iTdImko3pdR2w6tuh1ATEh/RNwse2929i57W3b2O+mTNKJwFNtH97rWmLTlSOM6IVdgWs2ZENJm8UfOZtLP2Dz6kusXwIjxpWki4AXA58vQ0dPk7S9pDMlrZF0s6R/kfSYsv5bJP1E0qck3QacWNrfJuk6SbdLukDSrh3v8edhI0lPkPSfku6StEjSv0r68ZB13y7pRkl3SDpVkkbpxiGSlkn6g6SPd9T6mFL7zZJWlz5tX5bNKu91pKTfAhd1tB0h6bdlfx9cz/fulZJ+Ufqyohw1DC47UNLKIesvl/QySbOBfwbeUL7nV3Wstmv5/t4t6YeSpnRsf6ika8r35RJJewzZ9/+SdDVwr6Qtyvyqsq/rJb10lO9jbGpsZ8o0rhNwCXBUx/yZwHeA7YBZwA3AkWXZW4B1wDE0Q6iPBeYAS4E9Stu/AJd17M80wy8A55TpccCewArgx0PW/S6wAzATWAPMXk/tBi4Gdizr3zDYF+Btpa4nA9sC3wK+WpbNKtueCTy+9GOw7f+W+b2AB4A9RnjvA4Fn0fyh92zg98BrOpatHLL+cuBl5fWJwNeG+TncBDytvP8lwMll2dOAe4GXA1vSDCMuBbbq2PeVwIyy7dPL93aXjv4+pde/a5m6O+UII3pK0iRgLnCc7bttLwc+AbypY7Xf2f6c7XW27wfeDnzU9nW21wH/G9i78yijY9+vA06wfZ/ta4GvDFPGybbvsP1bmjAY7dzKKbbXlvU/DRxW2t8IfNL2Mtv3AMcBc4cM2Zxo+97Sj0Efsn2/7auAq2iC41FsX2L7l7Yftn01cDZwwCi1jubLtm8o9XyDv/T9DcD3bF9o+0/A/6EJhud3bPtZ2yvKtg8BWwN7StrS9nLbN21kbdFnEhjRa1No/oK9uaPtZmBax/yKIdvsCnymDJXcAawFNGQbgKk0RyCd2w/dF8CtHa/vozk6oAzH3FOmF42wj5uBXcrrXYbpxxbAThvy/kNJ2l/SxWXo7k6a4Jwy3LpjMNJ7P6Ivth+mqX3Yn4vtpcB7aI5kVks6R9IuxGYlgRG99gfgTzQhMGgmsKpjfugtlVcAf297h47psbYvG7LeGprhrOkdbTNqC7P9TNvblum/R9jHTOB35fXvhunHOpqho5H6MhZnAQuBGba3B06jCUpoho8eN7hiObqauhHv+4i+lPM6M1jPz8X2WbZfWLYzcMoY3zP6XAIjesr2QzRDIR+RtF0ZVnof8LX1bHYacJykZwKUk+avH2Hf3wJOlPQ4Sc8A3tyFsj8gabKkGcC7ga+X9rOB90raTdK2NENlXy/DZt2wHbDW9h8l7Qf8XceyG4BtyonxLWnO62zdsfz3wKzBE/QVvgG8UtJLy/7eT3N+ZWgoA3/+bM1LJG0N/BG4H3h4LJ2L/pfAiH5wDM1fyMuAH9P8Jb1gpJVtn0fz1+s5ku4CfgUcPMLqRwPb0wy9fJXmP/UHNrLe7wBLaE76fg/4UmlfUN7jUuA3NP9xHrOR79XpH4CTJN0NHE/znzoAtu8sy0+nOQq4F+i8auo/ytfbJF0x2hvZvh44HPgczVHgq4FX235whE22Bk4u694KPJHmHE5sRmTnAUoxcUg6BXiS7SN6XUvEpiZHGLFZk/QMSc9WYz/gSOC8XtcVsSnKJzRjc7cdzTDULjTj+J+gGVKKiDHKkFRERFTJkFRERFTZrIakpkyZ4lmzZvW6jIiITcaSJUv+YHvq6GtuZoExa9YsFi9e3OsyIiI2GZJuHn2tRoakIiKiSgIjIiKqJDAiIqJKAiMiIqokMCIiokoCIyIiqrQWGJJmlIe9XFseRPPuYdaRpM9KWirpakn7diw7ojxn+UZJuVFcRESPtfk5jHXA+21fIWk7YImkC8tjMgcdDOxepv2BLwD7S9oROAEYoHkQyxJJC23f3mK9ERGxHq0dYdi+xfYV5fXdwHU8+hGac4Az3bgc2EHSzsArgAvLc5NvBy4EZrdVa0REjG5cPuktaRawD/CzIYum8chnHK8sbSO1D7fvecA8gJkzZ3al3tj8zTr2e62/x/KTX9n6e0SMp9ZPepdHVX4TeI/tu7q9f9vzbQ/YHpg6tep2KBERsQFaDYzyLOBvAv9u+1vDrLKK5sHyg6aXtpHaIyKiR9q8Sko0zzq+zvYnR1htIfDmcrXUc4E7bd8CXAAcJGmypMnAQaUtIiJ6pM1zGC8A3gT8UtKVpe2fgZkAtk8DzgcOAZYC9wFvLcvWSvowsKhsd5LttS3WGhERo2gtMGz/GNAo6xh45wjLFgALWigtIiI2QD7pHRERVRIYERFRJYERERFVEhgREVElgREREVUSGBERUSWBERERVRIYERFRJYERERFVEhgREVElgREREVUSGBERUSWBERERVRIYERFRJYERERFVEhgREVGltQcoSVoAvApYbft/DLP8A8AbO+rYA5hanra3HLgbeAhYZ3ugrTojIqJOm0cYZwCzR1po++O297a9N3Ac8F9DHsP64rI8YRER0QdaCwzblwK1z+E+DDi7rVoiImLj9fwchqTH0RyJfLOj2cAPJS2RNK83lUVERKfWzmGMwauBnwwZjnqh7VWSnghcKOnX5YjlUUqgzAOYOXNm+9VGRExQPT/CAOYyZDjK9qrydTVwHrDfSBvbnm97wPbA1KlTWy00ImIi62lgSNoeOAD4Tkfb4yVtN/gaOAj4VW8qjIiIQW1eVns2cCAwRdJK4ARgSwDbp5XVXgv80Pa9HZvuBJwnabC+s2z/oK06IyKiTmuBYfuwinXOoLn8trNtGbBXO1VFRMSG6odzGBERsQlIYERERJUERkREVElgRERElQRGRERUSWBERESVBEZERFRJYERERJUERkREVElgRERElQRGRERUSWBERESVBEZERFRJYERERJUERkREVElgRERElQRGRERUaS0wJC2QtFrSsM/jlnSgpDslXVmm4zuWzZZ0vaSlko5tq8aIiKjX5hHGGcDsUdb5b9t7l+kkAEmTgFOBg4E9gcMk7dlinRERUaG1wLB9KbB2AzbdD1hqe5ntB4FzgDldLS4iIsas1+cwnifpKknfl/TM0jYNWNGxzsrSNixJ8yQtlrR4zZo1bdYaETGh9TIwrgB2tb0X8Dng2xuyE9vzbQ/YHpg6dWpXC4yIiL/oWWDYvsv2PeX1+cCWkqYAq4AZHatOL20REdFDPQsMSU+SpPJ6v1LLbcAiYHdJu0naCpgLLOxVnRER0diirR1LOhs4EJgiaSVwArAlgO3TgL8F3iFpHXA/MNe2gXWSjgYuACYBC2xf01adERFRp7XAsH3YKMs/D3x+hGXnA+e3UVdERGyYXl8lFRERm4gERkREVElgRERElQRGRERUSWBERESVBEZERFRJYERERJUERkREVElgRERElQRGRERUSWBERESVBEZERFRJYERERJUERkREVElgRERElQRGRERUSWBERESV1gJD0gJJqyX9aoTlb5R0taRfSrpM0l4dy5aX9islLW6rxoiIqNfmEcYZwOz1LP8NcIDtZwEfBuYPWf5i23vbHmipvoiIGIM2n+l9qaRZ61l+Wcfs5cD0tmqJiIiN1y/nMI4Evt8xb+CHkpZImre+DSXNk7RY0uI1a9a0WmRExETW2hFGLUkvpgmMF3Y0v9D2KklPBC6U9Gvblw63ve35lOGsgYEBt15wRMQE1dMjDEnPBk4H5ti+bbDd9qrydTVwHrBfbyqMiIhBVYEh6dWSuhoukmYC3wLeZPuGjvbHS9pu8DVwEDDslVYRETF+aoek3gB8WtI3gQW2fz3aBpLOBg4EpkhaCZwAbAlg+zTgeOAJwL9JAlhXrojaCTivtG0BnGX7B2PpVEREdF9VYNg+XNJfAYcBZ0gy8GXgbNt3j7DNYaPs8yjgqGHalwF7PXqLiIjopephJtt3AecC5wA7A68FrpB0TEu1RUREH6k9hzFH0nnAJTTDSvvZPpjmSOD97ZUXERH9ovYcxv8EPjX00lbb90k6svtlRUREv6kdkrp1aFhIOgXA9o+6XlVERPSd2sB4+TBtB3ezkIiI6G/rHZKS9A7gH4CnSLq6Y9F2wE/aLCwiIvrLaOcwzqK5x9NHgWM72u+2vba1qiIiou+MFhi2vVzSO4cukLRjQiMiYuKoOcJ4FbCE5g6y6lhm4Mkt1RUREX1mvYFh+1Xl627jU05ERPSr2g/uPerS2eHaIiJi8zXaVVLbAI+juYHgZP4yJPVXwLSWa4uIiD4y2jmMvwfeA+xCcx5jMDDuAj7fYl0REdFnRjuH8RngM5KOsf25caopIiL6UO3tzT8n6fnArM5tbJ/ZUl0REdFnqgJD0leBpwBXAg+VZgMJjIiICaL2brUDwJ623WYxERHRv2pvPvgr4Elj3bmkBZJWSxr2mdxqfFbSUklXS9q3Y9kRkm4s0xFjfe+IiOiu2iOMKcC1kn4OPDDYaPvQUbY7g+ZqqpGGrg4Gdi/T/sAXgP0l7UjzDPABmqGvJZIW2r69st6IiOiy2sA4cUN2bvtSSbPWs8oc4Mwy1HW5pB0k7QwcCFw4eK8qSRcCs4GzN6SOiIjYeLVXSf1XS+8/DVjRMb+ytI3U/iiS5gHzAGbOnLnBhcw69nsbvG2t5Se/svX3GKte9jvf8/bke/5IE7Xf3VZ7a5DnSlok6R5JD0p6SNJdbRdXw/Z82wO2B6ZOndrrciIiNlu1J70/DxwG3Ag8FjgKOLUL778KmNExP720jdQeERE9UhsY2F4KTLL9kO0v05xT2FgLgTeXq6WeC9xp+xbgAuAgSZPLPawOKm0REdEjtSe975O0FXClpI8Bt1ARNpLOpjmBPUXSSporn7YEsH0acD5wCLAUuA94a1m2VtKHgUVlVyflYU0REb1VGxhvAiYBRwPvpRkuet1oG9k+bJTlBh71NL+ybAGwoLK+iIhoWe1VUjeXl/cDH2qvnIiI6Fe195L6Dc0H6B7Bdh7RGhExQYzlXlKDtgFeD+zY/XIiIqJfVV0lZfu2jmmV7U8Dm/+nVCIi4s9qh6T27Zh9DM0RR+3RSUREbAZq/9P/BH85h7EOWE4zLBURERNEbWB8lyYwBp/pbeBVUjNr+5PdLy0iIvpJbWA8B/hr4Ds0ofFq4Oc0twqJiIgJoDYwpgP72r4bQNKJwPdsH95WYRER0V9q7yW1E/Bgx/yDpS0iIiaI2iOMM4GfSzqvzL+G5ml6ERExQdTeGuQjkr4PvKg0vdX2L9orKyIi+k31ZylsXwFc0WItERHRx6qfhxERERNbAiMiIqokMCIiokoCIyIiqrQaGJJmS7pe0lJJxw6z/FOSrizTDZLu6Fj2UMeyhW3WGRERo2vtjrOSJgGnAi8HVgKLJC20fe3gOrbf27H+McA+Hbu43/bebdUXERFj0+YRxn7AUtvLbD8InAPMWc/6hwFnt1hPRERshDYDYxqwomN+ZWl7FEm7ArsBF3U0byNpsaTLJb1mpDeRNK+st3jNmjXdqDsiIobRLye95wLn2n6oo21X2wPA3wGflvSU4Ta0Pd/2gO2BqVOnjketERETUpuBsQqY0TE/vbQNZy5DhqNsrypflwGX8MjzGxERMc7aDIxFwO6SdpO0FU0oPOpqJ0nPACYDP+1omyxp6/J6CvAC4Nqh20ZExPhp7Sop2+skHQ1cAEwCFti+RtJJwGLbg+ExFzjHtjs23wP4oqSHaULt5M6rqyIiYvy1FhgAts8Hzh/SdvyQ+ROH2e4y4Flt1hYREWPTLye9IyKizyUwIiKiSgIjIiKqJDAiIqJKAiMiIqokMCIiokoCIyIiqiQwIiKiSgIjIiKqJDAiIqJKAiMiIqokMCIiokoCIyIiqiQwIiKiSgIjIiKqJDAiIqJKAiMiIqq0GhiSZku6XtJSSccOs/wtktZIurJMR3UsO0LSjWU6os06IyJidK09olXSJOBU4OXASmCRpIXDPJv767aPHrLtjsAJwABgYEnZ9va26o2IiPVr8whjP2Cp7WW2HwTOAeZUbvsK4ELba0tIXAjMbqnOiIio0GZgTANWdMyvLG1DvU7S1ZLOlTRjjNsiaZ6kxZIWr1mzpht1R0TEMHp90vs/gVm2n01zFPGVse7A9nzbA7YHpk6d2vUCIyKi0WZgrAJmdMxPL21/Zvs22w+U2dOB59RuGxER46vNwFgE7C5pN0lbAXOBhZ0rSNq5Y/ZQ4Lry+gLgIEmTJU0GDiptERHRI61dJWV7naSjaf6jnwQssH2NpJOAxbYXAu+SdCiwDlgLvKVsu1bSh2lCB+Ak22vbqjUiIkbXWmAA2D4fOH9I2/Edr48Djhth2wXAgjbri4iIer0+6R0REZuIBEZERFRJYERERJUERkREVElgRERElQRGRERUSWBERESVBEZERFRJYERERJUERkREVElgRERElQRGRERUSWBERESVBEZERFRJYERERJUERkREVElgRERElVYDQ9JsSddLWirp2GGWv0/StZKulvQjSbt2LHtI0pVlWjh024iIGF+tPaJV0iTgVODlwEpgkaSFtq/tWO0XwIDt+yS9A/gY8Iay7H7be7dVX0REjE2bRxj7AUttL7P9IHAOMKdzBdsX276vzF4OTG+xnoiI2AhtBsY0YEXH/MrSNpIjge93zG8jabGkyyW9ZqSNJM0r6y1es2bNxlUcEREjam1IaiwkHQ4MAAd0NO9qe5WkJwMXSfql7ZuGbmt7PjAfYGBgwONScETEBNTmEcYqYEbH/PTS9giSXgZ8EDjU9gOD7bZXla/LgEuAfVqsNSIiRtFmYCwCdpe0m6StgLnAI652krQP8EWasFjd0T5Z0tbl9RTgBUDnyfKIiBhnrQ1J2V4n6WjgAmASsMD2NZJOAhbbXgh8HNgW+A9JAL+1fSiwB/BFSQ/ThNrJQ66uioiIcdbqOQzb5wPnD2k7vuP1y0bY7jLgWW3WFhERY5NPekdERJUERkREVElgRERElQRGRERUSWBERESVBEZERFRJYERERJUERkREVElgRERElQRGRERUSWBERESVBEZERFRJYERERJUERkREVElgRERElQRGRERUSWBERESVVgND0mxJ10taKunYYZZvLenrZfnPJM3qWHZcab9e0ivarDMiIkbXWmBImgScChwM7AkcJmnPIasdCdxu+6nAp4BTyrZ7AnOBZwKzgX8r+4uIiB5p8whjP2Cp7WW2HwTOAeYMWWcO8JXy+lzgpZJU2s+x/YDt3wBLy/4iIqJHtmhx39OAFR3zK4H9R1rH9jpJdwJPKO2XD9l22nBvImkeMK/M3iPp+o0vvcoU4A9j2UCntFTJ+Nqk+t3F956o/YYx9n0z+T2HidPvXWtXbDMwxoXt+cD88X5fSYttD4z3+/Za+j3xTNS+T9R+r0+bQ1KrgBkd89NL27DrSNoC2B64rXLbiIgYR20GxiJgd0m7SdqK5iT2wiHrLASOKK//FrjItkv73HIV1W7A7sDPW6w1IiJG0dqQVDkncTRwATAJWGD7GkknAYttLwS+BHxV0lJgLU2oUNb7BnAtsA54p+2H2qp1A437MFifSL8nnona94na7xGp+YM+IiJi/fJJ74iIqJLAiIiIKgmMQtIMSRdLulbSNZLeXdp3lHShpBvL18mlXZI+W25fcrWkfTv29bGyj+vKOupVv0azAf1+hqSfSnpA0j8O2dd6bwXTb7rV95H206+6+TMvyydJ+oWk7453X8aiy7/rO0g6V9Kvy7/z5/WiT+POdqbmPM7OwL7l9XbADTS3NPkYcGxpPxY4pbw+BPg+IOC5wM9K+/OBn9Cc6J8E/BQ4sNf962K/nwj8NfAR4B879jMJuAl4MrAVcBWwZ6/7N059H3Y/ve5f2/3u2N/7gLOA7/a6b+PVb5o7VBxVXm8F7NDr/o3HlCOMwvYttq8or+8GrqP5dHnn7Uu+ArymvJ4DnOnG5cAOknYGDGxD80u0NbAl8Ptx68gYjbXftlfbXgT8aciuam4F01e61ff17KcvdfFnjqTpwCuB08eh9I3SrX5L2h74G5qrPLH9oO07xqUTPZbAGIaau+buA/wM2Mn2LWXRrcBO5fVwtz6ZZvunwMXALWW6wPZ141D2Rqvs90iG/X50ucTWbGTfR9pP3+tCvz8N/BPwcBv1tWUj+70bsAb4chmKO13S49uqtZ8kMIaQtC3wTeA9tu/qXObm+HO91yFLeiqwB82n06cBL5H0opbK7ZqN7femrFt9X99++lEXftdfBay2vaS9KruvCz/vLYB9gS/Y3ge4l2Yoa7OXwOggaUuaX6R/t/2t0vz7MtRE+bq6tI90+5LXApfbvsf2PTTnOfr6hNgY+z2STfJ2Ll3q+0j76Vtd6vcLgEMlLacZgnyJpK+1VHJXdKnfK4GVtgePIs+lCZDNXgKjKFcyfQm4zvYnOxZ13r7kCOA7He1vLldLPRe4sxzW/hY4QNIW5ZfzAJqx0r60Af0eSc2tYPpKt/q+nv30pW712/ZxtqfbnkXz877I9uEtlNwVXez3rcAKSU8vTS+luSvF5q/XZ937ZQJeSHMoejVwZZkOobnd+o+AG4H/B+xY1hfNA6JuAn4JDJT2ScAXaULiWuCTve5bl/v9JJq/sO4C7iiv/6osO4TmypObgA/2um/j1feR9tPr/o3Hz7xjnwfS/1dJdfN3fW9gcdnXt4HJve7feEy5NUhERFTJkFRERFRJYERERJUERkREVElgRERElQRGRERUSWBERESVBEZEH5E0qdc1RIwkgRGxgSSdJOk9HfMfkfRuSR+QtEjNc1I+1LH825KWlGcxzOtov0fSJyRdRZ/fRiYmtgRGxIZbALwZQNJjaG6PcSuwO83t3vcGniPpb8r6b7P9HGAAeJekJ5T2x9M8T2Uv2z8ezw5EjMUWvS4gYlNle7mk2yTtQ3NL7F/QPHDnoPIaYFuaALmUJiReW9pnlPbbgIdobogX0dcSGBEb53TgLTT3HVpAcyO6j9r+YudKkg4EXgY8z/Z9ki6hedAWwB9tPzReBUdsqAxJRWyc84DZNEcWF5TpbeWZC0iaJumJwPbA7SUsnkHzWN+ITUqOMCI2gu0HJV0M3FGOEn4oaQ/gp83dtLkHOBz4AfB2SdcB1wOX96rmiA2Vu9VGbIRysvsK4PW2b+x1PRFtypBUxAaStCewFPhRwiImghxhRERElRxhRERElQRGRERUSWBERESVBEZERFRJYERERJX/D/M7CLuliBgWAAAAAElFTkSuQmCC\n","text/plain":"What’s more exciting than a bar chart is a grouped bar chart. This type of chart is good for displaying changes over time while also comparing two or more groups. Let’s use a grouped bar chart to look at the distribution of male and female authors over time.
"},{"outputs":[{"data":{"text/plain":"Text(0.5,0,'year')"},"metadata":{},"execution_count":109,"output_type":"execute_result"},{"data":{"image/png":"iVBORw0KGgoAAAANSUhEUgAAAYIAAAEWCAYAAABrDZDcAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMS4xLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvAOZPmwAAF6BJREFUeJzt3Xu0ZGV9p/HnSzdIBBSQtkWatjE4GmYyiraI8QIRzSCiIFFjxksbMR1nEqOJl6CZmeCMjJe1NJLociSigHjBG8Iwo0ZRdKKgNIiotARQCCBNt0jLzUjA3/yx9wnl4VzqdNeuOof9fNaqdfZ9/95z6tS39rtr70pVIUnqrx0mXYAkabIMAknqOYNAknrOIJCknjMIJKnnDAJJ6jmDQP8qySlJ3rLY60jyliQ/SbJpzHUdn+T0ce5zNkkqyf4dbfvQJNd1sW0tTgaBlpQkq4HXAgdU1UMmXc84JDkvySsmXYfuuwwCLTWrgZuqavOkC7kvSLJ80jWMQpJlk65hKTMIlrgkVyd5fZJLk9ye5OQkK5N8LsmtSb6UZI+B5T+ZZFOSnyX5WpJ/O8e2j0xySZKtSb6R5N/PseyJSa5NckuSi5I8ZWDe8Uk+keS0tqbvJ1k7MP/AJBe3884Adp5lH08Hvgg8NMltSU5ppx/c1rc1yXeSHDqwznltV9I32nX+d5IHJflIW+uFSdYM044Z6pl1vzMse1ySq9o2XpbkudN+P6cPjK9pu36WJzkBeArwnrb+9wxs9ulJrmj3/94kadffIcl/SXJNks3t7/2B07Z9bJJ/Ar48R81varvgrk7yonba45PcOPjCm+SYJN+ZYf05l23rnPq93NQ+R/YcWHbW52qa7sP3Jfm/SW4Hfnu2dmgIVeVjCT+Aq4ELgJXAPsBm4GLgQJoX1C8DfzWw/MuB3YD7Ae8GLhmYdwrwlnb4wHZbTwCWAevafd1vljpeDDwIWE7TdbMJ2Lmddzzwz8AR7bbeClzQztsJuAb4M2BH4HnAv0zVMcN+DgWuGxjfB7ip3fYOwDPa8RXt/POAK4FfBx4IXAb8I/D0ttbTgA8toB2nD7PfGep+PvDQdtnfA24H9p6+3XZ8DVDA8oE2vGLa9go4B9id5ihpC3D4wN/4SuDhwK7AZ4APT9v2acAuwK/N8ju+C3hX+zw5pK33ke38y4BnDix/JvDaWdo967LAq2meu6va/bwf+NgCnqs/A57U/k53nvT/4lJ+TLwAH9v5B2xenF80MP5p4H0D468CPjvLuru3LwoPbMdP4Z4geB/wP6YtfzlwyJB13Qw8uh0+HvjSwLwDgJ+3w08FfgxkYP43GD4I/mLqRW5g2heAde3wecBfDsx7J/C5gfFnD77ADNGO04fZ7xC/n0uAo6Zvtx1fw3BB8OSB8U8Ax7XD5wL/eWDeI2nCdfnAth8+R22H0gTBLtO2/18H2v6RdnhP4A7aUJthW7MuC2wEDhtYdu+pOod8rp42jv+xPjzsGrpvuHFg+OczjO8KTT9qkre1h+K30IQIwF4zbPNhwGvbboetSbYC+9K8q72XJK9LsrE9jN9K8+57cLuDn/C5A9g5Tf/0Q4Hrq/3vbl0zT3un1/n8aXU+meZFZcpQv58h27GQ/f6rJC8d6GbbCvy7Wba7ENN/p1PteCi/+ju8hiYEVg5Mu3aebd9cVbdP28bU3/504NlJdgFeAPy/qrphlu3MtezDgDMHficbgbuBlUM+V+drg4Z0nzhRpKH9R+Aomm6Rq2le5G4GMsOy1wInVNUJ82207Ud/A3AY8P2q+mWS2bY73Q3APkkyEAargauGWHeqzg9X1R8OufysFtiOofeb5GHA37XbPb+q7k5yycB2bwfuP7DK9E9DLfQWwT+meZGdsprmHf6NNN0ww2xzjyS7DITBauB7AFV1fZLzgWOAl9AcPc5onmWvBV5eVV+fvl6SlzD/c9VbJ4+IRwT9shvwC5q+7PsD/3OOZf8OeGWSJ6SxS5JnJdltlu3eRdNPvTzJfwMeMGRN57fr/mmSHZMcAxw05LpwzzvO/9C+i9w5zefgV8275r0tpB0L2e8uNC9aWwCS/AHNEcGUS4CnJlndntR947T1b6Tp7x/Wx4A/S7Jfkl1p/s5nVNVdC9gGwJuT7NQG5JHAJwfmnUYTmr9Jcw5iLrMt+7+AE9qgJMmKJEe18xbyXNV2Mgj65TSaQ/zraU7iXTDbglW1AfhD4D0078SuBF42y+JfAD5PcxL2GpoTw0MdtlfVnTTvFl8G/JTmROp8LyyD619L887xTTQvtNcCr2fbnttDt2Mh+62qy2jOTZxP86L+m8DXB+Z/ETgDuBS4iOYk8KATgecluTnJ3wzRjg8CHwa+Bvyobcerhlhv0Caav/uPgY8Ar6yqHwzMP5O2a6eq7phnW7MteyJwNvD3SW6leT4+oZ039HNV2y+/2jUrScNJchXwR1X1pVEuq/HziEDSgiX5XZrurlmvQ9iWZTUZniyWtCBJzqP5CPBLquqXo1pWk2PXkCT1nF1DktRzS6JraK+99qo1a9ZMugxJWlIuuuiin1TVivmWWxJBsGbNGjZs2DDpMiRpSUky1FX6dg1JUs8ZBJLUcwaBJPWcQSBJPWcQSFLPGQSS1HMGgST1nEEgST1nEEhSzy2JK4t1H/bVea4YP2TteOqYlL63X4tCp0GQ5GrgVpovpL6rqtYm2ZPm25jW0HwX6Quq6uYu65AkzW4cXUO/XVWPqaqptzbHAedW1SOAc9txSdKETOIcwVHAqe3wqcDRE6hBktTqOgiK5oupL0qyvp22sqpuaIc3AStnWjHJ+iQbkmzYsmVLx2VKUn91fbL4yVV1fZIHA19M8oPBmVVVSWb8irSqOgk4CWDt2rV+jZokdaTTI4Kqur79uRk4EzgIuDHJ3gDtz81d1iBJmltnQZBklyS7TQ0DvwN8DzgbWNcutg44q6saJEnz67JraCVwZpKp/Xy0qj6f5ELgE0mOBa4BXtBhDZKkeXQWBFX1Q+DRM0y/CTisq/1KkhbGW0xIUs8ZBJLUc95rqO+8143Uex4RSFLPGQSS1HMGgST1nEEgST1nEEhSzxkEktRzBoEk9ZzXEajfvI5C8ohAkvrOIJCknjMIJKnnPEcwafZRS5owjwgkqecMAknqOYNAknrOIJCknjMIJKnnDAJJ6jmDQJJ6zusIJGlSFsl1RB4RSFLPGQSS1HMGgST1nEEgST1nEEhSzxkEktRzBoEk9ZxBIEk913kQJFmW5NtJzmnH90vyzSRXJjkjyU5d1yBJmt04jgheDWwcGH878NdVtT9wM3DsGGqQJM2i0yBIsgp4FvCBdjzA04BPtYucChzdZQ2SpLl1fUTwbuANwC/b8QcBW6vqrnb8OmCfmVZMsj7JhiQbtmzZ0nGZktRfnQVBkiOBzVV10basX1UnVdXaqlq7YsWKEVcnSZrS5d1HnwQ8J8kRwM7AA4ATgd2TLG+PClYB13dYgyRpHp0dEVTVG6tqVVWtAV4IfLmqXgR8BXheu9g64KyuapAkzW8S1xH8BfDnSa6kOWdw8gRqkCS1xvLFNFV1HnBeO/xD4KBx7FeSND+vLJaknjMIJKnnDAJJ6jmDQJJ6ziCQpJ4zCCSp5wwCSeo5g0CSes4gkKSeMwgkqecMAknqOYNAknrOIJCknjMIJKnnDAJJ6jmDQJJ6ziCQpJ4zCCSp5wwCSeq5sXxnsaRF6qsb5p5/yNrx1KGJ8ohAknrOIJCknjMIJKnnDAJJ6jmDQJJ6ziCQpJ4zCCSp57yOQFJ/eR0F4BGBJPWeQSBJPTdUECR5dhJDQ5Lug4Z9cf894Iok70jyqC4LkiSN11BBUFUvBg4ErgJOSXJ+kvVJdpttnSQ7J/lWku8k+X6SN7fT90vyzSRXJjkjyU4jaYkkaZsM3d1TVbcAnwI+DuwNPBe4OMmrZlnlF8DTqurRwGOAw5McDLwd+Ouq2h+4GTh2O+qXJG2nYc8RHJXkTOA8YEfgoKp6JvBo4LUzrVON29rRHdtHAU+jCRSAU4Gjt7l6SdJ2G/Y6gmNo3sV/bXBiVd2RZNZ39EmWARcB+wPvpela2lpVd7WLXAfsM8u664H1AKtXrx6yTElLip/jXxSG7RraND0EkrwdoKrOnW2lqrq7qh4DrAIOAoY+0VxVJ1XV2qpau2LFimFXkyQt0LBB8IwZpj1z2J1U1VbgK8ATgd2TTB2JrAKuH3Y7kqTRmzMIkvynJN8FHpXk0oHHj4BL51l3RZLd2+FfowmTjTSB8Lx2sXXAWdvbCEnStpvvHMFHgc8BbwWOG5h+a1X9dJ519wZObc8T7AB8oqrOSXIZ8PEkbwG+DZy8baVLkkZhviCoqro6yR9Pn5Fkz7nCoKoupbn2YPr0H9KcL5AkLQLDHBEcSfPJnwIyMK+Ah3dUlyRpTOYMgqo6sv2533jKkSSN27AXlN3rI6IzTZMkLT1zHhEk2Rm4P7BXkj24p2voAcxyIZgkaWmZ7xzBHwGvAR5Kc55gKghuAd7TYV2SpDGZ7xzBicCJSV5VVX87ppokSWM01L2Gqupvk/wWsGZwnao6raO6NCKfufyGOecfM6Y6JC1eQwVBkg8Dvw5cAtzdTi7AIJCkJW7Yu4+uBQ6oquqyGEnS+A1707nvAQ/pshBJ0mQMe0SwF3BZkm/RfPMYAFX1nE6qkiSNzbBBcHyXRUiSJmfYTw19tetCJEmTMewtJg5OcmGS25LcmeTuJLd0XZwkqXvDdg29B3gh8EmaTxC9FPg3XRUlSaPgdTTDGfZTQ1TVlcCy9nuIPwQc3l1ZkqRxGfaI4I4kOwGXJHkHcAMLCBFJ0uI17Iv5S4BlwJ8AtwP7Ar/bVVGSpPEZ9lND17SDPwfe3F05kqRxG/ZeQz+iubfQr6gqv6pSkpa4hdxraMrOwPOBPUdfjiRp3IY6R1BVNw08rq+qdwPP6rg2SdIYDNs19NiB0R1ojhCGPZqQJC1iw76Yv5N7zhHcBVxN0z0kSVrihg2Cc2iCYOo7iws4MmlGq+pdoy9NkjQOwwbB44DHA2fRhMGzgW8BV3RUlyRpTIYNglXAY6vqVoAkxwP/p6pe3FVhkqTxGPbK4pXAnQPjd7bTJElL3LBHBKcB30pyZjt+NHBKJxVJksZq2FtMnJDkc8BT2kl/UFXf7q4sSdK4DH0tQFVdDFzcYS2SpAno7KKwJPvSdCmtpPm46UlVdWKSPYEzgDU01yO8oKpu7qoOTdakvxhk0vuftL63X8Pp8jsF7gJeW1UHAAcDf5zkAOA44NyqegRwbjsuSZqQzoKgqm5ou5NoP3a6EdgHOAo4tV3sVJoTz5KkCRnLt4wlWQMcCHwTWFlVU8erm/BjqJI0UZ3fOC7JrsCngddU1S1Tt6UAqKpKcq/vOWjXWw+sB1i9enXXZUrSyC2VczSdHhEk2ZEmBD5SVZ9pJ9+YZO92/t7A5pnWraqTqmptVa1dsWJFl2VKUq91FgRp3vqfDGycdlO6s4F17fA6mvsXSZImpMuuoSfRfOn9d5Nc0k57E/A24BNJjgWuAV7QYQ2SpHl0FgRV9Q/cc9vq6Q7rar+LzVLpI5TUX2P51JAkafEyCCSp5wwCSeo5v4BeUmc8R7Y0eEQgST1nEEhSzxkEktRzBoEk9ZxBIEk9ZxBIUs8ZBJLUcwaBJPWcQSBJPWcQSFLPGQSS1HPea+irG+aef8ja8dQhSRPiEYEk9ZxBIEk9ZxBIUs8ZBJLUcwaBJPWcQSBJPWcQSFLPGQSS1HMGgST1nEEgST1nEEhSzxkEktRzBoEk9ZxBIEk9ZxBIUs/5fQRShz5z+Q1zzj9mTHVIc+nsiCDJB5NsTvK9gWl7Jvlikivan3t0tX9J0nC67Bo6BTh82rTjgHOr6hHAue24JGmCOguCqvoa8NNpk48CTm2HTwWO7mr/kqThjPtk8cqqmuo03QSsnG3BJOuTbEiyYcuWLeOpTpJ6aGKfGqqqAmqO+SdV1dqqWrtixYoxViZJ/TLuILgxyd4A7c/NY96/JGmacQfB2cC6dngdcNaY9y9JmqbLj49+DDgfeGSS65IcC7wNeEaSK4Cnt+OSpAnq7IKyqvr9WWYd1tU+JUkL5y0mJKnnDAJJ6rn7/L2GvNeLJM3NIwJJ6jmDQJJ6ziCQpJ4zCCSp5wwCSeo5g0CSes4gkKSeMwgkqecMAknqOYNAknrOIJCknjMIJKnnDAJJ6jmDQJJ6ziCQpJ4zCCSp5wwCSeo5g0CSes4gkKSeMwgkqecMAknqOYNAknrOIJCknjMIJKnnDAJJ6jmDQJJ6ziCQpJ4zCCSp5wwCSeq5iQRBksOTXJ7kyiTHTaIGSVJj7EGQZBnwXuCZwAHA7yc5YNx1SJIakzgiOAi4sqp+WFV3Ah8HjppAHZIkYPkE9rkPcO3A+HXAE6YvlGQ9sL4dvS3J5SPa/17AT0a0raXI9tt+298fDxtmoUkEwVCq6iTgpFFvN8mGqlo76u0uFbbf9tv+/rZ/NpPoGroe2HdgfFU7TZI0AZMIgguBRyTZL8lOwAuBsydQhySJCXQNVdVdSf4E+AKwDPhgVX1/jCWMvLtpibH9/Wb7dS+pqknXIEmaIK8slqSeMwgkqeeWfBAk2TfJV5JcluT7SV7dTt8zyReTXNH+3KOdniR/097e4tIkjx3Y1jvabWxsl8mk2jWsbWj/o5Kcn+QXSV43bVtL7tYfo2r/bNtZ7Eb592/nL0vy7STnjLst22LEz//dk3wqyQ/a14AnTqJNE1FVS/oB7A08th3eDfhHmltXvAM4rp1+HPD2dvgI4HNAgIOBb7bTfwv4Os0J7GXA+cChk25fB+1/MPB44ATgdQPbWQZcBTwc2An4DnDApNs3xvbPuJ1Jt29c7R/Y3p8DHwXOmXTbxt1+4FTgFe3wTsDuk27fuB5L/oigqm6oqovb4VuBjTRXLx9F84el/Xl0O3wUcFo1LgB2T7I3UMDONE+A+wE7AjeOrSHbaKHtr6rNVXUh8C/TNrUkb/0xqvbPsZ1FbYR/f5KsAp4FfGAMpY/EqNqf5IHAU4GT2+XurKqtY2nEIrDkg2BQkjXAgcA3gZVVdUM7axOwsh2e6RYX+1TV+cBXgBvaxxeqauMYyh6ZIds/mxl/LyMusVPb2f7ZtrNkjKD97wbeAPyyi/q6tp3t3w/YAnyo7Rr7QJJduqp1sbnPBEGSXYFPA6+pqlsG51VzrDfn52ST7A/8Bs2VzvsAT0vylI7KHbntbf9SN6r2z7WdxWwEz/8jgc1VdVF3VXZnBH//5cBjgfdV1YHA7TRdSr1wnwiCJDvSPAk+UlWfaSff2Hb50P7c3E6f7RYXzwUuqKrbquo2mvMIS+Jk0QLbP5sle+uPEbV/tu0seiNq/5OA5yS5mqZb8GlJTu+o5JEaUfuvA66rqqmjwE/RBEMvLPkgaD/ZczKwsareNTDrbGBdO7wOOGtg+kvbTw8dDPysPYT8J+CQJMvbJ9YhNP2Ni9o2tH82S/LWH6Nq/xzbWdRG1f6qemNVraqqNTR/+y9X1Ys7KHmkRtj+TcC1SR7ZTjoMuGzE5S5ekz5bvb0P4Mk0h32XApe0jyOABwHnAlcAXwL2bJcPzRfjXAV8F1jbTl8GvJ/mxf8y4F2TbltH7X8IzbufW4Ct7fAD2nlH0Hzq4irgLyfdtnG2f7btTLp94/z7D2zzUJbOp4ZG+fx/DLCh3dZngT0m3b5xPbzFhCT13JLvGpIkbR+DQJJ6ziCQpJ4zCCSp5wwCSeo5g0CSes4gkMYgybJJ1yDNxiCQpkny35O8ZmD8hCSvTvL6JBem+R6LNw/M/2ySi9r74a8fmH5bkncm+Q5L5HYl6ieDQLq3DwIvBUiyA80tFzYBj6C5XfdjgMcleWq7/Mur6nHAWuBPkzyonb4LzfddPLqq/mGcDZAWYvmkC5AWm6q6OslNSQ6kuX3xt2m+zOR32mGAXWmC4Ws0L/7Pbafv206/Cbib5mZo0qJmEEgz+wDwMpp703yQ5iZkb62q9w8ulORQ4OnAE6vqjiTn0XzBEcA/V9Xd4ypY2lZ2DUkzOxM4nOZI4Avt4+Xtfe9Jsk+SBwMPBG5uQ+BRNF9/Ki0pHhFIM6iqO5N8Bdjavqv/+yS/AZzf3PmY24AXA58HXplkI3A5cMGkapa2lXcflWbQniS+GHh+VV0x6XqkLtk1JE2T5ADgSuBcQ0B94BGBJPWcRwSS1HMGgST1nEEgST1nEEhSzxkEktRz/x8bXmhPNw4i1gAAAABJRU5ErkJggg==\n","text/plain":"
What are the most frequent words in Herman Melville's novel Moby Dick and how often do they occur?
\nIn this notebook, we'll scrape the novel Moby Dick from the website Project Gutenberg (which contains a large corpus of books) using the Python package requests
. Then we'll extract words from this web data using BeautifulSoup
. Finally, we'll dive into analyzing the distribution of words using the Natural Language ToolKit (nltk
).
The Data Science pipeline we'll build in this notebook can be used to visualize the word frequency distributions of any novel that you can find on Project Gutenberg. The natural language processing tools used here apply to much of the data that data scientists encounter as a vast proportion of the world's data is unstructured data and includes a great deal of text.
\nLet's start by loading in the three main python packages we are going to use.
","cell_type":"markdown"},{"execution_count":2,"outputs":[],"metadata":{"tags":["sample_code"],"dc":{"key":"3"},"trusted":true,"collapsed":true},"source":"# Importing requests, BeautifulSoup and nltk\nimport requests\nimport nltk\nfrom bs4 import BeautifulSoup","cell_type":"code"},{"metadata":{"tags":["context"],"editable":false,"dc":{"key":"10"},"run_control":{"frozen":true},"deletable":false},"source":"## 2. Request Moby Dick\nTo analyze Moby Dick, we need to get the contents of Moby Dick from somewhere. Luckily, the text is freely available online at Project Gutenberg as an HTML file: https://www.gutenberg.org/files/2701/2701-h/2701-h.htm .
\nNote that HTML stands for Hypertext Markup Language and is the standard markup language for the web.
\nTo fetch the HTML file with Moby Dick we're going to use the request
package to make a GET
request for the website, which means we're getting data from it. This is what you're doing through a browser when visiting a webpage, but now we're getting the requested page directly into python instead.
\\r\\n\\r\\nThe Project Gutenberg EBook of Moby Dick; or The Whale, by Herman Melville\\r\\n\\r\\nThis eBook is for the use of anyone anywh'"},"metadata":{}}],"metadata":{"tags":["sample_code"],"dc":{"key":"10"},"trusted":true},"source":"# Getting the Moby Dick HTML \nr = requests.get('https://s3.amazonaws.com/assets.datacamp.com/production/project_147/datasets/2701-h.htm')\n\n# Setting the correct text encoding of the HTML page\nr.encoding = 'utf-8'\n\n# Extracting the HTML from the request object\nhtml = r.text\n\n# Printing the first 2000 characters in html\nhtml[0:2000]","cell_type":"code"},{"metadata":{"tags":["context"],"editable":false,"dc":{"key":"17"},"run_control":{"frozen":true},"deletable":false},"source":"## 3. Get the text from the HTML\nThis HTML is not quite what we want. However, it does contain what we want: the text of Moby Dick. What we need to do now is wrangle this HTML to extract the text of the novel. For this we'll use the package
\nBeautifulSoup
.Firstly, a word on the name of the package: Beautiful Soup? In web development, the term \"tag soup\" refers to structurally or syntactically incorrect HTML code written for a web page. What Beautiful Soup does best is to make tag soup beautiful again and to extract information from it with ease! In fact, the main object created and queried when using this package is called
","cell_type":"markdown"},{"execution_count":6,"outputs":[{"output_type":"execute_result","execution_count":6,"data":{"text/plain":"'r which the beech tree\\r\\n extended its branches.” —Darwin’s Voyage of a Naturalist.\\r\\n \\n\\r\\n “‘Stern all!’ exclaimed the mate, as upon turning his head, he saw the\\r\\n distended jaws of a large Sperm Whale close to the head of the boat,\\r\\n threatening it with instant destruction;—‘Stern all, for your\\r\\n lives!’” —Wharton the Whale Killer.\\r\\n \\n\\r\\n “So be cheery, my lads, let your hearts never fail, While the bold\\r\\n harpooneer is striking the whale!” —Nantucket Song.\\r\\n \\n\\r\\n “Oh, the rare old Whale, mid storm and gale\\r\\n In his ocean home will be\\r\\n A giant in might, where might is right,\\r\\n And King of the boundless sea.”\\r\\n —Whale Song.\\r\\n\\n\\n\\n\\n\\n \\n\\n\\n\\n\\n\\r\\n CHAPTER 1. Loomings.\\r\\n \\n\\r\\n Call me Ishmael. Some years ago—never mind how long precisely—having\\r\\n little or no money in my purse, and nothing particular to interest me on\\r\\n shore, I thought I would sail about a little and see the watery part of\\r\\n the world. It is a way I have of driving off the spleen and regulating the\\r\\n circulation. Whenever I find myself growing grim about the mouth; whenever\\r\\n it is a damp, drizzly November in my soul; whenever I find myself\\r\\n involuntarily pausing before coffin warehouses, and bringing up the rear\\r\\n of every funeral I meet; and especially whenever my hypos get such an\\r\\n upper hand of me, that it requires a strong moral principle to prevent me\\r\\n from deliberately stepping into the street, and methodically knocking\\r\\n people’s hats off—then, I account it high time to get to sea as soon\\r\\n as I can. This is my substitute for pistol and ball. With a philosophical\\r\\n flourish Cato throws himself upon his sword; I quietly take to the ship.\\r\\n There is nothing surprising in this. If they but knew it, almost all men\\r\\n in their degree, some time or other, cherish very nearly the same feelings\\r\\n towards the ocean with me.\\r\\n \\n\\r\\n Ther'"},"metadata":{}}],"metadata":{"tags":["sample_code"],"dc":{"key":"17"},"trusted":true},"source":"# Creating a BeautifulSoup object from the HTML\nsoup = BeautifulSoup(html, 'html.parser')\n\n# Getting the text out of the soup\ntext = soup.get_text()\n\n# Printing out text between characters 32000 and 34000\ntext[32000:34000]","cell_type":"code"},{"metadata":{"tags":["context"],"editable":false,"dc":{"key":"24"},"run_control":{"frozen":true},"deletable":false},"source":"## 4. Extract the words\nBeautifulSoup
. After creating the soup, we can use its.get_text()
method to extract the text.We now have the text of the novel! There is some unwanted stuff at the start and some unwanted stuff at the end. We could remove it, but this content is so much smaller in amount than the text of Moby Dick that, to a first approximation, it is okay to leave it in.
\nNow that we have the text of interest, it's time to count how many times each word appears, and for this we'll use
","cell_type":"markdown"},{"execution_count":8,"outputs":[{"output_type":"execute_result","execution_count":8,"data":{"text/plain":"['Moby', 'Dick', 'Or', 'the', 'Whale', 'by', 'Herman', 'Melville']"},"metadata":{}}],"metadata":{"tags":["sample_code"],"dc":{"key":"24"},"trusted":true},"source":"# Creating a tokenizer\ntokenizer = nltk.tokenize.RegexpTokenizer('\\w+')\n\n# Tokenizing the text\ntokens = tokenizer.tokenize(text)\n\n# Printing out the first 8 words / tokens \ntokens[:8]","cell_type":"code"},{"metadata":{"tags":["context"],"editable":false,"dc":{"key":"31"},"run_control":{"frozen":true},"deletable":false},"source":"## 5. Make the words lowercase\nnltk
– the Natural Language Toolkit. We'll start by tokenizing the text, that is, remove everything that isn't a word (whitespace, punctuation, etc.) and then split the text into a list of words.OK! We're nearly there. Note that in the above 'Or' has a capital 'O' and that in other places it may not, but both 'Or' and 'or' should be counted as the same word. For this reason, we should build a list of all words in Moby Dick in which all capital letters have been made lower case.
","cell_type":"markdown"},{"execution_count":10,"outputs":[{"output_type":"execute_result","execution_count":10,"data":{"text/plain":"['moby', 'dick', 'or', 'the', 'whale', 'by', 'herman', 'melville']"},"metadata":{}}],"metadata":{"tags":["sample_code"],"dc":{"key":"31"},"trusted":true},"source":"# A new list to hold the lowercased words\nwords = []\n\n# Looping through the tokens and make them lower case\nfor word in tokens:\n words.append(word.lower())\n\n# Printing out the first 8 words / tokens \nwords[:8]","cell_type":"code"},{"metadata":{"tags":["context"],"editable":false,"dc":{"key":"38"},"run_control":{"frozen":true},"deletable":false},"source":"## 6. Load in stop words\nIt is common practice to remove words that appear a lot in the English language such as 'the', 'of' and 'a' because they're not so interesting. Such words are known as stop words. The package
","cell_type":"markdown"},{"execution_count":12,"outputs":[{"output_type":"execute_result","execution_count":12,"data":{"text/plain":"['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves']"},"metadata":{}}],"metadata":{"tags":["sample_code"],"dc":{"key":"38"},"trusted":true},"source":"# Getting the English stop words from nltk\nsw = nltk.corpus.stopwords.words('english')\n\n# Printing out the first eight stop words\nsw[:8]","cell_type":"code"},{"metadata":{"tags":["context"],"editable":false,"dc":{"key":"45"},"run_control":{"frozen":true},"deletable":false},"source":"## 7. Remove stop words in Moby Dick\nnltk
includes a good list of stop words in English that we can use.We now want to create a new list with all
","cell_type":"markdown"},{"execution_count":14,"outputs":[{"output_type":"execute_result","execution_count":14,"data":{"text/plain":"['moby', 'dick', 'whale', 'herman', 'melville', 'body', 'background', 'faebd0']"},"metadata":{}}],"metadata":{"tags":["sample_code"],"dc":{"key":"45"},"trusted":true},"source":"# A new list to hold Moby Dick with No Stop words\nwords_ns = []\n\n# Appending to words_ns all words that are in words but not in sw\nfor word in words:\n if not word in sw:\n words_ns.append(word)\n\n# Printing the first 5 words_ns to check that stop words are gone\nwords_ns[:8]","cell_type":"code"},{"metadata":{"tags":["context"],"editable":false,"dc":{"key":"52"},"run_control":{"frozen":true},"deletable":false},"source":"## 8. We have the answer\nwords
in Moby Dick, except those that are stop words (that is, those words listed insw
). One way to get this list is to loop over all elements ofwords
and add each word to a new list if they are not insw
.Our original question was:
\n\n\nWhat are the most frequent words in Herman Melville's novel Moby Dick and how often do they occur?
\nWe are now ready to answer that! Let's create a word frequency distribution plot using
","cell_type":"markdown"},{"execution_count":16,"outputs":[{"output_type":"display_data","metadata":{},"data":{"image/png":"iVBORw0KGgoAAAANSUhEUgAAAagAAAEYCAYAAAAJeGK1AAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAIABJREFUeJztnXmcFNW1x78HEIZFGEQFV1QUFwwOBoREjHvUxGhixBh31ERjjBp8KhqNMfp8LkmMZtNE3DVR9BlxQ54KbrggMIAIiLugQkQGBRURzvvjVjE9zUxPV/ft7ts15/v51Ke7blf9+vZWp+85554rqophGIZhhEa7SnfAMAzDMJrDDJRhGIYRJGagDMMwjCAxA2UYhmEEiRkowzAMI0jMQBmGYRhBUlIDJSJjRGSRiMxs5rGzRWSNiGyQ0XadiMwXkXoRqctoP15EXhOReSJyXCn7bBiGYYRBqUdQNwMHZDeKyObA/sA7GW0HAf1UdTvgFOD6qL0n8GtgCDAUuFhEepS434ZhGEaFKamBUtVngaXNPHQNcE5W26HAbdF5LwI9RKQ3zsBNUNVlqtoATAAOLF2vDcMwjBAoewxKRA4B3lPVWVkPbQa8l7G/IGrLbl8YtRmGYRgppkM5n0xEOgMX4Nx7rR6eVH/bbbfV5cuXs2jRIgD69evH+uuvT319PQB1dS6sZfu2b/u2b/uV3e/duzfA2uu1qq57zVfVkm5AX2BmdH9n4EPgTeAtYBXwNrAxLub0o4zz5gK9gSOB6zPamxyX9Vzqi4svvjiVOj61QtPxqRWajk+ttOr41ApNx6dWaDqqqtG1e51rejlcfBJtqOorqtpHVbdR1a1xbrxBqroYGAccByAiw4AGVV0EPAbsLyI9ooSJ/aM2wzAMI8WUOs38LmAy0F9E3hWRkVmHKI3G6xHgLRF5HbgBOC1qXwpcCrwMvAhcoi5ZYh3iIaMPvvjii1Tq+NQKTcenVmg6PrXSquNTKzQdn1qh6eSipDEoVT2qlce3ydo/vYXjbgFuae35unXrlqB3uRk+fHgqdXxqhabjUys0HZ9aadXxqRWajk+t0HRyIZqi9aBERNP0egzDMNoCItJskoSVOjIMwzCCJFUGKk5j9EFDQ7NhrqrX8akVmo5PrdB0fGqlVcenVmg6PrVC08lFqgyUYRiGkR4sBmUYhmFUFItBGYZhGFVFqgyUxaDKqxWajk+t0HR8aqVVx6dWaDo+tULTyUWqDJRhGIaRHiwGZRiGYVQUi0EZhmEYVUWqDJTFoMqrFZqOT63QdHxqpVXHp1ZoOj61QtPJRaoMlGEYhpEeLAZlGIZhVBSLQRmGYRhVRaoMlMWgyqsVmo5PrdB0fGqlVcenVmg6PrVC08lFqgyUYRiGkR4sBmUYhmFUFItBGYZhGFVFqgyUxaDKqxWajk+t0HR8aqVVx6dWaDo+tULTyUWqDJRhGIaRHlIXg3rjDWWbbSrdE8MwDCNf2kwM6u67K90DwzAMwwepMlB1dXXcc48frdD8tObLLq9WaDo+tdKq41MrNB2fWqHp5CJVBgqgvh5ee63SvTAMwzCKJXUxKFAuuwx+9atK98YwDMPIh4rEoERkjIgsEpGZGW1XicgcEakXkftEpHvGY+eLyPzo8W9ntB8oInNF5DUROa+15/Xl5jMMwzAqR6ldfDcDB2S1TQAGqGodMB84H0BEdgKOAHYEDgL+Ko52wJ8jnQHAj0Vkh+aerK6ujh49YOZMmDu3uI6H5qc1X3Z5tULT8amVVh2fWqHp+NQKTScXJTVQqvossDSr7XFVXRPtvgBsHt0/BPiXqn6lqm/jjNdu0TZfVd9R1VXAv4BDW3rOH/zA3Y4d6+91GIZhGOWn0kkSJwKPRPc3A97LeGxh1JbdviBqW4f6+nqOOMLdL9ZA1dbWFicQqI5PrdB0fGqFpuNTK606PrVC0/GpFZpOLjqU/BlaQER+BaxS1X/60uzXrx9PPjmaTp1qmDULbrxxMIcfPnztGxkPSW3f9m3f9m2/cvuTJk1i/PjxANTU1NAiqlrSDegLzMxqOwF4DuiU0TYaOC9jfzwwFBgGjG/puMytrq5OVVVHjlQF1Usu0YJZunRp4ScHrONTKzQdn1qh6fjUSquOT63QdHxqhaajqupM0brX9HK4+CTa3I7IgcA5wCGqujLjuHHAkSLSUUS2BrYFXgKmANuKSF8R6QgcGR3bIrGbz7L5DMMwqpeSzoMSkbuAvYBewCLgYuACoCOwJDrsBVU9LTr+fOAkYBVwpqpOiNoPBK7FxczGqOoVLTyfqiqrVkHv3rB0KbzyCgwYULKXaBiGYRRJS/OgUjdRN349J58MY8bAxRfDb35T2X4ZhmEYLdMmisVmrgeV6eYrxAaHNlfA5lOUVys0HZ9aadXxqRWajk+t0HRykSoDlcnee0OvXjBnDsyeXeneGIZhGElJrYsP4Kc/hX/8A379a7jkkgp2zDAMw2iRNuHiy2bECHdbqJvPMAzDqBypMlCZMShodPPNneuy+ZIQmp/WfNnl1QpNx6dWWnV8aoWm41MrNJ1cpMpAZdOhA/zwh+6+zYkyDMOoLlIdgwJ44gnYbz/o39+NpGQdL6dhGIZRSdrcPKiYr76CTTeF//zHrba7yy4V6pxhGIbRLG0iSSI7BgWFu/lC89OaL7u8WqHp+NRKq45PrdB0fGqFppOLVBmolih20q5hGIZRflLv4gNYvdq5+RYvhmnTYNCgCnTOMAzDaJY24eJrifbtG918ttKuYRhGdZAqA9VcDComqZsvND+t+bLLqxWajk+ttOr41ApNx6dWaDq5SJWBysUee7glON54A6ZPr3RvDMMwjNZoEzGomNNPh7/8Bc47D65odkUpwzAMo9y06RhUjGXzGYZhVA+pMlC5YlAAu+8Om2wCb70FU6fm1grNT2u+7PJqhabjUyutOj61QtPxqRWaTi5SZaBao317OPxwd99q8xmGYYRNm4pBATzzDHzrW9C3rxtJWW0+wzCMymIxqIjYzffOOzBlSqV7YxiGYbREqgxUazEogHbtGhcyzDVpNzQ/rfmyy6sVmo5PrbTq+NQKTcenVmg6uUiVgcoXy+YzDMMInzYXgwJYswa23BIWLoQXXoChQ8vQOcMwDKNZLAaVQaabz7L5DMMwwiRVBiqfGFRM7OYbO9aNqLIJzU9rvuzyaoWm41MrrTo+tULT8akVmk4uSmqgRGSMiCwSkZkZbT1FZIKIzBORx0SkR8Zj14nIfBGpF5G6jPbjReS16JzjfPRt6FDYYgt47z148UUfioZhGIZPShqDEpHhwHLgNlUdGLVdCSxR1atE5Dygp6qOFpGDgNNV9bsiMhS4VlWHiUhP4GVgV0CAqcCuqrqsmefLKwYVM2oUXHMNnHWWuzUMwzDKT0ViUKr6LLA0q/lQ4Nbo/q3Rftx+W3Tei0APEekNHABMUNVlqtoATAAO9NG/1tx8hmEYRuWoRAxqY1VdBKCqHwK9o/bNgPcyjlsQtWW3L4za1iFJDAqcmy/O5nv++aaPheanNV92ebVC0/GplVYdn1qh6fjUCk0nFx1K/gyt05JPLnERou7duzN69GhqamoAGDx4MMOHD6e2thZofEPj/WXLGjj1VLjgglrGjoUBA5o+nn18IfvLly/3qudjP6ZYveXLlwfVnxDfb5/9Ce39Dq0/9vlXz+c/adIkxo8fD7D2et0cJZ8HJSJ9gQczYlBzgL1UdZGI9AEmquqOInJ9dP/u6Li5wJ7A3tHxp0btTY7Leq5EMSiAl15yI6lNN3UJE+1SlddoGIYRPpWcByU0HQ2NA06I7p8APJDRfhyAiAwDGiJX4GPA/iLSI0qY2D9q88KQIa5w7Pvvw+TJvlQNwzCMYil1mvldwGSgv4i8KyIjgStwBmcesE+0j6o+ArwlIq8DNwCnRe1LgUtxmXwvApdEyRLrkDQG5frYtPRRTGh+WvNll1crNB2fWmnV8akVmo5PrdB0clHSGJSqHtXCQ/u1cPzpLbTfAtzip1frcsQRcPXVcO+9Lt28fftSPZNhGIaRL22yFl82qrDNNvD22/DUU269KMMwDKM8WC2+HLTk5jMMwzAqR6oMVCExqJjYQN17L6xeHZ6f1nzZ5dUKTcenVlp1fGqFpuNTKzSdXKTKQBXDrrs6N9+iRW5ZeMMwDKOyWAwqg/PPhyuugNNOg7/8xWPHDMMwjBZpKQZlBiqD6dPdSGrjjd28KMvmMwzDKD1tIkmimBiUOx+23RYWL4annw7LT2u+7PJqhabjUyutOj61QtPxqRWaTi5SZaCKJTObb+LEyvbFMAyjrWMuvixmzoRddoGePZ2bL0cdQ8MwDMMDbcLF54OBA10caulSuP/+SvfGMAyj7ZIqA1VsDCrmpJOgrq6BG28sXitEf29ofbLXVl6ttOr41ApNx6dWaDq5SJWB8sVRR0HHjvDkk/Dmm5XujWEYRtvEYlAtcOyxcMcdcOGFcOmlXiQNwzCMZrAYVEJOOsnd3nKLK31kGIZhlJdUGShfMSiAXXZpoF8/WLAAJkwoXCdEf29ofbLXVl6ttOr41ApNx6dWaDq5SGygRKSniAwsRWdCQgROPNHdHzOmsn0xDMNoi+QVgxKRScAhuAUOpwKLgedUdVRJe5cQnzEogIULYcstXcmjBQtcCSTDMAzDL8XGoHqo6ifAYcBtqjqUFlbFTRObbQbf+Q6sWgW3317p3hiGYbQt8jVQHURkE+AI4KES9qcofMagYv9qnCwxZoxbebdQHV/9CUkrNB2fWqHp+NRKq45PrdB0fGqFppOLfA3UJcBjwOuqOkVEtgHml65b4fDd70Lv3jBnDrzwQqV7YxiG0XbINwa1u6o+11pbpfEdg4o591y4+mo3mvJRXcIwDMNopKj1oERkmqru2lpbpSmVgZo7F3bcEbp1gw8+cLeGYRiGHwpKkhCRb4jI2cBGIjIqY/sNENxyfqWIQQHssAPsvjssXw733FO4jq/+hKIVmo5PrdB0fGqlVcenVmg6PrVC08lFazGojkA3XHr5+hnbJ8Dhpe1aWJx8sru1OVGGYRjlIV8XX19VfacM/SmKUrn4AFasgE02gU8/hVdfdS4/wzAMo3iKnQfVSUT+LiITROTJeCuyQ78UkVdEZKaI3CkiHUVkKxF5QUReE5F/ikiH6NiOIvIvEZkvIs+LyJbFPHchdO0KRx7p7tsoyjAMo/Tka6DGAtOBC4FzMraCEJFNgV8Au6rqQJwL8cfAlcDvVbU/0ABEs5A4CfhYVbcD/ghc1ZxuqWJQMfGcqNtugy+/LFzHV38qrRWajk+t0HR8aqVVx6dWaDo+tULTyUW+BuorVf2bqr6kqlPjrcjnbg90jUZJnYH3gb2B+6LHbwW+H90/NNoHuBfYt8jnLojddoMBA+A//4GHgp2ubBiGkQ7yjUH9Bld/735gZdyuqh8X/MQiZwD/DXwGTADOAp6PRk+IyObAI6o6UERmAQeo6vvRY/OBodnPX8oYVMw118CoUa4E0sMPl/SpDMMw2gQtxaA65Hn+8dFtpltPgW0K7EwtblTUF1iGcyEemESiucZ+/foxevRoampqABg8eDDDhw+ntrYWaBySFrN/2GFw3nm1jB8P8+c3sNFGxenZvu3bvu23tf1JkyYxfvx4gLXX62ZR1bJvuBT1f2TsHwv8FTdKaxe1DQMeje6Px42YwLkGFzenW1dXp75YunRpi4+NGKEKqpddVpyOr/5USis0HZ9aoen41Eqrjk+t0HR8aoWmo6rqTNG61/S8YlAiclxzWz7ntsC7wDARqRERwcWUZgMTgRHRMccDD0T3x9E4ihsBFJVBWCyZBWTXrKlkTwzDMNJLvjGoP2Xs1uAMyjRVLXiyrohcDBwJrMJlCJ4MbA78C+gZtR2jqqtEpBNwOzAIWAIcqapvN6Op+byeYlm9GrbeGt57D554AvbZp+RPaRiGkVqKqsXXjFgt8C9VTRI3KjnlMlAAF18Mv/0tHHUU3HlnWZ7SMAwjlRQ7UTebFcDWxXXJP6WeB5XJyJFuWfj77oOlSwvX8dWfSmiFpuNTKzQdn1pp1fGpFZqOT63QdHKRbwzqQREZF20PA/NwKedtlq22gv32g5Ur4a67Kt0bwzCM9JFvDGrPjN2vgHdUdUHJelUg5XTxAdx9tyt/NGgQTJtWtqc1DMNIFUXHoESkNzAk2n1JVRd77J8Xym2gVq6ETTeFjz+GqVNh16BWxzIMw6gOiopBicgRwEu4FO8jgBdFJLjlNsoZgwLo1AmOOcbdb6mAbIj+3tD6ZK+tvFpp1fGpFZqOT63QdHKRb5LEr4Ahqnq8qh4H7AZcVLpuVQ/xnKg774TPP69sXwzDMNJEvjGoWar6tYz9dsCMzLYQKLeLL2bIEHj5ZbjjDjj66LI/vWEYRlVTbJr5eBF5TEROEJETgIeBR3x2sJqx1XYNwzD8k9NAici2IrK7qp4D3AAMjLbngb+XoX+JKHcMKubII6FzZ5g4Ed54o3AdX/0pl1ZoOj61QtPxqZVWHZ9aoen41ApNJxetjaD+CHwCoKr/q6qjVHUUbg7UH0vduWqhRw8YEVUQvPnmyvbFMAwjLeSMQYnIFFUd0sJjsywG1cjTT8Oee8Jmm8Hbb0OHfBcyMQzDaOMUGoOqzfFY5+K6lC722AO22w4WLoTHHqt0bwzDMKqf1gzUyyLyk+xGETkZKHbJd+9UKgYFri7fiSe6+5nJEiH6e0Prk7228mqlVcenVmg6PrVC08lFa46os4D7ReRoGg3SYKAj8INSdqwaOf54uPBCePBBWLQIeveudI8MwzCql3znQe0N7BztzlbVii4Y2BKVjEHFHHoojBsHV18N//VfFe2KYRhGVeB1PahQCcFAjRvnjNQOO8CrrzrXn2EYhtEyvteDCpJKxqBivvMd6NMH5s6F558P098bWp/stZVXK606PrVC0/GpFZpOLlJloEKgQwcXiwKrLGEYhlEM5uIrAa+9BttvD127wgcfwPrrV7pHhmEY4dImXHyh0L+/mxe1YoVb1NAwDMNITqoMVAgxqJi4gOzjjzewenXl+1MKrdB0fGqFpuNTK606PrVC0/GpFZpOLlJloELi8MOhthbmzYN994UFCyrdI8MwjOrCYlAl5KmnXKXzDz+EXr3gllvg4IMr3SvDMIywsBhUBdhzT5gxAw48EJYsge99D846C1aurHTPDMMwwidVBiqkGFRMx44NPPywqyzRoQNcey184xswf35l+uNTKzQdn1qh6fjUSquOT63QdHxqhaaTi4oZKBHpISJjRWSOiMwWkaEi0lNEJojIvGgF3x4Zx18nIvNFpF5E/FmiMtCunSt79NxzsM02MH067LqrWyLeMAzDaJ6KxaBE5BbgKVW9WUQ6AF2BC4AlqnqViJwH9FTV0SJyEHC6qn5XRIYC16rqsGY0g4pBNceyZXDKKY3p58cfD3/+M3TrVtl+GYZhVIqgavGJSHdguqr2y2qfC+ypqotEpA8wUVV3FJHro/t3R8fNAfZS1UVZ5wdvoABU4aab4Be/gM8/d/Om7r4bPHooDcMwqobQkiS2Bj4SkZtFZJqI/F1EugC9Y6Ojqh8C8YIVmwHvZZy/MGprQogxqOZ0ROCkk+Dll2HnnV3liaFD3UiqJftqvuzyaoWm41MrrTo+tULT8akVmk4uKrUweQdgV+DnqvqyiFwDjAayL8+JhkPdu3dn9OjR1NTUADB48GCGDx9Oba1bGDh+Q8u5v3z58hYf33TTBh5/HH7zm1quvx7GjGlg7ly45JJaevUqXf9iitVbvnx5UP1p7f0O7fOv9vc7tP7Y5189n/+kSZMYP348wNrrdXNUysXXG3heVbeJ9ofjDFQ/ItddKy6+ta7ALN2qcPE1x333uVHVsmWw+eZw112uXJJhGEbaCcrFFxmW90Skf9S0LzAbGAecELWdADwQ3R8HHAcgIsOAhmzjVO388IdQX+9S0BcsgL32gksvxUuZJMMwjGqkkvOgzgDuFJF6YBfgcuBKYH8RmQfsA1wBoKqPAG+JyOvADcBpzQlWSwyqJbbaylWfOP98F4v69a9hv/1g4ULzZZdbKzQdn1pp1fGpFZqOT63QdHJRqRgUqjoDGNLMQ/u1cPzppe1RGKy3Hlx+OeyzDxxzDEyaBLvsArffDgcdVOneGYZhlA+rxRcwixa5eVKPPeb2d9sNRo509f2iuKNhGEbVE9Q8qFKRNgMFsGYN/OEPLh71ySeuraYGDjsMTjwR9t7bVaowDMOoVoJKkigV1R6Dao64TNK8eQ3cfrtz/X3xhcvy228/VzrpN7+Bt98uX59C1fGpFZqOT6206vjUCk3Hp1ZoOrlIlYFKMzU1Lib1xBPw5ptw8cXQty+88w5ccglsvbVbd+qOO+CzzyrdW8MwjOIxF18Vs2YNTJwIN9/s5lF98YVr797dxalGjnQVKmSdgbNhGEY4WAwq5TQ0uHp+N90EL73U2L7jjs5QHXss9OlTuf4ZhmG0hMWgEhKan7Y1ndpaVyX9xRfhlVdc3GrjjWHOHDj3XFed4pBD4P77Ye7cBlasaLnun68+lVvHp1ZoOj610qrjUys0HZ9aoenkomLzoIzSMWCAWyDx8svh0UfdqOrhh+HBB91WV+eqVnTq5JaiT7L17Ant21f6FRqG0RYwF18bYdEil0AxdqwrpbRkSWPMKgkibrTWq5dbw2q99Rq3Dh0K2+/VC0aMcCM+wzDaHhaDMtbhs8+cocp3++gjF+sqBeut5+oRnnoqfOtblthhGG2JNmGgBg0apNOnT/ei1dDQsLZMfJp0itX66itYutQZrGXLGoBaVq2CVavcY/H9JPtLljRwww21rFnjnmOnnZyhOvbY5BUzQnu/Q/nc2oKOT63QdHxqhaYDLRsoi0EZiejQATbayG0NDX5KLjU0wOjRcOON8I9/wKuvwhlnuLYf/9gZq8GDi38ewzCqi1SNoMzFV/2sWgXjxsHf/uYmJccMHgw/+5mb39WlS+X6ZxiGf9qEi88MVLp47TW44QY3EXnpUtfWo4croHvqqW6Ol2EY1Y/Ng0pIaHMF2uJ8iv794fe/d+th3XorDBvmVhy+7joXp9prLzc5+csvy9enSun41Eqrjk+t0HR8aoWmk4tUGSgjnXTuDMcdB88/D9OnuwnJXbu6xR2PPBK22AIuuCBZwVzDMMLHXHxGVfLJJ25e19/+5ipngEtN79fPja4GDGjctt/eGTnDMMLEYlBGKlGFyZPh+uvdJOSVK9c9pl07tyxJbLBiA7bDDq5KvGEYlaVNGCibB1VerdB0Vq6EV19t4LXXapk9G2bPdinr8+fD6tXrHt+uXcsjri++COu1+dRKq45PrdB0fGqFpgM2D8poA3Tq5NbFGjSoafvKlS4jMNNozZ4Nr7/ujNf8+fDAA43Ht2sHe+zhHu/UqXGrqWm6n6s9buvTxyV7DBhgozXDSEqqRlDm4jOSsHIlzJvXaLDi7fXXWVvVwhft2zuX4i67uGK98a3VHzSMNuLiMwNl+GDlSpfOvnKlK6i7cmXzW67H4scXLYIZM5whbM7o9enT1GDtsosbcVnFeKMt0SYMlMWgyqsVmo5PLd86n3/usg1nzHBLncyY4bZPP133nM6dYeedmxqugQNh9eowX1soOj61QtPxqRWaDlgMyjAqSufOMGSI22LWrHFzt+rrG41WfT28+y5MmeK2mHbt4LTT4IgjYPhwq/ZutA1SNYIyF5+RBpYubRxhxYZr1ixX/R1cXcKzz3bLk6y3XmX7ahg+CNLFJyLtgJeBBap6iIhsBfwL2ACYChyrql+JSEfgNuDrwEfAj1T13Wb0zEAZqeTDD+Gvf3XbkiWubYstXNX3k0/2U1XeMCpFqLX4zgRezdi/Evi9qvYHGoCTovaTgI9VdTvgj8BVzYlZLb7yaoWm41MrNJ0+fWDUqAbefdcV0N1+e3jvPTjnHGeozjoL3nqrvH0KTcenVmg6PrVC08lFxQyUiGwOfAe4MaN5H+C+6P6twPej+4dG+wD3AvuWo4+GERpdusBPf+pS4x96CPbZB5Yvh2uvhW23hREjXM1Cw0gDFXPxichY4L+BHsDZwEjg+Wj0FBuwR1R1oIjMAg5Q1fejx+YDQ1X14yxNc/EZbY76erjmGvjnP916WuAqv48aBT/4gVtk0jBCJqgsPhH5LrBIVetFZK/Mh/KVaK6xX79+jB49mppoyv7gwYMZPnz42lTIeEhq+7afpv26ulpuvRUuuqiBf/8bLr+8lhdegMsvb2DMGDjggFpOOgnWrAmjv7Zv+5MmTWL8+PEAa6/XzaKqZd+Ay4F3gTeBD4DlwB3AYqBddMww4NHo/njciAmgPbC4Od26ujr1xdKlS1Op41MrNB2fWqHpJNFavlz1r39V3W47VVdOV3X99VVHjVJ9++3wXpt9/uXVCk1HVdWZonWv6RWJQanqBaq6papuAxwJPKmqxwATgRHRYccDcYW0cdE+0eNPlrO/hlFNdO0KP/sZzJ3ragzuuaebEPyHP7jiuBdc4JYq+eSTSvfUMHJT8XlQIrIncLa6NPOtcWnmPYHpwDGqukpEOgG3A4OAJcCRqvp2M1pa6ddjGCEydaqLU919d+N8qk6d4IAD3OTf730PunevbB+NtkuQ86B8YwbKMHLz4Ydw331wzz3wzDPOAQjOWB14YKOxWn/9yvbTaFuEOg/KKzYPqrxaoen41ApNx5dWnz5w9NENPPUULFwIf/qTW1rkyy+dO/Doo2GjjVz23113NV8r0Gd/fOr41ApNx6dWaDq5SJWBMgwjfzbZBE4/HZ5+GhYsgOuuazRW//53o7E67DCXwp7LWBlGKTAXn2EYTXj//UY34HPPNboBa2rgoIOcG/Dgg6Fbt8r200gPFoMyDCMxCxc2NVYxNTVuMnDXrs2vMpxrBeLsxzp3hp49oVcvd2sTi9sebcJA2XpQ5dUKTcenVmg6PrUK1VmwoNFYTZ4MdXUN1NcX359sndpaZ6ySbF26uCVIKv0elUrHp1ZoOhBYJQnDMKqPzTeHM89024IFMGdOcSsOx8dssgmsWOGqtC9dCg0Nbnvjjfz71qmTM1SDBsE77+Q/kmvAW1yLAAAaHUlEQVTpmK5dXSyuWGpq3JIoXbs6l2j21qmTre2Vi1SNoMzFZxjVzerVzjgtWdJ0++ijddsyt5UrK93zwmjfvqnBasmQdesGW20FO+4IO+wAG25Y6Z77pU24+MxAGUbbQxU++8wZqs8+a33Uls8IL57MXCyrV7s+LV/e/FboKG3DDZ2hig1WfL9vX7f6crXRJgyUxaDKqxWajk+t0HR8aqVVx6dWuXS+/NK5N1esaNmILV8Oy5bBihUNTJxYy9y5rq05amrcWmHZxqt/f5eMUs7XlgSLQRmGYQRGx45u69mz9WMbGuCqq9yIceFCV2tx7lwXC4xvP/gAZsxwWyYijS7CQYOcgevc2SWXFHJbrlFaqkZQ5uIzDKMts2xZo+HKNF6vv+7cjb6IpwfcdZebG1csNoIyDMNIOT16wNChbsvkyy9dVuScOfCf/8Dnn7vYWKG3cayuffvSvp5UjaAsBlVerdB0fGqFpuNTK606PrVC0/GpVayOqjNOixc30KdPLR07Ft2ltlEs1jAMwygtIi4Zo3t3vBinnM+VphGUxaAMwzCqDxtBGYZhGFVFqgyUrQdVXq3QdHxqhabjUyutOj61QtPxqRWaTi5SZaAMwzCM9GAxKMMwDKOiWAzKMAzDqCpSZaAsBlVerdB0fGqFpuNTK606PrVC0/GpFZpOLlJloAzDMIz0YDEowzAMo6JYDMowDMOoKlJloCwGVV6t0HR8aoWm41MrrTo+tULT8akVmk4uKmKgRGRzEXlSRGaLyCwROSNq7ykiE0Rknog8JiI9Ms65TkTmi0i9iDRriT799FNvfXz22WdTqeNTKzQdn1qh6fjUSquOT63QdHxqhaaTi0qNoL4CRqnqAOAbwM9FZAdgNPC4qm4PPAmcDyAiBwH9VHU74BTg+uZE33jjDW8dfPnll1Op41MrNB2fWqHp+NRKq45PrdB0fGqFppOLihgoVf1QVeuj+8uBOcDmwKHArdFht0b7RLe3Rce/CPQQkd5l7bRhGIZRVioegxKRrYA64AWgt6ouAmfEgNgIbQa8l3HawqitCb17+7NZX3zxRSp1fGqFpuNTKzQdn1pp1fGpFZqOT63QdHJR0TRzEekGTAIuVdUHRORjVd0g4/ElqtpLRB4E/kdVJ0ftjwPnquq0LD3LMTcMw6hCglryXUQ6APcCt6vqA1HzIhHpraqLRKQPsDhqXwhskXH65lFbE5p7gYZhGEZ1UkkX303Aq6p6bUbbOOCE6P4JwAMZ7ccBiMgwoCF2BRqGYRjppCIuPhHZHXgamAVotF0AvATcgxstvQMcoaoN0Tl/Bg4EVgAjs917hmEYRrpIVakjwzAMIz1UPIvPMAzDMJrDDJRhGFWNiHTKp82oPiqWxecbEemiqp9Vuh+Z+OiTiLTHzQdb+1mp6rsFam0G9M3SeroAnY7ADrjY4TxV/bLA/vQBdot0pkRz3wrRGQ5sp6o3i8hGQDdVfasQrWIQkcNyPa6q/1uuvvhGRDbI9biqfpxQ70pVPa+1tjx5Htg1j7aqw9dv1lNf2gGHq+o9ZXvOao9Bicg3gRtxF6UtRWQX4BRVPS2hTn/gb7jJwjuLyEDgEFW9rIJ9+gVwMbAIWBM1q6oOLKBPVwI/Al4FVmdoHZJQ57u4UlNvAAJsjXttjybUORn4Na6klQB7Ar9V1ZsS6lwMDAa2V9X+IrIpMFZVd0+iE2nVACcBA4CauF1VT8zz/JujuxsD38S9NoC9gcmqenCeOnHyULMk/fyj0cQPga1oeqH7bQKNt6I+CbAlsDS6Xwu8q6pbJ+zTNFXdNattZpLXFv3B2Qy4Azgq6g9Ad+B6Vd0hYZ8eZN33fRnwMnCDquY1M9XXtcTHbzbHd0ko4FoiIi+r6uAk5xRDGgzUi8DhwDhVHRS1vaKqOyfUeQo4B/dFLFjHc59eB4aq6pKkfWhGax4wUFVXFqkzFzhYVV+P9vsBDxdwMZgHfDN+bSLSC3cR3z6hTj0wCJiW8V4nutBlaI0F5uIudr8FjgbmqOqZCXUmAMer6gfR/ibALap6QJ7n943u/jy6vT26PRpAVUcn7M943IV2Ko0XOlT190l0Iq1/APer6iPR/kHA91X1lDzP/xlwGrAN7k9OzPrAc6p6TIK+HI+bjjIYZ0RiPsW934lGrCJyLbAR8M+o6UfAJ7gLfHdVPTZPHS/XEh+/2YzvUrOo6jsJ9a4APgLuxmVUxzqJRtD5kgoXn6q+J9Jkju7qlo7NQRdVfSlL56sK9+k93IXFB28C6wFFGSjg09g4ZegWUkZ+SdZ5n0ZtSflSVTWuIiIiXQvQiNlWVUeIyKGqequI3AU8U4DOFrFxiliEG3XkRXzREJH94wtcxGgRmYYrqpyEzVX1wITntMQwVf1JvKOqj4rIVQnOvwt4FPgfmr6OT5Ne5FT1VuBWEfmhqt6X5NwW+KaqDsnYf1BEpqjqEBGZnUDH17Wk6N9sUgOUBz+Kbn+e0aa4PxzeSYOBei9yqamIrAeciSs+m5SPotFAfKE7HPgg9ykl79ObwCQReZiML6mq/qEArc+AehF5IkvrjIQ6L4vII7j5agqMAKbE8ZfW/rWKyKjo7uvAiyLyQKRzKDAzYV8A7hGRG4BaEfkJcCLwjwJ0AFZFtw0isjPwIc5dl5QnROQxmv4Tf7wAHRGR3VX1uWjnmxSW2DRZRL6mqrMKODeb90XkQpxbDdyo7v18T1bVZbg/XT/Oih1uKCJbFxI7VNX7Itdztms2bxdmRDcR2TKO8YrIlkC36LEkcVZf15Kif7Mi8im5XXzdk3QoqSu3WNJgoE4FrsX5ohcCE2hq3fPl58DfgR1EZCHwFpC3u6FEfXo32jpGWzGMi7ZiqcGNCPaM9v8DdAa+h/shtOZWWT+6fYOmLp4Hmjk2H77EXfw/AbYHfq2q/1eg1t9FpCdwEe696oaLkyVCVU+PDPYesa6q3l9Af04CbhK3Lprg4j55xcOgSfyhAzBSRN7EXegKij9E/BgXF41fz9NRWyIyY4fAzbjv9x1AIbHD64EuuFjfjTj3+ktJdYCzgWdFJDO+elo0Kr8155lN8XUtKfo3q6rrt35U/ohIF2AUsKWq/lREtsPFfx/y+Txrn6/aY1C+ib6M7VS14NUPRWSDbHdFof8Oo3O7wdqlSYwMROQy4EhgGq581mOasi91ZKDi0UeS87zGH3ziOXY4U1UHZtx2Ax5V1T1aPXldrU64DFVwGaoFl+z2cS0pFhHprqqfSAtZmAVkX96Ni2UeFyWAdMHFjv0tZ55B1Y6gRORP5M50SuS6ys50iv3HBbgJwPmuD1LVTyLtHYGxQNIg6c64APkG0f5HuC9GEn94rLUdzu+/E03dIHn5jkXkXFW9qqX3vYD3eyPgXNZ1y+yTREdVLxSRi4BvAyOBP4vIPcAYVU20gqW4NcYuBzZV1YNEZCfgG6o6Js/zY3eK0PQ9SuROyXCDZrcD+bt4M2JZzV2cCrpo+vrc8Bs7/Dy6/UxcFucSYJMCtb5OY7bjLiKCqt6WREBEanG1Q7ei6bUkr9+IiNyjqkdICxl4CY34XcDBOKMSfzfXSpE8dtRPVX8kIj+O+vKZZAXbfFK1BoqmWTs+eIDGTKdiEwkuxxmp7+JcGLcRZWAl5O+4lYcnAojIXrj4yjcL0LoZ55q5BucKGUmyeEYcQ/P1vt+JywQ6GOcSPR7nLkxMdKH7EBcz+groCdwrIv+nqucmkLoF9z79Ktp/LepjXgYq050iInU0uvieVtUZCfrh1S2DG11uQdPU8A9FZBHwE1WdmkDL1+fmM3b4UGQUrsa9Vi1ES0RuB/oB9WSkdRMtlpqAR3Dr282icXpIEuKs0bymJeRCo6kNHmNHX4pIZxrja/0o/nrZMqpqm/MIveJZ7/vAZNyXtH+BGjPyactTa2p0Oyu7rULvd9yfmRltUwrQORP3p+IxXMLGelF7O+CNhFpTotvpGW31BfTpjOhzvwSXrj4T+EUF3+t/AAdk7H8buAEYBrxYoc/tjOjzuhr4HbC/p9faCehR4LlziMIeRfZhWqU+6xx9eiKftjx0vg08hftTcifwNrB3qfpdzSMoYK3L4TzWdV0ldTkUnenUjPurBy4R4PTIVZA0Y+7NyH0Vz4M5BpfZVwgrxc0Eny8ip+OSN7q1cs46iJuE+F+sO+kz6fsdZ8x9EI003ydyZSZkA+AwzYqnqOoaEUn6D3SFuPlY8b/DYRSW5n8yLh17RaRzJa6ywZ+SiIib+NuciyfvRImI7NTwCSLyO1U9RZKXBPL1uW2MM1Jx7LCQLEcAokzZnwHfipomicgNqroqx2nN8QrQh8Kzd2Nuj0aFD9E0+y5pvGcY7juzIy6JpD2wQhNk3ombfN4F2DBKAMqczLzOquStEX13puL+3Ahwpqp+lFQnX6reQNHocvguxbkchlN8plO2+yuJ66Q5TsT9C48z454hQRZXFmfivqhnAJfi3HzHFaAzFldJ4kYKm9sVc1kU/D8b9yPsDvwyqYiqXpzjsaSp/aNwWVP9ROQ53KTNw5P2CffdyXxvVtPU958vmZlRNcAPSJDSncEHInIe8K9o/0e4xUHbk9wF5etz8xY7xFVtWA/4a7R/bNR2ckKdDYFXReQlmhqWRNVWcJmlV+NcxfEfjELiPX/GJQCNxWU8Hgf0T6hxCnAWsCnuehTHRz8l4R8mABF5QlX3BR5ups07VZ/FJyJTVfXrmRlAEk2uS6jTFxe7WBs3wC2MWLFMp5jogrBGi8ssHIz7wfTF/ZihsFInU1X164X2I2REZATOVbgFLmFmKHCRJlx7LEpyOJ7GVOzv4yob/LHI/rUDnlXVRDFIEdkQF38cHjU9h/vjswyXLvx6S+eWGnFlwEbi1nqbiPtnnih2KCIzVHWX1try0NmzuXZVfSqhzpvAbsWOLCQqK5R1bZuuTSdv56v1a+CP6jL6LsLVKbw03+92xkhsIrAXTUdi4zVhJZl8ScMIypfL4fu4f1z/i3vzb8f57vP+l+E5+wYRGYJzf6wf7S8DTtRkQe2YO3HlVwoK3GZkgj0oIqfhLr7FuC82An7Cuq7CQkeIPrhIVcdGrpC9cbGRv+EMVd6o6h9EZBKNBmGkqk730L/tKGDicHSh/EULDycyTuKvztyZuBHBR7jR+Dmquip2Q+MyBfNltYj0i0deIrINBYzuVfWpKJMz/nP7kqouTqqDe099FK7+TFxh5npx1To+oPAVKA5X1d+Kmxy9D8m/25kjsUyj9glupFcS0jCCOhjn+tqCRpfDJaqaaIKbiMzEpRTHcYOuwPNJjIqIbKKqH0gL80+SjsaiPv1cVZ+J9ocDf01q6KJzn1XV4a0f2eL5mcVCY9Z+eTTPdPUMvcm4zy27PpyPkjUFEf87FZH/wSWT3FXoP1ZP/clOW/8QOD/peyQiE2n+D1PSuKHPOnOXADc195sQkR2TuGdFZF9c9mUcn90K96dgYsI+HYFzzU3Cved74AznvQl17sel4U+kiKot0XVkES7+9EtcTPsvBbhAvX23ReQXqprYNVgoVW+gfBGNeoZoNDEvGtJOUdWvVbBP63yBpJkq0Hlq7Yub8Z9dNiVpQc0jcEP6glwFGTr1WqLJfYUiIg/hkkf2x72uz3H/ohO5ikJDRDJdsjU49+VXSdxoGVpxbbrpGQaqop9l9Fs9G9gXaACmANdowkm2IjIDl024ONrfCHi8AFfh8c21q6sdmETnTFW9trW2PLW8fLejFPOf4bwDivuTeX3S9zrv56t2A+XLVeQjbiCe616JyB9xZYT+Gen+CPiCqA5aEqMgInfgZsjPpunSHUnfp3i2/nBcssXvcOWFErnBxFWAmKxRVewQEDcr/kDcP8z54qqQf01VJ1SwT4eQkZ2mnkrKiMhLqrpbAec9CpyOW9JkV3F15k5S1YN89KsQouSKT3BubHDV6GtVdURCnVmZf0gjd+OMSv1Jbe7PaBExKC/f7ei9/pTGWowFvdd5P18KDJQ3V5GI7Epj3OAZT3GDgolcM9Bo9GJXT2zw8nbRiMg8TbiURQs6RbkKstxWXXGjuVUUaMTTjLilDYbQeOH9MW5Uf0FCncyYbDtctYTrCvk+RPGdv+Mmiy/F1Zk7upLJRCLyqqru1FpbHjpXAwNpWuR3pua5iGIrMWjNd7QirkrDUbhrUWY1/e7Aai1Rxlw++Hqv8yUNSRJd8v0CtUY0Iknkqioxk7L2FQouvzRZRHZS1VeL7NNCcRUA9geuFDePJu/ArTattrABLvBf0/IZbZrvAHWqugZARG4FpgOJDBRNy9x8hTMqJxXYp4W4eM9EXDLSJzjPQyHfSV9ME5FhqvoCgIgMpYCKJ6p6joj8kMaCtUmL/MYVIObg4nQxAiRZkmQyLiFiQyBzza5PKaziv0+8vNf5kgYD9ZCIfCckV5FHMovD1uBKnxSybAe49N36KNmhmIrWR+BcBb9T1YbIVXBOK+esg7gVdc8ENseVlhmG+2FW7N9hoNQCcYZkj0IE1O8SCQ/g4jzTKGxOljcyRirr4f6AvRvt98UtPJmYyPNSUKKONq4Btm32iFJE8k7Djs59B/iGuFWDd8O9rnmqWvAadZ74Oo3vNbh1zubFn0UhCVy5qFoXX1a8pxvuoht/eKl0FUWjlcdUda8CzvWSWeiLOCkFeEFV66If8OWqelgl+hMiInIkcAWNWWXfAkar6t0JddaptIDLwktaaaGgjL1S0dJ3Oibf77av2LF4XC040jsJN3/tyagvewK/VdWbkuj4xNd7nvfzVauBiomC/0/jYkaFji6qAnHzc6ao6raV7kuxZGSD1eOWtV8pIrNVdUCl+xYK0Xf7NVys523cZ/9hATo34kYZcRbZsbhYRtJKC4jI34E/qZ/FD1OFuAn1PfGwWnCkNw+3yu+SaL8XLrGo6FhysYjIxjQtLfdujsMLJg0uvjG4+QrXiausOw1nrBKnYoZGVrC1Pa70TiV9/T5ZIK4C9b+B/xORpTi3htFI/N0+BFdle7qIPF3Ad3tIVoD+ySilOm+kNIsfpgrNWC3Yk+QSmi6L8mnUVjGirNLf4ybsLsa5U+fg5n35f75qH0EBiKspNgQ3+/9U4HMtUemNcpI1nP4KWBSAD9o74krM9MDNr0qytHbq8fHdFpFpwAhtWmnh3uwU5lY0gl38MK2IyG3A13BxPwUOxSVJzIT81wXz3KcZuEoUj0fZvHsDx6hqoUk3Oan6EZSIPIFLV34el5I5RAsrTxIcbeVHrwlrnbUVPH63zwEmRqMewf3rHZlEoK18FwPjDZrGsh6Ibn2vF5aEVaq6RETaiUg7VZ0YzdcsCVVvoHD/Jr6OW612GdAgIs+r6ue5TzOM4PHy3VbVJ8StqBzHLuapaukWmTO8oKqXVLoPzdAgIt1wcf87RWQxsKJUT5YKFx+AiKwPnIBbq6iPqiZd58YwgsTHd1tEvsm61VaSrhRrlJGoSs65uPhOMWvd+exTV1w1G8GtEt4DuDNO5PBN1Y+gxC2+twfun+bbuOrfz+Q6xzCqAV/fbfG3lLlRXuK17g6muLXuvKFRMe2IRLUFC6HqDRTun8UfcEtRpy6BwGjT+PpuDwZ20rS4S9oOvVR1TFQg9ingKRGZUskOichhwJW4ZV+EEpcoq3oDpaq/q3QfDKMUePxu+1rK3Cgvvta688lVwPfKNee06g2UYRjNIyIP4lx56+NnKXOjvFwWTf49m8a17s6qbJdYVM6CCKlJkjAMoynR/DLBuWQy134S4EpNuESKUV6i4sBnqmpDtL8BrgZm2Vedjlx74Mot9cFNsC94Xbl8sRGUYaSUeH6ZiKyXPddM3MJzRtgMjI0TgKp+LCIVWd0Z+F7cDdxy9t/OeEwBM1CGYeRPZvFSEclcpmF94LnK9MpIQDsR6amqS2HtCKoi12xVHRn1IXtU15OmS4J4xQyUYaSXu4BH8VS81Cg7vweeF5Gx0f4I4L8r2B9Yd1S3tJSjOotBGYZhBIqI7ISrfQfwpIcFR4vtzwxgr6xR3VOq+rVSPJ+NoAzDMAIlMkgVNUpZlHVUZyMowzAMI2/KOaozA2UYhmEESbtKd8AwDMMwmsMMlGEYhhEkZqAMwzCMIDEDZRglQkR+JSKviMgMEZkmIkNK+FwTRSTvJdwNoxqwNHPDKAEiMgz4DlCnql9F80U6VrhbhlFV2AjKMErDJsBH8TpOqvqxqn4oIheJyIsiMlNEro8PjkZAfxCRKSIyW0QGi8h9IjJPRC6NjukrInNE5A4ReVVE7hGRmuwnFpH9RWSyiLwsIneLSJeo/YpoRFcvIleV6X0wjIIxA2UYpWECsKWIzBWRv4jIt6L2P6nqUFUdCHSJ1vmJWamqQ4AbgAeAnwFfA06Iap4BbA/8WVV3Aj7F1dpbi4j0Ai4E9lXVwcBUYFQ0gvu+qu6sqnXAZSV51YbhETNQhlECoqWxdwV+ilum+18ichywj4i8EBVv3RsYkHHauOh2FvCKqi5W1S+BN4AtosfeVdUXovt3AMOznnoYsBPwnIhMB44DtgSWAZ+LyI0i8gPgc48v1zBKgsWgDKNEREusPw08LSKzgFNwI6Kvq+r7InIxbln3mHh9nTUZ98EtZ9DSbzV7pr0AE1T16OwDRWQ3YF9ceZrTo/uGESw2gjKMEiAi/UVk24ymOmBudP9jEekGHF6A9JYiEi80eBTwTNbjLwC7i0i/qB9dRGQ7EekK1KrqeGAUMLCA5zaMsmIjKMMoDd2AP0VLdn8FvI5z9y0DXgE+AF7KOD5XzbHMx+YBPxeRm4HZwPWZx6jqRyJyAvBPEekUtV+Ii1c9kJFU8cvCX5phlAerxWcYVYKI9AUeKtXSBoYRGubiM4zqwv5RGm0GG0EZhmEYQWIjKMMwDCNIzEAZhmEYQWIGyjAMwwgSM1CGYRhGkJiBMgzDMILk/wHppI+IfApBLgAAAABJRU5ErkJggg==\n","text/plain":"nltk
."}}],"metadata":{"tags":["sample_code"],"dc":{"key":"52"},"trusted":true},"source":"# This command display figures inline\n%matplotlib inline\n\n# Creating the word frequency distribution\nfreqdist = nltk.FreqDist(words_ns)\n\n# Plotting the word frequency distribution\nfreqdist.plot(20)","cell_type":"code"},{"metadata":{"tags":["context"],"editable":false,"dc":{"key":"59"},"run_control":{"frozen":true},"deletable":false},"source":"## 9. The most common word\n Nice! The frequency distribution plot above is the answer to our question.
\nThe natural language processing skills we used in this notebook are also applicable to much of the data that Data Scientists encounter as the vast proportion of the world's data is unstructured data and includes a great deal of text.
\nSo, what word turned out to (not surprisingly) be the most common word in Moby Dick?
","cell_type":"markdown"},{"execution_count":18,"outputs":[],"metadata":{"tags":["sample_code"],"dc":{"key":"59"},"trusted":true,"collapsed":true},"source":"# What's the most common word in Moby Dick?\nmost_common_word = 'whale'","cell_type":"code"}],"nbformat":4,"metadata":{"language_info":{"codemirror_mode":{"name":"ipython","version":3},"nbconvert_exporter":"python","file_extension":".py","mimetype":"text/x-python","pygments_lexer":"ipython3","name":"python","version":"3.5.2"},"kernelspec":{"language":"python","display_name":"Python 3","name":"python3"}}} --------------------------------------------------------------------------------