├── .gitignore ├── Android Apps Google Play ├── Dataset │ ├── apps.csv │ └── user_reviews.csv └── Google_Play.ipynb ├── Nyc Airbnb ├── Exploring Airbnb Market Trends.ipynb └── datasets │ ├── airbnb_last_review.tsv │ ├── airbnb_price.csv │ └── airbnb_room_type.xlsx ├── README.md ├── Word Frequency in Classic Novels └── Word Frequency in Classic Novels.ipynb └── codewello-banner.png /.gitignore: -------------------------------------------------------------------------------- 1 | text-readme.txt -------------------------------------------------------------------------------- /Nyc Airbnb/Exploring Airbnb Market Trends.ipynb: -------------------------------------------------------------------------------- 1 | {"nbformat":4,"nbformat_minor":0,"metadata":{"colab":{"provenance":[],"authorship_tag":"ABX9TyO2OMYcvsNrWwnc8swuO0de"},"kernelspec":{"name":"python3","display_name":"Python 3"},"language_info":{"name":"python"}},"cells":[{"cell_type":"markdown","source":["#NYC Airbnb\n","\n","Welcome to NYC, a top tourist spot. Many Airbnb options cater to short or long stays. This notebook explores NYC's Airbnb scene using data from .csv, .tsv, and .xlsx files.\n","\n","\n","**the goal of this project**\n","\n","* What is the average price, per night, of an Airbnb listing in NYC?\n","* How does the average price of an Airbnb listing, per month, compare to the private rental market?\n","* How many adverts are for private rooms?\n","* How do Airbnb listing prices compare across the five NYC boroughs?\n","\n","\n","**We will be working with three datasets:**\n","\n","\"datasets/airbnb_price.csv\"\n","\n","\"datasets/airbnb_room_type.xlsx\"\n","\n","\"datasets/airbnb_last_review.tsv\"\n","\n"],"metadata":{"id":"o2plAFSi0QJU"}},{"cell_type":"markdown","source":["#1. Importing the data\n","\n","* Load \"data/airbnb_price.csv\" as a DataFrame called prices.\n","* Load \"data/airbnb_room_type.xlsx\" as a DataFrame called xls, and the first sheet from xls as a DataFrame called room_types.\n","* Load \"data/airbnb_last_review.tsv\" as a DataFrame called reviews."],"metadata":{"id":"_w4R7srY3_dN"}},{"cell_type":"code","execution_count":1,"metadata":{"id":"rP3jSI7h0DE2","executionInfo":{"status":"ok","timestamp":1700486267783,"user_tz":-120,"elapsed":650,"user":{"displayName":"Sam Eldin","userId":"17811646166394375086"}}},"outputs":[],"source":["import numpy as np\n","import pandas as pd\n","import datetime as dt"]},{"cell_type":"code","source":["#prices df\n","prices = pd.read_csv(\"/content/airbnb_price.csv\")\n","\n","#xls room types df\n","xls=pd.ExcelFile(\"/content/airbnb_room_type.xlsx\")\n","room_types = xls.parse(0)\n","\n","#rviews df\n","reviews= pd.read_csv(\"/content/airbnb_last_review.tsv\",sep='\\t')\n","\n","print(\n"," f\"prices: {prices.head()}\",\n"," \"\\n\",\n"," f\"room_types: {room_types.head()}\",\n"," \"\\n\",\n"," f\"reviews: {reviews.head()}\"\n",")"],"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"OCX7pAvD4R37","executionInfo":{"status":"ok","timestamp":1700487667541,"user_tz":-120,"elapsed":2038,"user":{"displayName":"Sam Eldin","userId":"17811646166394375086"}},"outputId":"4b854687-802f-46d2-cbc9-0f09dc4003c3"},"execution_count":9,"outputs":[{"output_type":"stream","name":"stdout","text":["prices: listing_id price nbhood_full\n","0 2595 225 dollars Manhattan, Midtown\n","1 3831 89 dollars Brooklyn, Clinton Hill\n","2 5099 200 dollars Manhattan, Murray Hill\n","3 5178 79 dollars Manhattan, Hell's Kitchen\n","4 5238 150 dollars Manhattan, Chinatown \n"," room_types: Unnamed: 0 listing_id room_type number_of_reviews\n","0 0 2595 Entire home/apt 48\n","1 1 3831 Entire home/apt 295\n","2 2 5099 Entire home/apt 78\n","3 3 5121 Private room 49\n","4 4 5178 Private room 454 \n"," reviews: listing_id host_name last_review\n","0 2595 Jennifer May 21 2019\n","1 3831 LisaRoxanne July 05 2019\n","2 5099 Chris June 22 2019\n","3 5178 Shunichi June 24 2019\n","4 5238 Ben June 09 2019\n"]}]},{"cell_type":"markdown","source":["#Cleaning the price column\n","\n","Now the DataFrames have been loaded, the first step is to calculate the average price per listing by room_type.\n","\n","You may have noticed that the price column in the prices DataFrame currently states each value as a string with the currency (dollars) following, i.e.,\n","\n","price\n","225 dollars\n","89 dollars\n","200 dollars\n","We will need to clean the column in order to calculate the average price."],"metadata":{"id":"MhaA-LvK5pah"}},{"cell_type":"code","source":["# Remove whitespace and string characters from prices column\n","#prices[\"price\"] = prices[\"price\"].str.replace(\"dollars\",\"\")\n","\n","# Convert prices column to numeric datatype\n","#prices[\"price\"] = pd.to_numeric(prices[\"price\"])\n","\n","# Print head\n","print(prices[\"price\"].head())\n","\n","# Print descriptive statistics for the price column\n","print(prices[\"price\"].describe())"],"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"PGPxXlgE5o42","executionInfo":{"status":"ok","timestamp":1700487934695,"user_tz":-120,"elapsed":481,"user":{"displayName":"Sam Eldin","userId":"17811646166394375086"}},"outputId":"7a2de068-9970-46cf-ae4c-91755512738c"},"execution_count":13,"outputs":[{"output_type":"stream","name":"stdout","text":["0 225\n","1 89\n","2 200\n","3 79\n","4 150\n","Name: price, dtype: int64\n","count 25209.000000\n","mean 141.777936\n","std 147.349137\n","min 0.000000\n","25% 69.000000\n","50% 105.000000\n","75% 175.000000\n","max 7500.000000\n","Name: price, dtype: float64\n"]}]},{"cell_type":"markdown","source":["# 3. Calculating average price\n","\n","We can see three quarters of listings cost $175 per night or less.\n","\n","However, there are some outliers including a maximum price of $7,500 per night!\n","\n","Some of listings are actually showing as free. Let's remove these from the DataFrame, and calculate the average price."],"metadata":{"id":"IbhiH52L-uxz"}},{"cell_type":"code","source":["prices.info()\n"],"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"kUFGp-WD-1qZ","executionInfo":{"status":"ok","timestamp":1700487983616,"user_tz":-120,"elapsed":303,"user":{"displayName":"Sam Eldin","userId":"17811646166394375086"}},"outputId":"cccd3ed1-778d-411a-ce28-a9eb2e7904e4"},"execution_count":14,"outputs":[{"output_type":"stream","name":"stdout","text":["\n","RangeIndex: 25209 entries, 0 to 25208\n","Data columns (total 3 columns):\n"," # Column Non-Null Count Dtype \n","--- ------ -------------- ----- \n"," 0 listing_id 25209 non-null int64 \n"," 1 price 25209 non-null int64 \n"," 2 nbhood_full 25209 non-null object\n","dtypes: int64(2), object(1)\n","memory usage: 591.0+ KB\n"]}]},{"cell_type":"code","source":["# Subset prices for listings costing $0 named \"free_listings\"\n","free_listings = prices[\"price\"] == 0\n","print(type(free_listings))\n","print(free_listings.shape)\n","\n","# Update prices by removing all free listings from prices\n","# Similar to SQL's concept of \"NOT IN\"\n","prices = prices.loc[~free_listings]\n","\n","# Calculate the average price and round to nearest 2 decimal places, avg_price\n","avg_price = round(prices[\"price\"].mean(),2)\n","\n","# Print the average price\n","print(\"The average price per night for an Airbnb listing in NYC is ${}.\".format(avg_price))"],"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"cB3pHCIr_CVa","executionInfo":{"status":"ok","timestamp":1700488032347,"user_tz":-120,"elapsed":3,"user":{"displayName":"Sam Eldin","userId":"17811646166394375086"}},"outputId":"673283b6-f452-440f-86d1-7c269cb1fd24"},"execution_count":15,"outputs":[{"output_type":"stream","name":"stdout","text":["\n","(25209,)\n","The average price per night for an Airbnb listing in NYC is $141.82.\n"]}]},{"cell_type":"markdown","source":["#4. Comparing costs to the private rental market\n","\n","Now we know how much a listing costs, on average, per night, but it would be useful to have a benchmark for comparison. According to Zumper, a 1 bedroom apartment in New York City costs, on average, $3,100 per month. Let's convert the per night prices of our listings into monthly costs, so we can compare to the private market.\n","\n"],"metadata":{"id":"n5dOjJB1_KHM"}},{"cell_type":"code","source":[],"metadata":{"id":"NA2gfKbQ_3OO"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":["prices.head()\n"],"metadata":{"colab":{"base_uri":"https://localhost:8080/","height":206},"id":"P394VJ4E_NQG","executionInfo":{"status":"ok","timestamp":1700488111462,"user_tz":-120,"elapsed":283,"user":{"displayName":"Sam Eldin","userId":"17811646166394375086"}},"outputId":"256180d1-9c3d-4805-bc11-f951d8383027"},"execution_count":17,"outputs":[{"output_type":"execute_result","data":{"text/plain":[" listing_id price nbhood_full\n","0 2595 225 Manhattan, Midtown\n","1 3831 89 Brooklyn, Clinton Hill\n","2 5099 200 Manhattan, Murray Hill\n","3 5178 79 Manhattan, Hell's Kitchen\n","4 5238 150 Manhattan, Chinatown"],"text/html":["\n","
\n","
\n","\n","\n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n","
listing_idpricenbhood_full
02595225Manhattan, Midtown
1383189Brooklyn, Clinton Hill
25099200Manhattan, Murray Hill
3517879Manhattan, Hell's Kitchen
45238150Manhattan, Chinatown
\n","
\n","
\n","\n","
\n"," \n","\n"," \n","\n"," \n","
\n","\n","\n","
\n"," \n","\n","\n","\n"," \n","
\n","
\n","
\n"]},"metadata":{},"execution_count":17}]},{"cell_type":"code","source":["# Add a new column to the prices DataFrame, price_per_month\n","prices['price_per_month'] = prices['price'] * 365 / 12\n","#print((prices[\"price_per_month\"]))\n","\n","average_price_per_month = round(prices['price_per_month'].mean(),2)\n","\n","# Compare Airbnb and rental market\n","difference = round(average_price_per_month - 3100,2)\n","print(\"Airbnb monthly costs are ${}, while in the private market you would pay {}.\".format(average_price_per_month, \"$3,100.00\"))\n"],"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"QLdMws76_oEG","executionInfo":{"status":"ok","timestamp":1700488378955,"user_tz":-120,"elapsed":338,"user":{"displayName":"Sam Eldin","userId":"17811646166394375086"}},"outputId":"3c7e6d23-00b8-4735-cb6c-a3abfee6d00c"},"execution_count":25,"outputs":[{"output_type":"stream","name":"stdout","text":["Airbnb monthly costs are $4313.61, while in the private market you would pay $3,100.00.\n"]}]},{"cell_type":"markdown","source":["#5. Cleaning the room type column\n","\n","Unsurprisingly, using Airbnb appears to be substantially more expensive than the private rental market. We should, however, consider that these Airbnb listings include single private rooms or even rooms to share, as well as entire homes/apartments.\n","\n","Let's dive deeper into the room_type column to find out the breakdown of listings by type of room. The room_type column has several variations for private room listings, specifically:\n","\n","\"Private room\"\n","\"private room\"\n","\"PRIVATE ROOM\"\n","We can solve this by converting all string characters to lower case (upper case would also work just fine)."],"metadata":{"id":"cUhfbvjVAhj7"}},{"cell_type":"code","source":["room_types.info()\n"],"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"HfVMXMxd_syd","executionInfo":{"status":"ok","timestamp":1700488434592,"user_tz":-120,"elapsed":314,"user":{"displayName":"Sam Eldin","userId":"17811646166394375086"}},"outputId":"2900135f-4147-4357-ed88-463f5b6ed6b3"},"execution_count":26,"outputs":[{"output_type":"stream","name":"stdout","text":["\n","RangeIndex: 17614 entries, 0 to 17613\n","Data columns (total 4 columns):\n"," # Column Non-Null Count Dtype \n","--- ------ -------------- ----- \n"," 0 Unnamed: 0 17614 non-null int64 \n"," 1 listing_id 17614 non-null int64 \n"," 2 room_type 17614 non-null object\n"," 3 number_of_reviews 17614 non-null int64 \n","dtypes: int64(3), object(1)\n","memory usage: 550.6+ KB\n"]}]},{"cell_type":"code","source":["room_types.head()\n"],"metadata":{"colab":{"base_uri":"https://localhost:8080/","height":206},"id":"tEwrtAKdAn8W","executionInfo":{"status":"ok","timestamp":1700488447518,"user_tz":-120,"elapsed":11,"user":{"displayName":"Sam Eldin","userId":"17811646166394375086"}},"outputId":"95e0162d-fadd-4ea4-c178-913bd9879178"},"execution_count":27,"outputs":[{"output_type":"execute_result","data":{"text/plain":[" Unnamed: 0 listing_id room_type number_of_reviews\n","0 0 2595 Entire home/apt 48\n","1 1 3831 Entire home/apt 295\n","2 2 5099 Entire home/apt 78\n","3 3 5121 Private room 49\n","4 4 5178 Private room 454"],"text/html":["\n","
\n","
\n","\n","\n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n","
Unnamed: 0listing_idroom_typenumber_of_reviews
002595Entire home/apt48
113831Entire home/apt295
225099Entire home/apt78
335121Private room49
445178Private room454
\n","
\n","
\n","\n","
\n"," \n","\n"," \n","\n"," \n","
\n","\n","\n","
\n"," \n","\n","\n","\n"," \n","
\n","
\n","
\n"]},"metadata":{},"execution_count":27}]},{"cell_type":"code","source":["# Convert the room_type column to lowercase\n","room_types[\"room_type\"] = room_types[\"room_type\"].str.lower()\n","\n","# Update the room_type column to category data type\n","# https://pandas.pydata.org/docs/user_guide/categorical.html\n","room_types[\"room_type\"] = room_types[\"room_type\"].astype(\"category\")\n","\n","# Create the variable room_frequencies\n","room_frequencies = room_types[\"room_type\"].value_counts()\n","\n","# Print room_frequencies\n","print(room_frequencies)"],"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"gi-3nTG4A0aD","executionInfo":{"status":"ok","timestamp":1700488499966,"user_tz":-120,"elapsed":5,"user":{"displayName":"Sam Eldin","userId":"17811646166394375086"}},"outputId":"7823f8e6-b827-41eb-b0da-8ae00ea2c82a"},"execution_count":28,"outputs":[{"output_type":"stream","name":"stdout","text":["entire home/apt 9405\n","private room 7752\n","shared room 357\n","hotel room 100\n","Name: room_type, dtype: int64\n"]}]},{"cell_type":"markdown","source":["#6. What timeframe are we working with?"],"metadata":{"id":"qtAt6z4PA4Rs"}},{"cell_type":"markdown","source":["It seems there is a fairly similar sized market opportunity for both private rooms (45% of listings) and entire homes/apartments (52%) on the Airbnb platform in NYC.\n","\n","\n","Now let's turn our attention to the reviews DataFrame. The last_review column contains the date of the last review in the format of \"Month Day Year\" e.g., May 21 2019. We've been asked to find out the earliest and latest review dates in the DataFrame, and ensure the format allows this analysis to be easily conducted going forwards."],"metadata":{"id":"FJ6CfBCgA8ik"}},{"cell_type":"code","source":["reviews.head()\n"],"metadata":{"colab":{"base_uri":"https://localhost:8080/","height":206},"id":"RIy4JcQ4A9xI","executionInfo":{"status":"ok","timestamp":1700488540827,"user_tz":-120,"elapsed":304,"user":{"displayName":"Sam Eldin","userId":"17811646166394375086"}},"outputId":"6c61706b-199d-470b-fbaa-490d06134f8f"},"execution_count":29,"outputs":[{"output_type":"execute_result","data":{"text/plain":[" listing_id host_name last_review\n","0 2595 Jennifer May 21 2019\n","1 3831 LisaRoxanne July 05 2019\n","2 5099 Chris June 22 2019\n","3 5178 Shunichi June 24 2019\n","4 5238 Ben June 09 2019"],"text/html":["\n","
\n","
\n","\n","\n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n","
listing_idhost_namelast_review
02595JenniferMay 21 2019
13831LisaRoxanneJuly 05 2019
25099ChrisJune 22 2019
35178ShunichiJune 24 2019
45238BenJune 09 2019
\n","
\n","
\n","\n","
\n"," \n","\n"," \n","\n"," \n","
\n","\n","\n","
\n"," \n","\n","\n","\n"," \n","
\n","
\n","
\n"]},"metadata":{},"execution_count":29}]},{"cell_type":"code","source":["reviews.info()\n"],"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"4MaADt50BCOe","executionInfo":{"status":"ok","timestamp":1700488555551,"user_tz":-120,"elapsed":310,"user":{"displayName":"Sam Eldin","userId":"17811646166394375086"}},"outputId":"38742eec-66fa-43c7-931a-f397ff507750"},"execution_count":30,"outputs":[{"output_type":"stream","name":"stdout","text":["\n","RangeIndex: 25209 entries, 0 to 25208\n","Data columns (total 3 columns):\n"," # Column Non-Null Count Dtype \n","--- ------ -------------- ----- \n"," 0 listing_id 25209 non-null int64 \n"," 1 host_name 25201 non-null object\n"," 2 last_review 25209 non-null object\n","dtypes: int64(1), object(2)\n","memory usage: 591.0+ KB\n"]}]},{"cell_type":"code","source":["# Change the data type of the last_review column to datetime\n","# https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html\n","reviews[\"last_review\"] = pd.to_datetime(reviews[\"last_review\"])\n","print(type(reviews[\"last_review\"]))\n","\n","# Create first_reviewed, the earliest review date\n","first_reviewed = reviews[\"last_review\"].dt.date.min()\n","\n","# Create last_reviewed, the most recent review date\n","last_reviewed = reviews[\"last_review\"].dt.date.max()\n","\n","# Print the oldest and newest reviews from the DataFrame\n","print(\"The latest Airbnb review is {}, the earliest review is {}\".format(last_reviewed, first_reviewed))"],"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"EhZ3tRPWBFUn","executionInfo":{"status":"ok","timestamp":1700488580953,"user_tz":-120,"elapsed":265,"user":{"displayName":"Sam Eldin","userId":"17811646166394375086"}},"outputId":"9bddab49-5d0c-486a-9255-cb35806dcd51"},"execution_count":31,"outputs":[{"output_type":"stream","name":"stdout","text":["\n","The latest Airbnb review is 2019-07-09, the earliest review is 2019-01-01\n"]}]},{"cell_type":"code","source":["print(reviews.dtypes[\"last_review\"])\n","reviews[\"last_review\"].head()"],"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"xdWoN_wJBMW5","executionInfo":{"status":"ok","timestamp":1700488597535,"user_tz":-120,"elapsed":276,"user":{"displayName":"Sam Eldin","userId":"17811646166394375086"}},"outputId":"0464fc25-8d1b-44bc-f887-3fb0c13eb1a9"},"execution_count":32,"outputs":[{"output_type":"stream","name":"stdout","text":["datetime64[ns]\n"]},{"output_type":"execute_result","data":{"text/plain":["0 2019-05-21\n","1 2019-07-05\n","2 2019-06-22\n","3 2019-06-24\n","4 2019-06-09\n","Name: last_review, dtype: datetime64[ns]"]},"metadata":{},"execution_count":32}]},{"cell_type":"markdown","source":["# 7. Joining the DataFrames."],"metadata":{"id":"C1FTilFEBOkQ"}},{"cell_type":"code","source":["# Merge prices and room_types to create rooms_and_prices\n","# https://pandas.pydata.org/docs/user_guide/merging.html\n","rooms_and_prices = pd.merge(prices, room_types,\n"," how=\"outer\",\n"," on=\"listing_id\")\n","\n","# Merge rooms_and_prices with the reviews DataFrame to create airbnb_merged\n","airbnb_merged = pd.merge(rooms_and_prices, reviews,\n"," how=\"outer\",\n"," on=\"listing_id\")\n","\n","# Drop missing values from airbnb_merged\n","airbnb_merged.dropna(inplace=True)\n","\n","# Check if there are any duplicate values\n","print(\"There are {} duplicates in the DataFrame.\".format(airbnb_merged.duplicated().sum()))"],"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"AiGOY5suBSel","executionInfo":{"status":"ok","timestamp":1700488638839,"user_tz":-120,"elapsed":287,"user":{"displayName":"Sam Eldin","userId":"17811646166394375086"}},"outputId":"35dfd87e-b34a-4b0a-b940-c684ae0da7cb"},"execution_count":33,"outputs":[{"output_type":"stream","name":"stdout","text":["There are 0 duplicates in the DataFrame.\n"]}]},{"cell_type":"code","source":["print(airbnb_merged.info())\n","airbnb_merged.head()"],"metadata":{"colab":{"base_uri":"https://localhost:8080/","height":621},"id":"DHdAXXDlBZjh","executionInfo":{"status":"ok","timestamp":1700488651599,"user_tz":-120,"elapsed":15,"user":{"displayName":"Sam Eldin","userId":"17811646166394375086"}},"outputId":"c6c7f5b6-ecd8-486e-d793-5baf35ad5534"},"execution_count":34,"outputs":[{"output_type":"stream","name":"stdout","text":["\n","Int64Index: 13663 entries, 0 to 19192\n","Data columns (total 9 columns):\n"," # Column Non-Null Count Dtype \n","--- ------ -------------- ----- \n"," 0 listing_id 13663 non-null int64 \n"," 1 price 13663 non-null float64 \n"," 2 nbhood_full 13663 non-null object \n"," 3 price_per_month 13663 non-null float64 \n"," 4 Unnamed: 0 13663 non-null float64 \n"," 5 room_type 13663 non-null category \n"," 6 number_of_reviews 13663 non-null float64 \n"," 7 host_name 13663 non-null object \n"," 8 last_review 13663 non-null datetime64[ns]\n","dtypes: category(1), datetime64[ns](1), float64(4), int64(1), object(2)\n","memory usage: 974.2+ KB\n","None\n"]},{"output_type":"execute_result","data":{"text/plain":[" listing_id price nbhood_full price_per_month Unnamed: 0 \\\n","0 2595 225.0 Manhattan, Midtown 6843.750000 0.0 \n","1 3831 89.0 Brooklyn, Clinton Hill 2707.083333 1.0 \n","2 5099 200.0 Manhattan, Murray Hill 6083.333333 2.0 \n","3 5178 79.0 Manhattan, Hell's Kitchen 2402.916667 4.0 \n","4 5238 150.0 Manhattan, Chinatown 4562.500000 6.0 \n","\n"," room_type number_of_reviews host_name last_review \n","0 entire home/apt 48.0 Jennifer 2019-05-21 \n","1 entire home/apt 295.0 LisaRoxanne 2019-07-05 \n","2 entire home/apt 78.0 Chris 2019-06-22 \n","3 private room 454.0 Shunichi 2019-06-24 \n","4 entire home/apt 161.0 Ben 2019-06-09 "],"text/html":["\n","
\n","
\n","\n","\n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n","
listing_idpricenbhood_fullprice_per_monthUnnamed: 0room_typenumber_of_reviewshost_namelast_review
02595225.0Manhattan, Midtown6843.7500000.0entire home/apt48.0Jennifer2019-05-21
1383189.0Brooklyn, Clinton Hill2707.0833331.0entire home/apt295.0LisaRoxanne2019-07-05
25099200.0Manhattan, Murray Hill6083.3333332.0entire home/apt78.0Chris2019-06-22
3517879.0Manhattan, Hell's Kitchen2402.9166674.0private room454.0Shunichi2019-06-24
45238150.0Manhattan, Chinatown4562.5000006.0entire home/apt161.0Ben2019-06-09
\n","
\n","
\n","\n","
\n"," \n","\n"," \n","\n"," \n","
\n","\n","\n","
\n"," \n","\n","\n","\n"," \n","
\n","
\n","
\n"]},"metadata":{},"execution_count":34}]},{"cell_type":"markdown","source":["# 8. Analyzing listing prices by NYC borough"],"metadata":{"id":"VXk7iYeIBjSz"}},{"cell_type":"code","source":["# Extract information from the nbhood_full column and store as a new column, borough\n","# Either use `.str.partition()` or `.str.split()`\n","airbnb_merged[\"borough\"] = airbnb_merged[\"nbhood_full\"].str.partition(\",\", expand=True)[0]\n","\n","# Group by borough and calculate summary statistics\n","boroughs = airbnb_merged.groupby(\"borough\")[\"price\"].agg([\"sum\", \"mean\", \"median\", \"count\"])\n","\n","# Round boroughs to 2 decimal places, and sort by mean in descending order\n","boroughs = boroughs.round(2).sort_values(\"mean\", ascending=False)\n","\n","# Print boroughs\n","print(boroughs)"],"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"oSVIOcycBoCi","executionInfo":{"status":"ok","timestamp":1700488734894,"user_tz":-120,"elapsed":273,"user":{"displayName":"Sam Eldin","userId":"17811646166394375086"}},"outputId":"932fe0fb-6a7b-4c2a-ea36-e3a26aeee394"},"execution_count":35,"outputs":[{"output_type":"stream","name":"stdout","text":[" sum mean median count\n","borough \n","Manhattan 893869.0 171.70 139.0 5206\n","Brooklyn 742816.0 123.02 100.0 6038\n","Queens 174429.0 92.73 73.0 1881\n","Staten Island 13439.0 86.15 70.5 156\n","Bronx 30954.0 81.03 63.5 382\n"]}]},{"cell_type":"markdown","source":["# 9. Price range by borough\n","\n","The above output gives us a summary of prices for listings across the 5 boroughs. In this final task we would like to categorize listings based on whether they fall into specific price ranges, and view this by borough.\n","\n","We can do this using percentiles and labels to create a new column, price_range, in the DataFrame. Once we have created the labels, we can then group the data and count frequencies for listings in each price range by borough.\n","\n","We will assign the following categories and price ranges:"],"metadata":{"id":"rh5Ne8ypB4Cz"}},{"cell_type":"code","source":["# Create labels for the price range, label_names\n","label_names = [\"Budget\", \"Average\", \"Expensive\", \"Extravagant\"]\n","\n","# Create the label ranges, ranges\n","ranges = [0, 69, 175, 350, np.inf]\n","\n","# Insert new column, price_range, into DataFrame\n","# Use `pd.cut` to segment and sort data values into bins\n","# Useful for going from a continuous variable to a categorical variable\n","airbnb_merged[\"price_range\"] = pd.cut(airbnb_merged[\"price\"], bins=ranges, labels=label_names)\n","\n","# Calculate borough and price_range frequencies, prices_by_borough\n","prices_by_borough = airbnb_merged.groupby([\"borough\", \"price_range\"])[\"price_range\"].agg(\"count\")\n","print(prices_by_borough)"],"metadata":{"id":"iqBzIdSLB9jo","executionInfo":{"status":"ok","timestamp":1700488799700,"user_tz":-120,"elapsed":9,"user":{"displayName":"Sam Eldin","userId":"17811646166394375086"}},"outputId":"9eac1d87-edd7-4707-9fed-226228d7bd8a","colab":{"base_uri":"https://localhost:8080/"}},"execution_count":36,"outputs":[{"output_type":"stream","name":"stdout","text":["borough price_range\n","Bronx Budget 209\n"," Average 155\n"," Expensive 14\n"," Extravagant 4\n","Brooklyn Budget 1697\n"," Average 3324\n"," Expensive 888\n"," Extravagant 129\n","Manhattan Budget 590\n"," Average 2868\n"," Expensive 1456\n"," Extravagant 292\n","Queens Budget 870\n"," Average 847\n"," Expensive 145\n"," Extravagant 19\n","Staten Island Budget 71\n"," Average 73\n"," Expensive 12\n"," Extravagant 0\n","Name: price_range, dtype: int64\n"]}]}]} -------------------------------------------------------------------------------- /Nyc Airbnb/datasets/airbnb_room_type.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/HossamEldinx/NLP-Projects/fd79a6fdb9ee0b74be5b7dd64cd65708f471510d/Nyc Airbnb/datasets/airbnb_room_type.xlsx -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ## NLP Projects with Data Analysis 2 | 3 |
4 |

5 | 6 | 10 | 11 |

12 |
13 | 14 | [Linkedin](https://www.linkedin.com/in/hossam-eldin/) | [Youtube](https://www.youtube.com/@codewello) | 15 | 16 | 17 |
18 |
19 | 20 | 21 | ## 👋 hello my name is hossam eldin 22 | 23 | 24 | I've put together this awesome list to help you and me create cool projects in NLP and data analysis! It's a step-by-step guide that's perfect for beginners to intermediate levels, using all sorts of libraries. Whether you're a newbie or a bit more experienced, you'll find this super useful. 25 | 26 | 27 | | **Num** | **Project** | **repository** | 28 | |:----:|:------------:|:-------------------------------------------------:| 29 | | 1 | [Ecommerce-review-sentiments](https://github.com/ninda-code/ecommerce-review-sentiments/tree/main) | [![GitHub](https://badges.aleen42.com/src/github.svg)](https://github.com/ninda-code/ecommerce-review-sentiments/tree/main) | 30 | | 2 | [NLP-Chatbot](https://github.com/ankitamasand/nlp-chatbot) | [![GitHub](https://badges.aleen42.com/src/github.svg)](https://github.com/ankitamasand/nlp-chatbot) | 31 | | 3 | [Topic-Modeling](https://github.com/yanhan-si/NLP-and-Topic-Modeling-on-User-Review-Dataset) | [![GitHub](https://badges.aleen42.com/src/github.svg)](https://github.com/yanhan-si/NLP-and-Topic-Modeling-on-User-Review-Dataset) | 32 | | 4 | [Text Summarization Using Spacy](https://github.com/sainivarsha97/spacy-Tutorial/tree/master) | [![GitHub](https://badges.aleen42.com/src/github.svg)](https://github.com/sainivarsha97/spacy-Tutorial/tree/master) | 33 | | 5 | [Spam Classification](https://github.com/manulthanura/Spam-Classification-using-NLP) | [![GitHub](https://badges.aleen42.com/src/github.svg)](https://github.com/manulthanura/Spam-Classification-using-NLP) | 34 | | 6 | [Text Classification](https://github.com/Idilismiguzel/NLP-with-Python/blob/master/Text-Classification.ipynb) | [![GitHub](https://badges.aleen42.com/src/github.svg)](https://github.com/Idilismiguzel/NLP-with-Python/blob/master/Text-Classification.ipynb) | 35 | | 7| [Basket Market Analysis ](https://github.com/limchiahooi/market-basket-analysis) | [![GitHub](https://badges.aleen42.com/src/github.svg)](https://github.com/limchiahooi/market-basket-analysis) | 36 | | 8 | [Question Tagging System](https://github.com/kushagra2103/Auto-Tagging-System) | [![GitHub](https://badges.aleen42.com/src/github.svg)](https://github.com/kushagra2103/Auto-Tagging-System) | 37 | | 9 | [Resume Parsing](https://github.com/anshulmahajan01/NLP/blob/master/Resume%20Parsing%20.ipynb) | [![GitHub](https://badges.aleen42.com/src/github.svg)](https://github.com/anshulmahajan01/NLP/blob/master/Resume%20Parsing%20.ipynb) | 38 | | 10 | [Disease Prediction](https://github.com/anujdutt9/Disease-Prediction-from-Symptoms) | [![GitHub](https://badges.aleen42.com/src/github.svg)](https://github.com/anujdutt9/Disease-Prediction-from-Symptoms) | 39 | | 11 | [Image-Caption-Generator](https://github.com/MiteshPuthran/Image-Caption-Generator/tree/master) | [![GitHub](https://badges.aleen42.com/src/github.svg)](https://github.com/MiteshPuthran/Image-Caption-Generator/tree/master) | 40 | | 12 | [Speech-Emotion-Analyzer](https://github.com/MiteshPuthran/Speech-Emotion-Analyzer) | [![GitHub](https://badges.aleen42.com/src/github.svg)](https://github.com/MiteshPuthran/Speech-Emotion-Analyzer) | 41 | | 13 | [Detecting Paraphrases](https://github.com/wasiahmad/paraphrase_identification) | [![GitHub](https://badges.aleen42.com/src/github.svg)](https://github.com/wasiahmad/paraphrase_identification) | 42 | | 14 | [Hate Speech Detection](https://github.com/NakulLakhotia/Hate-Speech-Detection-in-Social-Media-using-Python/blob/master/final_customization.ipynb) | [![GitHub](https://badges.aleen42.com/src/github.svg)](https://github.com/NakulLakhotia/Hate-Speech-Detection-in-Social-Media-using-Python/blob/master/final_customization.ipynb) | 43 | | 15 | [ The Android App Market on Google Play ](https://github.com/HossamEldinx/NLP-Projects/tree/main/Android%20Apps%20Google%20Play) | [![GitHub](https://badges.aleen42.com/src/github.svg)](https://github.com/HossamEldinx/NLP-Projects/tree/main/Android%20Apps%20Google%20Play) | 44 | | 16 | [Exploring the NYC Airbnb Market](https://github.com/HossamEldinx/NLP-Projects/tree/main/Nyc%20Airbnb) | [![GitHub](https://badges.aleen42.com/src/github.svg)](https://github.com/HossamEldinx/NLP-Projects/tree/main/Nyc%20Airbnb) | 45 | | 17 | [Word Frequency in Classic Novels](https://github.com/HossamEldinx/NLP-Projects/tree/main/Word%20Frequency%20in%20Classic%20Novels) | [![GitHub](https://badges.aleen42.com/src/github.svg)](https://github.com/HossamEldinx/NLP-Projects/tree/main/Word%20Frequency%20in%20Classic%20Novels) | 46 | | 18 | [Find Movie Similarity from Plot Summaries](https://github.com/mrbarkis/DataCamp_projects/tree/master/Find%20Movie%20Similarity%20from%20Plot%20Summaries) | [![GitHub](https://badges.aleen42.com/src/github.svg)](https://github.com/mrbarkis/DataCamp_projects/tree/master/Find%20Movie%20Similarity%20from%20Plot%20Summaries) | 47 | | 19 | [Real-time Insights from Social Media Data](https://github.com/mrbarkis/DataCamp_projects/tree/master/Real-time%20Insights%20from%20Social%20Media%20Data) | [![GitHub](https://badges.aleen42.com/src/github.svg)](https://github.com/mrbarkis/DataCamp_projects/tree/master/Real-time%20Insights%20from%20Social%20Media%20Data) | 48 | | 20 | [Classify Song Genres from Audio Data](https://github.com/kayveen/Classify-Song-Genres-from-Audio-Data) | [![GitHub](https://badges.aleen42.com/src/github.svg)](https://github.com/kayveen/Classify-Song-Genres-from-Audio-Data) | 49 | 50 | 51 | 52 | 53 | -------------------------------------------------------------------------------- /Word Frequency in Classic Novels/Word Frequency in Classic Novels.ipynb: -------------------------------------------------------------------------------- 1 | {"nbformat":4,"nbformat_minor":0,"metadata":{"colab":{"provenance":[],"authorship_tag":"ABX9TyNgaKv0fE/b2k8/tzFZTzek"},"kernelspec":{"name":"python3","display_name":"Python 3"},"language_info":{"name":"python"}},"cells":[{"cell_type":"markdown","source":["# 1. Tools for text processing\n","\n","We'll check which words are used the most in Herman Melville's Moby Dick and how often.\n","\n","We'll get the novel from Project Gutenberg using Python's requests.\n","with the Natural Language Toolkit (nltk)"],"metadata":{"id":"nbg_SKPWyVft"}},{"cell_type":"code","execution_count":1,"metadata":{"id":"OtLI6lsbyQyP","executionInfo":{"status":"ok","timestamp":1700551975655,"user_tz":-120,"elapsed":1732,"user":{"displayName":"Sam Eldin","userId":"17811646166394375086"}}},"outputs":[],"source":["# Importing requests, BeautifulSoup and nltk\n","import requests\n","from bs4 import BeautifulSoup\n","import nltk"]},{"cell_type":"markdown","source":["#2. Request Moby Dick\n","\n","HTML file: https://www.gutenberg.org/files/2701/2701-h/2701-h.htm ."],"metadata":{"id":"CZ5MOWWby9A_"}},{"cell_type":"code","source":["r= requests.get(\"https://s3.amazonaws.com/assets.datacamp.com/production/project_147/datasets/2701-h.htm\")\n","\n","r.encoding = 'utf-8'\n","\n","html = r.text\n","\n","html[:2000]"],"metadata":{"colab":{"base_uri":"https://localhost:8080/","height":163},"id":"eYKCe_zUzARN","executionInfo":{"status":"ok","timestamp":1700552046352,"user_tz":-120,"elapsed":1097,"user":{"displayName":"Sam Eldin","userId":"17811646166394375086"}},"outputId":"9e36544b-7ba1-4448-fede-e4fb0efc3ba7"},"execution_count":2,"outputs":[{"output_type":"execute_result","data":{"text/plain":["'\\r\\n\\r\\n\\r\\n\\r\\n\\r\\n \\r\\n \\r\\n Moby Dick; Or the Whale, by Herman Melville\\r\\n \\r\\n \\r\\n \\r\\n \\r\\n
\\r\\n\\r\\nThe Project Gutenberg EBook of Moby Dick; or The Whale, by Herman Melville\\r\\n\\r\\nThis eBook is for the use of anyone anywh'"],"application/vnd.google.colaboratory.intrinsic+json":{"type":"string"}},"metadata":{},"execution_count":2}]},{"cell_type":"markdown","source":["# 3. Get the text from the HTML\n","\n"," For this we'll use the package BeautifulSoup\n",""],"metadata":{"id":"UFYL30OpzRhj"}},{"cell_type":"code","source":["soup = BeautifulSoup(html,\"lxml\")\n","\n","text = soup.get_text()\n","\n","text[32000:34000]\n"],"metadata":{"colab":{"base_uri":"https://localhost:8080/","height":220},"id":"-8Bor8f4zbaV","executionInfo":{"status":"ok","timestamp":1700552109434,"user_tz":-120,"elapsed":461,"user":{"displayName":"Sam Eldin","userId":"17811646166394375086"}},"outputId":"95cd6ac7-10b5-47f4-ef00-8c7654f85f74"},"execution_count":3,"outputs":[{"output_type":"stream","name":"stderr","text":["/usr/local/lib/python3.10/dist-packages/bs4/builder/__init__.py:545: XMLParsedAsHTMLWarning: It looks like you're parsing an XML document using an HTML parser. If this really is an HTML document (maybe it's XHTML?), you can ignore or filter this warning. If it's XML, you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the lxml package installed, and pass the keyword argument `features=\"xml\"` into the BeautifulSoup constructor.\n","  warnings.warn(\n"]},{"output_type":"execute_result","data":{"text/plain":["'t me\\r\\n      from deliberately stepping into the street, and methodically knocking\\r\\n      people’s hats off—then, I account it high time to get to sea as soon\\r\\n      as I can. This is my substitute for pistol and ball. With a philosophical\\r\\n      flourish Cato throws himself upon his sword; I quietly take to the ship.\\r\\n      There is nothing surprising in this. If they but knew it, almost all men\\r\\n      in their degree, some time or other, cherish very nearly the same feelings\\r\\n      towards the ocean with me.\\r\\n    \\n\\r\\n      There now is your insular city of the Manhattoes, belted round by wharves\\r\\n      as Indian isles by coral reefs—commerce surrounds it with her surf.\\r\\n      Right and left, the streets take you waterward. Its extreme downtown is\\r\\n      the battery, where that noble mole is washed by waves, and cooled by\\r\\n      breezes, which a few hours previous were out of sight of land. Look at the\\r\\n      crowds of water-gazers there.\\r\\n    \\n\\r\\n      Circumambulate the city of a dreamy Sabbath afternoon. Go from Corlears\\r\\n      Hook to Coenties Slip, and from thence, by Whitehall, northward. What do\\r\\n      you see?—Posted like silent sentinels all around the town, stand\\r\\n      thousands upon thousands of mortal men fixed in ocean reveries. Some\\r\\n      leaning against the spiles; some seated upon the pier-heads; some looking\\r\\n      over the bulwarks of ships from China; some high aloft in the rigging, as\\r\\n      if striving to get a still better seaward peep. But these are all\\r\\n      landsmen; of week days pent up in lath and plaster—tied to counters,\\r\\n      nailed to benches, clinched to desks. How then is this? Are the green\\r\\n      fields gone? What do they here?\\r\\n    \\n\\r\\n      But look! here come more crowds, pacing straight for the water, and\\r\\n      seemingly bound for a dive. Strange! Nothing will content them but the\\r\\n      extremest limit of the land; loitering under the shady lee of yonder\\r\\n      warehouses will not suffice. No. They must get just as nigh the '"],"application/vnd.google.colaboratory.intrinsic+json":{"type":"string"}},"metadata":{},"execution_count":3}]},{"cell_type":"markdown","source":["# 4. Extract the words\n","\n","For how to use the nltk.tokenize.RegexpTokenizer function, please see the example in [the nltk documentation.](https://www.nltk.org/api/nltk.tokenize.html?highlight=regexp#module-nltk.tokenize.regexp)\n","\n"],"metadata":{"id":"vVxenmwUzl5F"}},{"cell_type":"code","source":["tokenizer = nltk.tokenize.RegexpTokenizer(pattern='\\w+')\n","\n","tokens = tokenizer.tokenize(text=text)\n","\n","tokens[:10]"],"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"MBg5xCzCz9jv","executionInfo":{"status":"ok","timestamp":1700552257582,"user_tz":-120,"elapsed":5,"user":{"displayName":"Sam Eldin","userId":"17811646166394375086"}},"outputId":"852f330e-9a55-44fe-c50b-5052bc3ec4ce"},"execution_count":4,"outputs":[{"output_type":"execute_result","data":{"text/plain":["['Moby',\n"," 'Dick',\n"," 'Or',\n"," 'the',\n"," 'Whale',\n"," 'by',\n"," 'Herman',\n"," 'Melville',\n"," 'The',\n"," 'Project']"]},"metadata":{},"execution_count":4}]},{"cell_type":"markdown","source":["#5. Make the words lowercase"],"metadata":{"id":"3-y5BGiI0HV5"}},{"cell_type":"code","source":["words = [token.lower() for token in tokens]\n","\n","# Printing out the first 8 words / tokens\n","words[:8]"],"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"yHHdKFVy0Mxj","executionInfo":{"status":"ok","timestamp":1700552301088,"user_tz":-120,"elapsed":297,"user":{"displayName":"Sam Eldin","userId":"17811646166394375086"}},"outputId":"d3b7d20f-5a6c-47da-b70b-0a6a8f01ae01"},"execution_count":5,"outputs":[{"output_type":"execute_result","data":{"text/plain":["['moby', 'dick', 'or', 'the', 'whale', 'by', 'herman', 'melville']"]},"metadata":{},"execution_count":5}]},{"cell_type":"markdown","source":["#6. Load in stop words\n","\n","People often take out common English words like 'the,' 'of,' and 'a' because they're not very interesting. These words are called stop words. The nltk package has a helpful list of stop words in English that we can use.\n","\n","\n","\n","\n","\n"],"metadata":{"id":"ODi-qvfL0SFj"}},{"cell_type":"code","source":["#get stopwords\n","nltk.download('stopwords')\n"],"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"FslNuOos0fnT","executionInfo":{"status":"ok","timestamp":1700552414919,"user_tz":-120,"elapsed":282,"user":{"displayName":"Sam Eldin","userId":"17811646166394375086"}},"outputId":"2daebed5-9b36-4d7a-99c1-3604021e3629"},"execution_count":6,"outputs":[{"output_type":"stream","name":"stderr","text":["[nltk_data] Downloading package stopwords to /root/nltk_data...\n","[nltk_data]   Unzipping corpora/stopwords.zip.\n"]},{"output_type":"execute_result","data":{"text/plain":["True"]},"metadata":{},"execution_count":6}]},{"cell_type":"code","source":["sw = nltk.corpus.stopwords.words('english')\n"],"metadata":{"id":"w5tfvN9g0qTF","executionInfo":{"status":"ok","timestamp":1700552438435,"user_tz":-120,"elapsed":397,"user":{"displayName":"Sam Eldin","userId":"17811646166394375086"}}},"execution_count":8,"outputs":[]},{"cell_type":"code","source":["sw[:15]"],"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"gfhDX9Qh0tbK","executionInfo":{"status":"ok","timestamp":1700552440002,"user_tz":-120,"elapsed":4,"user":{"displayName":"Sam Eldin","userId":"17811646166394375086"}},"outputId":"d052e096-cad1-43d9-ef88-ff5ea7bc4b81"},"execution_count":9,"outputs":[{"output_type":"execute_result","data":{"text/plain":["['i',\n"," 'me',\n"," 'my',\n"," 'myself',\n"," 'we',\n"," 'our',\n"," 'ours',\n"," 'ourselves',\n"," 'you',\n"," \"you're\",\n"," \"you've\",\n"," \"you'll\",\n"," \"you'd\",\n"," 'your',\n"," 'yours']"]},"metadata":{},"execution_count":9}]},{"cell_type":"markdown","source":["# 7. Remove stop words in Moby Dick\n","\n","We want to make a list of words from Moby Dick, but without using certain common words (stop words). To do this, we'll go through each word in the original list and add it to a new list only if it's not a stop word.\n","\n","\n","\n","\n","\n"],"metadata":{"id":"3wF50USg0xbC"}},{"cell_type":"code","source":["words_ns = [word for word in words if word not in sw]\n","\n","words_ns[:5]\n","\n"],"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"m2Y3nkuG00Yw","executionInfo":{"status":"ok","timestamp":1700552474918,"user_tz":-120,"elapsed":462,"user":{"displayName":"Sam Eldin","userId":"17811646166394375086"}},"outputId":"3b76c2ff-3bae-42cf-e45d-64426b2fc7d5"},"execution_count":10,"outputs":[{"output_type":"execute_result","data":{"text/plain":["['moby', 'dick', 'whale', 'herman', 'melville']"]},"metadata":{},"execution_count":10}]},{"cell_type":"markdown","source":["#8. We have the answer\n","\n","Our original question was:\n","\n","What are the most frequent words in Herman Melville's novel Moby Dick and how often do they occur?\n","\n","We are now ready to answer that! Let's create a word frequency distribution plot using nltk.\n","\n","See the nltk documentation for how to use nltk.FreqDist()\n"],"metadata":{"id":"DujdVu2B1EmH"}},{"cell_type":"code","source":["%matplotlib inline"],"metadata":{"id":"6hJTtkx11y7K","executionInfo":{"status":"ok","timestamp":1700552724427,"user_tz":-120,"elapsed":2,"user":{"displayName":"Sam Eldin","userId":"17811646166394375086"}}},"execution_count":11,"outputs":[]},{"cell_type":"code","source":["freqdist = nltk.FreqDist(words_ns)\n","\n","freqdist.plot(25)"],"metadata":{"colab":{"base_uri":"https://localhost:8080/","height":508},"id":"prJqvsq010qa","executionInfo":{"status":"ok","timestamp":1700552734225,"user_tz":-120,"elapsed":1209,"user":{"displayName":"Sam Eldin","userId":"17811646166394375086"}},"outputId":"c6f086dd-8e12-487b-a76c-4540c048a94c"},"execution_count":12,"outputs":[{"output_type":"display_data","data":{"text/plain":["
"],"image/png":"\n"},"metadata":{}},{"output_type":"execute_result","data":{"text/plain":[""]},"metadata":{},"execution_count":12}]},{"cell_type":"code","source":["import matplotlib.pyplot as plt\n","\n","freqdist = nltk.FreqDist(words_ns)\n","\n","top_words = freqdist.most_common(25)\n","\n","words, frequencies = zip(*top_words)\n","\n","most_common_word, most_common_frequency = freqdist.most_common(1)[0]\n","\n","# Create a bar chart\n","plt.figure(figsize=(10, 6))\n","plt.bar(range(len(words)), frequencies, tick_label=words)\n","plt.xlabel('Words')\n","plt.ylabel('Frequency')\n","plt.title('Top 25 Words Frequency Distribution')\n","plt.xticks(rotation=45, ha='right') # Rotate x-axis labels for better visibility\n","plt.tight_layout()\n","plt.show()\n","\n","print(f\"The most common word is '{most_common_word}' with a frequency of {most_common_frequency}.\")\n"],"metadata":{"colab":{"base_uri":"https://localhost:8080/","height":625},"id":"7r5MpQHl3JUu","executionInfo":{"status":"ok","timestamp":1700553632017,"user_tz":-120,"elapsed":874,"user":{"displayName":"Sam Eldin","userId":"17811646166394375086"}},"outputId":"82163bdc-ee5a-4145-88d6-69b64b523697"},"execution_count":15,"outputs":[{"output_type":"display_data","data":{"text/plain":["
"],"image/png":"\n"},"metadata":{}},{"output_type":"stream","name":"stdout","text":["The most common word is 'whale' with a frequency of 1246.\n"]}]}]} -------------------------------------------------------------------------------- /codewello-banner.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/HossamEldinx/NLP-Projects/fd79a6fdb9ee0b74be5b7dd64cd65708f471510d/codewello-banner.png --------------------------------------------------------------------------------