A fully configured DagsHub environment that suits YOUR needs.
\n",
24 | "\n",
25 | "---\n",
26 | "\n",
27 | "
With this notebook, you can easily pull all of your project’s components from DagsHub to Colab runtime, train the model, log the experiments, version the changes, and push them to DagsHub remotes.
\n",
1149 | " "
1150 | ]
1151 | },
1152 | "metadata": {},
1153 | "execution_count": 36
1154 | }
1155 | ]
1156 | },
1157 | {
1158 | "cell_type": "code",
1159 | "source": [],
1160 | "metadata": {
1161 | "id": "JqYo0nIIKlp4"
1162 | },
1163 | "execution_count": null,
1164 | "outputs": []
1165 | },
1166 | {
1167 | "cell_type": "code",
1168 | "source": [
1169 | "#After checking using Logistic Regression for 2 submission without para tuning\n",
1170 | "\n",
1171 | "logreg = LogisticRegression()\n",
1172 | "\n",
1173 | "logreg.fit(X, y)"
1174 | ],
1175 | "metadata": {
1176 | "colab": {
1177 | "base_uri": "https://localhost:8080/"
1178 | },
1179 | "id": "Z3d4abnbLOVA",
1180 | "outputId": "793441b2-fc00-42b9-d9c4-62c1673918be"
1181 | },
1182 | "execution_count": 48,
1183 | "outputs": [
1184 | {
1185 | "output_type": "execute_result",
1186 | "data": {
1187 | "text/plain": [
1188 | "LogisticRegression()"
1189 | ]
1190 | },
1191 | "metadata": {},
1192 | "execution_count": 48
1193 | }
1194 | ]
1195 | },
1196 | {
1197 | "cell_type": "code",
1198 | "source": [
1199 | "predictions_2 = logreg.predict(X_test)\n",
1200 | "predictions_2.shape"
1201 | ],
1202 | "metadata": {
1203 | "colab": {
1204 | "base_uri": "https://localhost:8080/"
1205 | },
1206 | "id": "6XhIZazMLZpX",
1207 | "outputId": "0a5e7a70-d538-4a3c-9ace-a8de21b45133"
1208 | },
1209 | "execution_count": 49,
1210 | "outputs": [
1211 | {
1212 | "output_type": "execute_result",
1213 | "data": {
1214 | "text/plain": [
1215 | "(418,)"
1216 | ]
1217 | },
1218 | "metadata": {},
1219 | "execution_count": 49
1220 | }
1221 | ]
1222 | },
1223 | {
1224 | "cell_type": "code",
1225 | "source": [
1226 | "output = pd.DataFrame({'PassengerId': titanic_test.PassengerId, 'Survived': predictions_2})\n",
1227 | "output.to_csv('submission_2.csv', index=False)\n",
1228 | "print(\"Your submission was successfully saved!\")"
1229 | ],
1230 | "metadata": {
1231 | "colab": {
1232 | "base_uri": "https://localhost:8080/"
1233 | },
1234 | "id": "AqlpwthdL8FI",
1235 | "outputId": "650108c6-80a0-441f-a280-df188cd1dd06"
1236 | },
1237 | "execution_count": 50,
1238 | "outputs": [
1239 | {
1240 | "output_type": "stream",
1241 | "name": "stdout",
1242 | "text": [
1243 | "Your submission was successfully saved!\n"
1244 | ]
1245 | }
1246 | ]
1247 | },
1248 | {
1249 | "cell_type": "code",
1250 | "source": [],
1251 | "metadata": {
1252 | "id": "vu9T6D4UMJkn"
1253 | },
1254 | "execution_count": null,
1255 | "outputs": []
1256 | }
1257 | ]
1258 | }
--------------------------------------------------------------------------------
/decision-tree-notes.ipynb:
--------------------------------------------------------------------------------
1 | {"metadata":{"kernelspec":{"language":"python","display_name":"Python 3","name":"python3"},"language_info":{"pygments_lexer":"ipython3","nbconvert_exporter":"python","version":"3.6.4","file_extension":".py","codemirror_mode":{"name":"ipython","version":3},"name":"python","mimetype":"text/x-python"}},"nbformat_minor":4,"nbformat":4,"cells":[{"source":"","metadata":{},"cell_type":"markdown","outputs":[],"execution_count":0},{"cell_type":"markdown","source":"## Decision Tree\n\nDecision tree algorithm is one of the most versatile algorithms in machine learning which can perform both classification and regression analysis. It is very powerful and works great with complex datasets. Apart from that, it is very easy to understand and read. That makes it more popular to use. When coupled with ensemble techniques – which we will learn very soon- it performs even better.\nAs the name suggests, this algorithm works by dividing the whole dataset into a tree-like structure based on some rules and conditions and then gives prediction based on those conditions.\nLet’s understand the approach to decision tree with a basic scenario. \nSuppose it’s Friday night and you are not able to decide if you should go out or stay at home. Let the decision tree decide it for you.\n\n\n\n \nAlthough we may or may not use the decision tree for such decisions, this was a basic example to help you understand how a decision tree makes a decision.\nSo how did it work?\n*\tIt selects a root node based on a given condition, e.g. our root node was chosen as time >10 pm.\n*\tThen, the root node was split into child notes based on the given condition. The right child node in the above figure fulfilled the condition, so no more questions were asked.\n*\tThe left child node didn’t fulfil the condition, so again it was split based on a new condition.\n*\tThis process continues till all the conditions are met or if you have predefined the depth of your tree, e.g. the depth of our tree is 3, and it reached there when all the conditions were exhausted.\n\nLet’s see how the parent nodes and condition is chosen for the splitting to work.\n\n#### Decision Tree for Regression\nWhen performing regression with a decision tree, we try to divide the given values of X into distinct and non-overlapping regions, e.g. for a set of possible values X1, X2,..., Xp; we will try to divide them into J distinct and non-overlapping regions R1, R2, . . . , RJ.\nFor a given observation falling into the region Rj, the prediction is equal to the mean of the response(y) values for each training observations(x) in the region Rj. \nThe regions R1,R2, . . . , RJ are selected in a way to reduce the following sum of squares of residuals :\n\n\n\n \nWhere, yrj (second term) is the mean of all the response variables in the region ‘j’.\n\n\n\n#### Recursive binary splitting(Greedy approach)\nAs mentioned above, we try to divide the X values into j regions, but it is very expensive in terms of computational time to try to fit every set of X values into j regions. Thus, decision tree opts for a top-down greedy approach in which nodes are divided into two regions based on the given condition, i.e. not every node will be split but the ones which satisfy the condition are split into two branches. It is called greedy because it does the best split at a given step at that point of time rather than looking for splitting a step for a better tree in upcoming steps. It decides a threshold value(say s) to divide the observations into different regions(j) such that the RSS for Xj>= s and Xj \n \nHere for the above equation, j and s are found such that this equation has the minimum value.\nThe regions R1, R2 are selected based on that value of s and j such that the equation above has the minimum value.\nSimilarly, more regions are split out of the regions created above based on some condition with the same logic. This continues until a stopping criterion (predefined) is achieved.\nOnce all the regions are split, the prediction is made based on the mean of observations in that region.\n\nThe process mentioned above has a high chance of overfitting the training data as it will be very complex. \n\n\n### Classification Trees\n\nRegression trees are used for quantitative data. In the case of qualitative data or categorical data, we use classification trees. In regression trees, we split the nodes based on RSS criteria, but in classification, it is done using classification error rate, Gini impurity and entropy.\nLet’s understand these terms in detail.\n\n#### Entropy\nEntropy is the measure of randomness in the data. In other words, it gives the impurity present in the dataset.\n\n\n \nWhen we split our nodes into two regions and put different observations in both the regions, the main goal is to reduce the entropy i.e. reduce the randomness in the region and divide our data cleanly than it was in the previous node. If splitting the node doesn’t lead into entropy reduction, we try to split based on a different condition, or we stop. \nA region is clean (low entropy) when it contains data with the same labels and random if there is a mixture of labels present (high entropy).\nLet’s suppose there are ‘m’ observations and we need to classify them into categories 1 and 2.\nLet’s say that category 1 has ‘n’ observations and category 2 has ‘m-n’ observations.\n\np= n/m and q = m-n/m = 1-p\n\nthen, entropy for the given set is:\n\n\n E = -p*log2(p) – q*log2(q) \n \n \nWhen all the observations belong to category 1, then p = 1 and all observations belong to category 2, then p =0, int both cases E =0, as there is no randomness in the categories.\nIf half of the observations are in category 1 and another half in category 2, then p =1/2 and q =1/2, and the entropy is maximum, E =1.\n\n\n\n \n\n#### Information Gain\nInformation gain calculates the decrease in entropy after splitting a node. It is the difference between entropies before and after the split. The more the information gain, the more entropy is removed. \n\n\n\n \nWhere, T is the parent node before split and X is the split node from T.\n\nA tree which is splitted on basis of entropy and information gain value looks like:\n\n\n\n#### Ginni Impurity\nAccording to wikipedia, ‘Gini impurity is a measure of how often a randomly chosen element from the set would be incorrectly labelled if it was randomly labelled according to the distribution of labels in the subset.’\nIt is calculated by multiplying the probability that a given observation is classified into the correct class and sum of all the probabilities when that particular observation is classified into the wrong class.\nLet’s suppose there are k number of classes and an observation belongs to the class ‘i’, then Ginni impurity is given as:\n\n\n \nGinni impurity value lies between 0 and 1, 0 being no impurity and 1 denoting random distribution.\nThe node for which the Ginni impurity is least is selected as the root node to split.\n\n\nA tree which is splitted on basis of ginni impurity value looks like:\n\n\n\n\n\n","metadata":{"id":"XDcUAXjpGqVJ"}},{"cell_type":"markdown","source":"### Maths behind Decision Tree Classifier\nBefore we see the python implementation of decision tree. let's first understand the math behind the decision tree classfication. We will see how all the above mentioned terms are used for splitting.\n\nWe will use a simple dataset which contains information about students of different classes and gender and see whether they stay in school's hostel or not.","metadata":{"id":"wGKGzVu0GqVS"}},{"cell_type":"markdown","source":"This is how our data set looks like :\n\n\n","metadata":{"id":"kHVbisn7GqVT"}},{"cell_type":"markdown","source":"Let's try and understand how the root node is selected by calcualting gini impurity. We will use the above mentioned data.\n\nWe have two features which we can use for nodes: \"Class\" and \"Gender\".\nWe will calculate gini impurity for each of the features and then select that feature which has least gini impurity.\n\nLet's review the formula for calculating ginni impurity:\n\n\n\nLet's start with class, we will try to gini impurity for all different values in \"class\". \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nThis is how our Decision tree node is selected by calculating gini impurity for each node individually.\nIf the number of feautures increases, then we just need to repeat the same steps after the selection of the root node.","metadata":{"id":"XEAo25FiGqVT"}},{"cell_type":"markdown","source":"We will try and find the root nodes for the same dataset by calculating entropy and information gain.\n\nDataSet:\n\n\n\nWe have two features and we will try to choose the root node by calculating the information gain by splitting each feature.\n\nLet' review the formula for entropy and information gain:\n\n\n\n\n\n\nLet's start with feature \"class\" :\n\n\n\n\n\n\n\n\n\n\n\n\nLet' see the information gain from feature \"gender\" :\n\n\n\n\n\n\n\n\n\n\n\n\n\n","metadata":{"id":"IWlyaD_4GqVU"}},{"cell_type":"markdown","source":"### Different Algorithms for Decision Tree\n\n\n* ID3 (Iterative Dichotomiser) : It is one of the algorithms used to construct decision tree for classification. It uses Information gain as the criteria for finding the root nodes and splitting them. It only accepts categorical attributes.\n\n* C4.5 : It is an extension of ID3 algorithm, and better than ID3 as it deals both continuous and discreet values.It is also used for classfication purposes.\n\n\n* Classfication and Regression Algorithm(CART) : It is the most popular algorithm used for constructing decison trees. It uses ginni impurity as the default calculation for selecting root nodes, however one can use \"entropy\" for criteria as well. This algorithm works on both regression as well as classfication problems. We will use this algorithm in our pyhton implementation. \n\n\nEntropy and Ginni impurity can be used reversibly. It doesn't affects the result much. Although, ginni is easier to compute than entropy, since entropy has a log term calculation. That's why CART algorithm uses ginni as the default algorithm.\n\nIf we plot ginni vs entropy graph, we can see there is not much difference between them:\n\n\n\n","metadata":{"id":"NGF42IqYGqVU"}},{"cell_type":"markdown","source":"##### Advantages of Decision Tree:\n\n * It can be used for both Regression and Classification problems.\n * Decision Trees are very easy to grasp as the rules of splitting is clearly mentioned.\n * Complex decision tree models are very simple when visualized. It can be understood just by visualising.\n * Scaling and normalization are not needed.\n\n\n##### Disadvantages of Decision Tree:\n\n\n * A small change in data can cause instability in the model because of the greedy approach.\n * Probability of overfitting is very high for Decision Trees.\n * It takes more time to train a decision tree model than other classification algorithms.","metadata":{"id":"vPxt4lUfGqVV"}},{"cell_type":"markdown","source":"## Business Case:-Based on given features we need to find whether an employee will leave the company or not.","metadata":{"id":"81MWsN13GqVV"}},{"cell_type":"code","source":"## Importing the libraries\nimport pandas as pd\nimport numpy as np\nimport matplotlib.pyplot as plt\nimport seaborn as sns\n%matplotlib inline","metadata":{"executionInfo":{"elapsed":1748,"status":"ok","timestamp":1619774881813,"user":{"displayName":"DHANUSH Appala","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GjkVBriqzTRKHlFTY43_1IwgVqGUQnUlVPkfsvVpRg=s64","userId":"17800607208793960461"},"user_tz":-330},"id":"GTvp_TcrGqVW"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"## Target variable:-","metadata":{"id":"h9OlcUPLGqVW"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"## Loading the data\ndata=pd.read_csv('HR-Employee-Attrition.csv')","metadata":{"id":"Tt_FHAtHGqVX"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"data.head()\npd.set_option('display.max_columns',None)","metadata":{"id":"pHCvnVhtGqVX","outputId":"14dfc600-ab01-4f18-cc40-38c872c064c2"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"data.head()","metadata":{},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"## Getting some rows\ndata.columns","metadata":{"id":"3hvXDP0MGqVY","outputId":"deccd908-f355-4142-db22-a46da4485ff0"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"pd.set_option('display.max_columns',None)","metadata":{"id":"XrfN1JniGqVY"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"## Basic Checks","metadata":{"id":"SifTu1HxGqVZ"}},{"cell_type":"code","source":"data.tail()","metadata":{"id":"VaotZpJ3GqVZ","outputId":"f7c60733-2736-484d-8805-09fe9bd236a0"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"data.describe()","metadata":{"id":"DQJs1mkVGqVZ","outputId":"7c6c572f-821a-4257-b1cf-a26980f5507f"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"data.describe(include=['O'])","metadata":{"id":"nHxP8RSpGqVa","outputId":"685d7fea-c131-4bf3-d522-1f99123f89de"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"data.info()","metadata":{"id":"2grCr8khGqVa"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"## Exploratory Data Analysis","metadata":{"id":"lgxJSrcjGqVa"}},{"cell_type":"code","source":"## Univariate Analysis\n!pip install sweetviz","metadata":{"executionInfo":{"elapsed":6423,"status":"ok","timestamp":1619775151488,"user":{"displayName":"DHANUSH Appala","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GjkVBriqzTRKHlFTY43_1IwgVqGUQnUlVPkfsvVpRg=s64","userId":"17800607208793960461"},"user_tz":-330},"id":"4ptW3IqPGqVa","outputId":"bb95af17-7530-441b-c08b-5d2bf46de589"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"import sweetviz as sv\nmy_report = sv.analyze(data)\nmy_report.show_html() # Default arguments will generate to \"SWEETVIZ_REPORT.html\"","metadata":{"executionInfo":{"elapsed":6980,"status":"error","timestamp":1619775152057,"user":{"displayName":"DHANUSH Appala","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GjkVBriqzTRKHlFTY43_1IwgVqGUQnUlVPkfsvVpRg=s64","userId":"17800607208793960461"},"user_tz":-330},"id":"IYqXdgGvGqVb","outputId":"361846ee-10bc-4fa4-aad0-1faf54824ef5"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"## Insights from univariate analysis-->task","metadata":{"executionInfo":{"elapsed":6968,"status":"aborted","timestamp":1619775152048,"user":{"displayName":"DHANUSH Appala","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GjkVBriqzTRKHlFTY43_1IwgVqGUQnUlVPkfsvVpRg=s64","userId":"17800607208793960461"},"user_tz":-330},"id":"UGde0x-iGqVb"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"## Bivaraite Analysis checking relationship of all variables with respect to ","metadata":{"executionInfo":{"elapsed":6967,"status":"aborted","timestamp":1619775152050,"user":{"displayName":"DHANUSH Appala","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GjkVBriqzTRKHlFTY43_1IwgVqGUQnUlVPkfsvVpRg=s64","userId":"17800607208793960461"},"user_tz":-330},"id":"t-ljpiZyGqVb"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"categorical_col = []\nfor column in data.columns:\n if data[column].dtype == object and len(data[column].unique()) <= 50:\n categorical_col.append(column)\n print(f\"{column} : {data[column].unique()}\")\n print(\"====================================\")","metadata":{"executionInfo":{"elapsed":6961,"status":"aborted","timestamp":1619775152051,"user":{"displayName":"DHANUSH Appala","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GjkVBriqzTRKHlFTY43_1IwgVqGUQnUlVPkfsvVpRg=s64","userId":"17800607208793960461"},"user_tz":-330},"id":"bhAo8-QaGqVc"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"## Categorical Data","metadata":{"executionInfo":{"elapsed":6955,"status":"aborted","timestamp":1619775152052,"user":{"displayName":"DHANUSH Appala","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GjkVBriqzTRKHlFTY43_1IwgVqGUQnUlVPkfsvVpRg=s64","userId":"17800607208793960461"},"user_tz":-330},"id":"rQ_KEiy5GqVc"}},{"cell_type":"code","source":"## Create a new dataframe with categorical variables only\ndata1=data[['Attrition',\n 'BusinessTravel',\n 'Department',\n 'EducationField',\n 'Gender',\n 'JobRole',\n 'MaritalStatus',\n 'Over18',\n 'OverTime']]","metadata":{"executionInfo":{"elapsed":6952,"status":"aborted","timestamp":1619775152053,"user":{"displayName":"DHANUSH Appala","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GjkVBriqzTRKHlFTY43_1IwgVqGUQnUlVPkfsvVpRg=s64","userId":"17800607208793960461"},"user_tz":-330},"id":"FX0KQNHSGqVc"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"data1","metadata":{"executionInfo":{"elapsed":6947,"status":"aborted","timestamp":1619775152054,"user":{"displayName":"DHANUSH Appala","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GjkVBriqzTRKHlFTY43_1IwgVqGUQnUlVPkfsvVpRg=s64","userId":"17800607208793960461"},"user_tz":-330},"id":"97QP0vXTGqVc"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"# Plotting how every categorical feature correlate with the \"target\"\nplt.figure(figsize=(25,25), facecolor='white')\nplotnumber = 1\n\nfor column in data1:\n if plotnumber<=16 :\n ax = plt.subplot(4,4,plotnumber)\n sns.countplot(x=data1[column].dropna(axis=0)\n ,hue=data.Attrition)\n plt.xlabel(column,fontsize=20)\n plt.ylabel('Attrition',fontsize=20)\n plotnumber+=1\nplt.tight_layout()","metadata":{"executionInfo":{"elapsed":6942,"status":"aborted","timestamp":1619775152055,"user":{"displayName":"DHANUSH Appala","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GjkVBriqzTRKHlFTY43_1IwgVqGUQnUlVPkfsvVpRg=s64","userId":"17800607208793960461"},"user_tz":-330},"id":"K11mCBysGqVd"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"plt.figure(figsize=(40,25), facecolor='white')\nsns.countplot(x='JobRole',hue='Attrition',data=data1)\nplt.xlabel('JobRole',fontsize=40)","metadata":{},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"numerical_col = []\nfor column in data.columns:\n if data[column].dtype == int and len(data[column].unique()) >= 10:\n numerical_col.append(column)\n ","metadata":{"executionInfo":{"elapsed":6940,"status":"aborted","timestamp":1619775152056,"user":{"displayName":"DHANUSH Appala","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GjkVBriqzTRKHlFTY43_1IwgVqGUQnUlVPkfsvVpRg=s64","userId":"17800607208793960461"},"user_tz":-330},"id":"4aBbGbVoGqVd"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"data.info()","metadata":{},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"numerical_col","metadata":{"scrolled":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"## Discrete data","metadata":{"id":"xMnzw7AxGqVd"}},{"cell_type":"code","source":"discrete_col = []\nfor column in data.columns:\n if data[column].dtype == int and len(data[column].unique()) <= 10:\n discrete_col.append(column)\n ","metadata":{},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"discrete_col","metadata":{},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"data3=data[['Education',\n 'EmployeeCount',\n 'EnvironmentSatisfaction',\n 'JobInvolvement',\n 'JobLevel',\n 'JobSatisfaction',\n 'NumCompaniesWorked',\n 'PerformanceRating',\n 'RelationshipSatisfaction',\n 'StandardHours',\n 'StockOptionLevel',\n 'TrainingTimesLastYear',\n 'WorkLifeBalance']]","metadata":{"id":"NTWZl0nTGqVe"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"# Plotting how every discrete feature correlate with the \"target\"\nplt.figure(figsize=(20,25), facecolor='white')\nplotnumber = 1\n\nfor column in data3:\n if plotnumber<=16 :\n ax = plt.subplot(4,4,plotnumber)\n sns.countplot(x=data3[column].dropna(axis=0)\n ,hue=data.Attrition)\n plt.xlabel(column,fontsize=20)\n plt.ylabel('Attrition',fontsize=20)\n plotnumber+=1\nplt.tight_layout()","metadata":{"id":"HNzZGjgTGqVe","outputId":"a4453bd4-53e0-4037-c7aa-eae0dcc3d216"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"numerical_col","metadata":{"id":"0_mHNEfJGqVe","outputId":"32e15048-df45-4877-80ef-2be800b38ef5"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"data2=data[['Age',\n 'DailyRate',\n 'DistanceFromHome',\n 'EmployeeNumber',\n 'HourlyRate',\n 'MonthlyIncome',\n 'MonthlyRate',\n 'NumCompaniesWorked',\n 'PercentSalaryHike',\n 'TotalWorkingYears',\n 'YearsAtCompany',\n 'YearsInCurrentRole',\n 'YearsSinceLastPromotion',\n 'YearsWithCurrManager']]","metadata":{"id":"5DwOl-OhGqVf"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"# Plotting how every numerical feature correlate with the \"target\"\nplt.figure(figsize=(20,25), facecolor='white')\nplotnumber = 1\n\nfor column in data2:\n if plotnumber<=16 :\n ax = plt.subplot(4,4,plotnumber)\n sns.histplot(x=data2[column].dropna(axis=0)\n ,hue=data.Attrition)\n plt.xlabel(column,fontsize=20)\n plt.ylabel('Attrition',fontsize=20)\n plotnumber+=1\nplt.tight_layout()","metadata":{"id":"R82tDyP2GqVf","outputId":"aad28ca6-7152-4015-a2f5-24faa0767ffb"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"","metadata":{"id":"ZPq5HV-GGqVf"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"## Final conclusions\nBusinessTravel : The workers who travel alot are more likely to quit then other employees.\n\nDepartment : The worker in Research & Development are more likely to stay then the workers on other departement.\n\nEducationField : The workers with Human Resources and Technical Degree are more likely to quit then employees from other fields of educations.\n\nGender : The Male are more likely to quit.\n\nJobRole : The workers in Laboratory Technician, Sales Representative, and Human Resources are more likely to quit the workers in other positions.\n\nMaritalStatus : The workers who have Single marital status are more likely to quit the Married, and Divorced.\n\nOverTime : Attrition rate is almost equal","metadata":{"id":"NnQIsJzMGqVf"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"## Data Preprocessing","metadata":{"id":"Q9k1w33zGqVg"}},{"cell_type":"code","source":"## Checking missing values\ndata.isnull().sum()","metadata":{"id":"Sa-TlTe8GqVg","outputId":"5f8c541e-50a0-4cb2-b110-93fc9a19cdbe"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"## Categorical data conversion\ndata1.head()","metadata":{"id":"_J_u9FdIGqVg"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"## Manual encoding Attrition feature\ndata.Attrition=data.Attrition.map({'Yes':1,'No':0})","metadata":{"id":"g8jvPI4xGqVg"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"data.Attrition.unique()","metadata":{"id":"n2ddHOLsGqVh","outputId":"ebe1af6b-c561-429d-a9cc-73f1e05e0034"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"## Encoding BusinessTravel, this feature told the worker who travelled frequesnlty has quited the job so let do the\n##manual encoding\ndata.BusinessTravel=data.BusinessTravel.map({'Travel_Frequently':2,'Travel_Rarely':1,'Non-Travel':0})\n","metadata":{"id":"V2APHYAdGqVh"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"data.head()","metadata":{"id":"mVUQ5MzcGqVh"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"data.Department=data.Department.map({'Research & Development':2,'Sales':1,'Human Resources':0})\n","metadata":{"id":"YHE92tgcGqVh"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"data.Department\n","metadata":{"id":"6E51sFceGqVh"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"data.EducationField=data.EducationField.map({'Life Sciences':5,'Medical':4,'Marketing':3,'Technical Degree':2,'Other':1,'Human Resources':0 })\n \n ","metadata":{"id":"450XCLRIGqVi"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"data.head()","metadata":{"id":"5qLTareTGqVi","outputId":"66d31fd4-0b43-4d17-e27e-b9631de4c502","collapsed":true,"jupyter":{"outputs_hidden":true}},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"## Encoding Gender\ndata.Gender=pd.get_dummies(data.Gender,drop_first=True)","metadata":{"id":"VKyEmGfDGqVi"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"data.Gender.value_counts()","metadata":{"id":"APyj9DcRGqVi","outputId":"27a620e0-cb69-4dfb-c2a6-264142d2bfd1","scrolled":true},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"data.Gender","metadata":{},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"## Encoding JobRole\ndata.JobRole=data.JobRole.map({'Laboratory Technician':8,'Sales Executive':7,'Research Scientist':6,'Sales Representative':5,\n 'Human Resources':4,'Manufacturing Director':3,'Healthcare Representative':2,'Manager':1,'Research Director':0 })\n \n \n ","metadata":{"id":"9FVbyyqkGqVi"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"data.head()","metadata":{},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"data.JobRole","metadata":{"id":"ehJ6tv84GqVj"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"## Encoding MaritalStatus\n\nfrom sklearn.preprocessing import LabelEncoder\n\nlabel = LabelEncoder()\ndata.MaritalStatus=label.fit_transform(data.MaritalStatus)","metadata":{"id":"8Lkw25rOGqVj"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"data.MaritalStatus","metadata":{"id":"bswiSdTYGqVj"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"## Encoding OverTime\ndata.OverTime=label.fit_transform(data.OverTime)","metadata":{"id":"9fhBprtnGqVj"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"data.head()","metadata":{"id":"xhSE2-g4GqVk"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"## Feature Selection","metadata":{"id":"eY7sHt4EGqVk"}},{"cell_type":"code","source":"## Checking correlation\n\nplt.figure(figsize=(30, 30))\nsns.heatmap(data2.corr(), annot=True, cmap=\"RdYlGn\", annot_kws={\"size\":15})","metadata":{"id":"eplBC4upGqVk","outputId":"9261a36a-9cf0-44fd-afec-ba6625b76d99","collapsed":true,"jupyter":{"outputs_hidden":true}},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"## Removing constant features\ndata.drop(['EmployeeCount', 'EmployeeNumber', 'Over18', 'StandardHours'], axis=\"columns\", inplace=True)","metadata":{"id":"0IZ7NazaGqVk"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"data.describe()","metadata":{},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"## Model Creation","metadata":{"id":"TNccjAhwGqVl"}},{"cell_type":"code","source":"## Creating independent and dependent variable\nX = data.drop('Attrition', axis=1)\ny = data.Attrition","metadata":{"id":"hstSuG5WGqVl"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"## Balacing the data\nfrom collections import Counter\nfrom imblearn.over_sampling import SMOTE\nsm=SMOTE()\nprint(Counter(y))\nX_sm,y_sm=sm.fit_resample(X,y)\nprint(Counter(y_sm))","metadata":{"id":"2-vfEK3qGqVl","outputId":"8cc6621c-c747-45d0-b4dc-cd7fd0e6f512"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"## preparing training and testing data\nfrom sklearn.model_selection import train_test_split\n\n\n\nX_train, X_test, y_train, y_test = train_test_split(X_sm, y_sm, test_size=0.25, random_state=42)","metadata":{"id":"ENFDdjN6GqVl"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"from sklearn.tree import DecisionTreeClassifier\ndt=DecisionTreeClassifier()\ndt.fit(X_train,y_train)\ny_hat=dt.predict(X_test)","metadata":{"id":"bO4z0WfgGqVl"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"## Evalauting the model\nfrom sklearn.metrics import accuracy_score,classification_report,f1_score\n## Training score\n#train_predict=dt.predict(X_train)\n#cc_train=accuracy_score(y_train,train_predict)\n#cc_train","metadata":{"id":"oLebPUzFGqVm"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"print(classification_report(y_train,train_predict))","metadata":{"id":"gcZKrmmXGqVm","outputId":"6f8978e5-59ae-45a2-bcec-c8859417599f"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"pd.crosstab(y_train,y_train_predict)","metadata":{"id":"_nWcvr9AGqVm","outputId":"c6724240-4648-4bba-ac63-86934197e889"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"## test acc\ntest_acc=accuracy_score(y_test,y_hat)\ntest_acc","metadata":{"id":"SHIIwd0tGqVm","outputId":"de310265-7fce-4677-c0b0-f0c5d67864a8","scrolled":true},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"","metadata":{"id":"OH29Hq0GGqVn"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"## test score\ntest_f1=f1_score(y_test,y_hat)\ntest_f1","metadata":{"id":"ndTtoD1dGqVn","outputId":"a345da1f-30bf-4ead-f594-1622fb401d72"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"print(classification_report(y_test,y_hat))","metadata":{"id":"uzKcjjRBGqVn","outputId":"a3f5c57c-c04d-4cfa-89c4-6a474a050da7"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"pd.crosstab(y_test,y_hat)","metadata":{"id":"YGdOtp55GqVn","outputId":"1a17d8de-7e1c-4beb-9672-c00a1bd3b542"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"## Hyperparameters of DecisionTree","metadata":{"id":"xHp4KJoMGqVn"}},{"cell_type":"markdown","source":"* criterion: The function to measure the quality of a split. Supported criteria are \"gini\" for the Gini impurity and \"entropy\" for the information gain.\n\n\n* splitter: The strategy used to choose the split at each node. Supported strategies are \"best\" to choose the best split and \"random\" to choose the best random split.\n\n* max_depth: It tells how deep the decision tree can be.The maximum depth of the tree.Deeper the tree more split it has and it captures mopre info from data.In general a DT overfits for large depth value.The tree perfectly fits the training data and fails to generalize on testing data.\n\n* min_samples_split: The minimum number of samples required to split an internal node.Ideal range is 1 to 40.\n\n* min_samples_leaf: The minimum number of samples required to be at a leaf node.Similarr to min sample split ,this describes the minimum number of samples at the leaf,the base of tree.Ideal range is 1 to 20.\n","metadata":{"id":"3sXgNhQKGqVo"}},{"cell_type":"code","source":"https://towardsdatascience.com/how-to-tune-a-decision-tree-f03721801680","metadata":{},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"from sklearn.model_selection import GridSearchCV","metadata":{"id":"hhVF_1VgGqVo"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"\n\nparams = {\n \"criterion\":(\"gini\", \"entropy\"), \n \"splitter\":(\"best\", \"random\"), \n \"max_depth\":(list(range(1, 20))), \n \"min_samples_split\":[2, 3, 4], \n \"min_samples_leaf\":list(range(1, 20)), \n}\n\n\ntree_clf = DecisionTreeClassifier(random_state=3)\ntree_cv = GridSearchCV(tree_clf, params, scoring=\"f1\", n_jobs=-1, verbose=1, cv=3)\ntree_cv.fit(X_train,y_train)\nbest_params = tree_cv.best_params_\nprint(f\"Best paramters: {best_params})\")\n\n","metadata":{"id":"CI6CDPDyGqVo","outputId":"232b8759-cdc3-4fc9-c0ef-f3ef47931b60","scrolled":true},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"fitting 3 folds for each of 4332 candidates, totalling 12996 fits\nBest paramters: {'criterion': 'entropy', 'max_depth': 15, 'min_samples_leaf': 1, 'min_samples_split': 2, 'splitter': 'random'})\n","metadata":{},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"tree_cv.best_params_","metadata":{"id":"G02O7SWoGqVo"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"tree_cv.best_score_","metadata":{"id":"NtJLO8QRGqVo","outputId":"a2ed2226-83ba-4c4f-9db8-1f1aa7124661"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"dt1=DecisionTreeClassifier(criterion='gini',\n max_depth=13,min_samples_leaf=1,\n min_samples_split=2,splitter='random')","metadata":{"id":"BsjmSDTHGqVp"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"dt1.fit(X_train,y_train)","metadata":{"id":"QIpcP_d4GqVp","outputId":"25114b13-8270-4d29-acf8-2145a7581ad8"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"y_hat1=dt1.predict(X_test)","metadata":{"id":"hgxF9WufGqVp"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"test_accuracy=accuracy_score(y_test,y_hat1)\ntest_accuracy","metadata":{},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"test_f1=f1_score(y_test,y_hat1)\ntest_f1","metadata":{},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"print(classification_report(y_test,y_hat1))","metadata":{"id":"SNcxd2n4GqVp","outputId":"4f32d1e2-421c-4fb5-848f-0a9ca1217d73"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"## RandomForest Implementation","metadata":{"id":"eOhvjBIxGqVp"}},{"cell_type":"code","source":"from sklearn.ensemble import RandomForestClassifier\n\nrf_clf = RandomForestClassifier(n_estimators=100)\nrf_clf.fit(X_train,y_train)","metadata":{"id":"CREynTLyGqVq","outputId":"0dccb0c3-84b6-4106-f441-494ea08fe631"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"y_predict=rf_clf.predict(X_test)","metadata":{"id":"K6yeNrbLGqVq"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"\nprint(classification_report(y_test,y_predict))","metadata":{"id":"2PxP7zz7GqVq","outputId":"7e5549f1-4429-4e55-99d9-c27ce91f2f40"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"f_Score=f1_score(y_test,y_predict)\nf_Score","metadata":{"id":"Q1693iiVGqVq","outputId":"844968d4-550b-4540-9d34-7267d5bac380"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"## Hyperparameter Tuning","metadata":{"id":"GGQzgT2GGqVq"}},{"cell_type":"markdown","source":"* n_estimators = number of trees in the foreset\n* max_features = max number of features considered for creating the tree.\n* max_depth = max number of levels in each decision tree\n* min_samples_split = min number of data points placed in a node before the node is split\n* min_samples_leaf = min number of data points allowed in a leaf node\n* bootstrap = method for sampling data points (with or without replacement)","metadata":{"id":"J2NDxMIKGqVq"}},{"cell_type":"code","source":"from sklearn.model_selection import RandomizedSearchCV\n\nn_estimators = [int(x) for x in np.linspace(start=200, stop=2000, num=10)]\nmax_features = ['auto', 'sqrt']\nmax_depth = [int(x) for x in np.linspace(10, 110, num=11)]\nmax_depth.append(None)\nmin_samples_split = [2, 5, 10]\nmin_samples_leaf = [1, 2, 4]\nbootstrap = [True, False]\n\nrandom_grid = {'n_estimators': n_estimators, 'max_features': max_features,\n 'max_depth': max_depth, 'min_samples_split': min_samples_split,\n 'min_samples_leaf': min_samples_leaf, 'bootstrap': bootstrap}\n\nrf_clf1 = RandomForestClassifier(random_state=42)\n\nrf_cv = RandomizedSearchCV(estimator=rf_clf1, scoring='f1',param_distributions=random_grid, n_iter=100, cv=3, \n verbose=3, random_state=42, n_jobs=-1)\n\nrf_cv.fit(X_train, y_train)\nrf_best_params = rf_cv.best_params_\nprint(f\"Best paramters: {rf_best_params})\")\n \n","metadata":{"id":"9xUOaCSsGqVr","outputId":"29e21a0f-eadc-4d00-9c96-183a9563a346"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"rf_clf2 = RandomForestClassifier(**rf_best_params)\nrf_clf2.fit(X_train, y_train)\ny_predict=rf_clf2.predict(X_test)\nf1_score=f1_score(y_test,y_predict)","metadata":{"id":"HXDZsw6OGqVr"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"f1_score","metadata":{"id":"e9vcwyZiGqVr","outputId":"38f9e72a-9ba9-4526-a1d5-566414c653af"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"## Decision Tree Visualization","metadata":{}},{"cell_type":"code","source":"import matplotlib.pyplot as plt","metadata":{},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"from sklearn import tree\nplt.figure(figsize=(15,10))\ntree.plot_tree(dt,filled=True)","metadata":{"collapsed":true,"jupyter":{"outputs_hidden":true}},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"! pip install graphviz","metadata":{},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"import graphviz\n# DOT data\ndot_data = tree.export_graphviz(dt, \n feature_names=X.columns, \n class_names=y,\n filled=True)\n\n# Draw graph\ngraph = graphviz.Source(dot_data, format=\"png\") \ngraph","metadata":{"collapsed":true,"jupyter":{"outputs_hidden":true}},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"","metadata":{},"execution_count":null,"outputs":[]}]}
--------------------------------------------------------------------------------
/random-forest-social-network.ipynb:
--------------------------------------------------------------------------------
1 | {"cells":[{"source":"","metadata":{},"cell_type":"markdown","outputs":[],"execution_count":0},{"cell_type":"markdown","id":"b8c80528","metadata":{"execution":{"iopub.execute_input":"2023-01-10T16:31:59.598789Z","iopub.status.busy":"2023-01-10T16:31:59.598342Z","iopub.status.idle":"2023-01-10T16:31:59.62834Z","shell.execute_reply":"2023-01-10T16:31:59.6266Z","shell.execute_reply.started":"2023-01-10T16:31:59.598756Z"},"papermill":{"duration":0.008579,"end_time":"2023-01-12T13:50:23.814162","exception":false,"start_time":"2023-01-12T13:50:23.805583","status":"completed"},"tags":[]},"source":["
\n"," In this project, I am using Random Forest algorithm on Social Network Ads data to uncover if the Sales was successful or not,\n"," \n","
\n","
\n"," Some things to note:\n","
The Random forest classifier creates a set of decision trees from a randomly selected subset of the training set. It is basically a set of decision trees (DT) from a randomly selected subset of the training set and then It collects the votes from different decision trees to decide the final prediction.\n","
"]},{"cell_type":"code","execution_count":1,"id":"8a1ed862","metadata":{"execution":{"iopub.execute_input":"2023-01-12T13:50:23.873133Z","iopub.status.busy":"2023-01-12T13:50:23.872328Z","iopub.status.idle":"2023-01-12T13:50:25.58106Z","shell.execute_reply":"2023-01-12T13:50:25.5802Z"},"papermill":{"duration":1.720153,"end_time":"2023-01-12T13:50:25.583717","exception":false,"start_time":"2023-01-12T13:50:23.863564","status":"completed"},"tags":[]},"outputs":[],"source":["import pandas as pd\n","import numpy as np\n","import seaborn as sns\n","import matplotlib.pyplot as plt\n","import statsmodels.formula.api as smf\n","from sklearn.model_selection import train_test_split\n","from sklearn.preprocessing import StandardScaler\n","from sklearn.tree import DecisionTreeClassifier\n","from sklearn.ensemble import RandomForestClassifier\n","from sklearn.metrics import confusion_matrix, accuracy_score\n","from sklearn.metrics import confusion_matrix, accuracy_score, roc_auc_score, classification_report"]},{"cell_type":"markdown","id":"fd356bd4","metadata":{"papermill":{"duration":0.008211,"end_time":"2023-01-12T13:50:25.599087","exception":false,"start_time":"2023-01-12T13:50:25.590876","status":"completed"},"tags":[]},"source":["
\n"," Loading the data\n","
"]},{"cell_type":"code","execution_count":2,"id":"941634e5","metadata":{"execution":{"iopub.execute_input":"2023-01-12T13:50:25.615353Z","iopub.status.busy":"2023-01-12T13:50:25.614339Z","iopub.status.idle":"2023-01-12T13:50:25.633752Z","shell.execute_reply":"2023-01-12T13:50:25.632982Z"},"papermill":{"duration":0.030404,"end_time":"2023-01-12T13:50:25.636371","exception":false,"start_time":"2023-01-12T13:50:25.605967","status":"completed"},"tags":[]},"outputs":[],"source":["#Reading the data file\n","social_data = pd.read_csv('/kaggle/input/social-network/Social_Network_Ads.csv')"]},{"cell_type":"code","execution_count":3,"id":"06b49c59","metadata":{"execution":{"iopub.execute_input":"2023-01-12T13:50:25.652417Z","iopub.status.busy":"2023-01-12T13:50:25.652052Z","iopub.status.idle":"2023-01-12T13:50:25.672265Z","shell.execute_reply":"2023-01-12T13:50:25.671056Z"},"papermill":{"duration":0.031072,"end_time":"2023-01-12T13:50:25.6747","exception":false,"start_time":"2023-01-12T13:50:25.643628","status":"completed"},"tags":[]},"outputs":[{"data":{"text/html":["
\n","\n","
\n"," \n","
\n","
\n","
User ID
\n","
Gender
\n","
Age
\n","
EstimatedSalary
\n","
Purchased
\n","
\n"," \n"," \n","
\n","
0
\n","
15624510
\n","
Male
\n","
19
\n","
19000
\n","
0
\n","
\n","
\n","
1
\n","
15810944
\n","
Male
\n","
35
\n","
20000
\n","
0
\n","
\n","
\n","
2
\n","
15668575
\n","
Female
\n","
26
\n","
43000
\n","
0
\n","
\n","
\n","
3
\n","
15603246
\n","
Female
\n","
27
\n","
57000
\n","
0
\n","
\n","
\n","
4
\n","
15804002
\n","
Male
\n","
19
\n","
76000
\n","
0
\n","
\n","
\n","
5
\n","
15728773
\n","
Male
\n","
27
\n","
58000
\n","
0
\n","
\n","
\n","
6
\n","
15598044
\n","
Female
\n","
27
\n","
84000
\n","
0
\n","
\n","
\n","
7
\n","
15694829
\n","
Female
\n","
32
\n","
150000
\n","
1
\n","
\n","
\n","
8
\n","
15600575
\n","
Male
\n","
25
\n","
33000
\n","
0
\n","
\n","
\n","
9
\n","
15727311
\n","
Female
\n","
35
\n","
65000
\n","
0
\n","
\n","
\n","
10
\n","
15570769
\n","
Female
\n","
26
\n","
80000
\n","
0
\n","
\n"," \n","
\n","
"],"text/plain":[" User ID Gender Age EstimatedSalary Purchased\n","0 15624510 Male 19 19000 0\n","1 15810944 Male 35 20000 0\n","2 15668575 Female 26 43000 0\n","3 15603246 Female 27 57000 0\n","4 15804002 Male 19 76000 0\n","5 15728773 Male 27 58000 0\n","6 15598044 Female 27 84000 0\n","7 15694829 Female 32 150000 1\n","8 15600575 Male 25 33000 0\n","9 15727311 Female 35 65000 0\n","10 15570769 Female 26 80000 0"]},"execution_count":3,"metadata":{},"output_type":"execute_result"}],"source":["#Printing the head for data\n","social_data.head(11)"]},{"cell_type":"code","execution_count":4,"id":"2c3d07a5","metadata":{"execution":{"iopub.execute_input":"2023-01-12T13:50:25.691067Z","iopub.status.busy":"2023-01-12T13:50:25.690639Z","iopub.status.idle":"2023-01-12T13:50:25.714479Z","shell.execute_reply":"2023-01-12T13:50:25.712914Z"},"papermill":{"duration":0.035015,"end_time":"2023-01-12T13:50:25.717108","exception":false,"start_time":"2023-01-12T13:50:25.682093","status":"completed"},"tags":[]},"outputs":[{"name":"stdout","output_type":"stream","text":["\n","RangeIndex: 400 entries, 0 to 399\n","Data columns (total 5 columns):\n"," # Column Non-Null Count Dtype \n","--- ------ -------------- ----- \n"," 0 User ID 400 non-null int64 \n"," 1 Gender 400 non-null object\n"," 2 Age 400 non-null int64 \n"," 3 EstimatedSalary 400 non-null int64 \n"," 4 Purchased 400 non-null int64 \n","dtypes: int64(4), object(1)\n","memory usage: 15.8+ KB\n"]}],"source":["#Checking info for the data\n","social_data.info()"]},{"cell_type":"markdown","id":"03cf9d87","metadata":{"papermill":{"duration":0.007016,"end_time":"2023-01-12T13:50:25.731691","exception":false,"start_time":"2023-01-12T13:50:25.724675","status":"completed"},"tags":[]},"source":["
\n"," Selecting data using iloc - Here we select all rows but only required columns\n","
"]},{"cell_type":"code","execution_count":5,"id":"b4729237","metadata":{"execution":{"iopub.execute_input":"2023-01-12T13:50:25.747892Z","iopub.status.busy":"2023-01-12T13:50:25.747463Z","iopub.status.idle":"2023-01-12T13:50:25.753799Z","shell.execute_reply":"2023-01-12T13:50:25.75267Z"},"papermill":{"duration":0.016979,"end_time":"2023-01-12T13:50:25.755995","exception":false,"start_time":"2023-01-12T13:50:25.739016","status":"completed"},"tags":[]},"outputs":[],"source":["#Seprating for features and target variable\n","X = social_data.iloc[:, [2,3]].values\n","y = social_data.iloc[:,4].values"]},{"cell_type":"code","execution_count":6,"id":"585dc6a8","metadata":{"execution":{"iopub.execute_input":"2023-01-12T13:50:25.772257Z","iopub.status.busy":"2023-01-12T13:50:25.771914Z","iopub.status.idle":"2023-01-12T13:50:25.776359Z","shell.execute_reply":"2023-01-12T13:50:25.775345Z"},"papermill":{"duration":0.01506,"end_time":"2023-01-12T13:50:25.778527","exception":false,"start_time":"2023-01-12T13:50:25.763467","status":"completed"},"tags":[]},"outputs":[],"source":["#Checking for columns\n","#print(X), print(y)"]},{"cell_type":"markdown","id":"27bf42c5","metadata":{"papermill":{"duration":0.006953,"end_time":"2023-01-12T13:50:25.792902","exception":false,"start_time":"2023-01-12T13:50:25.785949","status":"completed"},"tags":[]},"source":["
\n"," Splitting the dataset into the Training set and Test set\n","
\n"," Building the DT Model using the Training data\n","
Here we create a DT classifier object\n","
"]},{"cell_type":"code","execution_count":12,"id":"7c8f99a6","metadata":{"execution":{"iopub.execute_input":"2023-01-12T13:50:25.978255Z","iopub.status.busy":"2023-01-12T13:50:25.977587Z","iopub.status.idle":"2023-01-12T13:50:25.981513Z","shell.execute_reply":"2023-01-12T13:50:25.980718Z"},"papermill":{"duration":0.015287,"end_time":"2023-01-12T13:50:25.983487","exception":false,"start_time":"2023-01-12T13:50:25.9682","status":"completed"},"tags":[]},"outputs":[],"source":["#Fitting the model on trainig data\n","#classifier.fit(X_train, y_train)"]},{"cell_type":"code","execution_count":13,"id":"fafac011","metadata":{"execution":{"iopub.execute_input":"2023-01-12T13:50:26.00157Z","iopub.status.busy":"2023-01-12T13:50:26.000933Z","iopub.status.idle":"2023-01-12T13:50:26.004564Z","shell.execute_reply":"2023-01-12T13:50:26.003829Z"},"papermill":{"duration":0.014905,"end_time":"2023-01-12T13:50:26.006506","exception":false,"start_time":"2023-01-12T13:50:25.991601","status":"completed"},"tags":[]},"outputs":[],"source":["#Creting the insatance\n","#classifier = DecisionTreeClassifier(criterion = 'entropy', random_state = 0)"]},{"cell_type":"markdown","id":"61d9f4f3","metadata":{"papermill":{"duration":0.007868,"end_time":"2023-01-12T13:50:26.02249","exception":false,"start_time":"2023-01-12T13:50:26.014622","status":"completed"},"tags":[]},"source":["
\n"," Building the model - Random Forest Classifier\n","
\n"," The modelcan be tunned further, keep an eye updated version for this notebook!\n","
"]}],"metadata":{"kernelspec":{"display_name":"Python 3","language":"python","name":"python3"},"language_info":{"codemirror_mode":{"name":"ipython","version":3},"file_extension":".py","mimetype":"text/x-python","name":"python","nbconvert_exporter":"python","pygments_lexer":"ipython3","version":"3.7.12"},"papermill":{"default_parameters":{},"duration":11.826605,"end_time":"2023-01-12T13:50:27.035887","environment_variables":{},"exception":null,"input_path":"__notebook__.ipynb","output_path":"__notebook__.ipynb","parameters":{},"start_time":"2023-01-12T13:50:15.209282","version":"2.3.4"}},"nbformat":4,"nbformat_minor":5}
--------------------------------------------------------------------------------
/titanic-competition-submission.ipynb:
--------------------------------------------------------------------------------
1 | {"cells":[{"source":"","metadata":{},"cell_type":"markdown","outputs":[],"execution_count":0},{"cell_type":"markdown","id":"6bb1e547","metadata":{"papermill":{"duration":0.004232,"end_time":"2022-12-01T18:35:22.824111","exception":false,"start_time":"2022-12-01T18:35:22.819879","status":"completed"},"tags":[]},"source":["Titanic_competition_submission"]},{"cell_type":"code","execution_count":1,"id":"021a3c18","metadata":{"_cell_guid":"b1076dfc-b9ad-4769-8c92-a6c4dae69d19","_uuid":"8f2839f25d086af736a60e9eeb907d3b93b6e0e5","execution":{"iopub.execute_input":"2022-12-01T18:35:22.832906Z","iopub.status.busy":"2022-12-01T18:35:22.832402Z","iopub.status.idle":"2022-12-01T18:35:22.847605Z","shell.execute_reply":"2022-12-01T18:35:22.846287Z"},"papermill":{"duration":0.023231,"end_time":"2022-12-01T18:35:22.850783","exception":false,"start_time":"2022-12-01T18:35:22.827552","status":"completed"},"tags":[]},"outputs":[{"name":"stdout","output_type":"stream","text":["/kaggle/input/titanic/train.csv\n","/kaggle/input/titanic/test.csv\n","/kaggle/input/titanic/gender_submission.csv\n"]}],"source":["# This Python 3 environment comes with many helpful analytics libraries installed\n","# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python\n","# For example, here's several helpful packages to load\n","\n","import numpy as np # linear algebra\n","import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)\n","\n","# Input data files are available in the read-only \"../input/\" directory\n","# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory\n","\n","import os\n","for dirname, _, filenames in os.walk('/kaggle/input'):\n"," for filename in filenames:\n"," print(os.path.join(dirname, filename))\n","\n","# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using \"Save & Run All\" \n","# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session"]},{"cell_type":"code","execution_count":2,"id":"c870dd3b","metadata":{"execution":{"iopub.execute_input":"2022-12-01T18:35:22.859571Z","iopub.status.busy":"2022-12-01T18:35:22.859159Z","iopub.status.idle":"2022-12-01T18:35:25.09791Z","shell.execute_reply":"2022-12-01T18:35:25.096837Z"},"papermill":{"duration":2.245887,"end_time":"2022-12-01T18:35:25.100641","exception":false,"start_time":"2022-12-01T18:35:22.854754","status":"completed"},"tags":[]},"outputs":[],"source":["# Data manipulation imports\n","import numpy as np\n","import pandas as pd\n","\n","# Visualization imports\n","import matplotlib.pyplot as plt\n","import plotly.express as px\n","\n","# Modeling imports\n","from sklearn.model_selection import train_test_split\n","from sklearn.impute import SimpleImputer\n","from sklearn.preprocessing import OneHotEncoder, StandardScaler\n","from sklearn.compose import ColumnTransformer\n","from sklearn.pipeline import Pipeline\n","from sklearn.neighbors import KNeighborsClassifier\n","from sklearn.metrics import accuracy_score, ConfusionMatrixDisplay"]},{"cell_type":"code","execution_count":3,"id":"f45b0f58","metadata":{"execution":{"iopub.execute_input":"2022-12-01T18:35:25.108876Z","iopub.status.busy":"2022-12-01T18:35:25.108461Z","iopub.status.idle":"2022-12-01T18:35:25.148511Z","shell.execute_reply":"2022-12-01T18:35:25.147343Z"},"papermill":{"duration":0.047113,"end_time":"2022-12-01T18:35:25.151031","exception":false,"start_time":"2022-12-01T18:35:25.103918","status":"completed"},"tags":[]},"outputs":[{"data":{"text/html":["
\n","\n","
\n"," \n","
\n","
\n","
PassengerId
\n","
Survived
\n","
Pclass
\n","
Name
\n","
Sex
\n","
Age
\n","
SibSp
\n","
Parch
\n","
Ticket
\n","
Fare
\n","
Cabin
\n","
Embarked
\n","
\n"," \n"," \n","
\n","
0
\n","
1
\n","
0
\n","
3
\n","
Braund, Mr. Owen Harris
\n","
male
\n","
22.0
\n","
1
\n","
0
\n","
A/5 21171
\n","
7.2500
\n","
NaN
\n","
S
\n","
\n","
\n","
1
\n","
2
\n","
1
\n","
1
\n","
Cumings, Mrs. John Bradley (Florence Briggs Th...
\n","
female
\n","
38.0
\n","
1
\n","
0
\n","
PC 17599
\n","
71.2833
\n","
C85
\n","
C
\n","
\n","
\n","
2
\n","
3
\n","
1
\n","
3
\n","
Heikkinen, Miss. Laina
\n","
female
\n","
26.0
\n","
0
\n","
0
\n","
STON/O2. 3101282
\n","
7.9250
\n","
NaN
\n","
S
\n","
\n","
\n","
3
\n","
4
\n","
1
\n","
1
\n","
Futrelle, Mrs. Jacques Heath (Lily May Peel)
\n","
female
\n","
35.0
\n","
1
\n","
0
\n","
113803
\n","
53.1000
\n","
C123
\n","
S
\n","
\n","
\n","
4
\n","
5
\n","
0
\n","
3
\n","
Allen, Mr. William Henry
\n","
male
\n","
35.0
\n","
0
\n","
0
\n","
373450
\n","
8.0500
\n","
NaN
\n","
S
\n","
\n","
\n","
5
\n","
6
\n","
0
\n","
3
\n","
Moran, Mr. James
\n","
male
\n","
NaN
\n","
0
\n","
0
\n","
330877
\n","
8.4583
\n","
NaN
\n","
Q
\n","
\n","
\n","
6
\n","
7
\n","
0
\n","
1
\n","
McCarthy, Mr. Timothy J
\n","
male
\n","
54.0
\n","
0
\n","
0
\n","
17463
\n","
51.8625
\n","
E46
\n","
S
\n","
\n","
\n","
7
\n","
8
\n","
0
\n","
3
\n","
Palsson, Master. Gosta Leonard
\n","
male
\n","
2.0
\n","
3
\n","
1
\n","
349909
\n","
21.0750
\n","
NaN
\n","
S
\n","
\n","
\n","
8
\n","
9
\n","
1
\n","
3
\n","
Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)
\n","
female
\n","
27.0
\n","
0
\n","
2
\n","
347742
\n","
11.1333
\n","
NaN
\n","
S
\n","
\n","
\n","
9
\n","
10
\n","
1
\n","
2
\n","
Nasser, Mrs. Nicholas (Adele Achem)
\n","
female
\n","
14.0
\n","
1
\n","
0
\n","
237736
\n","
30.0708
\n","
NaN
\n","
C
\n","
\n","
\n","
10
\n","
11
\n","
1
\n","
3
\n","
Sandstrom, Miss. Marguerite Rut
\n","
female
\n","
4.0
\n","
1
\n","
1
\n","
PP 9549
\n","
16.7000
\n","
G6
\n","
S
\n","
\n"," \n","
\n","
"],"text/plain":[" PassengerId Survived Pclass \\\n","0 1 0 3 \n","1 2 1 1 \n","2 3 1 3 \n","3 4 1 1 \n","4 5 0 3 \n","5 6 0 3 \n","6 7 0 1 \n","7 8 0 3 \n","8 9 1 3 \n","9 10 1 2 \n","10 11 1 3 \n","\n"," Name Sex Age SibSp \\\n","0 Braund, Mr. Owen Harris male 22.0 1 \n","1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 \n","2 Heikkinen, Miss. Laina female 26.0 0 \n","3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 \n","4 Allen, Mr. William Henry male 35.0 0 \n","5 Moran, Mr. James male NaN 0 \n","6 McCarthy, Mr. Timothy J male 54.0 0 \n","7 Palsson, Master. Gosta Leonard male 2.0 3 \n","8 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27.0 0 \n","9 Nasser, Mrs. Nicholas (Adele Achem) female 14.0 1 \n","10 Sandstrom, Miss. Marguerite Rut female 4.0 1 \n","\n"," Parch Ticket Fare Cabin Embarked \n","0 0 A/5 21171 7.2500 NaN S \n","1 0 PC 17599 71.2833 C85 C \n","2 0 STON/O2. 3101282 7.9250 NaN S \n","3 0 113803 53.1000 C123 S \n","4 0 373450 8.0500 NaN S \n","5 0 330877 8.4583 NaN Q \n","6 0 17463 51.8625 E46 S \n","7 1 349909 21.0750 NaN S \n","8 2 347742 11.1333 NaN S \n","9 0 237736 30.0708 NaN C \n","10 1 PP 9549 16.7000 G6 S "]},"execution_count":3,"metadata":{},"output_type":"execute_result"}],"source":["#loading train_data\n","titanic_data = pd.read_csv(\"/kaggle/input/titanic/train.csv\")\n","titanic_data.head(11)"]},{"cell_type":"code","execution_count":4,"id":"17b6198f","metadata":{"execution":{"iopub.execute_input":"2022-12-01T18:35:25.159704Z","iopub.status.busy":"2022-12-01T18:35:25.159297Z","iopub.status.idle":"2022-12-01T18:35:25.182487Z","shell.execute_reply":"2022-12-01T18:35:25.181426Z"},"papermill":{"duration":0.030802,"end_time":"2022-12-01T18:35:25.185425","exception":false,"start_time":"2022-12-01T18:35:25.154623","status":"completed"},"tags":[]},"outputs":[{"data":{"text/plain":["( PassengerId Pclass Name Sex \\\n"," 0 892 3 Kelly, Mr. James male \n"," 1 893 3 Wilkes, Mrs. James (Ellen Needs) female \n"," 2 894 2 Myles, Mr. Thomas Francis male \n"," 3 895 3 Wirz, Mr. Albert male \n"," 4 896 3 Hirvonen, Mrs. Alexander (Helga E Lindqvist) female \n"," \n"," Age SibSp Parch Ticket Fare Cabin Embarked \n"," 0 34.5 0 0 330911 7.8292 NaN Q \n"," 1 47.0 1 0 363272 7.0000 NaN S \n"," 2 62.0 0 0 240276 9.6875 NaN Q \n"," 3 27.0 0 0 315154 8.6625 NaN S \n"," 4 22.0 1 1 3101298 12.2875 NaN S ,\n"," (418, 11))"]},"execution_count":4,"metadata":{},"output_type":"execute_result"}],"source":["#loadding test_data\n","titanic_test = pd.read_csv(\"/kaggle/input/titanic/test.csv\")\n","titanic_test.head(), titanic_test.shape"]},{"cell_type":"code","execution_count":5,"id":"f8b36fab","metadata":{"execution":{"iopub.execute_input":"2022-12-01T18:35:25.194472Z","iopub.status.busy":"2022-12-01T18:35:25.194064Z","iopub.status.idle":"2022-12-01T18:35:25.206425Z","shell.execute_reply":"2022-12-01T18:35:25.205321Z"},"papermill":{"duration":0.020541,"end_time":"2022-12-01T18:35:25.209739","exception":false,"start_time":"2022-12-01T18:35:25.189198","status":"completed"},"tags":[]},"outputs":[{"name":"stdout","output_type":"stream","text":["% of women who survived: 0.7420382165605095\n"]}],"source":["#Checking pattern for women\n","women = titanic_data.loc[titanic_data.Sex == 'female'][\"Survived\"]\n","rate_women = sum(women)/len(women)\n","\n","print(\"% of women who survived:\", rate_women)"]},{"cell_type":"code","execution_count":6,"id":"218d65df","metadata":{"execution":{"iopub.execute_input":"2022-12-01T18:35:25.219923Z","iopub.status.busy":"2022-12-01T18:35:25.219493Z","iopub.status.idle":"2022-12-01T18:35:25.227441Z","shell.execute_reply":"2022-12-01T18:35:25.226154Z"},"papermill":{"duration":0.015976,"end_time":"2022-12-01T18:35:25.230242","exception":false,"start_time":"2022-12-01T18:35:25.214266","status":"completed"},"tags":[]},"outputs":[{"name":"stdout","output_type":"stream","text":["% of women who survived: 0.18890814558058924\n"]}],"source":["#Checking pattern for men\n","men = titanic_data.loc[titanic_data.Sex == 'male'][\"Survived\"]\n","rate_men = sum(men)/len(men)\n","\n","print(\"% of women who survived:\", rate_men)"]},{"cell_type":"code","execution_count":7,"id":"6a6da08e","metadata":{"execution":{"iopub.execute_input":"2022-12-01T18:35:25.239736Z","iopub.status.busy":"2022-12-01T18:35:25.239303Z","iopub.status.idle":"2022-12-01T18:35:25.52152Z","shell.execute_reply":"2022-12-01T18:35:25.520193Z"},"papermill":{"duration":0.289828,"end_time":"2022-12-01T18:35:25.523961","exception":false,"start_time":"2022-12-01T18:35:25.234133","status":"completed"},"tags":[]},"outputs":[{"name":"stdout","output_type":"stream","text":["Your submission was successfully saved!\n"]}],"source":["from sklearn.ensemble import RandomForestClassifier\n","\n","y = titanic_data[\"Survived\"]\n","\n","features = [\"Pclass\", \"Sex\", \"SibSp\", \"Parch\"]\n","X = pd.get_dummies(titanic_data[features])\n","X_test = pd.get_dummies(titanic_test[features])\n","\n","model = RandomForestClassifier(n_estimators=101, max_depth=5, random_state=123)\n","model.fit(X, y)\n","predictions = model.predict(X_test)\n","\n","output = pd.DataFrame({'PassengerId': titanic_test.PassengerId, 'Survived': predictions})\n","output.to_csv('submission.csv', index=False)\n","print(\"Your submission was successfully saved!\")"]},{"cell_type":"code","execution_count":null,"id":"223bb683","metadata":{"papermill":{"duration":0.003424,"end_time":"2022-12-01T18:35:25.531277","exception":false,"start_time":"2022-12-01T18:35:25.527853","status":"completed"},"tags":[]},"outputs":[],"source":[]}],"metadata":{"kernelspec":{"display_name":"Python 3","language":"python","name":"python3"},"language_info":{"codemirror_mode":{"name":"ipython","version":3},"file_extension":".py","mimetype":"text/x-python","name":"python","nbconvert_exporter":"python","pygments_lexer":"ipython3","version":"3.7.12"},"papermill":{"default_parameters":{},"duration":13.313314,"end_time":"2022-12-01T18:35:26.359542","environment_variables":{},"exception":null,"input_path":"__notebook__.ipynb","output_path":"__notebook__.ipynb","parameters":{},"start_time":"2022-12-01T18:35:13.046228","version":"2.3.4"}},"nbformat":4,"nbformat_minor":5}
--------------------------------------------------------------------------------