├── data.zip
├── README.md
├── LICENSE
├── Aula7-Regressao-Linear-Simples.ipynb
└── Aula4-Classificacao-Naive Bayes.ipynb
/data.zip:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/prof-francisco-rodrigues/ciencia-de-dados/HEAD/data.zip
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Ciência de Dados
2 | Francisco Aparecido Rodrigues, francisco@icmc.usp.br.
3 | Universidade de São Paulo, São Carlos, Brasil.
4 | https://sites.icmc.usp.br/francisco
5 | Copyright: Creative Commons
6 |
7 |
8 | Neste repositório incluímos diversos notebooks em Python que descrevem os principais conceitos relacionados à Ciência de Dados.
9 |
10 | O notebooks são os seguintes:
11 |
12 | **Aula 0: Introdução à Programação em Python**
13 |
14 | **Aula 1 - Preparação e transformação dos dados**: Apresentamos exemplos de como limpar e adequar os dados para serem tratados por algoritmos de Ciência de Dados. Também consideramos métodos básicos de pré-processamento e transformação de dados.
15 | Aula no Youtube: https://bit.ly/3em6BA0
16 |
17 | **Aula 2 - Análise Exploratória de dados**: inclui medidas resumo e de correlação (Pearson e Spearman).
18 | Aulas no youtube:
19 | Medidas resumo: https://bit.ly/37HwebF
20 | Correlação: https://bit.ly/2UXDhIe
21 |
22 | **Aula 3 - Teoria da Decisão Bayesiana e o Classificador Bayesiano**.
23 | Aulas no youtube:
24 | Teoria da Decisão Bayesiana: https://www.youtube.com/watch?v=8zAKWEOdGsg
25 | Classificador Bayesiano: https://www.youtube.com/watch?v=Rq_hXHrdkbc
26 |
27 | **Aula 4 - Classificador Naive Bayes**.
28 | Aula no youtube:
29 | https://www.youtube.com/watch?v=Bk2mSIMw_XE&list=PLSc7xcwCGNh1PJrPfLaH4MMjfDl48tmGM&index=11&ab_channel=FranciscoRodrigues
30 |
31 | **Aula 5 – Algoritmo k-vizinhos mais próximos**.
32 | Aula no youtube:
33 | https://www.youtube.com/watch?v=7WySJWL2o_4&list=PLSc7xcwCGNh1PJrPfLaH4MMjfDl48tmGM&index=12&ab_channel=FranciscoRodrigues
34 |
35 | **Aula 6 – Regressão Logística**.
36 | Aula no youtube:
37 | https://www.youtube.com/watch?v=EoP2wN0yuHA&list=PLSc7xcwCGNh1PJrPfLaH4MMjfDl48tmGM&index=13&ab_channel=FranciscoRodrigues
38 |
39 | **Aula 7 – Regressão Liner Simples**.
40 | Aula no youtube:
41 | https://www.youtube.com/watch?v=xRIc81AeLOg&t=44s&ab_channel=FranciscoRodrigues
42 |
43 | Se for usar em suas aulas, cite a fonte: https://github.com/franciscoicmc
44 |
45 | Novos notebooks serão incluídos periodicamente.
46 |
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | Creative Commons Legal Code
2 |
3 | CC0 1.0 Universal
4 |
5 | CREATIVE COMMONS CORPORATION IS NOT A LAW FIRM AND DOES NOT PROVIDE
6 | LEGAL SERVICES. DISTRIBUTION OF THIS DOCUMENT DOES NOT CREATE AN
7 | ATTORNEY-CLIENT RELATIONSHIP. CREATIVE COMMONS PROVIDES THIS
8 | INFORMATION ON AN "AS-IS" BASIS. CREATIVE COMMONS MAKES NO WARRANTIES
9 | REGARDING THE USE OF THIS DOCUMENT OR THE INFORMATION OR WORKS
10 | PROVIDED HEREUNDER, AND DISCLAIMS LIABILITY FOR DAMAGES RESULTING FROM
11 | THE USE OF THIS DOCUMENT OR THE INFORMATION OR WORKS PROVIDED
12 | HEREUNDER.
13 |
14 | Statement of Purpose
15 |
16 | The laws of most jurisdictions throughout the world automatically confer
17 | exclusive Copyright and Related Rights (defined below) upon the creator
18 | and subsequent owner(s) (each and all, an "owner") of an original work of
19 | authorship and/or a database (each, a "Work").
20 |
21 | Certain owners wish to permanently relinquish those rights to a Work for
22 | the purpose of contributing to a commons of creative, cultural and
23 | scientific works ("Commons") that the public can reliably and without fear
24 | of later claims of infringement build upon, modify, incorporate in other
25 | works, reuse and redistribute as freely as possible in any form whatsoever
26 | and for any purposes, including without limitation commercial purposes.
27 | These owners may contribute to the Commons to promote the ideal of a free
28 | culture and the further production of creative, cultural and scientific
29 | works, or to gain reputation or greater distribution for their Work in
30 | part through the use and efforts of others.
31 |
32 | For these and/or other purposes and motivations, and without any
33 | expectation of additional consideration or compensation, the person
34 | associating CC0 with a Work (the "Affirmer"), to the extent that he or she
35 | is an owner of Copyright and Related Rights in the Work, voluntarily
36 | elects to apply CC0 to the Work and publicly distribute the Work under its
37 | terms, with knowledge of his or her Copyright and Related Rights in the
38 | Work and the meaning and intended legal effect of CC0 on those rights.
39 |
40 | 1. Copyright and Related Rights. A Work made available under CC0 may be
41 | protected by copyright and related or neighboring rights ("Copyright and
42 | Related Rights"). Copyright and Related Rights include, but are not
43 | limited to, the following:
44 |
45 | i. the right to reproduce, adapt, distribute, perform, display,
46 | communicate, and translate a Work;
47 | ii. moral rights retained by the original author(s) and/or performer(s);
48 | iii. publicity and privacy rights pertaining to a person's image or
49 | likeness depicted in a Work;
50 | iv. rights protecting against unfair competition in regards to a Work,
51 | subject to the limitations in paragraph 4(a), below;
52 | v. rights protecting the extraction, dissemination, use and reuse of data
53 | in a Work;
54 | vi. database rights (such as those arising under Directive 96/9/EC of the
55 | European Parliament and of the Council of 11 March 1996 on the legal
56 | protection of databases, and under any national implementation
57 | thereof, including any amended or successor version of such
58 | directive); and
59 | vii. other similar, equivalent or corresponding rights throughout the
60 | world based on applicable law or treaty, and any national
61 | implementations thereof.
62 |
63 | 2. Waiver. To the greatest extent permitted by, but not in contravention
64 | of, applicable law, Affirmer hereby overtly, fully, permanently,
65 | irrevocably and unconditionally waives, abandons, and surrenders all of
66 | Affirmer's Copyright and Related Rights and associated claims and causes
67 | of action, whether now known or unknown (including existing as well as
68 | future claims and causes of action), in the Work (i) in all territories
69 | worldwide, (ii) for the maximum duration provided by applicable law or
70 | treaty (including future time extensions), (iii) in any current or future
71 | medium and for any number of copies, and (iv) for any purpose whatsoever,
72 | including without limitation commercial, advertising or promotional
73 | purposes (the "Waiver"). Affirmer makes the Waiver for the benefit of each
74 | member of the public at large and to the detriment of Affirmer's heirs and
75 | successors, fully intending that such Waiver shall not be subject to
76 | revocation, rescission, cancellation, termination, or any other legal or
77 | equitable action to disrupt the quiet enjoyment of the Work by the public
78 | as contemplated by Affirmer's express Statement of Purpose.
79 |
80 | 3. Public License Fallback. Should any part of the Waiver for any reason
81 | be judged legally invalid or ineffective under applicable law, then the
82 | Waiver shall be preserved to the maximum extent permitted taking into
83 | account Affirmer's express Statement of Purpose. In addition, to the
84 | extent the Waiver is so judged Affirmer hereby grants to each affected
85 | person a royalty-free, non transferable, non sublicensable, non exclusive,
86 | irrevocable and unconditional license to exercise Affirmer's Copyright and
87 | Related Rights in the Work (i) in all territories worldwide, (ii) for the
88 | maximum duration provided by applicable law or treaty (including future
89 | time extensions), (iii) in any current or future medium and for any number
90 | of copies, and (iv) for any purpose whatsoever, including without
91 | limitation commercial, advertising or promotional purposes (the
92 | "License"). The License shall be deemed effective as of the date CC0 was
93 | applied by Affirmer to the Work. Should any part of the License for any
94 | reason be judged legally invalid or ineffective under applicable law, such
95 | partial invalidity or ineffectiveness shall not invalidate the remainder
96 | of the License, and in such case Affirmer hereby affirms that he or she
97 | will not (i) exercise any of his or her remaining Copyright and Related
98 | Rights in the Work or (ii) assert any associated claims and causes of
99 | action with respect to the Work, in either case contrary to Affirmer's
100 | express Statement of Purpose.
101 |
102 | 4. Limitations and Disclaimers.
103 |
104 | a. No trademark or patent rights held by Affirmer are waived, abandoned,
105 | surrendered, licensed or otherwise affected by this document.
106 | b. Affirmer offers the Work as-is and makes no representations or
107 | warranties of any kind concerning the Work, express, implied,
108 | statutory or otherwise, including without limitation warranties of
109 | title, merchantability, fitness for a particular purpose, non
110 | infringement, or the absence of latent or other defects, accuracy, or
111 | the present or absence of errors, whether or not discoverable, all to
112 | the greatest extent permissible under applicable law.
113 | c. Affirmer disclaims responsibility for clearing rights of other persons
114 | that may apply to the Work or any use thereof, including without
115 | limitation any person's Copyright and Related Rights in the Work.
116 | Further, Affirmer disclaims responsibility for obtaining any necessary
117 | consents, permissions or other rights required for any use of the
118 | Work.
119 | d. Affirmer understands and acknowledges that Creative Commons is not a
120 | party to this document and has no duty or obligation with respect to
121 | this CC0 or use of the Work.
122 |
--------------------------------------------------------------------------------
/Aula7-Regressao-Linear-Simples.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Regressão linear simples"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "Francisco Aparecido Rodrigues, francisco@icmc.usp.br. \n",
15 | "Universidade de São Paulo, São Carlos, Brasil. \n",
16 | "https://sites.icmc.usp.br/francisco \n",
17 | "Copyright: Creative Commons"
18 | ]
19 | },
20 | {
21 | "cell_type": "markdown",
22 | "metadata": {},
23 | "source": [
24 | " "
25 | ]
26 | },
27 | {
28 | "cell_type": "markdown",
29 | "metadata": {},
30 | "source": [
31 | "Vamos supor que temos o conjunto de dados abaixo."
32 | ]
33 | },
34 | {
35 | "cell_type": "code",
36 | "execution_count": 2,
37 | "metadata": {},
38 | "outputs": [
39 | {
40 | "data": {
41 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYMAAAELCAYAAAA7h+qnAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjMsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+AADFEAAAPBUlEQVR4nO3db4xldX3H8fdnWf8N1mjD9I/g7mBjrZa0wd40Ko21YhsaKfqgtZqRAG2dNGkVjYn/9gEx6T5oNFYfGOMEEK1XTIu0EmKt1j+xTZR2FqyCq5aAO66ijDVV6yQi4dsH926ZGRiZgdnzO3vP+5WQM/c3Z3c/OWHvZ865Z883VYUkadj2tQ4gSWrPMpAkWQaSJMtAkoRlIEkC9rcO8HCdccYZtbCw0DqGJJ1Sjhw58t2qmt+6fsqWwcLCAisrK61jSNIpJcmxB1v3MpEkyTKQJFkGkiQsA0kSloEkiY7LIMnVSe5OcuuGtbcm+UqSLyb5hyRP7DKTJJ0qxmNYWIB9+ybb8Xjvfu+uzwyuAS7YsvYJ4Jyq+jXga8CbOs4kSb03HsPSEhw7BlWT7dLS3hVCp2VQVZ8Fvrdl7eNVde/05eeBs7rMJEmngkOHYH1989r6+mR9L/TtM4M/Af5pu28mWUqykmRlbW2tw1iS1Nbq6u7Wd6s3ZZDkEHAvsO1JT1UtV9Woqkbz8w/419SSNLMOHNjd+m71ogySXAJcCCyWo9ck6QEOH4a5uc1rc3OT9b3QvAySXAC8AbioqtYfan9JGqLFRVhehoMHIZlsl5cn63shXf4gnuRa4PnAGcB3gCuY3D30GOC/p7t9vqr+/KF+r9FoVD6oTpJ2J8mRqhptXe/0qaVV9fIHWb6qywySpAdqfplIktSeZSBJsgwkSZaBJAnLQJKEZSBJwjKQJGEZSJKwDCRJWAaSJCwDSRKWgSQJy0CShGUgScIykCRhGUiSsAwkSVgGkiQsA0kSloEkCctAkoRlIEnCMpAkYRlIkrAMJElYBpIkLANJEh2XQZKrk9yd5NYNaz+b5BNJ/mu6fVKXmSTpoYzHsLAA+/ZNtuNx60R7r+szg2uAC7asvRH4ZFU9Dfjk9LUk9cJ4DEtLcOwYVE22S0uzVwidlkFVfRb43pblFwPvm379PuAlXWaSpJ/m0CFYX9+8tr4+WZ8lffjM4Oer6i6A6fbnttsxyVKSlSQra2trnQWUNFyrq7tbP1X1oQx2rKqWq2pUVaP5+fnWcSQNwIEDu1s/VfWhDL6T5BcBptu7G+eRpP93+DDMzW1em5ubrM+SPpTBDcAl068vAT7SMIskbbK4CMvLcPAgJJPt8vJkfZakqrr7w5JrgecDZwDfAa4A/hH4O+AAsAr8UVVt/ZD5AUajUa2srJy8sJI0g5IcqarR1vX9XYaoqpdv863zu8whSdqsD5eJJEmNWQaSJMtAkmQZSJKwDCRJWAaSJCwDSRKWgSQJy0CShGUgqceGMGGsLzp9HIUk7dSJCWMnBsucmDAGs/eQuD7wzEBSLw1lwlhfWAaSemkoE8b6wjKQ1EtDmTDWF5aBpF4ayoSxvrAMJPXSUCaM9YV3E0nqrcVF3/y74pmBJMkykCRZBpIkLANJEpaBJAnLQJKEZSBJwjKQJGEZSJKwDCRJ9KgMkrw2yW1Jbk1ybZLHts4kda0vk736kkPd6UUZJDkTeDUwqqpzgNOAl7VNJXXrxGSvY8eg6v7JXl2/Efclh7rVizKY2g88Lsl+YA74VuM8Uqf6MtmrLznUrV6UQVV9E3gbsArcBXy/qj6+db8kS0lWkqysra11HVM6qfoy2asvOdStXpRBkicBLwbOBp4MnJ7kFVv3q6rlqhpV1Wh+fr7rmNJJ1ZfJXn3JoW71ogyAFwJ3VtVaVf0EuB54buNMUqf6MtmrLznUrb6UwSrw7CRzSQKcDxxtnEnqVF8me/Ulh7qVqmqdAYAkbwH+GLgXuAX4s6r68Xb7j0ajWllZ6SqeJM2EJEeqarR1vTdjL6vqCuCK1jkkaYj6cplIktSQZSBJsgwkSZaBJAnLQJKEZSBJwjKQJGEZSJKwDCRJWAYS4GQvqTePo5BaOTHZ68RAlxOTvcCHs2k4PDPQ4DnZS7IMJCd7SVgGkpO9JCwDycleEpaB5GQvCe8mkoDJG79v/hoyzwwkSTsvgyQXJrE8JGkG7ebN/SPAN5P8dZJnnKxAkqTu7aYMfglYBl4K3Jrkc0lemeQJJyeaJKkrOy6Dqvp6VV1RVWcDvwvcDvwNcFeSv03yOycrpCTp5HpYnwFU1aeq6mLgl4EjwCLwL0nuTPLaJN6lJEmnkIdVBkl+O8k1wFeBc4B3Ab8H/D3wFuD9exVQknTy7fgn+CQHgUum/y0AnwGWgOur6sfT3T6Z5HPAB/Y2piTpZNrN5Zw7gG8B1wBXV9Wd2+x3G/DvjzCXJKlDuymDPwA+VlX3/bSdquprgB8mS9IpZDd3E330oYrgkUjyxCTXJflKkqNJnnOy/ixJ0mZ9uuvnnUzOPP4wyaOBuYf6BZKkvdGLMpj+w7XnAZcCVNU9wD0tM0nSkPTlWUNPBdaA9ya5JcmVSU7fulOSpSQrSVbW1ta6TylJM6ovZbAfeBbw7qo6F/gR8MatO1XVclWNqmo0Pz/fdUZJmll9KYPjwPGqumn6+jom5SBJ6kAvyqCqvg18I8nTp0vnA19uGEmSBqUXHyBPvQoYT+8kugO4rHEeSRqM3pRBVX0BGLXOIUlD1IvLRJKktiwDSZJlIEmyDCRJWAaSJCwDSRKWgSQJy0CShGUgScIyUA+Mx7CwAPv2TbbjcetE0vD05nEUGqbxGJaWYH198vrYsclrgMXFdrmkofHMQE0dOnR/EZywvj5Zl9Qdy0BNra7ubl3SyWEZqKkDB3a3LunksAzU1OHDMDe3eW1ubrIuqTuWgZpaXITlZTh4EJLJdnnZD4+lrnk3kZpbXPTNX2rNMwNJkmUgSbIMJElYBpIkLANJEpaBJAnLQJKEZSBJwjKQJGEZSJLoWRkkOS3JLUlubJ1FkoakV2UAXA4cbR1CkoamN2WQ5CzgRcCVrbNI0tD0pgyAdwCvB+7bbockS0lWkqysra11l0ySZlwvyiDJhcDdVXXkp+1XVctVNaqq0fz8fEfpJGn29aIMgPOAi5J8HfgQ8IIkH2gbSZKGoxdlUFVvqqqzqmoBeBnwqap6ReNYkjQYvSgDSVJbvRt7WVWfAT7TOIYkDYpnBpIky0CSZBlIkrAMJElYBpIkLANJEpaBJAnLQJKEZSBJwjJoYjyGhQXYt2+yHY9bJ5I0dL17HMWsG49haQnW1yevjx2bvAZYXGyXS9KweWbQsUOH7i+CE9bXJ+uS1Ipl0LHV1d2tS1IXLIOOHTiwu3VJ6oJl0LHDh2FubvPa3NxkXZJasQw6trgIy8tw8CAkk+3ysh8eS2rLu4kaWFz0zV9Sv3hmIEmyDCRJloEkCctAkoRlIEnCMpAkYRlIkrAMJElYBpIkLANJEj0pgyRPSfLpJEeT3Jbk8taZhsCJa5JO6Muzie4FXldVNyf5GeBIkk9U1ZdbB5tVTlyTtFEvzgyq6q6qunn69Q+Bo8CZbVPNNieuSdqoF2WwUZIF4Fzgpgf53lKSlSQra2trXUebKU5ck7RRr8ogyeOBDwOvqaofbP1+VS1X1aiqRvPz890HnCFOXJO0UW/KIMmjmBTBuKqub51n1jlxTdJGvSiDJAGuAo5W1dtb5xkCJ65J2ihV1ToDSX4L+FfgS8B90+U3V9VHt/s1o9GoVlZWuognSTMjyZGqGm1d78WtpVX1b0Ba55CkoerFZSJJUluWgSTJMpAkWQaSJCwDSRKWgSQJy0CShGUgScIykCQxsDJwspckPbhePI6iC072kqTtDebMwMlekrS9wZSBk70kaXuDKQMne0nS9gZTBk72kqTtDaYMnOwlSdsbzN1EMHnj981fkh5oMGcGkqTtWQaSJMtAkmQZSJKwDCRJQKqqdYaHJckacOxh/vIzgO/uYZxTncfjfh6LzTwem83C8ThYVfNbF0/ZMngkkqxU1ah1jr7weNzPY7GZx2OzWT4eXiaSJFkGkqThlsFy6wA94/G4n8diM4/HZjN7PAb5mYEkabOhnhlIkjawDCRJwyuDJBck+WqS25O8sXWeVpI8JcmnkxxNcluSy1tn6oMkpyW5JcmNrbO0luSJSa5L8pXp/yfPaZ2plSSvnf49uTXJtUke2zrTXhtUGSQ5DXgX8PvAM4GXJ3lm21TN3Au8rqqeATwb+IsBH4uNLgeOtg7RE+8EPlZVvwL8OgM9LknOBF4NjKrqHOA04GVtU+29QZUB8JvA7VV1R1XdA3wIeHHjTE1U1V1VdfP06x8y+Yt+ZttUbSU5C3gRcGXrLK0leQLwPOAqgKq6p6r+p22qpvYDj0uyH5gDvtU4z54bWhmcCXxjw+vjDPwNECDJAnAucFPbJM29A3g9cF/rID3wVGANeO/0stmVSU5vHaqFqvom8DZgFbgL+H5Vfbxtqr03tDLIg6wN+t7aJI8HPgy8pqp+0DpPK0kuBO6uqiOts/TEfuBZwLur6lzgR8AgP2NL8iQmVxDOBp4MnJ7kFW1T7b2hlcFx4CkbXp/FDJ7u7VSSRzEpgnFVXd86T2PnARcl+TqTy4cvSPKBtpGaOg4cr6oTZ4vXMSmHIXohcGdVrVXVT4Drgec2zrTnhlYG/wE8LcnZSR7N5EOgGxpnaiJJmFwPPlpVb2+dp7WqelNVnVVVC0z+v/hUVc3cT387VVXfBr6R5OnTpfOBLzeM1NIq8Owkc9O/N+czgx+m728doEtVdW+SvwT+mckdAVdX1W2NY7VyHnAx8KUkX5iuvbmqPtowk/rlVcB4+oPTHcBljfM0UVU3JbkOuJnJXXi3MIOPpfBxFJKkwV0mkiQ9CMtAkmQZSJIsA0kSloEkCctAkoRlIEnCMpAkYRlIj9h0CMzxJO/fsn5Dkq8lmWuVTdopy0B6hKbP+f9T4OIkLwFIchmT2QiXVtV6y3zSTvg4CmmPJHkP8BLgAuDTwHuq6g1tU0k7YxlIe2Q6G+KLTJ55fzvwG1X147appJ3xMpG0R6rqf4EbgccAV1kEOpV4ZiDtkSQj4HPAl4CDwK9O5wJIvWcZSHsgyWOZPO/+DuClwH8yGRx0UdNg0g55mUjaG38F/ALwyundQ5cAL0pyadNU0g55ZiA9QknOAz4LXFxVH9yw/lbglcA5VXW8VT5pJywDSZKXiSRJloEkCctAkoRlIEnCMpAkYRlIkrAMJElYBpIk4P8AjNEyvizrObsAAAAASUVORK5CYII=\n",
42 | "text/plain": [
43 | ""
44 | ]
45 | },
46 | "metadata": {
47 | "needs_background": "light"
48 | },
49 | "output_type": "display_data"
50 | }
51 | ],
52 | "source": [
53 | "import matplotlib.pyplot as plt\n",
54 | "%matplotlib inline \n",
55 | "import numpy as np\n",
56 | "\n",
57 | "# define os dados\n",
58 | "x = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) \n",
59 | "y = np.array([1, 3, 2, 5, 7, 8, 8, 9, 10, 12]) \n",
60 | "plt.plot(x, y, 'bo')\n",
61 | "plt.xlabel(\"x\", fontsize = 15)\n",
62 | "plt.ylabel(\"y\", fontsize = 15)\n",
63 | "plt.show(True) "
64 | ]
65 | },
66 | {
67 | "cell_type": "markdown",
68 | "metadata": {},
69 | "source": [
70 | "Vemos que é uma relação entre as variáveis $X$ e $Y$. Queremos determinar um modelo que melhor descreva essa relação. Podemos supor um modelo de regressão linear simples, como:\n",
71 | "$$\n",
72 | "y_i \\approx \\beta_0 + \\beta_1x_i\n",
73 | "$$"
74 | ]
75 | },
76 | {
77 | "cell_type": "markdown",
78 | "metadata": {},
79 | "source": [
80 | "Como há outros fatores, além de $x_i$ que afetam os valores de $y_i$, podemos escrever:\n",
81 | "$$\n",
82 | "y_i = \\beta_0 + \\beta_1 x_i + \\epsilon_i,\n",
83 | "$$\n",
84 | "onde $\\epsilon_i$ é uma variável aleatória que indica o erro na aproximação. O objetivo dos métodos de regressão é encontrar o melhor valor de $\\beta_0$ e $\\beta_1$ que minimizem o erro no ajuste. Ou seja, queremos encontrar a linha no plano $x-y$ que melhor se ajuste aos dados observados."
85 | ]
86 | },
87 | {
88 | "cell_type": "markdown",
89 | "metadata": {},
90 | "source": [
91 | "Estimando os coeficiente através do método dos momentos ou dos mínimos quadrados, obtemos:\n",
92 | " $$\n",
93 | " \\begin{cases}\n",
94 | " \\hat{\\beta}_1=\\frac{\\sum_{i=1}^n (x_i- \\bar{x})(y_i - \\bar{y})}{\\sum_{i=1}^n (x_i - \\bar{x})^2} = \\frac{S{xy}}{S{xx}}\\\\\n",
95 | " \\hat{\\beta}_0= \\bar{y}-\\hat{\\beta_1}\\bar{x}\n",
96 | " \\end{cases}\n",
97 | " $$"
98 | ]
99 | },
100 | {
101 | "cell_type": "markdown",
102 | "metadata": {},
103 | "source": [
104 | "Vamos implementar uma função para realizar a estimação."
105 | ]
106 | },
107 | {
108 | "cell_type": "code",
109 | "execution_count": 3,
110 | "metadata": {},
111 | "outputs": [],
112 | "source": [
113 | "from statistics import variance \n",
114 | "import math \n",
115 | "\n",
116 | "def estimate_coef(x, y): \n",
117 | " # número de observações/pontos\n",
118 | " n = np.size(x) \n",
119 | " \n",
120 | " # médias de x e y\n",
121 | " m_x, m_y = np.mean(x), np.mean(y) \n",
122 | " \n",
123 | " # calculating cross-deviation and deviation about x \n",
124 | " SS_xy = np.sum(y*x) - n*m_y*m_x \n",
125 | " SS_xx = np.sum(x*x) - n*m_x*m_x \n",
126 | " \n",
127 | " # calcula os coeficientes de regressão\n",
128 | " b_1 = SS_xy / SS_xx \n",
129 | " b_0 = m_y - b_1*m_x \n",
130 | " \n",
131 | " return(b_0, b_1) \n",
132 | "\n",
133 | "# função para mostrar os dados e o ajuste linear\n",
134 | "def plot_regression_line(x, y, b): \n",
135 | " # mostra os dados\n",
136 | " plt.scatter(x, y, color = \"b\", marker = \"o\", s = 50) \n",
137 | " \n",
138 | " # prediz os valores\n",
139 | " y_pred = b[0] + b[1]*x \n",
140 | " \n",
141 | " # mostra a reta de regressão\n",
142 | " plt.plot(x, y_pred, color = \"r\") \n",
143 | " \n",
144 | " plt.xlabel('x', fontsize = 15) \n",
145 | " plt.ylabel('y', fontsize = 15) \n",
146 | " plt.show(True) "
147 | ]
148 | },
149 | {
150 | "cell_type": "markdown",
151 | "metadata": {},
152 | "source": [
153 | "Assim, aplicando ao conjunto de dados:"
154 | ]
155 | },
156 | {
157 | "cell_type": "code",
158 | "execution_count": 4,
159 | "metadata": {},
160 | "outputs": [
161 | {
162 | "name": "stdout",
163 | "output_type": "stream",
164 | "text": [
165 | "Estimated coefficients:\n",
166 | "b_0 = 1.2363636363636363 \n",
167 | "b_1 = 1.1696969696969697\n"
168 | ]
169 | },
170 | {
171 | "data": {
172 | "image/png": "\n",
173 | "text/plain": [
174 | ""
175 | ]
176 | },
177 | "metadata": {
178 | "needs_background": "light"
179 | },
180 | "output_type": "display_data"
181 | }
182 | ],
183 | "source": [
184 | "import numpy as np\n",
185 | "\n",
186 | "# estima os coeficientes\n",
187 | "b = estimate_coef(x, y) \n",
188 | "print(\"Estimated coefficients:\\nb_0 = {} \\nb_1 = {}\".format(b[0], b[1])) \n",
189 | " \n",
190 | "# mostra o ajuste linear\n",
191 | "plot_regression_line(x, y, b) "
192 | ]
193 | },
194 | {
195 | "cell_type": "markdown",
196 | "metadata": {},
197 | "source": [
198 | "Para quantificar a acurácia do modelo, usamos o erro padrão residual (residual standard error):\n",
199 | "$$\n",
200 | "RSE = \\sqrt{\\frac{1}{n-2}\\sum_{i=1}^n (y_i-\\hat{y}_i)^2}\n",
201 | "$$"
202 | ]
203 | },
204 | {
205 | "cell_type": "code",
206 | "execution_count": 5,
207 | "metadata": {},
208 | "outputs": [
209 | {
210 | "name": "stdout",
211 | "output_type": "stream",
212 | "text": [
213 | "RSE: 0.8384690232980003\n"
214 | ]
215 | }
216 | ],
217 | "source": [
218 | "#funcao que calcula o RSE\n",
219 | "def RSE(x,y,b):\n",
220 | " n = len(y)\n",
221 | " RSE = 0\n",
222 | " for i in range(0,n):\n",
223 | " y_pred = b[0]+ x[i]*b[1] # valor predito\n",
224 | " RSE = RSE + (y[i]-y_pred)**2\n",
225 | " RSE = math.sqrt(RSE/(n-2))\n",
226 | " return RSE\n",
227 | "print('RSE:', RSE(x,y,b))"
228 | ]
229 | },
230 | {
231 | "cell_type": "markdown",
232 | "metadata": {},
233 | "source": [
234 | "Outra medida importante é o coeficiente R2, que mede a proporção da variabilidade em Y que pode ser explicada a partir de X.\n",
235 | "$$\n",
236 | "R^2 = 1-\\frac{\\sum_{i=1}^n (y_i-\\hat{y}_i)^2}{\\sum_{i=1}^n(y_i-\\bar{y})^2}, \\quad 0\\leq R^2\\leq 1\n",
237 | "$$"
238 | ]
239 | },
240 | {
241 | "cell_type": "code",
242 | "execution_count": 6,
243 | "metadata": {},
244 | "outputs": [
245 | {
246 | "name": "stdout",
247 | "output_type": "stream",
248 | "text": [
249 | "R2: 0.952538038613988\n"
250 | ]
251 | }
252 | ],
253 | "source": [
254 | "def R2(x,y,b):\n",
255 | " n = len(y)\n",
256 | " c1 = 0\n",
257 | " c2 = 0\n",
258 | " ym = np.mean(y)\n",
259 | " for i in range(0,n):\n",
260 | " y_pred = b[0]+ x[i]*b[1] # valor predito\n",
261 | " c1 = c1 + (y[i]-y_pred)**2\n",
262 | " c2 = c2 + (y[i]-ym)**2\n",
263 | " R2 = 1 - c1/c2\n",
264 | " return R2\n",
265 | "\n",
266 | "print('R2:', R2(x,y,b))"
267 | ]
268 | },
269 | {
270 | "cell_type": "markdown",
271 | "metadata": {},
272 | "source": [
273 | "Quanto mais próximo de um, melhor é o ajuste da regressão linear."
274 | ]
275 | },
276 | {
277 | "cell_type": "markdown",
278 | "metadata": {},
279 | "source": [
280 | "O código completo:"
281 | ]
282 | },
283 | {
284 | "cell_type": "code",
285 | "execution_count": 19,
286 | "metadata": {},
287 | "outputs": [
288 | {
289 | "name": "stdout",
290 | "output_type": "stream",
291 | "text": [
292 | "Coeficientes:\n",
293 | "b_0 = 1.2363636363636363 \n",
294 | "b_1 = 1.1696969696969697\n",
295 | "R2: 0.952538038613988\n"
296 | ]
297 | },
298 | {
299 | "data": {
300 | "image/png": "\n",
301 | "text/plain": [
302 | ""
303 | ]
304 | },
305 | "metadata": {
306 | "needs_background": "light"
307 | },
308 | "output_type": "display_data"
309 | }
310 | ],
311 | "source": [
312 | "import matplotlib.pyplot as plt\n",
313 | "import numpy as np\n",
314 | "from statistics import variance \n",
315 | "import math \n",
316 | "\n",
317 | "# define os dados\n",
318 | "x = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) \n",
319 | "y = np.array([1, 3, 2, 5, 7, 8, 8, 9, 10, 12]) \n",
320 | "\n",
321 | "# funcao que estima os coeficientes\n",
322 | "def estimate_coef(x, y): \n",
323 | " # número de observações\n",
324 | " n = np.size(x) \n",
325 | " # estimadores\n",
326 | " m_x, m_y = np.mean(x), np.mean(y) \n",
327 | " SS_xy = np.sum(y*x) - n*m_y*m_x \n",
328 | " SS_xx = np.sum(x*x) - n*m_x*m_x \n",
329 | " # calcula os coeficientes de regressão\n",
330 | " b_1 = SS_xy / SS_xx \n",
331 | " b_0 = m_y - b_1*m_x \n",
332 | " return(b_0, b_1) \n",
333 | "\n",
334 | "# função para mostrar os dados e o ajuste linear\n",
335 | "def plot_regression_line(x, y, b): \n",
336 | " # mostra os dados\n",
337 | " plt.scatter(x, y, color = \"b\", marker = \"o\", s = 50) \n",
338 | " # prediz os valores\n",
339 | " y_pred = b[0] + b[1]*x \n",
340 | " # mostra a reta de regressão\n",
341 | " plt.plot(x, y_pred, color = \"r\") \n",
342 | " plt.xlabel('x', fontsize = 15) \n",
343 | " plt.ylabel('y', fontsize = 15) \n",
344 | " plt.show(True) \n",
345 | " \n",
346 | "def R2(x,y,b):\n",
347 | " n = len(y)\n",
348 | " c1 = 0\n",
349 | " c2 = 0\n",
350 | " ym = np.mean(y)\n",
351 | " for i in range(0,n):\n",
352 | " y_pred = b[0]+ x[i]*b[1] # valor predito\n",
353 | " c1 = c1 + (y[i]-y_pred)**2\n",
354 | " c2 = c2 + (y[i]-ym)**2\n",
355 | " R2 = 1 - c1/c2\n",
356 | " return R2\n",
357 | " \n",
358 | "# estima os coeficientes\n",
359 | "b = estimate_coef(x, y) \n",
360 | "print(\"Coeficientes:\\nb_0 = {} \\nb_1 = {}\".format(b[0], b[1])) \n",
361 | "print('R2:', R2(x,y,b))\n",
362 | "\n",
363 | "\n",
364 | "# mostra o ajuste linear\n",
365 | "plt.plot(x,y,'bo')\n",
366 | "plt.plot(x, b[0] + b[1]*x, 'r-')\n",
367 | "plt.savefig('plot.eps')"
368 | ]
369 | }
370 | ],
371 | "metadata": {
372 | "kernelspec": {
373 | "display_name": "Python 3",
374 | "language": "python",
375 | "name": "python3"
376 | },
377 | "language_info": {
378 | "codemirror_mode": {
379 | "name": "ipython",
380 | "version": 3
381 | },
382 | "file_extension": ".py",
383 | "mimetype": "text/x-python",
384 | "name": "python",
385 | "nbconvert_exporter": "python",
386 | "pygments_lexer": "ipython3",
387 | "version": "3.7.4"
388 | }
389 | },
390 | "nbformat": 4,
391 | "nbformat_minor": 2
392 | }
393 |
--------------------------------------------------------------------------------
/Aula4-Classificacao-Naive Bayes.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Classificador Naive Bayes"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "Francisco Aparecido Rodrigues, francisco@icmc.usp.br. \n",
15 | "Universidade de São Paulo, São Carlos, Brasil. \n",
16 | "https://sites.icmc.usp.br/francisco \n",
17 | "Copyright: Creative Commons"
18 | ]
19 | },
20 | {
21 | "cell_type": "markdown",
22 | "metadata": {},
23 | "source": [
24 | " "
25 | ]
26 | },
27 | {
28 | "cell_type": "markdown",
29 | "metadata": {},
30 | "source": [
31 | "No classificador Naive Bayes, podemos assumir que os atributos são normalmente distribuídos."
32 | ]
33 | },
34 | {
35 | "cell_type": "code",
36 | "execution_count": 1,
37 | "metadata": {},
38 | "outputs": [
39 | {
40 | "name": "stdout",
41 | "output_type": "stream",
42 | "text": [
43 | "Número de linhas e colunas na matriz de atributos: (150, 5)\n"
44 | ]
45 | },
46 | {
47 | "data": {
48 | "text/html": [
49 | "\n",
50 | "\n",
63 | "
\n",
64 | " \n",
65 | " \n",
66 | " \n",
67 | " sepal_length \n",
68 | " sepal_width \n",
69 | " petal_length \n",
70 | " petal_width \n",
71 | " species \n",
72 | " \n",
73 | " \n",
74 | " \n",
75 | " \n",
76 | " 0 \n",
77 | " 5.1 \n",
78 | " 3.5 \n",
79 | " 1.4 \n",
80 | " 0.2 \n",
81 | " setosa \n",
82 | " \n",
83 | " \n",
84 | " 1 \n",
85 | " 4.9 \n",
86 | " 3.0 \n",
87 | " 1.4 \n",
88 | " 0.2 \n",
89 | " setosa \n",
90 | " \n",
91 | " \n",
92 | " 2 \n",
93 | " 4.7 \n",
94 | " 3.2 \n",
95 | " 1.3 \n",
96 | " 0.2 \n",
97 | " setosa \n",
98 | " \n",
99 | " \n",
100 | " 3 \n",
101 | " 4.6 \n",
102 | " 3.1 \n",
103 | " 1.5 \n",
104 | " 0.2 \n",
105 | " setosa \n",
106 | " \n",
107 | " \n",
108 | " 4 \n",
109 | " 5.0 \n",
110 | " 3.6 \n",
111 | " 1.4 \n",
112 | " 0.2 \n",
113 | " setosa \n",
114 | " \n",
115 | " \n",
116 | " 5 \n",
117 | " 5.4 \n",
118 | " 3.9 \n",
119 | " 1.7 \n",
120 | " 0.4 \n",
121 | " setosa \n",
122 | " \n",
123 | " \n",
124 | " 6 \n",
125 | " 4.6 \n",
126 | " 3.4 \n",
127 | " 1.4 \n",
128 | " 0.3 \n",
129 | " setosa \n",
130 | " \n",
131 | " \n",
132 | " 7 \n",
133 | " 5.0 \n",
134 | " 3.4 \n",
135 | " 1.5 \n",
136 | " 0.2 \n",
137 | " setosa \n",
138 | " \n",
139 | " \n",
140 | " 8 \n",
141 | " 4.4 \n",
142 | " 2.9 \n",
143 | " 1.4 \n",
144 | " 0.2 \n",
145 | " setosa \n",
146 | " \n",
147 | " \n",
148 | " 9 \n",
149 | " 4.9 \n",
150 | " 3.1 \n",
151 | " 1.5 \n",
152 | " 0.1 \n",
153 | " setosa \n",
154 | " \n",
155 | " \n",
156 | "
\n",
157 | "
"
158 | ],
159 | "text/plain": [
160 | " sepal_length sepal_width petal_length petal_width species\n",
161 | "0 5.1 3.5 1.4 0.2 setosa\n",
162 | "1 4.9 3.0 1.4 0.2 setosa\n",
163 | "2 4.7 3.2 1.3 0.2 setosa\n",
164 | "3 4.6 3.1 1.5 0.2 setosa\n",
165 | "4 5.0 3.6 1.4 0.2 setosa\n",
166 | "5 5.4 3.9 1.7 0.4 setosa\n",
167 | "6 4.6 3.4 1.4 0.3 setosa\n",
168 | "7 5.0 3.4 1.5 0.2 setosa\n",
169 | "8 4.4 2.9 1.4 0.2 setosa\n",
170 | "9 4.9 3.1 1.5 0.1 setosa"
171 | ]
172 | },
173 | "execution_count": 1,
174 | "metadata": {},
175 | "output_type": "execute_result"
176 | }
177 | ],
178 | "source": [
179 | "import random\n",
180 | "random.seed(42) # define the seed (important to reproduce the results)\n",
181 | "import pandas as pd\n",
182 | "import numpy as np\n",
183 | "import matplotlib.pyplot as plt\n",
184 | "\n",
185 | "#data = pd.read_csv('data/vertebralcolumn-3C.csv', header=(0))\n",
186 | "data = pd.read_csv('data/Iris.csv', header=(0))\n",
187 | "\n",
188 | "data = data.dropna(axis='rows') #remove NaN\n",
189 | "# armazena os nomes das classes\n",
190 | "classes = np.array(pd.unique(data[data.columns[-1]]), dtype=str) \n",
191 | "\n",
192 | "print(\"Número de linhas e colunas na matriz de atributos:\", data.shape)\n",
193 | "attributes = list(data.columns)\n",
194 | "data.head(10)"
195 | ]
196 | },
197 | {
198 | "cell_type": "code",
199 | "execution_count": 2,
200 | "metadata": {},
201 | "outputs": [],
202 | "source": [
203 | "data = data.to_numpy()\n",
204 | "nrow,ncol = data.shape\n",
205 | "y = data[:,-1]\n",
206 | "X = data[:,0:ncol-1]"
207 | ]
208 | },
209 | {
210 | "cell_type": "markdown",
211 | "metadata": {},
212 | "source": [
213 | "Selecionando os conjuntos de treinamento e teste."
214 | ]
215 | },
216 | {
217 | "cell_type": "code",
218 | "execution_count": 3,
219 | "metadata": {},
220 | "outputs": [],
221 | "source": [
222 | "from sklearn.model_selection import train_test_split\n",
223 | "p = 0.7 # fracao de elementos no conjunto de treinamento\n",
224 | "X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = p, random_state = 42)"
225 | ]
226 | },
227 | {
228 | "cell_type": "markdown",
229 | "metadata": {},
230 | "source": [
231 | "### Classificação: implementação do método"
232 | ]
233 | },
234 | {
235 | "cell_type": "markdown",
236 | "metadata": {},
237 | "source": [
238 | "Inicialmente, definimos uma função para calcular a densidade de probabilidade conjunta: $$p(\\vec{x}|C_i) = \\prod_{j=1}^d p(x_j|C_i), \\quad i=1,\\ldots, k$$ \n",
239 | "onde $C_i$ são as classes. Se a distribuição for normal, temos que cada atributo $X_j$ tem a seguinte função densidade de probabilidade associada, para cada classe:\n",
240 | "$$\n",
241 | "p(x_j|C_i) = \\frac{1}{\\sqrt{2\\pi\\sigma_{C_i}}}\\exp \\left[ -\\frac{1}{2}\\left( \\frac{x_j-\\mu_{C_i}}{\\sigma_{C_i}}\\right)^2 \\right], \\quad i=1,2,\\ldots, k.\n",
242 | "$$\n",
243 | "Assim, definimos uma função para calcular a função de verossimilhança."
244 | ]
245 | },
246 | {
247 | "cell_type": "code",
248 | "execution_count": 4,
249 | "metadata": {},
250 | "outputs": [],
251 | "source": [
252 | "def likelyhood(y, Z):\n",
253 | " def gaussian(x, mu, sig):\n",
254 | " return np.exp(-np.power(x - mu, 2.) / (2 * np.power(sig, 2.)))\n",
255 | " prob = 1\n",
256 | " for j in np.arange(0, Z.shape[1]):\n",
257 | " m = np.mean(Z[:,j])\n",
258 | " s = np.std(Z[:,j]) \n",
259 | " prob = prob*gaussian(y[j], m, s)\n",
260 | " return prob"
261 | ]
262 | },
263 | {
264 | "cell_type": "markdown",
265 | "metadata": {},
266 | "source": [
267 | "A seguir, realizamos a estimação para cada classe:"
268 | ]
269 | },
270 | {
271 | "cell_type": "code",
272 | "execution_count": 5,
273 | "metadata": {},
274 | "outputs": [],
275 | "source": [
276 | "P = pd.DataFrame(data=np.zeros((X_test.shape[0], len(classes))), columns = classes) \n",
277 | "for i in np.arange(0, len(classes)):\n",
278 | " elements = tuple(np.where(y_train == classes[i]))\n",
279 | " Z = X_train[elements,:][0]\n",
280 | " for j in np.arange(0,X_test.shape[0]):\n",
281 | " x = X_test[j,:]\n",
282 | " pj = likelyhood(x,Z)\n",
283 | " P[classes[i]][j] = pj*len(elements)/X_train.shape[0]"
284 | ]
285 | },
286 | {
287 | "cell_type": "markdown",
288 | "metadata": {},
289 | "source": [
290 | "Para as observações no conjunto de teste, a probabilidade pertencer a cada classe:"
291 | ]
292 | },
293 | {
294 | "cell_type": "code",
295 | "execution_count": 6,
296 | "metadata": {},
297 | "outputs": [
298 | {
299 | "data": {
300 | "text/html": [
301 | "\n",
302 | "\n",
315 | "
\n",
316 | " \n",
317 | " \n",
318 | " \n",
319 | " setosa \n",
320 | " versicolor \n",
321 | " virginica \n",
322 | " \n",
323 | " \n",
324 | " \n",
325 | " \n",
326 | " 0 \n",
327 | " 1.824344e-90 \n",
328 | " 4.440479e-03 \n",
329 | " 4.107993e-05 \n",
330 | " \n",
331 | " \n",
332 | " 1 \n",
333 | " 1.652256e-04 \n",
334 | " 1.196823e-16 \n",
335 | " 4.171552e-23 \n",
336 | " \n",
337 | " \n",
338 | " 2 \n",
339 | " 6.741862e-287 \n",
340 | " 3.100666e-17 \n",
341 | " 2.363765e-05 \n",
342 | " \n",
343 | " \n",
344 | " 3 \n",
345 | " 1.609452e-93 \n",
346 | " 4.042876e-03 \n",
347 | " 2.146958e-04 \n",
348 | " \n",
349 | " \n",
350 | " 4 \n",
351 | " 2.453031e-106 \n",
352 | " 8.057427e-04 \n",
353 | " 3.352704e-04 \n",
354 | " \n",
355 | " \n",
356 | " 5 \n",
357 | " 1.491009e-03 \n",
358 | " 5.335375e-15 \n",
359 | " 1.159063e-22 \n",
360 | " \n",
361 | " \n",
362 | " 6 \n",
363 | " 1.585589e-53 \n",
364 | " 3.230445e-03 \n",
365 | " 2.400042e-07 \n",
366 | " \n",
367 | " \n",
368 | " 7 \n",
369 | " 5.666865e-172 \n",
370 | " 6.868359e-10 \n",
371 | " 3.319537e-03 \n",
372 | " \n",
373 | " \n",
374 | " 8 \n",
375 | " 3.375351e-96 \n",
376 | " 8.399067e-04 \n",
377 | " 1.117962e-05 \n",
378 | " \n",
379 | " \n",
380 | " 9 \n",
381 | " 7.866680e-60 \n",
382 | " 6.780257e-03 \n",
383 | " 6.579053e-07 \n",
384 | " \n",
385 | " \n",
386 | "
\n",
387 | "
"
388 | ],
389 | "text/plain": [
390 | " setosa versicolor virginica\n",
391 | "0 1.824344e-90 4.440479e-03 4.107993e-05\n",
392 | "1 1.652256e-04 1.196823e-16 4.171552e-23\n",
393 | "2 6.741862e-287 3.100666e-17 2.363765e-05\n",
394 | "3 1.609452e-93 4.042876e-03 2.146958e-04\n",
395 | "4 2.453031e-106 8.057427e-04 3.352704e-04\n",
396 | "5 1.491009e-03 5.335375e-15 1.159063e-22\n",
397 | "6 1.585589e-53 3.230445e-03 2.400042e-07\n",
398 | "7 5.666865e-172 6.868359e-10 3.319537e-03\n",
399 | "8 3.375351e-96 8.399067e-04 1.117962e-05\n",
400 | "9 7.866680e-60 6.780257e-03 6.579053e-07"
401 | ]
402 | },
403 | "execution_count": 6,
404 | "metadata": {},
405 | "output_type": "execute_result"
406 | }
407 | ],
408 | "source": [
409 | "P.head(10)"
410 | ]
411 | },
412 | {
413 | "cell_type": "code",
414 | "execution_count": 7,
415 | "metadata": {},
416 | "outputs": [
417 | {
418 | "name": "stdout",
419 | "output_type": "stream",
420 | "text": [
421 | "Accuracy: 0.9555555555555556\n"
422 | ]
423 | }
424 | ],
425 | "source": [
426 | "from sklearn.metrics import accuracy_score\n",
427 | "\n",
428 | "y_pred = []\n",
429 | "for i in np.arange(0, P.shape[0]):\n",
430 | " c = np.argmax(np.array(P.iloc[[i]]))\n",
431 | " y_pred.append(P.columns[c])\n",
432 | "y_pred = np.array(y_pred, dtype=str)\n",
433 | "\n",
434 | "score = accuracy_score(y_pred, y_test)\n",
435 | "print('Accuracy:', score)"
436 | ]
437 | },
438 | {
439 | "cell_type": "markdown",
440 | "metadata": {},
441 | "source": [
442 | "### Classificação: usando a biblioteca scikit-learn"
443 | ]
444 | },
445 | {
446 | "cell_type": "markdown",
447 | "metadata": {},
448 | "source": [
449 | "Podemos realizar a classificação usando a função disponível na biblioteca scikit-learn."
450 | ]
451 | },
452 | {
453 | "cell_type": "code",
454 | "execution_count": 8,
455 | "metadata": {},
456 | "outputs": [
457 | {
458 | "name": "stdout",
459 | "output_type": "stream",
460 | "text": [
461 | "Accuracy: 0.9777777777777777\n"
462 | ]
463 | }
464 | ],
465 | "source": [
466 | "from sklearn.naive_bayes import GaussianNB\n",
467 | "from sklearn import metrics\n",
468 | "\n",
469 | "model = GaussianNB()\n",
470 | "model.fit(X_train, y_train)\n",
471 | "\n",
472 | "y_pred = model.predict(X_test)\n",
473 | "score = accuracy_score(y_pred, y_test)\n",
474 | "print('Accuracy:', score)"
475 | ]
476 | },
477 | {
478 | "cell_type": "markdown",
479 | "metadata": {},
480 | "source": [
481 | "Outra maneira de efetuarmos a classificação é assumirmos que os atributos possuem distribuição diferente da normal. "
482 | ]
483 | },
484 | {
485 | "cell_type": "markdown",
486 | "metadata": {},
487 | "source": [
488 | "Uma possibilidade é assumirmos que os dados possuem distribuição de Bernoulli. "
489 | ]
490 | },
491 | {
492 | "cell_type": "code",
493 | "execution_count": 9,
494 | "metadata": {},
495 | "outputs": [
496 | {
497 | "name": "stdout",
498 | "output_type": "stream",
499 | "text": [
500 | "Accuracy: 0.28888888888888886\n"
501 | ]
502 | }
503 | ],
504 | "source": [
505 | "from sklearn.naive_bayes import BernoulliNB\n",
506 | "\n",
507 | "model = BernoulliNB()\n",
508 | "model.fit(X_train, y_train)\n",
509 | "\n",
510 | "y_pred = model.predict(X_test)\n",
511 | "score = accuracy_score(y_pred, y_test)\n",
512 | "print('Accuracy:', score)"
513 | ]
514 | },
515 | {
516 | "cell_type": "markdown",
517 | "metadata": {},
518 | "source": [
519 | "Código completo."
520 | ]
521 | },
522 | {
523 | "cell_type": "code",
524 | "execution_count": 10,
525 | "metadata": {},
526 | "outputs": [
527 | {
528 | "name": "stdout",
529 | "output_type": "stream",
530 | "text": [
531 | "Acuracia: 1.0\n"
532 | ]
533 | }
534 | ],
535 | "source": [
536 | "import random\n",
537 | "import pandas as pd\n",
538 | "import numpy as np\n",
539 | "import matplotlib.pyplot as plt\n",
540 | "from sklearn.model_selection import train_test_split\n",
541 | "from sklearn.preprocessing import StandardScaler\n",
542 | "from sklearn.naive_bayes import GaussianNB\n",
543 | "from sklearn.metrics import accuracy_score\n",
544 | "\n",
545 | "random.seed(42) \n",
546 | "\n",
547 | "data = pd.read_csv('data/Iris.csv', header=(0))\n",
548 | "\n",
549 | "classes = np.array(pd.unique(data[data.columns[-1]]), dtype=str) \n",
550 | "\n",
551 | "# Converte para matriz e vetor do numpy\n",
552 | "data = data.to_numpy()\n",
553 | "nrow,ncol = data.shape\n",
554 | "y = data[:,-1]\n",
555 | "X = data[:,0:ncol-1]\n",
556 | "\n",
557 | "# Transforma os dados para terem media igual a zero e variancia igual a 1\n",
558 | "#scaler = StandardScaler().fit(X)\n",
559 | "#X = scaler.transform(X)\n",
560 | "\n",
561 | "# Seleciona os conjuntos de treinamento e teste\n",
562 | "p = 0.8 # fraction of elements in the test set\n",
563 | "X_train, X_test, y_train, y_test = train_test_split(X, y, \n",
564 | " train_size = p, random_state = 42)\n",
565 | "\n",
566 | "# ajusta o classificador Naive-Bayes de acordo com os dados\n",
567 | "model = GaussianNB()\n",
568 | "model.fit(X_train, y_train)\n",
569 | "# realiza a predicao\n",
570 | "y_pred = model.predict(X_test)\n",
571 | "# calcula a acuracia\n",
572 | "score = accuracy_score(y_pred, y_test)\n",
573 | "print('Acuracia:', score)"
574 | ]
575 | },
576 | {
577 | "cell_type": "markdown",
578 | "metadata": {},
579 | "source": [
580 | "## Região de decisão"
581 | ]
582 | },
583 | {
584 | "cell_type": "markdown",
585 | "metadata": {},
586 | "source": [
587 | "Selecionando dois atributos, podemos visualizar a região de decisão. Para graficar a região de separação, precisamos instalar a bibliteca mlxtend: http://rasbt.github.io/mlxtend/installation/ \n",
588 | "Pode ser usado: conda install -c conda-forge mlxtend"
589 | ]
590 | },
591 | {
592 | "cell_type": "markdown",
593 | "metadata": {},
594 | "source": [
595 | "Para o classificador Naive Bayes:"
596 | ]
597 | },
598 | {
599 | "cell_type": "code",
600 | "execution_count": 11,
601 | "metadata": {},
602 | "outputs": [],
603 | "source": [
604 | "from mlxtend.plotting import plot_decision_regions\n",
605 | "import numpy as np\n",
606 | "import matplotlib.pyplot as plt\n",
607 | "from sklearn import datasets\n",
608 | "from sklearn.neighbors import KNeighborsClassifier\n",
609 | "import sklearn.datasets as skdata\n",
610 | "from matplotlib import pyplot\n",
611 | "from pandas import DataFrame\n",
612 | "\n",
613 | "# Gera os dados em duas dimensões\n",
614 | "n_samples = 100 # número de observações\n",
615 | "# centro dos grupos\n",
616 | "centers = [(-4, 0), (0, 0), (3, 3)]\n",
617 | "X, y = skdata.make_blobs(n_samples=100, n_features=2, cluster_std=1.0, centers=centers, \n",
618 | " shuffle=False, random_state=42)\n",
619 | "\n",
620 | "# monta a matrix de atributos\n",
621 | "d = np.column_stack((X,np.transpose(y)))\n",
622 | "# converte para o formato dataframe do Pandas\n",
623 | "data = DataFrame(data = d, columns=['X1', 'X2', 'y'])\n",
624 | "features_names = ['X1', 'X2']\n",
625 | "class_labels = np.unique(y)"
626 | ]
627 | },
628 | {
629 | "cell_type": "code",
630 | "execution_count": 12,
631 | "metadata": {},
632 | "outputs": [
633 | {
634 | "data": {
635 | "image/png": "\n",
636 | "text/plain": [
637 | ""
638 | ]
639 | },
640 | "metadata": {
641 | "needs_background": "light"
642 | },
643 | "output_type": "display_data"
644 | },
645 | {
646 | "data": {
647 | "image/png": "\n",
648 | "text/plain": [
649 | ""
650 | ]
651 | },
652 | "metadata": {
653 | "needs_background": "light"
654 | },
655 | "output_type": "display_data"
656 | }
657 | ],
658 | "source": [
659 | "from mlxtend.plotting import plot_decision_regions\n",
660 | "import matplotlib.pyplot as plt\n",
661 | "from sklearn import datasets\n",
662 | "from sklearn.naive_bayes import GaussianNB\n",
663 | "\n",
664 | "\n",
665 | "# mostra os dados e colori de acordo com as classes\n",
666 | "colors = ['red', 'blue', 'green', 'black']\n",
667 | "aux = 0\n",
668 | "for c in class_labels:\n",
669 | " ind = np.where(y == c)\n",
670 | " plt.scatter(X[ind,0][0], X[ind,1][0], color = colors[aux], label = c)\n",
671 | " aux = aux + 1\n",
672 | "plt.legend()\n",
673 | "plt.show()\n",
674 | "\n",
675 | "# Training a classifier\n",
676 | "model = GaussianNB()\n",
677 | "model.fit(X, y)\n",
678 | "\n",
679 | "# Plotting decision regions\n",
680 | "plot_decision_regions(X, y, clf=model, legend=2)\n",
681 | "\n",
682 | "plt.xlabel('X1')\n",
683 | "plt.ylabel('X2')\n",
684 | "plt.title('Decision Regions')\n",
685 | "plt.show()"
686 | ]
687 | },
688 | {
689 | "cell_type": "markdown",
690 | "metadata": {},
691 | "source": [
692 | "### Exercícios de fixação"
693 | ]
694 | },
695 | {
696 | "cell_type": "markdown",
697 | "metadata": {},
698 | "source": [
699 | "1 - Repita todos os passos acima para a base de dados BreastCancer."
700 | ]
701 | },
702 | {
703 | "cell_type": "markdown",
704 | "metadata": {},
705 | "source": [
706 | "2 - Considere a base vertebralcolumn-3C e compare o classificadores: Naive Bayes, Classificador Bayesiano paramétrico e o classiificador Bayesiano não-paramétrico."
707 | ]
708 | },
709 | {
710 | "cell_type": "markdown",
711 | "metadata": {},
712 | "source": [
713 | "3 - Considerando a base de dados Vehicle, projete os dados em duas dimensões usando PCA e mostre as regiões de separação como feito acima."
714 | ]
715 | },
716 | {
717 | "cell_type": "markdown",
718 | "metadata": {},
719 | "source": [
720 | "4 - Faça a classificação dos dados gerados artificialmente com o código abaixo. Compare os resultados para os métodos Naive Bayes, Classificador Bayesiano paramétrico e o classiificador Bayesiano não-paramétrico."
721 | ]
722 | },
723 | {
724 | "cell_type": "code",
725 | "execution_count": 13,
726 | "metadata": {},
727 | "outputs": [
728 | {
729 | "data": {
730 | "image/png": "\n",
731 | "text/plain": [
732 | ""
733 | ]
734 | },
735 | "metadata": {
736 | "needs_background": "light"
737 | },
738 | "output_type": "display_data"
739 | }
740 | ],
741 | "source": [
742 | "from sklearn import datasets\n",
743 | "plt.figure(figsize=(6,4))\n",
744 | "\n",
745 | "n_samples = 1000\n",
746 | "\n",
747 | "data = datasets.make_moons(n_samples=n_samples, noise=.05)\n",
748 | "X = data[0]\n",
749 | "y = data[1]\n",
750 | "plt.scatter(X[:,0], X[:,1], c=y, cmap='viridis', s=50, alpha=0.7)\n",
751 | "plt.show(True)"
752 | ]
753 | },
754 | {
755 | "cell_type": "markdown",
756 | "metadata": {},
757 | "source": [
758 | "5 - Encontre a região de separação dos dados do exercício anterior usando o método Naive Bayes."
759 | ]
760 | },
761 | {
762 | "cell_type": "markdown",
763 | "metadata": {},
764 | "source": [
765 | "6 - (Facultativo) Escolha outras distribuições de probabilidade e implemente um algoritmo Naive Bayes geral. Ou seja, o algoritmo faz a classificação usando várias distribuições e obtém o melhor resultado, mostrando também qual a distribuição mais adequada."
766 | ]
767 | },
768 | {
769 | "cell_type": "markdown",
770 | "metadata": {},
771 | "source": [
772 | "7 - (Para pensar) É possível implementar o Naive Bayes heterogêneo, ou seja, com diferentes distribuições para cada atributo?"
773 | ]
774 | },
775 | {
776 | "cell_type": "markdown",
777 | "metadata": {},
778 | "source": [
779 | "8 - (Desafio) Gere dados com diferentes níveis de correlação entre as variáveis e verifique se a perfomance do algoritmo muda com a correlação."
780 | ]
781 | },
782 | {
783 | "cell_type": "markdown",
784 | "metadata": {},
785 | "source": [
786 | "## Código completo"
787 | ]
788 | },
789 | {
790 | "cell_type": "code",
791 | "execution_count": 14,
792 | "metadata": {},
793 | "outputs": [
794 | {
795 | "name": "stdout",
796 | "output_type": "stream",
797 | "text": [
798 | "Accuracy: 1.0\n"
799 | ]
800 | }
801 | ],
802 | "source": [
803 | "import random\n",
804 | "import pandas as pd\n",
805 | "import numpy as np\n",
806 | "from sklearn.model_selection import train_test_split\n",
807 | "from sklearn.metrics import accuracy_score\n",
808 | "\n",
809 | "random.seed(42) \n",
810 | "data = pd.read_csv('data/Iris.csv', header=(0))\n",
811 | "# classes: setosa, virginica e versicolor\n",
812 | "classes = pd.unique(data[data.columns[-1]])\n",
813 | "classes = np.array(classes, dtype=str) \n",
814 | "# converte para matrizes do numpy\n",
815 | "data = data.to_numpy()\n",
816 | "nrow,ncol = data.shape\n",
817 | "y = data[:,-1]\n",
818 | "X = data[:,0:ncol-1]\n",
819 | "# Seleciona o conjunto de teste e treinamento\n",
820 | "p = 0.7 \n",
821 | "X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = p)\n",
822 | "\n",
823 | "# funcao para calcular a verossimilhanca\n",
824 | "def likelyhood(y, Z):\n",
825 | " def gaussian(x, mu, sig):\n",
826 | " return np.exp(-np.power(x - mu, 2.) / (2 * np.power(sig, 2.)))\n",
827 | " prob = 1\n",
828 | " for j in np.arange(0, Z.shape[1]):\n",
829 | " m = np.mean(Z[:,j])\n",
830 | " s = np.std(Z[:,j]) \n",
831 | " prob = prob*gaussian(y[j], m, s)\n",
832 | " return prob\n",
833 | "\n",
834 | "# matriz que armazena o produto da verossimilhanca pela priori\n",
835 | "P = pd.DataFrame(data=np.zeros((X_test.shape[0], len(classes))), columns = classes) \n",
836 | "for i in np.arange(0, len(classes)):\n",
837 | " elements = tuple(np.where(y_train == classes[i]))\n",
838 | " Z = X_train[elements,:][0]\n",
839 | " for j in np.arange(0,X_test.shape[0]):\n",
840 | " x = X_test[j,:]\n",
841 | " pj = likelyhood(x,Z) #verossimilhanca\n",
842 | " pc = len(elements)/X_train.shape[0] # priori\n",
843 | " P[classes[i]][j] = pj*pc\n",
844 | " \n",
845 | "# realiza a classificao seguindo a regra de Bayes\n",
846 | "y_pred = []\n",
847 | "for i in np.arange(0, P.shape[0]):\n",
848 | " c = np.argmax(np.array(P.iloc[[i]]))\n",
849 | " y_pred.append(P.columns[c])\n",
850 | "y_pred = np.array(y_pred, dtype=str)\n",
851 | "# calcula a acuracia na classificacao\n",
852 | "score = accuracy_score(y_pred, y_test)\n",
853 | "print('Accuracy:', score)"
854 | ]
855 | },
856 | {
857 | "cell_type": "code",
858 | "execution_count": null,
859 | "metadata": {},
860 | "outputs": [],
861 | "source": []
862 | }
863 | ],
864 | "metadata": {
865 | "kernelspec": {
866 | "display_name": "Python 3",
867 | "language": "python",
868 | "name": "python3"
869 | },
870 | "language_info": {
871 | "codemirror_mode": {
872 | "name": "ipython",
873 | "version": 3
874 | },
875 | "file_extension": ".py",
876 | "mimetype": "text/x-python",
877 | "name": "python",
878 | "nbconvert_exporter": "python",
879 | "pygments_lexer": "ipython3",
880 | "version": "3.7.4"
881 | }
882 | },
883 | "nbformat": 4,
884 | "nbformat_minor": 2
885 | }
886 |
--------------------------------------------------------------------------------