├── README.md ├── Chapter_03 └── Chapter_03.Rmd ├── Chapter_01 └── Chapter_01.Rmd └── Chapter_02 └── Chapter_02.Rmd /README.md: -------------------------------------------------------------------------------- 1 | # Introduction to Data Mining: Solutions 2 | Author: Paige Bailey 3 | 4 | ------ 5 | 6 | ## Chapter 1: Introduction 7 | * What is data mining? 8 | * Motivating Challenges 9 | * The Origins of Data Mining 10 | * Data Mining Tasks 11 | 12 | ## Chapter 2: Data 13 | * Types of Data 14 | * Data Quality 15 | * Data Preprocessing 16 | * Measures of Simliarity and Dissimilarity 17 | 18 | ## Chapter 3: Exploring Data 19 | * Summary Statistics 20 | * Visualization 21 | * OLAP and Multidimensional Data Analysis 22 | 23 | ## Chapter 4: Classification - Basic Concepts, Decision Trees, and Model Evaluation 24 | * General Approach to Solving a Classification Problem 25 | * Decision Tree Induction 26 | * Model Overfitting 27 | * Evaluating the Performance of a Classifier 28 | * Methods for Comparing Classifiers 29 | 30 | ## Chapter 5: Classification - Alternative Techniques 31 | * Rule-Based Classifier 32 | * Nearest-Neighbor Classifiers 33 | * Bayesian Classifiers 34 | * Artificial Neural Network (ANN) 35 | * Support Vector Machine (SVM) 36 | * Ensemble Methods 37 | * Class Imbalance Problems 38 | * Multiclass Problems 39 | 40 | ## Chapter 6: Association Analysis - Basic Concepts and Algorithms 41 | * Frequent Itemset Generation 42 | * Rule Generation 43 | * Compact Representation of Frequent Itemsets 44 | * Alternative Methods for Generating Frequent Itemsets 45 | * FP-Growth Algorithm 46 | * Evaluation of Association Patterns 47 | * Effect of Skewed Support Distribution 48 | 49 | ## Chapter 7: Association Analysis - Advanced Concepts 50 | * Handling Categorical Attributes 51 | * Handling Continuous Attributes 52 | * Handling a Concept Hierarchy 53 | * Sequential Patterns 54 | * Subgraph Patterns 55 | * Infrequent Patterns 56 | 57 | ## Chapter 8: Cluster Analysis - Basic Concepts and Algorithms 58 | * K-Means 59 | * Agglomerative Hierarchical Clustering 60 | * DBSCAN 61 | * Cluster Evaluation 62 | 63 | ## Chapter 9: Cluster Analysis - Additional Issues and Algorithms 64 | * Characteristics of Data, Clusters, and Clustering Algorithms 65 | * Prototype-Based Clustering 66 | * Density-Based Clustering 67 | * Graph-Based Clustering 68 | * Scalable Clustering Algorithms 69 | * Which Clustering Algorithm? 70 | 71 | ## Chapter 10: Anomaly Detection 72 | * Statistical Approaches 73 | * Proximity-Based Outlier Detection 74 | * Density-Based Outlier Detection 75 | * Clustering-Based Techniques 76 | -------------------------------------------------------------------------------- /Chapter_03/Chapter_03.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: 'Chapter 3: Exploring Data' 3 | author: "Paige Bailey" 4 | date: "April 8, 2017" 5 | output: 6 | html_document: 7 | theme: "sandstone" 8 | toc: true 9 | toc_float: true 10 | --- 11 | 12 | ```{r setup, include=FALSE} 13 | knitr::opts_chunk$set(echo = TRUE) 14 | ``` 15 | 16 | ----- 17 | 18 | ## Question 1 19 | Obtain one of the data sets available at the UCI Machine Learning Repository and apply as many of the different visualization techniques described in the chapter as possible. 20 | 21 | ----- 22 | 23 | ## Question 2 24 | Identify at least two advantages and two disadvantages of using color to visually represent information. 25 | 26 | ----- 27 | 28 | ## Question 3 29 | What are the arrangement issues that arise with respect to three-dimensional plots? 30 | 31 | ----- 32 | 33 | ## Question 4 34 | Discuss the advantages and disadvantages of using sampling to reduce the number of data objects that need to be displayed. Would simple random sampling (without replacement) be a good approach to sampling? Why or why not? 35 | 36 | ----- 37 | 38 | ## Question 5 39 | Describe how you would create visualizations to display information that describes the following types of systems. Be sure to address the following issues: 40 | 41 | * **Representation**. How will you map objects, attributes, and relationships to visual elements? 42 | * **Arrangement**. Are there any special considerations that need to be taken into account with respect to how visual elements are displayed? Specific examples might be the choice of viewpoint, the use of transparency, or the separation of certain groups of objects. 43 | * **Selection**. How will you handle a large number of attributes and data objects? 44 | 45 | ------ 46 | 47 | ## Question 6 48 | Describe one advantage and one disadvantage of a stem and leaf plot with respect to a standard histogram. 49 | 50 | ------ 51 | 52 | ## Question 7 53 | How might you address the problem that a histogram depends on the number and location of the bins? 54 | 55 | ------ 56 | 57 | ## Question 8 58 | Describe how a box plot can give information about whether the value of an attribute is symmetrically distributed. What can you say about the symmetry of the distributions of the attributes shown in Figure 3.11? 59 | 60 | ----- 61 | 62 | ## Question 9 63 | Compare sepal length, sepal width, petal length, and petal width, using Figure 3.12. 64 | 65 | ----- 66 | 67 | ## Question 10 68 | Comment on the use of a box plot to explore a data set with four atributes: age, weight, height, and income. 69 | 70 | ----- 71 | 72 | ## Question 11 73 | Give a possible explanation as to why most of the values of petal length and width fall in the buckets along the diagonal in Figure 3.9. 74 | 75 | ------ 76 | 77 | ## Question 12 78 | Use Figures 3.14 and 3.15 to identify a characteristic shared by the petal width and petal length attributes. 79 | 80 | ----- 81 | 82 | ## Question 13 83 | Simple line plots, such as that displayed in Figure 2.12, which shows two time series, can be used to effectively display high-dimensional data. For example, it is easy to tell that the frequencies of the two time series are different. What characteristic of time series allows the effective visualization of high-dimensional data? 84 | 85 | ----- 86 | 87 | ## Question 14 88 | Describe the types of situations that produce sparse or dense data cubes. Illustrate with examples other than those used in the book. 89 | 90 | ----- 91 | 92 | ## Question 15 93 | How might you extend the notion of multidimensional data analysis so that the target variable is a qualitative variable? In other words, what sorts of summary statistics or data visualizations would be of interest? 94 | 95 | ------ 96 | 97 | ## Question 16 98 | Construct a data cube from Table 3.1. Is this a dense or a sparse data cube? If it is sparse, identify the cells that are empty. 99 | 100 | ----- 101 | 102 | ## Question 17 103 | Discuss the differences between dimensionality reduction based on aggregation and dimensionality reduction based on techniques such as PCA and SVD. -------------------------------------------------------------------------------- /Chapter_01/Chapter_01.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: 'Chapter 1: Introduction' 3 | author: "Paige Bailey" 4 | date: "April 8, 2017" 5 | output: 6 | html_document: 7 | theme: "sandstone" 8 | toc: true 9 | toc_float: true 10 | --- 11 | 12 | ```{r setup, include=FALSE} 13 | knitr::opts_chunk$set(echo = TRUE) 14 | ``` 15 | 16 | ----- 17 | 18 | ## Question 1 19 | _Discuss whether or not each of the following activities is a data mining task._ 20 | 21 | * Dividing the customers of a company according to their gender. 22 | * Dividing the customers of a company according to their profitability. 23 | * Computing the total sales of a company. 24 | * Sorting a student database based on student identification numbers. 25 | * Predicting the outcome of tossing a (fair) pair of dice. 26 | * Predicting the future stock price of a company using historical records. 27 | * Monitoring the heart rate of a patient for abnormalities. 28 | * Monitoring seismic waves for earthquake activities. 29 | * Extracting the frequencies of a sound wave. 30 | 31 | ##### **Answers**: 32 | 33 | A) **No**, this would not be a data mining task, just a query to the company's database. 34 | B) **No**, this would be a sum of a customer's purchases, and then the application of thresholds. 35 | C) **No**, this would just be the sum of the company's sales. 36 | D) **No**, this task would be a sort on an ID column in the student database. 37 | E) **No**. Since the die are fair, this would be considered a probability calculation. 38 | F) **Yes**! This would be a data mining tasks, since we would be building a predictive model for future values of the stock price. 39 | G) **Yes**, this would be considered a data mining task because we would have to build a mode for normal heart behavior and create an alert when unusual heart-related behavior occurs. 40 | H) **Yes**. Similar to the heart rate example above, we would need to build a model for seismic wave behavior associated with earthquake activity, then send an alert when those conditions are observed. 41 | I) **Nope**, this would be _digital signal processing_. You might by interested in taking a gander at [Signals and Systems](http://signalsand.systems). 42 | 43 | ----- 44 | 45 | ## Question 2 46 | Explore various online applications for data mining. Categorize what are the three applications where predictive tasks are required. Also try to identify the applications in a daily routine where descriptive tasks are required. 47 | 48 | ##### **Answer**: 49 | 50 | Three applications for _predictive analytics_ would be: 51 | 52 | * **Cross-selling**: Provided you are a corporation that collects and maintains customer records, you could analyze customers' spending, usage, and other behavior to sell additional products to your current client base. Amazon does this on the regs. 53 | * **Customer retention**: Predicting (and preventing!) customer churn. 54 | * **Fraud detection**: similar to the examples listed in _Question 1_, your company could build a model for characteristics of fraud and send an alert when those characteristics are observed in customer transactions. 55 | 56 | Descriptive tasks such as classfication could assign news articles into pre-defined categories, like "Entertainment", "Sports", "Politics", "Foreign Affairs", etc. Descriptive analytics can also be used to classify products, and incorporated into marketing strategies for your customer base. 57 | 58 | ----- 59 | 60 | ## Question 3 61 | For each of the following data sets, explain whether or not data privacy is an important issue. 62 | * Census data collected from 1900 - 1950. 63 | * IP addresses and visit times of web users who visit your website. 64 | * Images from earth-orbiting satellites. 65 | * Names and addresses of people from the telephone book. 66 | * Names and email addresses collected from the internet. 67 | 68 | ##### **Answer**: 69 | 70 | 1) **No**. Census data is freely available [here](https://www.census.gov/data.html), and does not contain sensitive personal information. 71 | 2) **Yes**! IP addresses are highly sensitive information, and should always be protected. 72 | 3) **Nope**. NASA makes these images open, and you can find some of 'em [here](https://worldview.earthdata.nasa.gov/). 73 | 4) **No**. If they were sensitive, they wouldn't be in a public phone book! 74 | 5) **Nope** - same story as #4. -------------------------------------------------------------------------------- /Chapter_02/Chapter_02.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: 'Chapter 2: Data' 3 | author: "Paige Bailey" 4 | date: "April 8, 2017" 5 | output: 6 | html_document: 7 | theme: "sandstone" 8 | toc: true 9 | toc_float: true 10 | --- 11 | 12 | ```{r setup, include=FALSE} 13 | knitr::opts_chunk$set(echo = TRUE) 14 | ``` 15 | 16 | ----- 17 | 18 | ## Question 1 19 | In the initial example of Chapter 2, the statistician says "Yes, fields 2 and 3 are basically the same." Can you tell from the three lines of sample data that are shown why she says that? 20 | 21 | Field 1 | Field 2 | Field 3 | Field 4 | Field 5 22 | --- | --- | --- | --- | --- 23 | 012 | 232 | 33.5 | 0 | 10.7 24 | 020 | 121 | 16.9 | 2 | 210.1 25 | 027 | 165 | 24.0 | 0 | 427.6 26 | 27 | ##### **Answer:** 28 | 29 | Upon inspection of the table above, we can see that the second column (containing the values 232, 121, and 165) divided by $\approx$ 7 would create the values in the third column (33.5, 16.9, 24.0). 30 | 31 | ``` {r} 32 | field_2 <- c(232, 121, 165) 33 | field_3 <- c(33.5, 16.9, 24.0) 34 | field_2 / field_3 35 | ``` 36 | 37 | ----- 38 | 39 | ## Question 2 40 | Classify the following attributes as _binary_, _discrete_, or _continuous_. Also classify them as qualitative (**nominal** or **ordinal**) or quantitative (**interval** or **ratio**). Some cases may have more than one interpretation, so briefly indicate your reasoning if you think there may be some ambiguity. 41 | 42 | A) Time in terms of AM or PM. 43 | B) Brightness, as measured by a light meter. 44 | C) Brightness, as measured by people's judgments. 45 | D) Angles as measured in degrees between 0 and 360. 46 | E) Bronze, silver, and gold medals as awarded at the Olympics. 47 | F) Height above sea level. 48 | G) Number of patients in a hospital. 49 | H) ISBN numbers for books. 50 | I) Ability to pass light in terms of the following values: opaque, translucent, transparent. 51 | J) Military rank. 52 | K) Distance from the center of campus. 53 | L) Density of a substance in grams per cubic centimeter. 54 | M) Coat check number. 55 | 56 | ##### **Answers:** 57 | 58 | * Binary, qualitative, ordinal. 59 | * Continuous, quantitative, ratio. 60 | * Discrete, qualitative, ordinal. 61 | * Continuous, quantitative, ratio. 62 | * Discrete, qualitative, ordinal. 63 | * Continuous, quantitative, interval. 64 | * Discrete, quantitative, ratio. 65 | * Discrete, qualitative, nominal. 66 | * Discrete, qualitative, ordinal. 67 | * Discrete, qualitative, ordinal. 68 | * Continuous, quantitative, interval. 69 | * Discrete, quantitative, ratio. 70 | * Discrete, qualitative, nominal. 71 | 72 | ----- 73 | 74 | ## Question 3 75 | You are approached by the marketing director of a local company, who believe that he has devised a foolproof way to measure customer satisfaction. He explains his scheme as follows: 76 | 77 | > "It's so simple that I can't believe that no one has thought of it before. I just keep track of the number of customer complaints for each product. I read in a data mining book that counts are ratio attributes, and so, my measure of product satisfaction must be a ratio attribute. But when I rated the products based on my new customer satisfaction measure and showed them to my boss, he told me that I had overlooked the obvious, and that my measure was worthless. I think that he was just mad because our best-selling product had the worst satisfaction since it had the most complaints. Could you help me set him straight?" 78 | 79 | A) **Who is right, the marketing director or his boss? If you answered 'his boss', what would you do to fix the measure of satisfaction?** 80 | 81 | * The boss is correct - the marketing director is not taking into account the number of sales of each product. If I have sold 1,000 widgets and 20 whangdoodles, I will almost certainly have a higher number of complaints for the former. It would be better to gauge customer satisfaction using the equation: 82 | 83 | $$ Satisfaction_{product} = \frac{n_{complaints}}{n_{sales}} $$ 84 | 85 | B) **What can you say about the attribute type of the original product satisfaction attribute?** 86 | 87 | You can't really say anything about the _attribute type_ of the original measure based on the product satisfaction (see _Example 2_). For example, two products might have the same level of satisfaction from customers, but different numbers of complaints or sales. 88 | 89 | ----- 90 | 91 | ## Question 4 92 | A few months later, you are again approached by the same marketing director as in Question 3. This time, he has devised a better approach to measure the extent to which a customer prefers one product over other, similar products. He explains, 93 | 94 | > "When we develop new products, we typically create several variations and evaluate which one customers prefer. Our standard procedure is to give our test subjects all of the product variations at one time and then ask them to rank the product variations in order of preference. However, our test subjects are very indecisive, especially when there are more than two products. As a result, testing takes forever. I suggested that we perform the comparisons in pairs and then use the comparisons to get the rankings. Thus, if we have three product variations, we have the customers compare variations 1 and 2, then 2 and 3, and finally 3 and 1. Our testing time with my new procedure is a third of what it was for the old procedure, but the employees conducting the tests complain that they cannot come up with a consistent ranking from the results. And my boss wants the latest product evaluations, yesterday. I should also mention that he was the person who came up with the old product evaluation approach. Can you help me?" 95 | 96 | A) **Is the marketing director in trouble? Will his approach work for generating an ordinal ranking of the product variations in terms of customer preference? Explain.** 97 | 98 | B) **Is there a way to fix the marketing director's approach? More generally, what can you say about trying to create an ordinal measurement scale based on pairwise comparisons?** 99 | 100 | C) **For the original product evaluation scheme, the overall rankings of each product variation are found by computing its average over all test subjects. Comment on whether you think that this is a reasonable approach. What other approaches might you take?** 101 | 102 | ----- 103 | 104 | ## Question 5 105 | Can you think of a situation in which identification numbers would be useful for prediction? 106 | 107 | Phone numbers could provide a good prediction of owner location (provided that country codes and area codes were included); and student identification numbers could be good predictors of matriculation year, or graduation year. 108 | 109 | ----- 110 | 111 | ## Question 6 112 | An educational psychologist wants to use _association analysis_ to analyze test results. The test consists of 100 questions with four possible answers each. 113 | 114 | A) **How would you convert this data into a form suitable for association analysis?** 115 | 116 | B) **In particular, what type of attributes would you have an how many of them are there?** 117 | 118 | ----- 119 | 120 | ## Question 7 121 | Which of the following quantities is likely to show more temporal autocorrelation: daily rainfall or daily temperature? Why? 122 | 123 | ----- 124 | 125 | ## Question 8 126 | Discuss why a document-term matrix is an example of a data set that has _asymmetric discrete_ or _asymmetric continuous_ features. 127 | 128 | ----- 129 | 130 | ## Question 9 131 | Many sciences rely on observation instead of (or in addition to) designed experiments. Compares the data quality issues involved in observational science with those of experimental science and data mining. 132 | 133 | ----- 134 | 135 | ## Question 10 136 | Discuss the difference between the precision of a measurement and the terms single and double precision, as they are used in computer science, typically to represent floating-point numbers that require 32 and 64 bits, respectively. 137 | 138 | ----- 139 | 140 | ## Question 11 141 | Give at least two advantages to working with data stored in text files instead of a binary format. 142 | 143 | ----- 144 | 145 | ## Question 12 146 | Distinguish between _noise_ and _outliers_. Be sure to consider the following questions: 147 | 148 | A) Is noise ever interesting or desirable? Outliers? 149 | B) Can noise objects be outliers? 150 | C) Are noise objects always outliers? 151 | D) Are outliers always noise objects? 152 | E) Can noise make a typical value into an unusual one, or vice versa? 153 | 154 | ----- 155 | 156 | ## Question 13 157 | Consider the problem of finding the $K$ nearest neighbors of a data object. A programmer designs Algorithm 2.1 for this task. 158 | 159 | ##### Algorithm 2.1: Algorithm for finding K nearest neighbors 160 | 1) **for** $i$ = 1 to _number of data objects_ **do** 161 | 2) Find the distances of the $i^{th}$ object to all other objects. 162 | 3) Sort these distances in decreasing order. 163 | 4) **return** the objects associated with the first $K$ distances of the sorted list 164 | 5) **end for** 165 | 166 | A) **Describe the potential problems with this algorithm if there are duplicate objects in the data set. Assume the distance function will only return a distance of 0 for objects that are the same.** 167 | 168 | B) **How would you fix this problem?** 169 | 170 | ----- 171 | 172 | ## Question 14 173 | The following attributes are measured for members of a herd of Asian elephants: _weight, height, tusk length, trunk length,_ and _ear area_. Based on these measurements, what sort of similarity measure from Section 2.4 would you use to compare or group these elephants? Justify your answer and explain any special circumstances. 174 | 175 | ----- 176 | 177 | ## Question 15 178 | You are given a set of $m$ objects that is divided into $K$ groups, where the $i^{th}$ group is size $m_{i}$. If the goal is to obtaina sample of size $n < m$, what is the difference between the following two sampling schemes? (Assume sampling with replacement.) 179 | 180 | A) We randomly select $n*m_{i}/m$ elements from each group. 181 | B) We randomly select $n$ elements from the data set, without regard for the group to which an object belongs. 182 | 183 | ----- 184 | 185 | ## Question 16 186 | Consider a document-term matrix, where $tf_{ij}$ is the frequency of the $i^{th}$ word (term) in the $j^{th}$ document and $m$ is the number of documents. Consider the variable transformation that is defined by: 187 | 188 | $$ tf'_{ij} = tf_{ij} * \log\frac{m}{df_{i}}, $$ 189 | where $df$ is the number of documents in which the $i^{th}$ term appears and is known as the **document frequency** of the term. This transformation is known as the **inverse document frequency** transformation. 190 | 191 | A) What is the effect of this transformation if a term occurs in one document? In every document? 192 | 193 | B) What might be the purpose of this transformation? 194 | 195 | ----- 196 | 197 | ## Question 17 198 | Assume that we apply a square root transformation to a ratio attribute $x$ to obtain the new attribute $x*$. As part of your analysis, you identify an inveral $(a,b)$ in which $x*$ has a linear relationship to another attribute $y$. 199 | 200 | A) What is the corresponding interval $(a,b)$ in terms of $x$? 201 | B) Give an equation that relates $y$ to $x$. In this interval, $y = x^{2}$. 202 | 203 | ----- 204 | 205 | ## Question 18 206 | This exercise compares and contrasts some similarity and distance measures. 207 | 208 | A) For binary data, the L1 distance corresponds to the Hamming distance; that is, the number of bits that are different between two binary vectors. The Jaccard similarity is a measure of the similarity between two binary vectors. Compute the Hamming distance and the Jaccard similarity between the following two binary vectors. 209 | 210 | x = 0101010001 211 | y = 0100011000 212 | 213 | B) Which approach, Jaccard or Hamming distance, is more similar to the Simple Matching Coefficient, and which approach is more similar to the cosine measure? Explain. 214 | 215 | C) Suppose that you are comparing how similar two organisms of different species are in terms of the number of genes they share. Descrive what measure, Hamming or Jaccard, you think would be more appropriate for comparing the genetic makeup of two organisms. Explain. 216 | 217 | D) If you wanted to compare the genetic makeup of two organisms of the same species, e.g., two human beings, would you use the Hamming distance, the Jaccard coefficient, or a different measure of similarity or distance? Explain. 218 | 219 | ----- 220 | 221 | ## Question 19 222 | For the following vectors, **x** and **y**, calculate the indicated similarity or distance measures. 223 | 224 | A) 225 | 226 | ----- 227 | 228 | --------------------------------------------------------------------------------