├── README.md
├── device failure prediction-imbalanced-undersampling.ipynb
├── device failure prediction.ipynb
└── failures_prediction.csv


/README.md:
--------------------------------------------------------------------------------
 1 | # Device-failure-prediction
 2 | Liling Lu
 3 | April 20, 2019
 4 | ## Overview
 5 | * EDA and Data Engineering work have been done based on the given csv data set and some high performance models were bulit at the end.
 6 | * Chanllenging parts: 1 dealing with imbalanced dataset. 2 exploration on the feature 'date'. 3 some devices are removed and then put back to use at different time period. 4 oversampling should be done before or within cross validation?
 7 | ## Project background
 8 | * Company has a fleet of devices transmitting daily aggregated telemetry attributes.Predictive maintenance techniques are designed to help determine the condition of in-service  equipment in order to predict when maintenance should be performed. This approach promises cost savings over routine or time-based preventive  maintenance, because tasks are performed only when warranted.
 9 | ## Exploratory Data Analysis
10 | 
11 | ![WeChat18b25b902e804c5765dc34bf1a476bd8](https://user-images.githubusercontent.com/40584525/56835120-2ded0780-6828-11e9-9a7a-1cb0aba4d1b3.png)
12 | ![WeChatf462cae8f8f9172e5f1b5de73c87be61](https://user-images.githubusercontent.com/40584525/56835125-30e7f800-6828-11e9-938c-471d42a15a4c.png)
13 | * This dataset has 124494 rows and no missing values. All attributes are integer data type.
14 | * It is imbalanced data set, as the failuer class is about 0.1% of unfailure class.Here oversampling approach is used to deal with imbalanced dataset.
15 | * Some attributes have limited number of distictive values, very sparse, indicating that they are likely to be categorical variable, such as attibute 3, 5,7,9.
16 | * Attribute7 and 8 seems like exactly same to each other, we can drop one of them.
17 | * Attribute 2,3,4,7,9 are highly skewed.
18 | * Attributes differ in their magnitudes. Scaling or centering is requried.
19 | ![WeChat4bfdb29bd5291137c9b8812f27227d45](https://user-images.githubusercontent.com/40584525/56835547-60e3cb00-6829-11e9-832c-cf98064e35af.png)
20 | * As we can see that the  number of devices decreases as time goes on. And we have noticed there is a big jump in the middle. That maybe some devices get back again after failed first.
21 | 
22 | ![WeChat86a1951cc2f1259bab19346c19aa7e1c](https://user-images.githubusercontent.com/40584525/56835953-81f8eb80-682a-11e9-9998-9cb5e3a6ac17.png)
23 | 
24 | * For those devices get back again after failure, they failure date are different from their 'max date'
25 | 
26 | ![WeChat18f5d9cd268981f4cef9df3bf283f909](https://user-images.githubusercontent.com/40584525/56836286-a1dcdf00-682b-11e9-964c-4adee3267e07.png)
27 | 
28 | * As for attribute 3,4,5,7,9, most of their values are vero, to change them to catagorical features may make more sense.
29 | 
30 | ## Oversampling
31 | * Mean while, I've tried two ways to do oversampling. If I upsample a dataset before splitting it into a train and validation set, I could end up with the same observation in both datasets. As a result, a complex enough model will be able to perfectly prdict the value for those observations when prediction on the validation set, inflating the metrix.
32 | * The results turn out that we should upsamplint within the cross validation, which means we just oversample the data set used to train the model. For the validation set, it is still unseen.
33 | ![WeChat0f41fef78358121010eaf33e3be0a696](https://user-images.githubusercontent.com/40584525/56836325-bf11ad80-682b-11e9-82e2-c7285d6beeef.png)
34 | * Here I drewed the ROC curve for the top 4 models I generated above.
35 |  
36 |  ## Deployment
37 | ![WeChatd008517b5a41192a946e6f51c6b15f17](https://user-images.githubusercontent.com/40584525/56856573-813b8480-6912-11e9-8056-ebe49627a9cf.png)
38 | 
39 | * The results are pretty good, which means the fitted model has very good generalization ability.
40 | ## Future work
41 |  Try to use other classification models, for example neural network.
42 | 


--------------------------------------------------------------------------------