├── README.md ├── device failure prediction-imbalanced-undersampling.ipynb ├── device failure prediction.ipynb └── failures_prediction.csv /README.md: -------------------------------------------------------------------------------- 1 | # Device-failure-prediction 2 | Liling Lu 3 | April 20, 2019 4 | ## Overview 5 | * EDA and Data Engineering work have been done based on the given csv data set and some high performance models were bulit at the end. 6 | * Chanllenging parts: 1 dealing with imbalanced dataset. 2 exploration on the feature 'date'. 3 some devices are removed and then put back to use at different time period. 4 oversampling should be done before or within cross validation? 7 | ## Project background 8 | * Company has a fleet of devices transmitting daily aggregated telemetry attributes.Predictive maintenance techniques are designed to help determine the condition of in-service equipment in order to predict when maintenance should be performed. This approach promises cost savings over routine or time-based preventive maintenance, because tasks are performed only when warranted. 9 | ## Exploratory Data Analysis 10 | 11 | ![WeChat18b25b902e804c5765dc34bf1a476bd8](https://user-images.githubusercontent.com/40584525/56835120-2ded0780-6828-11e9-9a7a-1cb0aba4d1b3.png) 12 | ![WeChatf462cae8f8f9172e5f1b5de73c87be61](https://user-images.githubusercontent.com/40584525/56835125-30e7f800-6828-11e9-938c-471d42a15a4c.png) 13 | * This dataset has 124494 rows and no missing values. All attributes are integer data type. 14 | * It is imbalanced data set, as the failuer class is about 0.1% of unfailure class.Here oversampling approach is used to deal with imbalanced dataset. 15 | * Some attributes have limited number of distictive values, very sparse, indicating that they are likely to be categorical variable, such as attibute 3, 5,7,9. 16 | * Attribute7 and 8 seems like exactly same to each other, we can drop one of them. 17 | * Attribute 2,3,4,7,9 are highly skewed. 18 | * Attributes differ in their magnitudes. Scaling or centering is requried. 19 | ![WeChat4bfdb29bd5291137c9b8812f27227d45](https://user-images.githubusercontent.com/40584525/56835547-60e3cb00-6829-11e9-832c-cf98064e35af.png) 20 | * As we can see that the number of devices decreases as time goes on. And we have noticed there is a big jump in the middle. That maybe some devices get back again after failed first. 21 | 22 | ![WeChat86a1951cc2f1259bab19346c19aa7e1c](https://user-images.githubusercontent.com/40584525/56835953-81f8eb80-682a-11e9-9998-9cb5e3a6ac17.png) 23 | 24 | * For those devices get back again after failure, they failure date are different from their 'max date' 25 | 26 | ![WeChat18f5d9cd268981f4cef9df3bf283f909](https://user-images.githubusercontent.com/40584525/56836286-a1dcdf00-682b-11e9-964c-4adee3267e07.png) 27 | 28 | * As for attribute 3,4,5,7,9, most of their values are vero, to change them to catagorical features may make more sense. 29 | 30 | ## Oversampling 31 | * Mean while, I've tried two ways to do oversampling. If I upsample a dataset before splitting it into a train and validation set, I could end up with the same observation in both datasets. As a result, a complex enough model will be able to perfectly prdict the value for those observations when prediction on the validation set, inflating the metrix. 32 | * The results turn out that we should upsamplint within the cross validation, which means we just oversample the data set used to train the model. For the validation set, it is still unseen. 33 | ![WeChat0f41fef78358121010eaf33e3be0a696](https://user-images.githubusercontent.com/40584525/56836325-bf11ad80-682b-11e9-82e2-c7285d6beeef.png) 34 | * Here I drewed the ROC curve for the top 4 models I generated above. 35 | 36 | ## Deployment 37 | ![WeChatd008517b5a41192a946e6f51c6b15f17](https://user-images.githubusercontent.com/40584525/56856573-813b8480-6912-11e9-8056-ebe49627a9cf.png) 38 | 39 | * The results are pretty good, which means the fitted model has very good generalization ability. 40 | ## Future work 41 | Try to use other classification models, for example neural network. 42 | --------------------------------------------------------------------------------