├── assets
└── predictive_asset_maintenance_e2e_flow.png
├── env
└── intel_env.yml
├── SECURITY.md
├── LICENSE
├── run_dataset.sh
├── .gitignore
├── src
├── generate_data_pandas.py
├── train_predict_pam.py
└── daal_xgb_model.py
├── PredictiveMaintenance.ipynb
└── README.md
/assets/predictive_asset_maintenance_e2e_flow.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/oneapi-src/predictive-asset-health-analytics/HEAD/assets/predictive_asset_maintenance_e2e_flow.png
--------------------------------------------------------------------------------
/env/intel_env.yml:
--------------------------------------------------------------------------------
1 | name: predictive_maintenance_intel
2 | channels:
3 | - intel
4 | - conda-forge
5 | dependencies:
6 | - python=3.10
7 | - intelpython3_full=2024.0.0
8 | - modin-all=0.24.1
9 |
10 |
--------------------------------------------------------------------------------
/SECURITY.md:
--------------------------------------------------------------------------------
1 | # Security Policy
2 | Intel is committed to rapidly addressing security vulnerabilities affecting our customers and providing clear guidance on the solution, impact, severity and mitigation.
3 |
4 | ## Reporting a Vulnerability
5 | Please report any security vulnerabilities in this project [utilizing the guidelines here](https://www.intel.com/content/www/us/en/security-center/vulnerability-handling-guidelines.html).
6 |
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | Copyright (c) 2024, Intel Corporation
2 |
3 | Redistribution and use in source and binary forms, with or without
4 | modification, are permitted provided that the following conditions are met:
5 |
6 | * Redistributions of source code must retain the above copyright notice,
7 | this list of conditions and the following disclaimer.
8 | * Redistributions in binary form must reproduce the above copyright
9 | notice, this list of conditions and the following disclaimer in the
10 | documentation and/or other materials provided with the distribution.
11 | * Neither the name of Intel Corporation nor the names of its contributors
12 | may be used to endorse or promote products derived from this software
13 | without specific prior written permission.
14 |
15 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
16 | AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
17 | IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
18 | DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE
19 | FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
20 | DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
21 | SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
22 | CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
23 | OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
24 | OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
25 |
--------------------------------------------------------------------------------
/run_dataset.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 |
3 | # Copyright (C) 2024 Intel Corporation
4 | # SPDX-License-Identifier: BSD-3-Clause
5 |
6 | [[ -z "$WORKSPACE" ]] && { echo "Error: \$WORKSPACE not declared"; exit 1; }
7 | [[ -z "$DATA_DIR" ]] && { echo "Error: \$DATA_DIR not declared"; exit 1; }
8 | [[ -z "$OUTPUT_DIR" ]] && { echo "Error: \$OUTPUT_DIR not declared"; exit 1; }
9 |
10 | # Dataset sizes
11 | SIZE_ARRAY=(25000 50000 100000 200000 400000 800000 1000000 2000000 4000000 8000000 10000000)
12 | SIZE_ELEMENTS=${#SIZE_ARRAY[@]}
13 |
14 | # Hyperparameter tuning
15 | TUNING_ARRAY=(notuning hyperparametertuning)
16 | TUNING_ELEMENTS=${#TUNING_ARRAY[@]}
17 |
18 | for (( i=0;i<$SIZE_ELEMENTS;i++)); do
19 | echo -e "\t$i. ${SIZE_ARRAY[${i}]}"
20 | done
21 | echo -e "Select dataset size: \c"
22 | read opt
23 | DATASIZE=${SIZE_ARRAY[${opt}]}
24 |
25 | for (( i=0;i<$TUNING_ELEMENTS;i++)); do
26 | echo -e "\t$i. ${TUNING_ARRAY[${i}]}"
27 | done
28 | echo -e "Select tuning option: \c"
29 | read opt
30 | TUNING="0"
31 | if eval "[[ ${TUNING_ARRAY[${opt}]} = ${TUNING_ARRAY[1]} ]]"; then
32 | TUNING="1"
33 | fi
34 |
35 | echo -e "Number of CPU cores to be used for the training: \c"
36 | read opt
37 | CPU=${opt}
38 |
39 | mkdir -p $OUTPUT_DIR/logs
40 |
41 | mkdir -p $DATA_DIR
42 | DATASET=$DATA_DIR/data_$DATASIZE.pkl
43 |
44 | ## Generate PAM data if it does not exist
45 | if [ ! -e $DATASET ]; then
46 | OF=$OUTPUT_DIR/logs/logfile_pandas_${DATASIZE}_$(date +%Y%m%d%H%M%S).log
47 | echo "+++++++++++++++++++++++++++++++++DataSet++++++++++++++++++++++++++++++++++++++++++++++++++++" >> $OF
48 | echo -e "Generating Data... \n"
49 | python3 $WORKSPACE/src/generate_data_pandas.py -s $DATASIZE -f $DATASET 2>&1 | tee $OF
50 | echo "Dataset Generation logfile stored in $OF"
51 | sync
52 | echo "Generating Data: DONE"
53 | fi
54 |
55 | ## Training and prediction
56 | OF=$OUTPUT_DIR/logs/logfile_train_predict_${DATASIZE}_$(date +%Y%m%d%H%M%S).log
57 | echo "++++++++++++++++++++++++++Training and Prediction++++++++++++++++++++++++++++++++++++++++++++" >> $OF
58 | echo -e "Training and Prediction... \n"
59 | python3 $WORKSPACE/src/train_predict_pam.py -t $TUNING -p "pandas" -f $DATASET -ncpu $CPU 2>&1 | tee -a $OF
60 | sync
61 | echo "Dataset stored in $DATASET"
62 | echo "Train Prediction logfile stored in $OF"
63 | echo "Training and Prediction: DONE"
64 |
--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
1 | # Byte-compiled / optimized / DLL files
2 | __pycache__/
3 | *.py[cod]
4 | *$py.class
5 |
6 | # C extensions
7 | *.so
8 |
9 | # Distribution / packaging
10 | .Python
11 | build/
12 | develop-eggs/
13 | dist/
14 | downloads/
15 | eggs/
16 | .eggs/
17 | lib/
18 | lib64/
19 | parts/
20 | sdist/
21 | var/
22 | wheels/
23 | share/python-wheels/
24 | *.egg-info/
25 | .installed.cfg
26 | *.egg
27 | MANIFEST
28 |
29 | # PyInstaller
30 | # Usually these files are written by a python script from a template
31 | # before PyInstaller builds the exe, so as to inject date/other infos into it.
32 | *.manifest
33 | *.spec
34 |
35 | # Installer logs
36 | pip-log.txt
37 | pip-delete-this-directory.txt
38 |
39 | # Unit test / coverage reports
40 | htmlcov/
41 | .tox/
42 | .nox/
43 | .coverage
44 | .coverage.*
45 | .cache
46 | nosetests.xml
47 | coverage.xml
48 | *.cover
49 | *.py,cover
50 | .hypothesis/
51 | .pytest_cache/
52 | cover/
53 |
54 | # Translations
55 | *.mo
56 | *.pot
57 |
58 | # Django stuff:
59 | local_settings.py
60 | db.sqlite3
61 | db.sqlite3-journal
62 |
63 | # Flask stuff:
64 | instance/
65 | .webassets-cache
66 |
67 | # Scrapy stuff:
68 | .scrapy
69 |
70 | # Sphinx documentation
71 | docs/_build/
72 |
73 | # PyBuilder
74 | .pybuilder/
75 | target/
76 |
77 | # Jupyter Notebook
78 | .ipynb_checkpoints
79 |
80 | # IPython
81 | profile_default/
82 | ipython_config.py
83 |
84 | # pyenv
85 | # For a library or package, you might want to ignore these files since the code is
86 | # intended to run in multiple environments; otherwise, check them in:
87 | # .python-version
88 |
89 | # pipenv
90 | # According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
91 | # However, in case of collaboration, if having platform-specific dependencies or dependencies
92 | # having no cross-platform support, pipenv may install dependencies that don't work, or not
93 | # install all needed dependencies.
94 | #Pipfile.lock
95 |
96 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow
97 | __pypackages__/
98 |
99 | # Celery stuff
100 | celerybeat-schedule
101 | celerybeat.pid
102 |
103 | # SageMath parsed files
104 | *.sage.py
105 |
106 | # Environments
107 | .env
108 | .venv
109 | venv/
110 | env.bak/
111 | venv.bak/
112 |
113 | # Spyder project settings
114 | .spyderproject
115 | .spyproject
116 |
117 | # Rope project settings
118 | .ropeproject
119 |
120 | # mkdocs documentation
121 | /site
122 |
123 | # mypy
124 | .mypy_cache/
125 | .dmypy.json
126 | dmypy.json
127 |
128 | # Pyre type checker
129 | .pyre/
130 |
131 | # pytype static type analyzer
132 | .pytype/
133 |
134 | # Cython debug symbols
135 | cython_debug/
136 |
--------------------------------------------------------------------------------
/src/generate_data_pandas.py:
--------------------------------------------------------------------------------
1 | # Copyright (C) 2024 Intel Corporation
2 | # SPDX-License-Identifier: BSD-3-Clause
3 |
4 | '''
5 | Module to generate dataset for Predictive Asset Health Analytics
6 | '''
7 | # !/usr/bin/env python
8 | # coding: utf-8
9 | import warnings
10 | import argparse
11 | import logging
12 | import time
13 | import pandas as pd
14 | import numpy as np
15 |
16 | parser = argparse.ArgumentParser()
17 | parser.add_argument('-s',
18 | '--size',
19 | type=int,
20 | required=False,
21 | default=25000,
22 | help='data size')
23 | parser.add_argument('-f',
24 | '--file',
25 | type=str,
26 | required=False,
27 | default='asset_data_pandas.pkl',
28 | help='output pkl file name')
29 | parser.add_argument('-d',
30 | '--debug',
31 | action='store_true',
32 | help='changes logging level from INFO to DEBUG')
33 |
34 | FLAGS = parser.parse_args()
35 | dsize = FLAGS.size
36 |
37 | if FLAGS.debug:
38 | logging_level=logging.DEBUG
39 | else:
40 | logging_level=logging.INFO
41 |
42 | logging.basicConfig(level=logging_level)
43 | logger = logging.getLogger(__name__)
44 | warnings.filterwarnings("ignore")
45 |
46 | # Generating our data
47 | start = time.time()
48 | logger.info('Generating data with the size %d', dsize)
49 | np.random.seed(1)
50 | manufacturer_list = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J']
51 | species_list = ['C1', 'C2', 'C3', 'C4', 'C5', 'C6', 'C7']
52 | district_list = ['N', 'NE', 'NW', 'E', 'W', 'S', 'SE', 'SW']
53 | treatment_list = ['Oil', 'Pentachlorophenol', 'Untreated', 'Creosote', 'UNK', 'Cellon']
54 | data = pd.DataFrame({"Age": np.random.choice(range(1, 101), dsize, replace=True),
55 | "Elevation": np.random.randint(low=-300, high=4500, size=dsize),
56 | "Pole_Height": np.random.normal(60, 15, size=dsize),
57 | "Measured_Length": np.random.randint(low=1, high=2000, size=dsize),
58 | "Manufacturer": np.random.choice(manufacturer_list, dsize, replace=True),
59 | "Species": np.random.choice(species_list, dsize, replace=True),
60 | "Number_Repairs": np.random.choice(range(1, 7), dsize, replace=True),
61 | "District": np.random.choice(district_list, dsize, replace=True),
62 | "Tele_Attached": np.random.choice(range(0, 2), dsize, replace=True),
63 | "Original_Treatment": np.random.choice(treatment_list, dsize, replace=True)})
64 |
65 |
66 | # changing Tele_Attatched into an object variable
67 | logger.info('changing Tele_Attatched into an object variable')
68 | data[['Number_Repairs', 'Tele_Attached']] = data[['Number_Repairs', 'Tele_Attached']].astype('object')
69 |
70 |
71 | # Generating our target variable Asset_Label
72 | logger.info('Generating our target variable Asset_Label')
73 | data['Asset_Label'] = np.random.choice(range(0, 2), dsize, replace=True, p=[0.99, 0.01])
74 |
75 |
76 | # Creating correlation between our variables and our target variable
77 | # When age is 60-70 and over 95 change Asset_Label to 1
78 | logger.info('Creating correlation between our variables and our target variable')
79 | logger.info('When age is 60-70 and over 95 change Asset_Label to 1')
80 | data['Asset_Label'] = np.where(((data.Age > 0) & (data.Age <= 5)) | (data.Age > 45),
81 | 1, data.Asset_Label)
82 |
83 | # When elevation is between 500-1500 change Asset_Label to 1
84 | logger.info('When elevation is between 500-1500 change Asset_Label to 1')
85 | data['Asset_Label'] = np.where((data.Elevation >= -300) & (data.Elevation <= 1400),
86 | 1, data.Asset_Label)
87 |
88 | # When Manufacturer is A, E, or H change Asset_Label to have 95% 0's
89 | logger.info("When Manufacturer is A, E, or H change Asset_Label to have 95% 0's")
90 | data['Temp_Var'] = np.random.choice(range(0, 2), dsize, replace=True, p=[0.95, 0.05])
91 | data['Asset_Label'] = np.where((data.Manufacturer == 'A') |
92 | (data.Manufacturer == 'E') |
93 | (data.Manufacturer == 'H'),
94 | data.Temp_Var,
95 | data.Asset_Label)
96 |
97 | # When Species is C2 or C5 change Asset_Label to have 90% to 0's
98 | logger.info("When Species is C2 or C5 change Asset_Label to have 90% to 0's")
99 | data['Temp_Var'] = np.random.choice(range(0, 2), dsize, replace=True, p=[0.9, 0.1])
100 | data['Asset_Label'] = np.where((data.Species == 'C2') | (data.Species == 'C5'),
101 | data.Temp_Var,
102 | data.Asset_Label)
103 |
104 |
105 | # When District is NE or W change Asset_Label to have 90% to 0's
106 | logger.info("When District is NE or W change Asset_Label to have 90% to 0's")
107 | data['Temp_Var'] = np.random.choice(range(0, 2), dsize, replace=True, p=[0.95, 0.05])
108 | data['Asset_Label'] = np.where((data.District == 'NE') | (data.District == 'W'),
109 | data.Temp_Var, data.Asset_Label)
110 |
111 |
112 | # When District is Untreated change Asset_Label to have 70% to 1's
113 | logger.info("When District is Untreated change Asset_Label to have 70% to 1's")
114 | data['Temp_Var'] = np.random.choice(range(0, 2), dsize, replace=True, p=[0.25, 0.75])
115 | data['Asset_Label'] = np.where((data.Original_Treatment == 'Untreated'),
116 | data.Temp_Var,
117 | data.Asset_Label)
118 |
119 |
120 | # When Age is greater than 90 and Elevation is less than 1200
121 | # and Original_treatment is Oil change Asset_Label to have 90% to 1's
122 | logger.info("When Age is greater than 90 and Elevaation is less than 1200" \
123 | " and Original_treatment is Oil change Asset_Label to have 90% to 1's")
124 | data['Temp_Var'] = np.random.choice(range(0, 2), dsize, replace=True, p=[0.05, 0.95])
125 | data['Asset_Label'] = np.where((data.Age >= 20) &
126 | (data.Elevation <= 1000) &
127 | (data.Original_Treatment == 'Oil') &
128 | (data.Pole_Height >= 60),
129 | data.Temp_Var,
130 | data.Asset_Label)
131 |
132 | data.drop('Temp_Var', axis=1, inplace=True)
133 |
134 |
135 | # saving a copy of the dataset for EDA
136 | data.head()
137 |
138 | Categorical_Variables = pd.get_dummies(
139 | data[[
140 | 'Manufacturer',
141 | 'District',
142 | 'Species',
143 | 'Original_Treatment']],
144 | drop_first=True, dtype='int8')
145 | data = pd.concat([data, Categorical_Variables], axis=1)
146 | data.drop(['Manufacturer', 'District', 'Species', 'Original_Treatment'], axis=1, inplace=True)
147 |
148 | data = data.astype({'Tele_Attached': 'int32', 'Number_Repairs': 'float64'})
149 |
150 | etime = time.time() - start
151 | datasize = data.shape
152 | logger.info('=====> Time taken %f secs for data generation for the size of %s',
153 | etime, datasize)
154 |
155 |
156 | pklfile = FLAGS.file
157 | logger.info('Saving the data to %s ...', pklfile)
158 | data.to_pickle(pklfile)
159 | logger.info('DONE')
160 |
--------------------------------------------------------------------------------
/src/train_predict_pam.py:
--------------------------------------------------------------------------------
1 | # Copyright (C) 2024 Intel Corporation
2 | # SPDX-License-Identifier: BSD-3-Clause
3 |
4 | '''
5 | Module to train and prediction using XGBoost Classifier
6 | '''
7 | # !/usr/bin/env python
8 | # coding: utf-8
9 | # pylint: disable=import-error
10 | import time
11 | from datetime import datetime
12 | import warnings
13 | import argparse
14 | import sys
15 | import logging
16 | import numpy as np
17 | import xgboost as xgb
18 |
19 | if __name__ == "__main__":
20 | # Data size
21 | parser = argparse.ArgumentParser()
22 | parser.add_argument('-f',
23 | '--file',
24 | type=str,
25 | required=False,
26 | default='data_25000.pkl',
27 | help='input pkl file name')
28 | parser.add_argument('-p',
29 | '--package',
30 | type=str,
31 | required=False,
32 | default='pandas',
33 | help='data package to be used (pandas, modin)')
34 | parser.add_argument('-t',
35 | '--tuning',
36 | type=str,
37 | required=False,
38 | default='0',
39 | help='hyper parameter tuning (0/1)')
40 | parser.add_argument('-cv',
41 | '--cross-validation',
42 | type=int,
43 | required=False,
44 | default=2,
45 | help='cross validation iteration')
46 | parser.add_argument('-ncpu',
47 | '--num-cpu',
48 | type=int,
49 | required=True,
50 | default=4,
51 | help='number of cpu cores, default 4')
52 | parser.add_argument('-d',
53 | '--debug',
54 | action='store_true',
55 | help='changes logging level from INFO to DEBUG')
56 |
57 | FLAGS = parser.parse_args()
58 |
59 | if FLAGS.debug:
60 | logging_level=logging.DEBUG
61 | else:
62 | logging_level=logging.INFO
63 |
64 | logging.basicConfig(level=logging_level)
65 | logger = logging.getLogger(__name__)
66 | warnings.filterwarnings("ignore")
67 |
68 | pkg = FLAGS.package
69 | cv_val = FLAGS.cross_validation
70 | TUNING = False
71 | if FLAGS.tuning == "1":
72 | TUNING = True
73 |
74 | from sklearnex import patch_sklearn
75 | patch_sklearn()
76 |
77 | from sklearn.model_selection import GridSearchCV
78 | from sklearn.model_selection import train_test_split
79 | from sklearn.preprocessing import RobustScaler
80 |
81 | if pkg == "pandas":
82 | import pandas as pd # noqa: F811
83 |
84 | if pkg == "modin":
85 | import modin.config as cfg
86 | cfg.Engine.put('ray')
87 | import modin.pandas as pd # noqa: F811
88 |
89 | # Generating our data
90 | logger.info('Reading the dataset from %s...', FLAGS.file)
91 | try:
92 | data = pd.read_pickle(FLAGS.file)
93 | except FileNotFoundError:
94 | sys.exit('Dataset file not found')
95 |
96 | datasize = data.shape
97 |
98 | start = time.time()
99 | X = data.drop('Asset_Label', axis=1)
100 | y = data.Asset_Label
101 |
102 | # original split .25
103 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.25)
104 |
105 | df_num_train = X_train.select_dtypes(['float', 'int', 'int32'])
106 | df_num_test = X_test.select_dtypes(['float', 'int', 'int32'])
107 | robust_scaler = RobustScaler()
108 | X_train_scaled = robust_scaler.fit_transform(df_num_train)
109 | X_test_scaled = robust_scaler.transform(df_num_test)
110 |
111 | # Making them pandas dataframes
112 | X_train_scaled_transformed = pd.DataFrame(X_train_scaled,
113 | index=df_num_train.index,
114 | columns=df_num_train.columns)
115 | X_test_scaled_transformed = pd.DataFrame(X_test_scaled,
116 | index=df_num_test.index,
117 | columns=df_num_test.columns)
118 |
119 | del X_train_scaled_transformed['Number_Repairs']
120 | del X_train_scaled_transformed['Tele_Attached']
121 |
122 | del X_test_scaled_transformed['Number_Repairs']
123 | del X_test_scaled_transformed['Tele_Attached']
124 |
125 | # Dropping the unscaled numerical columns
126 | X_train = X_train.drop(['Age', 'Elevation', 'Pole_Height', 'Measured_Length'], axis=1)
127 | X_test = X_test.drop(['Age', 'Elevation', 'Pole_Height', 'Measured_Length'], axis=1)
128 |
129 | # Creating train and test data with scaled numerical columns
130 | X_train_scaled_transformed = pd.concat([X_train_scaled_transformed, X_train], axis=1)
131 | X_test_scaled_transformed = pd.concat([X_test_scaled_transformed, X_test], axis=1)
132 |
133 | def fit_xgb_model(x_data, y_data):
134 | """Use a XGBClassifier for this problem."""
135 | # prepare data for xgboost training
136 | dtrain = xgb.DMatrix(x_data, y_data, nthread=FLAGS.num_cpu)
137 | label = dtrain.get_label()
138 | ratio = float(np.sum(label == 0)) / np.sum(label == 1)
139 | # Set xgboost parameters
140 | parameters = {'scale_pos_weight': ratio.round(2), 'n_jobs': FLAGS.num_cpu, 'tree_method': 'hist'}
141 |
142 | # define the model to use
143 | if TUNING is False:
144 | xg_cl = xgb.XGBClassifier(use_label_encoder=False)
145 | else:
146 | xg_cl = xgb.XGBClassifier(use_label_encoder=False, eval_metric='logloss')
147 | xg_cl.set_params(**parameters)
148 | # Train the model
149 | xg_cl.fit(x_data, y_data)
150 | return xg_cl
151 |
152 | X_train_scaled_transformed = X_train_scaled_transformed.astype(
153 | {'Tele_Attached': 'float64',
154 | 'Number_Repairs': 'float64'})
155 | X_test_scaled_transformed = X_test_scaled_transformed.astype(
156 | {'Tele_Attached': 'float64',
157 | 'Number_Repairs': 'float64'})
158 |
159 | # Training
160 | tstart = time.time()
161 | xgb_model = fit_xgb_model(X_train_scaled_transformed, y_train)
162 | ttime = time.time() - tstart
163 |
164 | if TUNING is True:
165 | logger.info("Starting hyper parameter tuning:")
166 | # GridSearchCV
167 | def timer(start_time=None): # pylint: disable=missing-function-docstring
168 | if not start_time:
169 | start_time = datetime.now()
170 | return start_time
171 | if start_time:
172 | thour, temp_sec = divmod((datetime.now() - start_time).total_seconds(), 3600)
173 | tmin, tsec = divmod(temp_sec, 60)
174 | print('Time taken: %i hours %i minutes and %s seconds.',
175 | (thour, tmin, round(tsec, 2)))
176 | return 0
177 |
178 | # Hyper parameters for tuning
179 | # n_jobs should be used according to the underlying HW accelerators, hence -1 is given
180 | params = {
181 | 'min_child_weight': [1, 5, 10],
182 | 'gamma': [0.5, 1, 1.5, 2, 5],
183 | # 'subsample': [0.6, 0.8, 1.0],
184 | # 'colsample_bytree': [0.6, 0.8, 1.0],
185 | 'max_depth': [3, 4, 5],
186 | 'n_jobs': [-1]
187 | # 'learning_rate': [0.001, 0.01]
188 | }
189 |
190 | xgb_model = GridSearchCV(xgb_model, param_grid=params, cv=cv_val, verbose=10)
191 | xgb_model.fit(X_train_scaled_transformed, y_train)
192 |
193 | pstart = time.time()
194 | xgb_prediction = xgb_model.predict(X_test_scaled_transformed)
195 | ptime = time.time() - pstart
196 | xgb_errors_count = np.count_nonzero(xgb_prediction - np.ravel(y_test))
197 |
198 | xgb_total = ptime
199 |
200 | y_test = np.ravel(y_test)
201 |
202 | etime = time.time() - start
203 | accuracy_scr = 1 - xgb_errors_count / xgb_prediction.shape[0]
204 | sys.stdout.flush()
205 | sys.stderr.flush()
206 | logger.info('=====> Total Time: %f secs for data size %s', etime, datasize)
207 | logger.info('=====> Training Time %f secs', ttime)
208 | logger.info('=====> Prediction Time %f secs', ptime)
209 | logger.info('=====> XGBoost accuracy score %f', accuracy_scr)
210 |
211 | logger.info('DONE')
212 |
--------------------------------------------------------------------------------
/PredictiveMaintenance.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "id": "68ca0fdc",
6 | "metadata": {},
7 | "source": [
8 | "# Predictive Asset Health Analytics\n",
9 | "\n",
10 | "## Introduction\n",
11 | "Create an end-to-end predictive asset maintenance solution with XGBoost* from Intel® oneAPI AI Analytics Toolkit (oneAPI). Check out more workflow examples in the [Developer Catalog](https://developer.intel.com/aireferenceimplementations).\n",
12 | "\n",
13 | "## Solution Technical Overview\n",
14 | "\n",
15 | "Predictive asset maintenance is a method that uses data analysis tools to predict defects and anomalies before they happen. Solutions of huge scale typically require operating across multiple hardware architectures. Accelerating training for the ever-increasing size of datasets and machine learning models is a major challenge while adopting AI (Artificial Intelligence).\n",
16 | "\n",
17 | "For an industrial scenario is important to improve the MLOps (Machine Learning Operations) time for developing and deploying new models, this could be challenging due to the ever-increasing size of datasets over a period of time. XGBoost* classifier with HIST tree method addresses this problem improving the overall training/tuning and validation time. A model with a huge set of batch processing requires fast prediction time with a low accuracy lose, daal4py helps the XGBoost* machine learning model to achieve this criteria.\n",
18 | "\n",
19 | "For more details, visit the [Predictive Asset Maintenance](https://github.com/oneapi-src/predictive-asset-health-analytics) GitHub repository.\n",
20 | "\n",
21 | "## Validated Hardware Details \n",
22 | "\n",
23 | "Intel® oneAPI is used to achieve quick results even when the data for a model are huge. It provides the capability to reuse the code present in different languages so that the hardware utilization is optimized to provide these results.\n",
24 | "\n",
25 | "| Recommended Hardware | Precision |\n",
26 | "| ---------------------------- | ---------- |\n",
27 | "| Intel® 4th Gen Xeon® Scalable Performance processors|BF16 |\n",
28 | "| Intel® 1st, 2nd, 3rd, and 4th Gen Xeon® Scalable Performance processors| FP32 |\n",
29 | "\n",
30 | "## How it Works\n",
31 | "\n",
32 | "This reference kit generates datasets of given row size for a predictive asset maintenance analytics use-case and stores it in ‘. pkl’ format; these data are then split for training and testing, where we train our model built on the XGBoost* algorithm and predict test data.\n",
33 | "\n",
34 | ""
35 | ]
36 | },
37 | {
38 | "cell_type": "markdown",
39 | "id": "3671f658",
40 | "metadata": {},
41 | "source": [
42 | "## Run Using Jupyter Notebook\n",
43 | "### Run Workflow\n",
44 | "The following cell provides the variables needed to execute the workflow scripts. \n",
45 | "If the user did not define previously the path to `WORKSPACE` from console or want to use another `WORKSPACE` location replace `` with the new path. Before using a new `WORKSPACE` directory make sure that the procedure described in `Get Started` described in the `README.md` file has been followed for the new `WORKSPACE` location."
46 | ]
47 | },
48 | {
49 | "cell_type": "code",
50 | "execution_count": null,
51 | "id": "b729b884",
52 | "metadata": {},
53 | "outputs": [],
54 | "source": [
55 | "#Setting path variables.\n",
56 | "import os\n",
57 | "workspace = os.getenv(\"WORKSPACE\", \"\")\n",
58 | "data_dir = workspace+'/data'\n",
59 | "output_dir = workspace+'/output'\n",
60 | "print(\"workspace path: {}\".format(workspace))\n",
61 | "\n",
62 | "#Setting parameter values.\n",
63 | "dataset_size = 200000\n",
64 | "datapkl_path = data_dir+f\"/data_{dataset_size}.pkl\"\n",
65 | "ncpu = 20\n",
66 | "tunning = 0 #Hyperparameter tunning, 0 for no tunning\n",
67 | "data_package = \"pandas\" #Valid options are pandas and modin\n",
68 | "cross_validation = 5"
69 | ]
70 | },
71 | {
72 | "cell_type": "markdown",
73 | "id": "0b2f9b30",
74 | "metadata": {},
75 | "source": [
76 | "The following script can be executed with the parameters provided below for generating the test dataset with the active environment.\n",
77 | "\n",
78 | "```\n",
79 | "usage: src/generate_data_pandas.py [-h] [-s SIZE] [-f FILE]\n",
80 | "\n",
81 | "optional arguments:\n",
82 | " -h, --help show this help message and exit\n",
83 | " -s SIZE, --size SIZE data size which is number of rows\n",
84 | " -f FILE, --file FILE output pkl file name\n",
85 | " -d, --debug changes logging level from INFO to DEBUG\n",
86 | "```"
87 | ]
88 | },
89 | {
90 | "cell_type": "code",
91 | "execution_count": null,
92 | "id": "dda29c11",
93 | "metadata": {},
94 | "outputs": [],
95 | "source": [
96 | "%run {workspace}/src/generate_data_pandas.py -s {dataset_size} -f {datapkl_path}"
97 | ]
98 | },
99 | {
100 | "cell_type": "markdown",
101 | "id": "093b4673",
102 | "metadata": {},
103 | "source": [
104 | "Training and prediction along with hyperparameter turning can also be executed independently:\n",
105 | "```\n",
106 | "usage: src/train_predict_pam.py [-h] [-f FILE] [-p PACKAGE] [-t TUNING] [-cv CROSS_VALIDATION] [-patch PATCH_SKLEARN]\n",
107 | " -ncpu NUM_CPU\n",
108 | "\n",
109 | "optional arguments:\n",
110 | " -h, --help show this help message and exit\n",
111 | " -f FILE, --file FILE input pkl file name\n",
112 | " -p PACKAGE, --package PACKAGE\n",
113 | " data package to be used (pandas, modin)\n",
114 | " -t TUNING, --tuning TUNING\n",
115 | " hyper parameter tuning (0/1)\n",
116 | " -cv CROSS_VALIDATION, --cross-validation CROSS_VALIDATION\n",
117 | " cross validation iteration, default 2.\n",
118 | " -ncpu NUM_CPU, --num-cpu NUM_CPU\n",
119 | " number of cpu cores, default 4.\n",
120 | " -d, --debug \n",
121 | " changes logging level from INFO to DEBUG\n",
122 | "```"
123 | ]
124 | },
125 | {
126 | "cell_type": "code",
127 | "execution_count": null,
128 | "id": "2b57cb84",
129 | "metadata": {
130 | "scrolled": true
131 | },
132 | "outputs": [],
133 | "source": [
134 | "%run {workspace}/src/train_predict_pam.py -t {tunning} -p {data_package} -f {datapkl_path} -ncpu {ncpu} -cv {cross_validation}"
135 | ]
136 | },
137 | {
138 | "cell_type": "markdown",
139 | "id": "98e02eb1",
140 | "metadata": {},
141 | "source": [
142 | "### XGBoost* with oneDAL Python Wrapper (daal4py) model\n",
143 | "In order to gain even further improved performance on prediction time for the XGBoost* trained machine learning model, it can be converted to a daal4py model. daal4py makes XGBoost* machine learning algorithm execution faster to gain better performance on the underlying hardware by utilizing the Intel® oneAPI Data Analytics Library (oneDAL)."
144 | ]
145 | },
146 | {
147 | "cell_type": "markdown",
148 | "id": "7577f325",
149 | "metadata": {},
150 | "source": [
151 | "The generated '.pkl' file is used as input for this Python script. \n",
152 | "```\n",
153 | "usage: src/daal_xgb_model.py [-h] [-f FILE]\n",
154 | "\n",
155 | "optional arguments:\n",
156 | " -h, --help show this help message and exit\n",
157 | " -f FILE, --file FILE input pkl file name\n",
158 | " -d, --debug changes logging level from INFO to DEBUG\n",
159 | "```\n",
160 | "Run the following command to train the model with the given dataset and convert the same to daal4py format and measure the prediction time performance."
161 | ]
162 | },
163 | {
164 | "cell_type": "code",
165 | "execution_count": null,
166 | "id": "85b5b144",
167 | "metadata": {},
168 | "outputs": [],
169 | "source": [
170 | "%run {workspace}/src/daal_xgb_model.py -f {datapkl_path}"
171 | ]
172 | },
173 | {
174 | "cell_type": "markdown",
175 | "id": "06f6a6c0",
176 | "metadata": {},
177 | "source": [
178 | "## Expected Output\n",
179 | "A successful execution of ```generate_data_pandas.py``` should return similar results as shown below:\n",
180 | "```\n",
181 | "INFO:__main__:Generating data with the size 800000\n",
182 | "INFO:__main__:changing Tele_Attatched into an object variable\n",
183 | "INFO:__main__:Generating our target variable Asset_Label\n",
184 | "INFO:__main__:Creating correlation between our variables and our target variable\n",
185 | "INFO:__main__:When age is 60-70 and over 95 change Asset_Label to 1\n",
186 | "INFO:__main__:When elevation is between 500-1500 change Asset_Label to 1\n",
187 | "INFO:__main__:When Manufacturer is A, E, or H change Asset_Label to have 95% 0's\n",
188 | "INFO:__main__:When Species is C2 or C5 change Asset_Label to have 90% to 0's\n",
189 | "INFO:__main__:When District is NE or W change Asset_Label to have 90% to 0's\n",
190 | "INFO:__main__:When District is Untreated change Asset_Label to have 70% to 1's\n",
191 | "INFO:__main__:When Age is greater than 90 and Elevaation is less than 1200 and Original_treatment is Oil change Asset_Label to have 90% to 1's\n",
192 | "INFO:__main__:=====> Time taken 1.431621 secs for data generation for the size of (800000, 34)\n",
193 | "INFO:__main__:Saving the data to /localdisk/aagalleg/frameworks.ai.platform.sample-apps.predictive-health-analytics/data/data_800000.pkl ...\n",
194 | "INFO:__main__:DONE\n",
195 | "```\n",
196 | "\n",
197 | "A successful execution of ```train_predict_pam.py``` should return similar results as shown below:\n",
198 | "\n",
199 | "```\n",
200 | "INFO:__main__:=====> Total Time:\n",
201 | "6.791231 secs for data size (800000, 34)\n",
202 | "INFO:__main__:=====> Training Time 3.459683 secs\n",
203 | "INFO:__main__:=====> Prediction Time 0.281359 secs\n",
204 | "INFO:__main__:=====> XGBoost accuracy score 0.921640\n",
205 | "INFO:__main__:DONE\n",
206 | "```\n",
207 | "\n",
208 | "A successful execution of ```train_predict_pam.py``` should return similar results as shown below:\n",
209 | "\n",
210 | "```\n",
211 | "INFO:__main__:Reading the dataset from ./intel_python/data_800000.pkl...\n",
212 | "INFO:root:sklearn.model_selection.train_test_split: running accelerated version on CPU\n",
213 | "INFO:root:sklearn.model_selection.train_test_split: running accelerated version on CPU\n",
214 | "INFO:__main__:XGBoost training time (seconds): 74.001453\n",
215 | "INFO:__main__:XGBoost inference time (seconds): 0.054897\n",
216 | "INFO:__main__:DAAL conversion time (seconds): 0.366412\n",
217 | "INFO:__main__:DAAL inference time (seconds): 0.017998\n",
218 | "INFO:__main__:XGBoost errors count: 15622\n",
219 | "INFO:__main__:XGBoost accuracy: 0.921890\n",
220 | "INFO:__main__:Daal4py errors count: 15622\n",
221 | "INFO:__main__:Daal4py accuracy: 0.921890\n",
222 | "INFO:__main__:XGBoost Prediction Time: 0.054897\n",
223 | "INFO:__main__:daal4py Prediction Time: 0.017998\n",
224 | "INFO:__main__:daal4py time improvement relative to XGBoost: 0.672158\n",
225 | "INFO:__main__:Accuracy Difference 0.000000\n",
226 | "```"
227 | ]
228 | }
229 | ],
230 | "metadata": {
231 | "kernelspec": {
232 | "display_name": "Python [conda env:predictive_maintenance_intel] *",
233 | "language": "python",
234 | "name": "conda-env-predictive_maintenance_intel-py"
235 | },
236 | "language_info": {
237 | "codemirror_mode": {
238 | "name": "ipython",
239 | "version": 3
240 | },
241 | "file_extension": ".py",
242 | "mimetype": "text/x-python",
243 | "name": "python",
244 | "nbconvert_exporter": "python",
245 | "pygments_lexer": "ipython3",
246 | "version": "3.10.13"
247 | }
248 | },
249 | "nbformat": 4,
250 | "nbformat_minor": 5
251 | }
252 |
--------------------------------------------------------------------------------
/src/daal_xgb_model.py:
--------------------------------------------------------------------------------
1 | # Copyright (C) 2024 Intel Corporation
2 | # SPDX-License-Identifier: BSD-3-Clause
3 |
4 | '''
5 | Module to convert XGBoost trained model to optimized daal4py version
6 | '''
7 | # !/usr/bin/env python
8 | # coding: utf-8
9 | import time
10 | import warnings
11 | # import matplotlib.pyplot as plt
12 | import logging
13 | from typing import Deque, Dict, Any
14 | from collections import deque
15 | import json
16 | import argparse
17 | import sys
18 | import numpy as np
19 | import pandas as pd
20 | import xgboost as xgb
21 | import daal4py as d4p
22 | from sklearn.model_selection import train_test_split
23 |
24 |
25 | parser = argparse.ArgumentParser()
26 | parser.add_argument('-f',
27 | '--file',
28 | type=str,
29 | required=False,
30 | default='data_25000.pkl',
31 | help='input pkl file name')
32 | parser.add_argument('-d',
33 | '--debug',
34 | action='store_true',
35 | help='changes logging level from INFO to DEBUG')
36 |
37 | FLAGS = parser.parse_args()
38 |
39 | if FLAGS.debug:
40 | logging_level=logging.DEBUG
41 | else:
42 | logging_level=logging.INFO
43 |
44 | logging.basicConfig(level=logging_level)
45 | logger = logging.getLogger(__name__)
46 | warnings.filterwarnings("ignore")
47 |
48 | start = time.time()
49 | logger.info('Reading the dataset from %s...', FLAGS.file)
50 | try:
51 | data = pd.read_pickle(FLAGS.file)
52 | except FileNotFoundError:
53 | sys.exit('Dataset file not found')
54 |
55 | data['Original_Treatment_Untreated'].describe()
56 |
57 | datasize = data.shape
58 |
59 | X = data.drop('Asset_Label', axis=1)
60 | y = data.Asset_Label
61 |
62 | X = X.rename(columns={x: y for x, y in zip(X.columns, range(0, len(X.columns)))}) # pylint: disable=unnecessary-comprehension
63 |
64 | X[4] = X[4].astype('int32')
65 |
66 | # original split .25
67 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.25)
68 |
69 | # Datasets creation
70 | xgb_train = xgb.DMatrix(X_train, label=np.array(y_train))
71 | xgb_test = xgb.DMatrix(X_test, label=np.array(y_test))
72 |
73 | train_start = time.time()
74 | # training parameters setting
75 | params = {
76 | 'max_bin': 256,
77 | 'scale_pos_weight': 2,
78 | 'lambda_l2': 1,
79 | 'alpha': 0.9,
80 | 'max_depth': 8,
81 | 'num_leaves': 2**8,
82 | 'verbosity': 0,
83 | 'objective': 'multi:softmax',
84 | 'learning_rate': 0.3,
85 | 'num_class': 5,
86 | }
87 |
88 | # Training
89 | xgb_model = xgb.train(params, xgb_train, num_boost_round=100)
90 |
91 | total_train_time = time.time() - train_start
92 | logger.info('XGBoost training time (seconds): %f', total_train_time)
93 |
94 | props = dict(boxstyle='round', facecolor='cyan', alpha=0.5)
95 |
96 | # Training - Training Time Benchmark
97 | left = [1]
98 |
99 | rounded_train_time = round(total_train_time, 5)
100 |
101 | tick_label = ['XGBoost Training Model']
102 |
103 | # XGBoost prediction (for accuracy comparison)
104 | xgb_start_time = time.time()
105 | xgb_prediction = xgb_model.predict(xgb_test)
106 | xgb_total = time.time() - xgb_start_time
107 |
108 | xgb_errors_count = np.count_nonzero(xgb_prediction - np.ravel(y_test))
109 |
110 | logger.info('XGBoost inference time (seconds): %f', xgb_total)
111 |
112 |
113 | # pylint: disable=too-many-branches
114 | # pylint: disable=too-many-locals
115 | def get_gbt_model_from_xgboost(booster: Any) -> Any: # pylint: disable=too-many-statements
116 | '''
117 | Class Node
118 | '''
119 | class Node: # pylint: disable=too-few-public-methods
120 | """Class representing a Node"""
121 | def __init__(self, tree: Dict, parent_id: int, position: int):
122 | self.tree = tree
123 | self.parent_id = parent_id
124 | self.position = position
125 |
126 | # Release Note for XGBoost 1.5.0: Python interface now supports configuring
127 | # constraints using feature names instead of feature indices.
128 | if booster.feature_names is None:
129 | lst = [*range(booster.num_features())]
130 | booster.feature_names = [str(i) for i in lst]
131 | # constraints using feature names instead of feature indices. This also
132 | # helps with pandas input with set feature names.
133 | lst = [*range(booster.num_features())]
134 | booster.feature_names = [str(i) for i in lst]
135 |
136 | trees_arr = booster.get_dump(dump_format="json")
137 | xgb_config = json.loads(booster.save_config())
138 | n_features = int(xgb_config["learner"]["learner_model_param"]["num_feature"])
139 | n_classes = int(xgb_config["learner"]["learner_model_param"]["num_class"])
140 | base_score = float(xgb_config["learner"]["learner_model_param"]["base_score"])
141 | is_regression = False
142 | objective_fun = xgb_config["learner"]["learner_train_param"]["objective"]
143 | if n_classes > 2:
144 | if objective_fun not in ["multi:softprob", "multi:softmax"]:
145 | raise TypeError(
146 | "multi:softprob and multi:softmax are only supported for multiclass classification")
147 | elif objective_fun.find("binary:") == 0:
148 | if objective_fun in ["binary:logistic", "binary:logitraw"]:
149 | n_classes = 2
150 | else:
151 | raise TypeError(
152 | "binary:logistic and binary:logitraw are only supported for binary classification")
153 | else:
154 | is_regression = True
155 | n_iterations = int(len(trees_arr) / (n_classes if n_classes > 2 else 1))
156 | # Create + base iteration
157 | if is_regression:
158 | m_b = d4p.gbt_reg_model_builder(n_features=n_features, n_iterations=n_iterations + 1) # pylint: disable=undefined-variable
159 | tree_id = m_b.create_tree(1)
160 | m_b.add_leaf(tree_id=tree_id, response=base_score)
161 | else:
162 | m_b = d4p.gbt_clf_model_builder( # pylint: disable=no-member
163 | n_features=n_features, n_iterations=n_iterations, n_classes=n_classes)
164 | class_label = 0
165 | iterations_counter = 0
166 | mis_eq_yes = None
167 | for tree in trees_arr:
168 | n_nodes = 1
169 | # find out the number of nodes in the tree
170 | for node in tree.split("nodeid")[1:]:
171 | node_id = int(node[3:node.find(",")])
172 | if node_id + 1 > n_nodes:
173 | n_nodes = node_id + 1
174 | if is_regression:
175 | tree_id = m_b.create_tree(n_nodes)
176 | else:
177 | tree_id = m_b.create_tree(n_nodes=n_nodes, class_label=class_label)
178 | iterations_counter += 1
179 | if iterations_counter == n_iterations:
180 | iterations_counter = 0
181 | class_label += 1
182 | sub_tree = json.loads(tree)
183 | # root is leaf
184 | if "leaf" in sub_tree:
185 | m_b.add_leaf(tree_id=tree_id, response=sub_tree["leaf"])
186 | continue
187 | # add root
188 | try:
189 | feature_index = int(sub_tree["split"])
190 | except ValueError as typeerror:
191 | raise TypeError("Feature names must be integers") from typeerror
192 | feature_value = np.nextafter(np.single(sub_tree["split_condition"]), np.single(-np.inf))
193 | parent_id = m_b.add_split(tree_id=tree_id, feature_index=feature_index,
194 | feature_value=feature_value)
195 | # create queue
196 | yes_idx = sub_tree["yes"]
197 | no_idx = sub_tree["no"]
198 | mis_idx = sub_tree["missing"]
199 | if mis_eq_yes is None:
200 | if mis_idx == yes_idx:
201 | mis_eq_yes = True
202 | elif mis_idx == no_idx:
203 | mis_eq_yes = False
204 | else:
205 | raise TypeError(
206 | "Missing values are not supported in daal4py Gradient Boosting Trees")
207 | elif mis_eq_yes and mis_idx != yes_idx or not mis_eq_yes and mis_idx != no_idx:
208 | raise TypeError("Missing values are not supported in daal4py Gradient Boosting Trees")
209 | node_queue: Deque[Node] = deque() # pylint: disable=undefined-variable
210 | node_queue.append(Node(sub_tree["children"][0], parent_id, 0))
211 | node_queue.append(Node(sub_tree["children"][1], parent_id, 1))
212 | # bfs through it
213 | while node_queue:
214 | sub_tree = node_queue[0].tree
215 | parent_id = node_queue[0].parent_id
216 | position = node_queue[0].position
217 | node_queue.popleft()
218 | # current node is leaf
219 | if "leaf" in sub_tree:
220 | m_b.add_leaf(
221 | tree_id=tree_id, response=sub_tree["leaf"],
222 | parent_id=parent_id, position=position)
223 | continue
224 | # current node is split
225 | try:
226 | feature_index = int(sub_tree["split"])
227 | except ValueError as typeerror:
228 | raise TypeError("Feature names must be integers") from typeerror
229 | feature_value = np.nextafter(np.single(sub_tree["split_condition"]), np.single(-np.inf))
230 | parent_id = m_b.add_split(
231 | tree_id=tree_id, feature_index=feature_index, feature_value=feature_value,
232 | parent_id=parent_id, position=position)
233 | # append to queue
234 | yes_idx = sub_tree["yes"]
235 | no_idx = sub_tree["no"]
236 | mis_idx = sub_tree["missing"]
237 | if mis_eq_yes and mis_idx != yes_idx or not mis_eq_yes and mis_idx != no_idx:
238 | raise TypeError(
239 | "Missing values are not supported in daal4py Gradient Boosting Trees")
240 | node_queue.append(Node(sub_tree["children"][0], parent_id, 0))
241 | node_queue.append(Node(sub_tree["children"][1], parent_id, 1))
242 | return m_b.model()
243 |
244 |
245 | # Conversion to daal4py
246 | daal_conv_stime = time.time()
247 | daal_model = d4p.get_gbt_model_from_xgboost(xgb_model) # pylint: disable=no-member
248 | daal_conv_etime = time.time()
249 |
250 | daal_conv_total = daal_conv_etime - daal_conv_stime
251 | logger.info('DAAL conversion time (seconds): %f', daal_conv_total)
252 |
253 | # daal4py prediction
254 | daal_predict_algo = d4p.gbt_classification_prediction( # pylint: disable=no-member
255 | nClasses=params["num_class"],
256 | resultsToEvaluate="computeClassLabels",
257 | fptype='float'
258 | )
259 | daal_start_time = time.time()
260 | daal_prediction = daal_predict_algo.compute(X_test, daal_model)
261 | d4p_total = time.time() - daal_start_time
262 |
263 | daal_errors_count = np.count_nonzero(daal_prediction.prediction[:, 0] - np.ravel(y_test))
264 | #logger.info(daal_errors_count)
265 |
266 | logger.info('DAAL inference time (seconds): %f', d4p_total)
267 |
268 | logger.info("XGBoost errors count: %d", xgb_errors_count)
269 | xgb_acc = abs((xgb_errors_count / xgb_prediction.shape[0]) - 1)
270 | logger.info("XGBoost accuracy: %f", xgb_acc)
271 |
272 | logger.info("Daal4py errors count: %d", daal_errors_count)
273 | d4p_acc = abs((daal_errors_count / xgb_prediction.shape[0]) - 1)
274 | logger.info("Daal4py accuracy: %f", d4p_acc)
275 |
276 | logger.info("XGBoost Prediction Time: %f", xgb_total)
277 | logger.info("daal4py Prediction Time: %f", d4p_total)
278 |
279 | # Performance - Prediction Time
280 | rounded_xgb = round(xgb_total, 4)
281 | rounded_daal = round(d4p_total, 4)
282 |
283 | left = [1, 2]
284 | pred_times = [rounded_xgb, rounded_daal]
285 | tick_label = ['XGBoost Prediction', 'daal4py Prediction']
286 |
287 | # Performance - Prediction Time Benchmark
288 | left = [1]
289 | perf_bench = abs((d4p_total/xgb_total) - 1)
290 |
291 | tick_label = ['daal4py Prediction']
292 |
293 | logger.info("daal4py time improvement relative to XGBoost: %f", perf_bench)
294 |
295 | # Accurancy
296 | left = [1, 2]
297 | xgb_acc = abs((xgb_errors_count / xgb_prediction.shape[0]) - 1)
298 |
299 | d4p_acc = abs((daal_errors_count / xgb_prediction.shape[0]) - 1)
300 |
301 | pred_acc = [xgb_acc, d4p_acc]
302 | tick_label = ['XGBoost Prediction', 'daal4py Prediction']
303 |
304 | logger.info("Accuracy Difference %f", xgb_acc-d4p_acc)
305 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | PROJECT NOT UNDER ACTIVE MANAGEMENT
2 |
3 | This project will no longer be maintained by Intel.
4 |
5 | Intel has ceased development and contributions including, but not limited to, maintenance, bug fixes, new releases, or updates, to this project.
6 |
7 | Intel no longer accepts patches to this project.
8 |
9 | If you have an ongoing need to use this project, are interested in independently developing it, or would like to maintain patches for the open source software community, please create your own fork of this project.
10 |
11 | Contact: webadmin@linux.intel.com
12 | # Predictive Asset Health Analytics
13 |
14 | ## Introduction
15 | Create an end-to-end predictive asset maintenance solution to predict defects and anomalies before they happen with XGBoost* from [Intel® oneAPI AI Analytics Toolkit](https://www.intel.com/content/www/us/en/developer/tools/oneapi/ai-analytics-toolkit.html) (oneAPI). Check out more workflow examples in the [Developer Catalog](https://developer.intel.com/aireferenceimplementations).
16 |
17 | ## **Table of Contents**
18 |
19 | - [Solution Technical Overview](#solution-technical-overview)
20 | - [Validated Hardware Details](#validated-hardware-details)
21 | - [How it Works?](#how-it-works)
22 | - [Get Started](#get-started)
23 | - [Download the Workflow Repository](#download-the-workflow-repository)
24 | - [Set Up Conda](#set-up-conda)
25 | - [Set Up Environment](#set-up-environment)
26 | - [Ways to run this reference use case](#Ways-to-run-this-reference-use-case)
27 | - [Run Using Bare Metal](#run-using-bare-metal)
28 | - [Run Using Jupyter Notebook](#run-using-jupyter-notebook)
29 | - [Expected Output](#expected-output)
30 | - [Summary and Next Steps](#summary-and-next-steps)
31 | - [Learn More](#learn-more)
32 | - [Support](#support)
33 | - [Appendix](#appendix)
34 |
35 |
36 | ## Solution Technical Overview
37 |
38 | Predictive asset maintenance is a method that uses data analysis tools to predict defects and anomalies before they happen. Solutions of huge scale typically require operating across multiple hardware architectures. Accelerating training for the ever-increasing size of datasets and machine learning models is a major challenge while adopting AI (Artificial Intelligence).
39 |
40 | For an industrial scenario is important to improve the MLOps (Machine Learning Operations) time for developing and deploying new models, this could be challenging due to the ever-increasing size of datasets over a period of time. XGBoost* classifier with HIST tree method addresses this problem improving the overall training/tuning and validation time. A model with a huge set of batch processing requires fast prediction time with a low accuracy lose, daal4py helps the XGBoost* machine learning model to achieve this criteria.
41 |
42 | The solution contained in this repo uses the following Intel® packages:
43 |
44 | * ***Intel® Distribution for Python****
45 |
46 | The [Intel® Distribution for Python*](https://www.intel.com/content/www/us/en/developer/tools/oneapi/distribution-for-python.html#gs.52te4z) provides:
47 |
48 | * Scalable performance using all available CPU cores on laptops, desktops, and powerful servers
49 | * Support for the latest CPU instructions
50 | * Near-native performance through acceleration of core numerical and machine learning packages with libraries like the Intel® oneAPI Math Kernel Library (oneMKL) and Intel® oneAPI Data Analytics Library
51 | * Productivity tools for compiling Python code into optimized instructions
52 | * Essential Python bindings for easing integration of Intel® native tools with your Python* project
53 |
54 | * ***Intel® Distribution of Modin****
55 |
56 | Modin* is a drop-in replacement for pandas, enabling data scientists to scale to distributed DataFrame processing without having to change API code. [Intel® Distribution of Modin*](https://www.intel.com/content/www/us/en/developer/tools/oneapi/distribution-of-modin.html) adds optimizations to further accelerate processing on Intel hardware.
57 |
58 | For more details, visit [Intel® Distribution for Python*](https://www.intel.com/content/www/us/en/developer/tools/oneapi/distribution-for-python.html#gs.52te4z), [Intel® Distribution of Modin*](https://www.intel.com/content/www/us/en/developer/tools/oneapi/distribution-of-modin.html), the [Predictive Asset Health Analytics](https://github.com/oneapi-src/predictive-asset-health-analytics) GitHub repository, the [XGBoost* documentation webpage](https://xgboost.readthedocs.io/en/stable/) and the [daal4py documentation webpage](https://intelpython.github.io/daal4py/).
59 |
60 | ## Validated Hardware Details
61 |
62 | [Intel® oneAPI](https://www.intel.com/content/www/us/en/developer/tools/oneapi/overview.html#gs.52tat6) is used to achieve quick results even when the data for a model are huge. It provides the capability to reuse the code present in different languages so that the hardware utilization is optimized to provide these results.
63 |
64 | | Recommended Hardware
65 | | ----------------------------
66 | | CPU: Intel® 2th Gen Xeon® Platinum 8280 CPU @ 2.70GHz or higher
67 | | RAM: 187 GB
68 | | Recommended Free Disk Space: 20 GB or more
69 |
70 | Code was tested on Ubuntu\* 22.04 LTS.
71 |
72 | ## How it Works
73 |
74 | This reference kit generates datasets of given row size for a predictive asset maintenance analytics use-case and stores it in ‘. pkl’ format. The data is splitted into two subsets, the first subset will train the XGBoost* model and the second will be use to test the model's prediction capabilities.
75 |
76 | The below diagram presents the different stages that compose the end-to-end workflow.
77 |
78 | 
79 |
80 |
81 | ## Get Started
82 | Start by defining an environment variable that will store the workspace path, these directories will be created in further steps and will be used for all the commands executed using absolute paths.
83 |
84 | [//]: # (capture: baremetal)
85 | ```bash
86 | export WORKSPACE=$PWD/predictive-health-analytics
87 | export DATA_DIR=$WORKSPACE/data
88 | export OUTPUT_DIR=$WORKSPACE/output
89 | ```
90 | ### Download the Workflow Repository
91 | Create a working directory for the workflow and clone the [Main
92 | Repository](https://github.com/oneapi-src/predictive-asset-health-analytics) repository into your working
93 | directory.
94 |
95 | [//]: # (capture: baremetal)
96 | ```bash
97 | mkdir -p $WORKSPACE && cd $WORKSPACE
98 | ```
99 |
100 | ```
101 | git clone https://github.com/oneapi-src/predictive-asset-health-analytics.git $WORKSPACE
102 | ```
103 |
104 | [//]: # (capture: baremetal)
105 | ```bash
106 | mkdir -p $DATA_DIR $OUTPUT_DIR/logs
107 | ```
108 | ### Set Up Conda
109 | To learn more, please visit [install anaconda on Linux](https://docs.anaconda.com/free/anaconda/install/linux/).
110 | ```bash
111 | wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
112 | bash Miniconda3-latest-Linux-x86_64.sh
113 | ```
114 |
115 | ### Set Up Environment
116 | The conda yaml dependencies are kept in `$WORKSPACE/env/intel_env.yml`.
117 |
118 | | **Packages required in YAML file:** | **Version:**
119 | | :--- | :--
120 | | `python` | 3.10
121 | | `intelpython3_full` | 2024.0.0
122 | | `modin-all` | 0.24.1
123 |
124 | Follow the next steps for Intel® Python* Distribution setup inside conda environment:
125 | ```bash
126 | conda env create -f $WORKSPACE/env/intel_env.yml --no-default-packages
127 | ```
128 |
129 | Environment setup is required only once. This step does not cleanup the existing environment with the same name; make sure no conda environment exists with the same name. During this setup a new conda environment will be created with the dependencies listed in the YAML configuration.
130 |
131 | Once the appropriate environment is created with the previous step then it has to be activated using the conda command as given below:
132 | ```bash
133 | conda activate predictive_maintenance_intel
134 | ```
135 |
136 | ## Ways to run this reference use case
137 | You can execute the references pipelines using the following environments:
138 | * Bare Metal
139 | * Jupyter Notebook
140 |
141 | ---
142 |
143 | ### Run Using Bare Metal
144 |
145 | #### Set Up System Software
146 | Our examples use the `conda` package and environment on your local computer. If you don't already have `conda` installed or the `conda` environment created, go to [Set Up Conda*](#set-up-conda) or see the [Conda* Linux installation instructions](https://docs.conda.io/projects/conda/en/stable/user-guide/install/linux.html).
147 |
148 |
149 | #### Run Workflow
150 | The below bash script, located in ```$WORKSPACE```, needs to be executed to start creating the test dataset and training the model using pandas/modin.
151 | ```sh
152 | bash $WORKSPACE/run_dataset.sh
153 | ```
154 | | **Option** | **Values**
155 | | :-- | :--
156 | | Dataset Size | `25K to 10M`
157 | | Hyperparameter tuning | `notuning` - Training without hyperparameter tuning
`hyperparametertuning` - Training with hyperparameter tuning
158 | | Number of CPU cores | Based on the total number of cores available on the execution environment
159 |
160 | This stage invokes two python scripts to generate the test dataset with the chosen size and to train the model with selected data package library. The data generation process will create a folder with the name of the active conda environment; all the dataset and the log files will be captured. The dataset file will be saved in pickle format and it will be reused in further test runs on this same environment for the same dataset size.
161 |
162 | Example option selection for Pandas with 1M dataset size as given below
163 |
164 | ```
165 | 0. 25000
166 | 1. 50000
167 | 2. 100000
168 | 3. 200000
169 | 4. 400000
170 | 5. 800000
171 | 6. 1000000
172 | 7. 2000000
173 | 8. 4000000
174 | 9. 8000000
175 | 10. 10000000
176 | Select dataset size: 6
177 | 0. notuning
178 | 1. hyperparametertuning
179 | Select tuning option: 0
180 | Number of CPU cores to be used for the training: 8
181 | ```
182 |
183 | Log file will be generated in the below location:
184 | ```bash
185 | $OUTPUT_DIR/logs/logfile_pandas__.log
186 | $OUTPUT_DIR/logs/logfile_train_predict__.log
187 | ```
188 | Test data pickle file will be generated in the below location:
189 | ```bash
190 | $DATA_DIR/data_.pkl
191 | ```
192 | Alternatively, user can run `generate_data_pandas.py` and `train_predict_pam.py` scripts, described below, instead of `run_dataset.sh`; running each Python script independently provides more options for the user to experiment. `generate_data_pandas.py` will automatically create the dataset and, `train_predict_pam.py` will run train and prediction with the previously generated dataset.
193 |
194 | The dataset generation script uses the following optional arguments:
195 |
196 | ```bash
197 | usage: src/generate_data_pandas.py [-h] [-s SIZE] [-f FILE]
198 |
199 | optional arguments:
200 | -h, --help show this help message and exit
201 | -s SIZE, --size SIZE data size which is number of rows
202 | -f FILE, --file FILE output pkl file name
203 | -d, --debug Changes logging level from INFO to DEBUG
204 | ```
205 |
206 | For example, below command should generate the dataset of 25k rows and saves the log file.
207 |
208 | [//]: # (capture: baremetal)
209 | ```bash
210 | export DATASIZE=25000
211 | export OF=$OUTPUT_DIR/logs/logfile_pandas_${DATASIZE}_$(date +%Y%m%d%H%M%S).log
212 | python $WORKSPACE/src/generate_data_pandas.py -s ${DATASIZE} -f $DATA_DIR/dataset_${DATASIZE}.pkl 2>&1 | tee $OF
213 | echo "Logfile saved: $OF"
214 | ```
215 | Training and prediction along with hyperparameter turning can also be executed independently with the following arguments:
216 | ```bash
217 | usage: src/train_predict_pam.py [-h] [-f FILE] [-p PACKAGE] [-t TUNING] [-cv CROSS_VALIDATION] [-patch PATCH_SKLEARN]
218 | -ncpu NUM_CPU
219 |
220 | optional arguments:
221 | -h, --help show this help message and exit
222 | -f FILE, --file FILE input pkl file name
223 | -p PACKAGE, --package PACKAGE
224 | data package to be used (pandas, modin)
225 | -t TUNING, --tuning TUNING
226 | hyper parameter tuning (0/1)
227 | -cv CROSS_VALIDATION, --cross-validation CROSS_VALIDATION
228 | cross validation iteration
229 | -ncpu NUM_CPU, --num-cpu NUM_CPU
230 | number of cpu cores, default 4.
231 | -d, --debug
232 | changes logging level from INFO to DEBUG
233 | ```
234 | For example, below command should take the 25k dataset pkl file generated in the previous example and perform the training and prediction using XGBoost* classifier algorithm.
235 |
236 | [//]: # (capture: baremetal)
237 | ```bash
238 | export PACKAGE="pandas"
239 | export TUNING=0
240 | export NCPU=20
241 | export CROSS_VAL=4
242 | export OF=$OUTPUT_DIR/logs/logfile_train_predict_${DATASIZE}_$(date +%Y%m%d%H%M%S).log
243 | python $WORKSPACE/src/train_predict_pam.py -f $DATA_DIR/dataset_${DATASIZE}.pkl -t $TUNING -ncpu $NCPU -p $PACKAGE -cv $CROSS_VAL 2>&1 | tee -a $OF
244 | echo "Logfile saved: $OF"
245 | ```
246 |
247 | #### XGBoost* with oneDAL Python Wrapper (daal4py) model
248 | To gain even further improved performance on prediction time for the XGBoost* trained machine learning model, it can be converted to a daal4py model. daal4py makes XGBoost* machine learning algorithm execution faster to gain better performance on the underlying hardware by utilizing the Intel® oneAPI Data Analytics Library (oneDAL).
249 |
250 | The previously generated pkl file is used as input for this Python script.
251 | ```bash
252 | usage: src/daal_xgb_model.py [-h] [-f FILE]
253 |
254 | optional arguments:
255 | -h, --help show this help message and exit
256 | -f FILE, --file FILE input pkl file name
257 | -d, --debug changes logging level from INFO to DEBUG
258 | ```
259 | Run the following command to train the model with the given dataset, convert the same to daal4py format and measure the prediction time performance.
260 |
261 | [//]: # (capture: baremetal)
262 | ```bash
263 | python $WORKSPACE/src/daal_xgb_model.py -f $DATA_DIR/dataset_${DATASIZE}.pkl
264 | ```
265 | #### Clean Up Bare metal
266 | Before proceeding to the cleaning process, it is strongly recommended to make a backup of the data that the user wants to keep. To clean the previously downloaded and generated data, run the following commands:
267 | ```bash
268 | conda deactivate #Run line if predictive_maintenance_intel is active
269 | conda env remove -n predictive_maintenance_intel
270 | rm $OUTPUT_DIR $DATA_DIR $WORKSPACE -rf
271 | ```
272 |
273 | ---
274 | ### Run Using Jupyter Notebook
275 | Before continuing steps described in [Get Started](#get-started).
276 |
277 | #### Create and activate conda environment
278 | To be able to run `Fraud_Detection_Notebook.ipynb` a [conda environment](#set-up-environment) must be created:
279 | ```bash
280 | conda activate predictive_maintenance_intel
281 | conda install -c intel -c conda-forge nb_conda_kernels jupyterlab -y
282 | ```
283 | Follow the steps in [Get Started](#get-started) section before continuing. Run the following command inside of the project root directory. ENVVARs must be set in the same terminal that will run Jupyter Notebook.
284 | ```bash
285 | cd $WORKSPACE
286 | jupyter lab
287 | ```
288 | Open Jupyter Notebook in a web browser, select `PredictiveMaintenance.ipynb` and select `conda env:predictive_maintenance_intel` as the jupyter kernel. Now you can follow the notebook's instructions step by step.
289 |
290 | #### Clean Up Jupyter Notebook
291 | To clean Jupyter Notebook follow the instructions described in [Clean Up Bare Metal](#clean-up-bare-metal).
292 |
293 | ## Expected Output
294 | A successful execution of ```generate_data_pandas.py``` should return similar results as shown below:
295 |
296 | ```
297 | INFO:__main__:Generating data with the size 25000
298 | INFO:__main__:changing Tele_Attatched into an object variable
299 | INFO:__main__:Generating our target variable Asset_Label
300 | INFO:__main__:Creating correlation between our variables and our target variable
301 | INFO:__main__:When age is 60-70 and over 95 change Asset_Label to 1
302 | INFO:__main__:When elevation is between 500-1500 change Asset_Label to 1
303 | INFO:__main__:When Manufacturer is A, E, or H change Asset_Label to have 95% 0's
304 | INFO:__main__:When Species is C2 or C5 change Asset_Label to have 90% to 0's
305 | INFO:__main__:When District is NE or W change Asset_Label to have 90% to 0's
306 | INFO:__main__:When District is Untreated change Asset_Label to have 70% to 1's
307 | INFO:__main__:When Age is greater than 90 and Elevaation is less than 1200 and Original_treatment is Oil change Asset_Label to have 90% to 1's
308 | INFO:__main__:=====> Time taken 0.049012 secs for data generation for the size of (25000, 34)
309 | INFO:__main__:Saving the data to /localdisk/aagalleg/frameworks.ai.platform.sample-apps.predictive-health-analytics/predictive-health-analytics/data/dataset_25000.pkl ...
310 | INFO:__main__:DONE
311 | ```
312 |
313 | A successful execution of ```train_predict_pam.py``` should return similar results as shown below:
314 |
315 | ```
316 | INFO:__main__:=====> Total Time:
317 | 6.791231 secs for data size (800000, 34)
318 | INFO:__main__:=====> Training Time 3.459683 secs
319 | INFO:__main__:=====> Prediction Time 0.281359 secs
320 | INFO:__main__:=====> XGBoost accuracy score 0.921640
321 | INFO:__main__:DONE
322 | ```
323 |
324 | A successful execution of ```daal_xgb_model.py``` should return similar results as shown below:
325 |
326 | ```
327 | INFO:__main__:Reading the dataset from ./data/data_800000.pkl...
328 | INFO:root:sklearn.model_selection.train_test_split: running accelerated version on CPU
329 | INFO:root:sklearn.model_selection.train_test_split: running accelerated version on CPU
330 | INFO:__main__:XGBoost training time (seconds): 74.001453
331 | INFO:__main__:XGBoost inference time (seconds): 0.054897
332 | INFO:__main__:DAAL conversion time (seconds): 0.366412
333 | INFO:__main__:DAAL inference time (seconds): 0.017998
334 | INFO:__main__:XGBoost errors count: 15622
335 | INFO:__main__:XGBoost accuracy: 0.921890
336 | INFO:__main__:Daal4py errors count: 15622
337 | INFO:__main__:Daal4py accuracy: 0.921890
338 | INFO:__main__:XGBoost Prediction Time: 0.054897
339 | INFO:__main__:daal4py Prediction Time: 0.017998
340 | INFO:__main__:daal4py time improvement relative to XGBoost: 0.672158
341 | INFO:__main__:Accuracy Difference 0.000000
342 | ```
343 |
344 | ## Summary and Next Steps
345 |
346 | Predictive asset maintenance solutions of huge scale typically require acceleration in training and prediction for the ever-increasing size of datasets without changing the existing computing resources in order to make their solutions feasible and economically attractive for Utility customers. This reference kit implementation provides performance-optimized guide around utility asset maintenance use cases that can be easily scaled across similar use cases.
347 |
348 |
349 | ## Learn More
350 | For more information about predictive asset maintenance or to read about other relevant workflow examples, see these guides and software resources:
351 |
352 | - [Intel® AI Analytics Toolkit (AI Kit)](https://www.intel.com/content/www/us/en/developer/tools/oneapi/ai-analytics-toolkit.html)
353 | - [Intel® Distribution for Python*](https://www.intel.com/content/www/us/en/developer/tools/oneapi/distribution-for-python.html#gs.52te4z)
354 | - [Intel® Distribution of Modin*](https://www.intel.com/content/www/us/en/developer/tools/oneapi/distribution-of-modin.html)
355 | - [XGBoost Documentation](https://xgboost.readthedocs.io/en/stable/)
356 | - [Fast, Scalable and Easy Machine Learning With DAAL4PY](https://intelpython.github.io/daal4py/)
357 |
358 | ## Support
359 |
360 | The End-to-end Predictive Asset Health Analytics team tracks both bugs and
361 | enhancement requests using [GitHub
362 | issues](https://github.com/oneapi-src/predictive-asset-health-analytics/issues).
363 | Before submitting a suggestion or bug report, search the [DLSA GitHub
364 | issues](https://github.com/oneapi-src/predictive-asset-health-analytics/issues/issues) to
365 | see if your issue has already been reported.
366 |
367 | ## Appendix
368 |
369 | \*Names and brands that may be claimed as the property of others. [Trademarks](https://www.intel.com/content/www/us/en/legal/trademarks.html).
370 |
371 | ### Disclaimers
372 |
373 | To the extent that any public or non-Intel datasets or models are referenced by or accessed using tools or code on this site those datasets or models are provided by the third party indicated as the content source. Intel does not create the content and does not warrant its accuracy or quality. By accessing the public content, or using materials trained on or with such content, you agree to the terms associated with that content and that your use complies with the applicable license.
374 |
375 | Intel expressly disclaims the accuracy, adequacy, or completeness of any such public content, and is not liable for any errors, omissions, or defects in the content, or for any reliance on the content. Intel is not liable for any liability or damages relating to your use of public content.
376 |
--------------------------------------------------------------------------------