├── .images ├── 2023-05-24_16-50.png ├── 90.png ├── 90_center.png ├── BGRvsRGB.png ├── binary.png ├── edge.png ├── fli-1.png ├── fli0.png ├── fli1.png ├── gray.png ├── noisy.png ├── resize.png ├── segment.png ├── smooth.png ├── sptial.png └── trans.png ├── README.md └── images_processioning_full_version_.ipynb /.images/2023-05-24_16-50.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/h9-tect/Data_reprocessing/98222dcdd5e6ddb0b90b3ef5f489686611ca7bf7/.images/2023-05-24_16-50.png -------------------------------------------------------------------------------- /.images/90.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/h9-tect/Data_reprocessing/98222dcdd5e6ddb0b90b3ef5f489686611ca7bf7/.images/90.png -------------------------------------------------------------------------------- /.images/90_center.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/h9-tect/Data_reprocessing/98222dcdd5e6ddb0b90b3ef5f489686611ca7bf7/.images/90_center.png -------------------------------------------------------------------------------- /.images/BGRvsRGB.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/h9-tect/Data_reprocessing/98222dcdd5e6ddb0b90b3ef5f489686611ca7bf7/.images/BGRvsRGB.png -------------------------------------------------------------------------------- /.images/binary.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/h9-tect/Data_reprocessing/98222dcdd5e6ddb0b90b3ef5f489686611ca7bf7/.images/binary.png -------------------------------------------------------------------------------- /.images/edge.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/h9-tect/Data_reprocessing/98222dcdd5e6ddb0b90b3ef5f489686611ca7bf7/.images/edge.png -------------------------------------------------------------------------------- /.images/fli-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/h9-tect/Data_reprocessing/98222dcdd5e6ddb0b90b3ef5f489686611ca7bf7/.images/fli-1.png -------------------------------------------------------------------------------- /.images/fli0.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/h9-tect/Data_reprocessing/98222dcdd5e6ddb0b90b3ef5f489686611ca7bf7/.images/fli0.png -------------------------------------------------------------------------------- /.images/fli1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/h9-tect/Data_reprocessing/98222dcdd5e6ddb0b90b3ef5f489686611ca7bf7/.images/fli1.png -------------------------------------------------------------------------------- /.images/gray.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/h9-tect/Data_reprocessing/98222dcdd5e6ddb0b90b3ef5f489686611ca7bf7/.images/gray.png -------------------------------------------------------------------------------- /.images/noisy.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/h9-tect/Data_reprocessing/98222dcdd5e6ddb0b90b3ef5f489686611ca7bf7/.images/noisy.png -------------------------------------------------------------------------------- /.images/resize.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/h9-tect/Data_reprocessing/98222dcdd5e6ddb0b90b3ef5f489686611ca7bf7/.images/resize.png -------------------------------------------------------------------------------- /.images/segment.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/h9-tect/Data_reprocessing/98222dcdd5e6ddb0b90b3ef5f489686611ca7bf7/.images/segment.png -------------------------------------------------------------------------------- /.images/smooth.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/h9-tect/Data_reprocessing/98222dcdd5e6ddb0b90b3ef5f489686611ca7bf7/.images/smooth.png -------------------------------------------------------------------------------- /.images/sptial.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/h9-tect/Data_reprocessing/98222dcdd5e6ddb0b90b3ef5f489686611ca7bf7/.images/sptial.png -------------------------------------------------------------------------------- /.images/trans.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/h9-tect/Data_reprocessing/98222dcdd5e6ddb0b90b3ef5f489686611ca7bf7/.images/trans.png -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Data_reprocessing 2 | 3 | 1. [NLP](#nlp) 4 | 2. [Computer Vision](#computer-vision) 5 | 6 | I have created this repository that covers approximately 98% of the preprocessing steps for any data, including a brief explanation of each step and example code to demonstrate how it is done. 7 | 8 | I would like to emphasize two important points: 9 | 10 | - I welcome any feedback or suggestions because my main goal is to assist and help. 11 | - Some of the techniques included in the repository may not be necessary for certain tasks. For example, in the case of sentiment analysis, it may not be suitable to apply stop word removal because it can remove important negations like "not" which carry significant meaning. It is crucial to understand the data and its characteristics before applying any preprocessing techniques. 12 | 13 | This README provides an overview of common data preprocessing steps using Python. Each step is accompanied by an example code snippet. 14 | 15 | ## NLP 16 | ## Steps 17 | 18 | 1. **Data Collection**: 19 | Data collection is the systematic process of gathering, organizing, and analyzing information to gain insights and make informed decisions. 20 | - Using pandas to read data from CSV files: 21 | ```python 22 | import pandas as pd 23 | data = pd.read_csv('data.csv') 24 | ``` 25 | 26 | 2. **Data Cleaning**: 27 | Data cleaning refers to the process of identifying and correcting or removing errors, inconsistencies, and inaccuracies from a dataset to ensure its quality and reliability for analysis and decision-making purposes. 28 | - Handling missing values using pandas: 29 | ```python 30 | data.dropna() # Remove instances with missing values 31 | data.fillna(value) # Impute missing values with a specific value 32 | ``` 33 | 34 | 3. **Data Integration**: 35 | Data integration is the process of combining and merging data from multiple sources or systems into a unified and cohesive format, enabling comprehensive analysis and a holistic view of the information. 36 | - Merging datasets using pandas: 37 | ```python 38 | merged_data = pd.concat([data1, data2], axis=1) # Concatenate horizontally 39 | merged_data = pd.merge(data1, data2, on='key_column') # Merge based on a common column 40 | ``` 41 | 42 | 4. **Data Transformation**: 43 | Data transformation refers to the process of converting or altering data from its original format or structure into a standardized or desired format, allowing for improved compatibility, analysis, and usability. 44 | - Normalizing data using scikit-learn: 45 | ```python 46 | from sklearn.preprocessing import MinMaxScaler 47 | scaler = MinMaxScaler() 48 | normalized_data = scaler.fit_transform(data) 49 | ``` 50 | 51 | 5. **Feature Selection/Extraction**: 52 | Feature selection/extraction is the process of identifying and selecting the most relevant and informative features from a dataset, or creating new features, in order to improve the performance and efficiency of machine learning models and reduce dimensionality. 53 | - Selecting top-K features based on feature importance using scikit-learn: 54 | ```python 55 | from sklearn.feature_selection import SelectKBest, f_regression 56 | selector = SelectKBest(score_func=f_regression, k=5) 57 | selected_features = selector.fit_transform(data, target) 58 | ``` 59 | 60 | 6. **Handling Categorical Data**: 61 | Handling categorical data involves converting categorical variables into a numerical representation, such as one-hot encoding or label encoding, to enable their effective utilization in machine learning algorithms and statistical analyses. 62 | - One-hot encoding categorical variables using pandas: 63 | ```python 64 | encoded_data = pd.get_dummies(data, columns=['categorical_column']) 65 | ``` 66 | 67 | 7. **Handling Text Data**: 68 | Handling text data involves preprocessing and transforming textual information into a numerical representation, commonly through techniques such as tokenization, stemming or lemmatization, removing stop words, and applying methods like TF-IDF or word embeddings, to facilitate natural language processing tasks like text classification, sentiment analysis, or information retrieval. 69 | - Text preprocessing using NLTK library: 70 | ```python 71 | import nltk 72 | from nltk.corpus import stopwords 73 | from nltk.tokenize import word_tokenize 74 | nltk.download('stopwords') 75 | nltk.download('punkt') 76 | 77 | stop_words = set(stopwords.words('english')) 78 | preprocessed_text = [] 79 | 80 | for text in data['text_column']: 81 | tokens = word_tokenize(text) 82 | filtered_tokens = [token.lower() for token in tokens if token.lower() not in stop_words] 83 | preprocessed_text.append(filtered_tokens) 84 | ``` 85 | 86 | 8. **Dimensionality Reduction**: 87 | Dimensionality reduction is a technique used to reduce the number of features or variables in a dataset while retaining the most relevant information, typically through methods like principal component analysis (PCA) or t-distributed stochastic neighbor embedding (t-SNE), which can help alleviate the curse of dimensionality and improve computational efficiency in data analysis and machine learning tasks. 88 | - Applying Principal Component Analysis (PCA) using scikit-learn: 89 | ```python 90 | from sklearn.decomposition import PCA 91 | pca = PCA(n_components=2) 92 | reduced_data = pca.fit_transform(data) 93 | ``` 94 | 95 | 9. **Splitting the Dataset**: 96 | Splitting the dataset refers to dividing the available data into separate subsets, typically into training, validation, and testing sets, to evaluate and validate the performance of a machine learning model, ensuring generalizability and avoiding overfitting by using distinct data for training, evaluation, and final testing. 97 | - Dividing the preprocessed dataset into training, validation, and testing sets: 98 | ```python 99 | from sklearn.model_selection import train_test_split 100 | X_train, X_val, y_train, y_val = train_test_split(features, target, test_size=0.2, random_state=42) 101 | ``` 102 | 103 | 10. **Data Sampling**: 104 | Data sampling is the process of selecting a subset of data points from a larger dataset in order to gain insights, make inferences, or build models on a representative portion of the data, often using techniques such as random sampling, stratified sampling, or oversampling/undersampling to address class imbalance or specific sampling requirements. 105 | - Selecting a subset of the data using random sampling: 106 | ```python 107 | sampled_data = data.sample(n=100, random_state=42) 108 | ``` 109 | 110 | 11. **Data Visualization**: 111 | Data visualization is the graphical representation of data using charts, graphs, or other visual elements to effectively communicate patterns, trends, and relationships within the data, making it easier for humans to understand and interpret complex information. 112 | - Plotting data using matplotlib: 113 | ```python 114 | import matplotlib.pyplot as plt 115 | plt.scatter(data['x'], data['y']) 116 | plt.xlabel('X') 117 | plt.ylabel('Y') 118 | plt.show() 119 | ``` 120 | 121 | 12. **Data Auditing**: 122 | Data auditing involves the systematic examination and evaluation of data to ensure its accuracy, completeness, consistency, and adherence to predefined standards or rules, often performed through data profiling, validation checks, and data quality assessments to identify and address any data anomalies or issues. 123 | - Checking data accuracy and completeness: 124 | ```python 125 | data.describe() # Summary statistics 126 | data.isnull(). 127 | 128 | 13. **Data Documentation**: 129 | Data documentation refers to the process of creating comprehensive and detailed documentation that describes various aspects of a dataset, including its structure, variables, meanings, data sources, data collection methods, data transformations, and any other relevant information, to facilitate understanding, reproducibility, and proper usage of the data by others. 130 | - Create documentation that describes the data, including its sources, format, and limitations: 131 | ```markdown 132 | ## Dataset Description 133 | 134 | - **Source:** [Provide the source of the dataset] 135 | - **Format:** [Describe the format of the dataset] 136 | - **Limitations:** [Highlight any limitations or known issues with the dataset] 137 | 138 | [Provide additional information or instructions for other researchers using the data] 139 | ``` 140 | 141 | 14. **Outlier Detection and Handling**: 142 | Outlier detection is the process of identifying data points that significantly deviate from the normal patterns or behavior of a dataset. Outliers can be detected using statistical methods, such as the z-score or the interquartile range, or using machine learning algorithms designed for anomaly detection. Once outliers are detected, they can be handled by either removing them from the dataset, replacing them with more representative values, or treating them separately in the analysis, depending on the specific context and goals of the data analysis. 143 | - Univariate Outlier Detection using Z-Score: 144 | The first stage involves detecting outliers in each variable individually, without considering the relationships between variables. The z-score is a widely used statistical method for this purpose. It measures how many standard deviations a data point deviates from the mean of the variable. A commonly used threshold is a z-score of 3, which considers any data point beyond three standard deviations as an outlier. 145 | 146 | - Identify and handle outliers in the data: 147 | ```python 148 | from scipy import stats 149 | z_scores = stats.zscore(data) 150 | outliers = (np.abs(z_scores) > 3).any(axis=1) 151 | cleaned_data = data[~outliers] 152 | ``` 153 | - Multivariate Outlier Detection using PCA (Principal Component Analysis): 154 | The second stage involves considering the relationships between variables and identifying outliers based on their collective behavior. PCA is a dimensionality reduction technique that can also be used for outlier detection. By transforming the data into a new set of uncorrelated variables (principal components), PCA can help identify outliers that deviate significantly from the overall patterns observed in the dataset. The outliers detected at this stage may capture more complex interactions and dependencies between variables. 155 | ```python 156 | import numpy as np 157 | from sklearn.decomposition import PCA 158 | # Assume 'data' is your cleaned_data from the previous stage (univariate outlier removal) 159 | # Perform PCA 160 | pca = PCA() 161 | pca.fit(data) 162 | # Calculate the Mahalanobis distance 163 | mahalanobis_dist = pca.transform(data) 164 | # Set a threshold for outlier detection (e.g., 3 standard deviations) 165 | threshold = 3 * np.std(mahalanobis_dist) 166 | # Identify outliers based on the Mahalanobis distance exceeding the threshold 167 | outliers = np.where(np.abs(mahalanobis_dist) > threshold) 168 | # Print the indices of the outlier data points 169 | print("Indices of outliers:", outliers) 170 | ``` 171 | 15. **Imbalanced Data Handling**: 172 | 173 | Imbalanced data handling refers to addressing the issue of imbalanced class distribution in a dataset, where one class has significantly more or fewer instances than the others. Techniques for handling imbalanced data include resampling methods such as oversampling (increasing the minority class samples) or undersampling (reducing the majority class samples), using different performance metrics like F1-score or area under the receiver operating characteristic curve (AUC-ROC) to evaluate model performance, applying algorithmic approaches like cost-sensitive learning or ensemble methods, or utilizing synthetic data generation techniques such as SMOTE (Synthetic Minority Over-sampling Technique) to balance the class distribution and improve the performance of machine learning models 174 | - Address class imbalance issues in the dataset: 175 | ```python 176 | from imblearn.over_sampling import SMOTE 177 | smote = SMOTE(random_state=42) 178 | balanced_data, balanced_labels = smote.fit_resample(data, labels) 179 | ``` 180 | 181 | 16. **Feature Scaling**: 182 | Feature scaling is the process of transforming numerical features in a dataset to a common scale or range to ensure that they have comparable magnitudes and do not disproportionately influence the learning algorithm. Common techniques for feature scaling include standardization (subtracting the mean and dividing by the standard deviation) or normalization (scaling values to a specific range, such as 0 to 1), which help improve the convergence and performance of machine learning models, particularly those based on distance or gradient-based optimization algorithms. 183 | - Scale numerical features to a similar range or distribution: 184 | ```python 185 | from sklearn.preprocessing import MinMaxScaler 186 | scaler = MinMaxScaler() 187 | scaled_data = scaler.fit_transform(data) 188 | ``` 189 | 190 | 17. **Handling Time-Series Data**: 191 | Handling time-series data involves analyzing and modeling data points that are collected at successive time intervals. Some common techniques for handling time-series data include: 192 | 193 | Time-series decomposition: Separating the data into its trend, seasonality, and residual components to better understand and model the underlying patterns. 194 | - Smoothing techniques: Applying moving averages or exponential smoothing methods to reduce noise and identify long-term trends. 195 | - Feature engineering: Creating additional features such as lagged variables or rolling statistics to capture temporal dependencies and improve predictive modeling. 196 | - Time-series forecasting: Utilizing techniques like autoregressive integrated moving average (ARIMA), seasonal ARIMA (SARIMA), or machine learning algorithms such as recurrent neural networks (RNNs) or - - long short-term memory (LSTM) networks for predicting future values based on historical patterns. 197 | - Handling irregular time intervals: If the time-series data has irregular intervals, interpolation or resampling methods can be employed to align the data to a regular time grid. 198 | - Visualization: Plotting time-series data using line charts, scatter plots, or heatmaps to identify trends, seasonality, anomalies, and relationships between variables. 199 | - Time-series evaluation: Assessing the performance of time-series models using metrics like mean absolute error (MAE), root mean squared error (RMSE), or forecasting accuracy measures like mean absolute percentage error (MAPE). 200 | - Preprocess time-series data by handling irregularities, missing values, and aligning time steps: 201 | ```python 202 | import pandas as pd 203 | df = pd.read_csv('time_series_data.csv', parse_dates=['timestamp']) 204 | df = df.set_index('timestamp') 205 | df = df.resample('D').mean() 206 | ``` 207 | 208 | 18. **Handling Noisy Data**: 209 | Handling noisy data involves addressing the presence of unwanted or irrelevant variations, errors, or outliers in a dataset. Here are some approaches for handling noisy data: 210 | 211 | - Data cleansing: Applying techniques like outlier detection and removal, error correction, or imputation to mitigate the impact of noise on the dataset. 212 | - Smoothing techniques: Employing filters or averaging methods such as moving averages, median filters, or low-pass filters to reduce random fluctuations and smooth out noisy signals. 213 | - Robust statistics: Utilizing statistical methods that are less sensitive to outliers, such as robust estimators (e.g., median instead of mean) or robust regression techniques like RANSAC (Random Sample Consensus). 214 | - Feature selection: Identifying and selecting the most informative and robust features that are less affected by noise to improve the performance of machine learning models. 215 | - Ensemble methods: Utilizing ensemble techniques like bagging or boosting that combine multiple models to reduce the impact of noise and enhance overall performance. 216 | - Data augmentation: Generating additional synthetic data points based on existing data by applying transformations, perturbations, or adding noise within reasonable bounds to increase the robustness of the model. 217 | - Model-based approaches: Employing specific models designed to handle noisy data, such as robust regression models, noise-tolerant clustering algorithms, or outlier detection algorithms. 218 | - Domain knowledge: Leveraging expert knowledge or domain-specific insights to identify and handle noise appropriately, such as using known constraints or physical limitations to filter out unrealistic data points. 219 | - Identify and handle noisy data in the dataset: 220 | ```python 221 | from scipy.signal import medfilt 222 | filtered_data = medfilt(data, kernel_size=3) 223 | ``` 224 | 225 | 19. **Handling Skewed Data**: 226 | 227 | Handling skewed data involves addressing the issue of imbalanced distribution or skewness in the target variable or predictor variables. Here are some approaches for handling skewed data: 228 | 229 | - Logarithmic transformation: Applying logarithmic transformation (e.g., taking the logarithm of the values) to reduce the impact of extreme values and compress the range of skewed variables. 230 | - Power transformation: Using power transformations like Box-Cox or Yeo-Johnson to achieve a more symmetric distribution and reduce skewness. 231 | - Winsorization: Replacing extreme values with less extreme values, often by capping or truncating the outliers to a certain percentile of the distribution. 232 | - Binning or discretization: Grouping continuous variables into bins or discrete categories to reduce the impact of extreme values and create more balanced distributions. 233 | - Data augmentation: Generating synthetic data points, particularly for the minority or skewed class, through techniques like oversampling or SMOTE to balance the class distribution and provide more representative samples. 234 | - Weighted sampling or cost-sensitive learning: Assigning higher weights to underrepresented or minority class samples during model training to give them more importance and address the imbalance issue. 235 | - Ensemble methods: Employing ensemble techniques like bagging or boosting that can handle imbalanced data by combining multiple models or adjusting class weights to improve classification performance. 236 | - Resampling techniques: Using undersampling (reducing the majority class samples) or oversampling (increasing the minority class samples) methods to balance the class distribution and mitigate the impact of skewness. 237 | - Algorithm selection: Choosing algorithms that are inherently robust to class imbalance or skewed data, such as decision trees, random forests, or support vector machines with appropriate class weights or sampling techniques. 238 | - Address skewed data distributions: 239 | ```python 240 | import numpy as np 241 | log_transformed_data = np.log(data) 242 | ``` 243 | 244 | 20. **Handling Duplicate Data**: 245 | Handling duplicate data involves identifying and managing instances in a dataset that are identical or nearly identical to one another. Here are some approaches for handling duplicate data: 246 | 247 | - Identifying duplicates: Conducting a thorough analysis to identify duplicate records based on key attributes or a combination of attributes that define uniqueness in the dataset. 248 | - Removing exact duplicates: Removing instances that are exact duplicates, where all attributes have identical values, to ensure data integrity and avoid redundancy. 249 | - Fuzzy matching: Using fuzzy matching algorithms or similarity measures to identify approximate duplicates that may have slight variations or inconsistencies in the attribute values. 250 | - Deduplication based on business rules: Applying domain-specific business rules or logical conditions to identify and remove duplicates that meet certain criteria or conditions. 251 | - Key attribute selection: Choosing a subset of key attributes that uniquely define each instance and comparing records based on those attributes to identify duplicates. 252 | - Record merging: If duplicates are identified, merging or consolidating the duplicate records into a single representative record by combining or aggregating the relevant information. 253 | - Duplicate tracking: Maintaining a separate identifier or flag to track and manage duplicates, allowing for traceability and auditability of the data cleaning process. 254 | - Prevention strategies: Implementing data validation rules, unique constraints, or duplicate prevention mechanisms at the data entry stage to minimize the occurrence of duplicate data. 255 | - Identify and remove duplicate instances from the dataset: 256 | ```python 257 | deduplicated_data = data.drop_duplicates() 258 | ``` 259 | 260 | 21. **Feature Engineering**: 261 | Feature engineering is the process of creating new, informative, and representative features from existing data to enhance the performance and predictive power of machine learning models. 262 | - Create new features from existing ones or domain knowledge: 263 | ```python 264 | data['new_feature'] = data['feature1'] + data['feature2'] 265 | ``` 266 | 267 | 22. **Handling Missing Data**: 268 | Handling missing data involves strategies and techniques to address the presence of missing values in a dataset. Common approaches for handling missing data include deletion of missing values, imputation (filling in missing values with estimated or imputed values), or using advanced techniques such as multiple imputation or modeling-based imputation to retain the integrity and completeness of the dataset during analysis or modeling tasks. 269 | - Handle missing values by imputing them: 270 | ```python 271 | from sklearn.impute import SimpleImputer 272 | imputer = SimpleImputer(strategy='mean') 273 | imputed_data = imputer.fit_transform(data) 274 | ``` 275 | 276 | 23. **Data Normalization**: 277 | Data normalization, also known as data standardization, is the process of rescaling or transforming numerical data to a common scale or range, typically between 0 and 1 or with a mean of 0 and a standard deviation of 1, to ensure that different variables have comparable magnitudes and distributions. It helps to prevent certain variables from dominating the analysis or modeling process due to their larger scales and facilitates better interpretation, convergence, and performance of machine learning algorithms. 278 | - Normalize the data to a standard scale or range: 279 | ```python 280 | from sklearn.preprocessing import StandardScaler 281 | scaler = StandardScaler() 282 | normalized_data = scaler.fit_transform(data) 283 | ``` 284 | 285 | 24. **Addressing Data Privacy and Security**: 286 | Addressing data privacy and security involves implementing measures to protect sensitive data from unauthorized access, ensuring compliance with privacy regulations, and safeguarding against potential threats or breaches. 287 | - Implement techniques to protect sensitive information and ensure data privacy and security: 288 | - Encrypt sensitive data 289 | - Apply access controls and permissions 290 | - Anonymize or de-identify personal information 291 | 292 | 293 | 25. **Handling Multicollinearity**: 294 | Handling multicollinearity refers to addressing the issue of high correlation or interdependency between predictor variables in a regression or modeling context by applying techniques such as feature selection, variable transformation, or using advanced methods like principal component analysis (PCA) or ridge regression to mitigate the negative impact of multicollinearity on the model's interpretability and stability. 295 | - Identify and handle multicollinearity among predictor variables: 296 | ```python 297 | from statsmodels.stats.outliers_influence import variance_inflation_factor 298 | vif = pd.DataFrame() 299 | vif["Feature"] = X.columns 300 | vif["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])] 301 | ``` 302 | 303 | 26. **Handling Seasonality and Trend**: 304 | Handling seasonality and trend involves identifying and modeling the repetitive patterns and long-term directional movements in time series data to understand their impact and make accurate predictions or forecasts. 305 | - Handle seasonality and trend components in time-series data: 306 | ```python 307 | from statsmodels.tsa.seasonal import seasonal_decompose 308 | decomposition = seasonal_decompose(data, model='additive', period=12) 309 | ``` 310 | 311 | 27. **Handling Skewed Target Variables**: 312 | 313 | Handling skewed target variables involves addressing the issue of imbalanced or skewed distributions in the outcome variable of a predictive modeling task. Common approaches for handling skewed target variables include log-transformations, using appropriate evaluation metrics (e.g., mean absolute error or area under the receiver operating characteristic curve) to assess model performance, applying algorithms designed for imbalanced data (e.g., cost-sensitive learning or ensemble methods), or employing resampling techniques like oversampling or undersampling to balance the class distribution and improve the performance of machine learning models. 314 | - Apply transformations to make the target variable more symmetric: 315 | ```python 316 | import numpy as np 317 | transformed_target = np.log1p(target) 318 | ``` 319 | 320 | 28. **Data Partitioning for Cross-Validation**: 321 | Data partitioning for cross-validation involves splitting a dataset into training and validation subsets, allowing for iterative model training and evaluation to assess its generalization performance and mitigate overfitting. 322 | - Divide the dataset into multiple folds for cross-validation: 323 | ```python 324 | from sklearn.model_selection import KFold 325 | kf = KFold(n_splits=5, shuffle=True, random_state=42) 326 | for train_index, val_index in kf.split(X): 327 | X_train, X_val = X[train_index], X[val_index] 328 | y_train, y_val = y[train_index], y[val_index] 329 | ``` 330 | 331 | 29. **Handling Sparse Data**: 332 | Handling sparse data involves managing datasets where the majority of values are zeros or missing, often through techniques such as feature selection, data imputation, or sparse matrix representations, to effectively utilize and analyze the available information. 333 | - Handle sparse datasets using techniques like sparse matrix representation or dimensionality reduction: 334 | ```python 335 | from scipy.sparse import csr_matrix 336 | sparse_matrix = csr_matrix(data) 337 | ``` 338 | 339 | 30. **Handling Time Delays**: 340 | Handling time delays refers to addressing the temporal relationship between variables in a time series or sequential data analysis, taking into account the lagged effects or dependencies over different time periods by incorporating lagged variables, time shifting, or using time series forecasting models to capture and account for the time delay in the data. 341 | - Account for time delays or lags in time-series analysis: 342 | ```python 343 | import pandas as pd 344 | df['lag_1'] = df['target'].shift(1) 345 | ``` 346 | 347 | 31. **Handling Non-Numeric Data**: 348 | 349 | Handling non-numeric data involves converting or transforming categorical or qualitative data into a numerical representation that can be processed by machine learning algorithms, typically through techniques such as one-hot encoding, label encoding, or embedding methods. 350 | - Preprocess non-numeric data such as categorical variables or text data: 351 | ```python 352 | from sklearn.preprocessing import OneHotEncoder 353 | encoder = OneHotEncoder() 354 | encoded_data = encoder.fit_transform(data) 355 | ``` 356 | 357 | 32. **Handling Incomplete Data**: 358 | Handling incomplete data involves addressing the issue of missing or partially available values in a dataset by applying techniques such as data imputation, deletion of missing values, or using advanced methods like multiple imputation or modeling-based imputation to handle missing data and retain the integrity and usefulness of the dataset for analysis or modeling tasks. 359 | - Handle incomplete or missing records in the dataset: 360 | ```python 361 | from sklearn.impute import SimpleImputer 362 | imputer = SimpleImputer(strategy='mean') 363 | imputed_data = imputer.fit_transform(data) 364 | ``` 365 | 366 | 33. **Handling Long-Tailed Distributions**: 367 | Handling long-tailed distributions involves addressing the presence of imbalanced or heavily skewed data distributions, typically characterized by a large number of infrequent occurrences or outliers, by applying techniques such as resampling methods (e.g., oversampling or undersampling), data augmentation, using appropriate evaluation metrics (e.g., precision-recall curve), or applying specialized algorithms designed to handle imbalanced data to improve the model's performance and mitigate the impact of the long tail. 368 | - Normalize distributions with long tails using techniques like log-transformations or power-law transformations: 369 | ```python 370 | transformed_data = np.log1p(data) 371 | ``` 372 | 373 | 34. **Data Discretization**: 374 | 375 | Data discretization, also known as binning, is the process of transforming continuous or numerical data into discrete intervals or categories. This can be achieved through various techniques such as equal-width binning (dividing the data into bins of equal width), equal-frequency binning (dividing the data into bins with an equal number of data points), or more advanced methods like clustering-based binning or decision tree-based discretization. Discretization can help simplify data analysis, reduce the impact of outliers, and enable the use of algorithms that require categorical or ordinal data. 376 | - Convert continuous variables into categorical or ordinal variables through data discretization: 377 | 378 | ```python 379 | from sklearn.preprocessing import KBinsDiscretizer 380 | discretizer = KBinsDiscretizer(n_bins=5, encode='ordinal', strategy='quantile') 381 | discretized_data = discretizer.fit_transform(data) 382 | ``` 383 | 384 | 35. **Handling Data Dependencies**: 385 | Handling data dependencies involves addressing the relationships or dependencies between variables in a dataset to ensure accurate modeling and analysis. This can be done through various techniques, such as feature engineering to create new derived features that capture the dependencies, applying dimensionality reduction techniques to eliminate redundant or highly correlated variables, using specialized models or algorithms that explicitly handle dependencies (e.g., Bayesian networks or Markov models), or incorporating time series analysis methods to capture temporal dependencies in sequential data. Effective handling of data dependencies helps to improve the interpretability, predictive accuracy, and generalizability of the models. 386 | - Consider and handle dependencies or relationships between different observations or instances in the dataset: 387 | ```python 388 | # Example for time-series analysis 389 | df['lag_1'] = df['target'].shift(1) 390 | df['lag_2'] = df['target'].shift(2) 391 | ``` 392 | 393 | # Computer-Vision 394 | 395 | ## steps 396 | 397 | in this section we will learn about the steps of image processing 398 | the main need of image processing is to make the image more clear and easy to understand for the machine 399 | 400 | ![](.images/2023-05-24_16-50.png) 401 | 402 | mainly image processing is divided into 2 types: 403 | **Spatial domain :** in this type we will work on the image itself means we will change the pixels of the image using kernels 404 | and 405 | **Frequency domain :** in this type we will work on the frequency of the image using fourier transform then we will change the frequency of the image then we will use inverse fourier transform to get the image back 406 | 407 | 1. **Read The Image**: 408 | Read the image means to load the image into the memory of the computer so we can process it 409 | - Read an image from a file using OpenCV: 410 | ```python 411 | import cv2 412 | image = cv2.imread('image.jpg') 413 | ``` 414 | - Read an image from a file using PIL: 415 | ```python 416 | from PIL import Image 417 | image = Image.open('image.jpg') 418 | ``` 419 | - use matplotlib to show the image: 420 | ```python 421 | import matplotlib.pyplot as plt 422 | image = plt.imread('image.jpg') # also you can use 423 | plt.imshow(image) 424 | plt.show() 425 | ``` 426 | - when you read image using opencv it will be in BGR format so you need to convert it to RGB format: 427 | ```python 428 | image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB) 429 | ``` 430 | ![](.images/BGRvsRGB.png) 431 | 432 | 2. **Resize The Image**: 433 | Resize the image means to change the size of the image to make it smaller or bigger or to change the aspect ratio of the image 434 | - Resize an image to a specific width and height: 435 | ```python 436 | image = cv2.resize(image, (width, height)) 437 | ``` 438 | - Resize an image to a specific width and height while maintaining the aspect ratio: 439 | ```python 440 | image = cv2.resize(image, (width, height), interpolation=cv2.INTER_AREA) 441 | ``` 442 | - Resize an image to a specific width and height while maintaining the aspect ratio and ensuring the image fits within the specified dimensions: 443 | ```python 444 | image = cv2.resize(image, (width, height), interpolation=cv2.INTER_AREA) 445 | ``` 446 | ![](.images/resize.png) 447 | 448 | 3. **Grayscale Conversion**: 449 | Grayscale conversion refers to converting an image from color to grayscale, which is a single-channel image containing only shades of gray. This can be done by applying a grayscale conversion formula or by using a built-in function in a library like OpenCV. 450 | - Convert an image from color to grayscale using a formula: 451 | ```python 452 | image = 0.299 * image[:, :, 0] + 0.587 * image[:, :, 1] + 0.114 * image[:, :, 2] 453 | ``` 454 | - the formula is based wave length of the color 455 | means that the red color has multiple of 0.299 and the green color has multiple of 0.587 and the blue color has multiple of 0.114 , **Image is a matrix of pixels with shape (height, width, channels)** 456 | - Convert an image from color to grayscale using OpenCV: 457 | ```python 458 | image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY) 459 | ``` 460 | - Convert an image from color to grayscale using PIL: 461 | ```python 462 | image = image.convert('L') 463 | ``` 464 | ![](.images/gray.png) 465 | 466 | 4. **Binary Thresholding**: 467 | Binary thresholding refers to converting an image from grayscale to binary by applying a threshold value to each pixel in the image. This can be done by applying a thresholding formula or by using a built-in function in a library like OpenCV. 468 | - Convert an image from grayscale to binary using a formula: 469 | ```python 470 | image = (image > threshold).astype('uint8') * 255 471 | ``` 472 | - Convert an image from grayscale to binary using OpenCV: 473 | ```python 474 | _, image = cv2.threshold(image, threshold, 255, cv2.THRESH_BINARY) 475 | ``` 476 | ![](.images/binary.png) 477 | 478 | 5. **Smoothing**: 479 | Smoothing refers to reducing the noise in an image there are many ways to do that like using gaussian blur or median blur or bilateral filter 480 | for example : 481 | we will use **median blur** which is a non-linear filter that replaces each pixel in the image with the median value of its neighboring pixels. This can be done by applying a median blur formula or by using a built-in function in a library like OpenCV. 482 | 483 | - Smooth an image using a formula: 484 | ```python 485 | image = np.median(image, (kernel_size, kernel_size)) 486 | ``` 487 | - Smooth an image using OpenCV: 488 | ```python 489 | image = cv2.medianBlur(image, kernel_size) # kernel_size must be odd number ex: 3, 5, 7, 9, ... 490 | 491 | image = cv2.GaussianBlur(image,5) # for example 492 | ``` 493 | ![](.images/noisy.png) 494 | ![](.images/smooth.png) 495 | 496 | 6. **Edge Detection**: 497 | Edge detection refers to detecting the edges in an image. we can do that using sobel filter or laplacian filter or canny edge detection 498 | for example : 499 | we will use **canny edge detection** which is an edge detection algorithm that uses a multi-stage algorithm to detect a wide range of edges in images. This can be done by applying a canny edge detection formula or by using a built-in function in a library like OpenCV. 500 | 501 | - Detect edges in an image using a formula: 502 | ```python 503 | image = cv2.Canny(image, threshold1, threshold2) 504 | ``` 505 | ![](.images/edge.png) 506 | 507 | 7. **Image Segmentation**: 508 | Image segmentation refers to dividing an image into multiple segments. This can be done by applying a segmentation techniques 509 | for example : 510 | we will use **K-means clustering** which is a clustering algorithm that divides the pixels of an image into clusters. 511 | 512 | - Segment an image using a formula: 513 | ```python 514 | # reshape the image to a 2D array of pixels and 3 color values (RGB) 515 | image = image.reshape((-1, 3)) 516 | # convert to np.float32 517 | image = np.float32(image) 518 | # define criteria, number of clusters(K) and apply kmeans() 519 | criteria = (cv2.TERM_CRITERIA_EPS + cv2.TERM_CRITERIA_MAX_ITER, 10, 1.0) 520 | _, labels, centers = cv2.kmeans(image, K, None, criteria, 10, cv2.KMEANS_RANDOM_CENTERS) 521 | # convert back to 8 bit values 522 | centers = np.uint8(centers) 523 | # flatten the labels array 524 | labels = labels.flatten() 525 | # convert all pixels to the color of the centroids 526 | segmented_image = centers[labels.flatten()] 527 | # reshape back to the original image dimension 528 | segmented_image = segmented_image.reshape(image.shape) 529 | ``` 530 | ![](.images/segment.png) 531 | 532 | **I used this code to segment** 533 | ```python 534 | # image segmentation using k-means clustering with loop to find the best k 535 | # read the image using opencv 536 | img = cv2.imread('/kaggle/input/intel-image-classification/seg_test/seg_test/sea/20077.jpg') 537 | img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB) 538 | 539 | for k in range(2, 6): 540 | # reshape the image to a 2D array of pixels and 3 color values (RGB) 541 | pixel_vals = img.reshape((-1, 3)) 542 | # convert to float type 543 | pixel_vals = np.float32(pixel_vals) 544 | 545 | # define stopping criteria 546 | criteria = (cv2.TERM_CRITERIA_EPS + cv2.TERM_CRITERIA_MAX_ITER, 100, 0.85) 547 | 548 | # perform k-means clustering 549 | ret, label, center = cv2.kmeans(pixel_vals, k, None, criteria, 10, cv2.KMEANS_RANDOM_CENTERS) 550 | 551 | # convert data into 8-bit values 552 | center = np.uint8(center) 553 | res = center[label.flatten()] 554 | res2 = res.reshape((img.shape)) 555 | 556 | # plot the image the two images 557 | plt.subplot(2, 2, k-1) 558 | plt.axis('off') 559 | plt.imshow(res2) 560 | plt.title('k = {}'.format(k)) 561 | plt.show() 562 | ``` 563 | you can see the difference between the images with different 564 | 565 | 8. **Image rotation**: 566 | Image rotation refers to rotating an image by a certain angle. 567 | 568 | - Rotate an image using a formula: 569 | ```python 570 | # calculate the center of the image 571 | center = (image.shape[1] / 2, image.shape[0] / 2) 572 | # rotate the image by 90 degrees 573 | M = cv2.getRotationMatrix2D(center, 90, 1.0) 574 | # 90 is the angle of rotation 575 | # 1.0 is the scale of rotation 576 | rotated_image = cv2.warpAffine(image, M, (image.shape[1], image.shape[0])) 577 | ``` 578 | ![](.images/90_center.png) 579 | **OR** 580 | ```python 581 | # rotate the image by 90 degrees 582 | rotated_image = cv2.rotate(image, cv2.ROTATE_90_CLOCKWISE) 583 | # cv2.ROTATE_90_CLOCKWISE is the angle of rotation 584 | # can be cv2.ROTATE_90_COUNTERCLOCKWISE or cv2.ROTATE_180 check the documentation 585 | ``` 586 | ![](.images/90.png) 587 | 588 | 9. **Image flipping**: 589 | Image flipping refers to flipping an image horizontally or vertically. 590 | 591 | - Flip an image using a formula: 592 | ```python 593 | # flip the image horizontally 594 | flipped_image = cv2.flip(image, 1) 595 | # 1 is the code for flipping the image horizontally 596 | # 0 is the code for flipping the image vertically 597 | # -1 is the code for flipping the image both horizontally and vertically 598 | ``` 599 | ![](.images/fli1.png) 600 | ![](.images/fli0.png) 601 | ![](.images/fli-1.png) 602 | 603 | 10. **Image translation**: 604 | Image translation refers to shifting an image by a certain distance. 605 | 606 | - Translate an image using a formula: 607 | ```python 608 | # translate the image by (25, 50) pixels 609 | # 25 is the number of pixels to shift along the x-axis 610 | # 50 is the number of pixels to shift along the y-axis 611 | M = np.float32([[1, 0, 25], [0, 1, 50]]) # is the translation matrix used for shifting the image 612 | translated_image = cv2.warpAffine(image, M, (image.shape[1], image.shape[0])) 613 | # (image.shape[1], image.shape[0]) is the size of the output image 614 | ``` 615 | ![](.images/trans.png) 616 | 617 | 11. **Saving an image**: 618 | Saving an image refers to saving an image to a file. 619 | 620 | - Save an image using a opencv function: 621 | ```python 622 | # save the image 623 | cv2.imwrite('image.jpg', image) 624 | # 'image.jpg' is the name of the file 625 | # image is the image that we want to save 626 | ``` 627 | **OR using matplotlib** 628 | ```python 629 | # save the image 630 | plt.imsave('image.jpg', image) 631 | # 'image.jpg' is the name of the file 632 | # image is the image that we want to save 633 | ``` 634 | **OR using PIL** 635 | ```python 636 | # save the image 637 | image.save('image.jpg') 638 | # 'image.jpg' is the name of the file 639 | # image is the image that we want to save 640 | ``` 641 | 642 | 13. **Addtional tips**: 643 | - **Read a folder of images**: 644 | ```python 645 | # read a folder of images 646 | import os 647 | for filename in os.listdir('folder_name'): 648 | image = cv2.imread(os.path.join('folder_name', filename)) 649 | # do something with the image 650 | ``` 651 | - **OR using glob**: 652 | ```python 653 | # read a folder of images 654 | import glob 655 | for filename in glob.glob('folder_name/*.jpg'): 656 | image = cv2.imread(filename) 657 | # do something with the image 658 | ``` 659 | - **Read a video**: 660 | ```python 661 | # read a video 662 | cap = cv2.VideoCapture('video.mp4') 663 | while cap.isOpened(): 664 | ret, frame = cap.read() 665 | # do something with the frame 666 | ``` 667 | - **Read a webcam**: 668 | ```python 669 | # read a webcam 670 | cap = cv2.VideoCapture(0) 671 | while cap.isOpened(): 672 | ret, frame = cap.read() 673 | # do something with the frame 674 | ``` 675 | - **Read a video extract frames and save them**: 676 | ```python 677 | # read a video extract frames and save them 678 | cap = cv2.VideoCapture('video.mp4') 679 | count = 0 680 | while cap.isOpened(): 681 | ret, frame = cap.read() 682 | if ret: 683 | cv2.imwrite('frame_{}.jpg'.format(count), frame) 684 | count += 1 685 | else: 686 | break 687 | ``` 688 | - **Get video information**: 689 | ```python 690 | # get video information 691 | cap = cv2.VideoCapture('video.mp4') 692 | width = cap.get(cv2.CAP_PROP_FRAME_WIDTH) 693 | height = cap.get(cv2.CAP_PROP_FRAME_HEIGHT) 694 | fps = cap.get(cv2.CAP_PROP_FPS) 695 | frame_count = cap.get(cv2.CAP_PROP_FRAME_COUNT) 696 | ``` 697 | > **Tip:** Since images can take up a lot of memory, it's not practical to load all of them at once. To address this issue, packages like PyTorch and TensorFlow use generators to load images in batches. This approach is more efficient than loading all images simultaneously. 698 | 699 | - **Build custom dataset and dataset loader using Pytorch** 700 | The main use of the dataset class is to get the length of the dataset and to get the item at a specific index. The main use of the dataset loader is to load the data in batches. 701 | 702 | 1. **Build a custom dataset**: 703 | ```python 704 | # build a custom dataset 705 | import torch 706 | from torch.utils.data import Dataset 707 | import pandas as pd 708 | import os 709 | from PIL import Image 710 | 711 | class CustomDataset(Dataset): 712 | def __init__(self, csv_file, root_dir, transform=None): 713 | self.annotations = pd.read_csv(csv_file) 714 | self.root_dir = root_dir 715 | self.transform = transform 716 | 717 | def __len__(self): 718 | return len(self.annotations) 719 | 720 | def __getitem__(self, index): 721 | img_path = os.path.join(self.root_dir, self.annotations.iloc[index, 0]) 722 | image = Image.open(img_path) 723 | y_label = torch.tensor(int(self.annotations.iloc[index, 1])) 724 | 725 | if self.transform: 726 | image = self.transform(image) 727 | 728 | return (image, y_label) 729 | ``` 730 | - **csv_file**: is the path to the csv file that contains the image names and their labels. 731 | - **root_dir**: is the path to the folder that contains the images. 732 | - **transform**: is the transformation that we want to apply to the images. 733 | 734 | 2. **Build a custom dataset loader**: 735 | ```python 736 | # build a custom dataset loader 737 | from torch.utils.data import DataLoader 738 | import torchvision.transforms as transforms 739 | 740 | dataset = CustomDataset('data.csv', 'images/', transforms.ToTensor()) 741 | # 'data.csv' is the path to the csv file that contains the image names and their labels. 742 | # 'images/' is the path to the folder that contains the images. 743 | # transforms.ToTensor() is the transformation that we want to apply to the images. 744 | 745 | dataloader = DataLoader(dataset, batch_size=32, shuffle=True) 746 | # dataset is the dataset that we want to load 747 | # batch_size is the number of images that we want to load in each batch 748 | # shuffle is a boolean that indicates whether to shuffle the data or not 749 | ``` 750 | - **dataset**: is the dataset that we want to load. 751 | - **batch_size**: is the number of images that we want to load in each batch. 752 | - **shuffle**: is a boolean that indicates whether to shuffle the data or 753 | 3. **Transform** 754 | is the transformation that we want to apply to the images, this usually includes resizing, normalizing, and converting the images to tensors. 755 | ```python 756 | # transform 757 | import torchvision.transforms as transforms 758 | 759 | transform = transforms.Compose([ 760 | transforms.Resize((100, 100)), 761 | transforms.ToTensor(), 762 | transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)) 763 | ]) 764 | # transforms.Resize((100, 100)) resize the image to 100x100 765 | # transforms.ToTensor() convert the image to tensor 766 | # transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)) normalize the image 767 | 768 | ``` 769 | - **transforms.Resize((100, 100))**: resize the image to 100x100. 770 | - **transforms.ToTensor()**: convert the image to tensor. 771 | - **transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))**: normalize the image. 772 | 773 | - **Build custom dataset using TensorFlow** 774 | 1. **Build a custom dataset**: 775 | ```python 776 | # build a custom dataset 777 | import tensorflow as tf 778 | import pandas as pd 779 | import os 780 | 781 | class CustomDataset(tf.keras.utils.Sequence): 782 | def __init__(self, csv_file, root_dir, batch_size=32, shuffle=True): 783 | self.batch_size = batch_size 784 | self.shuffle = shuffle 785 | self.annotations = pd.read_csv(csv_file) 786 | self.root_dir = root_dir 787 | self.on_epoch_end() 788 | 789 | def __len__(self): 790 | return len(self.annotations) // self.batch_size 791 | 792 | def __getitem__(self, index): 793 | batch = self.indexes[index*self.batch_size:(index+1)*self.batch_size] 794 | X, y = self.__data_generation(batch) 795 | return X, y 796 | 797 | def on_epoch_end(self): 798 | self.indexes = np.arange(len(self.annotations)) 799 | if self.shuffle: 800 | np.random.shuffle(self.indexes) 801 | 802 | def __data_generation(self, batch): 803 | X = [] 804 | y = [] 805 | for i in batch: 806 | img_path = os.path.join(self.root_dir, self.annotations.iloc[i, 0]) 807 | image = cv2.imread(img_path) 808 | image = cv2.resize(image, (100, 100)) 809 | X.append(image) 810 | y.append(self.annotations.iloc[i, 1]) 811 | return np.array(X), np.array(y) 812 | ``` 813 | - **csv_file**: is the path to the csv file that contains the image names and their labels. 814 | - **root_dir**: is the path to the folder that contains the images. 815 | - **batch_size**: is the number of images that we want to load in each batch. 816 | - **shuffle**: is a boolean that indicates whether to shuffle the data or not. 817 | 818 | 2. **Build a custom dataset loader**: 819 | ```python 820 | # build a custom dataset loader 821 | dataset = CustomDataset('data.csv', 'images/', batch_size=32, shuffle=True) 822 | # 'data.csv' is the path to the csv file that contains the image names and their labels. 823 | # 'images/' is the path to the folder that contains the images. 824 | # batch_size is the number of images that we want to load in each batch 825 | # shuffle is a boolean that indicates whether to shuffle the data or not 826 | ``` 827 | - **dataset**: is the dataset that we want to load. 828 | 829 | > **Note**: The main difference between the Pytorch and TensorFlow dataset loaders is that the Pytorch dataset loader returns the images and their labels in a tuple, while the TensorFlow dataset loader returns the images and their labels in two separate arrays. 830 | 831 | > **Note**: you can do whatever you want in the `__data_generation` function, you can apply any transformation to the images, and you can also load the images from a different source. the same thing applies to the `__getitem__` function, you can return the images and their labels in any format you want. 832 | 833 | > **Reference**: from documentation of [Pytorch](https://pytorch.org/tutorials/beginner/data_loading_tutorial.html) , [TensorFlow](https://www.tensorflow.org/api_docs/python/tf/keras/utils/Sequence), and [opencv](https://docs.opencv.org/4.5.2/d6/d0f/group__dnn.html#ga29f34df9376379a603acd8df581ac8d7). 834 | 835 | --------------------------------------------------------------------------------