├── Computer Vision ├── Pose Estimation & Squat Counter │ ├── Pose Estimation and squat counter using MoveNet.ipynb │ ├── Readme.md │ ├── ezgif.com-gif-maker.gif │ ├── images.jpg │ ├── jump.gif │ └── requirments └── Real Time Sign Language Interpretation App │ ├── IBM_cloud_configuration.md.txt │ ├── ReactComputerVisionTemplate │ ├── Public │ │ ├── favicon.ico │ │ ├── index.html │ │ ├── logo192.png │ │ ├── logo512.png │ │ ├── manifest.json │ │ ├── readme.md │ │ └── robots.txt │ ├── Readme.md │ ├── package-lock.json │ ├── package.json │ ├── src │ │ ├── App.css │ │ ├── App.js │ │ ├── index.css │ │ ├── index.js │ │ ├── readme.md │ │ └── utilities.js │ └── yarn.lock │ ├── Readme.md │ └── Sign-language_detection.ipynb ├── Data Visualization └── Python │ ├── Immigration_to_Canda_Data_Visualization.ipynb │ ├── Readme.dm │ └── Spatial visualization of San Francisco incidents.ipynb ├── Deep Learning └── Classification │ └── Melenoma_Classification │ ├── Readme.md │ ├── deep-learning-models │ ├── CNN_model.py │ ├── __init__.py │ ├── main.py │ ├── readme.md │ └── training.py │ ├── evaluation-metrics │ ├── __init__.py │ ├── classification_metrics.py │ ├── f1_score.py │ └── readme.md │ ├── loading and storing │ ├── __init__.py │ ├── loading_images.py │ ├── loading_storing_h5py.py │ └── readme.md │ ├── main.py │ ├── preprocessing │ ├── __init__.py │ ├── exploration.py │ ├── preprocessing.py │ └── readme.md │ └── readme.md ├── Machine Learning ├── Classification │ ├── Alzhimers CV-BOLD Classification │ │ ├── Best_mask.py │ │ ├── Best_mask2.py │ │ ├── Model.py │ │ ├── confidence_interval_mask.py │ │ ├── data_preprocessing.py │ │ ├── deep learning │ │ │ ├── CNN_based_models │ │ │ │ ├── AlexNet.py │ │ │ │ ├── CNN.py │ │ │ │ ├── CNN_feature_extractor.py │ │ │ │ ├── DenseNet121.py │ │ │ │ ├── InceptionResNetV2.py │ │ │ │ ├── LeNet.py │ │ │ │ ├── ResNet50.py │ │ │ │ ├── VGG.py │ │ │ │ ├── VGG_pretrained.py │ │ │ │ ├── ZFNet.py │ │ │ │ ├── optimizers.py │ │ │ │ ├── readme.md │ │ │ │ └── simple_model.py │ │ │ ├── evaluation │ │ │ │ ├── metrics.py │ │ │ │ ├── model_evaluation.py │ │ │ │ └── readme.md │ │ │ ├── main.py │ │ │ ├── preprocessing │ │ │ │ ├── data_augmentation.py │ │ │ │ ├── data_preprocessing.py │ │ │ │ ├── preprocessing_methods.py │ │ │ │ └── readme.md │ │ │ └── storing_loading │ │ │ │ ├── generate_result_.py │ │ │ │ ├── load_data.py │ │ │ │ └── readme.md │ │ ├── generate_result.py │ │ ├── hyper_opt.py │ │ ├── load_data.py │ │ ├── load_models.py │ │ ├── main.py │ │ ├── pykliep.py │ │ ├── readme.md │ │ ├── sample_test.py │ │ ├── shuffle.py │ │ └── writing.py │ └── Sensor-activity-recognition │ │ ├── Sensor Activity Recognition.pdf │ │ ├── codes │ │ ├── classes_accuarcy.m │ │ ├── classification.m │ │ ├── create_feature_map.m │ │ ├── main.m │ │ ├── performance_evaluation.m │ │ ├── readme.md │ │ └── scalingANDoutliers.m │ │ └── readme.md ├── Clustering │ ├── Customer identification for mail order products │ │ ├── Identify Customer Segments.ipynb │ │ ├── LICENSE │ │ └── README.md │ ├── Finding-the-best-Tornoto-neighborhood-to-open-a-new-gym │ │ ├── Finding best neighborhood for new gym opening in toronto city.ipynb │ │ ├── LICENSE │ │ ├── Project report.pdf │ │ └── Readme.md │ └── Readme.md └── Regression │ ├── Automobile price prediction │ ├── Automobile Price Prediction .ipynb │ └── Readme.md │ └── Readme.md ├── Natural_Language_processing ├── Data-Science-Resume-Selector │ ├── Resume Selector with Naive Bayes .ipynb │ ├── readme.md │ └── resume.csv ├── Sentiment-analysis │ ├── README.md │ ├── SageMaker Project.ipynb │ ├── sevre │ │ ├── model.py │ │ ├── predict.py │ │ ├── requirements.txt │ │ └── utils.py │ ├── train │ │ ├── model.py │ │ ├── requirements.txt │ │ └── train.py │ └── website │ │ └── index.html └── plagiarism-detector-web-app │ ├── 1_Data_Exploration.ipynb │ ├── 2_Plagiarism_Feature_Engineering.ipynb │ ├── 3_Training_a_Model.ipynb │ ├── README.md │ ├── helpers.py │ ├── palagrism_data │ ├── test.csv │ └── train.csv │ ├── problem_unittests.py │ ├── source_pytorch │ ├── model.py │ ├── predict.py │ └── train.py │ └── source_sklearn │ └── train.py ├── Readme.md ├── Spark ├── Cluster Analysis of the San Diego Weather Data │ ├── Cluster Analysis of the San Diego Weather Data.ipynb │ └── readme.md └── San Diego Rainforest Fire Predicition │ ├── Readme.md │ └── San Diego Rainforest Fire Prediction.ipynb └── time-series-analysis ├── Power-consumption-forecasting ├── Energy_Consumption_Solution.ipynb ├── json_energy_data │ ├── readme.md │ ├── test.json │ └── train.json ├── readme.md └── txt_preprocessing.py └── readme.md /Computer Vision/Pose Estimation & Squat Counter/Readme.md: -------------------------------------------------------------------------------- 1 | ## Pose Estimation & Squat Counter 2 | 3 | ### Introdction ### 4 | 5 | Pose estimation refers to a general problem in computer vision techniques that detect human figures in images and videos, so that one could determine, for example, where someone’s elbow shows up in an image. It is important to be aware of the fact that pose estimation merely estimates where key body joints are and does not recognize who is in an image or video. The pose estimation models takes a processed camera image as the input and outputs information about keypoints. The keypoints detected are indexed by a part ID, with a confidence score between 0.0 and 1.0. The confidence score indicates the probability that a keypoint exists in that position. An example of this is as shown in the video below. 6 | 7 | ![alt-text](https://github.com/youssefHosni/Data-Science-Portofolio/blob/main/Computer%20Vision/Pose%20Estimation%20%26%20Squat%20Counter/jump.gif) 8 | 9 | Based on the results of the pose estimatiaon, the squat movment was detected and counted and printed on the screen as shown in the video below. 10 | 11 | ![alt-text](https://github.com/youssefHosni/Data-Science-Portofolio/blob/main/Computer%20Vision/Pose%20Estimation%20%26%20Squat%20Counter/ezgif.com-gif-maker.gif) 12 | --- 13 | 14 | ### Methods 15 | 16 | The model used is MoveNet, the MoveNet is available in two flavors: 17 | 18 | * MoveNet.Lightning is smaller, faster but less accurate than the Thunder version. It can run in realtime on modern smartphones. 19 | * MoveNet.Thunder is the more accurate version but also larger and slower than Lightning. It is useful for the use cases that require higher accuracy. 20 | 21 | MoveNet.Lightning is used here. 22 | 23 | MoveNet is the state-of-the-art pose estimation model that can detect these 17 key-points: 24 | 25 | * Nose 26 | * Left and right eye 27 | * Left and right ear 28 | * Left and right shoulder 29 | * Left and right elbow 30 | * Left and right wrist 31 | * Left and right hip 32 | * Left and right knee 33 | * Left and right ankle 34 | 35 | The various body joints detected by the pose estimation model are tabulated below: 36 | 37 | | Id | Part | 38 | | --- | ----------- | 39 | | 0 | nose | 40 | | 1 | leftEye | 41 | | 2 | rightEye | 42 | | 3 | leftEar | 43 | | 4 | rightEar | 44 | |5 | leftShoulder | 45 | | 6 | rightShoulder | 46 | | 7 | leftElbow | 47 | | 8 | rightElbow | 48 | | 9 | leftWrist | 49 | | 10 | rightWrist | 50 | | 11 | leftHip | 51 | | 12 | rightHip | 52 | | 13 | leftKnee | 53 | | 14 | rightKnee | 54 | | 15 | leftAnkle | 55 | | 16 | rightAnkle | 56 | 57 | 58 | --- 59 | 60 | ### Install dependencies 61 | 62 | ``` 63 | pip install -r requirements 64 | ``` 65 | 66 | -------------------------------------------------------------------------------- /Computer Vision/Pose Estimation & Squat Counter/ezgif.com-gif-maker.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/youssefHosni/Data-Science-Portofolio/cfb6f1cdd49735fed4b8c09d4a43e8f00cdf616e/Computer Vision/Pose Estimation & Squat Counter/ezgif.com-gif-maker.gif -------------------------------------------------------------------------------- /Computer Vision/Pose Estimation & Squat Counter/images.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/youssefHosni/Data-Science-Portofolio/cfb6f1cdd49735fed4b8c09d4a43e8f00cdf616e/Computer Vision/Pose Estimation & Squat Counter/images.jpg -------------------------------------------------------------------------------- /Computer Vision/Pose Estimation & Squat Counter/jump.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/youssefHosni/Data-Science-Portofolio/cfb6f1cdd49735fed4b8c09d4a43e8f00cdf616e/Computer Vision/Pose Estimation & Squat Counter/jump.gif -------------------------------------------------------------------------------- /Computer Vision/Pose Estimation & Squat Counter/requirments: -------------------------------------------------------------------------------- 1 | tensorflow 2 | tensorflow_hub 3 | opencv-python 4 | numpy 5 | -------------------------------------------------------------------------------- /Computer Vision/Real Time Sign Language Interpretation App/IBM_cloud_configuration.md.txt: -------------------------------------------------------------------------------- 1 | ibmcloud login 2 | 3 | ibmcloud target -r eu-de 4 | 5 | # configurations 6 | 7 | ibmcloud cos bucket-cors-put --bucket tensorflowjsrealtimesign --cors-configuration file://corsconfig.json -------------------------------------------------------------------------------- /Computer Vision/Real Time Sign Language Interpretation App/ReactComputerVisionTemplate/Public/favicon.ico: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/youssefHosni/Data-Science-Portofolio/cfb6f1cdd49735fed4b8c09d4a43e8f00cdf616e/Computer Vision/Real Time Sign Language Interpretation App/ReactComputerVisionTemplate/Public/favicon.ico -------------------------------------------------------------------------------- /Computer Vision/Real Time Sign Language Interpretation App/ReactComputerVisionTemplate/Public/index.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 12 | 13 | 17 | 18 | 27 | React App 28 | 29 | 30 | 31 |
32 | 42 | 43 | 44 | -------------------------------------------------------------------------------- /Computer Vision/Real Time Sign Language Interpretation App/ReactComputerVisionTemplate/Public/logo192.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/youssefHosni/Data-Science-Portofolio/cfb6f1cdd49735fed4b8c09d4a43e8f00cdf616e/Computer Vision/Real Time Sign Language Interpretation App/ReactComputerVisionTemplate/Public/logo192.png -------------------------------------------------------------------------------- /Computer Vision/Real Time Sign Language Interpretation App/ReactComputerVisionTemplate/Public/logo512.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/youssefHosni/Data-Science-Portofolio/cfb6f1cdd49735fed4b8c09d4a43e8f00cdf616e/Computer Vision/Real Time Sign Language Interpretation App/ReactComputerVisionTemplate/Public/logo512.png -------------------------------------------------------------------------------- /Computer Vision/Real Time Sign Language Interpretation App/ReactComputerVisionTemplate/Public/manifest.json: -------------------------------------------------------------------------------- 1 | { 2 | "short_name": "React App", 3 | "name": "Create React App Sample", 4 | "icons": [ 5 | { 6 | "src": "favicon.ico", 7 | "sizes": "64x64 32x32 24x24 16x16", 8 | "type": "image/x-icon" 9 | }, 10 | { 11 | "src": "logo192.png", 12 | "type": "image/png", 13 | "sizes": "192x192" 14 | }, 15 | { 16 | "src": "logo512.png", 17 | "type": "image/png", 18 | "sizes": "512x512" 19 | } 20 | ], 21 | "start_url": ".", 22 | "display": "standalone", 23 | "theme_color": "#000000", 24 | "background_color": "#ffffff" 25 | } 26 | -------------------------------------------------------------------------------- /Computer Vision/Real Time Sign Language Interpretation App/ReactComputerVisionTemplate/Public/readme.md: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /Computer Vision/Real Time Sign Language Interpretation App/ReactComputerVisionTemplate/Public/robots.txt: -------------------------------------------------------------------------------- 1 | # https://www.robotstxt.org/robotstxt.html 2 | User-agent: * 3 | Disallow: 4 | -------------------------------------------------------------------------------- /Computer Vision/Real Time Sign Language Interpretation App/ReactComputerVisionTemplate/Readme.md: -------------------------------------------------------------------------------- 1 | This project was bootstrapped with [Create React App](https://github.com/facebook/create-react-app). 2 | 3 | ## Available Scripts 4 | 5 | In the project directory, you can run: 6 | 7 | ### `yarn start` 8 | 9 | Runs the app in the development mode.
10 | Open [http://localhost:3000](http://localhost:3000) to view it in the browser. 11 | 12 | The page will reload if you make edits.
13 | You will also see any lint errors in the console. 14 | 15 | ### `yarn test` 16 | 17 | Launches the test runner in the interactive watch mode.
18 | See the section about [running tests](https://facebook.github.io/create-react-app/docs/running-tests) for more information. 19 | 20 | ### `yarn build` 21 | 22 | Builds the app for production to the `build` folder.
23 | It correctly bundles React in production mode and optimizes the build for the best performance. 24 | 25 | The build is minified and the filenames include the hashes.
26 | Your app is ready to be deployed! 27 | 28 | See the section about [deployment](https://facebook.github.io/create-react-app/docs/deployment) for more information. 29 | 30 | ### `yarn eject` 31 | 32 | **Note: this is a one-way operation. Once you `eject`, you can’t go back!** 33 | 34 | If you aren’t satisfied with the build tool and configuration choices, you can `eject` at any time. This command will remove the single build dependency from your project. 35 | 36 | Instead, it will copy all the configuration files and the transitive dependencies (webpack, Babel, ESLint, etc) right into your project so you have full control over them. All of the commands except `eject` will still work, but they will point to the copied scripts so you can tweak them. At this point you’re on your own. 37 | 38 | You don’t have to ever use `eject`. The curated feature set is suitable for small and middle deployments, and you shouldn’t feel obligated to use this feature. However we understand that this tool wouldn’t be useful if you couldn’t customize it when you are ready for it. 39 | 40 | ## Learn More 41 | 42 | You can learn more in the [Create React App documentation](https://facebook.github.io/create-react-app/docs/getting-started). 43 | 44 | To learn React, check out the [React documentation](https://reactjs.org/). 45 | 46 | ### Code Splitting 47 | 48 | This section has moved here: https://facebook.github.io/create-react-app/docs/code-splitting 49 | 50 | ### Analyzing the Bundle Size 51 | 52 | This section has moved here: https://facebook.github.io/create-react-app/docs/analyzing-the-bundle-size 53 | 54 | ### Making a Progressive Web App 55 | 56 | This section has moved here: https://facebook.github.io/create-react-app/docs/making-a-progressive-web-app 57 | 58 | ### Advanced Configuration 59 | 60 | This section has moved here: https://facebook.github.io/create-react-app/docs/advanced-configuration 61 | 62 | ### Deployment 63 | 64 | This section has moved here: https://facebook.github.io/create-react-app/docs/deployment 65 | 66 | ### `yarn build` fails to minify 67 | 68 | This section has moved here: https://facebook.github.io/create-react-app/docs/troubleshooting#npm-run-build-fails-to-minify 69 | -------------------------------------------------------------------------------- /Computer Vision/Real Time Sign Language Interpretation App/ReactComputerVisionTemplate/package.json: -------------------------------------------------------------------------------- 1 | { 2 | "name": "handpose", 3 | "version": "0.1.0", 4 | "private": true, 5 | "dependencies": { 6 | "@tensorflow/tfjs": "^3.1.0", 7 | "@testing-library/jest-dom": "^4.2.4", 8 | "@testing-library/react": "^9.3.2", 9 | "@testing-library/user-event": "^7.1.2", 10 | "react": "^16.13.1", 11 | "react-dom": "^16.13.1", 12 | "react-scripts": "3.4.3", 13 | "react-webcam": "^5.2.0" 14 | }, 15 | "scripts": { 16 | "start": "react-scripts start", 17 | "build": "react-scripts build", 18 | "test": "react-scripts test", 19 | "eject": "react-scripts eject" 20 | }, 21 | "eslintConfig": { 22 | "extends": "react-app" 23 | }, 24 | "browserslist": { 25 | "production": [ 26 | ">0.2%", 27 | "not dead", 28 | "not op_mini all" 29 | ], 30 | "development": [ 31 | "last 1 chrome version", 32 | "last 1 firefox version", 33 | "last 1 safari version" 34 | ] 35 | } 36 | } 37 | -------------------------------------------------------------------------------- /Computer Vision/Real Time Sign Language Interpretation App/ReactComputerVisionTemplate/src/App.css: -------------------------------------------------------------------------------- 1 | .App { 2 | text-align: center; 3 | } 4 | 5 | .App-logo { 6 | height: 40vmin; 7 | pointer-events: none; 8 | } 9 | 10 | @media (prefers-reduced-motion: no-preference) { 11 | .App-logo { 12 | animation: App-logo-spin infinite 20s linear; 13 | } 14 | } 15 | 16 | .App-header { 17 | background-color: #282c34; 18 | min-height: 100vh; 19 | display: flex; 20 | flex-direction: column; 21 | align-items: center; 22 | justify-content: center; 23 | font-size: calc(10px + 2vmin); 24 | color: white; 25 | } 26 | 27 | .App-link { 28 | color: #61dafb; 29 | } 30 | 31 | @keyframes App-logo-spin { 32 | from { 33 | transform: rotate(0deg); 34 | } 35 | to { 36 | transform: rotate(360deg); 37 | } 38 | } 39 | -------------------------------------------------------------------------------- /Computer Vision/Real Time Sign Language Interpretation App/ReactComputerVisionTemplate/src/App.js: -------------------------------------------------------------------------------- 1 | // Import dependencies 2 | import React, { useRef, useState, useEffect } from "react"; 3 | import * as tf from "@tensorflow/tfjs"; 4 | import Webcam from "react-webcam"; 5 | import "./App.css"; 6 | import { nextFrame } from "@tensorflow/tfjs"; 7 | // 2. TODO - Import drawing utility here 8 | // e.g. import { drawRect } from "./utilities"; 9 | import { drawRect } from "./utilities"; 10 | 11 | function App() { 12 | const webcamRef = useRef(null); 13 | const canvasRef = useRef(null); 14 | 15 | // Main function 16 | const runCoco = async () => { 17 | // 3. TODO - Load network 18 | // e.g. const net = await cocossd.load(); 19 | // https://tensorflowjsrealtimesign.s3.eu-de.cloud-object-storage.appdomain.cloud/model.json 20 | const net = await tf.loadGraphModel('https://tensorflowjsrealtimesign.s3.eu-de.cloud-object-storage.appdomain.cloud/model.json') 21 | 22 | // Loop and detect hands 23 | setInterval(() => { 24 | detect(net); 25 | }, 16.7); 26 | }; 27 | 28 | const detect = async (net) => { 29 | // Check data is available 30 | if ( 31 | typeof webcamRef.current !== "undefined" && 32 | webcamRef.current !== null && 33 | webcamRef.current.video.readyState === 4 34 | ) { 35 | // Get Video Properties 36 | const video = webcamRef.current.video; 37 | const videoWidth = webcamRef.current.video.videoWidth; 38 | const videoHeight = webcamRef.current.video.videoHeight; 39 | 40 | // Set video width 41 | webcamRef.current.video.width = videoWidth; 42 | webcamRef.current.video.height = videoHeight; 43 | 44 | // Set canvas height and width 45 | canvasRef.current.width = videoWidth; 46 | canvasRef.current.height = videoHeight; 47 | 48 | // 4. TODO - Make Detections 49 | const img = tf.browser.fromPixels(video) 50 | const resized = tf.image.resizeBilinear(img, [640, 480]) 51 | const casted = resized.cast('int32') 52 | const expanded = casted.expandDims(0) 53 | const obj = await net.executeAsync(expanded) 54 | console.log(obj) 55 | 56 | const boxes = await obj[1].array() 57 | const classes = await obj[6].array() 58 | const scores = await obj[5].array() 59 | 60 | // Draw mesh 61 | const ctx = canvasRef.current.getContext("2d"); 62 | 63 | // 5. TODO - Update drawing utility 64 | // drawSomething(obj, ctx) 65 | window.requestAnimationFrame(() => { drawRect(boxes[0], classes[0], scores[0], 0.9, videoWidth, videoHeight, ctx) }); 66 | 67 | tf.dispose(img) 68 | tf.dispose(resized) 69 | tf.dispose(casted) 70 | tf.dispose(expanded) 71 | tf.dispose(obj) 72 | 73 | } 74 | }; 75 | 76 | useEffect(() => { runCoco() }, []); 77 | 78 | return ( 79 |
80 |
81 | 96 | 97 | 111 |
112 |
113 | ); 114 | } 115 | 116 | export default App; -------------------------------------------------------------------------------- /Computer Vision/Real Time Sign Language Interpretation App/ReactComputerVisionTemplate/src/index.css: -------------------------------------------------------------------------------- 1 | body { 2 | margin: 0; 3 | font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', 'Roboto', 'Oxygen', 4 | 'Ubuntu', 'Cantarell', 'Fira Sans', 'Droid Sans', 'Helvetica Neue', 5 | sans-serif; 6 | -webkit-font-smoothing: antialiased; 7 | -moz-osx-font-smoothing: grayscale; 8 | } 9 | 10 | code { 11 | font-family: source-code-pro, Menlo, Monaco, Consolas, 'Courier New', 12 | monospace; 13 | } 14 | -------------------------------------------------------------------------------- /Computer Vision/Real Time Sign Language Interpretation App/ReactComputerVisionTemplate/src/index.js: -------------------------------------------------------------------------------- 1 | import React from 'react'; 2 | import ReactDOM from 'react-dom'; 3 | import './index.css'; 4 | import App from './App'; 5 | 6 | ReactDOM.render( 7 | 8 | 9 | , 10 | document.getElementById('root') 11 | ); -------------------------------------------------------------------------------- /Computer Vision/Real Time Sign Language Interpretation App/ReactComputerVisionTemplate/src/readme.md: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /Computer Vision/Real Time Sign Language Interpretation App/ReactComputerVisionTemplate/src/utilities.js: -------------------------------------------------------------------------------- 1 | // Define our labelmap 2 | const labelMap = { 3 | 1: { name: 'Hello', color: 'red' }, 4 | 2: { name: 'Yes', color: 'yellow' }, 5 | 3: { name: 'NO', color: 'lime' }, 6 | 4: { name: 'Thank_you', color: 'blue' }, 7 | 5: { name: 'I_Love_You', color: 'purple' }, 8 | } 9 | 10 | const width_scale = 2 11 | const length_scale = 1.5 12 | 13 | 14 | // Define a drawing function 15 | export const drawRect = (boxes, classes, scores, threshold, imgWidth, imgHeight, ctx) => { 16 | for (let i = 0; i <= boxes.length; i++) { 17 | if (boxes[i] && classes[i] && scores[i] > threshold) { 18 | // Extract variables 19 | const [y, x, height, width] = boxes[i] 20 | const text = classes[i] 21 | 22 | // Set styling 23 | ctx.strokeStyle = labelMap[text]['color'] 24 | ctx.lineWidth = 10 25 | ctx.fillStyle = 'white' 26 | ctx.font = '30px Arial' 27 | 28 | if (labelMap[text]['name'] == "Hello"){ 29 | 30 | const width_scale = 2.5 31 | const length_scale = 2 32 | 33 | } else if (labelMap[text]['name'] == "Yes") { 34 | const width_scale = 1.5 35 | const length_scale = 2.5 36 | 37 | } else if (labelMap[text]['name'] == "No"){ 38 | 39 | const width_scale = 1.5 40 | const length_scale = 2.5 41 | 42 | } else if (labelMap[text]['name'] == "Thank_you"){ 43 | 44 | const width_scale = 2 45 | const length_scale = 2.5 46 | } else if (labelMap[text]['name'] == "I_Love_You"){ 47 | 48 | const width_scale = 1.2 49 | const length_scale = 1.5 50 | } else { 51 | 52 | const width_scale = 2 53 | const length_scale = 1.5 54 | 55 | } 56 | 57 | // DRAW!! 58 | ctx.beginPath() 59 | ctx.fillText(labelMap[text]['name'] + ' - ' + Math.round(scores[i] * 100) / 100, x * imgWidth, y * imgHeight -10) 60 | 61 | ctx.rect(x * imgWidth, y * imgHeight, width * imgWidth /width_scale, height * imgHeight /length_scale); 62 | ctx.stroke() 63 | } 64 | } 65 | } -------------------------------------------------------------------------------- /Computer Vision/Real Time Sign Language Interpretation App/Readme.md: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /Data Visualization/Python/Readme.dm: -------------------------------------------------------------------------------- 1 | Data visulaization projects implemented in python. 2 | -------------------------------------------------------------------------------- /Deep Learning/Classification/Melenoma_Classification/Readme.md: -------------------------------------------------------------------------------- 1 | # Melenoma Classification 2 | 3 | ## 1. Problem Statment 4 | 5 | Classfying Melenoma skin lesion images into 9 classes of diagnostics using deep learning models. 6 | 7 | ## 2. Methods 8 | 9 | ### 2.1. Dataset 10 | The dataset used is the [Skin Lesion Images for Melanoma Classification on Kaggle](https://www.kaggle.com/datasets/andrewmvd/isic-2019). This dataset contains the training data for the ISIC 2019 challenge, note that it already includes data from previous years (2018 and 2017). The dataset for ISIC 2019 contains 25,331 images available for the classification of dermoscopic images among nine different diagnostic categories: 11 | 12 | * Melanoma 13 | * Melanocytic nevus 14 | * Basal cell carcinoma 15 | * Actinic keratosis 16 | * Benign keratosis (solar lentigo / seborrheic keratosis / lichen planus-like keratosis) 17 | * Dermatofibroma 18 | * Vascular lesion 19 | * Squamous cell carcinoma 20 | * None of the above 21 | 22 | ![Screenshot 2023-03-30 183038](https://user-images.githubusercontent.com/72076328/228887596-f9be3bed-ad19-4469-8fda-09a8725bb246.png) 23 | 24 | ### 2.2. Data Preprcoessing 25 | 26 | 27 | ### 2.3. Feature Engineering 28 | 29 | 30 | ### 2.4. Models 31 | Finetuned pretrained CNN based models. The models used are: 32 | 33 | * VGG-16 34 | * VGG-19 35 | * ResNet-50 36 | * Mobile-Net 37 | 38 | 39 | ## 3. Results 40 | Since the data is imbalanced so the F1_score. 41 | 42 | -------------------------------------------------------------------------------- /Deep Learning/Classification/Melenoma_Classification/deep-learning-models/CNN_model.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Created on Tue Mar 16 19:00:48 2021 4 | 5 | @author: youss 6 | """ 7 | import tensorflow.compat.v1 as tf 8 | from tensorflow.keras.models import Sequential 9 | from tensorflow.keras.layers import Conv2D, MaxPooling2D, Flatten 10 | from tensorflow.keras.layers import Dense, BatchNormalization, Dropout, Activation 11 | 12 | # https://keras.io/api/applications/ 13 | def simple_CNN(train_data_shape,n_classes): 14 | # building a linear stack of layers with the sequential model 15 | 16 | model = Sequential() 17 | model.add(Conv2D(32, (3, 3), input_shape=train_data_shape[1:])) 18 | model.add(Activation('relu')) 19 | model.add(MaxPooling2D(pool_size=(2, 2))) 20 | 21 | model.add(Conv2D(64, (3, 3))) 22 | model.add(Activation('relu')) 23 | model.add(MaxPooling2D(pool_size=(2, 2))) 24 | 25 | model.add(Conv2D(128, (3, 3))) 26 | model.add(Activation('relu')) 27 | model.add(MaxPooling2D(pool_size=(2, 2))) 28 | 29 | model.add(Flatten()) 30 | model.add(Dense(64)) 31 | model.add(Activation('relu')) 32 | model.add(Dropout(0.5)) 33 | model.add(Dense(n_classes)) 34 | model.add(Activation('sigmoid')) 35 | 36 | model.summary() 37 | return model 38 | 39 | def MobileNet(num_classes, is_trainable ): 40 | 41 | pretrained_model=tf.keras.applications.MobileNet( 42 | input_shape=(224, 224, 3), 43 | alpha=1.0, 44 | depth_multiplier=1, 45 | dropout=0.001, 46 | include_top=False, 47 | weights="imagenet") 48 | 49 | for layer in pretrained_model.layers[0:18]: 50 | layer.trainable = is_trainable 51 | 52 | model = Sequential() 53 | # first (and only) set of FC => RELU layers 54 | model.add(Flatten()) 55 | model.add(Dense(200, activation='relu')) 56 | model.add(Dropout(0.5)) 57 | model.add(BatchNormalization()) 58 | model.add(Dense(400, activation='relu')) 59 | model.add(Dropout(0.5)) 60 | model.add(BatchNormalization()) 61 | 62 | # softmax classifier 63 | model.add(Dense(num_classes,activation='softmax')) 64 | pretrainedInput = pretrained_model.input 65 | pretrainedOutput = pretrained_model.output 66 | output = model(pretrainedOutput) 67 | model = tf.keras.models.Model(pretrainedInput, output) 68 | model.summary() 69 | return model 70 | 71 | def VGG_16(num_classes,is_trainable): 72 | from tensorflow.keras.applications.vgg16 import VGG16 73 | 74 | pretrained_model = VGG16( 75 | include_top=False, 76 | input_shape=(224, 224, 3), 77 | weights='imagenet') 78 | 79 | for layer in pretrained_model.layers: 80 | layer.trainable = is_trainable 81 | 82 | model = Sequential() 83 | # first (and only) set of FC => RELU layers 84 | model.add(Flatten()) 85 | model.add(Dense(200, activation='relu')) 86 | model.add(Dropout(0.5)) 87 | model.add(BatchNormalization()) 88 | model.add(Dense(400, activation='relu')) 89 | model.add(Dropout(0.5)) 90 | model.add(BatchNormalization()) 91 | 92 | # softmax classifier 93 | model.add(Dense(num_classes,activation='softmax')) 94 | pretrainedInput = pretrained_model.input 95 | pretrainedOutput = pretrained_model.output 96 | output = model(pretrainedOutput) 97 | model = tf.keras.models.Model(pretrainedInput, output) 98 | model.summary() 99 | return model 100 | 101 | def Inception_v3(num_classes,is_trainable): 102 | pretrained_model= tf.keras.applications.InceptionV3( 103 | include_top=False, 104 | weights="imagenet", 105 | input_tensor=None, 106 | input_shape=(224, 224, 3), 107 | pooling='max') 108 | for layer in pretrained_model.layers[0:150]: 109 | layer.trainable = is_trainable 110 | model = Sequential() 111 | # first (and only) set of FC => RELU layers 112 | model.add(Flatten()) 113 | model.add(Dense(200, activation='relu')) 114 | model.add(Dropout(0.5)) 115 | model.add(BatchNormalization()) 116 | model.add(Dense(400, activation='relu')) 117 | model.add(Dropout(0.5)) 118 | model.add(BatchNormalization()) 119 | 120 | # softmax classifier 121 | model.add(Dense(num_classes,activation='softmax')) 122 | pretrainedInput = pretrained_model.input 123 | pretrainedOutput = pretrained_model.output 124 | output = model(pretrainedOutput) 125 | model = tf.keras.models.Model(pretrainedInput, output) 126 | model.summary() 127 | return model 128 | 129 | def InceptionResNetV2(num_classes,is_trainable): 130 | pretrained_model=tf.keras.applications.InceptionResNetV2( 131 | include_top=False, 132 | weights="imagenet", 133 | input_tensor=None, 134 | input_shape=(224,224,3)) 135 | 136 | for layer in pretrained_model.layers[0:450]: 137 | layer.trainable = is_trainable 138 | model = Sequential() 139 | # first (and only) set of FC => RELU layers 140 | model.add(Flatten()) 141 | model.add(Dense(32, activation='relu')) 142 | model.add(Dropout(0.5)) 143 | model.add(BatchNormalization()) 144 | 145 | model.add(Dense(64, activation='relu')) 146 | model.add(Dropout(0.5)) 147 | model.add(BatchNormalization()) 148 | 149 | model.add(Dense(128, activation='relu')) 150 | model.add(Dropout(0.5)) 151 | model.add(BatchNormalization()) 152 | 153 | model.add(Dense(256, activation='relu')) 154 | model.add(Dropout(0.5)) 155 | model.add(BatchNormalization()) 156 | 157 | model.add(Dense(512, activation='relu')) 158 | model.add(Dropout(0.5)) 159 | model.add(BatchNormalization()) 160 | 161 | 162 | # softmax classifier 163 | model.add(Dense(num_classes,activation='softmax')) 164 | pretrainedInput = pretrained_model.input 165 | pretrainedOutput = pretrained_model.output 166 | output = model(pretrainedOutput) 167 | model = tf.keras.models.Model(pretrainedInput, output) 168 | model.summary() 169 | return model 170 | 171 | 172 | 173 | -------------------------------------------------------------------------------- /Deep Learning/Classification/Melenoma_Classification/deep-learning-models/__init__.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Created on Tue Mar 16 19:29:24 2021 4 | 5 | @author: youss 6 | """ 7 | 8 | -------------------------------------------------------------------------------- /Deep Learning/Classification/Melenoma_Classification/deep-learning-models/main.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Created on Mon Mar 22 04:58:28 2021 4 | 5 | @author: youss 6 | """ 7 | import sys 8 | 9 | sys.path.insert(0,'D:/work & study/Nawah/Datasets/codes/deep learning models') 10 | from CNN_model import Inception_v3 11 | from CNN_model import VGG_16 12 | from CNN_model import simple_CNN 13 | from CNN_model import MobileNet 14 | from CNN_model import InceptionResNetV2 15 | 16 | def select_CNN_model(model_name,num_classes,trainable,input_shape): 17 | 18 | if model_name == 'simple_CNN': 19 | model= simple_CNN(input_shape,num_classes) 20 | 21 | elif model_name=='MobileNet': 22 | model=MobileNet(num_classes,trainable) 23 | 24 | elif model_name=='VGG-16': 25 | model=VGG_16(num_classes,trainable) 26 | 27 | elif model_name=='Inception-v3': 28 | model=Inception_v3(num_classes,trainable) 29 | 30 | elif model_name=='InceptionResNetV2': 31 | model=InceptionResNetV2(num_classes,trainable) 32 | 33 | else: 34 | print("Error value : There is no model with the following name",model_name) 35 | return 36 | 37 | return model 38 | 39 | def getLayerIndexByName(model, layername): 40 | 41 | for idx, layer in enumerate(model.layers): 42 | if layer.name == layername: 43 | return idx 44 | return None 45 | -------------------------------------------------------------------------------- /Deep Learning/Classification/Melenoma_Classification/deep-learning-models/readme.md: -------------------------------------------------------------------------------- 1 | The deep learning models used are 2 | -------------------------------------------------------------------------------- /Deep Learning/Classification/Melenoma_Classification/deep-learning-models/training.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Created on Tue Mar 16 19:31:01 2021 4 | 5 | @author: youss 6 | """ 7 | import numpy as np 8 | import sys 9 | 10 | import matplotlib.pyplot as plt 11 | from keras.callbacks import EarlyStopping 12 | from sklearn.utils import class_weight 13 | from tensorflow.keras.optimizers import SGD 14 | 15 | sys.path.insert(0,'D:/work & study/Nawah/Datasets/codes/evaluation metrics') 16 | from f1_score import f1,f1_micro 17 | from classification_metrics import confusion_matrix_calc 18 | 19 | def training_model(model,train_data,train_labels,val_data,val_labels,test_data,test_labels,num_epoch,batch_size,num_classes,evaulation_metric): 20 | 21 | class_weights = class_weight.compute_class_weight('balanced',np.unique(train_labels.values.argmax(axis=1)),train_labels.values.argmax(axis=1)) 22 | sgd = SGD(lr=1e-3, decay=1e-6, momentum=0.9, nesterov=True) 23 | es = EarlyStopping(monitor='val_'+evaulation_metric, mode='max', verbose=1,patience=10,baseline=0.5,min_delta=0.1) 24 | 25 | if evaulation_metric=='accuracy': 26 | model.compile(loss='binary_crossentropy',optimizer=sgd,metrics=['accuracy']) 27 | elif evaulation_metric=='f1': 28 | model.compile(loss='binary_crossentropy',optimizer=sgd,metrics=[f1]) 29 | elif evaulation_metric=='f1_micro': 30 | model.compile(loss='binary_crossentropy',optimizer=sgd,metrics=[f1_micro]) 31 | 32 | 33 | history=model.fit(train_data,train_labels, validation_data=(val_data,val_labels),epochs=num_epoch, batch_size=batch_size,class_weight=class_weights,callbacks=[es]) 34 | score=model.evaluate(test_data,test_labels) 35 | print(f'Test loss: {score[0]} / Test' + ' ' + evaulation_metric + f'score: {score[1]}') 36 | plotting_train_val_metrics(history,evaulation_metric) 37 | predicted_train_labels=model.predict(train_data) 38 | predicted_val_labels=model.predict(val_data) 39 | predicted_test_labels=model.predict(test_data) 40 | confusion_matrix_calc(predicted_train_labels,train_labels,num_classes,'confusion matrix of the training data') 41 | confusion_matrix_calc(predicted_val_labels,val_labels,num_classes,'confusion matrix of the validation data') 42 | confusion_matrix_calc(predicted_test_labels,test_labels,num_classes,'confusion matrix of the test data') 43 | 44 | return None 45 | 46 | def plotting_train_val_metrics(history,evaulation_metric): 47 | plt.plot(history.history[evaulation_metric]) 48 | plt.plot(history.history['val_'+ evaulation_metric]) 49 | plt.title('training and val' + evaulation_metric) 50 | plt.ylabel(evaulation_metric +'score') 51 | plt.xlabel('epoch') 52 | plt.legend(['train', 'val'], loc='upper left') 53 | plt.show() 54 | # summarize history for loss 55 | plt.plot(history.history['loss']) 56 | plt.plot(history.history['val_loss']) 57 | plt.title('model loss') 58 | plt.ylabel('loss') 59 | plt.xlabel('epoch') 60 | plt.legend(['train', 'val'], loc='upper left') 61 | plt.show() 62 | return None 63 | -------------------------------------------------------------------------------- /Deep Learning/Classification/Melenoma_Classification/evaluation-metrics/__init__.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Created on Tue Mar 16 19:21:12 2021 4 | 5 | @author: youss 6 | """ 7 | 8 | -------------------------------------------------------------------------------- /Deep Learning/Classification/Melenoma_Classification/evaluation-metrics/classification_metrics.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Created on Wed Mar 17 13:57:42 2021 4 | 5 | @author: youss 6 | """ 7 | from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay 8 | import numpy as np 9 | import matplotlib.pyplot as plt 10 | 11 | def confusion_matrix_calc(predicted_labels,true_labels,num_classes,title): 12 | positions = np.arange(0,num_classes) 13 | classes = np.arange(0,num_classes) 14 | cm=confusion_matrix(predicted_labels.argmax(axis=1), true_labels.values.argmax(axis=1),labels=classes) 15 | disp=ConfusionMatrixDisplay(cm,display_labels=classes) 16 | classes_name = true_labels.columns.values 17 | plt.figure(figsize=(10,10)) 18 | disp.plot() 19 | plt.xticks(positions, classes_name) 20 | plt.yticks(positions, classes_name) 21 | plt.title(title) 22 | return 23 | 24 | 25 | -------------------------------------------------------------------------------- /Deep Learning/Classification/Melenoma_Classification/evaluation-metrics/f1_score.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Created on Tue Mar 16 19:10:17 2021 4 | 5 | @author: youss 6 | """ 7 | import tensorflow.keras.backend as K 8 | import tensorflow.compat.v1 as tf 9 | 10 | def f1(y_true, y_pred): 11 | y_pred = K.round(y_pred) 12 | tp = K.sum(K.cast(y_true*y_pred, 'float'), axis=0) 13 | tn = K.sum(K.cast((1-y_true)*(1-y_pred), 'float'), axis=0) 14 | fp = K.sum(K.cast((1-y_true)*y_pred, 'float'), axis=0) 15 | fn = K.sum(K.cast(y_true*(1-y_pred), 'float'), axis=0) 16 | 17 | p = tp / (tp + fp + K.epsilon()) 18 | r = tp / (tp + fn + K.epsilon()) 19 | 20 | f1 = 2*p*r / (p+r+K.epsilon()) 21 | f1 = tf.where(tf.is_nan(f1), tf.zeros_like(f1), f1) 22 | return K.mean(f1) 23 | 24 | def f1_loss(y_true, y_pred): 25 | 26 | tp = K.sum(K.cast(y_true*y_pred, 'float'), axis=0) 27 | tn = K.sum(K.cast((1-y_true)*(1-y_pred), 'float'), axis=0) 28 | fp = K.sum(K.cast((1-y_true)*y_pred, 'float'), axis=0) 29 | fn = K.sum(K.cast(y_true*(1-y_pred), 'float'), axis=0) 30 | 31 | p = tp / (tp + fp + K.epsilon()) 32 | r = tp / (tp + fn + K.epsilon()) 33 | 34 | f1 = 2*p*r / (p+r+K.epsilon()) 35 | f1 = tf.where(tf.is_nan(f1), tf.zeros_like(f1), f1) 36 | return 1 - K.mean(f1) 37 | 38 | def f1_micro(y_true, y_pred): 39 | y_pred = K.round(y_pred) 40 | tp_per_class = K.sum(K.cast(y_true*y_pred, 'float'), axis=1) 41 | tn_per_class = K.sum(K.cast((1-y_true)*(1-y_pred), 'float'), axis=1) 42 | fp_per_class = K.sum(K.cast((1-y_true)*y_pred, 'float'), axis=1) 43 | fn_per_class = K.sum(K.cast(y_true*(1-y_pred), 'float'), axis=1) 44 | 45 | p_per_class = tp_per_class / (tp_per_class + fp_per_class + K.epsilon()) 46 | r_per_class = tp_per_class / (tp_per_class + fn_per_class + K.epsilon()) 47 | 48 | f1_per_class = 2*p_per_class*r_per_class / (p_per_class+r_per_class+K.epsilon()) 49 | f1_total= K.sum(f1_per_class*K.sum(y_true,axis=1))/ K.sum(y_true) 50 | 51 | return f1_total 52 | 53 | -------------------------------------------------------------------------------- /Deep Learning/Classification/Melenoma_Classification/evaluation-metrics/readme.md: -------------------------------------------------------------------------------- 1 | The evaluation metrics used to evaluate the classificaiton models 2 | -------------------------------------------------------------------------------- /Deep Learning/Classification/Melenoma_Classification/loading and storing/__init__.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Created on Tue Mar 16 17:23:51 2021 4 | 5 | @author: youss 6 | """ 7 | 8 | -------------------------------------------------------------------------------- /Deep Learning/Classification/Melenoma_Classification/loading and storing/loading_images.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Created on Wed Mar 24 02:40:27 2021 4 | 5 | @author: youss 6 | """ 7 | import cv2 8 | import os 9 | 10 | 11 | def load_images_from_folder(folder,width,height): 12 | images = [] 13 | i=0 14 | 15 | for filename in os.listdir(folder): 16 | img = cv2.imread(os.path.join(folder,filename)) 17 | img=cv2.resize(img,(width,height)) 18 | if img is not None: 19 | images.append(img) 20 | i=i+1 21 | return images 22 | 23 | 24 | -------------------------------------------------------------------------------- /Deep Learning/Classification/Melenoma_Classification/loading and storing/loading_storing_h5py.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Created on Tue Mar 16 16:55:49 2021 4 | 5 | @author: youss 6 | """ 7 | import numpy as np 8 | import h5py 9 | import os 10 | 11 | 12 | def storing_h5py(input_data,hdf5_dir): 13 | for i in range (len(input_data)): 14 | image_id=i 15 | image= input_data[i] 16 | file = h5py.File(os.path.join(hdf5_dir,str(image_id)+'.h5'), "w") 17 | dataset = file.create_dataset("image", np.shape(image), h5py.h5t.STD_U8BE, data=image) 18 | file.close() 19 | 20 | def read_h5py(hdf5_dir,num_images): 21 | images=[] 22 | for i in range(num_images): 23 | image_id=i 24 | file = h5py.File(os.path.join(hdf5_dir,str(image_id)+'.h5'), "r+") 25 | image = np.array(file["/image"]).astype("uint8") 26 | images.append(image) 27 | return images 28 | 29 | -------------------------------------------------------------------------------- /Deep Learning/Classification/Melenoma_Classification/loading and storing/readme.md: -------------------------------------------------------------------------------- 1 | Loading the inputs and storing the output 2 | -------------------------------------------------------------------------------- /Deep Learning/Classification/Melenoma_Classification/main.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Created on Tue Mar 16 16:54:05 2021 4 | 5 | @author: youssef Hosni 6 | """ 7 | import pandas as pd 8 | import numpy as np 9 | import sys 10 | 11 | import tensorflow.compat.v1 as tf 12 | 13 | sys.path.insert(0,'D:/work & study/Nawah/Datasets/codes/loading and storing') 14 | from loading_images import load_images_from_folder 15 | sys.path.insert(0,'D:/work & study/Nawah/Datasets/codes/preprocessing') 16 | from exploration import bar_plot,class_counts_proportions 17 | from preprocessing import splitting_normalization 18 | from preprocessing import splitting_classes 19 | sys.path.insert(0,'D:/work & study/Nawah/Datasets/codes/deep learning models') 20 | from main import select_CNN_model 21 | sys.path.insert(0,'D:/work & study/Nawah/Datasets/codes/deep learning models') 22 | from training import training_model 23 | 24 | 25 | print('Using:') 26 | print('\t\u2022 TensorFlow version:', tf.__version__) 27 | print('\t\u2022 tf.keras version:', tf.keras.__version__) 28 | print('\t\u2022 Running on GPU' if tf.config.list_physical_devices('GPU') else '\t\u2022 GPU device not found. Running on CPU') 29 | 30 | #%% Loading the png data and split and normalize it 31 | images_dir = "D:\\work & study\\Nawah\\Datasets\\ISIC_2019_Training_Input\\ISIC_2019_Training_Input" 32 | width = 224 33 | height = 224 34 | input_data = load_images_from_folder(images_dir,width,height) 35 | #%% preprocessiing 36 | labels = pd.read_csv("D:/work & study/Nawah/Datasets/ISIC_2019_Training_GroundTruth.csv") 37 | labels=labels.iloc[:,1:] 38 | labels.head() 39 | #%% splitting the data and normalizing it 40 | train_data,train_labels, val_data,val_labels,test_data,test_labels = splitting_normalization( 41 | input_data, 42 | labels 43 | ) 44 | 45 | #%% dividing the data into datasets one with two classes and one with 8 classes 46 | [train_data_small_classes, 47 | train_labels_small_classes, 48 | train_data_labels_two_classes] = splitting_classes(train_data,train_labels) 49 | 50 | [val_data_small_classes, 51 | val_labels_small_classes, 52 | val_data_labels_two_classes] = splitting_classes(val_data,val_labels) 53 | 54 | [test_data_small_classes, 55 | test_labels_small_classes, 56 | test_data_labels_two_classes] = splitting_classes(test_data,test_labels) 57 | 58 | 59 | #%% underesampling the data 60 | sys.path.insert(0,'D:/work & study/Nawah/Datasets/codes/preprocessing') 61 | from preprocessing import resampling 62 | resampling_stragey = {1:4500} 63 | resampled_training_data, resampled_labels = resampling( 64 | train_data, 65 | train_labels, 66 | 'under_sampling', 67 | resampling_stragey 68 | ) 69 | resampled_training_data = resampled_training_data.reshape( 70 | resampled_training_data.shape[0], 71 | 224, 72 | 224, 73 | 3 74 | ) 75 | resampled_labels=pd.DataFrame(resampled_labels) 76 | resampled_labels.columns = train_labels.columns[0:-1] 77 | resampled_labels['UNK'] = 0 78 | 79 | #%% Oversampling the data 80 | resampling_stragey = {2:2000,4:1700,5:1700,6:2000} 81 | resampled_training_data_small, resampled_labels_small= resampling( 82 | train_data_small_classes,train_labels_small_classes, 83 | 'over_sampling',resampling_stragey 84 | ) 85 | resampled_training_data_small = resampled_training_data_small.reshape( 86 | resampled_training_data_small.shape[0], 87 | 224,224,3 88 | ) 89 | resampled_labels_small = pd.DataFrame(resampled_labels_small) 90 | resampled_labels_small.columns=train_labels_small_classes.columns[0:-1] 91 | resampled_labels_small['UNK']=0 92 | #%% Building the simple CNN model and trainning it 93 | models_name_list = [ 94 | 'simple_CNN', 95 | 'MobileNet', 96 | 'VGG-16', 97 | 'Inception-v3', 98 | 'InceptionResNetV2' 99 | ] 100 | model_name=models_name_list[3] 101 | is_trainable=False 102 | epoch_num=100 103 | batch_num=32 104 | evaluation_metrics_list=[ 105 | 'accuracy', 106 | 'f1', 107 | 'f1_micro' 108 | ] 109 | evaluation_metric = evaluation_metrics_list[2] 110 | model = select_CNN_model(model_name,8,is_trainable, np.shape(resampled_training_data_small)) 111 | training_model(model,resampled_training_data_small,resampled_labels_small,val_data_small_classes,val_labels_small_classes,test_data_small_classes, 112 | test_labels_small_classes,epoch_num,batch_num,8,evaluation_metric) 113 | -------------------------------------------------------------------------------- /Deep Learning/Classification/Melenoma_Classification/preprocessing/__init__.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Created on Sun Mar 21 06:45:28 2021 4 | 5 | @author: youss 6 | """ 7 | 8 | -------------------------------------------------------------------------------- /Deep Learning/Classification/Melenoma_Classification/preprocessing/exploration.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Created on Sun Mar 21 04:05:02 2021 4 | 5 | @author: youss 6 | """ 7 | import pandas as pd 8 | import matplotlib.pyplot as plt 9 | 10 | def bar_plot(input_data,title): 11 | input_data.columns 12 | df=pd.DataFrame() 13 | df["disease"]=input_data.columns 14 | number_cases_per_class=input_data.sum() 15 | df["number_of_cases"]=number_cases_per_class.values 16 | plt.figure() 17 | plt.bar(df["disease"],df["number_of_cases"]) 18 | plt.title(title) 19 | 20 | def class_counts_proportions(labels): 21 | df=pd.DataFrame() 22 | df["Label"]=labels.columns 23 | number_cases_per_class=labels.sum() 24 | df["number_of_cases_each_class"]=number_cases_per_class.values 25 | df["percentage_of_classes"]=number_cases_per_class.values/sum(number_cases_per_class.values) 26 | return df 27 | -------------------------------------------------------------------------------- /Deep Learning/Classification/Melenoma_Classification/preprocessing/preprocessing.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Created on Tue Mar 16 22:20:13 2021 4 | @author: youss 5 | """ 6 | 7 | import numpy as np 8 | import time 9 | import cv2 10 | 11 | from multiprocessing.dummy import Pool 12 | from multiprocessing.sharedctypes import Value 13 | from ctypes import c_int 14 | from sklearn.model_selection import train_test_split 15 | from imblearn.over_sampling import SMOTE, RandomOverSampler 16 | from imblearn.under_sampling import RandomUnderSampler, TomekLinks,NearMiss 17 | from imblearn.under_sampling import OneSidedSelection 18 | 19 | def splitting_normalization(input_data, labels): 20 | train_data,val_data,train_labels,val_labels = train_test_split(input_data,labels, test_size=0.3, random_state=42) 21 | val_data,test_data,val_labels,test_labels = train_test_split(val_data,val_labels, test_size=0.33, random_state=42) 22 | 23 | # building the input vector from the 28x28 pixels 24 | train_data=np.array(train_data) 25 | val_data=np.array(val_data) 26 | test_data=np.array(test_data) 27 | 28 | train_data = train_data.astype('float32') 29 | val_data = val_data.astype('float32') 30 | test_data = test_data.astype('float32') 31 | print(train_data.shape) 32 | print(val_data.shape) 33 | print(test_data.shape) 34 | 35 | # normalizing the data to help with the training 36 | train_data /= 255 37 | val_data /= 255 38 | test_data /= 255 39 | return train_data,train_labels, val_data,val_labels,test_data,test_labels 40 | 41 | 42 | 43 | def splitting_classes(input_data,input_labels): 44 | 45 | """ 46 | Parameters 47 | ---------- 48 | input_data : Array 49 | The input data with all classes . 50 | input_labels : DataFrame 51 | The input labels of data with all classes. 52 | 53 | Returns 54 | ------- 55 | ouput_data_small_classes : Array 56 | The input data with all classes . 57 | output_labels_small_classes : DataFrame 58 | The input labels of data with all classes. 59 | output_labels_two_classes : DataFrame 60 | The input labels of data with all classes. 61 | 62 | """ 63 | input_labels.reset_index(drop=True,inplace=True) 64 | output_labels_small_classes=input_labels[input_labels['NV']==0] 65 | output_labels_small_classes.drop(columns='NV',inplace=True) 66 | ouput_data_small_classes=input_data[output_labels_small_classes.index.values] 67 | 68 | 69 | output_labels_two_classes=input_labels.copy() 70 | output_labels_two_classes['other_classes']=0 71 | output_labels_two_classes.iloc[output_labels_small_classes.index.values,9]=1 72 | 73 | labels_to_drop_index=[0,2,3,4,5,6,7,8] 74 | output_labels_two_classes.drop(columns=output_labels_two_classes.columns[labels_to_drop_index],inplace=True) 75 | 76 | return ouput_data_small_classes, output_labels_small_classes,output_labels_two_classes 77 | 78 | def resizing_data(input_data,width, height): 79 | resized_data=[] 80 | def read_imagecv2(img, counter): 81 | img = cv2.resize(img, (width, height)) 82 | resized_data.append(img) 83 | with counter.get_lock(): #processing pools give no way to check up on progress, so we make our own 84 | counter.value += 1 85 | 86 | # start 4 worker processes 87 | with Pool(processes=2) as pool: #this should be the same as your processor cores (or less) 88 | counter = Value(c_int, 0) #using sharedctypes with mp.dummy isn't needed anymore, but we already wrote the code once... 89 | chunksize = 4 #making this larger might improve speed (less important the longer a single function call takes) 90 | resized_test_data = pool.starmap_async(read_imagecv2, ((img, counter) for img in input_data) , chunksize) #how many jobs to submit to each worker at once 91 | while not resized_test_data.ready(): #print out progress to indicate program is still working. 92 | #with counter.get_lock(): #you could lock here but you're not modifying the value, so nothing bad will happen if a write occurs simultaneously 93 | #just don't `time.sleep()` while you're holding the lock 94 | print("\rcompleted {} images ".format(counter.value), end='') 95 | time.sleep(.5) 96 | print('\nCompleted all images') 97 | return resized_data 98 | 99 | 100 | 101 | def resampling(train_data,train_labels,resampling_type,resampling_stragey): 102 | train_data_new=np.reshape(train_data,(train_data.shape[0],train_data.shape[1]*train_data.shape[2]*train_data.shape[3])) 103 | if resampling_type == 'SMOTE': 104 | train_data_resampled,train_labels_resampled = SMOTE(random_state=42).fit_resample(train_data_new, train_labels.values) 105 | 106 | elif resampling_type=='over_sampling': 107 | over_sampler=RandomOverSampler(sampling_strategy=resampling_stragey) 108 | train_data_resampled, train_labels_resampled = over_sampler.fit_resample(train_data_new,train_labels.values) 109 | 110 | elif resampling_type== 'under_sampling': 111 | under_sampler=RandomUnderSampler(sampling_strategy=resampling_stragey) 112 | train_data_resampled, train_labels_resampled = under_sampler.fit_resample(train_data_new,train_labels.values) 113 | 114 | elif resampling_type == 'tomelinks': 115 | t1= TomekLinks( sampling_strategy=resampling_stragey) 116 | train_data_resampled, train_labels_resampled = t1.fit_resample(train_data_new,train_labels.values ) 117 | 118 | elif resampling_type=='near_miss_neighbors': 119 | undersample = NearMiss(version=1, n_neighbors=3) 120 | train_data_resampled, train_labels_resampled = undersample.fit_resample(train_data_new,train_labels.values ) 121 | 122 | elif resampling_type=='one_sided_selection': 123 | undersample = OneSidedSelection(n_neighbors=1, n_seeds_S=200) 124 | train_data_resampled, train_labels_resampled = undersample.fit_resample(train_data_new,train_labels.values ) 125 | 126 | return train_data_resampled, train_labels_resampled 127 | 128 | 129 | 130 | 131 | -------------------------------------------------------------------------------- /Deep Learning/Classification/Melenoma_Classification/preprocessing/readme.md: -------------------------------------------------------------------------------- 1 | The preprocessing for the data 2 | -------------------------------------------------------------------------------- /Deep Learning/Classification/Melenoma_Classification/readme.md: -------------------------------------------------------------------------------- 1 | # Melenoma Classification 2 | -------------------------------------------------------------------------------- /Machine Learning/Classification/Alzhimers CV-BOLD Classification/deep learning/CNN_based_models/AlexNet.py: -------------------------------------------------------------------------------- 1 | import data_preprocessing 2 | import keras 3 | from keras.models import Sequential 4 | from keras.layers import Dropout 5 | from keras.callbacks import ModelCheckpoint 6 | from keras.models import load_model 7 | from keras import layers 8 | from keras.layers import Conv2D, MaxPooling2D 9 | from keras.models import Sequential, Input, Model 10 | from keras.layers import Dense, Dropout, Flatten 11 | from keras.layers import LeakyReLU 12 | import os 13 | import CNN_feature_extractor 14 | import model_evaluation 15 | from sklearn.model_selection import train_test_split 16 | 17 | def model(train_data_whole,train_labels_whole,test_data,test_labels,opt,epoch,batch_size_factor,num_classes,result_path,feature_extraction,feature_extractor_parameters): 18 | 19 | 20 | 21 | train_data, valid_data, train_labels, valid_labels = train_test_split(train_data_whole, train_labels_whole,test_size=0.1, random_state=13) 22 | train_labels_one_hot=data_preprocessing.labels_convert_one_hot(train_labels) 23 | valid_labels_one_hot=data_preprocessing.labels_convert_one_hot(valid_labels) 24 | test_labels_one_hot=data_preprocessing.labels_convert_one_hot(test_labels) 25 | batch_size=round(train_data.shape[0]/batch_size_factor) 26 | 27 | #Instantiate an empty model 28 | classification_model = Sequential() 29 | 30 | # 1st Convolutional Layer 31 | classification_model.add(Conv2D(filters=96, input_shape=(train_data.shape[1],train_data.shape[2],train_data.shape[3]), kernel_size=(11,11), strides=(4,4), padding='valid', activation='tanh')) 32 | #classification_model.add(LeakyReLU(alpha=0.01)) 33 | 34 | # Max Pooling 35 | classification_model.add(MaxPooling2D(pool_size=(2,2), strides=(2,2), padding='valid')) 36 | classification_model.add(Dropout(0.25)) 37 | 38 | # 2nd Convolutional Layer 39 | classification_model.add(Conv2D(filters=256, kernel_size=(11,11), strides=(1,1), padding='valid',activation='sigmoid')) 40 | #classification_model.add(LeakyReLU(alpha=0.01)) 41 | 42 | # Max Pooling 43 | classification_model.add(MaxPooling2D(pool_size=(2,2), strides=(2,2), padding='valid')) 44 | classification_model.add(Dropout(0.25)) 45 | 46 | # 3rd Convolutional Layer 47 | classification_model.add(Conv2D(filters=384, kernel_size=(3,3), strides=(1,1), padding='valid',activation='sigmoid')) 48 | #classification_model.add(LeakyReLU(alpha=0.01)) 49 | 50 | # 4th Convolutional Layer 51 | classification_model.add(Conv2D(filters=384, kernel_size=(3,3), strides=(1,1), padding='valid',activation='sigmoid')) 52 | #classification_model.add(LeakyReLU(alpha=0.01)) 53 | 54 | 55 | # 5th Convolutional Layer 56 | classification_model.add(Conv2D(filters=256, kernel_size=(3,3), strides=(1,1), padding='valid',activation='sigmoid')) 57 | #classification_model.add(LeakyReLU(alpha=0.01)) 58 | 59 | # Max Pooling 60 | classification_model.add(MaxPooling2D(pool_size=(2,2), strides=(2,2), padding='valid')) 61 | classification_model.add(Dropout(0.25)) 62 | 63 | # Passing it to a Fully Connected layer 64 | classification_model.add(Flatten()) 65 | # 1st Fully Connected Layer 66 | classification_model.add(Dense(4096,activation='sigmoid')) 67 | #classification_model.add(LeakyReLU(alpha=0.01)) 68 | 69 | 70 | # Add Dropout to prevent overfitting 71 | classification_model.add(Dropout(0.5)) 72 | 73 | # 2nd Fully Connected Layer 74 | classification_model.add(Dense(4096,activation='sigmoid')) 75 | 76 | # Add Dropout 77 | classification_model.add(Dropout(0.5)) 78 | 79 | # 3rd Fully Connected Layer 80 | classification_model.add(Dense(1000,activation='sigmoid')) 81 | #classification_model.add(LeakyReLU(alpha=0.01)) 82 | 83 | # Add Dropout 84 | classification_model.add(Dropout(0.5)) 85 | 86 | # Output Layer 87 | classification_model.add(Dense(num_classes,activation='softmax')) 88 | 89 | classification_model.summary() 90 | 91 | # Compile the classification_model 92 | classification_model.compile(loss='mean_squared_error', optimizer=opt, metrics=['accuracy']) 93 | 94 | es=keras.callbacks.EarlyStopping(monitor='val_acc', 95 | min_delta=0, 96 | patience=5000, 97 | verbose=1, mode='auto') 98 | 99 | mc = ModelCheckpoint(os.path.join(result_path,'best_model.h5'), monitor='val_acc', mode='auto', save_best_only=True) 100 | classification_train = classification_model.fit(train_data, train_labels_one_hot, batch_size=batch_size, epochs=epoch, 101 | verbose=1, validation_data=(valid_data, valid_labels_one_hot),callbacks=[mc,es]) 102 | best_model=load_model(os.path.join(result_path,'best_model.h5')) 103 | file_name=os.path.split(result_path)[1] 104 | date=os.path.split(os.path.split(result_path)[0])[1] 105 | classification_model.save(os.path.join(result_path,date+'_'+file_name+'_'+'AlexNet_model.h5')) 106 | 107 | 108 | if feature_extraction==1: 109 | feature_extractor_parameters['CNN_model']=classification_model 110 | CNN_feature_extractor.CNN_feature_extraction_classsification(feature_extractor_parameters,result_path) 111 | return 112 | model_evaluation.testing_and_printing(classification_model,classification_train,best_model,test_data, test_labels_one_hot, 'Alexnet', result_path,epoch) 113 | -------------------------------------------------------------------------------- /Machine Learning/Classification/Alzhimers CV-BOLD Classification/deep learning/CNN_based_models/DenseNet121.py: -------------------------------------------------------------------------------- 1 | import data_preprocessing 2 | import keras 3 | from keras.models import Sequential 4 | from keras.layers import Dropout 5 | from keras.callbacks import ModelCheckpoint 6 | from keras.models import load_model 7 | from keras import layers 8 | from keras.layers import Conv2D, MaxPooling2D 9 | from keras.models import Sequential, Input, Model 10 | from keras.layers import Dense, Dropout, Flatten 11 | import os 12 | import CNN_feature_extractor 13 | import model_evaluation 14 | from sklearn.model_selection import train_test_split 15 | 16 | def model(train_data_whole,train_labels_whole,test_data,test_labels,opt,epoch,batch_size_factor,num_classes,result_path,feature_extraction,feature_extractor_parameters): 17 | 18 | train_data, valid_data, train_labels, valid_labels = train_test_split(train_data_whole, train_labels_whole,test_size=0.1, random_state=13) 19 | train_labels_one_hot=data_preprocessing.labels_convert_one_hot(train_labels) 20 | valid_labels_one_hot=data_preprocessing.labels_convert_one_hot(valid_labels) 21 | test_labels_one_hot=data_preprocessing.labels_convert_one_hot(test_labels) 22 | ''' 23 | train_data=data_preprocessing.depth_reshapeing(train_data) 24 | test_data=data_preprocessing.depth_reshapeing(test_data) 25 | valid_data=data_preprocessing.depth_reshapeing(valid_data) 26 | 27 | train_data = data_preprocessing.size_editing(train_data, 224) 28 | valid_data= data_preprocessing.size_editing(valid_data, 224) 29 | test_data = data_preprocessing.size_editing(test_data, 224) 30 | ''' 31 | batch_size=round(train_data.shape[0]/batch_size_factor) 32 | input_shape= (224,224,3) 33 | densenet121_model=keras.applications.densenet.DenseNet121(include_top=False, weights='imagenet', input_shape=input_shape, pooling=None, classes=num_classes) 34 | 35 | 36 | 37 | layer_dict = dict([(layer.name, layer) for layer in densenet121_model.layers]) 38 | # Getting output tensor of the last VGG layer that we want to include 39 | x = layer_dict[list(layer_dict.keys())[-1]].output 40 | x = MaxPooling2D(pool_size=(2, 2))(x) 41 | 42 | x = Flatten()(x) 43 | x = Dense(4096, activation='relu')(x) 44 | x = Dropout(0.5)(x) 45 | x = Dense(4096, activation='relu')(x) 46 | x = Dropout(0.5)(x) 47 | x = Dense(num_classes, activation='softmax')(x) 48 | classification_model = Model(input=densenet121_model.input, output=x) 49 | 50 | 51 | for layer in classification_model.layers: 52 | layer.trainable = True 53 | classification_model.compile(loss='mean_squared_error',optimizer=opt,metrics=['accuracy']) 54 | es = keras.callbacks.EarlyStopping(monitor='val_acc', 55 | min_delta=0, 56 | patience=500, 57 | verbose=1, mode='auto') 58 | mc = ModelCheckpoint(os.path.join(result_path, 'best_model.h5'), monitor='val_acc', mode='auto', 59 | save_best_only=True) 60 | 61 | 62 | classification_train = classification_model.fit(train_data, train_labels_one_hot, batch_size=batch_size, epochs=epoch, 63 | verbose=1, validation_data=(valid_data, valid_labels_one_hot),callbacks=[es,mc ]) 64 | file_name=os.path.split(result_path)[1] 65 | date=os.path.split(os.path.split(result_path)[0])[1] 66 | classification_model.save(os.path.join(result_path,date+'_'+file_name+'_'+'DenseNet121.h5')) 67 | #best_model=load_model(os.path.join(result_path,'best_model.h5')) 68 | best_model=classification_model 69 | if feature_extraction==1: 70 | feature_extractor_parameters['CNN_model']=classification_model 71 | CNN_feature_extractor.CNN_feature_extraction_classsification(feature_extractor_parameters,result_path) 72 | return 73 | model_evaluation.testing_and_printing(classification_model,classification_train,best_model,test_data, test_labels_one_hot, 'InceptionResNetV2', result_path,epoch) 74 | -------------------------------------------------------------------------------- /Machine Learning/Classification/Alzhimers CV-BOLD Classification/deep learning/CNN_based_models/InceptionResNetV2.py: -------------------------------------------------------------------------------- 1 | import data_preprocessing 2 | import keras 3 | from keras.models import Sequential 4 | from keras.layers import Dropout 5 | from keras.callbacks import ModelCheckpoint 6 | from keras.models import load_model 7 | from keras import layers 8 | from keras.layers import Conv2D, MaxPooling2D 9 | from keras.models import Sequential, Input, Model 10 | from keras.layers import Dense, Dropout, Flatten 11 | import os 12 | import CNN_feature_extractor 13 | import model_evaluation 14 | from sklearn.model_selection import train_test_split 15 | 16 | 17 | def model (train_data_whole,train_labels_whole,test_data,test_labels,opt,epoch,batch_size_factor,num_classes,result_path,feature_extraction,feature_extractor_parameters): 18 | 19 | train_data, valid_data, train_labels, valid_labels = train_test_split(train_data_whole, train_labels_whole,test_size=0.1, random_state=13) 20 | train_labels_one_hot=data_preprocessing.labels_convert_one_hot(train_labels) 21 | valid_labels_one_hot=data_preprocessing.labels_convert_one_hot(valid_labels) 22 | test_labels_one_hot=data_preprocessing.labels_convert_one_hot(test_labels) 23 | 24 | ''' 25 | train_data=data_preprocessing.depth_reshapeing(train_data) 26 | test_data=data_preprocessing.depth_reshapeing(test_data) 27 | valid_data=data_preprocessing.depth_reshapeing(valid_data) 28 | 29 | train_data = data_preprocessing.size_editing(train_data, 224) 30 | valid_data= data_preprocessing.size_editing(valid_data, 224) 31 | test_data = data_preprocessing.size_editing(test_data, 224) 32 | ''' 33 | input_shape= (224,224,3) 34 | batch_size=round(train_data.shape[0]/batch_size_factor) 35 | inception_model=keras.applications.inception_resnet_v2.InceptionResNetV2(include_top=False, weights='imagenet',input_shape=input_shape, pooling=None) 36 | 37 | 38 | layer_dict = dict([(layer.name, layer) for layer in inception_model.layers]) 39 | # Getting output tensor of the last VGG layer that we want to include 40 | x = layer_dict[list(layer_dict.keys())[-1]].output 41 | x = MaxPooling2D(pool_size=(2, 2))(x) 42 | 43 | x = Flatten()(x) 44 | x = Dense(4096, activation='relu')(x) 45 | x = Dropout(0.5)(x) 46 | x = Dense(4096, activation='relu')(x) 47 | x = Dropout(0.5)(x) 48 | x = Dense(num_classes, activation='softmax')(x) 49 | classification_model = Model(input=inception_model.input, output=x) 50 | 51 | 52 | for layer in classification_model.layers: 53 | layer.trainable = True 54 | classification_model.compile(loss='mean_squared_error',optimizer=opt,metrics=['accuracy']) 55 | 56 | es = keras.callbacks.EarlyStopping(monitor='val_acc', 57 | min_delta=0, 58 | patience=500, 59 | verbose=1, mode='auto') 60 | mc = ModelCheckpoint(os.path.join(result_path, 'best_model.h5'), monitor='val_acc', mode='auto', 61 | save_best_only=True) 62 | 63 | 64 | classification_train = classification_model.fit(train_data, train_labels_one_hot, batch_size=batch_size, epochs=epoch, 65 | verbose=1, validation_data=(valid_data, valid_labels_one_hot),callbacks=[es,mc ]) 66 | 67 | #best_model=load_model(os.path.join(result_path,'best_model.h5')) 68 | best_model=classification_model 69 | file_name=os.path.split(result_path)[1] 70 | date=os.path.split(os.path.split(result_path)[0])[1] 71 | classification_model.save(os.path.join(result_path,date+'_'+file_name+'_'+'InceptionResNet_model.h5')) 72 | 73 | if feature_extraction==1: 74 | feature_extractor_parameters['CNN_model']=classification_model 75 | CNN_feature_extractor.CNN_feature_extraction_classsification(feature_extractor_parameters,result_path) 76 | return 77 | model_evaluation.testing_and_printing(classification_model,classification_train,best_model,test_data, test_labels_one_hot, 'InceptionResNetV2', result_path,epoch) 78 | -------------------------------------------------------------------------------- /Machine Learning/Classification/Alzhimers CV-BOLD Classification/deep learning/CNN_based_models/LeNet.py: -------------------------------------------------------------------------------- 1 | import data_preprocessing 2 | import keras 3 | from keras.models import Sequential 4 | from keras.layers import Dropout 5 | from keras.callbacks import ModelCheckpoint 6 | from keras.models import load_model 7 | from keras import layers 8 | from keras.layers import Conv2D, MaxPooling2D 9 | from keras.models import Sequential, Input, Model 10 | from keras.layers import Dense, Dropout, Flatten 11 | from keras.layers import LeakyReLU 12 | import os 13 | import CNN_feature_extractor 14 | import model_evaluation 15 | from sklearn.model_selection import train_test_split 16 | 17 | def model (train_data_whole,train_labels_whole,test_data,test_labels,opt,epoch,batch_size_factor,num_classes,result_path,feature_extraction,feature_extractor_parameters): 18 | 19 | train_data, valid_data, train_labels, valid_labels = train_test_split(train_data_whole, train_labels_whole,test_size=0.1, random_state=13) 20 | train_labels_one_hot=data_preprocessing.labels_convert_one_hot(train_labels) 21 | valid_labels_one_hot=data_preprocessing.labels_convert_one_hot(valid_labels) 22 | test_labels_one_hot=data_preprocessing.labels_convert_one_hot(test_labels) 23 | batch_size=round(train_data.shape[0]/batch_size_factor) 24 | classification_model = Sequential() 25 | # C1 Convolutional Layer 26 | classification_model.add(layers.Conv2D(6, kernel_size=(5, 5), strides=(1, 1),activation='relu', input_shape=(train_data.shape[1], train_data.shape[2], train_data.shape[3]), padding='same')) 27 | 28 | classification_model.add(Dropout(0.7)) 29 | 30 | # S2 Pooling Layer 31 | classification_model.add(layers.AveragePooling2D(pool_size=(2, 2), strides=(1, 1), padding='valid')) 32 | # C3 Convolutional Layer 33 | classification_model.add(layers.Conv2D(16, kernel_size=(5, 5), strides=(1, 1), padding='valid',activation='relu')) 34 | 35 | classification_model.add(Dropout(0.7)) 36 | 37 | # S4 Pooling Layer 38 | classification_model.add(layers.AveragePooling2D(pool_size=(2, 2), strides=(2, 2), padding='valid')) 39 | # C5 Fully Connected Convolutional Layer 40 | classification_model.add(layers.Conv2D(120, kernel_size=(5, 5), strides=(1, 1), padding='valid',activation='relu')) 41 | 42 | classification_model.add(Dropout(0.8)) 43 | #Flatten the CNN output so that we can connect it with fully connected layers 44 | classification_model.add(layers.Flatten()) 45 | # FC6 Fully Connected Layer 46 | classification_model.add(layers.Dense(84,activation='relu')) 47 | 48 | classification_model.add(Dropout(0.8)) 49 | 50 | #Output Layer with softmax activation 51 | classification_model.add(layers.Dense(num_classes, activation='softmax')) 52 | 53 | 54 | classification_model.compile(loss=keras.losses.categorical_crossentropy, optimizer=opt, metrics=['accuracy']) 55 | 56 | 57 | es=keras.callbacks.EarlyStopping(monitor='val_acc', 58 | min_delta=0, 59 | patience=100, 60 | verbose=1, mode='auto') 61 | mc = ModelCheckpoint(os.path.join(result_path,'best_model.h5'), monitor='val_acc', mode='auto', save_best_only=True) 62 | classification_train = classification_model.fit(train_data, train_labels_one_hot, batch_size=batch_size, epochs=epoch, 63 | verbose=1, validation_data=(valid_data, valid_labels_one_hot),callbacks=[es,mc]) 64 | best_model=load_model(os.path.join(result_path,'best_model.h5')) 65 | file_name=os.path.split(result_path)[1] 66 | date=os.path.split(os.path.split(result_path)[0])[1] 67 | classification_model.save(os.path.join(result_path,date+'_'+file_name+'_'+'leNet_model.h5')) 68 | if feature_extraction==1: 69 | feature_extractor_parameters['CNN_model']=classification_model 70 | CNN_feature_extractor.CNN_feature_extraction_classsification(train_data_whole,train_labels_whole,test_data,test_labels,feature_extractor_parameters,result_path) 71 | return 72 | model_evaluation.testing_and_printing(classification_model,classification_train,best_model,test_data,test_labels_one_hot,'LeNet',result_path,epoch) -------------------------------------------------------------------------------- /Machine Learning/Classification/Alzhimers CV-BOLD Classification/deep learning/CNN_based_models/ResNet50.py: -------------------------------------------------------------------------------- 1 | import data_preprocessing 2 | import keras 3 | from keras.models import Sequential 4 | from keras.layers import Dropout 5 | from keras.callbacks import ModelCheckpoint 6 | from keras.models import load_model 7 | from keras import layers 8 | from keras.layers import Conv2D, MaxPooling2D 9 | from keras.models import Sequential, Input, Model 10 | from keras.layers import Dense, Dropout, Flatten 11 | import os 12 | import CNN_feature_extractor 13 | import model_evaluation 14 | from sklearn.model_selection import train_test_split 15 | def model(train_data_whole,train_labels_whole,test_data,test_labels,opt,epoch,batch_size_factor,num_classes,result_path,feature_extraction,feature_extractor_parameters): 16 | 17 | train_data, valid_data, train_labels, valid_labels = train_test_split(train_data_whole, train_labels_whole,test_size=0.1, random_state=13) 18 | train_labels_one_hot=data_preprocessing.labels_convert_one_hot(train_labels) 19 | valid_labels_one_hot=data_preprocessing.labels_convert_one_hot(valid_labels) 20 | test_labels_one_hot=data_preprocessing.labels_convert_one_hot(test_labels) 21 | 22 | ''' 23 | train_data=data_preprocessing.depth_reshapeing(train_data) 24 | test_data=data_preprocessing.depth_reshapeing(test_data) 25 | valid_data=data_preprocessing.depth_reshapeing(valid_data) 26 | 27 | train_data = data_preprocessing.size_editing(train_data, 224) 28 | valid_data= data_preprocessing.size_editing(valid_data, 224) 29 | test_data = data_preprocessing.size_editing(test_data, 224) 30 | ''' 31 | batch_size=round(train_data.shape[0]/batch_size_factor) 32 | input_shape= (224,224,3) 33 | resnet50_model=keras.applications.resnet50.ResNet50(include_top=False, weights='imagenet', input_shape=input_shape, pooling=None) 34 | 35 | layer_dict = dict([(layer.name, layer) for layer in resnet50_model.layers]) 36 | #print(layer_dict) 37 | # Getting output tensor of the last VGG layer that we want to include 38 | x = layer_dict[list(layer_dict.keys())[-1]].output 39 | x = MaxPooling2D(pool_size=(2, 2))(x) 40 | 41 | x = Flatten()(x) 42 | x = Dense(4096, activation='relu')(x) 43 | x = Dropout(0.5)(x) 44 | x = Dense(4096, activation='relu')(x) 45 | x = Dropout(0.5)(x) 46 | x = Dense(num_classes, activation='softmax')(x) 47 | classification_model = Model(input=resnet50_model.input, output=x) 48 | 49 | for layer in classification_model.layers[:len(list(layer_dict.keys()))-50]: 50 | layer.trainable = False 51 | classification_model.compile(loss='mean_squared_error',optimizer=opt,metrics=['accuracy']) 52 | es = keras.callbacks.EarlyStopping(monitor='val_acc', 53 | min_delta=0, 54 | patience=500, 55 | verbose=1, mode='auto') 56 | mc = ModelCheckpoint(os.path.join(result_path, 'best_model.h5'), monitor='val_acc', mode='auto', 57 | save_best_only=True) 58 | 59 | classification_train = classification_model.fit(train_data, train_labels_one_hot, batch_size=batch_size, epochs=epoch, 60 | verbose=1, validation_data=(valid_data, valid_labels_one_hot),callbacks=[es,mc]) 61 | #best_model=load_model(os.path.join(result_path,'best_model.h5')) 62 | best_model=classification_model 63 | file_name=os.path.split(result_path)[1] 64 | date=os.path.split(os.path.split(result_path)[0])[1] 65 | classification_model.save(os.path.join(result_path,date+'_'+file_name+'_'+'ResNet50_model.h5')) 66 | 67 | if feature_extraction==1: 68 | feature_extractor_parameters['CNN_model']=classification_model 69 | CNN_feature_extractor.CNN_feature_extraction_classsification(feature_extractor_parameters,result_path) 70 | return 71 | model_evaluation.testing_and_printing(classification_model,classification_train,best_model,test_data, test_labels_one_hot, 'ResNet50', result_path,epoch) 72 | -------------------------------------------------------------------------------- /Machine Learning/Classification/Alzhimers CV-BOLD Classification/deep learning/CNN_based_models/VGG.py: -------------------------------------------------------------------------------- 1 | import data_preprocessing 2 | import keras 3 | from keras.models import Sequential 4 | from keras.layers import Dropout 5 | from keras.callbacks import ModelCheckpoint 6 | from keras.models import load_model 7 | from keras import layers 8 | from keras.layers import Conv2D, MaxPooling2D 9 | from keras.models import Sequential, Input, Model 10 | from keras.layers import Dense, Dropout, Flatten 11 | import os 12 | import CNN_feature_extractor 13 | import model_evaluation 14 | from sklearn.model_selection import train_test_split 15 | def model(train_data_whole,train_labels_whole,test_data,test_labels,opt,epoch,batch_size_factor,num_classes,result_path,feature_extraction,feature_extractor_parameters): 16 | 17 | 18 | train_data, valid_data, train_labels, valid_labels = train_test_split(train_data_whole, train_labels_whole,test_size=0.1, random_state=13) 19 | train_labels_one_hot=data_preprocessing.labels_convert_one_hot(train_labels) 20 | valid_labels_one_hot=data_preprocessing.labels_convert_one_hot(valid_labels) 21 | test_labels_one_hot=data_preprocessing.labels_convert_one_hot(test_labels) 22 | 23 | 24 | batch_size=round(train_data.shape[0]/batch_size_factor) 25 | 26 | #Instantiate an empty model 27 | classification_model = Sequential() 28 | 29 | # 1st Convolutional Layer 30 | classification_model.add(Conv2D(filters=64, input_shape=(train_data.shape[1],train_data.shape[2],train_data.shape[3]),activation='relu', kernel_size=(3,3), strides=(1,1), padding='same')) 31 | classification_model.add(Conv2D(filters=64,activation='relu', kernel_size=(3,3), strides=(1,1), padding='same')) 32 | classification_model.add(Dropout(0.4)) 33 | # Max Pooling 34 | classification_model.add(MaxPooling2D(pool_size=(2,2), strides=(2,2))) 35 | 36 | # 2nd Convolutional Layer 37 | classification_model.add(Conv2D(filters=128, kernel_size=(3,3), strides=(1,1), padding='same',activation='relu')) 38 | classification_model.add(Conv2D(filters=128, kernel_size=(3,3), strides=(1,1), padding='same',activation='relu')) 39 | classification_model.add(Dropout(0.4)) 40 | # Max Pooling 41 | classification_model.add(MaxPooling2D(pool_size=(2,2), strides=(2,2))) 42 | 43 | # 3rd Convolutional Layer 44 | classification_model.add(Conv2D(filters=256, kernel_size=(3,3), strides=(1,1), padding='same',activation='relu')) 45 | classification_model.add(Conv2D(filters=256, kernel_size=(3,3), strides=(1,1), padding='same',activation='relu')) 46 | classification_model.add(Conv2D(filters=256, kernel_size=(3,3), strides=(1,1), padding='same',activation='relu')) 47 | classification_model.add(Dropout(0.4)) 48 | # Max Pooling 49 | classification_model.add(MaxPooling2D(pool_size=(2,2), strides=(2,2))) 50 | 51 | # 4th Convolutional Layer 52 | classification_model.add(Conv2D(filters=512, kernel_size=(3,3), strides=(1,1), padding='same',activation='relu')) 53 | classification_model.add(Conv2D(filters=512, kernel_size=(3,3), strides=(1,1), padding='same',activation='relu')) 54 | classification_model.add(Conv2D(filters=512, kernel_size=(3,3), strides=(1,1), padding='same',activation='relu')) 55 | classification_model.add(Dropout(0.4)) 56 | # 5th Convolutional Layer 57 | classification_model.add(Conv2D(filters=512, kernel_size=(3,3), strides=(1,1), padding='same',activation='relu')) 58 | classification_model.add(Conv2D(filters=512, kernel_size=(3,3), strides=(1,1), padding='same',activation='relu')) 59 | classification_model.add(Conv2D(filters=512, kernel_size=(3,3), strides=(1,1), padding='same',activation='relu')) 60 | classification_model.add(Dropout(0.4)) 61 | 62 | # Max Pooling 63 | classification_model.add(MaxPooling2D(pool_size=(2,2), strides=(2,2), padding='valid')) 64 | 65 | # Passing it to a Fully Connected layer 66 | classification_model.add(Flatten()) 67 | # 1st Fully Connected Layer 68 | classification_model.add(Dense(4096,activation='relu')) 69 | # Add Dropout to prevent overfitting 70 | classification_model.add(Dropout(0.5)) 71 | 72 | # 2nd Fully Connected Layer 73 | classification_model.add(Dense(4096,activation='relu')) 74 | 75 | # Add Dropout 76 | classification_model.add(Dropout(0.5)) 77 | # Output Layer 78 | classification_model.add(Dense(num_classes,activation='softmax')) 79 | classification_model.summary() 80 | classification_model.compile(loss='mean_squared_error', optimizer=opt, metrics=['accuracy']) 81 | 82 | es=keras.callbacks.EarlyStopping(monitor='val_acc', 83 | min_delta=0, 84 | patience=1000, 85 | verbose=1, mode='auto') 86 | 87 | mc = ModelCheckpoint(os.path.join(result_path,'best_model.h5'), monitor='val_acc', mode='auto', save_best_only=True) 88 | classification_train = classification_model.fit(train_data, train_labels_one_hot, batch_size=batch_size, epochs=epoch, 89 | verbose=1, validation_data=(valid_data, valid_labels_one_hot),callbacks=[mc,es]) 90 | best_model=load_model(os.path.join(result_path,'best_model.h5')) 91 | file_name=os.path.split(result_path)[1] 92 | date=os.path.split(os.path.split(result_path)[0])[1] 93 | classification_model.save(os.path.join(result_path,date+'_'+file_name+'_'+'VGG_model.h5')) 94 | 95 | if feature_extraction==1: 96 | feature_extractor_parameters['CNN_model']=classification_model 97 | CNN_feature_extractor.CNN_feature_extraction_classsification(feature_extractor_parameters,result_path) 98 | return 99 | model_evaluation.testing_and_printing(classification_model,classification_train,best_model,test_data, test_labels_one_hot, 'VGG', result_path,epoch) 100 | -------------------------------------------------------------------------------- /Machine Learning/Classification/Alzhimers CV-BOLD Classification/deep learning/CNN_based_models/VGG_pretrained.py: -------------------------------------------------------------------------------- 1 | import data_preprocessing 2 | import keras 3 | from keras.models import Sequential 4 | from keras.layers import Dropout 5 | from keras.callbacks import ModelCheckpoint 6 | from keras.models import load_model 7 | from keras import layers 8 | from keras.layers import Conv2D, MaxPooling2D 9 | from keras.models import Sequential, Input, Model 10 | from keras.layers import Dense, Dropout, Flatten 11 | import os 12 | import CNN_feature_extractor 13 | import model_evaluation 14 | from sklearn.model_selection import train_test_split 15 | from keras.applications.vgg16 import VGG16 16 | 17 | 18 | def model (train_data_whole,train_labels_whole,test_data,test_labels,opt,epoch,batch_size_factor,num_classes,result_path,feature_extraction,feature_extractor_parameters): 19 | 20 | train_data, valid_data, train_labels, valid_labels = train_test_split(train_data_whole, train_labels_whole,test_size=0.1, random_state=13) 21 | train_labels_one_hot=data_preprocessing.labels_convert_one_hot(train_labels) 22 | valid_labels_one_hot=data_preprocessing.labels_convert_one_hot(valid_labels) 23 | test_labels_one_hot=data_preprocessing.labels_convert_one_hot(test_labels) 24 | batch_size=round(train_data.shape[0]/batch_size_factor) 25 | 26 | input_shape= (224,224,3) 27 | vgg_model = VGG16(weights='imagenet', 28 | include_top=False, 29 | input_shape=input_shape) 30 | # Creating dictionary that maps layer names to the layers 31 | layer_dict = dict([(layer.name, layer) for layer in vgg_model.layers]) 32 | # Getting output tensor of the last VGG layer that we want to include 33 | x = layer_dict['block2_pool'].output 34 | 35 | x = Conv2D(filters=64, kernel_size=(3, 3), activation='relu',padding='same')(x) 36 | x = MaxPooling2D(pool_size=(2, 2))(x) 37 | x = Conv2D(filters=128, kernel_size=(3, 3), activation='relu',padding='same')(x) 38 | x = MaxPooling2D(pool_size=(2, 2))(x) 39 | x = Conv2D(filters=128, kernel_size=(3, 3), activation='relu',padding='same')(x) 40 | x = MaxPooling2D(pool_size=(2, 2))(x) 41 | 42 | x = Flatten()(x) 43 | x = Dense(4096, activation='relu')(x) 44 | x = Dropout(0.5)(x) 45 | x = Dense(4096, activation='relu')(x) 46 | x = Dropout(0.5)(x) 47 | x = Dense(2, activation='softmax')(x) 48 | 49 | # Creating new model. Please note that this is NOT a Sequential() model. 50 | classification_model = Model(input=vgg_model.input, output=x) 51 | for layer in classification_model.layers[:7]: 52 | layer.trainable = True 53 | 54 | es=keras.callbacks.EarlyStopping(monitor='val_acc', 55 | min_delta=0, 56 | patience=5000, 57 | verbose=1, mode='auto') 58 | 59 | mc = ModelCheckpoint(os.path.join(result_path,'best_model.h5'), monitor='val_acc', mode='auto', save_best_only=True) 60 | 61 | classification_model.compile(loss='mean_squared_error',optimizer=opt,metrics=['accuracy']) 62 | classification_train = classification_model.fit(train_data, train_labels_one_hot, batch_size=batch_size, epochs=epoch, 63 | verbose=1, validation_data=(valid_data, valid_labels_one_hot),callbacks=[mc,es]) 64 | best_model=load_model(os.path.join(result_path,'best_model.h5')) 65 | file_name=os.path.split(result_path)[1] 66 | date=os.path.split(os.path.split(result_path)[0])[1] 67 | classification_model.save(os.path.join(result_path,date+'_'+file_name+'_'+'VGGpretrained_model.h5')) 68 | 69 | if feature_extraction==1: 70 | feature_extractor_parameters['CNN_model']=classification_model 71 | CNN_feature_extractor.CNN_feature_extraction_classsification(feature_extractor_parameters,result_path) 72 | return 73 | model_evaluation.testing_and_printing(classification_model,classification_train,best_model,test_data, test_labels_one_hot, 'VGG_pretrained', result_path,epoch) 74 | -------------------------------------------------------------------------------- /Machine Learning/Classification/Alzhimers CV-BOLD Classification/deep learning/CNN_based_models/ZFNet.py: -------------------------------------------------------------------------------- 1 | import data_preprocessing 2 | import keras 3 | from keras.models import Sequential 4 | from keras.layers import Dropout 5 | from keras.callbacks import ModelCheckpoint 6 | from keras.models import load_model 7 | from keras import layers 8 | from keras.layers import Conv2D, MaxPooling2D 9 | from keras.models import Sequential, Input, Model 10 | from keras.layers import Dense, Dropout, Flatten 11 | import os 12 | import CNN_feature_extractor 13 | import model_evaluation 14 | from sklearn.model_selection import train_test_split 15 | 16 | 17 | def model(train_data_whole,train_labels_whole,test_data,test_labels,opt,epoch,batch_size_factor,num_classes,result_path,feature_extraction,feature_extractor_parameters): 18 | 19 | train_data_whole = data_preprocessing.size_editing(train_data_whole, 224) 20 | test_data = data_preprocessing.size_editing(test_data, 224) 21 | 22 | train_data, valid_data, train_labels, valid_labels = train_test_split(train_data_whole, train_labels_whole,test_size=0.1, random_state=13) 23 | train_labels_one_hot=data_preprocessing.labels_convert_one_hot(train_labels) 24 | valid_labels_one_hot=data_preprocessing.labels_convert_one_hot(valid_labels) 25 | test_labels_one_hot=data_preprocessing.labels_convert_one_hot(test_labels) 26 | 27 | 28 | 29 | batch_size=round(train_data.shape[0]/batch_size_factor) 30 | 31 | #Instantiate an empty model 32 | classification_model = Sequential() 33 | 34 | # 1st Convolutional Layer 35 | classification_model.add(Conv2D(filters=96, input_shape=(train_data.shape[1],train_data.shape[2],train_data.shape[3]),activation='relu', kernel_size=(7,7), strides=(2,2), padding='valid')) 36 | 37 | # Max Pooling 38 | classification_model.add(MaxPooling2D(pool_size=(3,3), strides=(2,2), padding='valid')) 39 | 40 | # 2nd Convolutional Layer 41 | classification_model.add(Conv2D(filters=256, kernel_size=(11,11), strides=(1,1), padding='valid',activation='relu')) 42 | # Max Pooling 43 | classification_model.add(MaxPooling2D(pool_size=(3,3), strides=(2,2), padding='valid')) 44 | 45 | # 3rd Convolutional Layer 46 | classification_model.add(Conv2D(filters=512, kernel_size=(3,3), strides=(1,1), padding='valid',activation='relu')) 47 | 48 | # 4th Convolutional Layer 49 | classification_model.add(Conv2D(filters=1024, kernel_size=(3,3), strides=(1,1), padding='valid',activation='relu')) 50 | 51 | 52 | # 5th Convolutional Layer 53 | classification_model.add(Conv2D(filters=512, kernel_size=(3,3), strides=(1,1), padding='valid',activation='relu')) 54 | # Max Pooling 55 | classification_model.add(MaxPooling2D(pool_size=(3,3), strides=(2,2), padding='valid')) 56 | 57 | # Passing it to a Fully Connected layer 58 | classification_model.add(Flatten()) 59 | # 1st Fully Connected Layer 60 | classification_model.add(Dense(4096, input_shape=(train_data.shape[1],train_data.shape[2],train_data.shape[3]),activation='relu')) 61 | # Add Dropout to prevent overfitting 62 | classification_model.add(Dropout(0.4)) 63 | 64 | # 2nd Fully Connected Layer 65 | classification_model.add(Dense(4096)) 66 | 67 | # Add Dropout 68 | classification_model.add(Dropout(0.4)) 69 | 70 | # 3rd Fully Connected Layer 71 | classification_model.add(Dense(1000,activation='relu')) 72 | # Add Dropout 73 | classification_model.add(Dropout(0.4)) 74 | 75 | # Output Layer 76 | classification_model.add(Dense(num_classes,activation='softmax')) 77 | 78 | classification_model.summary() 79 | 80 | # Compile the classification_model 81 | classification_model.compile(loss='mean_squared_error', optimizer=opt, metrics=['accuracy']) 82 | 83 | es=keras.callbacks.EarlyStopping(monitor='val_acc', 84 | min_delta=0, 85 | patience=100, 86 | verbose=1, mode='auto') 87 | 88 | mc = ModelCheckpoint(os.path.join(result_path,'best_model.h5'), monitor='val_acc', mode='auto', save_best_only=True) 89 | classification_train = classification_model.fit(train_data, train_labels_one_hot, batch_size=batch_size, epochs=epoch, 90 | verbose=1, validation_data=(valid_data, valid_labels_one_hot),callbacks=[mc,es]) 91 | 92 | best_model=load_model(os.path.join(result_path,'best_model.h5')) 93 | file_name=os.path.split(result_path)[1] 94 | date=os.path.split(os.path.split(result_path)[0])[1] 95 | classification_model.save(os.path.join(result_path,date+'_'+file_name+'_'+'ZNet_model.h5')) 96 | 97 | if feature_extraction==1: 98 | feature_extractor_parameters['CNN_model']=classification_model 99 | CNN_feature_extractor.CNN_feature_extraction_classsification(feature_extractor_parameters,result_path) 100 | return 101 | model_evaluation.testing_and_printing(classification_model,classification_train,best_model,test_data, test_labels_one_hot, 'ZFNet', result_path,epoch) 102 | -------------------------------------------------------------------------------- /Machine Learning/Classification/Alzhimers CV-BOLD Classification/deep learning/CNN_based_models/optimizers.py: -------------------------------------------------------------------------------- 1 | import keras 2 | from keras import optimizers 3 | import sys 4 | 5 | 6 | def choosing(optimizer): 7 | if optimizer=='adam': 8 | opt=keras.optimizers.Adam(lr=0.000001, beta_1=0.9, beta_2=0.999, epsilon=None, decay=0.0, amsgrad=False) 9 | elif optimizer=='adamax': 10 | opt= keras.optimizers.Adamax(lr=0.002, beta_1=0.9, beta_2=0.999, epsilon=None, decay=0.0) 11 | elif optimizer=='Nadama': 12 | opt= keras.optimizers.Nadam(lr=0.002, beta_1=0.9, beta_2=0.999, epsilon=None, schedule_decay=0.004) 13 | elif optimizer=='adadelta': 14 | opt= optimizers.Adadelta(lr=1.0, rho=0.95, epsilon=None, decay=0.0) 15 | elif optimizer=='adagrad': 16 | opt= keras.optimizers.Adagrad(lr=0.01, epsilon=None, decay=0.0) 17 | elif optimizer=='sgd': 18 | opt = optimizers.SGD(lr=0.01, clipnorm=1.) 19 | elif optimizer=='RMSprop': 20 | opt=keras.optimizers.RMSprop(lr=0.0006, rho=0.9, epsilon=None, decay=0.0) 21 | elif optimizer==None: 22 | return 23 | else: 24 | print('Value Error:Optimizer took unexpected value') 25 | sys.exit() 26 | 27 | return opt -------------------------------------------------------------------------------- /Machine Learning/Classification/Alzhimers CV-BOLD Classification/deep learning/CNN_based_models/readme.md: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /Machine Learning/Classification/Alzhimers CV-BOLD Classification/deep learning/CNN_based_models/simple_model.py: -------------------------------------------------------------------------------- 1 | import data_preprocessing 2 | import keras 3 | from keras.models import Sequential 4 | from keras.layers import Dropout 5 | from keras.callbacks import ModelCheckpoint 6 | from keras.models import load_model 7 | from keras import layers 8 | from keras.layers import Conv2D, MaxPooling2D 9 | from keras.models import Sequential, Input, Model 10 | from keras.layers import Dense, Dropout, Flatten 11 | from keras.layers import LeakyReLU 12 | import os 13 | import CNN_feature_extractor 14 | import model_evaluation 15 | from sklearn.model_selection import train_test_split 16 | 17 | 18 | def model (train_data_whole,train_labels_whole,test_data,test_labels,opt,epoch,batch_size_factor,num_classes,result_path,feature_extraction,feature_extractor_parameters): 19 | train_data, valid_data, train_labels, valid_labels = train_test_split(train_data_whole, train_labels_whole,test_size=0.2, random_state=13) 20 | 21 | train_labels_one_hot=data_preprocessing.labels_convert_one_hot(train_labels) 22 | valid_labels_one_hot=data_preprocessing.labels_convert_one_hot(valid_labels) 23 | test_labels_one_hot=data_preprocessing.labels_convert_one_hot(test_labels) 24 | batch_size=round(train_data.shape[0]/batch_size_factor) 25 | classification_model = Sequential() 26 | classification_model.add(Conv2D(32, kernel_size=(3, 3), padding='same',activation='relu', 27 | input_shape=(train_data.shape[1], train_data.shape[2], train_data.shape[3]))) 28 | #classification_model.add(BatchNormalization()) 29 | 30 | classification_model.add(MaxPooling2D((2, 2), padding='same')) 31 | classification_model.add(Dropout(0.5)) 32 | classification_model.add(Conv2D(64, (3,3), padding='same')) 33 | classification_model.add(LeakyReLU(alpha=0.1)) 34 | #classification_model.add(BatchNormalization()) 35 | 36 | classification_model.add(MaxPooling2D(pool_size=(2, 2), padding='same')) 37 | classification_model.add(Dropout(0.5)) 38 | classification_model.add(Conv2D(128, (3,3), padding='same')) 39 | classification_model.add(LeakyReLU(alpha=0.1)) 40 | #classification_model.add(BatchNormalization()) 41 | classification_model.add(MaxPooling2D(pool_size=(2, 2), padding='same')) 42 | classification_model.add(Dropout(0.5)) 43 | classification_model.add(Flatten()) 44 | classification_model.add(Dense(128)) 45 | classification_model.add(LeakyReLU(alpha=0.1)) 46 | #classification_model.add(BatchNormalization()) 47 | classification_model.add(Dense(128,)) 48 | classification_model.add(Dropout(0.5)) 49 | classification_model.add(LeakyReLU(alpha=0.1)) 50 | 51 | classification_model.add(Dense(num_classes, activation='softmax')) 52 | classification_model.compile(loss=keras.losses.categorical_crossentropy, optimizer=opt, 53 | metrics=['accuracy']) 54 | 55 | es=keras.callbacks.EarlyStopping(monitor='val_acc', 56 | min_delta=0, 57 | patience=50, 58 | verbose=1, mode='auto',baseline=0.9) 59 | 60 | mc = ModelCheckpoint(os.path.join(result_path,'best_model.h5'), monitor='val_acc', mode='auto', save_best_only=True) 61 | classification_train = classification_model.fit(train_data, train_labels_one_hot, batch_size=batch_size, epochs=epoch, 62 | verbose=1, validation_data=(valid_data, valid_labels_one_hot),callbacks=[es,mc]) 63 | best_model=load_model(os.path.join(result_path,'best_model.h5')) 64 | file_name=os.path.split(result_path)[1] 65 | date=os.path.split(os.path.split(result_path)[0])[1] 66 | 67 | classification_model.save(os.path.join(result_path,date+'_'+file_name+'_'+'simple_model.h5')) 68 | 69 | if feature_extraction==1: 70 | feature_extractor_parameters['CNN_model']=classification_model 71 | CNN_feature_extractor.CNN_feature_extraction_classsification(train_data_whole,train_labels_whole,test_data,test_labels,feature_extractor_parameters,result_path) 72 | return 73 | 74 | model_evaluation.testing_and_printing(classification_model,classification_train,best_model,test_data,test_labels_one_hot,'simple_architecture',result_path,epoch) 75 | -------------------------------------------------------------------------------- /Machine Learning/Classification/Alzhimers CV-BOLD Classification/deep learning/evaluation/metrics.py: -------------------------------------------------------------------------------- 1 | def auc_roc(y_true, y_pred): 2 | # any tensorflow metric 3 | value, update_op = tf.contrib.metrics.streaming_auc(y_pred, y_true) 4 | 5 | # find all variables created for this metric 6 | metric_vars = [i for i in tf.local_variables() if 'auc_roc' in i.name.split('/')[1]] 7 | 8 | # Add metric variables to GLOBAL_VARIABLES collection. 9 | # They will be initialized for new session. 10 | for v in metric_vars: 11 | tf.add_to_collection(tf.GraphKeys.GLOBAL_VARIABLES, v) 12 | 13 | # force to update metric values 14 | with tf.control_dependencies([update_op]): 15 | value = tf.identity(value) 16 | return value -------------------------------------------------------------------------------- /Machine Learning/Classification/Alzhimers CV-BOLD Classification/deep learning/evaluation/model_evaluation.py: -------------------------------------------------------------------------------- 1 | import matplotlib.pyplot as plt 2 | import generate_result_ 3 | from sklearn.metrics import precision_score 4 | from sklearn.metrics import recall_score 5 | from sklearn.metrics import f1_score 6 | from sklearn.metrics import cohen_kappa_score 7 | from sklearn.metrics import roc_auc_score 8 | from sklearn.metrics import confusion_matrix 9 | 10 | 11 | def testing_and_printing(classification_model,classification_train,best_model,test_data,test_labels_one_hot,model_name,results_path,epoch): 12 | #classification_model.summary() 13 | test_eval = classification_model.evaluate(test_data, test_labels_one_hot, verbose=1) 14 | prediction=classification_model.predict(test_data) 15 | predicted_classes=classification_model.predict_classes(test_data) 16 | print(predicted_classes) 17 | test_labels=test_labels_one_hot[:,1] 18 | 19 | print(test_labels_one_hot) 20 | print(prediction) 21 | print('Test loss:', test_eval[0]) 22 | print('Test accuracy:', test_eval[1]) 23 | precision = precision_score(test_labels, predicted_classes) 24 | print('Precision: %f' % precision) 25 | # recall: tp / (tp + fn) 26 | recall = recall_score(test_labels, predicted_classes) 27 | print('Recall: %f' % recall) 28 | # f1: 2 tp / (2 tp + fp + fn) 29 | f1 = f1_score(test_labels, predicted_classes) 30 | print('F1 score: %f' % f1) 31 | 32 | matrix = confusion_matrix(test_labels, predicted_classes) 33 | print(matrix) 34 | 35 | 36 | 37 | #print('AUC on test data:',test_eval[2]) 38 | print('the number of epochs:', epoch) 39 | accuracy = classification_train.history['acc'] 40 | val_accuracy = classification_train.history['val_acc'] 41 | loss = classification_train.history['loss'] 42 | val_loss = classification_train.history['val_loss'] 43 | epochs = range(len(accuracy)) 44 | plt.plot(epochs, accuracy, 'bo', label='Training accuracy') 45 | plt.plot(epochs, val_accuracy, 'b', label='Validation accuracy') 46 | plt.title('Training and validation accuracy') 47 | plt.legend() 48 | plt.figure() 49 | plt.plot(epochs, loss, 'bo', label='Training loss') 50 | plt.plot(epochs, val_loss, 'b', label='Validation loss') 51 | plt.title('Training and validation loss') 52 | plt.legend() 53 | plt.show() 54 | 55 | 56 | test_acc=best_model.evaluate(test_data,test_labels_one_hot) 57 | print('best model test accaurecy is ',test_acc) 58 | generate_result_.cnn_save_result(test_eval[1],classification_model,model_name,results_path) 59 | -------------------------------------------------------------------------------- /Machine Learning/Classification/Alzhimers CV-BOLD Classification/deep learning/evaluation/readme.md: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /Machine Learning/Classification/Alzhimers CV-BOLD Classification/deep learning/main.py: -------------------------------------------------------------------------------- 1 | import CNN 2 | import load_data 3 | from numpy import load 4 | import numpy as np 5 | import data_preprocessing 6 | import preprocessing_methods 7 | import generate_result_ 8 | import os 9 | from scipy.signal import resample_poly 10 | 11 | def main(): 12 | 13 | 14 | 15 | train_data_path='/data/fmri/Folder/AD_classification/Data/input_data/preprocessed_data/CV_OULU_Con_AD_preprocessed.npz' 16 | train_data_classifer = load(train_data_path)['masked_voxels'] 17 | train_data_path='/data/fmri/Folder/AD_classification/Data/input_data/Augmented_data/CV_OULU_Con_AD_aug.npz' 18 | train_data_CNN = load(train_data_path)['masked_voxels'] 19 | test_data_path='/data/fmri/Folder/AD_classification/Data/input_data/CV_ADNI_Con_AD.npz' 20 | test_data_CNN = load(test_data_path)['masked_voxels'] 21 | test_data_path='/data/fmri/Folder/AD_classification/Data/input_data/preprocessed_data/CV_ADNI_Con_AD_preprocessed.npz' 22 | test_data_classifer = load(test_data_path)['masked_voxels'] 23 | 24 | transposing_order=[3,0,2,1] 25 | train_data_CNN=data_preprocessing.transposnig(train_data_CNN,transposing_order) 26 | test_data_CNN=data_preprocessing.transposnig(test_data_CNN,transposing_order) 27 | 28 | train_labels_path='/data/fmri/Folder/AD_classification/Data/input_data/labels/train_labels_aug_data.npz' 29 | train_labels_CNN=load(train_labels_path)['labels'] 30 | shuffling_indicies = np.random.permutation(len(train_labels_CNN)) 31 | temp = train_data_CNN[shuffling_indicies, :, :, :] 32 | train_data_CNN=temp 33 | train_labels_CNN = train_labels_CNN[shuffling_indicies] 34 | 35 | 36 | 37 | 38 | 39 | train_labels_path='/data/fmri/Folder/AD_classification/Data/input_data/labels/train_labels.npz' 40 | train_labels_classifer=load(train_labels_path)['labels'] 41 | shuffling_indicies = np.random.permutation(len(train_labels_classifer)) 42 | temp = train_data_classifer[shuffling_indicies, :, :, :] 43 | train_data_classifer=temp 44 | train_labels_classifer = train_labels_classifer[shuffling_indicies] 45 | 46 | #test_data_path = load_data.find_path(test_data_file_name) 47 | #test_data_path='/data/fmri/Folder/AD_classification/Data/input_data/CV_ADNI_Con_AD.npz' 48 | #test_data = load(test_data_path)['masked_voxels'] 49 | #test_labels_path=load_data.find_path(test_labels_file_name) 50 | test_labels_path='/data/fmri/Folder/AD_classification/Data/input_data/labels/test_labels.npz' 51 | test_labels=load(test_labels_path)['labels'] 52 | shuffling_indicies = np.random.permutation(len(test_labels)) 53 | test_data_CNN = test_data_CNN[shuffling_indicies, :, :, :] 54 | test_data_classifer = test_data_classifer[shuffling_indicies, :, :, :] 55 | 56 | test_labels = test_labels[shuffling_indicies] 57 | 58 | train_data_CNN,test_data_CNN,train_labels_CNN,test_labels=preprocessing_methods.preprocessing(train_data_CNN,test_data_CNN,train_labels_CNN,test_labels,4,0,None,None) 59 | 60 | factors=[(224,45),(224,45),(3,54)] 61 | train_data_CNN=resample_poly(train_data_CNN, factors[0][0], factors[0][1], axis=1) 62 | train_data_CNN=resample_poly(train_data_CNN, factors[1][0], factors[1][1], axis=2) 63 | #train_data_CNN=resample_poly(train_data_CNN, factors[2][0], factors[2][1], axis=3) 64 | 65 | test_data_CNN=resample_poly(test_data_CNN, factors[0][0], factors[0][1], axis=1) 66 | test_data_CNN=resample_poly(test_data_CNN, factors[1][0], factors[1][1], axis=2) 67 | #test_data_CNN=resample_poly(test_data_CNN, factors[2][0], factors[2][1], axis=3) 68 | 69 | 70 | train_CNN=0 71 | feature_extraction=1 72 | 73 | if train_CNN==1 and feature_extraction==1: 74 | line1='CNN model is trained and saved and then used as feature extractor' 75 | line2='CNN model used for feature extraction is :' 76 | elif train_CNN==1 and feature_extraction==0: 77 | line1 ='CNN model is trained and used to test the test data' 78 | line2='CNN model used is :' 79 | elif train_CNN==0 and feature_extraction==1: 80 | line1 ='using a saved model to extract fetaures' 81 | line2='The model used used is a saved model' 82 | else: 83 | print('Value Error: train_CNN and feature_extraction cannnot have these values') 84 | 85 | 86 | results_directory='Results' 87 | num_classes=2 88 | epoch=1000 89 | batch_size_factor=1 90 | optimizer='adam' 91 | CNN_models=['VGG16','VGG19'] 92 | #intermedidate_layer=[7,7,7,16] 93 | hyperparameters={'dropouts':[0.25,0.5,0.5],'activation_function':['relu','relu','relu','sigmoid'],'epoch':10,'opt':'adam','penalty':'l1','C':100,'neighbors':50} 94 | data={'train_data':train_data_CNN,'test_data':test_data_CNN,'train_labels':train_labels_CNN,'test_labels':test_labels} 95 | preprocessing_method='method 4' 96 | i=0 97 | for CNN_model in CNN_models: 98 | result_path = generate_result_.create_results_dir(results_directory) 99 | print(CNN_model) 100 | feature_extractor_parameters={'data':data,'hyperparameters':hyperparameters,'model_type':'pretrained','CNN_model':CNN_model,'intermediate_layer':7,'classifer_name':'all'} 101 | CNN.CNN_main(train_data_CNN,test_data_CNN,result_path,train_labels_CNN,test_labels,num_classes,epoch,batch_size_factor,optimizer,CNN_model,train_CNN,feature_extraction,feature_extractor_parameters) 102 | f = open(os.path.join(result_path, 'README'), "w+") 103 | 104 | line3=CNN_model 105 | line4='The preprocessing methods used is '+' '+preprocessing_method 106 | line5='The number of epochs used to train the CNN_model is '+str(epoch) 107 | line6='the oprimizer used is '+optimizer 108 | f.write("{}" "\n" "{}" "\n" "{}" "\n" "{}" "\n" "{}" "\n" "{}" "\n" .format(line1,line2,line3,line4,line5,line6)) 109 | i=i+1 110 | 111 | if __name__=='__main__': 112 | main() 113 | 114 | 115 | -------------------------------------------------------------------------------- /Machine Learning/Classification/Alzhimers CV-BOLD Classification/deep learning/preprocessing/data_augmentation.py: -------------------------------------------------------------------------------- 1 | import random 2 | from scipy import ndarray 3 | import skimage as sk 4 | from skimage import transform 5 | from skimage import util 6 | from sklearn.svm import SVC 7 | from sklearn.model_selection import StratifiedKFold 8 | from sklearn.feature_selection import RFECV 9 | from sklearn.datasets import make_classification 10 | from sklearn.model_selection import train_test_split 11 | #from sklearn import decomposition 12 | from sklearn.gaussian_process import GaussianProcessClassifier 13 | from sklearn.gaussian_process.kernels import RBF 14 | from sklearn import decomposition 15 | from sklearn.feature_selection import SelectFromModel 16 | from sklearn.svm import LinearSVC 17 | from sklearn.metrics import roc_auc_score 18 | from sklearn.metrics import f1_score 19 | from sklearn.metrics import confusion_matrix 20 | from sklearn.ensemble import RandomForestClassifier 21 | from sklearn.utils import resample 22 | from sklearn.utils import shuffle 23 | from sklearn.preprocessing import StandardScaler 24 | from sklearn.preprocessing import Normalizer 25 | from sklearn import preprocessing 26 | from scipy import ndimage 27 | import nilearn 28 | import nibabel as nib 29 | import numpy as np 30 | import os 31 | import load_data 32 | 33 | def load_obj(obj): 34 | # Load subjects 35 | in_img = nib.load(obj) 36 | in_shape = in_img.shape 37 | print('Shape: ', in_shape) 38 | in_array = in_img.get_fdata() 39 | return in_array 40 | 41 | 42 | 43 | def flipping(img,axis): 44 | flipped_img = np.flip(img,axis=axis) 45 | return flipped_img 46 | 47 | 48 | 49 | def flipping_HV(img): 50 | flipped_img = np.fliplr(img) 51 | return flipped_img 52 | 53 | def rotate(img,angle): 54 | 55 | img=ndimage.interpolation.rotate(img,angle) 56 | return img 57 | 58 | def shifting(img,shift_amount): 59 | 60 | img=ndimage.interpolation.shift(img,shift_amount) 61 | 62 | return img 63 | 64 | def zooming(img,zooming_amount): 65 | img=ndimage.interpolation.zoom(img,zooming_amount) 66 | return img 67 | 68 | 69 | def add_gaussian_noise(X_imgs): 70 | gaussian_noise_imgs = [] 71 | row, col,depth,number_of_samples = X_imgs.shape 72 | # Gaussian distribution parameters 73 | mean = 0 74 | var = 0.1 75 | sigma = var ** 0.5 76 | gaussian_noise_imgs=np.empty(X_imgs.shape) 77 | for i in range(number_of_samples): 78 | gaussian_img=np.zeros((row,col,depth)) 79 | gaussian = np.random.random((row, col, 1)).astype(np.float64) 80 | gaussian = np.tile(gaussian,(1,1,depth)) 81 | gaussian_img = cv2.addWeighted(X_imgs[:,:,:,i], 0.75, 0.25 * gaussian, 0.25, 0 ,dtype=cv2.CV_64F) 82 | gaussian_noise_imgs[:,:,:,i]=gaussian_img 83 | gaussian_noise_imgs = np.array(gaussian_noise_imgs, dtype = np.float32) 84 | return gaussian_noise_imgs 85 | 86 | 87 | def transposnig(input_data,order): 88 | return input_data.transpose(order) 89 | 90 | 91 | 92 | def mask_print(input,mask,name): 93 | #remained_feature_indices=np.where(mask==1) 94 | masking_img = nib.load('/data/fmri/Folder/AD_classification/Data/input_data/4mm_brain_mask_bin.nii.gz') 95 | 96 | masking_shape = masking_img.shape 97 | print(masking_shape) 98 | masking = np.empty(masking_shape, dtype=float) 99 | masking[:,:,:] = masking_img.get_data().astype(float) 100 | for i in range (np.shape(input)[3]): 101 | input[:,:,:,i]=mask*input[:,:,:,i] 102 | #input[:,:,:,i]=input[:,:,:,i] 103 | hdr = masking_img.header 104 | aff = masking_img.affine 105 | out_img = nib.Nifti1Image(input, aff, hdr) 106 | # Save to disk 107 | out_file_name = '/data/fmri/Folder/AD_classification/Data/input_data/Augmented_data/mask_'+name+'.nii.gz' 108 | nib.save(out_img, out_file_name) 109 | 110 | def slicing(len1,len2 ): 111 | diff=abs(len2-len1)/2 112 | 113 | if (round(diff)>diff): 114 | return round(diff),len2-round(diff)+1 115 | 116 | else: 117 | return int(diff),int(len2-diff) 118 | 119 | 120 | ''' 121 | Oulu_data_ad_path = '/data/fmri/Folder/AD_classification/Data/Raw_data/Oulu_Data/CV_OULU_AD.nii.gz' 122 | Oulu_data_con_path='/data/fmri/Folder/AD_classification/Data/Raw_data/Oulu_Data/CV_OULU_CON.nii.gz' 123 | adni_data_ad_path='/data/fmri/Folder/AD_classification/Data/Raw_data/ADNI_Data/CV_ADNI_AD.nii.gz' 124 | adni_data_con_path='/data/fmri/Folder/AD_classification/Data/Raw_data/ADNI_Data/CV_ADNI_CON.nii.gz' 125 | masking_data='/data/fmri/Folder/AD_classification/Data/input_data/4mm_brain_mask_bin.nii.gz' 126 | 127 | 128 | 129 | Oulu_data_ad=load_obj(Oulu_data_ad_path) 130 | Oulu_data_con=load_obj(Oulu_data_con_path) 131 | adni_data_ad=load_obj(adni_data_ad_path) 132 | adni_data_con=load_obj(adni_data_con_path) 133 | mask=load_obj(masking_data) 134 | 135 | order_data=(0,2,1,3) 136 | order_mask=(0,2,1) 137 | 138 | Oulu_data_con_transposed=transposnig(Oulu_data_con,order_data) 139 | Oulu_data_ad_transposed=transposnig(Oulu_data_ad,order_data) 140 | mask_transposed=transposnig(mask,order_mask) 141 | 142 | 143 | #Rotation 144 | angles=[30,-30,60,-60,45,-45] 145 | for i in angles: 146 | Oulu_data_ad_rotated=rotate(Oulu_data_ad_transposed,i) 147 | mask_rotated=rotate(mask_transposed,i) 148 | start,end=slicing(Oulu_data_ad.shape[0],Oulu_data_ad_rotated.shape[0]) 149 | 150 | Oulu_data_ad_rotated=Oulu_data_ad_rotated[0:Oulu_data_ad.shape[0],0:Oulu_data_ad.shape[0],:,:] 151 | mask_rotated=mask_rotated[0:Oulu_data_ad.shape[0],0:Oulu_data_ad.shape[0],:] 152 | 153 | Oulu_data_ad_rotated_transposed=transposnig(Oulu_data_ad_rotated,order_data) 154 | mask_rotated_transposed=transposnig(mask_rotated,order_mask) 155 | print(Oulu_data_ad_rotated_transposed.shape) 156 | mask_print(Oulu_data_ad_rotated_transposed,mask_rotated_transposed,'rotated_' +str(i)+ '_Oulu_data_ad') 157 | 158 | 159 | 160 | # adding gussian noise 161 | 162 | Oulu_data_ad_noised=add_gaussian_noise(Oulu_data_ad_transposed) 163 | Oulu_data_ad_noised_transposed=transposnig(Oulu_data_ad_noised,order_data) 164 | mask_print(Oulu_data_ad_noised_transposed,mask,'Oulu_data_ad_gussian_noised') 165 | 166 | 167 | #shifting 168 | shift_amount_data=[0,20,0,0] 169 | shift_amount_mask=[0,20,0] 170 | Oulu_data_con_shifted=shifting(Oulu_data_con_transposed,shift_amount_data) 171 | Oulu_data_con_shifted_transposed=transposnig(Oulu_data_con_shifted,order_data) 172 | mask_shifted=shifting(mask_transposed,shift_amount_mask) 173 | mask_shifted_transposed=transposnig(mask_shifted,order_mask) 174 | mask_print(Oulu_data_con_shifted_transposed,mask_shifted_transposed,'down_Oulu_data_con') 175 | 176 | 177 | # flipping 178 | Oulu_data_ad_flipped=flipping(Oulu_data_ad_transposed,0) 179 | Oulu_data_ad_flipped_transposed=transposnig(Oulu_data_ad_flipped,order_data) 180 | mask_tranposed=transposnig(mask,order_mask) 181 | mask_flipped=flipping(mask_tranposed,1) 182 | mask_flipped_transposed=transposnig(mask_flipped,order_mask) 183 | mask_print(Oulu_data_ad_flipped_transposed,mask_flipped_transposed,'vertical_flipped_Oulu_data_ad') 184 | ''' 185 | 186 | 187 | -------------------------------------------------------------------------------- /Machine Learning/Classification/Alzhimers CV-BOLD Classification/deep learning/preprocessing/preprocessing_methods.py: -------------------------------------------------------------------------------- 1 | import data_preprocessing 2 | import numpy as np 3 | import load_data 4 | 5 | def preprocessing(train_data,test_data,train_labels,test_labels,method,save,file_name,output_dir): 6 | 7 | 8 | 9 | dim0_train=train_data.shape[0] 10 | dim1_train=train_data.shape[1] 11 | dim2_train=train_data.shape[2] 12 | dim3_train=train_data.shape[3] 13 | 14 | dim0_test=test_data.shape[0] 15 | dim1_test=test_data.shape[1] 16 | dim2_test=test_data.shape[2] 17 | dim3_test=test_data.shape[3] 18 | 19 | if method==0: 20 | return 21 | elif method==1: 22 | train_data=train_data.reshape(dim0_train,dim1_train*dim2_train*dim3_train) 23 | test_data=test_data.reshape(dim0_test,dim1_test*dim2_test*dim3_test) 24 | train_data=data_preprocessing.MinMax_scaler(train_data) 25 | test_data=data_preprocessing.MinMax_scaler(test_data) 26 | train_data,test_data=data_preprocessing.standarization(train_data,test_data) 27 | train_data,test_data=data_preprocessing.KSTest(train_data,test_data,800) 28 | train_data=train_data.reshape(dim0_train,dim1_train,dim2_train,dim3_train) 29 | test_data=test_data.reshape(dim0_test,dim1_test,dim2_test,dim3_test) 30 | 31 | elif method==2: 32 | for i in range(train_data.shape[0]): 33 | for j in range(train_data.shape[3]): 34 | train_data[i, :, :, j] = data_preprocessing.standarization(train_data[i, :, :, j]) 35 | train_data[i, :, :, j] = data_preprocessing.MinMax_scaler(train_data[i, :, :, j]) 36 | 37 | if i < test_data.shape[0]: 38 | test_data[i, :, :, j] = data_preprocessing.standarization(test_data[i, :, :, j]) 39 | 40 | test_data[i, :, :, j] = data_preprocessing.MinMax_scaler(test_data[i, :, :, j]) 41 | 42 | 43 | 44 | 45 | 46 | 47 | elif method==3: 48 | train_data=train_data.reshape(dim0_train,dim1_train*dim2_train*dim3_train) 49 | test_data=test_data.reshape(dim0_test,dim1_test*dim2_test*dim3_test) 50 | train_data,test_data=data_preprocessing.KSTest(train_data,test_data,500) 51 | train_data=train_data.reshape(dim0_train,dim1_train,dim2_train,dim3_train) 52 | test_data=test_data.reshape(dim0_test,dim1_test,dim2_test,dim3_test) 53 | 54 | 55 | 56 | elif method==4: 57 | train_data=train_data.reshape(dim0_train,dim1_train*dim2_train,dim3_train) 58 | test_data=test_data.reshape(dim0_test,dim1_test*dim2_test,dim3_test) 59 | for i in range (dim3_train): 60 | train_data[:,:,i]=data_preprocessing.MinMax_scaler(train_data[:,:,i]) 61 | train_data[:,:,i]=data_preprocessing.standarization(train_data[:,:,i]) 62 | 63 | test_data[:,:,i]=data_preprocessing.MinMax_scaler(test_data[:,:,i]) 64 | test_data[:,:,i]=data_preprocessing.standarization(test_data[:,:,i]) 65 | 66 | train_data[:,:,i],test_data[:,:,i]=data_preprocessing.KSTest(train_data[:,:,i],test_data[:,:,i],800) 67 | train_data=train_data.reshape(dim0_train,dim1_train,dim2_train,dim3_train) 68 | test_data=test_data.reshape(dim0_test,dim1_test,dim2_test,dim3_test) 69 | 70 | elif method==5: 71 | train_data=train_data.reshape(dim0_train,dim1_train*dim2_train*dim3_train) 72 | train_data,train_labels,index=data_preprocessing.outliers(train_data,train_labels,1) 73 | train_data=train_data.reshape(dim0_train-np.size(index),dim1_train,dim2_train,dim3_train) 74 | if save==0: 75 | 76 | return train_data,test_data,train_labels,test_labels 77 | else: 78 | 79 | transposing_order = [1,3,2,0] 80 | train_data = data_preprocessing.transposnig(train_data, transposing_order) 81 | test_data = data_preprocessing.transposnig(test_data, transposing_order) 82 | output_path=load_data.find_path(output_dir) 83 | np.savez(output_path+file_name+'train_data.npz',masked_voxels=train_data) 84 | np.savez(output_path+file_name+'test_data.npz',masked_voxels=test_data) 85 | 86 | -------------------------------------------------------------------------------- /Machine Learning/Classification/Alzhimers CV-BOLD Classification/deep learning/preprocessing/readme.md: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /Machine Learning/Classification/Alzhimers CV-BOLD Classification/deep learning/storing_loading/load_data.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | from numpy import load 3 | import os 4 | import nibabel as nib 5 | from pathlib import Path 6 | import warnings 7 | warnings.filterwarnings("ignore") 8 | 9 | 10 | def find_path(file_name): 11 | data_path=None 12 | current_dir_path=os.getcwd() 13 | p=Path(current_dir_path) 14 | root_dir=p.parts[0]+p.parts[1] 15 | for r,d,f in os.walk(root_dir): 16 | for files in f: 17 | if files == file_name: 18 | data_path=os.path.join(r,files) 19 | 20 | else: 21 | for dir in d : 22 | if dir == file_name: 23 | data_path = os.path.join(r, dir) 24 | if data_path is not None: 25 | return data_path 26 | else: 27 | os.makedirs('./'+file_name) 28 | return './'+file_name 29 | 30 | 31 | 32 | def train_data_3d(train_Con_file_name, train_AD_file_name): 33 | train_data_Con_path = find_path(train_Con_file_name) 34 | train_data_AD_path = find_path(train_AD_file_name) 35 | train_data_Con = load(train_data_Con_path)['masked_voxels'] 36 | train_data_AD = load(train_data_AD_path)['masked_voxels'] 37 | train_data=np.concatenate((train_data_Con,train_data_AD),axis=3) 38 | train_labels = np.hstack((np.zeros(train_data_Con.shape[3]), np.ones(train_data_AD.shape[3]))) 39 | 40 | 41 | return train_data, train_labels 42 | 43 | def test_data_3d(test_Con_file_name,test_AD_file_name): 44 | test_data_Con_path=find_path(test_Con_file_name) 45 | test_data_AD_path = find_path(test_AD_file_name) 46 | test_data_Con = load(test_data_Con_path)['masked_voxels'] 47 | test_data_AD = load(test_data_AD_path)['masked_voxels'] 48 | test_data = np.concatenate((test_data_Con, test_data_AD), axis=3) 49 | test_labels = np.hstack((np.zeros(test_data_Con.shape[3]), np.ones(test_data_AD.shape[3]))) 50 | 51 | 52 | return test_data,test_labels 53 | 54 | 55 | def mask(mask_name): 56 | mask_path = find_path(mask_name) 57 | original_mask = nib.load(mask_path) 58 | return original_mask 59 | 60 | -------------------------------------------------------------------------------- /Machine Learning/Classification/Alzhimers CV-BOLD Classification/deep learning/storing_loading/readme.md: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /Machine Learning/Classification/Alzhimers CV-BOLD Classification/load_data.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | from numpy import load 3 | import os 4 | import nibabel as nib 5 | from pathlib import Path 6 | import warnings 7 | warnings.filterwarnings("ignore") 8 | 9 | 10 | def find_path(file_name): 11 | data_path=None 12 | current_dir_path=os.getcwd() 13 | p=Path(current_dir_path) 14 | root_dir=p.parts[0]+p.parts[1] 15 | for r,d,f in os.walk(root_dir): 16 | for files in f: 17 | if files == file_name: 18 | data_path=os.path.join(r,files) 19 | 20 | else: 21 | for dir in d : 22 | if dir == file_name: 23 | data_path = os.path.join(r, dir) 24 | if data_path is not None: 25 | return data_path 26 | else: 27 | os.makedirs('./'+file_name) 28 | return './'+file_name 29 | 30 | 31 | 32 | def train_data_3d(train_Con_file_name, train_AD_file_name): 33 | train_data_Con_path = find_path(train_Con_file_name) 34 | train_data_AD_path = find_path(train_AD_file_name) 35 | train_data_Con = load(train_data_Con_path)['masked_voxels'] 36 | train_data_AD = load(train_data_AD_path)['masked_voxels'] 37 | train_data=np.concatenate((train_data_Con,train_data_AD),axis=3) 38 | train_labels = np.hstack((np.zeros(train_data_Con.shape[3]), np.ones(train_data_AD.shape[3]))) 39 | 40 | 41 | return train_data, train_labels 42 | 43 | def test_data_3d(test_Con_file_name,test_AD_file_name): 44 | test_data_Con_path=find_path(test_Con_file_name) 45 | test_data_AD_path = find_path(test_AD_file_name) 46 | test_data_Con = load(test_data_Con_path)['masked_voxels'] 47 | test_data_AD = load(test_data_AD_path)['masked_voxels'] 48 | test_data = np.concatenate((test_data_Con, test_data_AD), axis=3) 49 | test_labels = np.hstack((np.zeros(test_data_Con.shape[3]), np.ones(test_data_AD.shape[3]))) 50 | 51 | 52 | return test_data,test_labels 53 | 54 | 55 | def mask(mask_name): 56 | mask_path = find_path(mask_name) 57 | original_mask = nib.load(mask_path) 58 | return original_mask 59 | 60 | -------------------------------------------------------------------------------- /Machine Learning/Classification/Alzhimers CV-BOLD Classification/load_models.py: -------------------------------------------------------------------------------- 1 | import pickle 2 | import load_data 3 | import data_preprocessing 4 | import numpy as np 5 | import nibabel as nib 6 | import generate_result 7 | 8 | #define paths 9 | train_Con_file_name = 'CV_OULU_CON.npz' 10 | train_AD_file_name = 'CV_OULU_AD.npz' 11 | test_Con_file_name = 'CV_ADNI_CON.npz' 12 | test_AD_file_name = 'CV_ADNI_AD.npz' 13 | mask_name = '4mm_brain_mask_bin.nii.gz' 14 | created_mask_high_certainity_file_name='./Output_results_directory/2019-08-10/1/high_certainity_model_mask.nii.gz' 15 | created_mask_outlier_file_name='./Output_results_directory/2019-08-10/1/high_certainity_model_mask.nii.gz' 16 | high_certainity_model_name='./Output_results_directory/2019-08-10/1/high_certainity_model.sav' 17 | low_certainty_model_name='./Output_results_directory/2019-08-10/1/low_certainty_model.sav' 18 | outliers_model_name='./Output_results_directory/2019-08-10/1/outliers_model.sav' 19 | #define variables 20 | number_of_neighbours = 1 21 | 22 | #load data 23 | train_data,train_labels=load_data.train_data_3d(train_Con_file_name,train_AD_file_name) 24 | test_data, test_labels = load_data.test_data_3d(test_Con_file_name, test_AD_file_name) 25 | 26 | #load masks 27 | mask_4mm = load_data.mask(mask_name) 28 | created_mask_high_certainity = nib.load(created_mask_high_certainity_file_name) 29 | created_mask_outlier = nib.load(created_mask_outlier_file_name) 30 | 31 | 32 | #data preprocessing 33 | train_data = np.moveaxis(train_data.copy(), 3, 0) 34 | test_data = np.moveaxis(test_data.copy(), 3, 0) 35 | original_mask=mask_4mm.get_fdata() 36 | train_data = train_data * original_mask 37 | test_data = test_data * original_mask 38 | created_mask_high_certainity=created_mask_high_certainity.get_fdata() 39 | created_mask_outlier=created_mask_outlier.get_fdata() 40 | orignal_mask_flatten = data_preprocessing.flatten(original_mask[np.newaxis, :, :, :].copy()) 41 | orignal_mask_flatten = np.reshape(orignal_mask_flatten, (-1)) 42 | created_mask_high_certainity_flatten = data_preprocessing.flatten(created_mask_high_certainity[np.newaxis, :, :, :].copy()) 43 | created_mask_high_certainity_flatten = np.reshape(created_mask_high_certainity_flatten, (-1)) 44 | created_mask_outlier_flatten = data_preprocessing.flatten(created_mask_outlier[np.newaxis, :, :, :].copy()) 45 | created_mask_outlier_flatten = np.reshape(created_mask_outlier_flatten, (-1)) 46 | train_data_flattened = data_preprocessing.flatten(train_data.copy()) 47 | test_data_flattened = data_preprocessing.flatten(test_data.copy()) 48 | train_data_flattened = data_preprocessing.MinMax_scaler(train_data_flattened.copy()) 49 | test_data_flattened = data_preprocessing.MinMax_scaler(test_data_flattened.copy()) 50 | 51 | 52 | train_data_inlier, train_labels_inlier, outlier_indices_train = data_preprocessing.outliers(train_data_flattened, 53 | train_labels, 54 | number_of_neighbours) 55 | test_data_inlier, test_labels_inlier, outlier_indices_test = data_preprocessing.novelty(train_data_inlier, 56 | train_labels_inlier, 57 | test_data_flattened, 58 | test_labels, 59 | number_of_neighbours) 60 | 61 | test_data_inlier_brain=test_data_inlier[:,np.squeeze(np.where(orignal_mask_flatten>0),axis=0)] 62 | test_data_outlier_brain=(test_data_flattened[outlier_indices_test])[:,np.squeeze(np.where(orignal_mask_flatten>0),axis=0)] 63 | test_data_masked_high_certainity=test_data_inlier_brain* created_mask_high_certainity_flatten[np.squeeze(np.where(orignal_mask_flatten > 0), axis=0)] 64 | test_data_inlier_CVspace = data_preprocessing.coefficient_of_variance(test_data_masked_high_certainity)[:,np.newaxis] 65 | test_data_outlier_cv = data_preprocessing.coefficient_of_variance( 66 | test_data_outlier_brain *created_mask_outlier_flatten[np.squeeze(np.where(orignal_mask_flatten > 0), axis=0)])[:, np.newaxis] 67 | #load models 68 | high_certainity_model = pickle.load(open(high_certainity_model_name, 'rb')) 69 | low_certainty_model = pickle.load(open(low_certainty_model_name, 'rb')) 70 | outliers_model = pickle.load(open(outliers_model_name, 'rb')) 71 | #output results 72 | test_accuracy_high_certainity,F1_score_high_certainity,auc_high_certainity,low_confidence_indices=generate_result.out_result_highprob(test_data_inlier_CVspace, 73 | test_labels_inlier,orignal_mask_flatten,created_mask_high_certainity_flatten,high_certainity_model) 74 | test_accuracy_low_certainty,F1_score_low_certainty,auc_low_certainty=generate_result.out_result(test_data_inlier_CVspace[low_confidence_indices], 75 | test_labels_inlier[low_confidence_indices],orignal_mask_flatten,created_mask_high_certainity_flatten,low_certainty_model) 76 | test_accuracy_outlier,F1_score_outlier,auc_outlier= generate_result.out_result(test_data_outlier_cv , 77 | test_labels[outlier_indices_test], orignal_mask_flatten, 78 | created_mask_outlier_flatten, outliers_model) 79 | 80 | 81 | #print results 82 | print('total_test_accuracy>',(test_accuracy_high_certainity+test_accuracy_low_certainty+test_accuracy_outlier)/3) 83 | print('total_F1_score>',(F1_score_high_certainity+F1_score_low_certainty+F1_score_outlier)/3) 84 | print('total_AUC_score>',(auc_high_certainity+auc_low_certainty+auc_outlier)/3) -------------------------------------------------------------------------------- /Machine Learning/Classification/Alzhimers CV-BOLD Classification/main.py: -------------------------------------------------------------------------------- 1 | #from hyper_opt import create_mask,model,model_1D 2 | import load_data 3 | import data_preprocessing 4 | import generate_result 5 | from Model import create_mask 6 | from pathlib import Path 7 | 8 | 9 | 10 | def main(): 11 | train_Con_file_name = 'whole_brain_Oulu_Con.npz' 12 | train_AD_file_name = 'whole_brain_Oulu_AD.npz' 13 | test_Con_file_name = 'whole_brain_ADNI_Con.npz' 14 | test_AD_file_name = 'whole_brain_ADNI_AD.npz' 15 | root_dir='/data' 16 | mask_name='4mm_brain_mask_bin.nii.gz' 17 | results_directory='Results' 18 | results_path=load_data.find_path(results_directory) 19 | number_of_cv=5 20 | feature_selection_type='recursion' 21 | data_preprocessing_method='kstest and standarization and Normalization and Density ratio estimation' 22 | Hyperparameter=(4000,10) 23 | train_data,train_labels=load_data.train_data(train_Con_file_name,train_AD_file_name) 24 | test_data, test_labels = load_data.test_data(test_Con_file_name, test_AD_file_name) 25 | #sample_weight = data_preprocessing.density_ratio_estimation(train_data,test_data) 26 | original_mask=load_data.mask(mask_name,root_dir) 27 | 28 | #created_mask,model_,model_name,weights=create_mask(train_data,labels_train,number_of_cv,feature_selection_type, 29 | #Hyperparameter,mask_threshold=2,model_type='gaussian_process') 30 | 31 | #test_data = data_preprocessing.coefficient_of_variance(test_data) 32 | 33 | #model_, model_name=model_1D(train_data,labels_train,created_mask,data_validation=None,labels_validation=None,model_type='gaussian_process') 34 | train_data,test_data=data_preprocessing.KSTest(train_data,test_data,step=Hyperparameter[1]) 35 | 36 | 37 | train_data = data_preprocessing.standarization(train_data) 38 | test_data = data_preprocessing.standarization(test_data) 39 | train_data = data_preprocessing.standarization(train_data) 40 | test_data = data_preprocessing.standarization(test_data) 41 | 42 | train_data = data_preprocessing. MinMax_scaler(train_data) 43 | test_data= data_preprocessing. MinMax_scaler(test_data) 44 | 45 | sample_weights=data_preprocessing.density_ratio_estimation(train_data,test_data) 46 | 47 | created_mask,model,model_name,weights =create_mask(train_data,train_labels,number_of_cv,feature_selection_type,Hyperparameter[0],1,model_type='Random_forest', sample_weights=sample_weights) 48 | 49 | generate_result.print_result(test_data, test_labels, original_mask, created_mask, model, model_name, weights, 50 | results_path,feature_selection_type,Hyperparameter,data_preprocessing_method) 51 | 52 | 53 | 54 | 55 | 56 | 57 | if __name__=='__main__': 58 | main() -------------------------------------------------------------------------------- /Machine Learning/Classification/Alzhimers CV-BOLD Classification/readme.md: -------------------------------------------------------------------------------- 1 | This project is about utilizint resting_state FMRI to classify patients with Alzheimer's disease form controls. 2 | The project started on June 2019, as a part of summer internship in Oulu universtiy, in collaboration with Oulu-university 3 | hospital. 4 | 5 | --- 6 | ## Installation 7 | ___ 8 | ### Dependencies 9 | * Python(>=3.5) 10 | * Keras==2.2.4 11 | * nilearn==0.5.2 12 | * scipy==1.2.1 13 | * nibabel==2.4.1 14 | * numpy==1.16.2 15 | * imbalanced_learn==0.5.0 16 | * imblearn==0.0 17 | * scikit_learn==0.21.2 18 | * densratio==0.2.2 19 | * skimage==0.0 20 | * matplotlib==3.0.3 21 | 22 | Download all using: 23 | 24 | ```bash 25 | pip3 install -r requirements.txt 26 | ``` 27 | 28 | ## Data 29 | 30 | In this project Oulu university data where used for trainign the model and [ADNI](http://adni.loni.usc.edu/data-samples/) data were used in testing the performance. 31 | The 4mm Brain mask that was used for extracting brain from scalp had been extracted using [FSL](https://fsl.fmrib.ox.ac.uk/fsl/fslwiki) 32 | 33 | --- 34 | ## Model 35 | 36 | Coefficient of variation over time for BOLD signal was used as the input data for the model. The model Consists of three main 37 | steps: 38 | 1. Loading and preprocessing the input data 39 | 1. Voxel Selection using cross validation for creating mask of most effective voxels 40 | 1. Classify the masked data using many classifiers and printing the results (Gaussian process classifier provides the best 41 | performance in high dimensional space. 42 | 43 | 44 | --- 45 | ### Running the code 46 | ```python3 47 | python3 main.py 48 | ``` 49 | Change input parameters and variables in the main.py file to fit your requirements and preferences 50 | The default script runs under 10 minutes on an average laptop using under 1 GB of memory. 51 | 52 | --- 53 | ## Results 54 | 55 | The results folder sould be created by you and give it name to the variable 'Results_directory' , it contains the measured performace in **Results.txt**, the used parameters and chosed classifier **README.txt**, mask of effective voxels, and voxel importance weight. 56 | The provided pretrained model in **Output_results_directory** gives the following confidence level: 57 | 58 | | | Median | Min(.95 CL) | Max(.95 CL) | 59 | |----------|--------|-------------|-------------| 60 | | Accuracy | .702 | .555 | .835 | 61 | | F1_score | .723 | .595 | .841 | 62 | | AUC | .781 | .721 | .856 | 63 | 64 | #### Note the mask and weights files images are in NIFTI format, you can use FSL utils or nibabel library to process the data or FSLeyes to visualize. 65 | #### Note try always to get unique name for the *Results* directory (unique in your root directory) 66 | 67 | 68 | -------------------------------------------------------------------------------- /Machine Learning/Classification/Alzhimers CV-BOLD Classification/sample_test.py: -------------------------------------------------------------------------------- 1 | import pickle 2 | import load_data 3 | import data_preprocessing 4 | import numpy as np 5 | import nibabel as nib 6 | from sklearn.neighbors import LocalOutlierFactor 7 | from scipy.stats import variation 8 | 9 | train_Con_file_name = 'CV_OULU_CON.npz' 10 | train_AD_file_name = 'CV_OULU_AD.npz' 11 | mask_name = '4mm_brain_mask_bin.nii.gz' 12 | created_mask_high_certainity_file_name='./Output_results_directory/2019-08-10/1/high_certainity_model_mask.nii.gz' 13 | created_mask_outlier_file_name='./Output_results_directory/2019-08-10/1/high_certainity_model_mask.nii.gz' 14 | high_certainity_model_name='./Output_results_directory/2019-08-10/1/high_certainity_model.sav' 15 | low_certainty_model_name='./Output_results_directory/2019-08-10/1/low_certainty_model.sav' 16 | outliers_model_name='./Output_results_directory/2019-08-10/1/outliers_model.sav' 17 | scaler_name='scaler.sav' 18 | number_of_neighbours = 1 19 | model_type='gaussian_process' 20 | 21 | #load nii file 22 | sample_name='CV_ADNI_AD.nii.gz' 23 | sample_path=load_data.find_path(sample_name) 24 | sample = nib.load(sample_path) 25 | sample = sample.get_fdata() 26 | sample=sample[:,:,:,7] #comment if using 3d data 27 | 28 | #load necessary files 29 | mask_4mm = load_data.mask(mask_name) 30 | original_mask=mask_4mm.get_fdata() 31 | orignal_mask_flatten = data_preprocessing.flatten(original_mask[np.newaxis, :, :, :].copy()) 32 | orignal_mask_flatten = np.reshape(orignal_mask_flatten, (-1)) 33 | created_mask_high_certainity = nib.load(created_mask_high_certainity_file_name) 34 | created_mask_outlier = nib.load(created_mask_outlier_file_name) 35 | created_mask_high_certainity=created_mask_high_certainity.get_fdata() 36 | created_mask_outlier=created_mask_outlier.get_fdata() 37 | created_mask_high_certainity_flatten = data_preprocessing.flatten(created_mask_high_certainity[np.newaxis, :, :, :].copy()) 38 | created_mask_high_certainity_flatten = np.reshape(created_mask_high_certainity_flatten, (-1)) 39 | created_mask_outlier_flatten = data_preprocessing.flatten(created_mask_outlier[np.newaxis, :, :, :].copy()) 40 | created_mask_outlier_flatten = np.reshape(created_mask_outlier_flatten, (-1)) 41 | train_data,train_labels=load_data.train_data_3d(train_Con_file_name,train_AD_file_name) 42 | train_data = np.moveaxis(train_data.copy(), 3, 0) 43 | train_data = train_data * original_mask 44 | train_data_flattened = data_preprocessing.flatten(train_data.copy()) 45 | train_data_flattened = data_preprocessing.MinMax_scaler(train_data_flattened.copy()) 46 | 47 | 48 | 49 | #preprocessing 50 | sample_pre=np.array(sample)*original_mask 51 | sample_pre=np.reshape(sample_pre,(1,-1)) 52 | scaler = pickle.load(open(scaler_name, 'rb')) 53 | sample_pre=scaler.transform(sample_pre) 54 | 55 | 56 | #load_models 57 | high_certainity_model = pickle.load(open(high_certainity_model_name, 'rb')) 58 | low_certainty_model = pickle.load(open(low_certainty_model_name, 'rb')) 59 | outliers_model = pickle.load(open(outliers_model_name, 'rb')) 60 | 61 | #prediction 62 | neigh = LocalOutlierFactor(n_neighbors=number_of_neighbours,novelty=True) 63 | neighbours=neigh.fit(train_data_flattened) 64 | inlier_outlier_state=neighbours.predict(sample_pre.copy()) 65 | if (inlier_outlier_state==1):#inlier 66 | sample_pre = variation(sample_pre[:,np.squeeze(np.where(orignal_mask_flatten > 0), axis=0)]* 67 | created_mask_high_certainity_flatten[np.squeeze(np.where(orignal_mask_flatten > 0), axis=0)],axis=1)[:,np.newaxis] 68 | 69 | 70 | if (model_type!='ensamble classifer'): 71 | sample_prob=high_certainity_model.predict_proba(sample_pre) 72 | if ((sample_prob[0,0]>.64)|(sample_prob[0,0]<.35)): 73 | sample_pred=high_certainity_model.predict(sample_pre) 74 | print('high certainty prediction') 75 | 76 | else: 77 | sample_pred = low_certainty_model.predict(sample_pre) 78 | print('low certainty prediction') 79 | 80 | else: 81 | sample_pred = low_certainty_model.predict(sample_pre) 82 | 83 | else: 84 | sample_pre=variation(sample_pre[:,np.squeeze(np.where(orignal_mask_flatten > 0), axis=0)]* 85 | created_mask_outlier_flatten[np.squeeze(np.where(orignal_mask_flatten > 0), axis=0)],axis=1)[:,np.newaxis] 86 | sample_pred = outliers_model.predict(sample_pre) 87 | print('an outlier prediction') 88 | if sample_pred: 89 | print('sample-prediction: AD') 90 | else: 91 | print('sample-prediction: CON') 92 | 93 | 94 | -------------------------------------------------------------------------------- /Machine Learning/Classification/Alzhimers CV-BOLD Classification/shuffle.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | from numpy import load 3 | import os 4 | 5 | 6 | oulu_con_data=load('/data/fmri/Folder/AD_classification/Data/input_data/whole_brain_Oulu_Con.npz')['masked_voxels'] 7 | oulu_ad_data=load('/data/fmri/Folder/AD_classification/Data/input_data/whole_brain_Oulu_AD.npz')['masked_voxels'] 8 | adni_con_data=load('/data/fmri/Folder/AD_classification/Data/input_data/whole_brain_ADNI_Con.npz')['masked_voxels'] 9 | adni_ad_data=load('/data/fmri/Folder/AD_classification/Data/input_data/whole_brain_ADNI_AD.npz')['masked_voxels'] 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | idx = np.random.permutation(np.shape(oulu_con_data)[0]) 18 | oulu_con_data= (oulu_con_data)[idx,:] 19 | oulu_ad_data=(oulu_ad_data)[idx,:] 20 | adni_con_data=(adni_con_data)[idx,:] 21 | adni_ad_data=(adni_ad_data)[idx,:] 22 | 23 | print(idx) 24 | print(np.shape(idx)) 25 | os.mkdir('./data') 26 | np.savez('./data/oulu_con_data', masked_voxels=oulu_con_data) 27 | np.savez('./data/oulu_ad_data',masked_voxels=oulu_ad_data) 28 | np.savez('./data/adni_con_data',masked_voxels=adni_con_data) 29 | np.savez('./data/adni_ad_data',masked_voxels=adni_ad_data) 30 | np.savez('./data/key',idx) 31 | 32 | npzfile = np.load('data/key.npz') 33 | npzfile=np.asarray(npzfile['arr_0']) 34 | print(npzfile) 35 | print(np.shape(npzfile)) -------------------------------------------------------------------------------- /Machine Learning/Classification/Alzhimers CV-BOLD Classification/writing.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | from numpy import load 3 | import nibabel as nib 4 | 5 | masking_img = nib.load('/data/fmri/Folder/AD_classification/Data/input_data/4mm_brain_mask_bin.nii.gz') 6 | masking_shape = masking_img.shape 7 | 8 | masking = np.empty(masking_shape, dtype=float) 9 | masking[:,:,:] = masking_img.get_data().astype(float) 10 | print(masking.shape) 11 | tmp=np.where(masking[2,:,:]>0) 12 | print(((tmp))) 13 | 14 | 15 | 16 | ''' 17 | import os 18 | from datetime import date 19 | import glob 20 | 21 | today = date.today() 22 | x='hello' 23 | f=open('test.txt',"w+") 24 | f.write(x+'world') 25 | 26 | os.mkdir('writing') 27 | LatestFile = sorted(os.listdir('/data/fmri/Folder/AD_classification/codes/model/writing'),reverse = True) 28 | 29 | x=int(LatestFile[0][0]) 30 | x=x+1 31 | print(x) 32 | 33 | ''' -------------------------------------------------------------------------------- /Machine Learning/Classification/Sensor-activity-recognition/Sensor Activity Recognition.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/youssefHosni/Data-Science-Portofolio/cfb6f1cdd49735fed4b8c09d4a43e8f00cdf616e/Machine Learning/Classification/Sensor-activity-recognition/Sensor Activity Recognition.pdf -------------------------------------------------------------------------------- /Machine Learning/Classification/Sensor-activity-recognition/codes/classes_accuarcy.m: -------------------------------------------------------------------------------- 1 | function [class_accuarcy] = classes_accuarcy(test_labels,predicted_labels) 2 | % giving the predicted labels and the groun truth labels, this function 3 | % returns the class(activity) that is best classified with the classifier 4 | 5 | misclassified_index=find(test_labels~=predicted_labels); 6 | 7 | misclassified_labels=test_labels(misclassified_index); 8 | 9 | unique_labels=unique(test_labels); 10 | class_accuarcy=[]; 11 | for i =1: length(unique_labels) 12 | counter=length(find(misclassified_labels==unique_labels(i))); 13 | class_accuarcy=[class_accuarcy counter]; 14 | end 15 | class_accuarcy=1-(class_accuarcy/length(test_labels)); 16 | 17 | 18 | end 19 | 20 | -------------------------------------------------------------------------------- /Machine Learning/Classification/Sensor-activity-recognition/codes/classification.m: -------------------------------------------------------------------------------- 1 | function [acc,class_accuarcy] = classification(train_features,train_labels,test_features,test_labels,classifier_name) 2 | %Apply the croos validation on the data and use Knn classifier 3 | % Detailed explanation goes here 4 | 5 | 6 | if strcmp(classifier_name,'KNN') 7 | clear classifer_model; 8 | classifer_model=fitcknn(train_features,train_labels,'NumNeighbors',5); 9 | predicted_labels=predict(classifer_model,test_features); 10 | performance_evaluaion=classperf(test_labels); 11 | classperf(performance_evaluaion,predicted_labels); 12 | acc=length(find(predicted_labels==test_labels))/length(predicted_labels); 13 | class_accuarcy = classes_accuarcy(test_labels,predicted_labels); 14 | 15 | elseif strcmp(classifier_name,'LDA') 16 | clear classifer_model; 17 | classifer_model=fitcdiscr(train_features,train_labels,'DiscrimType','linear'); 18 | predicted_labels=predict(classifer_model,test_features); 19 | acc=length(find(predicted_labels==test_labels))/length(predicted_labels); 20 | class_accuarcy = classes_accuarcy(test_labels,predicted_labels); 21 | 22 | 23 | elseif strcmp(classifier_name,'QDA') 24 | classifer_model=fitcdiscr(train_features,train_labels,'DiscrimType','quadratic'); 25 | predicted_labels=predict(classifer_model,test_features); 26 | acc=length(find(predicted_labels==test_labels))/length(predicted_labels); 27 | class_accuarcy = classes_accuarcy(test_labels,predicted_labels); 28 | 29 | end 30 | 31 | -------------------------------------------------------------------------------- /Machine Learning/Classification/Sensor-activity-recognition/codes/create_feature_map.m: -------------------------------------------------------------------------------- 1 | function [feature_map] = create_feature_map(activity_data,window_size) 2 | %UNTITLED2 Summary of this function goes here 3 | % Detailed explanation goes here 4 | struct_fields=fieldnames(activity_data); 5 | feature_map=struct; 6 | for i=1:length(struct_fields) 7 | participant=getfield(activity_data,cell2mat(struct_fields(i))); 8 | labels=participant.labels; 9 | new_label_index=find((labels(2:end)-labels(1:end-1))~=0); 10 | new_label_index=[1 ; new_label_index ; length(labels)]; 11 | participant_field_names=fieldnames(participant); 12 | labels_col=[]; 13 | feat_map_participant=[]; 14 | feature_map_all_positions=[]; 15 | for j=1:length(participant_field_names) 16 | feat_map_all_labels_one_position=[]; 17 | if ~strcmp(cell2mat(participant_field_names(j)), 'labels') && ~strcmp(cell2mat(participant_field_names(j)), 'time') 18 | position_data=getfield(participant,cell2mat(participant_field_names(j))); 19 | labels_col=[]; 20 | for k=1:length(new_label_index)-1 21 | feat_map_all_axis_one_label=[]; 22 | for c=1:9 23 | if k==length(new_label_index)-1 24 | strip=position_data(new_label_index(k):new_label_index(k+1),c); 25 | current_label=labels(new_label_index(k+1)); 26 | 27 | else 28 | strip=position_data(new_label_index(k):new_label_index(k+1),c); 29 | current_label=labels(new_label_index(k+1)); 30 | end 31 | feat_map_axis=features(strip,window_size); 32 | feat_map_all_axis_one_label=horzcat(feat_map_all_axis_one_label,feat_map_axis); 33 | 34 | end 35 | labels_col_temp=ones(size(feat_map_all_axis_one_label,1),1)*current_label; 36 | labels_col=vertcat(labels_col,labels_col_temp); 37 | feat_map_all_labels_one_position=vertcat(feat_map_all_labels_one_position,feat_map_all_axis_one_label); 38 | end 39 | feature_map_all_positions=horzcat(feature_map_all_positions,feat_map_all_labels_one_position); 40 | 41 | end 42 | end 43 | feat_map_participant=horzcat(labels_col,feature_map_all_positions); 44 | feature_map.(cell2mat(struct_fields(i)))=feat_map_participant; 45 | end 46 | save([ 'feature_map_' int2str(window_size) 's.mat'],'feature_map') 47 | end 48 | 49 | -------------------------------------------------------------------------------- /Machine Learning/Classification/Sensor-activity-recognition/codes/main.m: -------------------------------------------------------------------------------- 1 | %% create new feature map and use it 2 | clear all 3 | clc; 4 | activity_data=load('dataActivity'); 5 | classification_accuarcy=[]; 6 | k_folds=10; % number of folds used for cross validation 7 | window_size=8; % widnow size for creating the feature map 8 | create_new_feature_map=1; % if ==1 then a new feature map will be created , else a saved one will be used 9 | saved_feature_map_file_name='feature_map_3s.mat'; % the name of the feature map file 10 | scaling=0; % scaling should be ==1 if you would like to scale the data and 0 if not 11 | outliers=0; % outliers should be ==1 if you would like to remove the outliers and zero if ypu donot like. 12 | %[activity_data] = scalingANDoutliers(activity_data,scaling,outliers); 13 | 14 | % Check whether a new feature map will be created or a save one should be 15 | % used 16 | if create_new_feature_map==1 17 | feature_map=create_feature_map(activity_data,window_size); 18 | participant_names=fieldnames(feature_map); 19 | else 20 | feature_map=load(saved_feature_map_file_name); 21 | feature_map=feature_map.feature_map; 22 | participant_names=fieldnames(feature_map); 23 | end 24 | 25 | 26 | classifier_name='KNN'; % classifier name, to change the classifier used check the names of the avaliable classifier from classification function 27 | train_labels=[]; 28 | train_features=[]; 29 | cross_validation_each_posiiton_acc=[]; 30 | max_class_acc_each_position=[]; 31 | max_class_acc_value=[]; 32 | 33 | 34 | time_starting_feature_index=1; 35 | frequency_starting_feature_index=7; 36 | 37 | time_ending_feature_index=6; 38 | frequency_ending_feature_index=11; 39 | 40 | time_features=0; 41 | classification_accuarcy=[]; 42 | class_acc_all_folds=[]; 43 | for j=1:k_folds 44 | train_features=[]; 45 | test_features=[]; 46 | train_labels=[]; 47 | test_labels=[]; 48 | for i=1:length(participant_names) 49 | 50 | if i==length(participant_names)-(j-1) 51 | participant=getfield(feature_map,cell2mat(participant_names(i))); 52 | test_labels= participant(:,1); 53 | test_features=participant(:,2:end); 54 | else 55 | participant=getfield(feature_map,cell2mat(participant_names(i))); 56 | train_labels=vertcat(train_labels,participant(:,1)); 57 | train_features=vertcat(train_features,participant(:,2:end)); 58 | end 59 | end 60 | %%%% selecting certain feature for example certain positions or certain 61 | %%%% axis should be done here before giving the data to the classifier. 62 | position_feature_index_all=[]; 63 | for sensor=0:8 64 | for position=0:4 65 | if time_features==1 66 | starting_index=time_starting_feature_index; 67 | ending_index= time_ending_feature_index; 68 | elseif time_features==0 69 | starting_index=frequency_starting_feature_index; 70 | ending_index= frequency_ending_feature_index; 71 | else 72 | starting_index=1; 73 | ending_index=11; 74 | end 75 | position_feature_index=linspace(starting_index,ending_index,ending_index-starting_index+1)+11*sensor+position*99; 76 | position_feature_index_all=[position_feature_index_all position_feature_index]; 77 | end 78 | end 79 | 80 | train_features_certain_positions=train_features(:,position_feature_index_all); 81 | test_features_certain_positions=test_features(:,position_feature_index_all); 82 | [train_features_certain_positions] = scalingANDoutliers(train_features_certain_positions,scaling,outliers); 83 | [test_features_certain_positions] = scalingANDoutliers(test_features_certain_positions,scaling,outliers); 84 | [acc,class_acc] =classification(train_features_certain_positions,train_labels,test_features_certain_positions,test_labels,classifier_name); 85 | classification_accuarcy=[classification_accuarcy acc]; 86 | class_acc_all_folds=[class_acc_all_folds ; class_acc]; 87 | end 88 | cross_validation_acc=mean(classification_accuarcy); 89 | class_acc_one_position=mean(class_acc_all_folds,1); 90 | [max_class_acc,max_class_label]=max(class_acc_one_position); 91 | 92 | 93 | 94 | 95 | 96 | 97 | -------------------------------------------------------------------------------- /Machine Learning/Classification/Sensor-activity-recognition/codes/performance_evaluation.m: -------------------------------------------------------------------------------- 1 | function [cp] = performance_evaluation(classifier_model,test_data,test_labels) 2 | %UNTITLED3 Summary of this function goes here 3 | % Detailed explanation goes here 4 | predicted_labels=predict(classifer_model,test_data); 5 | cp=classperf(test_labels); 6 | classperf(cp,predicted_labels) 7 | 8 | 9 | 10 | end 11 | 12 | -------------------------------------------------------------------------------- /Machine Learning/Classification/Sensor-activity-recognition/codes/readme.md: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /Machine Learning/Classification/Sensor-activity-recognition/codes/scalingANDoutliers.m: -------------------------------------------------------------------------------- 1 | function [scaledANDcleanedData] = scalingANDoutliers(data,scaling,outliers) 2 | %Scales the activity data person-wise and postition-wise 3 | % Detailed explanation goes here 4 | 5 | %positions = {'leftPocket','rightPocket','belt','wrist','upperArm'}; 6 | 7 | %scaling 8 | 9 | 10 | if scaling == 1 %standardization 11 | 12 | minVal = min(data,[],2); 13 | maxVal = max(data,[],2); 14 | 15 | data = (data - minVal)./maxVal; 16 | 17 | 18 | else 19 | ; 20 | end 21 | 22 | if outliers == 1 23 | 24 | if max(data.(positions{f})) > 2 25 | 26 | data.(positions{f})(data.(positions{f}) > 2) = NaN; 27 | 28 | end 29 | 30 | else 31 | ; 32 | end 33 | 34 | 35 | scaledANDcleanedData = data; 36 | 37 | end 38 | 39 | 40 | -------------------------------------------------------------------------------- /Machine Learning/Classification/Sensor-activity-recognition/readme.md: -------------------------------------------------------------------------------- 1 | # Sensor Activity Recogniation 2 | 3 | 4 | ### Dataset 5 | The dataset can be downloaded __[here](https://www.kaggle.com/youssef19/sensor-activity-dataset)__ 6 | 7 | The feature matrix has 381 columns and n of rows depending on the widnow size 8 | 9 | The first column is the labels column , so after that there are 380 columns which is the following : 10 | 380= 9*5*8 11 | 12 | 9:(3 axis for the three sensors: Accelometer , Linear Accelometer and Gyroscope) 13 | 14 | 5:(five positions and they are in this order: left pocket,right pocket,belt,wrist,upper arm ) 15 | 16 | 3:(eight features an dthey are in these order: 17 | 18 | 1.Mean 19 | 20 | 2.Standard Deviation 21 | 22 | 3.Median 23 | 24 | 4.Variance 25 | 26 | 5.Zero crossings 27 | 28 | 6.Root mean square value 29 | 30 | 7.Sum of FFT coefficients 31 | 32 | 8.Signal energy medium ) 33 | 34 | 35 | So the first 8 columns are the features of the xasis of the accelometer for the left pocket position and so on . 36 | 37 | -------------------------------------------------------------------------------- /Machine Learning/Clustering/Customer identification for mail order products/LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2021 youssefHosni 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /Machine Learning/Clustering/Customer identification for mail order products/README.md: -------------------------------------------------------------------------------- 1 | # project summary 2 | In this project, real-life data from Bertelsmann partners AZ Direct and Arvato Finance Solution were used. The data here concerns a company that performs mail-order sales in Germany. Their main question of interest is to identify facets of the population that are most likely to be purchasers of their products for a mailout campaign. As a data scientist i will use unsupervised learning techniques to organize the general population into clusters, then use those clusters to see which of them comprise the main user base for the company. Prior to applying the machine learning methods, i cleaned the data in order to convert the data into a usable form. 3 | 4 | ## steps 5 | ### Step 1: Preprocessing 6 | When you start an analysis, you must first explore and understand the data that you are working with. In this (and the next) step of the project, you’ll be working with the general demographics data. As part of your investigation of dataset properties, you must attend to a few key points: 7 | 8 | How are missing or unknown values encoded in the data? Are there certain features (columns) that should be removed from the analysis because of missing data? Are there certain data points (rows) that should be treated separately from the rest? 9 | Consider the level of measurement for each feature in the dataset (e.g. categorical, ordinal, numeric). What assumptions must be made in order to use each feature in the final analysis? Are there features that need to be re-encoded before they can be used? Are there additional features that can be dropped at this stage? 10 | You will create a cleaning procedure that you will apply first to the general demographic data, then later to the customers data. 11 | 12 | ### Step 2: Feature Transformation 13 | Now that your data is clean, you will use dimensionality reduction techniques to identify relationships between variables in the dataset, resulting in the creation of a new set of variables that account for those correlations. In this stage of the project, you will attend to the following points: 14 | 15 | The first technique that you should perform on your data is feature scaling. What might happen if we don’t perform feature scaling before applying later techniques you’ll be using? 16 | Once you’ve scaled your features, you can then apply principal component analysis (PCA) to find the vectors of maximal variability. How much variability in the data does each principal component capture? Can you interpret associations between original features in your dataset based on the weights given on the strongest components? How many components will you keep as part of the dimensionality reduction process? 17 | You will use the sklearn library to create objects that implement your feature scaling and PCA dimensionality reduction decisions. 18 | 19 | ### Step 3: Clustering 20 | Finally, on your transformed data, you will apply clustering techniques to identify groups in the general demographic data. You will then apply the same clustering model to the customers dataset to see how market segments differ between the general population and the mail-order sales company. You will tackle the following points in this stage: 21 | 22 | Use the k-means method to cluster the demographic data into groups. How should you make a decision on how many clusters to use? 23 | Apply the techniques and models that you fit on the demographic data to the customers data: data cleaning, feature scaling, PCA, and k-means clustering. Compare the distribution of people by cluster for the customer data to that of the general population. Can you say anything about which types of people are likely consumers for the mail-order sales company? 24 | 25 | ## Requirments 26 | * NumPy 27 | * pandas 28 | * Sklearn / scikit-learn 29 | * Matplotlib (for data visualization) 30 | * Seaborn (for data visualization) 31 | 32 | ## Data used 33 | 34 | Demographic data for the general population of Germany; 891211 persons (rows) x 85 features (columns). 35 | Demographic data for customers of a mail-order company; 191652 persons (rows) x 85 features (columns). 36 | The data is not provided as it is private data. 37 | 38 | -------------------------------------------------------------------------------- /Machine Learning/Clustering/Finding-the-best-Tornoto-neighborhood-to-open-a-new-gym/LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2021 youssefHosni 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /Machine Learning/Clustering/Finding-the-best-Tornoto-neighborhood-to-open-a-new-gym/Project report.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/youssefHosni/Data-Science-Portofolio/cfb6f1cdd49735fed4b8c09d4a43e8f00cdf616e/Machine Learning/Clustering/Finding-the-best-Tornoto-neighborhood-to-open-a-new-gym/Project report.pdf -------------------------------------------------------------------------------- /Machine Learning/Clustering/Finding-the-best-Tornoto-neighborhood-to-open-a-new-gym/Readme.md: -------------------------------------------------------------------------------- 1 | # Finding the best neighborhood to open a new gym 2 | 3 | ## Introduction 4 | Finding the best neighborhood in Tornoto city to open a new gym, given the demographic and geogrpahic and venues information data. More inforamtion can be found [here](https://github.com/youssefHosni/Finding-the-best-Tornoto-neighborhood-to-open-a-new-gym/blob/main/Project%20report.pdf) 5 | 6 | ## Data 7 | 8 | The dataset used to solve this problem have the following informtion: 9 | * The demogrpahics information: They are the **total population**, the **15-45 poulation**, the **number of educated people** and the **number of employers** in each neighborhood. 10 | * The graphical data (lat,long) for each neighborhood is used to get the venues information for each neighborhood from Foursquare API. 11 | 12 | The neighbourhoods on the map are shown in the figure below 13 | ![neighborhood_map](https://user-images.githubusercontent.com/72076328/109424179-4bbec100-79eb-11eb-9a71-6557010e2ee3.PNG) 14 | From the Foursquare API the number of venues per each neighbourhood and the number of Gym/Fitness centers per each neighbourhood were calculated and then merged with the demographics data and the final data used is as the shown in the figure below. 15 | ![total_data](https://user-images.githubusercontent.com/72076328/109424261-a9530d80-79eb-11eb-807c-49864647abc6.PNG) 16 | The final dataset can be found [here](https://www.kaggle.com/youssef19/toronto-neighborhoods-inforamtion) 17 | 18 | ## Methodology 19 | ### Data preprocessing 20 | The data were normalized using the min-max normalization. This is an important step because the k-means algorithms depend on distance measurement, so it is important that the data used be in a similar scale. The formula of the min-max scaler is as the following: 21 | `𝑓𝑒𝑎𝑡𝑢𝑟𝑒−min⁡(𝑓𝑒𝑎𝑡𝑢𝑟𝑒)max⁡(𝑓𝑒𝑎𝑡𝑢𝑟𝑒)−min⁡(𝑓𝑒𝑎𝑡𝑢𝑟𝑒)` 22 | The neighborhood and the geographical data were dropped from the data as they will be used by the clustering algorithm. 23 | ### K-means clustering 24 | The best k was found using the elbow method, in which the average distance from the clusters is calculated for different values of k and the best k is the k at the elbow. The best k was found to be 3. 25 | 26 | ## Results 27 | The neighborhoods are clustered into three clusters as shown in the figure below. The red color is the first cluster, the violet is the second cluster, green is the third cluster. 28 | ![clsuters on the map](https://user-images.githubusercontent.com/72076328/113056147-18bb4900-91b4-11eb-9e33-8ccf83fa5fca.PNG) 29 | 30 | ## Conclusion 31 | Using the demographics data and the venue information for each neighborhood obtained from Foursquare API, I was able to cluster the neighborhoods into three clusters using the K-means clustering algorithm. The number of gyms was found to be correlated to the number of venues. The neighborhoods with a large number of venues and gyms are clustered into the third cluster, so the most suitable neighborhood out of this cluster is the **Trinity-Bellwoods neighborhood**. The first cluster contains neighborhoods with large population and small number of gyms and moderate number of venues. The **Church-Yonge Corridor neighborhood** is the best choice out of this cluster as it contains 98 venues and large population. The number of venues is almost similar to that of the “Trinity-Bellwoods” and the population is double of it, making it the best neighborhood to open a new gym in Toronto city. 32 | 33 | ## license & Copyright 34 | 35 | © Youssef Hosni 36 | 37 | Lincesed under [MIT Linces](https://github.com/youssefHosni/Finding-the-best-Tornoto-neighborhood-to-open-a-new-gym/blob/main/LICENSE). 38 | -------------------------------------------------------------------------------- /Machine Learning/Clustering/Readme.md: -------------------------------------------------------------------------------- 1 | The clustering projects 2 | -------------------------------------------------------------------------------- /Machine Learning/Regression/Automobile price prediction/Readme.md: -------------------------------------------------------------------------------- 1 | # Automobile Price Prediction # 2 | 3 | ## 1.Background ## 4 | 5 | In this project I will predict the price of old automobile. In this project the following steps were excluded: 6 | 7 | * Loading the data 8 | * Preprocessing the data 9 | * Explore features or charecteristics to predict price of car 10 | * Develop prediction models 11 | * Evaluate and refine prediction models 12 | 13 | ## 2. Methods 14 | 15 | ### 2.1. Data 16 | The dataset used is the [Automobile Dataset](https://www.kaggle.com/datasets/premptk/automobile-data-changed) from Kaggle. The data 17 | 18 | ### 2.2. Data Preprocessing 19 | 20 | #### Identify and handle missing values #### 21 | 22 | #### Correct data format #### 23 | 24 | ### 2.3. Feature Engineering ### 25 | 26 | #### Data Standardization #### 27 | 28 | 29 | #### Data Normalization #### 30 | 31 | 32 | #### Binning #### 33 | 34 | 35 | ### 2.3. Data Exploration ### 36 | 37 | 38 | 39 | -------------------------------------------------------------------------------- /Machine Learning/Regression/Readme.md: -------------------------------------------------------------------------------- 1 | Regression projects 2 | -------------------------------------------------------------------------------- /Natural_Language_processing/Data-Science-Resume-Selector/readme.md: -------------------------------------------------------------------------------- 1 | ## Data Science Resume Selector 2 | 3 | Selecting the resume that are eligbile to data scientist postions, the dataset used contains 125 resumes, in the resumetext column. Resumes were queried from Indeed.com with keyword 'data scientist', location 'Vermont'. If a resume is 'not flagged', the applicant can submit a modified resume version at a later date. If it is 'flagged', the applicant is invited to interview. 4 | The data can be downloaded from __[here](https://www.kaggle.com/samdeeplearning/deepnlp)__ 5 | -------------------------------------------------------------------------------- /Natural_Language_processing/Data-Science-Resume-Selector/resume.csv: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/youssefHosni/Data-Science-Portofolio/cfb6f1cdd49735fed4b8c09d4a43e8f00cdf616e/Natural_Language_processing/Data-Science-Resume-Selector/resume.csv -------------------------------------------------------------------------------- /Natural_Language_processing/Sentiment-analysis/README.md: -------------------------------------------------------------------------------- 1 | # Sentiment Analysis Web App 2 | 3 | The notebook and Python files provided here, once completed, result in a simple web app which interacts with a deployed recurrent neural network performing sentiment analysis on movie reviews. This project assumes some familiarity with SageMaker, the mini-project, Sentiment Analysis using XGBoost, should provide enough background. 4 | 5 | Please see the [README](https://github.com/udacity/sagemaker-deployment/tree/master/README.md) in the root directory for instructions on setting up a SageMaker notebook and downloading the project files (as well as the other notebooks). 6 | -------------------------------------------------------------------------------- /Natural_Language_processing/Sentiment-analysis/sevre/model.py: -------------------------------------------------------------------------------- 1 | import torch.nn as nn 2 | 3 | class LSTMClassifier(nn.Module): 4 | """ 5 | This is the simple RNN model we will be using to perform Sentiment Analysis. 6 | """ 7 | 8 | def __init__(self, embedding_dim, hidden_dim, vocab_size): 9 | """ 10 | Initialize the model by settingg up the various layers. 11 | """ 12 | super(LSTMClassifier, self).__init__() 13 | 14 | self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=0) 15 | self.lstm = nn.LSTM(embedding_dim, hidden_dim) 16 | self.dense = nn.Linear(in_features=hidden_dim, out_features=1) 17 | self.sig = nn.Sigmoid() 18 | 19 | self.word_dict = None 20 | 21 | def forward(self, x): 22 | """ 23 | Perform a forward pass of our model on some input. 24 | """ 25 | x = x.t() 26 | lengths = x[0,:] 27 | reviews = x[1:,:] 28 | embeds = self.embedding(reviews) 29 | lstm_out, _ = self.lstm(embeds) 30 | out = self.dense(lstm_out) 31 | out = out[lengths - 1, range(len(lengths))] 32 | return self.sig(out.squeeze()) -------------------------------------------------------------------------------- /Natural_Language_processing/Sentiment-analysis/sevre/predict.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import json 3 | import os 4 | import pickle 5 | import sys 6 | import sagemaker_containers 7 | import pandas as pd 8 | import numpy as np 9 | import torch 10 | import torch.nn as nn 11 | import torch.optim as optim 12 | import torch.utils.data 13 | 14 | from model import LSTMClassifier 15 | 16 | from utils import review_to_words, convert_and_pad 17 | 18 | def model_fn(model_dir): 19 | """Load the PyTorch model from the `model_dir` directory.""" 20 | print("Loading model.") 21 | 22 | # First, load the parameters used to create the model. 23 | model_info = {} 24 | model_info_path = os.path.join(model_dir, 'model_info.pth') 25 | with open(model_info_path, 'rb') as f: 26 | model_info = torch.load(f) 27 | 28 | print("model_info: {}".format(model_info)) 29 | 30 | # Determine the device and construct the model. 31 | device = torch.device("cuda" if torch.cuda.is_available() else "cpu") 32 | model = LSTMClassifier(model_info['embedding_dim'], model_info['hidden_dim'], model_info['vocab_size']) 33 | 34 | # Load the store model parameters. 35 | model_path = os.path.join(model_dir, 'model.pth') 36 | with open(model_path, 'rb') as f: 37 | model.load_state_dict(torch.load(f)) 38 | 39 | # Load the saved word_dict. 40 | word_dict_path = os.path.join(model_dir, 'word_dict.pkl') 41 | with open(word_dict_path, 'rb') as f: 42 | model.word_dict = pickle.load(f) 43 | 44 | model.to(device).eval() 45 | 46 | print("Done loading model.") 47 | return model 48 | 49 | def input_fn(serialized_input_data, content_type): 50 | print('Deserializing the input data.') 51 | if content_type == 'text/plain': 52 | data = serialized_input_data.decode('utf-8') 53 | return data 54 | raise Exception('Requested unsupported ContentType in content_type: ' + content_type) 55 | 56 | def output_fn(prediction_output, accept): 57 | print('Serializing the generated output.') 58 | return str(prediction_output) 59 | 60 | def predict_fn(input_data, model): 61 | print('Inferring sentiment of input data.') 62 | 63 | device = torch.device("cuda" if torch.cuda.is_available() else "cpu") 64 | 65 | if model.word_dict is None: 66 | raise Exception('Model has not been loaded properly, no word_dict.') 67 | 68 | # TODO: Process input_data so that it is ready to be sent to our model. 69 | # You should produce two variables: 70 | # data_X - A sequence of length 500 which represents the converted review 71 | # data_len - The length of the review 72 | 73 | words = review_to_words(input_data) 74 | data_X, data_len = convert_and_pad(model.word_dict, words) 75 | 76 | # Using data_X and data_len we construct an appropriate input tensor. Remember 77 | # that our model expects input data of the form 'len, review[500]'. 78 | data_pack = np.hstack((data_len, data_X)) 79 | data_pack = data_pack.reshape(1, -1) 80 | 81 | data = torch.from_numpy(data_pack) 82 | data = data.to(device) 83 | 84 | # Make sure to put the model into evaluation mode 85 | model.eval() 86 | 87 | # TODO: Compute the result of applying the model to the input data. The variable `result` should 88 | # be a numpy array which contains a single integer which is either 1 or 0 89 | 90 | with torch.no_grad(): 91 | output = model.forward(data) 92 | 93 | result = int(np.round(output.numpy())) 94 | return result 95 | -------------------------------------------------------------------------------- /Natural_Language_processing/Sentiment-analysis/sevre/requirements.txt: -------------------------------------------------------------------------------- 1 | pandas 2 | numpy 3 | nltk 4 | beautifulsoup4 5 | html5lib -------------------------------------------------------------------------------- /Natural_Language_processing/Sentiment-analysis/sevre/utils.py: -------------------------------------------------------------------------------- 1 | import nltk 2 | from nltk.corpus import stopwords 3 | from nltk.stem.porter import * 4 | 5 | import re 6 | from bs4 import BeautifulSoup 7 | 8 | import pickle 9 | 10 | import os 11 | import glob 12 | 13 | def review_to_words(review): 14 | nltk.download("stopwords", quiet=True) 15 | stemmer = PorterStemmer() 16 | 17 | text = BeautifulSoup(review, "html.parser").get_text() # Remove HTML tags 18 | text = re.sub(r"[^a-zA-Z0-9]", " ", text.lower()) # Convert to lower case 19 | words = text.split() # Split string into words 20 | words = [w for w in words if w not in stopwords.words("english")] # Remove stopwords 21 | words = [PorterStemmer().stem(w) for w in words] # stem 22 | 23 | return words 24 | 25 | def convert_and_pad(word_dict, sentence, pad=500): 26 | NOWORD = 0 # We will use 0 to represent the 'no word' category 27 | INFREQ = 1 # and we use 1 to represent the infrequent words, i.e., words not appearing in word_dict 28 | 29 | working_sentence = [NOWORD] * pad 30 | 31 | for word_index, word in enumerate(sentence[:pad]): 32 | if word in word_dict: 33 | working_sentence[word_index] = word_dict[word] 34 | else: 35 | working_sentence[word_index] = INFREQ 36 | 37 | return working_sentence, min(len(sentence), pad) -------------------------------------------------------------------------------- /Natural_Language_processing/Sentiment-analysis/train/model.py: -------------------------------------------------------------------------------- 1 | import torch.nn as nn 2 | 3 | class LSTMClassifier(nn.Module): 4 | """ 5 | This is the simple RNN model we will be using to perform Sentiment Analysis. 6 | """ 7 | 8 | def __init__(self, embedding_dim, hidden_dim, vocab_size): 9 | """ 10 | Initialize the model by settingg up the various layers. 11 | """ 12 | super(LSTMClassifier, self).__init__() 13 | 14 | self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=0) 15 | self.lstm = nn.LSTM(embedding_dim, hidden_dim) 16 | self.dense = nn.Linear(in_features=hidden_dim, out_features=1) 17 | self.sig = nn.Sigmoid() 18 | 19 | self.word_dict = None 20 | 21 | def forward(self, x): 22 | """ 23 | Perform a forward pass of our model on some input. 24 | """ 25 | x = x.t() 26 | lengths = x[0,:] 27 | reviews = x[1:,:] 28 | embeds = self.embedding(reviews) 29 | lstm_out, _ = self.lstm(embeds) 30 | out = self.dense(lstm_out) 31 | out = out[lengths - 1, range(len(lengths))] 32 | return self.sig(out.squeeze()) -------------------------------------------------------------------------------- /Natural_Language_processing/Sentiment-analysis/train/requirements.txt: -------------------------------------------------------------------------------- 1 | pandas 2 | numpy 3 | nltk 4 | beautifulsoup4 5 | html5lib -------------------------------------------------------------------------------- /Natural_Language_processing/Sentiment-analysis/train/train.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import json 3 | import os 4 | import pickle 5 | import sys 6 | import sagemaker_containers 7 | import pandas as pd 8 | import torch 9 | import torch.optim as optim 10 | import torch.utils.data 11 | 12 | from model import LSTMClassifier 13 | 14 | def model_fn(model_dir): 15 | """Load the PyTorch model from the `model_dir` directory.""" 16 | print("Loading model.") 17 | 18 | # First, load the parameters used to create the model. 19 | model_info = {} 20 | model_info_path = os.path.join(model_dir, 'model_info.pth') 21 | with open(model_info_path, 'rb') as f: 22 | model_info = torch.load(f) 23 | 24 | print("model_info: {}".format(model_info)) 25 | 26 | # Determine the device and construct the model. 27 | device = torch.device("cuda" if torch.cuda.is_available() else "cpu") 28 | model = LSTMClassifier(model_info['embedding_dim'], model_info['hidden_dim'], model_info['vocab_size']) 29 | 30 | # Load the stored model parameters. 31 | model_path = os.path.join(model_dir, 'model.pth') 32 | with open(model_path, 'rb') as f: 33 | model.load_state_dict(torch.load(f)) 34 | 35 | # Load the saved word_dict. 36 | word_dict_path = os.path.join(model_dir, 'word_dict.pkl') 37 | with open(word_dict_path, 'rb') as f: 38 | model.word_dict = pickle.load(f) 39 | 40 | model.to(device).eval() 41 | 42 | print("Done loading model.") 43 | return model 44 | 45 | def _get_train_data_loader(batch_size, training_dir): 46 | print("Get train data loader.") 47 | 48 | train_data = pd.read_csv(os.path.join(training_dir, "train.csv"), header=None, names=None) 49 | 50 | train_y = torch.from_numpy(train_data[[0]].values).float().squeeze() 51 | train_X = torch.from_numpy(train_data.drop([0], axis=1).values).long() 52 | 53 | train_ds = torch.utils.data.TensorDataset(train_X, train_y) 54 | 55 | return torch.utils.data.DataLoader(train_ds, batch_size=batch_size) 56 | 57 | 58 | def train(model, train_loader, epochs, optimizer, loss_fn, device): 59 | """ 60 | This is the training method that is called by the PyTorch training script. The parameters 61 | passed are as follows: 62 | model - The PyTorch model that we wish to train. 63 | train_loader - The PyTorch DataLoader that should be used during training. 64 | epochs - The total number of epochs to train for. 65 | optimizer - The optimizer to use during training. 66 | loss_fn - The loss function used for training. 67 | device - Where the model and data should be loaded (gpu or cpu). 68 | """ 69 | 70 | # TODO: Paste the train() method developed in the notebook here. 71 | 72 | for epoch in range(1, epochs + 1): 73 | model.train() 74 | total_loss = 0 75 | for batch in train_loader: 76 | batch_X, batch_y = batch 77 | 78 | batch_X = batch_X.to(device) 79 | batch_y = batch_y.to(device) 80 | 81 | # TODO: Complete this train method to train the model provided. 82 | 83 | output = model(batch_X) 84 | loss = loss_fn(output, batch_y) 85 | optimizer.zero_grad() 86 | loss.backward() 87 | optimizer.step() 88 | 89 | total_loss += loss.data.item() 90 | print("Epoch: {}, BCELoss: {}".format(epoch, total_loss / len(train_loader))) 91 | 92 | 93 | if __name__ == '__main__': 94 | # All of the model parameters and training parameters are sent as arguments when the script 95 | # is executed. Here we set up an argument parser to easily access the parameters. 96 | 97 | parser = argparse.ArgumentParser() 98 | 99 | # Training Parameters 100 | parser.add_argument('--batch-size', type=int, default=512, metavar='N', 101 | help='input batch size for training (default: 512)') 102 | parser.add_argument('--epochs', type=int, default=10, metavar='N', 103 | help='number of epochs to train (default: 10)') 104 | parser.add_argument('--seed', type=int, default=1, metavar='S', 105 | help='random seed (default: 1)') 106 | 107 | # Model Parameters 108 | parser.add_argument('--embedding_dim', type=int, default=32, metavar='N', 109 | help='size of the word embeddings (default: 32)') 110 | parser.add_argument('--hidden_dim', type=int, default=100, metavar='N', 111 | help='size of the hidden dimension (default: 100)') 112 | parser.add_argument('--vocab_size', type=int, default=5000, metavar='N', 113 | help='size of the vocabulary (default: 5000)') 114 | 115 | # SageMaker Parameters 116 | parser.add_argument('--hosts', type=list, default=json.loads(os.environ['SM_HOSTS'])) 117 | parser.add_argument('--current-host', type=str, default=os.environ['SM_CURRENT_HOST']) 118 | parser.add_argument('--model-dir', type=str, default=os.environ['SM_MODEL_DIR']) 119 | parser.add_argument('--data-dir', type=str, default=os.environ['SM_CHANNEL_TRAINING']) 120 | parser.add_argument('--num-gpus', type=int, default=os.environ['SM_NUM_GPUS']) 121 | 122 | args = parser.parse_args() 123 | 124 | device = torch.device("cuda" if torch.cuda.is_available() else "cpu") 125 | print("Using device {}.".format(device)) 126 | 127 | torch.manual_seed(args.seed) 128 | 129 | # Load the training data. 130 | train_loader = _get_train_data_loader(args.batch_size, args.data_dir) 131 | 132 | # Build the model. 133 | model = LSTMClassifier(args.embedding_dim, args.hidden_dim, args.vocab_size).to(device) 134 | 135 | with open(os.path.join(args.data_dir, "word_dict.pkl"), "rb") as f: 136 | model.word_dict = pickle.load(f) 137 | 138 | print("Model loaded with embedding_dim {}, hidden_dim {}, vocab_size {}.".format( 139 | args.embedding_dim, args.hidden_dim, args.vocab_size 140 | )) 141 | 142 | # Train the model. 143 | optimizer = optim.Adam(model.parameters()) 144 | loss_fn = torch.nn.BCELoss() 145 | 146 | train(model, train_loader, args.epochs, optimizer, loss_fn, device) 147 | 148 | # Save the parameters used to construct the model 149 | model_info_path = os.path.join(args.model_dir, 'model_info.pth') 150 | with open(model_info_path, 'wb') as f: 151 | model_info = { 152 | 'embedding_dim': args.embedding_dim, 153 | 'hidden_dim': args.hidden_dim, 154 | 'vocab_size': args.vocab_size, 155 | } 156 | torch.save(model_info, f) 157 | 158 | # Save the word_dict 159 | word_dict_path = os.path.join(args.model_dir, 'word_dict.pkl') 160 | with open(word_dict_path, 'wb') as f: 161 | pickle.dump(model.word_dict, f) 162 | 163 | # Save the model parameters 164 | model_path = os.path.join(args.model_dir, 'model.pth') 165 | with open(model_path, 'wb') as f: 166 | torch.save(model.cpu().state_dict(), f) 167 | -------------------------------------------------------------------------------- /Natural_Language_processing/Sentiment-analysis/website/index.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | Sentiment Analysis Web App 5 | 6 | 7 | 8 | 9 | 10 | 11 | 32 | 33 | 34 | 35 | 36 |
37 |

Is your review positive, or negative?

38 |

Enter your review below and click submit to find out...

39 |
42 |
43 | 44 | 45 |
46 | 47 |
48 |

49 |
50 | 51 | 52 | -------------------------------------------------------------------------------- /Natural_Language_processing/plagiarism-detector-web-app/README.md: -------------------------------------------------------------------------------- 1 | # Plagiarism Detector Web App 2 | This repository contains code and associated files for deploying a plagiarism detector using AWS SageMaker. 3 | 4 | ## Project Overview 5 | 6 | In this project, i build a plagiarism detector that examines a text file and performs binary classification; labeling that file as either *plagiarized* or *not*, depending on how similar that text file is to a provided source text. Detecting plagiarism is an active area of research; the task is non-trivial and the differences between paraphrased answers and original work are often not so obvious. 7 | 8 | This project is broken down into three main notebooks: 9 | 10 | **Notebook 1: Data Exploration** 11 | * Load in the corpus of plagiarism text data. 12 | * Explore the existing data features and the data distribution. 13 | * This first notebook is **not** required in your final project submission. 14 | 15 | **Notebook 2: Feature Engineering** 16 | 17 | * Clean and pre-process the text data. 18 | * Define features for comparing the similarity of an answer text and a source text, and extract similarity features. 19 | * Select "good" features, by analyzing the correlations between different features. 20 | * Create train/test `.csv` files that hold the relevant features and class labels for train/test data points. 21 | 22 | **Notebook 3: Train and Deploy Your Model in SageMaker** 23 | 24 | * Upload your train/test feature data to S3. 25 | * Define a binary classification model and a training script. 26 | * Train your model and deploy it using SageMaker. 27 | * Evaluate your deployed classifier. 28 | 29 | --- 30 | -------------------------------------------------------------------------------- /Natural_Language_processing/plagiarism-detector-web-app/helpers.py: -------------------------------------------------------------------------------- 1 | import re 2 | import pandas as pd 3 | import operator 4 | 5 | # Add 'datatype' column that indicates if the record is original wiki answer as 0, training data 1, test data 2, onto 6 | # the dataframe - uses stratified random sampling (with seed) to sample by task & plagiarism amount 7 | 8 | # Use function to label datatype for training 1 or test 2 9 | def create_datatype(df, train_value, test_value, datatype_var, compare_dfcolumn, operator_of_compare, value_of_compare, 10 | sampling_number, sampling_seed): 11 | # Subsets dataframe by condition relating to statement built from: 12 | # 'compare_dfcolumn' 'operator_of_compare' 'value_of_compare' 13 | df_subset = df[operator_of_compare(df[compare_dfcolumn], value_of_compare)] 14 | df_subset = df_subset.drop(columns = [datatype_var]) 15 | 16 | # Prints counts by task and compare_dfcolumn for subset df 17 | #print("\nCounts by Task & " + compare_dfcolumn + ":\n", df_subset.groupby(['Task', compare_dfcolumn]).size().reset_index(name="Counts") ) 18 | 19 | # Sets all datatype to value for training for df_subset 20 | df_subset.loc[:, datatype_var] = train_value 21 | 22 | # Performs stratified random sample of subset dataframe to create new df with subset values 23 | df_sampled = df_subset.groupby(['Task', compare_dfcolumn], group_keys=False).apply(lambda x: x.sample(min(len(x), sampling_number), random_state = sampling_seed)) 24 | df_sampled = df_sampled.drop(columns = [datatype_var]) 25 | # Sets all datatype to value for test_value for df_sampled 26 | df_sampled.loc[:, datatype_var] = test_value 27 | 28 | # Prints counts by compare_dfcolumn for selected sample 29 | #print("\nCounts by "+ compare_dfcolumn + ":\n", df_sampled.groupby([compare_dfcolumn]).size().reset_index(name="Counts") ) 30 | #print("\nSampled DF:\n",df_sampled) 31 | 32 | # Labels all datatype_var column as train_value which will be overwritten to 33 | # test_value in next for loop for all test cases chosen with stratified sample 34 | for index in df_sampled.index: 35 | # Labels all datatype_var columns with test_value for straified test sample 36 | df_subset.loc[index, datatype_var] = test_value 37 | 38 | #print("\nSubset DF:\n",df_subset) 39 | # Adds test_value and train_value for all relevant data in main dataframe 40 | for index in df_subset.index: 41 | # Labels all datatype_var columns in df with train_value/test_value based upon 42 | # stratified test sample and subset of df 43 | df.loc[index, datatype_var] = df_subset.loc[index, datatype_var] 44 | 45 | # returns nothing because dataframe df already altered 46 | 47 | def train_test_dataframe(clean_df, random_seed=100): 48 | 49 | new_df = clean_df.copy() 50 | 51 | # Initialize datatype as 0 initially for all records - after function 0 will remain only for original wiki answers 52 | new_df.loc[:,'Datatype'] = 0 53 | 54 | # Creates test & training datatypes for plagiarized answers (1,2,3) 55 | create_datatype(new_df, 1, 2, 'Datatype', 'Category', operator.gt, 0, 1, random_seed) 56 | 57 | # Creates test & training datatypes for NON-plagiarized answers (0) 58 | create_datatype(new_df, 1, 2, 'Datatype', 'Category', operator.eq, 0, 2, random_seed) 59 | 60 | # creating a dictionary of categorical:numerical mappings for plagiarsm categories 61 | mapping = {0:'orig', 1:'train', 2:'test'} 62 | 63 | # traversing through dataframe and replacing categorical data 64 | new_df.Datatype = [mapping[item] for item in new_df.Datatype] 65 | 66 | return new_df 67 | 68 | 69 | # helper function for pre-processing text given a file 70 | def process_file(file): 71 | # put text in all lower case letters 72 | all_text = file.read().lower() 73 | 74 | # remove all non-alphanumeric chars 75 | all_text = re.sub(r"[^a-zA-Z0-9]", " ", all_text) 76 | # remove newlines/tabs, etc. so it's easier to match phrases, later 77 | all_text = re.sub(r"\t", " ", all_text) 78 | all_text = re.sub(r"\n", " ", all_text) 79 | all_text = re.sub(" ", " ", all_text) 80 | all_text = re.sub(" ", " ", all_text) 81 | 82 | return all_text 83 | 84 | 85 | def create_text_column(df, file_directory='data/'): 86 | '''Reads in the files, listed in a df and returns that df with an additional column, `Text`. 87 | :param df: A dataframe of file information including a column for `File` 88 | :param file_directory: the main directory where files are stored 89 | :return: A dataframe with processed text ''' 90 | 91 | # create copy to modify 92 | text_df = df.copy() 93 | 94 | # store processed text 95 | text = [] 96 | 97 | # for each file (row) in the df, read in the file 98 | for row_i in df.index: 99 | filename = df.iloc[row_i]['File'] 100 | #print(filename) 101 | file_path = file_directory + filename 102 | with open(file_path, 'r', encoding='utf-8', errors='ignore') as file: 103 | 104 | # standardize text using helper function 105 | file_text = process_file(file) 106 | # append processed text to list 107 | text.append(file_text) 108 | 109 | # add column to the copied dataframe 110 | text_df['Text'] = text 111 | 112 | return text_df 113 | -------------------------------------------------------------------------------- /Natural_Language_processing/plagiarism-detector-web-app/palagrism_data/test.csv: -------------------------------------------------------------------------------- 1 | 1,1.0,0.9222797927461139,0.8207547169811321 2 | 1,0.7653061224489796,0.5896551724137931,0.6217105263157895 3 | 1,0.8844444444444445,0.18099547511312217,0.597457627118644 4 | 1,0.6190476190476191,0.043243243243243246,0.42783505154639173 5 | 1,0.92,0.39436619718309857,0.775 6 | 1,0.9926739926739927,0.9739776951672863,0.9930555555555556 7 | 0,0.4126984126984127,0.0,0.3466666666666667 8 | 0,0.4626865671641791,0.0,0.18932038834951456 9 | 0,0.581151832460733,0.0,0.24742268041237114 10 | 0,0.5842105263157895,0.0,0.29441624365482233 11 | 0,0.5663716814159292,0.0,0.25833333333333336 12 | 0,0.48148148148148145,0.022900763358778626,0.2789115646258503 13 | 1,0.6197916666666666,0.026595744680851064,0.3415841584158416 14 | 1,0.9217391304347826,0.6548672566371682,0.9294117647058824 15 | 1,1.0,0.9224806201550387,1.0 16 | 1,0.8615384615384616,0.06282722513089005,0.5047169811320755 17 | 1,0.6261682242990654,0.22397476340694006,0.5585585585585585 18 | 1,1.0,0.9688715953307393,0.9966996699669967 19 | 0,0.3838383838383838,0.010309278350515464,0.178743961352657 20 | 1,1.0,0.9446494464944649,0.8546712802768166 21 | 0,0.6139240506329114,0.0,0.2983425414364641 22 | 1,0.9727626459143969,0.8300395256916996,0.9270833333333334 23 | 1,0.9628099173553719,0.6890756302521008,0.9098039215686274 24 | 0,0.4152542372881356,0.0,0.1774193548387097 25 | 0,0.5321888412017167,0.017467248908296942,0.24583333333333332 26 | -------------------------------------------------------------------------------- /Natural_Language_processing/plagiarism-detector-web-app/palagrism_data/train.csv: -------------------------------------------------------------------------------- 1 | 0,0.39814814814814814,0.0,0.1917808219178082 2 | 1,0.8693693693693694,0.44954128440366975,0.8464912280701754 3 | 1,0.5935828877005348,0.08196721311475409,0.3160621761658031 4 | 0,0.5445026178010471,0.0,0.24257425742574257 5 | 0,0.32950191570881227,0.0,0.16117216117216118 6 | 0,0.5903083700440529,0.0,0.30165289256198347 7 | 1,0.7597765363128491,0.24571428571428572,0.484304932735426 8 | 0,0.5161290322580645,0.0,0.2708333333333333 9 | 0,0.44086021505376344,0.0,0.22395833333333334 10 | 1,0.9794520547945206,0.7887323943661971,0.9 11 | 1,0.9513888888888888,0.5214285714285715,0.8940397350993378 12 | 1,0.9764705882352941,0.5783132530120482,0.8232044198895028 13 | 1,0.8117647058823529,0.28313253012048195,0.45977011494252873 14 | 0,0.4411764705882353,0.0,0.3055555555555556 15 | 0,0.4888888888888889,0.0,0.2826086956521739 16 | 1,0.813953488372093,0.6341463414634146,0.7888888888888889 17 | 0,0.6111111111111112,0.0,0.3246753246753247 18 | 1,1.0,0.9659090909090909,1.0 19 | 1,0.634020618556701,0.005263157894736842,0.36893203883495146 20 | 1,0.5829383886255924,0.08695652173913043,0.4166666666666667 21 | 1,0.6379310344827587,0.30701754385964913,0.4898785425101215 22 | 0,0.42038216560509556,0.0,0.21875 23 | 1,0.6877637130801688,0.07725321888412018,0.5163934426229508 24 | 1,0.6766467065868264,0.11042944785276074,0.4725274725274725 25 | 1,0.7692307692307693,0.45084745762711864,0.6064516129032258 26 | 1,0.7122641509433962,0.08653846153846154,0.536697247706422 27 | 1,0.6299212598425197,0.28,0.39436619718309857 28 | 1,0.7157360406091371,0.0051813471502590676,0.3431372549019608 29 | 0,0.3320610687022901,0.0,0.15302491103202848 30 | 1,0.7172131147540983,0.07916666666666666,0.4559386973180077 31 | 1,0.8782608695652174,0.47345132743362833,0.82 32 | 1,0.5298013245033113,0.31543624161073824,0.45 33 | 0,0.5721153846153846,0.0,0.22935779816513763 34 | 0,0.319672131147541,0.0,0.16535433070866143 35 | 0,0.53,0.0,0.26046511627906976 36 | 1,0.78,0.6071428571428571,0.6699029126213593 37 | 0,0.6526946107784432,0.0,0.3551912568306011 38 | 0,0.4439461883408072,0.0,0.23376623376623376 39 | 1,0.6650246305418719,0.18090452261306533,0.3492647058823529 40 | 1,0.7281553398058253,0.034653465346534656,0.3476190476190476 41 | 1,0.7620481927710844,0.2896341463414634,0.5677233429394812 42 | 1,0.9470198675496688,0.2857142857142857,0.774390243902439 43 | 1,0.3684210526315789,0.0,0.19298245614035087 44 | 0,0.5328947368421053,0.0,0.21818181818181817 45 | 0,0.6184971098265896,0.005917159763313609,0.26666666666666666 46 | 0,0.5103092783505154,0.010526315789473684,0.22110552763819097 47 | 0,0.5798319327731093,0.0,0.2289156626506024 48 | 0,0.40703517587939697,0.0,0.1722488038277512 49 | 0,0.5154639175257731,0.0,0.23684210526315788 50 | 1,0.5845410628019324,0.04926108374384237,0.29493087557603687 51 | 1,0.6171875,0.1693548387096774,0.5037593984962406 52 | 1,1.0,0.84251968503937,0.9117647058823529 53 | 1,0.9916666666666667,0.8879310344827587,0.9923076923076923 54 | 0,0.550561797752809,0.0,0.2833333333333333 55 | 0,0.41935483870967744,0.0,0.2616822429906542 56 | 1,0.8351648351648352,0.034482758620689655,0.6470588235294118 57 | 1,0.9270833333333334,0.29347826086956524,0.85 58 | 0,0.4928909952606635,0.0,0.2350230414746544 59 | 1,0.7087378640776699,0.3217821782178218,0.6619718309859155 60 | 1,0.8633879781420765,0.30726256983240224,0.7911111111111111 61 | 1,0.9606060606060606,0.8650306748466258,0.9298245614035088 62 | 0,0.4380165289256198,0.0,0.2230769230769231 63 | 1,0.7336683417085427,0.07179487179487179,0.4900990099009901 64 | 1,0.5138888888888888,0.0,0.25203252032520324 65 | 0,0.4861111111111111,0.0,0.22767857142857142 66 | 1,0.8451882845188284,0.3021276595744681,0.6437246963562753 67 | 1,0.485,0.0,0.24271844660194175 68 | 1,0.9506726457399103,0.7808219178082192,0.8395061728395061 69 | 1,0.551219512195122,0.23383084577114427,0.2830188679245283 70 | 0,0.3612565445026178,0.0,0.16176470588235295 71 | -------------------------------------------------------------------------------- /Natural_Language_processing/plagiarism-detector-web-app/problem_unittests.py: -------------------------------------------------------------------------------- 1 | from unittest.mock import MagicMock, patch 2 | import sklearn.naive_bayes 3 | import numpy as np 4 | import pandas as pd 5 | import re 6 | 7 | # test csv file 8 | TEST_CSV = 'data/test_info.csv' 9 | 10 | class AssertTest(object): 11 | '''Defines general test behavior.''' 12 | def __init__(self, params): 13 | self.assert_param_message = '\n'.join([str(k) + ': ' + str(v) + '' for k, v in params.items()]) 14 | 15 | def test(self, assert_condition, assert_message): 16 | assert assert_condition, assert_message + '\n\nUnit Test Function Parameters\n' + self.assert_param_message 17 | 18 | def _print_success_message(): 19 | print('Tests Passed!') 20 | 21 | # test clean_dataframe 22 | def test_numerical_df(numerical_dataframe): 23 | 24 | # test result 25 | transformed_df = numerical_dataframe(TEST_CSV) 26 | 27 | # Check type is a DataFrame 28 | assert isinstance(transformed_df, pd.DataFrame), 'Returned type is {}.'.format(type(transformed_df)) 29 | 30 | # check columns 31 | column_names = list(transformed_df) 32 | assert 'File' in column_names, 'No File column, found.' 33 | assert 'Task' in column_names, 'No Task column, found.' 34 | assert 'Category' in column_names, 'No Category column, found.' 35 | assert 'Class' in column_names, 'No Class column, found.' 36 | 37 | # check conversion values 38 | assert transformed_df.loc[0, 'Category'] == 1, '`heavy` plagiarism mapping test, failed.' 39 | assert transformed_df.loc[2, 'Category'] == 0, '`non` plagiarism mapping test, failed.' 40 | assert transformed_df.loc[30, 'Category'] == 3, '`cut` plagiarism mapping test, failed.' 41 | assert transformed_df.loc[5, 'Category'] == 2, '`light` plagiarism mapping test, failed.' 42 | assert transformed_df.loc[37, 'Category'] == -1, 'original file mapping test, failed; should have a Category = -1.' 43 | assert transformed_df.loc[41, 'Category'] == -1, 'original file mapping test, failed; should have a Category = -1.' 44 | 45 | _print_success_message() 46 | 47 | 48 | def test_containment(complete_df, containment_fn): 49 | 50 | # check basic format and value 51 | # for n = 1 and just the fifth file 52 | test_val = containment_fn(complete_df, 1, 'g0pA_taske.txt') 53 | 54 | assert isinstance(test_val, float), 'Returned type is {}.'.format(type(test_val)) 55 | assert test_val<=1.0, 'It appears that the value is not normalized; expected a value <=1, got: '+str(test_val) 56 | 57 | # known vals for first few files 58 | filenames = ['g0pA_taska.txt', 'g0pA_taskb.txt', 'g0pA_taskc.txt', 'g0pA_taskd.txt'] 59 | ngram_1 = [0.39814814814814814, 1.0, 0.86936936936936937, 0.5935828877005348] 60 | ngram_3 = [0.0093457943925233638, 0.96410256410256412, 0.61363636363636365, 0.15675675675675677] 61 | 62 | # results for comparison 63 | results_1gram = [] 64 | results_3gram = [] 65 | 66 | for i in range(4): 67 | val_1 = containment_fn(complete_df, 1, filenames[i]) 68 | val_3 = containment_fn(complete_df, 3, filenames[i]) 69 | results_1gram.append(val_1) 70 | results_3gram.append(val_3) 71 | 72 | # check correct results 73 | assert all(np.isclose(results_1gram, ngram_1, rtol=1e-04)), \ 74 | 'n=1 calculations are incorrect. Double check the intersection calculation.' 75 | # check correct results 76 | assert all(np.isclose(results_3gram, ngram_3, rtol=1e-04)), \ 77 | 'n=3 calculations are incorrect.' 78 | 79 | _print_success_message() 80 | 81 | def test_lcs(df, lcs_word): 82 | 83 | test_index = 10 # file 10 84 | 85 | # get answer file text 86 | answer_text = df.loc[test_index, 'Text'] 87 | 88 | # get text for orig file 89 | # find the associated task type (one character, a-e) 90 | task = df.loc[test_index, 'Task'] 91 | # we know that source texts have Class = -1 92 | orig_rows = df[(df['Class'] == -1)] 93 | orig_row = orig_rows[(orig_rows['Task'] == task)] 94 | source_text = orig_row['Text'].values[0] 95 | 96 | # calculate LCS 97 | test_val = lcs_word(answer_text, source_text) 98 | 99 | # check type 100 | assert isinstance(test_val, float), 'Returned type is {}.'.format(type(test_val)) 101 | assert test_val<=1.0, 'It appears that the value is not normalized; expected a value <=1, got: '+str(test_val) 102 | 103 | # known vals for first few files 104 | lcs_vals = [0.1917808219178082, 0.8207547169811321, 0.8464912280701754, 0.3160621761658031, 0.24257425742574257] 105 | 106 | # results for comparison 107 | results = [] 108 | 109 | for i in range(5): 110 | # get answer and source text 111 | answer_text = df.loc[i, 'Text'] 112 | task = df.loc[i, 'Task'] 113 | # we know that source texts have Class = -1 114 | orig_rows = df[(df['Class'] == -1)] 115 | orig_row = orig_rows[(orig_rows['Task'] == task)] 116 | source_text = orig_row['Text'].values[0] 117 | # calc lcs 118 | val = lcs_word(answer_text, source_text) 119 | results.append(val) 120 | 121 | # check correct results 122 | assert all(np.isclose(results, lcs_vals, rtol=1e-05)), 'LCS calculations are incorrect.' 123 | 124 | _print_success_message() 125 | 126 | def test_data_split(train_x, train_y, test_x, test_y): 127 | 128 | # check types 129 | assert isinstance(train_x, np.ndarray),\ 130 | 'train_x is not an array, instead got type: {}'.format(type(train_x)) 131 | assert isinstance(train_y, np.ndarray),\ 132 | 'train_y is not an array, instead got type: {}'.format(type(train_y)) 133 | assert isinstance(test_x, np.ndarray),\ 134 | 'test_x is not an array, instead got type: {}'.format(type(test_x)) 135 | assert isinstance(test_y, np.ndarray),\ 136 | 'test_y is not an array, instead got type: {}'.format(type(test_y)) 137 | 138 | # should hold all 95 submission files 139 | assert len(train_x) + len(test_x) == 95, \ 140 | 'Unexpected amount of train + test data. Expecting 95 answer text files, got ' +str(len(train_x) + len(test_x)) 141 | assert len(test_x) > 1, \ 142 | 'Unexpected amount of test data. There should be multiple test files.' 143 | 144 | # check shape 145 | assert train_x.shape[1]==2, \ 146 | 'train_x should have as many columns as selected features, got: {}'.format(train_x.shape[1]) 147 | assert len(train_y.shape)==1, \ 148 | 'train_y should be a 1D array, got shape: {}'.format(train_y.shape) 149 | 150 | _print_success_message() 151 | 152 | 153 | -------------------------------------------------------------------------------- /Natural_Language_processing/plagiarism-detector-web-app/source_pytorch/model.py: -------------------------------------------------------------------------------- 1 | # torch imports 2 | import torch.nn.functional as F 3 | import torch.nn as nn 4 | 5 | 6 | ## TODO: Complete this classifier 7 | class BinaryClassifier(nn.Module): 8 | """ 9 | Define a neural network that performs binary classification. 10 | The network should accept your number of features as input, and produce 11 | a single sigmoid value, that can be rounded to a label: 0 or 1, as output. 12 | 13 | Notes on training: 14 | To train a binary classifier in PyTorch, use BCELoss. 15 | BCELoss is binary cross entropy loss, documentation: https://pytorch.org/docs/stable/nn.html#torch.nn.BCELoss 16 | """ 17 | 18 | ## TODO: Define the init function, the input params are required (for loading code in train.py to work) 19 | def __init__(self, input_features, hidden_dim, output_dim): 20 | """ 21 | Initialize the model by setting up linear layers. 22 | Use the input parameters to help define the layers of your model. 23 | :param input_features: the number of input features in your training/test data 24 | :param hidden_dim: helps define the number of nodes in the hidden layer(s) 25 | :param output_dim: the number of outputs you want to produce 26 | """ 27 | super(BinaryClassifier, self).__init__() 28 | 29 | # define any initial layers, here 30 | 31 | 32 | 33 | ## TODO: Define the feedforward behavior of the network 34 | def forward(self, x): 35 | """ 36 | Perform a forward pass of our model on input features, x. 37 | :param x: A batch of input features of size (batch_size, input_features) 38 | :return: A single, sigmoid-activated value as output 39 | """ 40 | 41 | # define the feedforward behavior 42 | 43 | return x 44 | 45 | -------------------------------------------------------------------------------- /Natural_Language_processing/plagiarism-detector-web-app/source_pytorch/predict.py: -------------------------------------------------------------------------------- 1 | # import libraries 2 | import os 3 | import numpy as np 4 | import torch 5 | from six import BytesIO 6 | 7 | # import model from model.py, by name 8 | from model import BinaryClassifier 9 | 10 | # default content type is numpy array 11 | NP_CONTENT_TYPE = 'application/x-npy' 12 | 13 | 14 | # Provided model load function 15 | def model_fn(model_dir): 16 | """Load the PyTorch model from the `model_dir` directory.""" 17 | print("Loading model.") 18 | 19 | # First, load the parameters used to create the model. 20 | model_info = {} 21 | model_info_path = os.path.join(model_dir, 'model_info.pth') 22 | with open(model_info_path, 'rb') as f: 23 | model_info = torch.load(f) 24 | 25 | print("model_info: {}".format(model_info)) 26 | 27 | # Determine the device and construct the model. 28 | device = torch.device("cuda" if torch.cuda.is_available() else "cpu") 29 | model = BinaryClassifier(model_info['input_features'], model_info['hidden_dim'], model_info['output_dim']) 30 | 31 | # Load the store model parameters. 32 | model_path = os.path.join(model_dir, 'model.pth') 33 | with open(model_path, 'rb') as f: 34 | model.load_state_dict(torch.load(f)) 35 | 36 | # Prep for testing 37 | model.to(device).eval() 38 | 39 | print("Done loading model.") 40 | return model 41 | 42 | 43 | # Provided input data loading 44 | def input_fn(serialized_input_data, content_type): 45 | print('Deserializing the input data.') 46 | if content_type == NP_CONTENT_TYPE: 47 | stream = BytesIO(serialized_input_data) 48 | return np.load(stream) 49 | raise Exception('Requested unsupported ContentType in content_type: ' + content_type) 50 | 51 | # Provided output data handling 52 | def output_fn(prediction_output, accept): 53 | print('Serializing the generated output.') 54 | if accept == NP_CONTENT_TYPE: 55 | stream = BytesIO() 56 | np.save(stream, prediction_output) 57 | return stream.getvalue(), accept 58 | raise Exception('Requested unsupported ContentType in Accept: ' + accept) 59 | 60 | 61 | # Provided predict function 62 | def predict_fn(input_data, model): 63 | print('Predicting class labels for the input data...') 64 | 65 | device = torch.device("cuda" if torch.cuda.is_available() else "cpu") 66 | 67 | # Process input_data so that it is ready to be sent to our model. 68 | data = torch.from_numpy(input_data.astype('float32')) 69 | data = data.to(device) 70 | 71 | # Put the model into evaluation mode 72 | model.eval() 73 | 74 | # Compute the result of applying the model to the input data 75 | # The variable `out_label` should be a rounded value, either 1 or 0 76 | out = model(data) 77 | out_np = out.cpu().detach().numpy() 78 | out_label = out_np.round() 79 | 80 | return out_label -------------------------------------------------------------------------------- /Natural_Language_processing/plagiarism-detector-web-app/source_pytorch/train.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import json 3 | import os 4 | import pandas as pd 5 | import torch 6 | import torch.optim as optim 7 | import torch.utils.data 8 | 9 | # imports the model in model.py by name 10 | from model import BinaryClassifier 11 | 12 | def model_fn(model_dir): 13 | """Load the PyTorch model from the `model_dir` directory.""" 14 | print("Loading model.") 15 | 16 | # First, load the parameters used to create the model. 17 | model_info = {} 18 | model_info_path = os.path.join(model_dir, 'model_info.pth') 19 | with open(model_info_path, 'rb') as f: 20 | model_info = torch.load(f) 21 | 22 | print("model_info: {}".format(model_info)) 23 | 24 | # Determine the device and construct the model. 25 | device = torch.device("cuda" if torch.cuda.is_available() else "cpu") 26 | model = BinaryClassifier(model_info['input_features'], model_info['hidden_dim'], model_info['output_dim']) 27 | 28 | # Load the stored model parameters. 29 | model_path = os.path.join(model_dir, 'model.pth') 30 | with open(model_path, 'rb') as f: 31 | model.load_state_dict(torch.load(f)) 32 | 33 | # set to eval mode, could use no_grad 34 | model.to(device).eval() 35 | 36 | print("Done loading model.") 37 | return model 38 | 39 | # Gets training data in batches from the train.csv file 40 | def _get_train_data_loader(batch_size, training_dir): 41 | print("Get train data loader.") 42 | 43 | train_data = pd.read_csv(os.path.join(training_dir, "train.csv"), header=None, names=None) 44 | 45 | train_y = torch.from_numpy(train_data[[0]].values).float().squeeze() 46 | train_x = torch.from_numpy(train_data.drop([0], axis=1).values).float() 47 | 48 | train_ds = torch.utils.data.TensorDataset(train_x, train_y) 49 | 50 | return torch.utils.data.DataLoader(train_ds, batch_size=batch_size) 51 | 52 | 53 | # Provided training function 54 | def train(model, train_loader, epochs, criterion, optimizer, device): 55 | """ 56 | This is the training method that is called by the PyTorch training script. The parameters 57 | passed are as follows: 58 | model - The PyTorch model that we wish to train. 59 | train_loader - The PyTorch DataLoader that should be used during training. 60 | epochs - The total number of epochs to train for. 61 | criterion - The loss function used for training. 62 | optimizer - The optimizer to use during training. 63 | device - Where the model and data should be loaded (gpu or cpu). 64 | """ 65 | 66 | # training loop is provided 67 | for epoch in range(1, epochs + 1): 68 | model.train() # Make sure that the model is in training mode. 69 | 70 | total_loss = 0 71 | 72 | for batch in train_loader: 73 | # get data 74 | batch_x, batch_y = batch 75 | 76 | batch_x = batch_x.to(device) 77 | batch_y = batch_y.to(device) 78 | 79 | optimizer.zero_grad() 80 | 81 | # get predictions from model 82 | y_pred = model(batch_x) 83 | 84 | # perform backprop 85 | loss = criterion(y_pred, batch_y) 86 | loss.backward() 87 | optimizer.step() 88 | 89 | total_loss += loss.data.item() 90 | 91 | print("Epoch: {}, Loss: {}".format(epoch, total_loss / len(train_loader))) 92 | 93 | 94 | ## TODO: Complete the main code 95 | if __name__ == '__main__': 96 | 97 | # All of the model parameters and training parameters are sent as arguments 98 | # when this script is executed, during a training job 99 | 100 | # Here we set up an argument parser to easily access the parameters 101 | parser = argparse.ArgumentParser() 102 | 103 | # SageMaker parameters, like the directories for training data and saving models; set automatically 104 | # Do not need to change 105 | parser.add_argument('--output-data-dir', type=str, default=os.environ['SM_OUTPUT_DATA_DIR']) 106 | parser.add_argument('--model-dir', type=str, default=os.environ['SM_MODEL_DIR']) 107 | parser.add_argument('--data-dir', type=str, default=os.environ['SM_CHANNEL_TRAIN']) 108 | 109 | # Training Parameters, given 110 | parser.add_argument('--batch-size', type=int, default=10, metavar='N', 111 | help='input batch size for training (default: 10)') 112 | parser.add_argument('--epochs', type=int, default=10, metavar='N', 113 | help='number of epochs to train (default: 10)') 114 | parser.add_argument('--seed', type=int, default=1, metavar='S', 115 | help='random seed (default: 1)') 116 | 117 | ## TODO: Add args for the three model parameters: input_features, hidden_dim, output_dim 118 | # Model Parameters 119 | 120 | 121 | # args holds all passed-in arguments 122 | args = parser.parse_args() 123 | 124 | device = torch.device("cuda" if torch.cuda.is_available() else "cpu") 125 | print("Using device {}.".format(device)) 126 | 127 | torch.manual_seed(args.seed) 128 | 129 | # Load the training data. 130 | train_loader = _get_train_data_loader(args.batch_size, args.data_dir) 131 | 132 | 133 | ## --- Your code here --- ## 134 | 135 | ## TODO: Build the model by passing in the input params 136 | # To get params from the parser, call args.argument_name, ex. args.epochs or ards.hidden_dim 137 | # Don't forget to move your model .to(device) to move to GPU , if appropriate 138 | model = None 139 | 140 | ## TODO: Define an optimizer and loss function for training 141 | optimizer = None 142 | criterion = None 143 | 144 | # Trains the model (given line of code, which calls the above training function) 145 | train(model, train_loader, args.epochs, criterion, optimizer, device) 146 | 147 | ## TODO: complete in the model_info by adding three argument names, the first is given 148 | # Keep the keys of this dictionary as they are 149 | model_info_path = os.path.join(args.model_dir, 'model_info.pth') 150 | with open(model_info_path, 'wb') as f: 151 | model_info = { 152 | 'input_features': args.input_features, 153 | 'hidden_dim': , 154 | 'output_dim': , 155 | } 156 | torch.save(model_info, f) 157 | 158 | ## --- End of your code --- ## 159 | 160 | 161 | # Save the model parameters 162 | model_path = os.path.join(args.model_dir, 'model.pth') 163 | with open(model_path, 'wb') as f: 164 | torch.save(model.cpu().state_dict(), f) 165 | -------------------------------------------------------------------------------- /Natural_Language_processing/plagiarism-detector-web-app/source_sklearn/train.py: -------------------------------------------------------------------------------- 1 | from __future__ import print_function 2 | 3 | import argparse 4 | import os 5 | import pandas as pd 6 | 7 | # sklearn.externals.joblib is deprecated in 0.21 and will be removed in 0.23. 8 | from sklearn.externals import joblib 9 | # Import joblib package directly 10 | 11 | #import joblib 12 | 13 | ## TODO: Import any additional libraries you need to define a model 14 | from sklearn.ensemble import RandomForestClassifier 15 | 16 | # Provided model load function 17 | def model_fn(model_dir): 18 | """Load model from the model_dir. This is the same model that is saved 19 | in the main if statement. 20 | """ 21 | print("Loading model.") 22 | 23 | # load using joblib 24 | model = joblib.load(os.path.join(model_dir, "model.joblib")) 25 | print("Done loading model.") 26 | 27 | return model 28 | 29 | 30 | ## TODO: Complete the main code 31 | if __name__ == '__main__': 32 | 33 | # All of the model parameters and training parameters are sent as arguments 34 | # when this script is executed, during a training job 35 | 36 | # Here we set up an argument parser to easily access the parameters 37 | parser = argparse.ArgumentParser() 38 | 39 | # SageMaker parameters, like the directories for training data and saving models; set automatically 40 | # Do not need to change 41 | parser.add_argument('--output-data-dir', type=str, default=os.environ['SM_OUTPUT_DATA_DIR']) 42 | parser.add_argument('--model-dir', type=str, default=os.environ['SM_MODEL_DIR']) 43 | parser.add_argument('--data-dir', type=str, default=os.environ['SM_CHANNEL_TRAIN']) 44 | #parser.add_argument('--max_depth', type=int, default= 10) 45 | 46 | ## TODO: Add any additional arguments that you will need to pass into your model 47 | 48 | # args holds all passed-in arguments 49 | args = parser.parse_args() 50 | 51 | # Read in csv training file 52 | training_dir = args.data_dir 53 | train_data = pd.read_csv(os.path.join(training_dir, "train.csv"), header=None, names=None) 54 | 55 | # Labels are in the first column 56 | train_y = train_data.iloc[:,0] 57 | train_x = train_data.iloc[:,1:] 58 | 59 | 60 | ## --- Your code here --- ## 61 | 62 | ## TODO: Define a model 63 | model = RandomForestClassifier() 64 | 65 | ## TODO: Train the model 66 | model.fit(train_x, train_y) 67 | 68 | 69 | ## --- End of your code --- ## 70 | 71 | 72 | # Save the trained model 73 | 74 | joblib.dump(model, os.path.join(args.model_dir, "model.joblib")) 75 | 76 | 77 | -------------------------------------------------------------------------------- /Spark/Cluster Analysis of the San Diego Weather Data/readme.md: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /Spark/San Diego Rainforest Fire Predicition/Readme.md: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /time-series-analysis/Power-consumption-forecasting/json_energy_data/readme.md: -------------------------------------------------------------------------------- 1 | The json energy file 2 | -------------------------------------------------------------------------------- /time-series-analysis/Power-consumption-forecasting/readme.md: -------------------------------------------------------------------------------- 1 |

Power Consumption Forecasting

2 | 3 |

overview

4 | 5 | Power consumption forecasting is s a time series data collected periodically, over time. Time series forecasting is the task of predicting future data points, given some historical data. It is commonly used in a variety of tasks from weather forecasting, retail and sales forecasting, stock market prediction, and in behavior prediction (such as predicting the flow of car traffic over a day). There is a lot of time series data out there, and recognizing patterns in that data is an active area of machine learning research! 6 |

Motivation and the problem

7 | 8 | Taking the data of the power consumption from 2007-2009, and then use it to accurately predict the average Global active power usage for the next several months in 2010! 9 | 10 |

Data

11 | Energy Consumption Data 12 | The data we'll be working with in this notebook is data about household electric power consumption, over the globe. The dataset is avlaible on [kaggle](https://www.kaggle.com/uciml/electric-power-consumption-data-set) and represents power consumption collected over several years from 2006 to 2010. With such a large dataset, we can aim to predict over long periods of time, over days, weeks or months of time. Predicting energy consumption can be a useful task for a variety of reasons including determining seasonal prices for power consumption and efficiently delivering power to people, according to their predicted usage. 13 | Interesting read: An inversely-related project, recently done by Google and DeepMind, uses machine learning to predict the generation of power by wind turbines and efficiently deliver power to the grid. You can read about that research, in this post. 14 | 15 |

DeepAR model

16 | 17 | DeepAR utilizes a recurrent neural network (RNN), which is designed to accept some sequence of data points as historical input and produce a predicted sequence of points. So, how does this model learn? 18 | During training, you'll provide a training dataset (made of several time series) to a DeepAR estimator. The estimator looks at all the training time series and tries to identify similarities across them. It trains by randomly sampling training examples from the training time series. 19 | Each training example consists of a pair of adjacent context and prediction windows of fixed, predefined lengths. 20 | The context_length parameter controls how far in the past the model can see. 21 | The prediction_length parameter controls how far in the future predictions can be made. 22 | You can find more details, in this **[documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/deepar_how-it-works.html)**. 23 | 24 |

Notebook outline

25 | 26 | * Loading and exploring the data 27 | * Creating training and test sets of time series 28 | * Formatting data as JSON files and uploading to S3 29 | * Instantiating and training a DeepAR estimator 30 | * Deploying a model and creating a predictor 31 | * Evaluating the predictor 32 | -------------------------------------------------------------------------------- /time-series-analysis/Power-consumption-forecasting/txt_preprocessing.py: -------------------------------------------------------------------------------- 1 | # This file was used to process the initial, raw text: household_power_consumption.txt 2 | import pandas as pd 3 | 4 | 5 | ## 1. Raw data processing 6 | 7 | # The 'household_power_consumption.txt' file has the following attributes: 8 | # * Each data point has a date and time (hour) of recording 9 | # * The data points are separated by semicolons (;) 10 | # * Some values are 'nan' or '?', and we'll treat these both as `NaN` values when making a DataFrame 11 | 12 | # A helper function to read the file in and create a DataFrame, indexed by 'Date Time' 13 | def create_df(text_file, sep=';', na_values=['nan','?']): 14 | '''Reads in a text file and converts it to a dataframe, indexed by 'Date Time'.''' 15 | 16 | df = None 17 | 18 | # check that the file is the expected text file 19 | expected_file='household_power_consumption.txt' 20 | if(text_file != expected_file): 21 | print('Unexpected file: '+str(text_file)) 22 | return df 23 | 24 | # read in the text file 25 | # each data point is separated by a semicolon 26 | df = pd.read_csv('household_power_consumption.txt', sep=sep, 27 | parse_dates={'Date-Time' : ['Date', 'Time']}, infer_datetime_format=True, 28 | low_memory=False, na_values=na_values, index_col='Date-Time') # indexed by Date-Time 29 | 30 | return df 31 | 32 | ## 2. Managing `NaN` values 33 | 34 | # This DataFrame does include some data points that have missing values. 35 | # So far, we've mainly been dropping these values, but there are other ways to handle `NaN` values. 36 | # One technique is to fill the missing values with the *mean* values from a column, 37 | # this way the added value is likely to be realistic. 38 | 39 | # A helper function to fill NaN values with a column average 40 | def fill_nan_with_mean(df): 41 | '''Fills NaN values in a given dataframe with the average values in a column. 42 | This technique works well for filling missing, hourly values 43 | that will later be averaged into energy stats over a day (24hrs).''' 44 | 45 | # filling nan with mean value of any columns 46 | num_cols = len(list(df.columns.values)) 47 | for col in range(num_cols): 48 | df.iloc[:,col]=df.iloc[:,col].fillna(df.iloc[:,col].mean()) 49 | 50 | return df 51 | 52 | -------------------------------------------------------------------------------- /time-series-analysis/readme.md: -------------------------------------------------------------------------------- 1 | The time series analysis projeects 2 | --------------------------------------------------------------------------------